243 files changed, 24911 insertions, 7328 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 5bc55a4e3bce..dceeb0d763aa 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -243,8 +243,8 @@ Configuration Flags and Socket Options
 These are the various configuration flags that can be used to control
 and monitor the behavior of AF_XDP sockets.
 
-XDP_COPY and XDP_ZERO_COPY bind flags
--------------------------------------
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
 
 When you bind to a socket, the kernel will first try to use zero-copy
 copy. If zero-copy is not supported, it will fall back on using copy
@@ -252,20 +252,27 @@ mode, i.e. copying all packets out to user space. But if you would
 like to force a certain mode, you can use the following flags. If you
 pass the XDP_COPY flag to the bind call, the kernel will force the
 socket into copy mode. If it cannot use copy mode, the bind call will
-fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
 socket into zero-copy mode or fail.
 
 XDP_SHARED_UMEM bind flag
 -------------------------
 
-This flag enables you to bind multiple sockets to the same UMEM, but
-only if they share the same queue id. In this mode, each socket has
-their own RX and TX rings, but the UMEM (tied to the fist socket
-created) only has a single FILL ring and a single COMPLETION
-ring. To use this mode, create the first socket and bind it in the normal
-way. Create a second socket and create an RX and a TX ring, or at
-least one of them, but no FILL or COMPLETION rings as the ones from
-the first socket will be used. In the bind call, set he
+This flag enables you to bind multiple sockets to the same UMEM. It
+works on the same queue id, between queue ids and between
+netdevs/devices. In this mode, each socket has their own RX and TX
+rings as usual, but you are going to have one or more FILL and
+COMPLETION ring pairs. You have to create one of these pairs per
+unique netdev and queue id tuple that you bind to.
+
+Starting with the case were we would like to share a UMEM between
+sockets bound to the same netdev and queue id. The UMEM (tied to the
+fist socket created) will only have a single FILL ring and a single
+COMPLETION ring as there is only on unique netdev,queue_id tuple that
+we have bound to. To use this mode, create the first socket and bind
+it in the normal way. Create a second socket and create an RX and a TX
+ring, or at least one of them, but no FILL or COMPLETION rings as the
+ones from the first socket will be used. In the bind call, set he
 XDP_SHARED_UMEM option and provide the initial socket's fd in the
 sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
 sockets this way.
@@ -283,19 +290,19 @@ round-robin example of distributing packets is shown below:
    #define MAX_SOCKS 16
 
    struct {
-        __uint(type, BPF_MAP_TYPE_XSKMAP);
-        __uint(max_entries, MAX_SOCKS);
-        __uint(key_size, sizeof(int));
-        __uint(value_size, sizeof(int));
+       __uint(type, BPF_MAP_TYPE_XSKMAP);
+       __uint(max_entries, MAX_SOCKS);
+       __uint(key_size, sizeof(int));
+       __uint(value_size, sizeof(int));
    } xsks_map SEC(".maps");
 
    static unsigned int rr;
 
    SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
    {
-	rr = (rr + 1) & (MAX_SOCKS - 1);
+       rr = (rr + 1) & (MAX_SOCKS - 1);
 
-	return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+       return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
    }
 
 Note, that since there is only a single set of FILL and COMPLETION
@@ -305,11 +312,41 @@ concurrently. There are no synchronization primitives in the
 libbpf code that protects multiple users at this point in time.
 
 Libbpf uses this mode if you create more than one socket tied to the
-same umem. However, note that you need to supply the
+same UMEM. However, note that you need to supply the
 XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
 xsk_socket__create calls and load your own XDP program as there is no
 built in one in libbpf that will route the traffic for you.
 
+The second case is when you share a UMEM between sockets that are
+bound to different queue ids and/or netdevs. In this case you have to
+create one FILL ring and one COMPLETION ring for each unique
+netdev,queue_id pair. Let us say you want to create two sockets bound
+to two different queue ids on the same netdev. Create the first socket
+and bind it in the normal way. Create a second socket and create an RX
+and a TX ring, or at least one of them, and then one FILL and
+COMPLETION ring for this socket. Then in the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field as you registered the UMEM on that
+socket. These two sockets will now share one and the same UMEM.
+
+There is no need to supply an XDP program like the one in the previous
+case where sockets were bound to the same queue id and
+device. Instead, use the NIC's packet steering capabilities to steer
+the packets to the right queue. In the previous example, there is only
+one queue shared among sockets, so the NIC cannot do this steering. It
+can only steer between queues.
+
+In libbpf, you need to use the xsk_socket__create_shared() API as it
+takes a reference to a FILL ring and a COMPLETION ring that will be
+created for you and bound to the shared UMEM. You can use this
+function for all the sockets you create, or you can use it for the
+second and following ones and use xsk_socket__create() for the first
+one. Both methods yield the same result.
+
+Note that a UMEM can be shared between sockets on the same queue id
+and device, as well as between queues on the same device and between
+devices at the same time.
+
 XDP_USE_NEED_WAKEUP bind flag
 -----------------------------
 
@@ -342,7 +379,7 @@ would look like this for the TX path:
 .. code-block:: c
 
    if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
-      sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+       sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
 
 I.e., only use the syscall if the flag is set.
 
@@ -364,7 +401,7 @@ resources by only setting up one of them. Both the FILL ring and the
 COMPLETION ring are mandatory as you need to have a UMEM tied to your
 socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
 first one does not have a UMEM and should in that case not have any
-FILL or COMPLETION rings created as the ones from the shared umem will
+FILL or COMPLETION rings created as the ones from the shared UMEM will
 be used. Note, that the rings are single-producer single-consumer, so
 do not try to access them from multiple processes at the same
 time. See the XDP_SHARED_UMEM section.
@@ -382,7 +419,7 @@ XDP_UMEM_REG setsockopt
 -----------------------
 
 This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can recide in. The call takes a
+contain all the buffers that packet can reside in. The call takes a
 pointer to the beginning of this area and the size of it. Moreover, it
 also has parameter called chunk_size that is the size that the UMEM is
 divided into. It can only be 2K or 4K at the moment. If you have an
@@ -396,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the
 application to use. The final option is the flags field, but it will
 be dealt with in separate sections for each UMEM flag.
 
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface.  It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface.  Updating the value requires CAP_NET_RAW.
+
 XDP_STATISTICS getsockopt
 -------------------------
 
@@ -405,9 +451,9 @@ purposes. The supported statistics are shown below:
 .. code-block:: c
 
    struct xdp_statistics {
-	  __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
-	  __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
-	  __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+       __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+       __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+       __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
    };
 
 XDP_OPTIONS getsockopt
@@ -416,8 +462,92 @@ XDP_OPTIONS getsockopt
 Gets options from an XDP socket. The only one supported so far is
 XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
 
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+  frame. In the case the packet consists of a single frame, the
+  descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+  descriptors/frames of this packet are marked as invalid and not
+  completed. The next descriptor is treated as the start of a new
+  packet, even if this was not the intent (because we cannot guess
+  the intent). As before, if your program is producing invalid
+  descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+  descriptors accumulated so far are dropped and treated as
+  invalid. To produce an application that will work on any system
+  regardless of this config setting, limit the number of frags to 18,
+  as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+  supports. Usually at least five on the NICs we have checked. We
+  consciously chose to not enforce a rigid limit (such as
+  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+  resulted in copy actions under the hood to fit into what limit the
+  NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+  probe for this limit is explained in the "probe for multi-buffer
+  support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
 Usage
-=====
+-----
 
 In order to use AF_XDP sockets two parts are needed. The
 user-space application and the XDP program. For a complete setup and
@@ -446,15 +576,15 @@ like this:
 .. code-block:: c
 
     // struct xdp_rxtx_ring {
-    // 	__u32 *producer;
-    // 	__u32 *consumer;
-    // 	struct xdp_desc *desc;
+    //     __u32 *producer;
+    //     __u32 *consumer;
+    //     struct xdp_desc *desc;
     // };
 
     // struct xdp_umem_ring {
-    // 	__u32 *producer;
-    // 	__u32 *consumer;
-    // 	__u64 *desc;
+    //     __u32 *producer;
+    //     __u32 *consumer;
+    //     __u64 *desc;
     // };
 
     // typedef struct xdp_rxtx_ring RING;
@@ -495,6 +625,131 @@ like this:
 But please use the libbpf functions as they are optimized and ready to
 use. Will make your life easier.
 
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+    void rx_packets(struct xsk_socket_info *xsk)
+    {
+        static bool new_packet = true;
+        u32 idx_rx = 0, idx_fq = 0;
+        static char *pkt;
+
+        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+        for (int i = 0; i < rcvd; i++) {
+            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+            bool eop = !(desc->options & XDP_PKT_CONTD);
+
+            if (new_packet)
+                pkt = frag;
+            else
+                add_frag_to_pkt(pkt, frag);
+
+            if (eop)
+                process_pkt(pkt);
+
+            new_packet = eop;
+
+            *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+        }
+
+        xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+        xsk_ring_cons__release(&xsk->rx, rcvd);
+    }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+    void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+                    int batch_size)
+    {
+        u32 idx, i, pkt_nb = 0;
+
+        xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+        for (i = 0; i < batch_size;) {
+            u64 addr = pkts[pkt_nb].addr;
+            u32 len = pkts[pkt_nb].size;
+
+            do {
+                struct xdp_desc *tx_desc;
+
+                tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+                tx_desc->addr = addr;
+
+                if (len > xsk_frame_size) {
+                    tx_desc->len = xsk_frame_size;
+                    tx_desc->options = XDP_PKT_CONTD;
+                } else {
+                    tx_desc->len = len;
+                    tx_desc->options = 0;
+                    pkt_nb++;
+                }
+                len -= tx_desc->len;
+                addr += xsk_frame_size;
+
+                if (i == batch_size) {
+                    /* Remember len, addr, pkt_nb for next iteration.
+                     * Skipped for simplicity.
+                     */
+                    break;
+                }
+            } while (len);
+        }
+
+        xsk_ring_prod__submit(&xsk->tx, i);
+    }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+   one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+     returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
 Sample application
 ==================
 
@@ -555,7 +810,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
    A number of other ways are possible all up to the capabilities of
    the NIC you have.
 
-Q: Can I use the XSKMAP to implement a switch betwen different umems
+Q: Can I use the XSKMAP to implement a switch between different umems
    in copy mode?
 
 A: The short answer is no, that is not supported at the moment. The
@@ -567,6 +822,17 @@ A: The short answer is no, that is not supported at the moment. The
    switch, or other distribution mechanism, in your NIC to direct
    traffic to the correct queue id and socket.
 
+Q: My packets are sometimes corrupted. What is wrong?
+
+A: Care has to be taken not to feed the same buffer in the UMEM into
+   more than one ring at the same time. If you for example feed the
+   same buffer into the FILL ring and the TX ring at the same time, the
+   NIC might receive data into the buffer at the same time it is
+   sending it. This will cause some packets to become corrupted. Same
+   thing goes for feeding the same buffer into the FILL rings
+   belonging to different queue ids or netdevs bound with the
+   XDP_SHARED_UMEM flag.
+
 Credits
 =======
 
diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst
index ac249ac8fcf2..3bf7f99cd7bb 100644
--- a/Documentation/networking/arcnet-hardware.rst
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -1902,7 +1902,7 @@ of 32 possible I/O Base addresses using the following tables::
      6    |  10
 
 The I/O address is sum of all switches set to "1". Remember that
-the I/O address space bellow 0x200 is RESERVED for mainboard, so
+the I/O address space below 0x200 is RESERVED for mainboard, so
 switch 1 should be ALWAYS SET TO OFF.
 
 
@@ -3152,7 +3152,7 @@ Tiara
 (model unknown)
 ---------------
 
-  - from Christoph Lameter <christoph@lameter.com>
+  - from Christoph Lameter <cl@gentwo.org>
 
 
 Here is information about my card as far as I could figure it out::
diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst
index f060cfb1445a..605e72c6c877 100644
--- a/Documentation/networking/ax25.rst
+++ b/Documentation/networking/ax25.rst
@@ -7,9 +7,9 @@ AX.25
 To use the amateur radio protocols within Linux you will need to get a
 suitable copy of the AX.25 Utilities. More detailed information about
 AX.25, NET/ROM and ROSE, associated programs and utilities can be
-found on http://www.linux-ax25.org.
+found on https://linux-ax25.in-berlin.de.
 
-There is an active mailing list for discussing Linux amateur radio matters
+There is a mailing list for discussing Linux amateur radio matters
 called linux-hams@vger.kernel.org. To subscribe to it, send a message to
 majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body
 of the message, the subject field is ignored.  You don't need to be
diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst
index 465a8b251bfe..621cb9575c8f 100644
--- a/Documentation/networking/bareudp.rst
+++ b/Documentation/networking/bareudp.rst
@@ -6,17 +6,17 @@ Bare UDP Tunnelling Module Documentation
 
 There are various L3 encapsulation standards using UDP being discussed to
 leverage the UDP based load balancing capability of different networks.
-MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
+MPLSoUDP (https://tools.ietf.org/html/rfc7510) is one among them.
 
-The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
-support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
-a UDP tunnel.
+The Bareudp tunnel module provides a generic L3 encapsulation support for
+tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel.
 
 Special Handling
 ----------------
+
 The bareudp device supports special handling for MPLS & IP as they can have
 multiple ethertypes.
-MPLS procotcol can have ethertypes ETH_P_MPLS_UC  (unicast) & ETH_P_MPLS_MC (multicast).
+The MPLS protocol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast).
 IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6).
 This special handling can be enabled only for ethertypes ETH_P_IP & ETH_P_MPLS_UC
 with a flag called multiproto mode.
@@ -26,7 +26,7 @@ Usage
 
 1) Device creation & deletion
 
-    a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847.
+    a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc
 
        This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
        0x8847 (MPLS traffic). The destination port of the UDP header will be set to
@@ -34,19 +34,26 @@ Usage
 
     b) ip link delete bareudp0
 
-2) Device creation with multiple proto mode enabled
+2) Device creation with multiproto mode enabled
+
+The multiproto mode allows bareudp tunnels to handle several protocols of the
+same family. It is currently only available for IP and MPLS. This mode has to
+be enabled explicitly with the "multiproto" flag.
+
+    a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype ipv4 multiproto
 
-There are two ways to create a bareudp device for MPLS & IP with multiproto mode
-enabled.
+       For an IPv4 tunnel the multiproto mode allows the tunnel to also handle
+       IPv6.
 
-    a) ip link add dev  bareudp0 type bareudp dstport 6635 ethertype 0x8847 multiproto
+    b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc multiproto
 
-    b) ip link add dev  bareudp0 type bareudp dstport 6635 ethertype mpls
+       For MPLS, the multiproto mode allows the tunnel to handle both unicast
+       and multicast MPLS packets.
 
 3) Device Usage
 
 The bareudp device could be used along with OVS or flower filter in TC.
-The OVS or TC flower layer must set the tunnel information in SKB dst field before
-sending packet buffer to the bareudp device for transmission. On reception the
-bareudp device extracts and stores the tunnel information in SKB dst field before
+The OVS or TC flower layer must set the tunnel information in the SKB dst field before
+sending the packet buffer to the bareudp device for transmission. On reception, the
+bareUDP device extracts and stores the tunnel information in the SKB dst field before
 passing the packet buffer to the network stack.
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
index 18020943ba25..ec53a42499c1 100644
--- a/Documentation/networking/batman-adv.rst
+++ b/Documentation/networking/batman-adv.rst
@@ -27,7 +27,7 @@ Load the batman-adv module into your kernel::
   $ insmod batman-adv.ko
 
 The module is now waiting for activation. You must add some interfaces on which
-batman-adv can operate. The batman-adv soft-interface can be created using the
+batman-adv can operate. The batman-adv mesh-interface can be created using the
 iproute2 tool ``ip``::
 
   $ ip link add name bat0 type batadv
@@ -73,7 +73,7 @@ lower value. This will make the mesh more responsive to topology changes, but
 will also increase the overhead.
 
 Information about the current state can be accessed via the batadv generic
-netlink family. batctl provides human readable version via its debug tables
+netlink family. batctl provides a human readable version via its debug tables
 subcommands.
 
 
@@ -115,8 +115,8 @@ are prefixed with "batman-adv:" So to see just these messages try::
   $ dmesg | grep batman-adv
 
 When investigating problems with your mesh network, it is sometimes necessary to
-see more detail debug messages. This must be enabled when compiling the
-batman-adv module. When building batman-adv as part of kernel, use "make
+see more detailed debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of the kernel, use "make
 menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
 (``CONFIG_BATMAN_ADV_DEBUG=y``).
 
@@ -157,12 +157,12 @@ Contact
 Please send us comments, experiences, questions, anything :)
 
 IRC:
-  #batman on irc.freenode.org
+  #batadv on ircs://irc.hackint.org/
 Mailing-list:
-  b.a.t.m.a.n@open-mesh.org (optional subscription at
-  https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+  b.a.t.m.a.n@lists.open-mesh.org (optional subscription at
+  https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
 
 You can also contact the Authors:
 
-* Marek Lindner <mareklindner@neomailbox.ch>
+* Marek Lindner <marek.lindner@mailbox.org>
 * Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 24168b0d16bd..a4c1291d2561 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -196,11 +196,12 @@ ad_actor_sys_prio
 ad_actor_system
 
 	In an AD system, this specifies the mac-address for the actor in
-	protocol packet exchanges (LACPDUs). The value cannot be NULL or
-	multicast. It is preferred to have the local-admin bit set for this
-	mac but driver does not enforce it. If the value is not given then
-	system defaults to using the masters' mac address as actors' system
-	address.
+	protocol packet exchanges (LACPDUs). The value cannot be a multicast
+	address. If the all-zeroes MAC is specified, bonding will internally
+	use the MAC of the bond itself. It is preferred to have the
+	local-admin bit set for this mac but driver does not enforce it. If
+	the value is not given then system defaults to using the masters'
+	mac address as actors' system address.
 
 	This parameter has effect only in 802.3ad mode and is available through
 	SysFs interface.
@@ -312,6 +313,17 @@ arp_ip_target
 	maximum number of targets that can be specified is 16.  The
 	default value is no IP addresses.
 
+ns_ip6_target
+
+	Specifies the IPv6 addresses to use as IPv6 monitoring peers when
+	arp_interval is > 0.  These are the targets of the NS request
+	sent to determine the health of the link to the targets.
+	Specify these values in ffff:ffff::ffff:ffff format.  Multiple IPv6
+	addresses must be separated by a comma.  At least one IPv6
+	address must be given for NS/NA monitoring to function.  The
+	maximum number of targets that can be specified is 16.  The
+	default value is no IPv6 addresses.
+
 arp_validate
 
 	Specifies whether or not ARP probes and replies should be
@@ -421,6 +433,29 @@ arp_all_targets
 		consider the slave up only when all of the arp_ip_targets
 		are reachable
 
+arp_missed_max
+
+	Specifies the number of arp_interval monitor checks that must
+	fail in order for an interface to be marked down by the ARP monitor.
+
+	In order to provide orderly failover semantics, backup interfaces
+	are permitted an extra monitor check (i.e., they must fail
+	arp_missed_max + 1 times before being marked down).
+
+	The default value is 2, and the allowable range is 1 - 255.
+
+coupled_control
+
+    Specifies whether the LACP state machine's MUX in the 802.3ad mode
+    should have separate Collecting and Distributing states.
+
+    This is by implementing the independent control state machine per
+    IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control
+    state machine.
+
+    The default value is 1. This setting does not separate the Collecting
+    and Distributing states, maintaining the bond in coupled control.
+
 downdelay
 
 	Specifies the time, in milliseconds, to wait before disabling
@@ -501,6 +536,18 @@ fail_over_mac
 	This option was added in bonding version 3.2.0.  The "follow"
 	policy was added in bonding version 3.3.0.
 
+lacp_active
+	Option specifying whether to send LACPDU frames periodically.
+
+	off or 0
+		LACPDU frames acts as "speak when spoken to".
+
+	on or 1
+		LACPDU frames are sent along the configured links
+		periodically. See lacp_rate for more details.
+
+	The default is on.
+
 lacp_rate
 
 	Option specifying the rate in which we'll ask our link partner
@@ -531,7 +578,8 @@ miimon
 	link monitoring.  A value of 100 is a good starting point.
 	The use_carrier option, below, affects how the link state is
 	determined.  See the High Availability section for additional
-	information.  The default value is 0.
+	information.  The default value is 100 if arp_interval is not
+	set.
 
 min_links
 
@@ -740,10 +788,22 @@ peer_notif_delay
 	Specify the delay, in milliseconds, between each peer
 	notification (gratuitous ARP and unsolicited IPv6 Neighbor
 	Advertisement) when they are issued after a failover event.
-	This delay should be a multiple of the link monitor interval
-	(arp_interval or miimon, whichever is active). The default
-	value is 0 which means to match the value of the link monitor
-	interval.
+	This delay should be a multiple of the MII link monitor interval
+	(miimon).
+
+	The valid range is 0 - 300000. The default value is 0, which means
+	to match the value of the MII link monitor interval.
+
+prio
+	Slave priority. A higher number means higher priority.
+	The primary slave has the highest priority. This option also
+	follows the primary_reselect rules.
+
+	This option could only be configured via netlink, and is only valid
+	for active-backup(1), balance-tlb (5) and balance-alb (6) mode.
+	The valid value range is a signed 32 bit integer.
+
+	The default value is 0.
 
 primary
 
@@ -800,7 +860,7 @@ primary_reselect
 tlb_dynamic_lb
 
 	Specifies if dynamic shuffling of flows is enabled in tlb
-	mode. The value has no effect on any other modes.
+	or alb mode. The value has no effect on any other modes.
 
 	The default behavior of tlb mode is to shuffle active flows across
 	slaves based on the load in that interval. This gives nice lb
@@ -859,7 +919,7 @@ xmit_hash_policy
 		Uses XOR of hardware MAC addresses and packet type ID
 		field to generate the hash. The formula is
 
-		hash = source MAC XOR destination MAC XOR packet type ID
+		hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
 		slave number = hash modulo slave count
 
 		This algorithm will place all traffic to a particular
@@ -875,7 +935,7 @@ xmit_hash_policy
 		Uses XOR of hardware MAC addresses and IP addresses to
 		generate the hash.  The formula is
 
-		hash = source MAC XOR destination MAC XOR packet type ID
+		hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
 		hash = hash XOR source IP XOR destination IP
 		hash = hash XOR (hash RSHIFT 16)
 		hash = hash XOR (hash RSHIFT 8)
@@ -910,6 +970,7 @@ xmit_hash_policy
 		hash = hash XOR source IP XOR destination IP
 		hash = hash XOR (hash RSHIFT 16)
 		hash = hash XOR (hash RSHIFT 8)
+		hash = hash RSHIFT 1
 		And then hash is reduced modulo slave count.
 
 		If the protocol is IPv6 then the source and destination
@@ -951,6 +1012,19 @@ xmit_hash_policy
 		packets will be distributed according to the encapsulated
 		flows.
 
+	vlan+srcmac
+
+		This policy uses a very rudimentary vlan ID and source mac
+		hash to load-balance traffic per-vlan, with failover
+		should one leg fail. The intended use case is for a bond
+		shared by multiple virtual machines, all configured to
+		use their own vlan, to give lacp-like functionality
+		without requiring lacp-capable switching hardware.
+
+		The formula for the hash is simply
+
+		hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
+
 	The default value is layer2.  This option was added in bonding
 	version 2.6.3.  In earlier versions of bonding, this parameter
 	does not exist, and the layer2 policy is the only policy.  The
@@ -1574,7 +1648,7 @@ your init script::
 -----------------------------------------
 
 This section applies to distros which use /etc/network/interfaces file
-to describe network interface configuration, most notably Debian and it's
+to describe network interface configuration, most notably Debian and its
 derivatives.
 
 The ifup and ifdown commands on Debian don't support bonding out of
@@ -1889,7 +1963,7 @@ obtain its hardware address from the first slave, which might not
 match the hardware address of the VLAN interfaces (which was
 ultimately copied from an earlier slave).
 
-There are two methods to insure that the VLAN device operates
+There are two methods to ensure that the VLAN device operates
 with the correct hardware address if all slaves are removed from a
 bond interface:
 
@@ -1923,15 +1997,6 @@ uses the response as an indication that the link is operating.  This
 gives some assurance that traffic is actually flowing to and from one
 or more peers on the local network.
 
-The ARP monitor relies on the device driver itself to verify
-that traffic is flowing.  In particular, the driver must keep up to
-date the last receive time, dev->last_rx.  Drivers that use NETIF_F_LLTX
-flag must also update netdev_queue->trans_start.  If they do not, then the
-ARP monitor will immediately fail any slaves using that driver, and
-those slaves will stay down.  If networking monitoring (tcpdump, etc)
-shows the ARP requests and replies on the network, then it may be that
-your device driver is not updating last_rx and trans_start.
-
 7.2 Configuring Multiple ARP Targets
 ------------------------------------
 
@@ -1975,7 +2040,7 @@ netif_carrier.
 If use_carrier is 0, then the MII monitor will first query the
 device's (via ioctl) MII registers and check the link state.  If that
 request fails (not just that it returns carrier down), then the MII
-monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain
+monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain
 the same information.  If both methods fail (i.e., the driver either
 does not support or had some error in processing both the MII register
 and ethtool requests), then the MII monitor will assume the link is
@@ -2013,7 +2078,7 @@ as an unsolicited ARP reply (because ARP matches replies on an
 interface basis), and is discarded.  The MII monitor is not affected
 by the state of the routing table.
 
-The solution here is simply to insure that slaves do not have
+The solution here is simply to ensure that slaves do not have
 routes of their own, and if for some reason they must, those routes do
 not supersede routes of their master.  This should generally be the
 case, but unusual configurations or errant manual or automatic static
@@ -2230,7 +2295,7 @@ active-backup:
 	the switches have an ISL and play together well.  If the
 	network configuration is such that one switch is specifically
 	a backup switch (e.g., has lower capacity, higher cost, etc),
-	then the primary option can be used to insure that the
+	then the primary option can be used to ensure that the
 	preferred link is always used when it is available.
 
 broadcast:
@@ -2257,7 +2322,7 @@ monitor can provide a higher level of reliability in detecting end to
 end connectivity failures (which may be caused by the failure of any
 individual component to pass traffic for any reason).  Additionally,
 the ARP monitor should be configured with multiple targets (at least
-one for each switch in the network).  This will insure that,
+one for each switch in the network).  This will ensure that,
 regardless of which switch is active, the ARP monitor has a suitable
 target to query.
 
@@ -2851,6 +2916,17 @@ from the bond (``ifenslave -d bond0 eth0``). The bonding driver will
 then restore the MAC addresses that the slaves had before they were
 enslaved.
 
+9.  What bonding modes support native XDP?
+------------------------------------------
+
+  * balance-rr (0)
+  * active-backup (1)
+  * balance-xor (2)
+  * 802.3ad (4)
+
+Note that the vlan+srcmac hash policy does not support native XDP.
+For other bonding modes, the XDP program must be loaded with generic mode.
+
 16. Resources and Links
 =======================
 
@@ -2860,17 +2936,6 @@ version of the linux kernel, found on http://kernel.org
 The latest version of this document can be found in the latest kernel
 source (named Documentation/networking/bonding.rst).
 
-Discussions regarding the usage of the bonding driver take place on the
-bonding-devel mailing list, hosted at sourceforge.net. If you have questions or
-problems, post them to the list.  The list address is:
-
-bonding-devel@lists.sourceforge.net
-
-The administrative interface (to subscribe or unsubscribe) can
-be found at:
-
-https://lists.sourceforge.net/lists/listinfo/bonding-devel
-
 Discussions regarding the development of the bonding driver take place
 on the main Linux network mailing list, hosted at vger.kernel.org. The list
 address is:
@@ -2881,10 +2946,3 @@ The administrative interface (to subscribe or unsubscribe) can
 be found at:
 
 http://vger.kernel.org/vger-lists.html#netdev
-
-Donald Becker's Ethernet Drivers and diag programs may be found at :
-
- - http://web.archive.org/web/%2E/http://www.scyld.com/network/
-
-You will also find a lot of information regarding Ethernet, NWay, MII,
-etc. at www.scyld.com.
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 4aef9cddde2f..ef8b73e157b2 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -4,18 +4,332 @@
 Ethernet Bridging
 =================
 
-In order to use the Ethernet bridging functionality, you'll need the
-userspace tools.
+Introduction
+============
 
-Documentation for Linux bridging is on:
-   http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
+The IEEE 802.1Q-2022 (Bridges and Bridged Networks) standard defines the
+operation of bridges in computer networks. A bridge, in the context of this
+standard, is a device that connects two or more network segments and operates
+at the data link layer (Layer 2) of the OSI (Open Systems Interconnection)
+model. The purpose of a bridge is to filter and forward frames between
+different segments based on the destination MAC (Media Access Control) address.
 
-The bridge-utilities are maintained at:
-   git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+Bridge kAPI
+===========
 
-Additionally, the iproute2 utilities can be used to configure
-bridge devices.
+Here are some core structures of bridge code. Note that the kAPI is *unstable*,
+and can be changed at any time.
 
-If you still have questions, don't hesitate to post to the mailing list 
-(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
+.. kernel-doc:: net/bridge/br_private.h
+   :identifiers: net_bridge_vlan
 
+Bridge uAPI
+===========
+
+Modern Linux bridge uAPI is accessed via Netlink interface. You can find
+below files where the bridge and bridge port netlink attributes are defined.
+
+Bridge netlink attributes
+-------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+   :doc: Bridge enum definition
+
+Bridge port netlink attributes
+------------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+   :doc: Bridge port enum definition
+
+Bridge sysfs
+------------
+
+The sysfs interface is deprecated and should not be extended if new
+options are added.
+
+STP
+===
+
+The STP (Spanning Tree Protocol) implementation in the Linux bridge driver
+is a critical feature that helps prevent loops and broadcast storms in
+Ethernet networks by identifying and disabling redundant links. In a Linux
+bridge context, STP is crucial for network stability and availability.
+
+STP is a Layer 2 protocol that operates at the Data Link Layer of the OSI
+model. It was originally developed as IEEE 802.1D and has since evolved into
+multiple versions, including Rapid Spanning Tree Protocol (RSTP) and
+`Multiple Spanning Tree Protocol (MSTP)
+<https://lore.kernel.org/netdev/20220316150857.2442916-1-tobias@waldekranz.com/>`_.
+
+The 802.1D-2004 removed the original Spanning Tree Protocol, instead
+incorporating the Rapid Spanning Tree Protocol (RSTP). By 2014, all the
+functionality defined by IEEE 802.1D has been incorporated into either
+IEEE 802.1Q (Bridges and Bridged Networks) or IEEE 802.1AC (MAC Service
+Definition). 802.1D has been officially withdrawn in 2022.
+
+Bridge Ports and STP States
+---------------------------
+
+In the context of STP, bridge ports can be in one of the following states:
+  * Blocking: The port is disabled for data traffic and only listens for
+    BPDUs (Bridge Protocol Data Units) from other devices to determine the
+    network topology.
+  * Listening: The port begins to participate in the STP process and listens
+    for BPDUs.
+  * Learning: The port continues to listen for BPDUs and begins to learn MAC
+    addresses from incoming frames but does not forward data frames.
+  * Forwarding: The port is fully operational and forwards both BPDUs and
+    data frames.
+  * Disabled: The port is administratively disabled and does not participate
+    in the STP process. The data frames forwarding are also disabled.
+
+Root Bridge and Convergence
+---------------------------
+
+In the context of networking and Ethernet bridging in Linux, the root bridge
+is a designated switch in a bridged network that serves as a reference point
+for the spanning tree algorithm to create a loop-free topology.
+
+Here's how the STP works and root bridge is chosen:
+  1. Bridge Priority: Each bridge running a spanning tree protocol, has a
+     configurable Bridge Priority value. The lower the value, the higher the
+     priority. By default, the Bridge Priority is set to a standard value
+     (e.g., 32768).
+  2. Bridge ID: The Bridge ID is composed of two components: Bridge Priority
+     and the MAC address of the bridge. It uniquely identifies each bridge
+     in the network. The Bridge ID is used to compare the priorities of
+     different bridges.
+  3. Bridge Election: When the network starts, all bridges initially assume
+     that they are the root bridge. They start advertising Bridge Protocol
+     Data Units (BPDU) to their neighbors, containing their Bridge ID and
+     other information.
+  4. BPDU Comparison: Bridges exchange BPDUs to determine the root bridge.
+     Each bridge examines the received BPDUs, including the Bridge Priority
+     and Bridge ID, to determine if it should adjust its own priorities.
+     The bridge with the lowest Bridge ID will become the root bridge.
+  5. Root Bridge Announcement: Once the root bridge is determined, it sends
+     BPDUs with information about the root bridge to all other bridges in the
+     network. This information is used by other bridges to calculate the
+     shortest path to the root bridge and, in doing so, create a loop-free
+     topology.
+  6. Forwarding Ports: After the root bridge is selected and the spanning tree
+     topology is established, each bridge determines which of its ports should
+     be in the forwarding state (used for data traffic) and which should be in
+     the blocking state (used to prevent loops). The root bridge's ports are
+     all in the forwarding state. while other bridges have some ports in the
+     blocking state to avoid loops.
+  7. Root Ports: After the root bridge is selected and the spanning tree
+     topology is established, each non-root bridge processes incoming
+     BPDUs and determines which of its ports provides the shortest path to the
+     root bridge based on the information in the received BPDUs. This port is
+     designated as the root port. And it is in the Forwarding state, allowing
+     it to actively forward network traffic.
+  8. Designated ports: A designated port is the port through which the non-root
+     bridge will forward traffic towards the designated segment. Designated ports
+     are placed in the Forwarding state. All other ports on the non-root
+     bridge that are not designated for specific segments are placed in the
+     Blocking state to prevent network loops.
+
+STP ensures network convergence by calculating the shortest path and disabling
+redundant links. When network topology changes occur (e.g., a link failure),
+STP recalculates the network topology to restore connectivity while avoiding loops.
+
+Proper configuration of STP parameters, such as the bridge priority, can
+influence network performance, path selection and which bridge becomes the
+Root Bridge.
+
+User space STP helper
+---------------------
+
+The user space STP helper *bridge-stp* is a program to control whether to use
+user mode spanning tree. The ``/sbin/bridge-stp <bridge> <start|stop>`` is
+called by the kernel when STP is enabled/disabled on a bridge
+(via ``brctl stp <bridge> <on|off>`` or ``ip link set <bridge> type bridge
+stp_state <0|1>``).  The kernel enables user_stp mode if that command returns
+0, or enables kernel_stp mode if that command returns any other value.
+
+VLAN
+====
+
+A LAN (Local Area Network) is a network that covers a small geographic area,
+typically within a single building or a campus. LANs are used to connect
+computers, servers, printers, and other networked devices within a localized
+area. LANs can be wired (using Ethernet cables) or wireless (using Wi-Fi).
+
+A VLAN (Virtual Local Area Network) is a logical segmentation of a physical
+network into multiple isolated broadcast domains. VLANs are used to divide
+a single physical LAN into multiple virtual LANs, allowing different groups of
+devices to communicate as if they were on separate physical networks.
+
+Typically there are two VLAN implementations, IEEE 802.1Q and IEEE 802.1ad
+(also known as QinQ). IEEE 802.1Q is a standard for VLAN tagging in Ethernet
+networks. It allows network administrators to create logical VLANs on a
+physical network and tag Ethernet frames with VLAN information, which is
+called *VLAN-tagged frames*. IEEE 802.1ad, commonly known as QinQ or Double
+VLAN, is an extension of the IEEE 802.1Q standard. QinQ allows for the
+stacking of multiple VLAN tags within a single Ethernet frame. The Linux
+bridge supports both the IEEE 802.1Q and `802.1AD
+<https://lore.kernel.org/netdev/1402401565-15423-1-git-send-email-makita.toshiaki@lab.ntt.co.jp/>`_
+protocol for VLAN tagging.
+
+`VLAN filtering <https://lore.kernel.org/netdev/1360792820-14116-1-git-send-email-vyasevic@redhat.com/>`_
+on a bridge is disabled by default. After enabling VLAN filtering on a bridge,
+it will start forwarding frames to appropriate destinations based on their
+destination MAC address and VLAN tag (both must match).
+
+Multicast
+=========
+
+The Linux bridge driver has multicast support allowing it to process Internet
+Group Management Protocol (IGMP) or Multicast Listener Discovery (MLD)
+messages, and to efficiently forward multicast data packets. The bridge
+driver supports IGMPv2/IGMPv3 and MLDv1/MLDv2.
+
+Multicast snooping
+------------------
+
+Multicast snooping is a networking technology that allows network switches
+to intelligently manage multicast traffic within a local area network (LAN).
+
+The switch maintains a multicast group table, which records the association
+between multicast group addresses and the ports where hosts have joined these
+groups. The group table is dynamically updated based on the IGMP/MLD messages
+received. With the multicast group information gathered through snooping, the
+switch optimizes the forwarding of multicast traffic. Instead of blindly
+broadcasting the multicast traffic to all ports, it sends the multicast
+traffic based on the destination MAC address only to ports which have
+subscribed the respective destination multicast group.
+
+When created, the Linux bridge devices have multicast snooping enabled by
+default. It maintains a Multicast forwarding database (MDB) which keeps track
+of port and group relationships.
+
+IGMPv3/MLDv2 EHT support
+------------------------
+
+The Linux bridge supports IGMPv3/MLDv2 EHT (Explicit Host Tracking), which
+was added by `474ddb37fa3a ("net: bridge: multicast: add EHT allow/block handling")
+<https://lore.kernel.org/netdev/20210120145203.1109140-1-razor@blackwall.org/>`_
+
+The explicit host tracking enables the device to keep track of each
+individual host that is joined to a particular group or channel. The main
+benefit of the explicit host tracking in IGMP is to allow minimal leave
+latencies when a host leaves a multicast group or channel.
+
+The length of time between a host wanting to leave and a device stopping
+traffic forwarding is called the IGMP leave latency. A device configured
+with IGMPv3 or MLDv2 and explicit tracking can immediately stop forwarding
+traffic if the last host to request to receive traffic from the device
+indicates that it no longer wants to receive traffic. The leave latency
+is thus bound only by the packet transmission latencies in the multiaccess
+network and the processing time in the device.
+
+Other multicast features
+------------------------
+
+The Linux bridge also supports `per-VLAN multicast snooping
+<https://lore.kernel.org/netdev/20210719170637.435541-1-razor@blackwall.org/>`_,
+which is disabled by default but can be enabled. And `Multicast Router Discovery
+<https://lore.kernel.org/netdev/20190121062628.2710-1-linus.luessing@c0d3.blue/>`_,
+which help identify the location of multicast routers.
+
+Switchdev
+=========
+
+Linux Bridge Switchdev is a feature in the Linux kernel that extends the
+capabilities of the traditional Linux bridge to work more efficiently with
+hardware switches that support switchdev. With Linux Bridge Switchdev, certain
+networking functions like forwarding, filtering, and learning of Ethernet
+frames can be offloaded to a hardware switch. This offloading reduces the
+burden on the Linux kernel and CPU, leading to improved network performance
+and lower latency.
+
+To use Linux Bridge Switchdev, you need hardware switches that support the
+switchdev interface. This means that the switch hardware needs to have the
+necessary drivers and functionality to work in conjunction with the Linux
+kernel.
+
+Please see the :ref:`switchdev` document for more details.
+
+Netfilter
+=========
+
+The bridge netfilter module is a legacy feature that allows to filter bridged
+packets with iptables and ip6tables. Its use is discouraged. Users should
+consider using nftables for packet filtering.
+
+The older ebtables tool is more feature-limited compared to nftables, but
+just like nftables it doesn't need this module either to function.
+
+The br_netfilter module intercepts packets entering the bridge, performs
+minimal sanity tests on ipv4 and ipv6 packets and then pretends that
+these packets are being routed, not bridged. br_netfilter then calls
+the ip and ipv6 netfilter hooks from the bridge layer, i.e. ip(6)tables
+rulesets will also see these packets.
+
+br_netfilter is also the reason for the iptables *physdev* match:
+This match is the only way to reliably tell routed and bridged packets
+apart in an iptables ruleset.
+
+Note that ebtables and nftables will work fine without the br_netfilter module.
+iptables/ip6tables/arptables do not work for bridged traffic because they
+plug in the routing stack. nftables rules in ip/ip6/inet/arp families won't
+see traffic that is forwarded by a bridge either, but that's very much how it
+should be.
+
+Historically the feature set of ebtables was very limited (it still is),
+this module was added to pretend packets are routed and invoke the ipv4/ipv6
+netfilter hooks from the bridge so users had access to the more feature-rich
+iptables matching capabilities (including conntrack). nftables doesn't have
+this limitation, pretty much all features work regardless of the protocol family.
+
+So, br_netfilter is only needed if users, for some reason, need to use
+ip(6)tables to filter packets forwarded by the bridge, or NAT bridged
+traffic. For pure link layer filtering, this module isn't needed.
+
+Other Features
+==============
+
+The Linux bridge also supports `IEEE 802.11 Proxy ARP
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=958501163ddd6ea22a98f94fa0e7ce6d4734e5c4>`_,
+`Media Redundancy Protocol (MRP)
+<https://lore.kernel.org/netdev/20200426132208.3232-1-horatiu.vultur@microchip.com/>`_,
+`Media Redundancy Protocol (MRP) LC mode
+<https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com>`_,
+`IEEE 802.1X port authentication
+<https://lore.kernel.org/netdev/20220218155148.2329797-1-schultz.hans+netdev@gmail.com/>`_,
+and `MAC Authentication Bypass (MAB)
+<https://lore.kernel.org/netdev/20221101193922.2125323-2-idosch@nvidia.com/>`_.
+
+FAQ
+===
+
+What does a bridge do?
+----------------------
+
+A bridge transparently forwards traffic between multiple network interfaces.
+In plain English this means that a bridge connects two or more physical
+Ethernet networks, to form one larger (logical) Ethernet network.
+
+Is it L3 protocol independent?
+------------------------------
+
+Yes. The bridge sees all frames, but it *uses* only L2 headers/information.
+As such, the bridging functionality is protocol independent, and there should
+be no trouble forwarding IPX, NetBEUI, IP, IPv6, etc.
+
+Contact Info
+============
+
+The code is currently maintained by Roopa Prabhu <roopa@nvidia.com> and
+Nikolay Aleksandrov <razor@blackwall.org>. Bridge bugs and enhancements
+are discussed on the linux-netdev mailing list netdev@vger.kernel.org and
+bridge@lists.linux.dev.
+
+The list is open to anyone interested: http://vger.kernel.org/vger-lists.html#netdev
+
+External Links
+==============
+
+The old Documentation for Linux bridging is on:
+https://wiki.linuxfoundation.org/networking/bridge
diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst
index a07213030ccf..d922d419c513 100644
--- a/Documentation/networking/caif/caif.rst
+++ b/Documentation/networking/caif/caif.rst
@@ -68,11 +68,10 @@ There are debugfs parameters provided for serial communication.
 * tty_status: Prints the bit-mask tty status information
 
   - 0x01 - tty->warned is on.
-  - 0x02 - tty->low_latency is on.
   - 0x04 - tty->packed is on.
-  - 0x08 - tty->flow_stopped is on.
+  - 0x08 - tty->flow.tco_stopped is on.
   - 0x10 - tty->hw_stopped is on.
-  - 0x20 - tty->stopped is on.
+  - 0x20 - tty->flow.stopped is on.
 
 * last_tx_msg: Binary blob Prints the last transmitted frame.
 
diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst
index 86e5b7832ec3..ec29b6f4bdb4 100644
--- a/Documentation/networking/caif/index.rst
+++ b/Documentation/networking/caif/index.rst
@@ -10,4 +10,3 @@ Contents:
 
    linux_caif
    caif
-   spi_porting
diff --git a/Documentation/networking/caif/spi_porting.rst b/Documentation/networking/caif/spi_porting.rst
deleted file mode 100644
index d49f874b20ac..000000000000
--- a/Documentation/networking/caif/spi_porting.rst
+++ /dev/null
@@ -1,229 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================
-CAIF SPI porting
-================
-
-CAIF SPI basics
-===============
-
-Running CAIF over SPI needs some extra setup, owing to the nature of SPI.
-Two extra GPIOs have been added in order to negotiate the transfers
-between the master and the slave. The minimum requirement for running
-CAIF over SPI is a SPI slave chip and two GPIOs (more details below).
-Please note that running as a slave implies that you need to keep up
-with the master clock. An overrun or underrun event is fatal.
-
-CAIF SPI framework
-==================
-
-To make porting as easy as possible, the CAIF SPI has been divided in
-two parts. The first part (called the interface part) deals with all
-generic functionality such as length framing, SPI frame negotiation
-and SPI frame delivery and transmission. The other part is the CAIF
-SPI slave device part, which is the module that you have to write if
-you want to run SPI CAIF on a new hardware. This part takes care of
-the physical hardware, both with regard to SPI and to GPIOs.
-
-- Implementing a CAIF SPI device:
-
-	- Functionality provided by the CAIF SPI slave device:
-
-	In order to implement a SPI device you will, as a minimum,
-	need to implement the following
-	functions:
-
-	::
-
-	    int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev):
-
-	This function is called by the CAIF SPI interface to give
-	you a chance to set up your hardware to be ready to receive
-	a stream of data from the master. The xfer structure contains
-	both physical and logical addresses, as well as the total length
-	of the transfer in both directions.The dev parameter can be used
-	to map to different CAIF SPI slave devices.
-
-	::
-
-	    void (*sig_xfer) (bool xfer, struct cfspi_dev *dev):
-
-	This function is called by the CAIF SPI interface when the output
-	(SPI_INT) GPIO needs to change state. The boolean value of the xfer
-	variable indicates whether the GPIO should be asserted (HIGH) or
-	deasserted (LOW). The dev parameter can be used to map to different CAIF
-	SPI slave devices.
-
-	- Functionality provided by the CAIF SPI interface:
-
-	::
-
-	    void (*ss_cb) (bool assert, struct cfspi_ifc *ifc);
-
-	This function is called by the CAIF SPI slave device in order to
-	signal a change of state of the input GPIO (SS) to the interface.
-	Only active edges are mandatory to be reported.
-	This function can be called from IRQ context (recommended in order
-	not to introduce latency). The ifc parameter should be the pointer
-	returned from the platform probe function in the SPI device structure.
-
-	::
-
-	    void (*xfer_done_cb) (struct cfspi_ifc *ifc);
-
-	This function is called by the CAIF SPI slave device in order to
-	report that a transfer is completed. This function should only be
-	called once both the transmission and the reception are completed.
-	This function can be called from IRQ context (recommended in order
-	not to introduce latency). The ifc parameter should be the pointer
-	returned from the platform probe function in the SPI device structure.
-
-	- Connecting the bits and pieces:
-
-		- Filling in the SPI slave device structure:
-
-		  Connect the necessary callback functions.
-
-		  Indicate clock speed (used to calculate toggle delays).
-
-		  Chose a suitable name (helps debugging if you use several CAIF
-		  SPI slave devices).
-
-		  Assign your private data (can be used to map to your
-		  structure).
-
-		- Filling in the SPI slave platform device structure:
-
-		  Add name of driver to connect to ("cfspi_sspi").
-
-		  Assign the SPI slave device structure as platform data.
-
-Padding
-=======
-
-In order to optimize throughput, a number of SPI padding options are provided.
-Padding can be enabled independently for uplink and downlink transfers.
-Padding can be enabled for the head, the tail and for the total frame size.
-The padding needs to be correctly configured on both sides of the link.
-The padding can be changed via module parameters in cfspi_sspi.c or via
-the sysfs directory of the cfspi_sspi driver (before device registration).
-
-- CAIF SPI device template::
-
-    /*
-    *	Copyright (C) ST-Ericsson AB 2010
-    *	Author: Daniel Martensson / Daniel.Martensson@stericsson.com
-    *	License terms: GNU General Public License (GPL), version 2.
-    *
-    */
-
-    #include <linux/init.h>
-    #include <linux/module.h>
-    #include <linux/device.h>
-    #include <linux/wait.h>
-    #include <linux/interrupt.h>
-    #include <linux/dma-mapping.h>
-    #include <net/caif/caif_spi.h>
-
-    MODULE_LICENSE("GPL");
-
-    struct sspi_struct {
-	    struct cfspi_dev sdev;
-	    struct cfspi_xfer *xfer;
-    };
-
-    static struct sspi_struct slave;
-    static struct platform_device slave_device;
-
-    static irqreturn_t sspi_irq(int irq, void *arg)
-    {
-	    /* You only need to trigger on an edge to the active state of the
-	    * SS signal. Once a edge is detected, the ss_cb() function should be
-	    * called with the parameter assert set to true. It is OK
-	    * (and even advised) to call the ss_cb() function in IRQ context in
-	    * order not to add any delay. */
-
-	    return IRQ_HANDLED;
-    }
-
-    static void sspi_complete(void *context)
-    {
-	    /* Normally the DMA or the SPI framework will call you back
-	    * in something similar to this. The only thing you need to
-	    * do is to call the xfer_done_cb() function, providing the pointer
-	    * to the CAIF SPI interface. It is OK to call this function
-	    * from IRQ context. */
-    }
-
-    static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev)
-    {
-	    /* Store transfer info. For a normal implementation you should
-	    * set up your DMA here and make sure that you are ready to
-	    * receive the data from the master SPI. */
-
-	    struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
-
-	    sspi->xfer = xfer;
-
-	    return 0;
-    }
-
-    void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev)
-    {
-	    /* If xfer is true then you should assert the SPI_INT to indicate to
-	    * the master that you are ready to receive the data from the master
-	    * SPI. If xfer is false then you should de-assert SPI_INT to indicate
-	    * that the transfer is done.
-	    */
-
-	    struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
-    }
-
-    static void sspi_release(struct device *dev)
-    {
-	    /*
-	    * Here you should release your SPI device resources.
-	    */
-    }
-
-    static int __init sspi_init(void)
-    {
-	    /* Here you should initialize your SPI device by providing the
-	    * necessary functions, clock speed, name and private data. Once
-	    * done, you can register your device with the
-	    * platform_device_register() function. This function will return
-	    * with the CAIF SPI interface initialized. This is probably also
-	    * the place where you should set up your GPIOs, interrupts and SPI
-	    * resources. */
-
-	    int res = 0;
-
-	    /* Initialize slave device. */
-	    slave.sdev.init_xfer = sspi_init_xfer;
-	    slave.sdev.sig_xfer = sspi_sig_xfer;
-	    slave.sdev.clk_mhz = 13;
-	    slave.sdev.priv = &slave;
-	    slave.sdev.name = "spi_sspi";
-	    slave_device.dev.release = sspi_release;
-
-	    /* Initialize platform device. */
-	    slave_device.name = "cfspi_sspi";
-	    slave_device.dev.platform_data = &slave.sdev;
-
-	    /* Register platform device. */
-	    res = platform_device_register(&slave_device);
-	    if (res) {
-		    printk(KERN_WARNING "sspi_init: failed to register dev.\n");
-		    return -ENODEV;
-	    }
-
-	    return res;
-    }
-
-    static void __exit sspi_exit(void)
-    {
-	    platform_device_del(&slave_device);
-    }
-
-    module_init(sspi_init);
-    module_exit(sspi_exit);
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index ff05cbd05e0d..b018ce346392 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -168,7 +168,7 @@ reflect the correct [#f1]_ traffic on the node the loopback of the sent
 data has to be performed right after a successful transmission. If
 the CAN network interface is not capable of performing the loopback for
 some reason the SocketCAN core can do this task as a fallback solution.
-See :ref:`socketcan-local-loopback1` for details (recommended).
+See :ref:`socketcan-local-loopback2` for details (recommended).
 
 The loopback functionality is enabled by default to reflect standard
 networking behaviour for CAN applications. Due to some requests from
@@ -228,20 +228,36 @@ send(2), sendto(2), sendmsg(2) and the recv* counterpart operations
 on the socket as usual. There are also CAN specific socket options
 described below.
 
-The basic CAN frame structure and the sockaddr structure are defined
-in include/linux/can.h:
+The Classical CAN frame structure (aka CAN 2.0B), the CAN FD frame structure
+and the sockaddr structure are defined in include/linux/can.h:
 
 .. code-block:: C
 
     struct can_frame {
             canid_t can_id;  /* 32 bit CAN_ID + EFF/RTR/ERR flags */
-            __u8    can_dlc; /* frame payload length in byte (0 .. 8) */
+            union {
+                    /* CAN frame payload length in byte (0 .. CAN_MAX_DLEN)
+                     * was previously named can_dlc so we need to carry that
+                     * name for legacy support
+                     */
+                    __u8 len;
+                    __u8 can_dlc; /* deprecated */
+            };
             __u8    __pad;   /* padding */
             __u8    __res0;  /* reserved / padding */
-            __u8    __res1;  /* reserved / padding */
+            __u8    len8_dlc; /* optional DLC for 8 byte payload length (9 .. 15) */
             __u8    data[8] __attribute__((aligned(8)));
     };
 
+Remark: The len element contains the payload length in bytes and should be
+used instead of can_dlc. The deprecated can_dlc was misleadingly named as
+it always contained the plain payload length in bytes and not the so called
+'data length code' (DLC).
+
+To pass the raw DLC from/to a Classical CAN network device the len8_dlc
+element can contain values 9 .. 15 when the len element is 8 (the real
+payload length for all DLC values greater or equal to 8).
+
 The alignment of the (linear) payload data[] to a 64bit boundary
 allows the user to define their own structs and unions to easily access
 the CAN payload. There is no given byteorder on the CAN bus by
@@ -260,6 +276,23 @@ PF_PACKET socket, that also binds to a specific interface:
                     /* transport protocol class address info (e.g. ISOTP) */
                     struct { canid_t rx_id, tx_id; } tp;
 
+                    /* J1939 address information */
+                    struct {
+                            /* 8 byte name when using dynamic addressing */
+                            __u64 name;
+
+                            /* pgn:
+                             * 8 bit: PS in PDU2 case, else 0
+                             * 8 bit: PF
+                             * 1 bit: DP
+                             * 1 bit: reserved
+                             */
+                            __u32 pgn;
+
+                            /* 1 byte address */
+                            __u8 addr;
+                    } j1939;
+
                     /* reserved for future CAN protocols address information */
             } can_addr;
     };
@@ -371,7 +404,7 @@ kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight
 bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g.
 the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that
 switches the socket into a mode that allows the handling of CAN FD frames
-and (legacy) CAN frames simultaneously (see :ref:`socketcan-rawfd`).
+and Classical CAN frames simultaneously (see :ref:`socketcan-rawfd`).
 
 The struct canfd_frame is defined in include/linux/can.h:
 
@@ -397,7 +430,7 @@ code (DLC) of the struct can_frame was used as a length information as the
 length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve
 the easy handling of the length information the canfd_frame.len element
 contains a plain length value from 0 .. 64. So both canfd_frame.len and
-can_frame.can_dlc are equal and contain a length information and no DLC.
+can_frame.len are equal and contain a length information and no DLC.
 For details about the distinction of CAN and CAN FD capable devices and
 the mapping to the bus-relevant data length code (DLC), see :ref:`socketcan-can-fd-driver`.
 
@@ -407,10 +440,28 @@ definitions are specified for CAN specific MTUs in include/linux/can.h:
 
 .. code-block:: C
 
-  #define CAN_MTU   (sizeof(struct can_frame))   == 16  => 'legacy' CAN frame
+  #define CAN_MTU   (sizeof(struct can_frame))   == 16  => Classical CAN frame
   #define CANFD_MTU (sizeof(struct canfd_frame)) == 72  => CAN FD frame
 
 
+Returned Message Flags
+----------------------
+
+When using the system call recvmsg(2) on a RAW or a BCM socket, the
+msg->msg_flags field may contain the following flags:
+
+MSG_DONTROUTE:
+	set when the received frame was created on the local host.
+
+MSG_CONFIRM:
+	set when the frame was sent via the socket it is received on.
+	This flag can be interpreted as a 'transmission confirmation' when the
+	CAN driver supports the echo of frames on driver level, see
+	:ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
+	(Note: In order to receive such messages on a RAW socket,
+	CAN_RAW_RECV_OWN_MSGS must be set.)
+
+
 .. _socketcan-raw-sockets:
 
 RAW Protocol Sockets with can_filters (SOCK_RAW)
@@ -575,6 +626,8 @@ demand:
     setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS,
                &recv_own_msgs, sizeof(recv_own_msgs));
 
+Note that reception of a socket's own CAN frames are subject to the same
+filtering as other CAN frames (see :ref:`socketcan-rawfilter`).
 
 .. _socketcan-rawfd:
 
@@ -609,7 +662,7 @@ Example:
             printf("got CAN FD frame with length %d\n", cfd.len);
             /* cfd.flags contains valid data */
     } else if (nbytes == CAN_MTU) {
-            printf("got legacy CAN frame with length %d\n", cfd.len);
+            printf("got Classical CAN frame with length %d\n", cfd.len);
             /* cfd.flags is undefined */
     } else {
             fprintf(stderr, "read: invalid CAN(FD) frame\n");
@@ -623,7 +676,7 @@ Example:
             printf("%02X ", cfd.data[i]);
 
 When reading with size CANFD_MTU only returns CAN_MTU bytes that have
-been received from the socket a legacy CAN frame has been read into the
+been received from the socket a Classical CAN frame has been read into the
 provided CAN FD structure. Note that the canfd_frame.flags data field is
 not specified in the struct can_frame and therefore it is only valid in
 CANFD_MTU sized CAN FD frames.
@@ -633,7 +686,7 @@ Implementation hint for new CAN applications:
 To build a CAN FD aware application use struct canfd_frame as basic CAN
 data structure for CAN_RAW based applications. When the application is
 executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES
-socket option returns an error: No problem. You'll get legacy CAN frames
+socket option returns an error: No problem. You'll get Classical CAN frames
 or CAN FD frames and can process them the same way.
 
 When sending to CAN devices make sure that the device is capable to handle
@@ -646,10 +699,10 @@ RAW socket option CAN_RAW_JOIN_FILTERS
 
 The CAN_RAW socket can set multiple CAN identifier specific filters that
 lead to multiple filters in the af_can.c filter processing. These filters
-are indenpendent from each other which leads to logical OR'ed filters when
+are independent from each other which leads to logical OR'ed filters when
 applied (see :ref:`socketcan-rawfilter`).
 
-This socket option joines the given CAN filters in the way that only CAN
+This socket option joins the given CAN filters in the way that only CAN
 frames are passed to user space that matched *all* given CAN filters. The
 semantic for the applied filters is therefore changed to a logical AND.
 
@@ -658,22 +711,6 @@ where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or
 CAN ID ranges from the incoming traffic.
 
 
-RAW Socket Returned Message Flags
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When using recvmsg() call, the msg->msg_flags may contain following flags:
-
-MSG_DONTROUTE:
-	set when the received frame was created on the local host.
-
-MSG_CONFIRM:
-	set when the frame was sent via the socket it is received on.
-	This flag can be interpreted as a 'transmission confirmation' when the
-	CAN driver supports the echo of frames on driver level, see
-	:ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
-	In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
-
-
 Broadcast Manager Protocol Sockets (SOCK_DGRAM)
 -----------------------------------------------
 
@@ -842,6 +879,8 @@ TX_RESET_MULTI_IDX:
 RX_RTR_FRAME:
 	Send reply for RTR-request (placed in op->frames[0]).
 
+CAN_FD_FRAME:
+	The CAN frames following the bcm_msg_head are struct canfd_frame's
 
 Broadcast Manager Transmission Timers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -894,7 +933,7 @@ ival1:
 ival2:
 	Throttle the received message rate down to the value of ival2. This
 	is useful to reduce messages for the application when the signal inside the
-	CAN frame is stateless as state changes within the ival2 periode may get
+	CAN frame is stateless as state changes within the ival2 period may get
 	lost.
 
 Broadcast Manager Multiplex Message Receive Filter
@@ -1026,7 +1065,7 @@ Additional procfs files in /proc/net/can::
 
     stats       - SocketCAN core statistics (rx/tx frames, match ratios, ...)
     reset_stats - manual statistic reset
-    version     - prints the SocketCAN core version and the ABI version
+    version     - prints SocketCAN core and ABI version (removed in Linux 5.10)
 
 
 Writing Own CAN Protocol Modules
@@ -1070,7 +1109,7 @@ General Settings
     dev->type  = ARPHRD_CAN; /* the netdevice hardware type */
     dev->flags = IFF_NOARP;  /* CAN has no arp */
 
-    dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */
+    dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */
 
     or alternative, when the controller supports CAN with flexible data rate:
     dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
@@ -1111,6 +1150,39 @@ tuning on deep embedded systems'. The author is running a MPC603e
 load without any problems ...
 
 
+Switchable Termination Resistors
+--------------------------------
+
+CAN bus requires a specific impedance across the differential pair,
+typically provided by two 120Ohm resistors on the farthest nodes of
+the bus. Some CAN controllers support activating / deactivating a
+termination resistor(s) to provide the correct impedance.
+
+Query the available resistances::
+
+    $ ip -details link show can0
+    ...
+    termination 120 [ 0, 120 ]
+
+Activate the terminating resistor::
+
+    $ ip link set dev can0 type can termination 120
+
+Deactivate the terminating resistor::
+
+    $ ip link set dev can0 type can termination 0
+
+To enable termination resistor support to a can-controller, either
+implement in the controller's struct can-priv::
+
+    termination_const
+    termination_const_cnt
+    do_set_termination
+
+or add gpio control with the device tree entries from
+Documentation/devicetree/bindings/net/can/can-controller.yaml
+
+
 The Virtual CAN Driver (vcan)
 -----------------------------
 
@@ -1184,6 +1256,7 @@ Setting CAN device properties::
         [ fd { on | off } ]
         [ fd-non-iso { on | off } ]
         [ presume-ack { on | off } ]
+        [ cc-len8-dlc { on | off } ]
 
         [ restart-ms TIME-MS ]
         [ restart ]
@@ -1326,22 +1399,22 @@ arbitration phase and the payload phase of the CAN FD frame. Therefore a
 second bit timing has to be specified in order to enable the CAN FD bitrate.
 
 Additionally CAN FD capable CAN controllers support up to 64 bytes of
-payload. The representation of this length in can_frame.can_dlc and
+payload. The representation of this length in can_frame.len and
 canfd_frame.len for userspace applications and inside the Linux network
 layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
-The data length code was a 1:1 mapping to the payload length in the legacy
+The data length code was a 1:1 mapping to the payload length in the Classical
 CAN frames anyway. The payload length to the bus-relevant DLC mapping is
 only performed inside the CAN drivers, preferably with the helper
-functions can_dlc2len() and can_len2dlc().
+functions can_fd_dlc2len() and can_fd_len2dlc().
 
 The CAN netdevice driver capabilities can be distinguished by the network
 devices maximum transfer unit (MTU)::
 
-  MTU = 16 (CAN_MTU)   => sizeof(struct can_frame)   => 'legacy' CAN device
+  MTU = 16 (CAN_MTU)   => sizeof(struct can_frame)   => Classical CAN device
   MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device
 
 The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
-N.B. CAN FD capable devices can also handle and send legacy CAN frames.
+N.B. CAN FD capable devices can also handle and send Classical CAN frames.
 
 When configuring CAN FD capable CAN controllers an additional 'data' bitrate
 has to be set. This bitrate for the data phase of the CAN FD frame has to be
diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst
index 638ac1ee7914..935d872ae87c 100644
--- a/Documentation/networking/can_ucan_protocol.rst
+++ b/Documentation/networking/can_ucan_protocol.rst
@@ -50,7 +50,7 @@ Setup Packet
 ``wIndex``         USB Interface Index (0 for device commands)
 ``wLength``        * Host to Device - Number of bytes to transmit
                    * Device to Host - Maximum Number of bytes to
-                     receive. If the device send less. Commom ZLP
+                     receive. If the device send less. Common ZLP
                      semantics are used.
 =================  =====================================================
 
diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst
index 0048409c06b4..8404a3f794f3 100644
--- a/Documentation/networking/cdc_mbim.rst
+++ b/Documentation/networking/cdc_mbim.rst
@@ -51,7 +51,7 @@ Such userspace applications includes, but are not limited to:
  - mbimcli (included with the libmbim [3] library), and
  - ModemManager [4]
 
-Establishing a MBIM IP session reequires at least these actions by the
+Establishing a MBIM IP session requires at least these actions by the
 management application:
 
  - open the control channel
@@ -93,7 +93,7 @@ MBIM function can be looked up using sysfs.  For example::
 USB configuration descriptors
 -----------------------------
 The wMaxControlMessage field of the CDC MBIM functional descriptor
-limits the maximum control message size. The managament application is
+limits the maximum control message size. The management application is
 responsible for negotiating a control message size complying with the
 requirements in section 9.3.1 of [1], taking this descriptor field
 into consideration.
diff --git a/Documentation/networking/cops.rst b/Documentation/networking/cops.rst
deleted file mode 100644
index 964ba80599a9..000000000000
--- a/Documentation/networking/cops.rst
+++ /dev/null
@@ -1,80 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-========================================
-The COPS LocalTalk Linux driver (cops.c)
-========================================
-
-By Jay Schulist <jschlst@samba.org>
-
-This driver has two modes and they are: Dayna mode and Tangent mode.
-Each mode corresponds with the type of card. It has been found
-that there are 2 main types of cards and all other cards are
-the same and just have different names or only have minor differences
-such as more IO ports. As this driver is tested it will
-become more clear exactly what cards are supported.
-
-Right now these cards are known to work with the COPS driver. The
-LT-200 cards work in a somewhat more limited capacity than the
-DL200 cards, which work very well and are in use by many people.
-
-TANGENT driver mode:
-	- Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
-
-DAYNA driver mode:
-	- Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
-	- Farallon PhoneNET PC III, Farallon PhoneNET PC II
-
-Other cards possibly supported mode unknown though:
-	- Dayna DL2000 (Full length)
-
-The COPS driver defaults to using Dayna mode. To change the driver's
-mode if you built a driver with dual support use board_type=1 or
-board_type=2 for Dayna or Tangent with insmod.
-
-Operation/loading of the driver
-===============================
-
-Use modprobe like this:	/sbin/modprobe cops.o (IO #) (IRQ #)
-If you do not specify any options the driver will try and use the IO = 0x240,
-IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
-
-To load multiple COPS driver Localtalk cards you can do one of the following::
-
-	insmod cops io=0x240 irq=5
-	insmod -o cops2 cops io=0x260 irq=3
-
-Or in lilo.conf put something like this::
-
-	append="ether=5,0x240,lt0 ether=3,0x260,lt1"
-
-Then bring up the interface with ifconfig. It will look something like this::
-
-  lt0       Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
-	    inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
-	    UP BROADCAST RUNNING NOARP MULTICAST  MTU:600  Metric:1
-	    RX packets:0 errors:0 dropped:0 overruns:0 frame:0
-	    TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
-
-Netatalk Configuration
-======================
-
-You will need to configure atalkd with something like the following to make
-it work with the cops.c driver.
-
-* For single LTalk card use::
-
-    dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
-    lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple cards, Ethernet and LocalTalk::
-
-    eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
-    lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple LocalTalk cards, and an Ethernet card.
-
-* Order seems to matter here, Ethernet last::
-
-    lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
-    lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
-    eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst
deleted file mode 100644
index dde16be04456..000000000000
--- a/Documentation/networking/dccp.rst
+++ /dev/null
@@ -1,216 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=============
-DCCP protocol
-=============
-
-
-.. Contents
-   - Introduction
-   - Missing features
-   - Socket options
-   - Sysctl variables
-   - IOCTLs
-   - Other tunables
-   - Notes
-
-
-Introduction
-============
-Datagram Congestion Control Protocol (DCCP) is an unreliable, connection
-oriented protocol designed to solve issues present in UDP and TCP, particularly
-for real-time and multimedia (streaming) traffic.
-It divides into a base protocol (RFC 4340) and pluggable congestion control
-modules called CCIDs. Like pluggable TCP congestion control, at least one CCID
-needs to be enabled in order for the protocol to function properly. In the Linux
-implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as
-the TCP-friendly CCID3 (RFC 4342), are optional.
-For a brief introduction to CCIDs and suggestions for choosing a CCID to match
-given applications, see section 10 of RFC 4340.
-
-It has a base protocol and pluggable congestion control IDs (CCIDs).
-
-DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol
-is at http://www.ietf.org/html.charters/dccp-charter.html
-
-
-Missing features
-================
-The Linux DCCP implementation does not currently support all the features that are
-specified in RFCs 4340...42.
-
-The known bugs are at:
-
-	http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP
-
-For more up-to-date versions of the DCCP implementation, please consider using
-the experimental DCCP test tree; instructions for checking this out are on:
-http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree
-
-
-Socket options
-==============
-DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes
-a policy ID as argument and can only be set before the connection (i.e. changes
-during an established connection are not supported). Currently, two policies are
-defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special,
-and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an
-u32 priority value as ancillary data to sendmsg(), where higher numbers indicate
-a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to
-be formatted using a cmsg(3) message header filled in as follows::
-
-	cmsg->cmsg_level = SOL_DCCP;
-	cmsg->cmsg_type	 = DCCP_SCM_PRIORITY;
-	cmsg->cmsg_len	 = CMSG_LEN(sizeof(uint32_t));	/* or CMSG_LEN(4) */
-
-DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero
-value is always interpreted as unbounded queue length. If different from zero,
-the interpretation of this parameter depends on the current dequeuing policy
-(see above): the "simple" policy will enforce a fixed queue size by returning
-EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the
-lowest-priority packet first. The default value for this parameter is
-initialised from /proc/sys/net/dccp/default/tx_qlen.
-
-DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
-service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
-the socket will fall back to 0 (which means that no meaningful service code
-is present). On active sockets this is set before connect(); specifying more
-than one code has no effect (all subsequent service codes are ignored). The
-case is different for passive sockets, where multiple service codes (up to 32)
-can be set before calling bind().
-
-DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
-size (application payload size) in bytes, see RFC 4340, section 14.
-
-DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs
-supported by the endpoint. The option value is an array of type uint8_t whose
-size is passed as option length. The minimum array size is 4 elements, the
-value returned in the optlen argument always reflects the true number of
-built-in CCIDs.
-
-DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
-time, combining the operation of the next two socket options. This option is
-preferable over the latter two, since often applications will use the same
-type of CCID for both directions; and mixed use of CCIDs is not currently well
-understood. This socket option takes as argument at least one uint8_t value, or
-an array of uint8_t values, which must match available CCIDS (see above). CCIDs
-must be registered on the socket before calling connect() or listen().
-
-DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets
-the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID.
-Please note that the getsockopt argument type here is ``int``, not uint8_t.
-
-DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID.
-
-DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold
-timewait state when closing the connection (RFC 4340, 8.3). The usual case is
-that the closing server sends a CloseReq, whereupon the client holds timewait
-state. When this boolean socket option is on, the server sends a Close instead
-and will enter TIMEWAIT. This option must be set after accept() returns.
-
-DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
-partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
-always cover the entire packet and that only fully covered application data is
-accepted by the receiver. Hence, when using this feature on the sender, it must
-be enabled at the receiver, too with suitable choice of CsCov.
-
-DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
-	range 0..15 are acceptable. The default setting is 0 (full coverage),
-	values between 1..15 indicate partial coverage.
-
-DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
-	sets a threshold, where again values 0..15 are acceptable. The default
-	of 0 means that all packets with a partial coverage will be discarded.
-	Values in the range 1..15 indicate that packets with minimally such a
-	coverage value are also acceptable. The higher the number, the more
-	restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
-	settings are inherited to the child socket after accept().
-
-The following two options apply to CCID 3 exclusively and are getsockopt()-only.
-In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
-
-DCCP_SOCKOPT_CCID_RX_INFO
-	Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and
-	optlen must be set to at least sizeof(struct tfrc_rx_info).
-
-DCCP_SOCKOPT_CCID_TX_INFO
-	Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and
-	optlen must be set to at least sizeof(struct tfrc_tx_info).
-
-On unidirectional connections it is useful to close the unused half-connection
-via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs.
-
-
-Sysctl variables
-================
-Several DCCP default parameters can be managed by the following sysctls
-(sysctl net.dccp.default or /proc/sys/net/dccp/default):
-
-request_retries
-	The number of active connection initiation retries (the number of
-	Requests minus one) before timing out. In addition, it also governs
-	the behaviour of the other, passive side: this variable also sets
-	the number of times DCCP repeats sending a Response when the initial
-	handshake does not progress from RESPOND to OPEN (i.e. when no Ack
-	is received after the initial Request).  This value should be greater
-	than 0, suggested is less than 10. Analogue of tcp_syn_retries.
-
-retries1
-	How often a DCCP Response is retransmitted until the listening DCCP
-	side considers its connecting peer dead. Analogue of tcp_retries1.
-
-retries2
-	The number of times a general DCCP packet is retransmitted. This has
-	importance for retransmitted acknowledgments and feature negotiation,
-	data packets are never retransmitted. Analogue of tcp_retries2.
-
-tx_ccid = 2
-	Default CCID for the sender-receiver half-connection. Depending on the
-	choice of CCID, the Send Ack Vector feature is enabled automatically.
-
-rx_ccid = 2
-	Default CCID for the receiver-sender half-connection; see tx_ccid.
-
-seq_window = 100
-	The initial sequence window (sec. 7.5.2) of the sender. This influences
-	the local ackno validity and the remote seqno validity windows (7.5.1).
-	Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set.
-
-tx_qlen = 5
-	The size of the transmit buffer in packets. A value of 0 corresponds
-	to an unbounded transmit buffer.
-
-sync_ratelimit = 125 ms
-	The timeout between subsequent DCCP-Sync packets sent in response to
-	sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
-	of this parameter is milliseconds; a value of 0 disables rate-limiting.
-
-
-IOCTLS
-======
-FIONREAD
-	Works as in udp(7): returns in the ``int`` argument pointer the size of
-	the next pending datagram in bytes, or 0 when no datagram is pending.
-
-
-Other tunables
-==============
-Per-route rto_min support
-	CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value
-	of the RTO timer. This setting can be modified via the 'rto_min' option
-	of iproute2; for example::
-
-		> ip route change 10.0.0.0/24   rto_min 250j dev wlan0
-		> ip route add    10.0.0.254/32 rto_min 800j dev wlan0
-		> ip route show dev wlan0
-
-	CCID-3 also supports the rto_min setting: it is used to define the lower
-	bound for the expiry of the nofeedback timer. This can be useful on LANs
-	with very low RTTs (e.g., loopback, Gbit ethernet).
-
-
-Notes
-=====
-DCCP does not travel through NAT successfully at present on many boxes. This is
-because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
-support for DCCP has been added.
diff --git a/Documentation/networking/decnet.rst b/Documentation/networking/decnet.rst
deleted file mode 100644
index b8bc11ff8370..000000000000
--- a/Documentation/networking/decnet.rst
+++ /dev/null
@@ -1,243 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================
-Linux DECnet Networking Layer Information
-=========================================
-
-1. Other documentation....
-==========================
-
-   - Project Home Pages
-     - http://www.chygwyn.com/				   - Kernel info
-     - http://linux-decnet.sourceforge.net/                - Userland tools
-     - http://www.sourceforge.net/projects/linux-decnet/   - Status page
-
-2. Configuring the kernel
-=========================
-
-Be sure to turn on the following options:
-
-    - CONFIG_DECNET (obviously)
-    - CONFIG_PROC_FS (to see what's going on)
-    - CONFIG_SYSCTL (for easy configuration)
-
-if you want to try out router support (not properly debugged yet)
-you'll need the following options as well...
-
-    - CONFIG_DECNET_ROUTER (to be able to add/delete routes)
-    - CONFIG_NETFILTER (will be required for the DECnet routing daemon)
-
-Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
-that you need it, in general you won't and it can cause ifconfig to
-malfunction.
-
-Run time configuration has changed slightly from the 2.4 system. If you
-want to configure an endnode, then the simplified procedure is as follows:
-
- - Set the MAC address on your ethernet card before starting _any_ other
-   network protocols.
-
-As soon as your network card is brought into the UP state, DECnet should
-start working. If you need something more complicated or are unsure how
-to set the MAC address, see the next section. Also all configurations which
-worked with 2.4 will work under 2.5 with no change.
-
-3. Command line options
-=======================
-
-You can set a DECnet address on the kernel command line for compatibility
-with the 2.4 configuration procedure, but in general it's not needed any more.
-If you do st a DECnet address on the command line, it has only one purpose
-which is that its added to the addresses on the loopback device.
-
-With 2.4 kernels, DECnet would only recognise addresses as local if they
-were added to the loopback device. In 2.5, any local interface address
-can be used to loop back to the local machine. Of course this does not
-prevent you adding further addresses to the loopback device if you
-want to.
-
-N.B. Since the address list of an interface determines the addresses for
-which "hello" messages are sent, if you don't set an address on the loopback
-interface then you won't see any entries in /proc/net/neigh for the local
-host until such time as you start a connection. This doesn't affect the
-operation of the local communications in any other way though.
-
-The kernel command line takes options looking like the following::
-
-    decnet.addr=1,2
-
-the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels
-and early 2.3.xx kernels, you must use a comma when specifying the
-DECnet address like this. For more recent 2.3.xx kernels, you may
-use almost any character except space, although a `.` would be the most
-obvious choice :-)
-
-There used to be a third number specifying the node type. This option
-has gone away in favour of a per interface node type. This is now set
-using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be
-set with a single digit, 0=EndNode, 1=L1 Router and  2=L2 Router.
-
-There are also equivalent options for modules. The node address can
-also be set through the /proc/sys/net/decnet/ files, as can other system
-parameters.
-
-Currently the only supported devices are ethernet and ip_gre. The
-ethernet address of your ethernet card has to be set according to the DECnet
-address of the node in order for it to be autoconfigured (and then appear in
-/proc/net/decnet_dev). There is a utility available at the above
-FTP sites called dn2ethaddr which can compute the correct ethernet
-address to use. The address can be set by ifconfig either before or
-at the time the device is brought up. If you are using RedHat you can
-add the line::
-
-    MACADDR=AA:00:04:00:03:04
-
-or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or
-wherever your network card's configuration lives. Setting the MAC address
-of your ethernet card to an address starting with "hi-ord" will cause a
-DECnet address which matches to be added to the interface (which you can
-verify with iproute2).
-
-The default device for routing can be set through the /proc filesystem
-by setting /proc/sys/net/decnet/default_device to the
-device you want DECnet to route packets out of when no specific route
-is available. Usually this will be eth0, for example::
-
-    echo -n "eth0" >/proc/sys/net/decnet/default_device
-
-If you don't set the default device, then it will default to the first
-ethernet card which has been autoconfigured as described above. You can
-confirm that by looking in the default_device file of course.
-
-There is a list of what the other files under /proc/sys/net/decnet/ do
-on the kernel patch web site (shown above).
-
-4. Run time kernel configuration
-================================
-
-
-This is either done through the sysctl/proc interface (see the kernel web
-pages for details on what the various options do) or through the iproute2
-package in the same way as IPv4/6 configuration is performed.
-
-Documentation for iproute2 is included with the package, although there is
-as yet no specific section on DECnet, most of the features apply to both
-IP and DECnet, albeit with DECnet addresses instead of IP addresses and
-a reduced functionality.
-
-If you want to configure a DECnet router you'll need the iproute2 package
-since its the _only_ way to add and delete routes currently. Eventually
-there will be a routing daemon to send and receive routing messages for
-each interface and update the kernel routing tables accordingly. The
-routing daemon will use netfilter to listen to routing packets, and
-rtnetlink to update the kernels routing tables.
-
-The DECnet raw socket layer has been removed since it was there purely
-for use by the routing daemon which will now use netfilter (a much cleaner
-and more generic solution) instead.
-
-5. How can I tell if its working?
-=================================
-
-Here is a quick guide of what to look for in order to know if your DECnet
-kernel subsystem is working.
-
-   - Is the node address set (see /proc/sys/net/decnet/node_address)
-   - Is the node of the correct type
-     (see /proc/sys/net/decnet/conf/<dev>/forwarding)
-   - Is the Ethernet MAC address of each Ethernet card set to match
-     the DECnet address. If in doubt use the dn2ethaddr utility available
-     at the ftp archive.
-   - If the previous two steps are satisfied, and the Ethernet card is up,
-     you should find that it is listed in /proc/net/decnet_dev and also
-     that it appears as a directory in /proc/sys/net/decnet/conf/. The
-     loopback device (lo) should also appear and is required to communicate
-     within a node.
-   - If you have any DECnet routers on your network, they should appear
-     in /proc/net/decnet_neigh, otherwise this file will only contain the
-     entry for the node itself (if it doesn't check to see if lo is up).
-   - If you want to send to any node which is not listed in the
-     /proc/net/decnet_neigh file, you'll need to set the default device
-     to point to an Ethernet card with connection to a router. This is
-     again done with the /proc/sys/net/decnet/default_device file.
-   - Try starting a simple server and client, like the dnping/dnmirror
-     over the loopback interface. With luck they should communicate.
-     For this step and those after, you'll need the DECnet library
-     which can be obtained from the above ftp sites as well as the
-     actual utilities themselves.
-   - If this seems to work, then try talking to a node on your local
-     network, and see if you can obtain the same results.
-   - At this point you are on your own... :-)
-
-6. How to send a bug report
-===========================
-
-If you've found a bug and want to report it, then there are several things
-you can do to help me work out exactly what it is that is wrong. Useful
-information (_most_ of which _is_ _essential_) includes:
-
- - What kernel version are you running ?
- - What version of the patch are you running ?
- - How far though the above set of tests can you get ?
- - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ?
- - Which services are you running ?
- - Which client caused the problem ?
- - How much data was being transferred ?
- - Was the network congested ?
- - How can the problem be reproduced ?
- - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of
-   tcpdump don't understand how to dump DECnet properly, so including
-   the hex listing of the packet contents is _essential_, usually the -x flag.
-   You may also need to increase the length grabbed with the -s flag. The
-   -e flag also provides very useful information (ethernet MAC addresses))
-
-7. MAC FAQ
-==========
-
-A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet
-interact and how to get the best performance from your hardware.
-
-Ethernet cards are designed to normally only pass received network frames
-to a host computer when they are addressed to it, or to the broadcast address.
-
-Linux has an interface which allows the setting of extra addresses for
-an ethernet card to listen to. If the ethernet card supports it, the
-filtering operation will be done in hardware, if not the extra unwanted packets
-received will be discarded by the host computer. In the latter case,
-significant processor time and bus bandwidth can be used up on a busy
-network (see the NAPI documentation for a longer explanation of these
-effects).
-
-DECnet makes use of this interface to allow running DECnet on an ethernet
-card which has already been configured using TCP/IP (presumably using the
-built in MAC address of the card, as usual) and/or to allow multiple DECnet
-addresses on each physical interface. If you do this, be aware that if your
-ethernet card doesn't support perfect hashing in its MAC address filter
-then your computer will be doing more work than required. Some cards
-will simply set themselves into promiscuous mode in order to receive
-packets from the DECnet specified addresses. So if you have one of these
-cards its better to set the MAC address of the card as described above
-to gain the best efficiency. Better still is to use a card which supports
-NAPI as well.
-
-
-8. Mailing list
-===============
-
-If you are keen to get involved in development, or want to ask questions
-about configuration, or even just report bugs, then there is a mailing
-list that you can join, details are at:
-
-http://sourceforge.net/mail/?group_id=4993
-
-9. Legal Info
-=============
-
-The Linux DECnet project team have placed their code under the GPL. The
-software is provided "as is" and without warranty express or implied.
-DECnet is a trademark of Compaq. This software is not a product of
-Compaq. We acknowledge the help of people at Compaq in providing extra
-documentation above and beyond what was previously publicly available.
-
-Steve Whitehouse <SteveW@ACM.org>
-
diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/device_drivers/atm/cxacru-cf.py
index b41d298398c8..b41d298398c8 100644
--- a/Documentation/networking/cxacru-cf.py
+++ b/Documentation/networking/device_drivers/atm/cxacru-cf.py
diff --git a/Documentation/networking/cxacru.rst b/Documentation/networking/device_drivers/atm/cxacru.rst
index 6088af2ffeda..6088af2ffeda 100644
--- a/Documentation/networking/cxacru.rst
+++ b/Documentation/networking/device_drivers/atm/cxacru.rst
diff --git a/Documentation/networking/fore200e.rst b/Documentation/networking/device_drivers/atm/fore200e.rst
index 55df9ec09ac8..55df9ec09ac8 100644
--- a/Documentation/networking/fore200e.rst
+++ b/Documentation/networking/device_drivers/atm/fore200e.rst
diff --git a/Documentation/networking/device_drivers/atm/index.rst b/Documentation/networking/device_drivers/atm/index.rst
new file mode 100644
index 000000000000..7b593f031a60
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Asynchronous Transfer Mode (ATM) Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   cxacru
+   fore200e
+   iphase
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/iphase.rst b/Documentation/networking/device_drivers/atm/iphase.rst
index 92d9b757d75a..388c7101e2cb 100644
--- a/Documentation/networking/iphase.rst
+++ b/Documentation/networking/device_drivers/atm/iphase.rst
@@ -4,7 +4,7 @@
 ATM (i)Chip IA Linux Driver Source
 ==================================
 
-			      READ ME FISRT
+			      READ ME FIRST
 
 --------------------------------------------------------------------------------
 
diff --git a/Documentation/networking/device_drivers/can/can327.rst b/Documentation/networking/device_drivers/can/can327.rst
new file mode 100644
index 000000000000..b87bfbe5d51c
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/can327.rst
@@ -0,0 +1,331 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-3-Clause)
+
+can327: ELM327 driver for Linux SocketCAN
+==========================================
+
+Authors
+--------
+
+Max Staudt <max@enpas.org>
+
+
+
+Motivation
+-----------
+
+This driver aims to lower the initial cost for hackers interested in
+working with CAN buses.
+
+CAN adapters are expensive, few, and far between.
+ELM327 interfaces are cheap and plentiful.
+Let's use ELM327s as CAN adapters.
+
+
+
+Introduction
+-------------
+
+This driver is an effort to turn abundant ELM327 based OBD interfaces
+into full fledged (as far as possible) CAN interfaces.
+
+Since the ELM327 was never meant to be a stand alone CAN controller,
+the driver has to switch between its modes as quickly as possible in
+order to fake full-duplex operation.
+
+As such, can327 is a best effort driver. However, this is more than
+enough to implement simple request-response protocols (such as OBD II),
+and to monitor broadcast messages on a bus (such as in a vehicle).
+
+Most ELM327s come as nondescript serial devices, attached via USB or
+Bluetooth. The driver cannot recognize them by itself, and as such it
+is up to the user to attach it in form of a TTY line discipline
+(similar to PPP, SLIP, slcan, ...).
+
+This driver is meant for ELM327 versions 1.4b and up, see below for
+known limitations in older controllers and clones.
+
+
+
+Data sheet
+-----------
+
+The official data sheets can be found at ELM electronics' home page:
+
+  https://www.elmelectronics.com/
+
+
+
+How to attach the line discipline
+----------------------------------
+
+Every ELM327 chip is factory programmed to operate at a serial setting
+of 38400 baud/s, 8 data bits, no parity, 1 stopbit.
+
+If you have kept this default configuration, the line discipline can
+be attached on a command prompt as follows::
+
+    sudo ldattach \
+           --debug \
+           --speed 38400 \
+           --eightbits \
+           --noparity \
+           --onestopbit \
+           --iflag -ICRNL,INLCR,-IXOFF \
+           30 \
+           /dev/ttyUSB0
+
+To change the ELM327's serial settings, please refer to its data
+sheet. This needs to be done before attaching the line discipline.
+
+Once the ldisc is attached, the CAN interface starts out unconfigured.
+Set the speed before starting it::
+
+    # The interface needs to be down to change parameters
+    sudo ip link set can0 down
+    sudo ip link set can0 type can bitrate 500000
+    sudo ip link set can0 up
+
+500000 bit/s is a common rate for OBD-II diagnostics.
+If you're connecting straight to a car's OBD port, this is the speed
+that most cars (but not all!) expect.
+
+After this, you can set out as usual with candump, cansniffer, etc.
+
+
+
+How to check the controller version
+------------------------------------
+
+Use a terminal program to attach to the controller.
+
+After issuing the "``AT WS``" command, the controller will respond with
+its version::
+
+    >AT WS
+
+
+    ELM327 v1.4b
+
+    >
+
+Note that clones may claim to be any version they like.
+It is not indicative of their actual feature set.
+
+
+
+
+Communication example
+----------------------
+
+This is a short and incomplete introduction on how to talk to an ELM327.
+It is here to guide understanding of the controller's and the driver's
+limitation (listed below) as well as manual testing.
+
+
+The ELM327 has two modes:
+
+- Command mode
+- Reception mode
+
+In command mode, it expects one command per line, terminated by CR.
+By default, the prompt is a "``>``", after which a command can be
+entered::
+
+    >ATE1
+    OK
+    >
+
+The init script in the driver switches off several configuration options
+that are only meaningful in the original OBD scenario the chip is meant
+for, and are actually a hindrance for can327.
+
+
+When a command is not recognized, such as by an older version of the
+ELM327, a question mark is printed as a response instead of OK::
+
+    >ATUNKNOWN
+    ?
+    >
+
+At present, can327 does not evaluate this response. See the section
+below on known limitations for details.
+
+
+When a CAN frame is to be sent, the target address is configured, after
+which the frame is sent as a command that consists of the data's hex
+dump::
+
+    >ATSH123
+    OK
+    >DEADBEEF12345678
+    OK
+    >
+
+The above interaction sends the SFF frame "``DE AD BE EF 12 34 56 78``"
+with (11 bit) CAN ID ``0x123``.
+For this to function, the controller must be configured for SFF sending
+mode (using "``AT PB``", see code or datasheet).
+
+
+Once a frame has been sent and wait-for-reply mode is on (``ATR1``,
+configured on ``listen-only=off``), or when the reply timeout expires
+and the driver sets the controller into monitoring mode (``ATMA``),
+the ELM327 will send one line for each received CAN frame, consisting
+of CAN ID, DLC, and data::
+
+    123 8 DEADBEEF12345678
+
+For EFF (29 bit) CAN frames, the address format is slightly different,
+which can327 uses to tell the two apart::
+
+    12 34 56 78 8 DEADBEEF12345678
+
+The ELM327 will receive both SFF and EFF frames - the current CAN
+config (``ATPB``) does not matter.
+
+
+If the ELM327's internal UART sending buffer runs full, it will abort
+the monitoring mode, print "BUFFER FULL" and drop back into command
+mode. Note that in this case, unlike with other error messages, the
+error message may appear on the same line as the last (usually
+incomplete) data frame::
+
+    12 34 56 78 8 DEADBEEF123 BUFFER FULL
+
+
+
+Known limitations of the controller
+------------------------------------
+
+- Clone devices ("v1.5" and others)
+
+  Sending RTR frames is not supported and will be dropped silently.
+
+  Receiving RTR with DLC 8 will appear to be a regular frame with
+  the last received frame's DLC and payload.
+
+  "``AT CSM``" (CAN Silent Monitoring, i.e. don't send CAN ACKs) is
+  not supported, and is hard coded to ON. Thus, frames are not ACKed
+  while listening: "``AT MA``" (Monitor All) will always be "silent".
+  However, immediately after sending a frame, the ELM327 will be in
+  "receive reply" mode, in which it *does* ACK any received frames.
+  Once the bus goes silent, or an error occurs (such as BUFFER FULL),
+  or the receive reply timeout runs out, the ELM327 will end reply
+  reception mode on its own and can327 will fall back to "``AT MA``"
+  in order to keep monitoring the bus.
+
+  Other limitations may apply, depending on the clone and the quality
+  of its firmware.
+
+
+- All versions
+
+  No full duplex operation is supported. The driver will switch
+  between input/output mode as quickly as possible.
+
+  The length of outgoing RTR frames cannot be set. In fact, some
+  clones (tested with one identifying as "``v1.5``") are unable to
+  send RTR frames at all.
+
+  We don't have a way to get real-time notifications on CAN errors.
+  While there is a command (``AT CS``) to retrieve some basic stats,
+  we don't poll it as it would force us to interrupt reception mode.
+
+
+- Versions prior to 1.4b
+
+  These versions do not send CAN ACKs when in monitoring mode (AT MA).
+  However, they do send ACKs while waiting for a reply immediately
+  after sending a frame. The driver maximizes this time to make the
+  controller as useful as possible.
+
+  Starting with version 1.4b, the ELM327 supports the "``AT CSM``"
+  command, and the "listen-only" CAN option will take effect.
+
+
+- Versions prior to 1.4
+
+  These chips do not support the "``AT PB``" command, and thus cannot
+  change bitrate or SFF/EFF mode on-the-fly. This will have to be
+  programmed by the user before attaching the line discipline. See the
+  data sheet for details.
+
+
+- Versions prior to 1.3
+
+  These chips cannot be used at all with can327. They do not support
+  the "``AT D1``" command, which is necessary to avoid parsing conflicts
+  on incoming data, as well as distinction of RTR frame lengths.
+
+  Specifically, this allows for easy distinction of SFF and EFF
+  frames, and to check whether frames are complete. While it is possible
+  to deduce the type and length from the length of the line the ELM327
+  sends us, this method fails when the ELM327's UART output buffer
+  overruns. It may abort sending in the middle of the line, which will
+  then be mistaken for something else.
+
+
+
+Known limitations of the driver
+--------------------------------
+
+- No 8/7 timing.
+
+  ELM327 can only set CAN bitrates that are of the form 500000/n, where
+  n is an integer divisor.
+  However there is an exception: With a separate flag, it may set the
+  speed to be 8/7 of the speed indicated by the divisor.
+  This mode is not currently implemented.
+
+- No evaluation of command responses.
+
+  The ELM327 will reply with OK when a command is understood, and with ?
+  when it is not. The driver does not currently check this, and simply
+  assumes that the chip understands every command.
+  The driver is built such that functionality degrades gracefully
+  nevertheless. See the section on known limitations of the controller.
+
+- No use of hardware CAN ID filtering
+
+  An ELM327's UART sending buffer will easily overflow on heavy CAN bus
+  load, resulting in the "``BUFFER FULL``" message. Using the hardware
+  filters available through "``AT CF xxx``" and "``AT CM xxx``" would be
+  helpful here, however SocketCAN does not currently provide a facility
+  to make use of such hardware features.
+
+
+
+Rationale behind the chosen configuration
+------------------------------------------
+
+``AT E1``
+  Echo on
+
+  We need this to be able to get a prompt reliably.
+
+``AT S1``
+  Spaces on
+
+  We need this to distinguish 11/29 bit CAN addresses received.
+
+  Note:
+  We can usually do this using the line length (odd/even),
+  but this fails if the line is not transmitted fully to
+  the host (BUFFER FULL).
+
+``AT D1``
+  DLC on
+
+  We need this to tell the "length" of RTR frames.
+
+
+
+A note on CAN bus termination
+------------------------------
+
+Your adapter may have resistors soldered in which are meant to terminate
+the bus. This is correct when it is plugged into a OBD-II socket, but
+not helpful when trying to tap into the middle of an existing CAN bus.
+
+If communications don't work with the adapter connected, check for the
+termination resistors on its PCB and try removing them.
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
new file mode 100644
index 000000000000..1661d13174d5
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
@@ -0,0 +1,638 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+CTU CAN FD Driver
+=================
+
+Author: Martin Jerabek <martin.jerabek01@gmail.com>
+
+
+About CTU CAN FD IP Core
+------------------------
+
+`CTU CAN FD <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+is an open source soft core written in VHDL.
+It originated in 2015 as Ondrej Ille's project
+at the `Department of Measurement <https://meas.fel.cvut.cz/>`_
+of `FEE <http://www.fel.cvut.cz/en/>`_ at `CTU <https://www.cvut.cz/en>`_.
+
+The SocketCAN driver for Xilinx Zynq SoC based MicroZed board
+`Vivado integration <https://gitlab.fel.cvut.cz/canbus/zynq/zynq-can-sja1000-top>`_
+and Intel Cyclone V 5CSEMA4U23C6 based DE0-Nano-SoC Terasic board
+`QSys integration <https://gitlab.fel.cvut.cz/canbus/intel-soc-ctucanfd>`_
+has been developed as well as support for
+`PCIe integration <https://gitlab.fel.cvut.cz/canbus/pcie-ctucanfd>`_ of the core.
+
+In the case of Zynq, the core is connected via the APB system bus, which does
+not have enumeration support, and the device must be specified in Device Tree.
+This kind of devices is called platform device in the kernel and is
+handled by a platform device driver.
+
+The basic functional model of the CTU CAN FD peripheral has been
+accepted into QEMU mainline. See QEMU `CAN emulation support <https://www.qemu.org/docs/master/system/devices/can.html>`_
+for CAN FD buses, host connection and CTU CAN FD core emulation. The development
+version of emulation support can be cloned from ctu-canfd branch of QEMU local
+development `repository <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_.
+
+
+About SocketCAN
+---------------
+
+SocketCAN is a standard common interface for CAN devices in the Linux
+kernel. As the name suggests, the bus is accessed via sockets, similarly
+to common network devices. The reasoning behind this is in depth
+described in `Linux SocketCAN <https://www.kernel.org/doc/html/latest/networking/can.html>`_.
+In short, it offers a
+natural way to implement and work with higher layer protocols over CAN,
+in the same way as, e.g., UDP/IP over Ethernet.
+
+Device probe
+~~~~~~~~~~~~
+
+Before going into detail about the structure of a CAN bus device driver,
+let's reiterate how the kernel gets to know about the device at all.
+Some buses, like PCI or PCIe, support device enumeration. That is, when
+the system boots, it discovers all the devices on the bus and reads
+their configuration. The kernel identifies the device via its vendor ID
+and device ID, and if there is a driver registered for this identifier
+combination, its probe method is invoked to populate the driver's
+instance for the given hardware. A similar situation goes with USB, only
+it allows for device hot-plug.
+
+The situation is different for peripherals which are directly embedded
+in the SoC and connected to an internal system bus (AXI, APB, Avalon,
+and others). These buses do not support enumeration, and thus the kernel
+has to learn about the devices from elsewhere. This is exactly what the
+Device Tree was made for.
+
+Device tree
+~~~~~~~~~~~
+
+An entry in device tree states that a device exists in the system, how
+it is reachable (on which bus it resides) and its configuration –
+registers address, interrupts and so on. An example of such a device
+tree is given in .
+
+::
+
+           / {
+               /* ... */
+               amba: amba {
+                   #address-cells = <1>;
+                   #size-cells = <1>;
+                   compatible = "simple-bus";
+
+                   CTU_CAN_FD_0: CTU_CAN_FD@43c30000 {
+                       compatible = "ctu,ctucanfd";
+                       interrupt-parent = <&intc>;
+                       interrupts = <0 30 4>;
+                       clocks = <&clkc 15>;
+                       reg = <0x43c30000 0x10000>;
+                   };
+               };
+           };
+
+
+.. _sec:socketcan:drv:
+
+Driver structure
+~~~~~~~~~~~~~~~~
+
+The driver can be divided into two parts – platform-dependent device
+discovery and set up, and platform-independent CAN network device
+implementation.
+
+.. _sec:socketcan:platdev:
+
+Platform device driver
+^^^^^^^^^^^^^^^^^^^^^^
+
+In the case of Zynq, the core is connected via the AXI system bus, which
+does not have enumeration support, and the device must be specified in
+Device Tree. This kind of devices is called *platform device* in the
+kernel and is handled by a *platform device driver*\  [1]_.
+
+A platform device driver provides the following things:
+
+-  A *probe* function
+
+-  A *remove* function
+
+-  A table of *compatible* devices that the driver can handle
+
+The *probe* function is called exactly once when the device appears (or
+the driver is loaded, whichever happens later). If there are more
+devices handled by the same driver, the *probe* function is called for
+each one of them. Its role is to allocate and initialize resources
+required for handling the device, as well as set up low-level functions
+for the platform-independent layer, e.g., *read_reg* and *write_reg*.
+After that, the driver registers the device to a higher layer, in our
+case as a *network device*.
+
+The *remove* function is called when the device disappears, or the
+driver is about to be unloaded. It serves to free the resources
+allocated in *probe* and to unregister the device from higher layers.
+
+Finally, the table of *compatible* devices states which devices the
+driver can handle. The Device Tree entry ``compatible`` is matched
+against the tables of all *platform drivers*.
+
+.. code:: c
+
+           /* Match table for OF platform binding */
+           static const struct of_device_id ctucan_of_match[] = {
+               { .compatible = "ctu,canfd-2", },
+               { .compatible = "ctu,ctucanfd", },
+               { /* end of list */ },
+           };
+           MODULE_DEVICE_TABLE(of, ctucan_of_match);
+
+           static int ctucan_probe(struct platform_device *pdev);
+           static int ctucan_remove(struct platform_device *pdev);
+
+           static struct platform_driver ctucanfd_driver = {
+               .probe  = ctucan_probe,
+               .remove = ctucan_remove,
+               .driver = {
+                   .name = DRIVER_NAME,
+                   .of_match_table = ctucan_of_match,
+               },
+           };
+           module_platform_driver(ctucanfd_driver);
+
+
+.. _sec:socketcan:netdev:
+
+Network device driver
+^^^^^^^^^^^^^^^^^^^^^
+
+Each network device must support at least these operations:
+
+-  Bring the device up: ``ndo_open``
+
+-  Bring the device down: ``ndo_close``
+
+-  Submit TX frames to the device: ``ndo_start_xmit``
+
+-  Signal TX completion and errors to the network subsystem: ISR
+
+-  Submit RX frames to the network subsystem: ISR and NAPI
+
+There are two possible event sources: the device and the network
+subsystem. Device events are usually signaled via an interrupt, handled
+in an Interrupt Service Routine (ISR). Handlers for the events
+originating in the network subsystem are then specified in
+``struct net_device_ops``.
+
+When the device is brought up, e.g., by calling ``ip link set can0 up``,
+the driver’s function ``ndo_open`` is called. It should validate the
+interface configuration and configure and enable the device. The
+analogous opposite is ``ndo_close``, called when the device is being
+brought down, be it explicitly or implicitly.
+
+When the system should transmit a frame, it does so by calling
+``ndo_start_xmit``, which enqueues the frame into the device. If the
+device HW queue (FIFO, mailboxes or whatever the implementation is)
+becomes full, the ``ndo_start_xmit`` implementation informs the network
+subsystem that it should stop the TX queue (via ``netif_stop_queue``).
+It is then re-enabled later in ISR when the device has some space
+available again and is able to enqueue another frame.
+
+All the device events are handled in ISR, namely:
+
+#. **TX completion**. When the device successfully finishes transmitting
+   a frame, the frame is echoed locally. On error, an informative error
+   frame [2]_ is sent to the network subsystem instead. In both cases,
+   the software TX queue is resumed so that more frames may be sent.
+
+#. **Error condition**. If something goes wrong (e.g., the device goes
+   bus-off or RX overrun happens), error counters are updated, and
+   informative error frames are enqueued to SW RX queue.
+
+#. **RX buffer not empty**. In this case, read the RX frames and enqueue
+   them to SW RX queue. Usually NAPI is used as a middle layer (see ).
+
+.. _sec:socketcan:napi:
+
+NAPI
+~~~~
+
+The frequency of incoming frames can be high and the overhead to invoke
+the interrupt service routine for each frame can cause significant
+system load. There are multiple mechanisms in the Linux kernel to deal
+with this situation. They evolved over the years of Linux kernel
+development and enhancements. For network devices, the current standard
+is NAPI – *the New API*. It is similar to classical top-half/bottom-half
+interrupt handling in that it only acknowledges the interrupt in the ISR
+and signals that the rest of the processing should be done in softirq
+context. On top of that, it offers the possibility to *poll* for new
+frames for a while. This has a potential to avoid the costly round of
+enabling interrupts, handling an incoming IRQ in ISR, re-enabling the
+softirq and switching context back to softirq.
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+Integrating the core to Xilinx Zynq
+-----------------------------------
+
+The core interfaces a simple subset of the Avalon
+(search for Intel **Avalon Interface Specifications**)
+bus as it was originally used on
+Alterra FPGA chips, yet Xilinx natively interfaces with AXI
+(search for ARM **AMBA AXI and ACE Protocol Specification AXI3,
+AXI4, and AXI4-Lite, ACE and ACE-Lite**).
+The most obvious solution would be to use
+an Avalon/AXI bridge or implement some simple conversion entity.
+However, the core’s interface is half-duplex with no handshake
+signaling, whereas AXI is full duplex with two-way signaling. Moreover,
+even AXI-Lite slave interface is quite resource-intensive, and the
+flexibility and speed of AXI are not required for a CAN core.
+
+Thus a much simpler bus was chosen – APB (Advanced Peripheral Bus)
+(search for ARM **AMBA APB Protocol Specification**).
+APB-AXI bridge is directly available in
+Xilinx Vivado, and the interface adaptor entity is just a few simple
+combinatorial assignments.
+
+Finally, to be able to include the core in a block diagram as a custom
+IP, the core, together with the APB interface, has been packaged as a
+Vivado component.
+
+CTU CAN FD Driver design
+------------------------
+
+The general structure of a CAN device driver has already been examined
+in . The next paragraphs provide a more detailed description of the CTU
+CAN FD core driver in particular.
+
+Low-level driver
+~~~~~~~~~~~~~~~~
+
+The core is not intended to be used solely with SocketCAN, and thus it
+is desirable to have an OS-independent low-level driver. This low-level
+driver can then be used in implementations of OS driver or directly
+either on bare metal or in a user-space application. Another advantage
+is that if the hardware slightly changes, only the low-level driver
+needs to be modified.
+
+The code [3]_ is in part automatically generated and in part written
+manually by the core author, with contributions of the thesis’ author.
+The low-level driver supports operations such as: set bit timing, set
+controller mode, enable/disable, read RX frame, write TX frame, and so
+on.
+
+Configuring bit timing
+~~~~~~~~~~~~~~~~~~~~~~
+
+On CAN, each bit is divided into four segments: SYNC, PROP, PHASE1, and
+PHASE2. Their duration is expressed in multiples of a Time Quantum
+(details in `CAN Specification, Version 2.0 <http://esd.cs.ucr.edu/webres/can20.pdf>`_, chapter 8).
+When configuring
+bitrate, the durations of all the segments (and time quantum) must be
+computed from the bitrate and Sample Point. This is performed
+independently for both the Nominal bitrate and Data bitrate for CAN FD.
+
+SocketCAN is fairly flexible and offers either highly customized
+configuration by setting all the segment durations manually, or a
+convenient configuration by setting just the bitrate and sample point
+(and even that is chosen automatically per Bosch recommendation if not
+specified). However, each CAN controller may have different base clock
+frequency and different width of segment duration registers. The
+algorithm thus needs the minimum and maximum values for the durations
+(and clock prescaler) and tries to optimize the numbers to fit both the
+constraints and the requested parameters.
+
+.. code:: c
+
+           struct can_bittiming_const {
+               char name[16];      /* Name of the CAN controller hardware */
+               __u32 tseg1_min;    /* Time segment 1 = prop_seg + phase_seg1 */
+               __u32 tseg1_max;
+               __u32 tseg2_min;    /* Time segment 2 = phase_seg2 */
+               __u32 tseg2_max;
+               __u32 sjw_max;      /* Synchronisation jump width */
+               __u32 brp_min;      /* Bit-rate prescaler */
+               __u32 brp_max;
+               __u32 brp_inc;
+           };
+
+
+[lst:can_bittiming_const]
+
+A curious reader will notice that the durations of the segments PROP_SEG
+and PHASE_SEG1 are not determined separately but rather combined and
+then, by default, the resulting TSEG1 is evenly divided between PROP_SEG
+and PHASE_SEG1. In practice, this has virtually no consequences as the
+sample point is between PHASE_SEG1 and PHASE_SEG2. In CTU CAN FD,
+however, the duration registers ``PROP`` and ``PH1`` have different
+widths (6 and 7 bits, respectively), so the auto-computed values might
+overflow the shorter register and must thus be redistributed among the
+two [4]_.
+
+Handling RX
+~~~~~~~~~~~
+
+Frame reception is handled in NAPI queue, which is enabled from ISR when
+the RXNE (RX FIFO Not Empty) bit is set. Frames are read one by one
+until either no frame is left in the RX FIFO or the maximum work quota
+has been reached for the NAPI poll run (see ). Each frame is then passed
+to the network interface RX queue.
+
+An incoming frame may be either a CAN 2.0 frame or a CAN FD frame. The
+way to distinguish between these two in the kernel is to allocate either
+``struct can_frame`` or ``struct canfd_frame``, the two having different
+sizes. In the controller, the information about the frame type is stored
+in the first word of RX FIFO.
+
+This brings us a chicken-egg problem: we want to allocate the ``skb``
+for the frame, and only if it succeeds, fetch the frame from FIFO;
+otherwise keep it there for later. But to be able to allocate the
+correct ``skb``, we have to fetch the first work of FIFO. There are
+several possible solutions:
+
+#. Read the word, then allocate. If it fails, discard the rest of the
+   frame. When the system is low on memory, the situation is bad anyway.
+
+#. Always allocate ``skb`` big enough for an FD frame beforehand. Then
+   tweak the ``skb`` internals to look like it has been allocated for
+   the smaller CAN 2.0 frame.
+
+#. Add option to peek into the FIFO instead of consuming the word.
+
+#. If the allocation fails, store the read word into driver’s data. On
+   the next try, use the stored word instead of reading it again.
+
+Option 1 is simple enough, but not very satisfying if we could do
+better. Option 2 is not acceptable, as it would require modifying the
+private state of an integral kernel structure. The slightly higher
+memory consumption is just a virtual cherry on top of the “cake”. Option
+3 requires non-trivial HW changes and is not ideal from the HW point of
+view.
+
+Option 4 seems like a good compromise, with its disadvantage being that
+a partial frame may stay in the FIFO for a prolonged time. Nonetheless,
+there may be just one owner of the RX FIFO, and thus no one else should
+see the partial frame (disregarding some exotic debugging scenarios).
+Basides, the driver resets the core on its initialization, so the
+partial frame cannot be “adopted” either. In the end, option 4 was
+selected [5]_.
+
+.. _subsec:ctucanfd:rxtimestamp:
+
+Timestamping RX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+The CTU CAN FD core reports the exact timestamp when the frame has been
+received. The timestamp is by default captured at the sample point of
+the last bit of EOF but is configurable to be captured at the SOF bit.
+The timestamp source is external to the core and may be up to 64 bits
+wide. At the time of writing, passing the timestamp from kernel to
+userspace is not yet implemented, but is planned in the future.
+
+Handling TX
+~~~~~~~~~~~
+
+The CTU CAN FD core has 4 independent TX buffers, each with its own
+state and priority. When the core wants to transmit, a TX buffer in
+Ready state with the highest priority is selected.
+
+The priorities are 3bit numbers in register TX_PRIORITY
+(nibble-aligned). This should be flexible enough for most use cases.
+SocketCAN, however, supports only one FIFO queue for outgoing
+frames [6]_. The buffer priorities may be used to simulate the FIFO
+behavior by assigning each buffer a distinct priority and *rotating* the
+priorities after a frame transmission is completed.
+
+In addition to priority rotation, the SW must maintain head and tail
+pointers into the FIFO formed by the TX buffers to be able to determine
+which buffer should be used for next frame (``txb_head``) and which
+should be the first completed one (``txb_tail``). The actual buffer
+indices are (obviously) modulo 4 (number of TX buffers), but the
+pointers must be at least one bit wider to be able to distinguish
+between FIFO full and FIFO empty – in this situation,
+:math:`txb\_head \equiv txb\_tail\ (\textrm{mod}\ 4)`. An example of how
+the FIFO is maintained, together with priority rotation, is depicted in
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq  | A | B | C |   |
++------+---+---+---+---+
+| Prio | 7 | 6 | 5 | 4 |
++------+---+---+---+---+
+|      |   | T |   | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq  |   | B | C |   |
++------+---+---+---+---+
+| Prio | 4 | 7 | 6 | 5 |
++------+---+---+---+---+
+|      |   | T |   | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+----+
+| TXB# | 0 | 1 | 2 | 3 | 0’ |
++======+===+===+===+===+====+
+| Seq  | E | B | C | D |    |
++------+---+---+---+---+----+
+| Prio | 4 | 7 | 6 | 5 |    |
++------+---+---+---+---+----+
+|      |   | T |   |   | H  |
++------+---+---+---+---+----+
+
+|
+
+.. kernel-figure:: fsm_txt_buffer_user.svg
+
+   TX Buffer states with possible transitions
+
+.. _subsec:ctucanfd:txtimestamp:
+
+Timestamping TX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+When submitting a frame to a TX buffer, one may specify the timestamp at
+which the frame should be transmitted. The frame transmission may start
+later, but not sooner. Note that the timestamp does not participate in
+buffer prioritization – that is decided solely by the mechanism
+described above.
+
+Support for time-based packet transmission was recently merged to Linux
+v4.19 `Time-based packet transmission <https://lwn.net/Articles/748879/>`_,
+but it remains yet to be researched
+whether this functionality will be practical for CAN.
+
+Also similarly to retrieving the timestamp of RX frames, the core
+supports retrieving the timestamp of TX frames – that is the time when
+the frame was successfully delivered. The particulars are very similar
+to timestamping RX frames and are described in .
+
+Handling RX buffer overrun
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a received frame does no more fit into the hardware RX FIFO in its
+entirety, RX FIFO overrun flag (STATUS[DOR]) is set and Data Overrun
+Interrupt (DOI) is triggered. When servicing the interrupt, care must be
+taken first to clear the DOR flag (via COMMAND[CDO]) and after that
+clear the DOI interrupt flag. Otherwise, the interrupt would be
+immediately [7]_ rearmed.
+
+**Note**: During development, it was discussed whether the internal HW
+pipelining cannot disrupt this clear sequence and whether an additional
+dummy cycle is necessary between clearing the flag and the interrupt. On
+the Avalon interface, it indeed proved to be the case, but APB being
+safe because it uses 2-cycle transactions. Essentially, the DOR flag
+would be cleared, but DOI register’s Preset input would still be high
+the cycle when the DOI clear request would also be applied (by setting
+the register’s Reset input high). As Set had higher priority than Reset,
+the DOI flag would not be reset. This has been already fixed by swapping
+the Set/Reset priority (see issue #187).
+
+Reporting Error Passive and Bus Off conditions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It may be desirable to report when the node reaches *Error Passive*,
+*Error Warning*, and *Bus Off* conditions. The driver is notified about
+error state change by an interrupt (EPI, EWLI), and then proceeds to
+determine the core’s error state by reading its error counters.
+
+There is, however, a slight race condition here – there is a delay
+between the time when the state transition occurs (and the interrupt is
+triggered) and when the error counters are read. When EPI is received,
+the node may be either *Error Passive* or *Bus Off*. If the node goes
+*Bus Off*, it obviously remains in the state until it is reset.
+Otherwise, the node is *or was* *Error Passive*. However, it may happen
+that the read state is *Error Warning* or even *Error Active*. It may be
+unclear whether and what exactly to report in that case, but I
+personally entertain the idea that the past error condition should still
+be reported. Similarly, when EWLI is received but the state is later
+detected to be *Error Passive*, *Error Passive* should be reported.
+
+
+CTU CAN FD Driver Sources Reference
+-----------------------------------
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd.h
+   :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_base.c
+   :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_pci.c
+   :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_platform.c
+   :internal:
+
+CTU CAN FD IP Core and Driver Development Acknowledgment
+---------------------------------------------------------
+
+* Odrej Ille <ondrej.ille@gmail.com>
+
+  * started the project as student at Department of Measurement, FEE, CTU
+  * invested great amount of personal time and enthusiasm to the project over years
+  * worked on more funded tasks
+
+* `Department of Measurement <https://meas.fel.cvut.cz/>`_,
+  `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+  `Czech Technical University <https://www.cvut.cz/en>`_
+
+  * is the main investor into the project over many years
+  * uses project in their CAN/CAN FD diagnostics framework for `Skoda Auto <https://www.skoda-auto.cz/>`_
+
+* `Digiteq Automotive <https://www.digiteqautomotive.com/en>`_
+
+  * funding of the project CAN FD Open Cores Support Linux Kernel Based Systems
+  * negotiated and paid CTU to allow public access to the project
+  * provided additional funding of the work
+
+* `Department of Control Engineering <https://control.fel.cvut.cz/en>`_,
+  `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+  `Czech Technical University <https://www.cvut.cz/en>`_
+
+  * solving the project CAN FD Open Cores Support Linux Kernel Based Systems
+  * providing GitLab management
+  * virtual servers and computational power for continuous integration
+  * providing hardware for HIL continuous integration tests
+
+* `PiKRON Ltd. <http://pikron.com/>`_
+
+  * minor funding to initiate preparation of the project open-sourcing
+
+* Petr Porazil <porazil@pikron.com>
+
+  * design of PCIe transceiver addon board and assembly of boards
+  * design and assembly of MZ_APO baseboard for MicroZed/Zynq based system
+
+* Martin Jerabek <martin.jerabek01@gmail.com>
+
+  * Linux driver development
+  * continuous integration platform architect and GHDL updates
+  * thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
+
+* Jiri Novak <jnovak@fel.cvut.cz>
+
+  * project initiation, management and use at Department of Measurement, FEE, CTU
+
+* Pavel Pisa <pisa@cmp.felk.cvut.cz>
+
+  * initiate open-sourcing, project coordination, management at Department of Control Engineering, FEE, CTU
+
+* Jaroslav Beran<jara.beran@gmail.com>
+
+ * system integration for Intel SoC, core and driver testing and updates
+
+* Carsten Emde (`OSADL <https://www.osadl.org/>`_)
+
+ * provided OSADL expertise to discuss IP core licensing
+ * pointed to possible deadlock for LGPL and CAN bus possible patent case which lead to relicense IP core design to BSD like license
+
+* Reiner Zitzmann and Holger Zeltwanger (`CAN in Automation <https://www.can-cia.org/>`_)
+
+ * provided suggestions and help to inform community about the project and invited us to events focused on CAN bus future development directions
+
+* Jan Charvat
+
+ * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
+ * Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator
+
+Notes
+-----
+
+
+.. [1]
+   Other buses have their own specific driver interface to set up the
+   device.
+
+.. [2]
+   Not to be mistaken with CAN Error Frame. This is a ``can_frame`` with
+   ``CAN_ERR_FLAG`` set and some error info in its ``data`` field.
+
+.. [3]
+   Available in CTU CAN FD repository
+   `<https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+
+.. [4]
+   As is done in the low-level driver functions
+   ``ctucan_hw_set_nom_bittiming`` and
+   ``ctucan_hw_set_data_bittiming``.
+
+.. [5]
+   At the time of writing this thesis, option 1 is still being used and
+   the modification is queued in gitlab issue #222
+
+.. [6]
+   Strictly speaking, multiple CAN TX queues are supported since v4.19
+   `can: enable multi-queue for SocketCAN devices <https://lore.kernel.org/patchwork/patch/913526/>`_ but no mainline driver is using
+   them yet.
+
+.. [7]
+   Or rather in the next clock cycle
diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
new file mode 100644
index 000000000000..381323423b4c
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
@@ -0,0 +1,151 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg width="113.611mm" height="86.6873mm" version="1.1" viewBox="0 0 113.611 86.6873" xmlns="http://www.w3.org/2000/svg" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+ <defs>
+  <marker id="marker3667" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker3517" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker3373" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker3199" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker3037" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker2779" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker2477" overflow="visible" orient="auto">
+   <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker2074" overflow="visible" orient="auto">
+   <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker1964" overflow="visible" orient="auto">
+   <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="marker1856" overflow="visible" orient="auto">
+   <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <marker id="Arrow2Mend" overflow="visible" orient="auto">
+   <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <filter id="filter1204" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <marker id="marker2074-3" overflow="visible" orient="auto">
+   <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+  </marker>
+  <filter id="filter1204-6" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2-9-4" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2-9-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2-9-1-3" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+  <filter id="filter1204-6-2-9-1-3-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+   <feGaussianBlur stdDeviation="0.00018829868"/>
+  </filter>
+ </defs>
+ <metadata>
+  <rdf:RDF>
+   <cc:Work rdf:about="">
+    <dc:format>image/svg+xml</dc:format>
+    <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+    <dc:title/>
+   </cc:Work>
+  </rdf:RDF>
+ </metadata>
+ <g transform="translate(-49.0277 -104.823)">
+  <g>
+   <path d="m130.534 165.429h-71.1816v-17.5315" fill="none" marker-end="url(#marker2477)" stroke="#28a4ff" stroke-width=".6"/>
+   <path d="m145.034 122.959v-11.5914h-43.1215" fill="none" marker-end="url(#marker3037)" stroke="#28a4ff" stroke-width=".6"/>
+   <rect x="130.679" y="122.933" width="28.2965" height="45.2319" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".499999"/>
+   <path d="m102.044 116.236h23.3126l-0.13388 18.8185h19.9383v3.66603" fill="none" marker-end="url(#marker3199)" stroke="#28a4ff" stroke-width=".6"/>
+   <path d="m59.5006 138.391v-24.2517h20.6338" fill="none" marker-end="url(#marker2779)" stroke="#28a4ff" stroke-width=".6"/>
+   <rect x="78.1389" y="126.411" width="28.0037" height="35.0443" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".5"/>
+  </g>
+  <g fill="#ffcb35" stroke="#000" stroke-linecap="square">
+   <ellipse cx="92.1408" cy="114.239" rx="10.8866" ry="4.39308" stroke-width=".5"/>
+   <ellipse cx="92.1408" cy="134.185" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+   <ellipse cx="92.1408" cy="152.199" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+  </g>
+  <g fill="#28a4ff" stroke="#000" stroke-linecap="square" stroke-width=".499999">
+   <ellipse cx="144.827" cy="143.316" rx="10.8866" ry="4.39308"/>
+   <ellipse cx="144.827" cy="159.143" rx="10.8866" ry="4.39308"/>
+   <ellipse cx="59.4364" cy="142.823" rx="7.36455" ry="4.39308"/>
+   <ellipse cx="144.827" cy="129.196" rx="10.8866" ry="4.39308"/>
+   <ellipse cx="143.077" cy="180.53" rx="10.8866" ry="4.39308"/>
+  </g>
+  <ellipse cx="110.386" cy="180.53" rx="10.8866" ry="4.39308" fill="#ffcb35" stroke="#000" stroke-linecap="square" stroke-width=".499999"/>
+  <text x="110.90907" y="179.42688" font-size="3.175px" xml:space="preserve"><tspan x="110.90907" y="179.42688" dy="0.60000002" text-align="center" text-anchor="middle">Accessible</tspan><tspan x="110.90907" y="183.39563"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+  <text x="143.5869" y="179.52795" xml:space="preserve"><tspan x="143.5869" y="179.52795" dy="1 0 0 0 0 0" font-family="sans-serif" font-size="2.82222px" text-align="center" text-anchor="middle" style="font-variant-caps:normal;font-variant-east-asian:normal;font-variant-ligatures:normal;font-variant-numeric:normal">Inaccessible</tspan><tspan x="143.5869" y="183.36786" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+  <g font-size="3.175px">
+   <text x="91.95018" y="115.29005" xml:space="preserve"><tspan x="91.95018" y="115.29005" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Ready</tspan></tspan></text>
+   <text x="145.25127" y="130.49019" xml:space="preserve"><tspan x="145.25127" y="130.49019" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX OK</tspan></tspan></text>
+   <text x="145.31845" y="144.43121" xml:space="preserve"><tspan x="145.31845" y="144.43121" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Aborted</tspan></tspan></text>
+   <text x="145.40399" y="160.36035" xml:space="preserve"><tspan x="145.40399" y="160.36035" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX failed</tspan></tspan></text>
+   <text x="91.823967" y="133.53941" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.823967" y="133.53941" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX in</tspan></tspan><tspan x="91.823967" y="136.39691" text-align="center">progress</tspan></text>
+   <text x="91.648918" y="151.84813" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.648918" y="151.84813" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Abort in</tspan></tspan><tspan x="91.648918" y="154.70563" text-align="center">progress</tspan></text>
+   <text x="59.456043" y="143.91658" xml:space="preserve"><tspan x="59.456043" y="143.91658" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Empty</tspan></tspan></text>
+  </g>
+  <g fill="none">
+   <g stroke="#000">
+    <rect x="52.3943" y="171.63" width="106.581" height="16.601" rx="0" ry="0" stroke-linecap="square" stroke-width=".499999"/>
+    <g stroke-width=".6">
+     <path d="m106.383 159.046h26.4967" marker-end="url(#Arrow2Mend)"/>
+     <path d="m103.138 152.268h41.5564v-3.92426" marker-end="url(#marker1856)"/>
+     <path d="m106.38 129.354h17.7785"/>
+     <path d="m125.818 129.359h7.2418" marker-end="url(#marker1964)"/>
+    </g>
+    <path d="m124.169 129.354a0.959514 0.97091 0 0 1 0.47587-0.84557 0.959514 0.97091 0 0 1 0.96164-3e-3 0.959514 0.97091 0 0 1 0.48149 0.84231" stroke-linecap="square" stroke-width=".600001"/>
+    <path d="m55.7026 180.832h34.8131" marker-end="url(#marker2074)" stroke-width=".6"/>
+   </g>
+   <g>
+    <path d="m55.6464 185.744h34.8131" marker-end="url(#marker2074-3)" stroke="#28a4ff" stroke-width=".600001"/>
+    <g stroke-width=".6">
+     <path d="m94.0487 129.889v-10.6493" marker-end="url(#marker3373)" stroke="#000"/>
+     <path d="m89.7534 118.621v10.662" marker-end="url(#marker3517)" stroke="#000"/>
+     <path d="m92.119 138.812v7.9718" marker-end="url(#marker3667)" stroke="#28a4ff"/>
+    </g>
+   </g>
+  </g>
+  <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
+  <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text>
+  <g font-size="12px" stroke-width="3.77953" text-anchor="middle">
+   <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
+  </g>
+  <g stroke-width="3.77953" text-anchor="middle">
+   <text transform="matrix(.264583 0 0 .264583 107.574 155.948)" x="38.824219" y="9.1171875" filter="url(#filter1204)" font-size="10.6667px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Retransmit</tspan><tspan x="38.824219" y="20.850557" text-align="center">limit reached or</tspan><tspan x="38.824219" y="32.583927" text-align="center">node went bus off</tspan><tspan x="38.824219" y="44.317299" text-align="center"/></text>
+   <text transform="matrix(.264583 0 0 .264583 60.7127 177.384)" x="38.824539" y="9.1173134" filter="url(#filter1204-6)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824539" y="9.1173134" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Transmission result</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 45.6885 173.226)" x="57.727047" y="9.11724" filter="url(#filter1204-6-9)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Legend:</tspan></text>
+  </g>
+  <g fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-anchor="middle">
+   <text transform="matrix(.264583 0 0 .264583 57.0045 182.079)" x="57.727047" y="9.11724" filter="url(#filter1204-6-2)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">SW command</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 57.7865 110.104)" x="40.822609" y="9.11724" filter="url(#filter1204-6-2-9)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="40.822609" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 116.893 107.491)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-4)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 87.5687 166.324)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-1)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set empty</tspan></text>
+   <text transform="matrix(.264583 0 0 .264583 106.53 113.074)" x="30.228771" y="8.9063139" filter="url(#filter1204-6-2-9-1-3)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="30.228771" y="8.9063139" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set abort</tspan></text>
+  </g>
+ </g>
+</svg>
diff --git a/Documentation/networking/device_drivers/can/freescale/flexcan.rst b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
new file mode 100644
index 000000000000..106cd2890135
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Flexcan CAN Controller driver
+=============================
+
+Authors: Marc Kleine-Budde <mkl@pengutronix.de>,
+Dario Binacchi <dario.binacchi@amarulasolutions.com>
+
+On/off RTR frames reception
+===========================
+
+For most flexcan IP cores the driver supports 2 RX modes:
+
+- FIFO
+- mailbox
+
+The older flexcan cores (integrated into the i.MX25, i.MX28, i.MX35
+and i.MX53 SOCs) only receive RTR frames if the controller is
+configured for RX-FIFO mode.
+
+The RX FIFO mode uses a hardware FIFO with a depth of 6 CAN frames,
+while the mailbox mode uses a software FIFO with a depth of up to 62
+CAN frames. With the help of the bigger buffer, the mailbox mode
+performs better under high system load situations.
+
+As reception of RTR frames is part of the CAN standard, all flexcan
+cores come up in a mode where RTR reception is possible.
+
+With the "rx-rtr" private flag the ability to receive RTR frames can
+be waived at the expense of losing the ability to receive RTR
+messages. This trade off is beneficial in certain use cases.
+
+"rx-rtr" on
+  Receive RTR frames. (default)
+
+  The CAN controller can and will receive RTR frames.
+
+  On some IP cores the controller cannot receive RTR frames in the
+  more performant "RX mailbox" mode and will use "RX FIFO" mode
+  instead.
+
+"rx-rtr" off
+
+  Waive ability to receive RTR frames. (not supported on all IP cores)
+
+  This mode activates the "RX mailbox mode" for better performance, on
+  some IP cores RTR frames cannot be received anymore.
+
+The setting can only be changed if the interface is down::
+
+    ip link set dev can0 down
+    ethtool --set-priv-flags can0 rx-rtr {off|on}
+    ip link set dev can0 up
diff --git a/Documentation/networking/device_drivers/can/index.rst b/Documentation/networking/device_drivers/can/index.rst
new file mode 100644
index 000000000000..6a8a4f74fa26
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Controller Area Network (CAN) Device Drivers
+============================================
+
+Device drivers for CAN devices.
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   can327
+   ctu/ctucanfd-driver
+   freescale/flexcan
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/index.rst b/Documentation/networking/device_drivers/cellular/index.rst
new file mode 100644
index 000000000000..fc1812d3fc70
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cellular Modem Device Drivers
+=============================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   qualcomm/rmnet
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
new file mode 100644
index 000000000000..289c146a8291
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Rmnet Driver
+============
+
+1. Introduction
+===============
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+================
+
+a. MAP packet v1 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+  Bit             0             1           2-7      8-15           16-31
+  Function   Command / Data   Reserved     Pad   Multiplexer ID    Payload length
+
+  Bit            32-x
+  Function      Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. Map packet v4 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+  Bit             0             1           2-7      8-15           16-31
+  Function   Command / Data   Reserved     Pad   Multiplexer ID    Payload length
+
+  Bit            32-(x-33)      (x-32)-x
+  Function      Raw bytes      Checksum offload header
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+Checksum offload header, has the information about the checksum processing done
+by the hardware.Checksum offload header fields are in big endian format.
+
+Packet format::
+
+  Bit             0-14        15              16-31
+  Function      Reserved   Valid     Checksum start offset
+
+  Bit                31-47                    48-64
+  Function      Checksum length           Checksum value
+
+Reserved bits must be zero when sent and ignored when received.
+
+Valid bit indicates whether the partial checksum is calculated and is valid.
+Set to 1, if its is valid. Set to 0 otherwise.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Checksum start offset, Indicates the offset in bytes from the beginning of the
+IP header, from which modem computed checksum.
+
+Checksum length is the Length in bytes starting from CKSUM_START_OFFSET,
+over which checksum is computed.
+
+Checksum value, indicates the checksum computed.
+
+c. MAP packet v5 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+  Bit             0             1         2-7      8-15           16-31
+  Function   Command / Data  Next header  Pad   Multiplexer ID   Payload length
+
+  Bit            32-x
+  Function      Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Next header is used to indicate the presence of another header, currently is
+limited to checksum header.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+d. Checksum offload header v5
+
+Checksum offload header fields are in big endian format.
+
+  Bit            0 - 6          7               8-15              16-31
+  Function     Header Type    Next Header     Checksum Valid    Reserved
+
+Header Type is to indicate the type of header, this usually is set to CHECKSUM
+
+Header types
+= ==========================================
+0 Reserved
+1 Reserved
+2 checksum header
+
+Checksum Valid is to indicate whether the header checksum is valid. Value of 1
+implies that checksum is calculated on this packet and is valid, value of 0
+indicates that the calculated packet checksum is invalid.
+
+Reserved bits must be zero when sent and ignored when received.
+
+e. MAP packet v1/v5 (command specific)::
+
+    Bit             0             1         2-7      8 - 15           16 - 31
+    Function   Command         Reserved     Pad   Multiplexer ID    Payload length
+    Bit          32 - 39        40 - 45    46 - 47       48 - 63
+    Function   Command name    Reserved   Command Type   Reserved
+    Bit          64 - 95
+    Function   Transaction ID
+    Bit          96 - 127
+    Function   Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types
+
+= ==========================================
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+= ==========================================
+
+f. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered to
+rmnet in a single linear skb. rmnet will process the individual
+packets and either ACK the MAP command or deliver the IP packet to the
+network stack as needed
+
+MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+
+MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+
+3. Userspace configuration
+==========================
+
+rmnet userspace configuration is done through netlink using iproute2
+https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
+
+The driver uses rtnl_link_ops for communication.
diff --git a/Documentation/networking/device_drivers/dec/de4x5.rst b/Documentation/networking/device_drivers/dec/de4x5.rst
deleted file mode 100644
index e03e9c631879..000000000000
--- a/Documentation/networking/device_drivers/dec/de4x5.rst
+++ /dev/null
@@ -1,189 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================
-DEC EtherWORKS Ethernet De4x5 cards
-===================================
-
-    Originally,   this  driver  was    written  for the  Digital   Equipment
-    Corporation series of EtherWORKS Ethernet cards:
-
-	 - DE425 TP/COAX EISA
-	 - DE434 TP PCI
-	 - DE435 TP/COAX/AUI PCI
-	 - DE450 TP/COAX/AUI PCI
-	 - DE500 10/100 PCI Fasternet
-
-    but it  will  now attempt  to  support all  cards which   conform to the
-    Digital Semiconductor   SROM   Specification.    The  driver   currently
-    recognises the following chips:
-
-	 - DC21040  (no SROM)
-	 - DC21041[A]
-	 - DC21140[A]
-	 - DC21142
-	 - DC21143
-
-    So far the driver is known to work with the following cards:
-
-	 - KINGSTON
-	 - Linksys
-	 - ZNYX342
-	 - SMC8432
-	 - SMC9332 (w/new SROM)
-	 - ZNYX31[45]
-	 - ZNYX346 10/100 4 port (can act as a 10/100 bridge!)
-
-    The driver has been tested on a relatively busy network using the DE425,
-    DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred
-    16M of data to a DECstation 5000/200 as follows::
-
-		  TCP           UDP
-	       TX     RX     TX     RX
-      DE425   1030k  997k   1170k  1128k
-      DE434   1063k  995k   1170k  1125k
-      DE435   1063k  995k   1170k  1125k
-      DE500   1063k  998k   1170k  1125k  in 10Mb/s mode
-
-    All  values are typical (in   kBytes/sec) from a  sample  of 4 for  each
-    measurement. Their error is +/-20k on a quiet (private) network and also
-    depend on what load the CPU has.
-
-----------------------------------------------------------------------------
-
-    The ability to load this  driver as a loadable  module has been included
-    and used extensively  during the driver development  (to save those long
-    reboot sequences).  Loadable module support  under PCI and EISA has been
-    achieved by letting the driver autoprobe as if it were compiled into the
-    kernel. Do make sure  you're not sharing  interrupts with anything  that
-    cannot accommodate  interrupt  sharing!
-
-    To utilise this ability, you have to do 8 things:
-
-    0) have a copy of the loadable modules code installed on your system.
-    1) copy de4x5.c from the  /linux/drivers/net directory to your favourite
-       temporary directory.
-    2) for fixed  autoprobes (not  recommended),  edit the source code  near
-       line 5594 to reflect the I/O address  you're using, or assign these when
-       loading by::
-
-		   insmod de4x5 io=0xghh           where g = bus number
-							hh = device number
-
-       .. note::
-
-	   autoprobing for modules is now supported by default. You may just
-	   use::
-
-		   insmod de4x5
-
-	   to load all available boards. For a specific board, still use
-	   the 'io=?' above.
-    3) compile  de4x5.c, but include -DMODULE in  the command line to ensure
-       that the correct bits are compiled (see end of source code).
-    4) if you are wanting to add a new  card, goto 5. Otherwise, recompile a
-       kernel with the de4x5 configuration turned off and reboot.
-    5) insmod de4x5 [io=0xghh]
-    6) run the net startup bits for your new eth?? interface(s) manually
-       (usually /etc/rc.inet[12] at boot time).
-    7) enjoy!
-
-    To unload a module, turn off the associated interface(s)
-    'ifconfig eth?? down' then 'rmmod de4x5'.
-
-    Automedia detection is included so that in  principle you can disconnect
-    from, e.g.  TP, reconnect  to BNC  and  things will still work  (after a
-    pause while the   driver figures out   where its media went).  My tests
-    using ping showed that it appears to work....
-
-    By  default,  the driver will  now   autodetect any  DECchip based card.
-    Should you have a need to restrict the driver to DIGITAL only cards, you
-    can compile with a  DEC_ONLY define, or if  loading as a module, use the
-    'dec_only=1'  parameter.
-
-    I've changed the timing routines to  use the kernel timer and scheduling
-    functions  so that the  hangs  and other assorted problems that occurred
-    while autosensing the  media  should be gone.  A  bonus  for the DC21040
-    auto  media sense algorithm is  that it can now  use one that is more in
-    line with the  rest (the DC21040  chip doesn't  have a hardware  timer).
-    The downside is the 1 'jiffies' (10ms) resolution.
-
-    IEEE 802.3u MII interface code has  been added in anticipation that some
-    products may use it in the future.
-
-    The SMC9332 card  has a non-compliant SROM  which needs fixing -  I have
-    patched this  driver to detect it  because the SROM format used complies
-    to a previous DEC-STD format.
-
-    I have removed the buffer copies needed for receive on Intels.  I cannot
-    remove them for   Alphas since  the  Tulip hardware   only does longword
-    aligned  DMA transfers  and  the  Alphas get   alignment traps with  non
-    longword aligned data copies (which makes them really slow). No comment.
-
-    I  have added SROM decoding  routines to make this  driver work with any
-    card that  supports the Digital  Semiconductor SROM spec. This will help
-    all  cards running the dc2114x  series chips in particular.  Cards using
-    the dc2104x  chips should run correctly with  the basic  driver.  I'm in
-    debt to <mjacob@feral.com> for the  testing and feedback that helped get
-    this feature working.  So far we have  tested KINGSTON, SMC8432, SMC9332
-    (with the latest SROM complying  with the SROM spec  V3: their first was
-    broken), ZNYX342  and  LinkSys. ZNYX314 (dual  21041  MAC) and  ZNYX 315
-    (quad 21041 MAC)  cards also  appear  to work despite their  incorrectly
-    wired IRQs.
-
-    I have added a temporary fix for interrupt problems when some SCSI cards
-    share the same interrupt as the DECchip based  cards. The problem occurs
-    because  the SCSI card wants to  grab the interrupt  as a fast interrupt
-    (runs the   service routine with interrupts turned   off) vs.  this card
-    which really needs to run the service routine with interrupts turned on.
-    This driver will  now   add the interrupt service   routine  as  a  fast
-    interrupt if it   is bounced from the   slow interrupt.  THIS IS NOT   A
-    RECOMMENDED WAY TO RUN THE DRIVER  and has been done  for a limited time
-    until  people   sort  out their  compatibility    issues and the  kernel
-    interrupt  service code  is  fixed.   YOU  SHOULD SEPARATE OUT  THE FAST
-    INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not
-    run on the same interrupt. PCMCIA/CardBus is another can of worms...
-
-    Finally, I think  I have really  fixed  the module  loading problem with
-    more than one DECchip based  card.  As a  side effect, I don't mess with
-    the  device structure any  more which means that  if more than 1 card in
-    2.0.x is    installed (4  in   2.1.x),  the  user   will have   to  edit
-    linux/drivers/net/Space.c  to make room for  them. Hence, module loading
-    is  the preferred way to use   this driver, since  it  doesn't have this
-    limitation.
-
-    Where SROM media  detection is used and  full duplex is specified in the
-    SROM,  the feature is  ignored unless  lp->params.fdx  is set at compile
-    time  OR during  a   module load  (insmod  de4x5   args='eth??:fdx' [see
-    below]).  This is because there  is no way  to automatically detect full
-    duplex   links  except through   autonegotiation.    When I  include the
-    autonegotiation feature in  the SROM autoconf  code, this detection will
-    occur automatically for that case.
-
-    Command line  arguments are  now allowed, similar to  passing  arguments
-    through LILO. This will allow a per adapter board set  up of full duplex
-    and media. The only lexical constraints are:  the board name (dev->name)
-    appears in  the list before its parameters.  The list of parameters ends
-    either at the end of the parameter list or with another board name.  The
-    following parameters are allowed:
-
-	    =========  ===============================================
-	    fdx        for full duplex
-	    autosense  to set the media/speed; with the following
-		       sub-parameters:
-		       TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO
-	    =========  ===============================================
-
-    Case sensitivity is important  for  the sub-parameters. They *must*   be
-    upper case. Examples::
-
-	insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'.
-
-    For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.::
-
-	DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"'
-
-    Yes,  I know full duplex  isn't permissible on BNC  or AUI; they're just
-    examples. By default, full duplex is turned  off and AUTO is the default
-    autosense setting. In  reality, I expect only the  full duplex option to
-    be used. Note the use of single quotes in the two examples above and the
-    lack of commas to separate items.
diff --git a/Documentation/networking/device_drivers/3com/3c509.rst b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
index 47f706bacdd9..47f706bacdd9 100644
--- a/Documentation/networking/device_drivers/3com/3c509.rst
+++ b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
diff --git a/Documentation/networking/device_drivers/3com/vortex.rst b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
index 800add5be338..a060f84c4f96 100644
--- a/Documentation/networking/device_drivers/3com/vortex.rst
+++ b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
@@ -4,8 +4,6 @@
 3Com Vortex device driver
 =========================
 
-Documentation/networking/device_drivers/3com/vortex.rst
-
 Andrew Morton
 
 30 April 2000
@@ -256,7 +254,7 @@ Media selection
 A number of the older NICs such as the 3c590 and 3c900 series have
 10base2 and AUI interfaces.
 
-Prior to January, 2001 this driver would autoeselect the 10base2 or AUI
+Prior to January, 2001 this driver would autoselect the 10base2 or AUI
 port if it didn't detect activity on the 10baseT port.  It would then
 get stuck on the 10base2 port and a driver reload was necessary to
 switch back to 10baseT.  This behaviour could not be prevented with a
@@ -376,8 +374,8 @@ steps you should take:
    email address will be in the driver source or in the MAINTAINERS file.
 
 - The contents of your report will vary a lot depending upon the
-  problem.  If it's a kernel crash then you should refer to the
-  admin-guide/reporting-bugs.rst file.
+  problem.  If it's a kernel crash then you should refer to
+  'Documentation/admin-guide/reporting-issues.rst'.
 
   But for most problems it is useful to provide the following:
 
diff --git a/Documentation/networking/altera_tse.rst b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
index 7a7040072e58..7a7040072e58 100644
--- a/Documentation/networking/altera_tse.rst
+++ b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
diff --git a/Documentation/networking/device_drivers/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 11af6388ea87..4561e8ab9e08 100644
--- a/Documentation/networking/device_drivers/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -11,12 +11,12 @@ ENA is a networking interface designed to make good use of modern CPU
 features and system architectures.
 
 The ENA device exposes a lightweight management interface with a
-minimal set of memory mapped registers and extendable command set
+minimal set of memory mapped registers and extendible command set
 through an Admin Queue.
 
 The driver supports a range of ENA devices, is link-speed independent
-(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
-a negotiated and extendable feature set.
+(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
+a negotiated and extendible feature set.
 
 Some ENA devices support SR-IOV. This driver is used for both the
 SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
@@ -27,9 +27,9 @@ is advertised by the device via the Admin Queue), a dedicated MSI-X
 interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
 and CPU cacheline optimized data placement.
 
-The ENA driver supports industry standard TCP/IP offload features such
-as checksum offload and TCP transmit segmentation offload (TSO).
-Receive-side scaling (RSS) is supported for multi-core scaling.
+The ENA driver supports industry standard TCP/IP offload features such as
+checksum offload. Receive-side scaling (RSS) is supported for multi-core
+scaling.
 
 The ENA driver and its corresponding devices implement health
 monitoring mechanisms such as watchdog, enabling the device and driver
@@ -39,32 +39,22 @@ debug logs.
 Some of the ENA devices support a working mode called Low-latency
 Queue (LLQ), which saves several more microseconds.
 
-Supported PCI vendor ID/device IDs
-==================================
-
-=========   =======================
-1d0f:0ec2   ENA PF
-1d0f:1ec2   ENA PF with LLQ support
-1d0f:ec20   ENA VF
-1d0f:ec21   ENA VF with LLQ support
-=========   =======================
-
 ENA Source Code Directory Structure
 ===================================
 
 =================   ======================================================
 ena_com.[ch]        Management communication layer. This layer is
-		    responsible for the handling all the management
-		    (admin) communication between the device and the
-		    driver.
+                    responsible for the handling all the management
+                    (admin) communication between the device and the
+                    driver.
 ena_eth_com.[ch]    Tx/Rx data path.
 ena_admin_defs.h    Definition of ENA management interface.
 ena_eth_io_defs.h   Definition of ENA data path interface.
 ena_common_defs.h   Common definitions for ena_com layer.
 ena_regs_defs.h     Definition of ENA PCI memory-mapped (MMIO) registers.
 ena_netdev.[ch]     Main Linux kernel driver.
-ena_syfsfs.[ch]     Sysfs files.
 ena_ethtool.c       ethtool callbacks.
+ena_xdp.[ch]        XDP files
 ena_pci_id_tbl.h    Supported device IDs.
 =================   ======================================================
 
@@ -79,7 +69,7 @@ ENA management interface is exposed by means of:
 - Asynchronous Event Notification Queue (AENQ)
 
 ENA device MMIO Registers are accessed only during driver
-initialization and are not involved in further normal device
+initialization and are not used during further normal device
 operation.
 
 AQ is used for submitting management commands, and the
@@ -110,28 +100,27 @@ group may have multiple syndromes, as shown below
 
 The events are:
 
-	====================	===============
-	Group			Syndrome
-	====================	===============
-	Link state change	**X**
-	Fatal error		**X**
-	Notification		Suspend traffic
-	Notification		Resume traffic
-	Keep-Alive		**X**
-	====================	===============
+====================    ===============
+Group                   Syndrome
+====================    ===============
+Link state change       **X**
+Fatal error             **X**
+Notification            Suspend traffic
+Notification            Resume traffic
+Keep-Alive              **X**
+====================    ===============
 
 ACQ and AENQ share the same MSI-X vector.
 
-Keep-Alive is a special mechanism that allows monitoring of the
-device's health. The driver maintains a watchdog (WD) handler which,
-if fired, logs the current state and statistics then resets and
-restarts the ENA device and driver. A Keep-Alive event is delivered by
-the device every second. The driver re-arms the WD upon reception of a
-Keep-Alive event. A missed Keep-Alive event causes the WD handler to
-fire.
+Keep-Alive is a special mechanism that allows monitoring the device's health.
+A Keep-Alive event is delivered by the device every second.
+The driver maintains a watchdog (WD) handler which logs the current state and
+statistics. If the keep-alive events aren't delivered as expected the WD resets
+the device and the driver.
 
 Data Path Interface
 ===================
+
 I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
 SQ correspondingly). Each SQ has a completion queue (CQ) associated
 with it.
@@ -141,26 +130,24 @@ physical memory.
 
 The ENA driver supports two Queue Operation modes for Tx SQs:
 
-- Regular mode
+- **Regular mode:**
+  In this mode the Tx SQs reside in the host's memory. The ENA
+  device fetches the ENA Tx descriptors and packet data from host
+  memory.
 
-  * In this mode the Tx SQs reside in the host's memory. The ENA
-    device fetches the ENA Tx descriptors and packet data from host
-    memory.
+- **Low Latency Queue (LLQ) mode or "push-mode":**
+  In this mode the driver pushes the transmit descriptors and the
+  first 96 bytes of the packet directly to the ENA device memory
+  space. The rest of the packet payload is fetched by the
+  device. For this operation mode, the driver uses a dedicated PCI
+  device memory BAR, which is mapped with write-combine capability.
 
-- Low Latency Queue (LLQ) mode or "push-mode".
-
-  * In this mode the driver pushes the transmit descriptors and the
-    first 128 bytes of the packet directly to the ENA device memory
-    space. The rest of the packet payload is fetched by the
-    device. For this operation mode, the driver uses a dedicated PCI
-    device memory BAR, which is mapped with write-combine capability.
+  **Note that** not all ENA devices support LLQ, and this feature is negotiated
+  with the device upon initialization. If the ENA device does not
+  support LLQ mode, the driver falls back to the regular mode.
 
 The Rx SQs support only the regular mode.
 
-Note: Not all ENA devices support LLQ, and this feature is negotiated
-      with the device upon initialization. If the ENA device does not
-      support LLQ mode, the driver falls back to the regular mode.
-
 The driver supports multi-queue for both Tx and Rx. This has various
 benefits:
 
@@ -175,6 +162,7 @@ benefits:
 
 Interrupt Modes
 ===============
+
 The driver assigns a single MSI-X vector per queue pair (for both Tx
 and Rx directions). The driver assigns an additional dedicated MSI-X vector
 for management (for ACQ and AENQ).
@@ -200,73 +188,72 @@ unmasked by the driver after NAPI processing is complete.
 
 Interrupt Moderation
 ====================
+
 ENA driver and device can operate in conventional or adaptive interrupt
 moderation mode.
 
-In conventional mode the driver instructs device to postpone interrupt
+**In conventional mode** the driver instructs device to postpone interrupt
 posting according to static interrupt delay value. The interrupt delay
-value can be configured through ethtool(8). The following ethtool
-parameters are supported by the driver: tx-usecs, rx-usecs
+value can be configured through `ethtool(8)`. The following `ethtool`
+parameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
 
-In adaptive interrupt moderation mode the interrupt delay value is
+**In adaptive interrupt** moderation mode the interrupt delay value is
 updated by the driver dynamically and adjusted every NAPI cycle
 according to the traffic nature.
 
-By default ENA driver applies adaptive coalescing on Rx traffic and
-conventional coalescing on Tx traffic.
-
-Adaptive coalescing can be switched on/off through ethtool(8)
-adaptive_rx on|off parameter.
+Adaptive coalescing can be switched on/off through `ethtool(8)`'s
+:code:`adaptive_rx on|off` parameter.
 
-The driver chooses interrupt delay value according to the number of
-bytes and packets received between interrupt unmasking and interrupt
-posting. The driver uses interrupt delay table that subdivides the
-range of received bytes/packets into 5 levels and assigns interrupt
-delay value to each level.
+More information about Adaptive Interrupt Moderation (DIM) can be found in
+Documentation/networking/net_dim.rst
 
-The user can enable/disable adaptive moderation, modify the interrupt
-delay table and restore its default values through sysfs.
+.. _`RX copybreak`:
 
 RX copybreak
 ============
+
 The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
 and can be configured by the ETHTOOL_STUNABLE command of the
 SIOCETHTOOL ioctl.
 
-SKB
-===
-The driver-allocated SKB for frames received from Rx handling using
-NAPI context. The allocation method depends on the size of the packet.
-If the frame length is larger than rx_copybreak, napi_get_frags()
-is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
-content is copied (by CPU) to the SKB, and the buffer is recycled.
+This option controls the maximum packet length for which the RX
+descriptor it was received on would be recycled. When a packet smaller
+than RX copybreak bytes is received, it is copied into a new memory
+buffer and the RX descriptor is returned to HW.
 
 Statistics
 ==========
-The user can obtain ENA device and driver statistics using ethtool.
+
+The user can obtain ENA device and driver statistics using `ethtool`.
 The driver can collect regular or extended statistics (including
 per-queue stats) from the device.
 
 In addition the driver logs the stats to syslog upon device reset.
 
+On supported instance types, the statistics will also include the
+ENA Express data (fields prefixed with `ena_srd`). For a complete
+documentation of ENA Express data refer to
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html#ena-express-monitor
+
 MTU
 ===
+
 The driver supports an arbitrarily large MTU with a maximum that is
 negotiated with the device. The driver configures MTU using the
 SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
-via ip(8) and similar legacy tools.
+via `ip(8)` and similar legacy tools.
 
 Stateless Offloads
 ==================
+
 The ENA driver supports:
 
-- TSO over IPv4/IPv6
-- TSO with ECN
 - IPv4 header checksum offload
 - TCP/UDP over IPv4/IPv6 checksum offloads
 
 RSS
 ===
+
 - The ENA device supports RSS that allows flexible Rx traffic
   steering.
 - Toeplitz and CRC32 hash functions are supported.
@@ -274,46 +261,47 @@ RSS
   inputs for hash functions.
 - The driver configures RSS settings using the AQ SetFeature command
   (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
-  ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties).
+  ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
 - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
   function delivered in the Rx CQ descriptor is set in the received
   SKB.
 - The user can provide a hash key, hash function, and configure the
-  indirection table through ethtool(8).
+  indirection table through `ethtool(8)`.
 
 DATA PATH
 =========
+
 Tx
 --
 
-end_start_xmit() is called by the stack. This function does the following:
+:code:`ena_start_xmit()` is called by the stack. This function does the following:
 
-- Maps data buffers (skb->data and frags).
-- Populates ena_buf for the push buffer (if the driver and device are
-  in push mode.)
+- Maps data buffers (``skb->data`` and frags).
+- Populates ``ena_buf`` for the push buffer (if the driver and device are
+  in push mode).
 - Prepares ENA bufs for the remaining frags.
-- Allocates a new request ID from the empty req_id ring. The request
+- Allocates a new request ID from the empty ``req_id`` ring. The request
   ID is the index of the packet in the Tx info. This is used for
-  out-of-order TX completions.
+  out-of-order Tx completions.
 - Adds the packet to the proper place in the Tx ring.
-- Calls ena_com_prepare_tx(), an ENA communication layer that converts
-  the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
-  needed.)
+- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
+  the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
+  needed).
 
   * This function also copies the ENA descriptors and the push buffer
-    to the Device memory space (if in push mode.)
+    to the Device memory space (if in push mode).
 
-- Writes doorbell to the ENA device.
+- Writes a doorbell to the ENA device.
 - When the ENA device finishes sending the packet, a completion
   interrupt is raised.
 - The interrupt handler schedules NAPI.
-- The ena_clean_tx_irq() function is called. This function handles the
+- The :code:`ena_clean_tx_irq()` function is called. This function handles the
   completion descriptors generated by the ENA, with a single
   completion descriptor per completed packet.
 
-  * req_id is retrieved from the completion descriptor. The tx_info of
-    the packet is retrieved via the req_id. The data buffers are
-    unmapped and req_id is returned to the empty req_id ring.
+  * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
+    the packet is retrieved via the ``req_id``. The data buffers are
+    unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
   * The function stops when the completion descriptors are completed or
     the budget is reached.
 
@@ -322,12 +310,11 @@ Rx
 
 - When a packet is received from the ENA device.
 - The interrupt handler schedules NAPI.
-- The ena_clean_rx_irq() function is called. This function calls
-  ena_rx_pkt(), an ENA communication layer function, which returns the
-  number of descriptors used for a new unhandled packet, and zero if
+- The :code:`ena_clean_rx_irq()` function is called. This function calls
+  :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
+  number of descriptors used for a new packet, and zero if
   no new packet is found.
-- Then it calls the ena_clean_rx_irq() function.
-- ena_eth_rx_skb() checks packet length:
+- :code:`ena_rx_skb()` checks packet length:
 
   * If the packet is small (len < rx_copybreak), the driver allocates
     a SKB for the new packet, and copies the packet payload into the
@@ -336,9 +323,41 @@ Rx
     - In this way the original data buffer is not passed to the stack
       and is reused for future Rx packets.
 
-  * Otherwise the function unmaps the Rx buffer, then allocates the
-    new SKB structure and hooks the Rx buffer to the SKB frags.
+  * Otherwise the function unmaps the Rx buffer, sets the first
+    descriptor as `skb`'s linear part and the other descriptors as the
+    `skb`'s frags.
 
 - The new SKB is updated with the necessary information (protocol,
-  checksum hw verify result, etc.), and then passed to the network
-  stack, using the NAPI interface function napi_gro_receive().
+  checksum hw verify result, etc), and then passed to the network
+  stack, using the NAPI interface function :code:`napi_gro_receive()`.
+
+Dynamic RX Buffers (DRB)
+------------------------
+
+Each RX descriptor in the RX ring is a single memory page (which is either 4KB
+or 16KB long depending on system's configurations).
+To reduce the memory allocations required when dealing with a high rate of small
+packets, the driver tries to reuse the remaining RX descriptor's space if more
+than 2KB of this page remain unused.
+
+A simple example of this mechanism is the following sequence of events:
+
+::
+
+        1. Driver allocates page-sized RX buffer and passes it to hardware
+                +----------------------+
+                |4KB RX Buffer         |
+                +----------------------+
+
+        2. A 300Bytes packet is received on this buffer
+
+        3. The driver increases the ref count on this page and returns it back to
+           HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
+               +----+--------------------+
+               |****|3796 Bytes RX Buffer|
+               +----+--------------------+
+
+This mechanism isn't used when an XDP program is loaded, or when the
+RX packet is less than rx_copybreak bytes (in which case the packet is
+copied out of the RX buffer into the linear part of a new skb allocated
+for it and the RX buffer remains the same size, see `RX copybreak`_).
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
new file mode 100644
index 000000000000..9e8a16c44102
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the AMD/Pensando(R) DSC adapter family
+========================================================
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Identifying the Adapter
+=======================
+
+To find if one or more AMD/Pensando PCI Core devices are installed on the
+host, check for the PCI devices::
+
+  # lspci -d 1dd8:100c
+  b5:00.0 Processing accelerators: Pensando Systems Device 100c
+  b6:00.0 Processing accelerators: Pensando Systems Device 100c
+
+If such devices are listed as above, then the pds_core.ko driver should find
+and configure them for use.  There should be log entries in the kernel
+messages such as these::
+
+  $ dmesg | grep pds_core
+  pds_core 0000:b5:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+  pds_core 0000:b5:00.0: FW: 1.60.0-73
+  pds_core 0000:b6:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+  pds_core 0000:b6:00.0: FW: 1.60.0-73
+
+Driver and firmware version information can be gathered with devlink::
+
+  $ devlink dev info pci/0000:b5:00.0
+  pci/0000:b5:00.0:
+    driver pds_core
+    serial_number FLM18420073
+    versions:
+        fixed:
+          asic.id 0x0
+          asic.rev 0x0
+        running:
+          fw 1.51.0-73
+        stored:
+          fw.goldfw 1.15.9-C-22
+          fw.mainfwa 1.60.0-73
+          fw.mainfwb 1.60.0-57
+
+Info versions
+=============
+
+The ``pds_core`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+   * - ``fw``
+     - running
+     - Version of firmware running on the device
+   * - ``fw.goldfw``
+     - stored
+     - Version of firmware stored in the goldfw slot
+   * - ``fw.mainfwa``
+     - stored
+     - Version of firmware stored in the mainfwa slot
+   * - ``fw.mainfwb``
+     - stored
+     - Version of firmware stored in the mainfwb slot
+   * - ``asic.id``
+     - fixed
+     - The ASIC type for this device
+   * - ``asic.rev``
+     - fixed
+     - The revision of the ASIC for this device
+
+Parameters
+==========
+
+The ``pds_core`` driver implements the following generic
+parameters for controlling the functionality to be made available
+as auxiliary_bus devices.
+
+.. list-table:: Generic parameters implemented
+   :widths: 5 5 8 82
+
+   * - Name
+     - Mode
+     - Type
+     - Description
+   * - ``enable_vnet``
+     - runtime
+     - Boolean
+     - Enables vDPA functionality through an auxiliary_bus device
+
+Firmware Management
+===================
+
+The ``flash`` command can update a the DSC firmware.  The downloaded firmware
+will be saved into either of firmware bank 1 or bank 2, whichever is not
+currently in use, and that bank will used for the next boot::
+
+  # devlink dev flash pci/0000:b5:00.0 \
+            file pensando/dsc_fw_1.63.0-22.tar
+
+Health Reporters
+================
+
+The driver supports a devlink health reporter for FW status::
+
+  # devlink health show pci/0000:2b:00.0 reporter fw
+  pci/0000:2b:00.0:
+    reporter fw
+      state healthy error 0 recover 0
+  # devlink health diagnose pci/0000:2b:00.0 reporter fw
+   Status: healthy State: 1 Generation: 0 Recoveries: 0
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+  make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+  -> Device Drivers
+    -> Network device support (NETDEVICES [=y])
+      -> Ethernet driver support
+        -> AMD devices
+          -> AMD/Pensando Ethernet PDS_CORE Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by AMD/Pensando personnel::
+
+  netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
new file mode 100644
index 000000000000..587927d3de92
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI vDPA driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando vDPA VF Device Driver
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Overview
+========
+
+The ``pds_vdpa`` driver is an auxiliary bus driver that supplies
+a vDPA device for use by the virtio network stack.  It is used with
+the Pensando Virtual Function devices that offer vDPA and virtio queue
+services.  It depends on the ``pds_core`` driver and hardware for the PF
+and VF PCI handling as well as for device configuration services.
+
+Using the device
+================
+
+The ``pds_vdpa`` device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices.  After the VFs are enabled, we enable the vDPA service
+in the ``pds_core`` device to create the auxiliary devices used by pds_vdpa.
+
+Example steps:
+
+.. code-block:: bash
+
+  #!/bin/bash
+
+  modprobe pds_core
+  modprobe vdpa
+  modprobe pds_vdpa
+
+  PF_BDF=`ls /sys/module/pds_core/drivers/pci\:pds_core/*/sriov_numvfs | awk -F / '{print $7}'`
+
+  # Enable vDPA VF auxiliary device(s) in the PF
+  devlink dev param set pci/$PF_BDF name enable_vnet cmode runtime value true
+
+  # Create a VF for vDPA use
+  echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+  # Find the vDPA services/devices available
+  PDS_VDPA_MGMT=`vdpa mgmtdev show | grep vDPA | head -1 | cut -d: -f1`
+
+  # Create a vDPA device for use in virtio network configurations
+  vdpa dev add name vdpa1 mgmtdev $PDS_VDPA_MGMT mac 00:11:22:33:44:55
+
+  # Set up an ethernet interface on the vdpa device
+  modprobe virtio_vdpa
+
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+  make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+  -> Device Drivers
+    -> Network device support (NETDEVICES [=y])
+      -> Ethernet driver support
+        -> Pensando devices
+          -> Pensando Ethernet PDS_VDPA Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+  netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+  drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
new file mode 100644
index 000000000000..7a6bc848a2b2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI VFIO driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando Linux VFIO PCI Device Driver
+Copyright(c) 2023 Advanced Micro Devices, Inc.
+
+Overview
+========
+
+The ``pds-vfio-pci`` module is a PCI driver that supports Live Migration
+capable Virtual Function (VF) devices in the DSC hardware.
+
+Using the device
+================
+
+The pds-vfio-pci device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices.
+
+Shown below are the steps to bind the driver to a VF and also to the
+associated auxiliary device created by the ``pds_core`` driver. This
+example assumes the pds_core and pds-vfio-pci modules are already
+loaded.
+
+.. code-block:: bash
+  :name: example-setup-script
+
+  #!/bin/bash
+
+  PF_BUS="0000:60"
+  PF_BDF="0000:60:00.0"
+  VF_BDF="0000:60:00.1"
+
+  # Prevent non-vfio VF driver from probing the VF device
+  echo 0 > /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
+
+  # Create single VF for Live Migration via pds_core
+  echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+  # Allow the VF to be bound to the pds-vfio-pci driver
+  echo "pds-vfio-pci" > /sys/class/pci_bus/$PF_BUS/device/$VF_BDF/driver_override
+
+  # Bind the VF to the pds-vfio-pci driver
+  echo "$VF_BDF" > /sys/bus/pci/drivers/pds-vfio-pci/bind
+
+After performing the steps above, a file in /dev/vfio/<iommu_group>
+should have been created.
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+  make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+  -> Device Drivers
+    -> VFIO Non-Privileged userspace driver framework
+      -> VFIO support for PDS PCI devices
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+  netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+  drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.rst b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
index 595ddef1c8b3..099280a261be 100644
--- a/Documentation/networking/device_drivers/aquantia/atlantic.rst
+++ b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
@@ -270,7 +270,7 @@ RX flow rules (ntuple filters)
 
     ethtool -K ethX ntuple <on|off>
 
- When disabling ntuple filters, all the user programed filters are
+ When disabling ntuple filters, all the user programmed filters are
  flushed from the driver cache and hardware. All needed filters must
  be re-added when ntuple is re-enabled.
 
@@ -418,7 +418,7 @@ Default value: 0xFFFF
 0        Disable interrupt throttling.
 1        Enable interrupt throttling and use specified tx and rx rates.
 0xFFFF   Auto throttling mode. Driver will choose the best RX and TX
-	 interrupt throtting settings based on link speed.
+	 interrupt throttling settings based on link speed.
 ======   ==============================================================
 
 aq_itr_tx - TX interrupt throttle rate
@@ -456,7 +456,7 @@ AQ_CFG_RX_PAGEORDER
 
 Default value: 0
 
-RX page order override. Thats a power of 2 number of RX pages allocated for
+RX page order override. That's a power of 2 number of RX pages allocated for
 each descriptor. Received descriptor size is still limited by
 AQ_CFG_RX_FRAME_MAX.
 
diff --git a/Documentation/networking/device_drivers/chelsio/cxgb.rst b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
index 435dce5fa2c7..435dce5fa2c7 100644
--- a/Documentation/networking/device_drivers/chelsio/cxgb.rst
+++ b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
diff --git a/Documentation/networking/device_drivers/cirrus/cs89x0.rst b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
index e5c283940ac5..e5c283940ac5 100644
--- a/Documentation/networking/device_drivers/cirrus/cs89x0.rst
+++ b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
diff --git a/Documentation/networking/device_drivers/davicom/dm9000.rst b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
index d5458da01083..14eb0a4d4e4e 100644
--- a/Documentation/networking/device_drivers/davicom/dm9000.rst
+++ b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
@@ -34,7 +34,7 @@ These resources should be specified in that order, as the ordering of the
 two address regions is important (the driver expects these to be address
 and then data).
 
-An example from arch/arm/mach-s3c2410/mach-bast.c is::
+An example from arch/arm/mach-s3c/mach-bast.c is::
 
   static struct resource bast_dm9k_resource[] = {
 	[0] = {
diff --git a/Documentation/networking/device_drivers/dec/dmfe.rst b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
index c4cf809cad84..c4cf809cad84 100644
--- a/Documentation/networking/device_drivers/dec/dmfe.rst
+++ b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
diff --git a/Documentation/networking/device_drivers/dlink/dl2k.rst b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
index ccdb5d0d7460..ccdb5d0d7460 100644
--- a/Documentation/networking/device_drivers/dlink/dl2k.rst
+++ b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
diff --git a/Documentation/networking/device_drivers/freescale/dpaa.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
index 241c6c6f6e68..241c6c6f6e68 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
index 17dbee1ac53e..e4ebfe62a183 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
@@ -1,5 +1,6 @@
 .. include:: <isonum.txt>
 
+===================================
 DPAA2 DPIO (Data Path I/O) Overview
 ===================================
 
@@ -19,8 +20,10 @@ pool management for network interfaces.
 This document provides an overview the Linux DPIO driver, its
 subcomponents, and its APIs.
 
-See Documentation/networking/device_drivers/freescale/dpaa2/overview.rst for
-a general overview of DPAA2 and the general DPAA2 driver architecture in Linux.
+See
+Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+for a general overview of DPAA2 and the general DPAA2 driver architecture
+in Linux.
 
 Driver Overview
 ---------------
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
index cb4c9a0c5a17..682f3986c15b 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
@@ -33,7 +33,8 @@ hardware resources, like queues, do not have a corresponding MC object and
 are treated as internal resources of other objects.
 
 For a more detailed description of the DPAA2 architecture and its object
-abstractions see *Documentation/networking/device_drivers/freescale/dpaa2/overview.rst*.
+abstractions see
+*Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst*.
 
 Each Linux net device is built on top of a Datapath Network Interface (DPNI)
 object and uses Buffer Pools (DPBPs), I/O Portals (DPIOs) and Concentrators
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
index ee40fcc5ddff..62f4a4aff6ec 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
@@ -9,3 +9,4 @@ DPAA2 Documentation
    dpio-driver
    ethernet-driver
    mac-phy-support
+   switch-driver
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
index 51e6624fb774..e2a36d0d88ef 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
@@ -11,7 +11,7 @@ Overview
 --------
 
 The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
-drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library.
+drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library.
 
 DPAA2 Software Architecture
 ---------------------------
@@ -181,10 +181,13 @@ when necessary using the below listed API::
  - int dpaa2_mac_connect(struct dpaa2_mac *mac);
  - void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
 
-A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
-One can check for this condition using the below API::
+A phylink integration is necessary only when the partner DPMAC is not of
+``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
+``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
+mode, the MC firmware does not access the PCS registers). One can check for
+this condition using the following helper::
 
- - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
+ - static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
 
 Before connection to a MAC, the caller must allocate and populate the
 dpaa2_mac structure with the associated net_device, a pointer to the MC portal
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index d638b5a8aadd..32ee827a3a2c 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -183,6 +183,7 @@ PHY and allows physical transmission and reception of Ethernet frames.
   IRQ config, enable, reset
 
 DPNI (Datapath Network Interface)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Contains TX/RX queues, network interface configuration, and RX buffer pool
 configuration mechanisms.  The TX/RX queues are in memory and are identified
 by queue number.
@@ -338,7 +339,7 @@ Key functions include:
   a bind of the root DPRC to the DPRC driver
 
 The binding for the MC-bus device-tree node can be consulted at
-*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
+*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml*.
 The sysfs bind/unbind interfaces for the MC-bus can be consulted at
 *Documentation/ABI/testing/sysfs-bus-fsl-mc*.
 
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
new file mode 100644
index 000000000000..5f3885e56f58
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================
+DPAA2 Switch driver
+===================
+
+:Copyright: |copy| 2021 NXP
+
+The DPAA2 Switch driver probes on the Datapath Switch (DPSW) object which can
+be instantiated on the following DPAA2 SoCs and their variants: LS2088A and
+LX2160A.
+
+The driver uses the switch device driver model and exposes each switch port as
+a network interface, which can be included in a bridge or used as a standalone
+interface. Traffic switched between ports is offloaded into the hardware.
+
+The DPSW can have ports connected to DPNIs or to DPMACs for external access.
+::
+
+         [ethA]     [ethB]      [ethC]     [ethD]     [ethE]     [ethF]
+            :          :          :          :          :          :
+            :          :          :          :          :          :
+       [dpaa2-eth]  [dpaa2-eth]  [              dpaa2-switch              ]
+            :          :          :          :          :          :        kernel
+       =============================================================================
+            :          :          :          :          :          :        hardware
+         [DPNI]      [DPNI]     [============= DPSW =================]
+            |          |          |          |          |          |
+            |           ----------           |       [DPMAC]    [DPMAC]
+             -------------------------------            |          |
+                                                        |          |
+                                                      [PHY]      [PHY]
+
+Creating an Ethernet Switch
+===========================
+
+The dpaa2-switch driver probes on DPSW devices found on the fsl-mc bus. These
+devices can be either created statically through the boot time configuration
+file - DataPath Layout (DPL) - or at runtime using the DPAA2 object APIs
+(incorporated already into the restool userspace tool).
+
+At the moment, the dpaa2-switch driver imposes the following restrictions on
+the DPSW object that it will probe:
+
+ * The minimum number of FDBs should be at least equal to the number of switch
+   interfaces. This is necessary so that separation of switch ports can be
+   done, ie when not under a bridge, each switch port will have its own FDB.
+   ::
+
+        fsl_dpaa2_switch dpsw.0: The number of FDBs is lower than the number of ports, cannot probe
+
+ * Both the broadcast and flooding configuration should be per FDB. This
+   enables the driver to restrict the broadcast and flooding domains of each
+   FDB depending on the switch ports that are sharing it (aka are under the
+   same bridge).
+   ::
+
+        fsl_dpaa2_switch dpsw.0: Flooding domain is not per FDB, cannot probe
+        fsl_dpaa2_switch dpsw.0: Broadcast domain is not per FDB, cannot probe
+
+ * The control interface of the switch should not be disabled
+   (DPSW_OPT_CTRL_IF_DIS not passed as a create time option). Without the
+   control interface, the driver is not capable to provide proper Rx/Tx traffic
+   support on the switch port netdevices.
+   ::
+
+        fsl_dpaa2_switch dpsw.0: Control Interface is disabled, cannot probe
+
+Besides the configuration of the actual DPSW object, the dpaa2-switch driver
+will need the following DPAA2 objects:
+
+ * 1 DPMCP - A Management Command Portal object is needed for any interaction
+   with the MC firmware.
+
+ * 1 DPBP - A Buffer Pool is used for seeding buffers intended for the Rx path
+   on the control interface.
+
+ * Access to at least one DPIO object (Software Portal) is needed for any
+   enqueue/dequeue operation to be performed on the control interface queues.
+   The DPIO object will be shared, no need for a private one.
+
+Switching features
+==================
+
+The driver supports the configuration of L2 forwarding rules in hardware for
+port bridging as well as standalone usage of the independent switch interfaces.
+
+The hardware is not configurable with respect to VLAN awareness, thus any DPAA2
+switch port should be used only in usecases with a VLAN aware bridge::
+
+        $ ip link add dev br0 type bridge vlan_filtering 1
+
+        $ ip link add dev br1 type bridge
+        $ ip link set dev ethX master br1
+        Error: fsl_dpaa2_switch: Cannot join a VLAN-unaware bridge
+
+Topology and loop detection through STP is supported when ``stp_state 1`` is
+used at bridge create ::
+
+        $ ip link add dev br0 type bridge vlan_filtering 1 stp_state 1
+
+L2 FDB manipulation (add/delete/dump) is supported.
+
+HW FDB learning can be configured on each switch port independently through
+bridge commands. When the HW learning is disabled, a fast age procedure will be
+run and any previously learnt addresses will be removed.
+::
+
+        $ bridge link set dev ethX learning off
+        $ bridge link set dev ethX learning on
+
+Restricting the unknown unicast and multicast flooding domain is supported, but
+not independently of each other::
+
+        $ ip link set dev ethX type bridge_slave flood off mcast_flood off
+        $ ip link set dev ethX type bridge_slave flood off mcast_flood on
+        Error: fsl_dpaa2_switch: Cannot configure multicast flooding independently of unicast.
+
+Broadcast flooding on a switch port can be disabled/enabled through the brport sysfs::
+
+        $ echo 0 > /sys/bus/fsl-mc/devices/dpsw.Y/net/ethX/brport/broadcast_flood
+
+Offloads
+========
+
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The DPAA2 switch is able to offload flow-based redirection of packets making
+use of ACL tables. Shared filter blocks are supported by sharing a single ACL
+table between multiple ports.
+
+The following flow keys are supported:
+
+ * Ethernet: dst_mac/src_mac
+ * IPv4: dst_ip/src_ip/ip_proto/tos
+ * VLAN: vlan_id/vlan_prio/vlan_tpid/vlan_dei
+ * L4: dst_port/src_port
+
+Also, the matchall filter can be used to redirect the entire traffic received
+on a port.
+
+As per flow actions, the following are supported:
+
+ * drop
+ * mirred egress redirect
+ * trap
+
+Each ACL entry (filter) can be setup with only one of the listed
+actions.
+
+Example 1: send frames received on eth4 with a SA of 00:01:02:03:04:05 to the
+CPU::
+
+        $ tc qdisc add dev eth4 clsact
+        $ tc filter add dev eth4 ingress flower src_mac 00:01:02:03:04:05 skip_sw action trap
+
+Example 2: drop frames received on eth4 with VID 100 and PCP of 3::
+
+        $ tc filter add dev eth4 ingress protocol 802.1q flower skip_sw vlan_id 100 vlan_prio 3 action drop
+
+Example 3: redirect all frames received on eth4 to eth1::
+
+        $ tc filter add dev eth4 ingress matchall action mirred egress redirect dev eth1
+
+Example 4: Use a single shared filter block on both eth5 and eth6::
+
+        $ tc qdisc add dev eth5 ingress_block 1 clsact
+        $ tc qdisc add dev eth6 ingress_block 1 clsact
+        $ tc filter add block 1 ingress flower dst_mac 00:01:02:03:04:04 skip_sw \
+                action trap
+        $ tc filter add block 1 ingress protocol ipv4 flower src_ip 192.168.1.1 skip_sw \
+                action mirred egress redirect dev eth3
+
+Mirroring
+~~~~~~~~~
+
+The DPAA2 switch supports only per port mirroring and per VLAN mirroring.
+Adding mirroring filters in shared blocks is also supported.
+
+When using the tc-flower classifier with the 802.1q protocol, only the
+''vlan_id'' key will be accepted. Mirroring based on any other fields from the
+802.1q protocol will be rejected::
+
+        $ tc qdisc add dev eth8 ingress_block 1 clsact
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_prio 3 action mirred egress mirror dev eth6
+        Error: fsl_dpaa2_switch: Only matching on VLAN ID supported.
+        We have an error talking to the kernel
+
+If a mirroring VLAN filter is requested on a port, the VLAN must to be
+installed on the switch port in question either using ''bridge'' or by creating
+a VLAN upper device if the switch port is used as a standalone interface::
+
+        $ tc qdisc add dev eth8 ingress_block 1 clsact
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+        Error: VLAN must be installed on the switch port.
+        We have an error talking to the kernel
+
+        $ bridge vlan add vid 200 dev eth8
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+        $ ip link add link eth8 name eth8.200 type vlan id 200
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+Also, it should be noted that the mirrored traffic will be subject to the same
+egress restrictions as any other traffic. This means that when a mirrored
+packet will reach the mirror port, if the VLAN found in the packet is not
+installed on the port it will get dropped.
+
+The DPAA2 switch supports only a single mirroring destination, thus multiple
+mirror rules can be installed but their ''to'' port has to be the same::
+
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+        $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 100 action mirred egress mirror dev eth7
+        Error: fsl_dpaa2_switch: Multiple mirror ports not supported.
+        We have an error talking to the kernel
diff --git a/Documentation/networking/device_drivers/freescale/gianfar.rst b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
index 9c4a91d3824b..9c4a91d3824b 100644
--- a/Documentation/networking/device_drivers/freescale/gianfar.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
index 793693cef6e3..31d621bca82e 100644
--- a/Documentation/networking/device_drivers/google/gve.rst
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -47,13 +47,33 @@ The driver interacts with the device in the following ways:
  - Transmit and Receive Queues
     - See description below
 
+Descriptor Formats
+------------------
+GVE supports two descriptor formats: GQI and DQO. These two formats have
+entirely different descriptors, which will be described below.
+
+Addressing Mode
+------------------
+GVE supports two addressing modes: QPL and RDA.
+QPL ("queue-page-list") mode communicates data through a set of
+pre-registered pages.
+
+For RDA ("raw DMA addressing") mode, the set of pages is dynamic.
+Therefore, the packet buffers can be anywhere in guest memory.
+
 Registers
 ---------
-All registers are MMIO and big endian.
+All registers are MMIO.
 
 The registers are used for initializing and configuring the device as well as
 querying device status in response to management interrupts.
 
+Endianness
+----------
+- Admin Queue messages and registers are all Big Endian.
+- GQI descriptors and datapath registers are Big Endian.
+- DQO descriptors and datapath registers are Little Endian.
+
 Admin Queue (AQ)
 ----------------
 The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
@@ -97,10 +117,10 @@ the queues associated with that interrupt.
 The handler for these irqs schedule the napi for that block to run
 and poll the queues.
 
-Traffic Queues
---------------
-gVNIC's queues are composed of a descriptor ring and a buffer and are
-assigned to a notification block.
+GQI Traffic Queues
+------------------
+GQI queues are composed of a descriptor ring and a buffer and are assigned to a
+notification block.
 
 The descriptor rings are power-of-two-sized ring buffers consisting of
 fixed-size descriptors. They advance their head pointer using a __be32
@@ -121,3 +141,35 @@ Receive
 The buffers for receive rings are put into a data ring that is the same
 length as the descriptor ring and the head and tail pointers advance over
 the rings together.
+
+DQO Traffic Queues
+------------------
+- Every TX and RX queue is assigned a notification block.
+
+- TX and RX buffers queues, which send descriptors to the device, use MMIO
+  doorbells to notify the device of new descriptors.
+
+- RX and TX completion queues, which receive descriptors from the device, use a
+  "generation bit" to know when a descriptor was populated by the device. The
+  driver initializes all bits with the "current generation". The device will
+  populate received descriptors with the "next generation" which is inverted
+  from the current generation. When the ring wraps, the current/next generation
+  are swapped.
+
+- It's the driver's responsibility to ensure that the RX and TX completion
+  queues are not overrun. This can be accomplished by limiting the number of
+  descriptors posted to HW.
+
+- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
+  buffer_id. These will be returned on the TX completion and RX queues
+  respectively to let the driver know which packet/buffer was completed.
+
+Transmit
+~~~~~~~~
+A packet's buffers are DMA mapped for the device to access before transmission.
+After the packet was successfully transmitted, the buffers are unmapped.
+
+Receive
+~~~~~~~
+The driver posts fixed sized buffers to HW on the RX buffer queue. The packet
+received on the associated RX queue may span multiple descriptors.
diff --git a/Documentation/networking/hinic.rst b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
index 867ac8f4e04a..867ac8f4e04a 100644
--- a/Documentation/networking/hinic.rst
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
diff --git a/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst b/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst
new file mode 100644
index 000000000000..e3dfd083fa52
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst
@@ -0,0 +1,137 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================================
+Linux kernel driver for Huawei Ethernet Device Driver (hinic3) family
+=====================================================================
+
+Overview
+========
+
+The hinic3 is a network interface card (NIC) for Data Center. It supports
+a range of link-speed devices (10GE, 25GE, 100GE, etc.). The hinic3
+devices can have multiple physical forms: LOM (Lan on Motherboard) NIC,
+PCIe standard NIC, OCP (Open Compute Project) NIC, etc.
+
+The hinic3 driver supports the following features:
+- IPv4/IPv6 TCP/UDP checksum offload
+- TSO (TCP Segmentation Offload), LRO (Large Receive Offload)
+- RSS (Receive Side Scaling)
+- MSI-X interrupt aggregation configuration and interrupt adaptation.
+- SR-IOV (Single Root I/O Virtualization).
+
+Content
+=======
+
+- Supported PCI vendor ID/device IDs
+- Source Code Structure of Hinic3 Driver
+- Management Interface
+
+Supported PCI vendor ID/device IDs
+==================================
+
+19e5:0222 - hinic3 PF/PPF
+19e5:375F - hinic3 VF
+
+Prime Physical Function (PPF) is responsible for the management of the
+whole NIC card. For example, clock synchronization between the NIC and
+the host. Any PF may serve as a PPF. The PPF is selected dynamically.
+
+Source Code Structure of Hinic3 Driver
+======================================
+
+========================  ================================================
+hinic3_pci_id_tbl.h       Supported device IDs
+hinic3_hw_intf.h          Interface between HW and driver
+hinic3_queue_common.[ch]  Common structures and methods for NIC queues
+hinic3_common.[ch]        Encapsulation of memory operations in Linux
+hinic3_csr.h              Register definitions in the BAR
+hinic3_hwif.[ch]          Interface for BAR
+hinic3_eqs.[ch]           Interface for AEQs and CEQs
+hinic3_mbox.[ch]          Interface for mailbox
+hinic3_mgmt.[ch]          Management interface based on mailbox and AEQ
+hinic3_wq.[ch]            Work queue data structures and interface
+hinic3_cmdq.[ch]          Command queue is used to post command to HW
+hinic3_hwdev.[ch]         HW structures and methods abstractions
+hinic3_lld.[ch]           Auxiliary driver adaptation layer
+hinic3_hw_comm.[ch]       Interface for common HW operations
+hinic3_mgmt_interface.h   Interface between firmware and driver
+hinic3_hw_cfg.[ch]        Interface for HW configuration
+hinic3_irq.c              Interrupt request
+hinic3_netdev_ops.c       Operations registered to Linux kernel stack
+hinic3_nic_dev.h          NIC structures and methods abstractions
+hinic3_main.c             Main Linux kernel driver
+hinic3_nic_cfg.[ch]       NIC service configuration
+hinic3_nic_io.[ch]        Management plane interface for TX and RX
+hinic3_rss.[ch]           Interface for Receive Side Scaling (RSS)
+hinic3_rx.[ch]            Interface for transmit
+hinic3_tx.[ch]            Interface for receive
+hinic3_ethtool.c          Interface for ethtool operations (ops)
+hinic3_filter.c           Interface for MAC address
+========================  ================================================
+
+Management Interface
+====================
+
+Asynchronous Event Queue (AEQ)
+------------------------------
+
+AEQ receives high priority events from the HW over a descriptor queue.
+Every descriptor is a fixed size of 64 bytes. AEQ can receive solicited or
+unsolicited events. Every device, VF or PF, can have up to 4 AEQs.
+Every AEQ is associated to a dedicated IRQ. AEQ can receive multiple types
+of events, but in practice the hinic3 driver ignores all events except for
+2 mailbox related events.
+
+Mailbox
+-------
+
+Mailbox is a communication mechanism between the hinic3 driver and the HW.
+Each device has an independent mailbox. Driver can use the mailbox to send
+requests to management. Driver receives mailbox messages, such as responses
+to requests, over the AEQ (using event HINIC3_AEQ_FOR_MBOX). Due to the
+limited size of mailbox data register, mailbox messages are sent
+segment-by-segment.
+
+Every device can use its mailbox to post request to firmware. The mailbox
+can also be used to post requests and responses between the PF and its VFs.
+
+Completion Event Queue (CEQ)
+----------------------------
+
+The implementation of CEQ is the same as AEQ. It receives completion events
+from HW over a fixed size descriptor of 32 bits. Every device can have up
+to 32 CEQs. Every CEQ has a dedicated IRQ. CEQ only receives solicited
+events that are responses to requests from the driver. CEQ can receive
+multiple types of events, but in practice the hinic3 driver ignores all
+events except for HINIC3_CMDQ that represents completion of previously
+posted commands on a cmdq.
+
+Command Queue (cmdq)
+--------------------
+
+Every cmdq has a dedicated work queue on which commands are posted.
+Commands on the work queue are fixed size descriptor of size 64 bytes.
+Completion of a command will be indicated using ctrl bits in the
+descriptor that carried the command. Notification of command completions
+will also be provided via event on CEQ. Every device has 4 command queues
+that are initialized as a set (called cmdqs), each with its own type.
+Hinic3 driver only uses type HINIC3_CMDQ_SYNC.
+
+Work Queues(WQ)
+---------------
+
+Work queues are logical arrays of fixed size WQEs. The array may be spread
+over multiple non-contiguous pages using indirection table. Work queues are
+used by I/O queues and command queues.
+
+Global function ID
+------------------
+
+Every function, PF or VF, has a unique ordinal identification within the device.
+Many management commands (mbox or cmdq) contain this ID so HW can apply the
+command effect to the right function.
+
+PF is allowed to post management commands to a subordinate VF by specifying the
+VFs ID. A VF must provide its own ID. Anti-spoofing in the HW will cause
+command from a VF to fail if it contains the wrong ID.
+
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
new file mode 100644
index 000000000000..139b4c75a191
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -0,0 +1,68 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Ethernet Device Drivers
+=======================
+
+Device drivers for Ethernet and Ethernet-based virtual function devices.
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   3com/3c509
+   3com/vortex
+   amazon/ena
+   altera/altera_tse
+   amd/pds_core
+   amd/pds_vdpa
+   amd/pds_vfio_pci
+   aquantia/atlantic
+   chelsio/cxgb
+   cirrus/cs89x0
+   dlink/dl2k
+   davicom/dm9000
+   dec/dmfe
+   freescale/dpaa
+   freescale/dpaa2/index
+   freescale/gianfar
+   google/gve
+   huawei/hinic
+   huawei/hinic3
+   intel/e100
+   intel/e1000
+   intel/e1000e
+   intel/fm10k
+   intel/idpf
+   intel/igb
+   intel/igbvf
+   intel/ixgbe
+   intel/ixgbevf
+   intel/i40e
+   intel/iavf
+   intel/ice
+   marvell/octeontx2
+   marvell/octeon_ep
+   marvell/octeon_ep_vf
+   mellanox/mlx5/index
+   meta/fbnic
+   microsoft/netvsc
+   neterion/s2io
+   netronome/nfp
+   pensando/ionic
+   smsc/smc9
+   stmicro/stmmac
+   ti/cpsw
+   ti/cpsw_switchdev
+   ti/am65_nuss_cpsw_switchdev
+   ti/tlan
+   ti/icssg_prueth
+   wangxun/txgbe
+   wangxun/ngbe
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
index 3ac21e7119a7..5dee1b53e977 100644
--- a/Documentation/networking/device_drivers/intel/e100.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
@@ -41,7 +41,7 @@ Identifying Your Adapter
 
 For information on how to identify your adapter, and for the latest Intel
 network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
 
 Driver Configuration Parameters
 ===============================
@@ -151,8 +151,7 @@ NAPI
 
 NAPI (Rx polling mode) is supported in the e100 driver.
 
-See https://wiki.linuxfoundation.org/networking/napi for more
-information on NAPI.
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
 
 Multiple Interfaces on Same Ethernet Broadcast Network
 ------------------------------------------------------
@@ -179,10 +178,8 @@ filtering by
 Support
 =======
 For general information, go to the Intel support website at:
-http://www.intel.com/support/
+https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-http://sourceforge.net/projects/e1000
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
index 4aaae0f7d6ba..52a7fb9ce8d9 100644
--- a/Documentation/networking/device_drivers/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
@@ -451,13 +451,8 @@ Support
 =======
 
 For general information, go to the Intel support website at:
-
-    http://support.intel.com
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-    http://sourceforge.net/projects/e1000
+http://support.intel.com
 
 If an issue is identified with the released source code on the supported
 kernel with a supported adapter, email the specific information related
-to the issue to e1000-devel@lists.sf.net
+to the issue to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
index f49cd370e7bf..d8f810afdd49 100644
--- a/Documentation/networking/device_drivers/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
@@ -371,13 +371,8 @@ NOTE: Wake on LAN is only supported on port A for the following devices:
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
index 4d279e64e221..396a2c8c3db1 100644
--- a/Documentation/networking/device_drivers/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
@@ -22,7 +22,7 @@ Ethernet Multi-host Controller.
 
 For information on how to identify your adapter, and for the latest Intel
 network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
 
 
 Flow Control
@@ -130,13 +130,8 @@ the Intel Ethernet Controller XL710.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index 8a9b18573688..53d9d5829d69 100644
--- a/Documentation/networking/device_drivers/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -173,7 +173,7 @@ Director rule is added from ethtool (Sideband filter), ATR is turned off by the
 driver. To re-enable ATR, the sideband can be disabled with the ethtool -K
 option. For example::
 
-  ethtool –K [adapter] ntuple [off|on]
+  ethtool -K [adapter] ntuple [off|on]
 
 If sideband is re-enabled after ATR is re-enabled, ATR remains enabled until a
 TCP-IP flow is added. When all TCP-IP sideband rules are deleted, ATR is
@@ -299,6 +299,18 @@ Use ethtool to view and set link-down-on-close, as follows::
   ethtool --show-priv-flags ethX
   ethtool --set-priv-flags ethX link-down-on-close [on|off]
 
+Setting the mdd-auto-reset-vf Private Flag
+------------------------------------------
+
+When the mdd-auto-reset-vf private flag is set to "on", the problematic VF will
+be automatically reset if a malformed descriptor is detected. If the flag is
+set to "off", the problematic VF will be disabled.
+
+Use ethtool to view and set mdd-auto-reset-vf, as follows::
+
+  ethtool --show-priv-flags ethX
+  ethtool --set-priv-flags ethX mdd-auto-reset-vf [on|off]
+
 Viewing Link Messages
 ---------------------
 Link messages will not be displayed to the console if the distribution is
@@ -399,8 +411,8 @@ operate only in full duplex and only at their native speed.
 NAPI
 ----
 NAPI (Rx polling mode) is supported in the i40e driver.
-For more information on NAPI, see
-https://wiki.linuxfoundation.org/networking/napi
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
 
 Flow Control
 ------------
@@ -466,7 +478,7 @@ network. PTP support varies among Intel devices that support this driver. Use
 "ethtool -T <netdev name>" to get a definitive list of PTP capabilities
 supported by the device.
 
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
 ---------------------------
 The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
 IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -523,8 +535,8 @@ of a port's bandwidth (should it be available). The sum of all the values for
 Maximum Bandwidth is not restricted, because no more than 100% of a port's
 bandwidth can ever be used.
 
-NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
-per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
+NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
+per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
 "add vsi failed for VF N, aq_err 16". To workaround the issue, enable less than
 64 virtual functions (VFs).
 
@@ -688,7 +700,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
 Totals must be equal or less than port speed.
 
 For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as `ifstat` or `sar -n DEV [interval] [number of samples]`
 
 2. Enable HW TC offload on interface::
 
@@ -759,13 +771,8 @@ enabled when setting up DCB on your switch.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
index 84ac7e75f363..eb926c3bd4cd 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -43,7 +43,7 @@ device.
 
 For information on how to identify your adapter, and for the latest NVM/FW
 images and Intel network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
 
 
 Additional Features and Configurations
@@ -113,7 +113,7 @@ which the AVF is associated. The following are base mode features:
 - AVF device ID
 - HW mailbox is used for VF to PF communications (including on Windows)
 
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
 ---------------------------
 The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
 IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -179,7 +179,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
 Totals must be equal or less than port speed.
 
 For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as ``ifstat`` or ``sar -n DEV [interval] [number of samples]``
 
 NOTE:
   Setting up channels via ethtool (ethtool -L) is not supported when the
@@ -319,13 +319,8 @@ This is caused by the way the Linux kernel reports this stressed condition.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://support.intel.com
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on the supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
new file mode 100644
index 000000000000..3c46a48d99ba
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -0,0 +1,1206 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 800 Series
+=================================================================
+
+Intel ice Linux driver.
+Copyright(c) 2018-2021 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Important Notes
+- Additional Features & Configurations
+- Performance Optimization
+
+
+The associated Virtual Function (VF) driver for this driver is iavf.
+
+Driver information can be obtained using ethtool and lspci.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+This driver supports XDP (Express Data Path) and AF_XDP zero-copy. Note that
+XDP is blocked for frame sizes larger than 3KB.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Important Notes
+===============
+
+Packet drops may occur under receive stress
+-------------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series are designed to
+tolerate a limited amount of system latency during PCIe and DMA transactions.
+If these transactions take longer than the tolerated latency, it can impact the
+length of time the packets are buffered in the device and associated memory,
+which may result in dropped packets. These packets drops typically do not have
+a noticeable impact on throughput and performance under standard workloads.
+
+If these packet drops appear to affect your workload, the following may improve
+the situation:
+
+1) Make sure that your system's physical memory is in a high-performance
+   configuration, as recommended by the platform vendor. A common
+   recommendation is for all channels to be populated with a single DIMM
+   module.
+2) In your system's BIOS/UEFI settings, select the "Performance" profile.
+3) Your distribution may provide tools like "tuned," which can help tweak
+   kernel settings to achieve better standard settings for different workloads.
+
+
+Configuring SR-IOV for improved network security
+------------------------------------------------
+In a virtualized environment, on Intel(R) Ethernet Network Adapters that
+support SR-IOV, the virtual function (VF) may be subject to malicious behavior.
+Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE
+802.1Qbb (priority based flow-control), and others of this type, are not
+expected and can throttle traffic between the host and the virtual switch,
+reducing performance. To resolve this issue, and to ensure isolation from
+unintended traffic streams, configure all SR-IOV enabled ports for VLAN tagging
+from the administrative interface on the PF. This configuration allows
+unexpected, and potentially malicious, frames to be dropped.
+
+See "Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports" later in this
+README for configuration instructions.
+
+
+Do not unload port driver if VF with active VM is bound to it
+-------------------------------------------------------------
+Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
+Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
+Once the VM shuts down, or otherwise releases the VF, the command will
+complete.
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://kernel.org/pub/software/network/ethtool/
+
+NOTE: The rx_bytes value of ethtool does not match the rx_bytes value of
+Netdev, due to the 4-byte CRC being stripped by the device. The difference
+between the two rx_bytes values will be 4 x the number of Rx packets. For
+example, if Rx packets are 10 and Netdev (software statistics) displays
+rx_bytes as "X", then ethtool (hardware statistics) will display rx_bytes as
+"X+40" (4 bytes CRC x 10 packets).
+
+ethtool reset
+-------------
+The driver supports 3 types of resets:
+
+- PF reset - resets only components associated with the given PF, does not
+  impact other PFs
+
+- CORE reset - whole adapter is affected, reset all PFs
+
+- GLOBAL reset - same as CORE but mac and phy components are also reinitialized
+
+These are mapped to ethtool reset flags as follow:
+
+- PF reset:
+
+  # ethtool --reset <ethX> irq dma filter offload
+
+- CORE reset:
+
+  # ethtool --reset <ethX> irq-shared dma-shared filter-shared offload-shared \
+  ram-shared
+
+- GLOBAL reset:
+
+  # ethtool --reset <ethX> irq-shared dma-shared filter-shared offload-shared \
+  mac-shared phy-shared ram-shared
+
+In switchdev mode you can reset a VF using port representor:
+
+  # ethtool --reset <repr> irq dma filter offload
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+  # dmesg -n 8
+
+NOTE: This setting is not saved across reboots.
+
+
+Dynamic Device Personalization
+------------------------------
+Dynamic Device Personalization (DDP) allows you to change the packet processing
+pipeline of a device by applying a profile package to the device at runtime.
+Profiles can be used to, for example, add support for new protocols, change
+existing protocols, or change default settings. DDP profiles can also be rolled
+back without rebooting the system.
+
+The DDP package loads during device initialization. The driver looks for
+``intel/ice/ddp/ice.pkg`` in your firmware root (typically ``/lib/firmware/``
+or ``/lib/firmware/updates/``) and checks that it contains a valid DDP package
+file.
+
+NOTE: Your distribution should likely have provided the latest DDP file, but if
+ice.pkg is missing, you can find it in the linux-firmware repository or from
+intel.com.
+
+If the driver is unable to load the DDP package, the device will enter Safe
+Mode. Safe Mode disables advanced and performance features and supports only
+basic traffic and minimal functionality, such as updating the NVM or
+downloading a new driver or DDP package. Safe Mode only applies to the affected
+physical function and does not impact any other PFs. See the "Intel(R) Ethernet
+Adapters and Devices User Guide" for more details on DDP and Safe Mode.
+
+NOTES:
+
+- If you encounter issues with the DDP package file, you may need to download
+  an updated driver or DDP package file. See the log messages for more
+  information.
+
+- The ice.pkg file is a symbolic link to the default DDP package file.
+
+- You cannot update the DDP package if any PF drivers are already loaded. To
+  overwrite a package, unload all PFs and then reload the driver with the new
+  package.
+
+- Only the first loaded PF per device can download a package for that device.
+
+You can install specific DDP package files for different physical devices in
+the same system. To install a specific DDP package file:
+
+1. Download the DDP package file you want for your device.
+
+2. Rename the file ice-xxxxxxxxxxxxxxxx.pkg, where 'xxxxxxxxxxxxxxxx' is the
+   unique 64-bit PCI Express device serial number (in hex) of the device you
+   want the package downloaded on. The filename must include the complete
+   serial number (including leading zeros) and be all lowercase. For example,
+   if the 64-bit serial number is b887a3ffffca0568, then the file name would be
+   ice-b887a3ffffca0568.pkg.
+
+   To find the serial number from the PCI bus address, you can use the
+   following command::
+
+     # lspci -vv -s af:00.0 | grep -i Serial
+     Capabilities: [150 v1] Device Serial Number b8-87-a3-ff-ff-ca-05-68
+
+   You can use the following command to format the serial number without the
+   dashes::
+
+     # lspci -vv -s af:00.0 | grep -i Serial | awk '{print $7}' | sed s/-//g
+     b887a3ffffca0568
+
+3. Copy the renamed DDP package file to
+   ``/lib/firmware/updates/intel/ice/ddp/``. If the directory does not yet
+   exist, create it before copying the file.
+
+4. Unload all of the PFs on the device.
+
+5. Reload the driver with the new package.
+
+NOTE: The presence of a device-specific DDP package file overrides the loading
+of the default DDP package file (ice.pkg).
+
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues
+- Enables tight control on routing a flow in the platform
+- Matches flows and CPU cores for flow affinity
+
+NOTE: This driver supports the following flow types:
+
+- IPv4
+- TCPv4
+- UDPv4
+- SCTPv4
+- IPv6
+- TCPv6
+- UDPv6
+- SCTPv6
+
+Each flow type supports valid combinations of IP addresses (source or
+destination) and UDP/TCP/SCTP ports (source and destination). You can supply
+only a source IP address, a source IP address and a destination port, or any
+combination of one or more of these four parameters.
+
+NOTE: This driver allows you to filter traffic based on a user-defined flexible
+two-byte pattern and offset by using the ethtool user-def and mask fields. Only
+L3 and L4 flow types are supported for user-defined flexible filters. For a
+given flow type, you must clear all Intel Ethernet Flow Director filters before
+changing the input set (for that flow type).
+
+
+Flow Director Filters
+---------------------
+Flow Director filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To enable
+or disable the Intel Ethernet Flow Director and these filters::
+
+  # ethtool -K <ethX> ntuple <off|on>
+
+NOTE: When you disable ntuple filters, all the user programmed filters are
+flushed from the driver cache and hardware. All needed filters must be re-added
+when ntuple is re-enabled.
+
+To display all of the active filters::
+
+  # ethtool -u <ethX>
+
+To add a new filter::
+
+  # ethtool -U <ethX> flow-type <type> src-ip <ip> [m <ip_mask>] dst-ip <ip>
+  [m <ip_mask>] src-port <port> [m <port_mask>] dst-port <port> [m <port_mask>]
+  action <queue>
+
+  Where:
+    <ethX> - the Ethernet device to program
+    <type> - can be ip4, tcp4, udp4, sctp4, ip6, tcp6, udp6, sctp6
+    <ip> - the IP address to match on
+    <ip_mask> - the IPv4 address to mask on
+              NOTE: These filters use inverted masks.
+    <port> - the port number to match on
+    <port_mask> - the 16-bit integer for masking
+              NOTE: These filters use inverted masks.
+    <queue> - the queue to direct traffic toward (-1 discards the
+              matched traffic)
+
+To delete a filter::
+
+  # ethtool -U <ethX> delete <N>
+
+  Where <N> is the filter ID displayed when printing all the active filters,
+  and may also have been specified using "loc <N>" when adding the filter.
+
+EXAMPLES:
+
+To add a filter that directs packet to queue 2::
+
+  # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+  192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To set a filter using only the source and destination IP address::
+
+  # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+  192.168.10.2 action 2 [loc 1]
+
+To set a filter based on a user-defined pattern and offset::
+
+  # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+  192.168.10.2 user-def 0x4FFFF action 2 [loc 1]
+
+  where the value of the user-def field contains the offset (4 bytes) and
+  the pattern (0xffff).
+
+To match TCP traffic sent from 192.168.0.1, port 5300, directed to 192.168.0.5,
+port 80, and then send it to queue 7::
+
+  # ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5
+  src-port 5300 dst-port 80 action 7
+
+To add a TCPv4 filter with a partial mask for a source IP subnet::
+
+  # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.0.0 m 0.255.255.255 dst-ip
+  192.168.5.12 src-port 12600 dst-port 31 action 12
+
+NOTES:
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+  # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+  # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+  # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+  # ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two tcp4 filters with different matching fields.
+
+The ice driver does not support matching on a subportion of a field, thus
+partial mask fields are not supported.
+
+
+Flex Byte Flow Director Filters
+-------------------------------
+The driver also supports matching user-defined data within the packet payload.
+This flexible data is specified using the "user-def" field of the ethtool
+command in the following way:
+
+.. table::
+
+    ============================== ============================
+    ``31    28    24    20    16`` ``15    12    8    4    0``
+    ``offset into packet payload`` ``2 bytes of flexible data``
+    ============================== ============================
+
+For example,
+
+::
+
+  ... user-def 0x4FFFF ...
+
+tells the filter to look 4 bytes into the payload and match that value against
+0xFFFF. The offset is based on the beginning of the payload, and not the
+beginning of the packet. Thus
+
+::
+
+  flow-type tcp4 ... user-def 0x8BEAF ...
+
+would match TCP/IPv4 packets which have the value 0xBEAF 8 bytes into the
+TCP/IPv4 payload.
+
+Note that ICMP headers are parsed as 4 bytes of header and 4 bytes of payload.
+Thus to match the first byte of the payload, you must actually add 4 bytes to
+the offset. Also note that ip4 filters match both ICMP frames as well as raw
+(unknown) ip4 frames, where the payload will be the L3 payload of the IP4
+frame.
+
+The maximum offset is 64. The hardware will only read up to 64 bytes of data
+from the payload. The offset must be even because the flexible data is 2 bytes
+long and must be aligned to byte 0 of the packet payload.
+
+The user-defined flexible offset is also considered part of the input set and
+cannot be programmed separately for multiple filters of the same type. However,
+the flexible data is not part of the input set and multiple filters may use the
+same offset but match against different data.
+
+
+RSS Hash Flow
+-------------
+Allows you to set the hash bytes per flow type and any combination of one or
+more options for Receive Side Scaling (RSS) hash byte configuration.
+
+::
+
+  # ethtool -N <ethX> rx-flow-hash <type> <option>
+
+  Where <type> is:
+    tcp4    signifying TCP over IPv4
+    udp4    signifying UDP over IPv4
+    gtpc4   signifying GTP-C over IPv4
+    gtpc4t  signifying GTP-C (include TEID) over IPv4
+    gtpu4   signifying GTP-U over IPV4
+    gtpu4e  signifying GTP-U and Extension Header over IPV4
+    gtpu4u  signifying GTP-U PSC Uplink over IPV4
+    gtpu4d  signifying GTP-U PSC Downlink over IPV4
+    tcp6    signifying TCP over IPv6
+    udp6    signifying UDP over IPv6
+    gtpc6   signifying GTP-C over IPv6
+    gtpc6t  signifying GTP-C (include TEID) over IPv6
+    gtpu6   signifying GTP-U over IPV6
+    gtpu6e  signifying GTP-U and Extension Header over IPV6
+    gtpu6u  signifying GTP-U PSC Uplink over IPV6
+    gtpu6d  signifying GTP-U PSC Downlink over IPV6
+  And <option> is one or more of:
+    s     Hash on the IP source address of the Rx packet.
+    d     Hash on the IP destination address of the Rx packet.
+    f     Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+    n     Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+    e     Hash on GTP Packet on TEID (4bytes) of the Rx packet.
+
+
+Accelerated Receive Flow Steering (aRFS)
+----------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series support
+Accelerated Receive Flow Steering (aRFS) on the PF. aRFS is a load-balancing
+mechanism that allows you to direct packets to the same CPU where an
+application is running or consuming the packets in that flow.
+
+NOTES:
+
+- aRFS requires that ntuple filtering is enabled via ethtool.
+- aRFS support is limited to the following packet types:
+
+    - TCP over IPv4 and IPv6
+    - UDP over IPv4 and IPv6
+    - Nonfragmented packets
+
+- aRFS only supports Flow Director filters, which consist of the
+  source/destination IP addresses and source/destination ports.
+- aRFS and ethtool's ntuple interface both use the device's Flow Director. aRFS
+  and ntuple features can coexist, but you may encounter unexpected results if
+  there's a conflict between aRFS and ntuple requests. See "Intel(R) Ethernet
+  Flow Director" for additional information.
+
+To set up aRFS:
+
+1. Enable the Intel Ethernet Flow Director and ntuple filters using ethtool.
+
+::
+
+   # ethtool -K <ethX> ntuple on
+
+2. Set up the number of entries in the global flow table. For example:
+
+::
+
+   # NUM_RPS_ENTRIES=16384
+   # echo $NUM_RPS_ENTRIES > /proc/sys/net/core/rps_sock_flow_entries
+
+3. Set up the number of entries in the per-queue flow table. For example:
+
+::
+
+   # NUM_RX_QUEUES=64
+   # for file in /sys/class/net/$IFACE/queues/rx-*/rps_flow_cnt; do
+   # echo $(($NUM_RPS_ENTRIES/$NUM_RX_QUEUES)) > $file;
+   # done
+
+4. Disable the IRQ balance daemon (this is only a temporary stop of the service
+   until the next reboot).
+
+::
+
+   # systemctl stop irqbalance
+
+5. Configure the interrupt affinity.
+
+   See ``/Documentation/core-api/irq/irq-affinity.rst``
+
+
+To disable aRFS using ethtool::
+
+  # ethtool -K <ethX> ntuple off
+
+NOTE: This command will disable ntuple filters and clear any aRFS filters in
+software and hardware.
+
+Example Use Case:
+
+1. Set the server application on the desired CPU (e.g., CPU 4).
+
+::
+
+   # taskset -c 4 netserver
+
+2. Use netperf to route traffic from the client to CPU 4 on the server with
+   aRFS configured. This example uses TCP over IPv4.
+
+::
+
+   # netperf -H <Host IPv4 Address> -t TCP_STREAM
+
+
+Enabling Virtual Functions (VFs)
+--------------------------------
+Use sysfs to enable virtual functions (VF).
+
+For example, you can create 4 VFs as follows::
+
+  # echo 4 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+To disable VFs, write 0 to the same file::
+
+  # echo 0 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+The maximum number of VFs for the ice driver is 256 total (all ports). To check
+how many VFs each PF supports, use the following command::
+
+  # cat /sys/class/net/<ethX>/device/sriov_totalvfs
+
+Note: You cannot use SR-IOV when link aggregation (LAG)/bonding is active, and
+vice versa. To enforce this, the driver checks for this mutual exclusion.
+
+
+Displaying VF Statistics on the PF
+----------------------------------
+Use the following command to display the statistics for the PF and its VFs::
+
+  # ip -s link show dev <ethX>
+
+NOTE: The output of this command can be very large due to the maximum number of
+possible VFs.
+
+The PF driver will display a subset of the statistics for the PF and for all
+VFs that are configured. The PF will always print a statistics block for each
+of the possible VFs, and it will show zero for all unconfigured VFs.
+
+
+Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports
+--------------------------------------------------------
+To configure VLAN tagging for the ports on an SR-IOV enabled adapter, use the
+following command. The VLAN configuration should be done before the VF driver
+is loaded or the VM is booted. The VF is not aware of the VLAN tag being
+inserted on transmit and removed on received frames (sometimes called "port
+VLAN" mode).
+
+::
+
+  # ip link set dev <ethX> vf <id> vlan <vlan id>
+
+For example, the following will configure PF eth0 and the first VF on VLAN 10::
+
+  # ip link set dev eth0 vf 0 vlan 10
+
+
+Enabling a VF link if the port is disconnected
+----------------------------------------------
+If the physical function (PF) link is down, you can force link up (from the
+host PF) on any virtual functions (VF) bound to the PF.
+
+For example, to force link up on VF 0 bound to PF eth0::
+
+  # ip link set eth0 vf 0 state enable
+
+Note: If the command does not work, it may not be supported by your system.
+
+
+Setting the MAC Address for a VF
+--------------------------------
+To change the MAC address for the specified VF::
+
+  # ip link set <ethX> vf 0 mac <address>
+
+For example::
+
+  # ip link set <ethX> vf 0 mac 00:01:02:03:04:05
+
+This setting lasts until the PF is reloaded.
+
+NOTE: Assigning a MAC address for a VF from the host will disable any
+subsequent requests to change the MAC address from within the VM. This is a
+security feature. The VM is not aware of this restriction, so if this is
+attempted in the VM, it will trigger MDD events.
+
+
+Trusted VFs and VF Promiscuous Mode
+-----------------------------------
+This feature allows you to designate a particular VF as trusted and allows that
+trusted VF to request selective promiscuous mode on the Physical Function (PF).
+
+To set a VF as trusted or untrusted, enter the following command in the
+Hypervisor::
+
+  # ip link set dev <ethX> vf 1 trust [on|off]
+
+NOTE: It's important to set the VF to trusted before setting promiscuous mode.
+If the VM is not trusted, the PF will ignore promiscuous mode requests from the
+VF. If the VM becomes trusted after the VF driver is loaded, you must make a
+new request to set the VF to promiscuous.
+
+Once the VF is designated as trusted, use the following commands in the VM to
+set the VF to promiscuous mode.
+
+For promiscuous all::
+
+  # ip link set <ethX> promisc on
+  Where <ethX> is a VF interface in the VM
+
+For promiscuous Multicast::
+
+  # ip link set <ethX> allmulticast on
+  Where <ethX> is a VF interface in the VM
+
+NOTE: By default, the ethtool private flag vf-true-promisc-support is set to
+"off," meaning that promiscuous mode for the VF will be limited. To set the
+promiscuous mode for the VF to true promiscuous and allow the VF to see all
+ingress traffic, use the following command::
+
+  # ethtool --set-priv-flags <ethX> vf-true-promisc-support on
+
+The vf-true-promisc-support private flag does not enable promiscuous mode;
+rather, it designates which type of promiscuous mode (limited or true) you will
+get when you enable promiscuous mode using the ip link commands above. Note
+that this is a global setting that affects the entire device. However, the
+vf-true-promisc-support private flag is only exposed to the first PF of the
+device. The PF remains in limited promiscuous mode regardless of the
+vf-true-promisc-support setting.
+
+Next, add a VLAN interface on the VF interface. For example::
+
+  # ip link add link eth2 name eth2.100 type vlan id 100
+
+Note that the order in which you set the VF to promiscuous mode and add the
+VLAN interface does not matter (you can do either first). The result in this
+example is that the VF will get all traffic that is tagged with VLAN 100.
+
+
+Malicious Driver Detection (MDD) for VFs
+----------------------------------------
+Some Intel Ethernet devices use Malicious Driver Detection (MDD) to detect
+malicious traffic from the VF and disable Tx/Rx queues or drop the offending
+packet until a VF driver reset occurs. You can view MDD messages in the PF's
+system log using the dmesg command.
+
+- If the PF driver logs MDD events from the VF, confirm that the correct VF
+  driver is installed.
+- To restore functionality, you can manually reload the VF or VM or enable
+  automatic VF resets.
+- When automatic VF resets are enabled, the PF driver will immediately reset
+  the VF and reenable queues when it detects MDD events on the receive path.
+- If automatic VF resets are disabled, the PF will not automatically reset the
+  VF when it detects MDD events.
+
+To enable or disable automatic VF resets, use the following command::
+
+  # ethtool --set-priv-flags <ethX> mdd-auto-reset-vf on|off
+
+
+MAC and VLAN Anti-Spoofing Feature for VFs
+------------------------------------------
+When a malicious driver on a Virtual Function (VF) interface attempts to send a
+spoofed packet, it is dropped by the hardware and not transmitted.
+
+NOTE: This feature can be disabled for a specific VF::
+
+  # ip link set <ethX> vf <vf id> spoofchk {off|on}
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <ethX> is the interface number::
+
+  # ifconfig <ethX> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+  # ip link set mtu 9000 dev <ethX>
+  # ip link set up dev <ethX>
+
+This setting is not saved across reboots.
+
+
+NOTE: The maximum MTU setting for jumbo frames is 9702. This corresponds to the
+maximum jumbo frame size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+NOTE: Packet loss may have a greater impact on throughput when you use jumbo
+frames. If you observe a drop in performance after enabling jumbo frames,
+enabling flow control may mitigate the issue.
+
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility. For the latest version, download and install ethtool from the
+following website:
+
+   https://kernel.org/pub/software/network/ethtool/
+
+To see the speed configurations your device supports, run the following::
+
+  # ethtool <ethX>
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+
+Data Center Bridging (DCB)
+--------------------------
+NOTE: The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+DCB is normally configured on the network using the DCBX protocol (802.1Qaz), a
+specialization of LLDP (802.1AB). The ice driver supports the following
+mutually exclusive variants of DCBX support:
+
+1) Firmware-based LLDP Agent
+2) Software-based LLDP Agent
+
+In firmware-based mode, firmware intercepts all LLDP traffic and handles DCBX
+negotiation transparently for the user. In this mode, the adapter operates in
+"willing" DCBX mode, receiving DCB settings from the link partner (typically a
+switch). The local user can only query the negotiated DCB configuration. For
+information on configuring DCBX parameters on a switch, please consult the
+switch manufacturer's documentation.
+
+In software-based mode, LLDP traffic is forwarded to the network stack and user
+space, where a software agent can handle it. In this mode, the adapter can
+operate in either "willing" or "nonwilling" DCBX mode and DCB configuration can
+be both queried and set locally. This mode requires the FW-based LLDP Agent to
+be disabled.
+
+NOTE:
+
+- You can enable and disable the firmware-based LLDP Agent using an ethtool
+  private flag. Refer to the "FW-LLDP (Firmware Link Layer Discovery Protocol)"
+  section in this README for more information.
+- In software-based DCBX mode, you can configure DCB parameters using software
+  LLDP/DCBX agents that interface with the Linux kernel's DCB Netlink API. We
+  recommend using OpenLLDP as the DCBX agent when running in software mode. For
+  more information, see the OpenLLDP man pages and
+  https://github.com/intel/openlldp.
+- The driver implements the DCB netlink interface layer to allow the user space
+  to communicate with the driver and query DCB configuration for the port.
+- iSCSI with DCB is not supported.
+
+
+FW-LLDP (Firmware Link Layer Discovery Protocol)
+------------------------------------------------
+Use ethtool to change FW-LLDP settings. The FW-LLDP setting is per port and
+persists across boots.
+
+To enable LLDP::
+
+  # ethtool --set-priv-flags <ethX> fw-lldp-agent on
+
+To disable LLDP::
+
+  # ethtool --set-priv-flags <ethX> fw-lldp-agent off
+
+To check the current LLDP setting::
+
+  # ethtool --show-priv-flags <ethX>
+
+NOTE: You must enable the UEFI HII "LLDP Agent" attribute for this setting to
+take effect. If "LLDP AGENT" is set to disabled, you cannot enable it from the
+OS.
+
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ice. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is disabled by default.
+
+Use ethtool to change the flow control settings.
+
+To enable or disable Rx or Tx Flow Control::
+
+  # ethtool -A <ethX> rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+NOTE:
+
+- The ice driver requires flow control on both the port and link partner. If
+  flow control is disabled on one of the sides, the port may appear to hang on
+  heavy traffic.
+- You may encounter issues with link-level flow control (LFC) after disabling
+  DCB. The LFC status may show as enabled but traffic is not paused. To resolve
+  this issue, disable and reenable LFC using ethtool::
+
+   # ethtool -A <ethX> rx off tx off
+   # ethtool -A <ethX> rx on tx on
+
+
+NAPI
+----
+
+This driver supports NAPI (Rx polling mode).
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+MACVLAN
+-------
+This driver supports MACVLAN. Kernel support for MACVLAN can be tested by
+checking if the MACVLAN driver is loaded. You can run 'lsmod | grep macvlan' to
+see if the MACVLAN driver is loaded or run 'modprobe macvlan' to try to load
+the MACVLAN driver.
+
+NOTE:
+
+- In passthru mode, you can only set up one MACVLAN device. It will inherit the
+  MAC address of the underlying PF (Physical Function) device.
+
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+NOTES:
+
+- Receive checksum offloads and VLAN acceleration are not supported for 802.1ad
+  (QinQ) packets.
+
+- 0x88A8 traffic will not be received unless VLAN stripping is disabled with
+  the following command::
+
+    # ethtool -K <ethX> rxvlan off
+
+- 0x88A8/0x8100 double VLANs cannot be used with 0x8100 or 0x8100/0x8100 VLANS
+  configured on the same port. 0x88a8/0x8100 traffic will not be received if
+  0x8100 VLANs are configured.
+
+- The VF can only transmit 0x88A8/0x8100 (i.e., 802.1ad/802.1Q) traffic if:
+
+    1) The VF is not assigned a port VLAN.
+    2) spoofchk is disabled from the PF. If you enable spoofchk, the VF will
+       not transmit 0x88A8/0x8100 traffic.
+
+- The VF may not receive all network traffic based on the Inner VLAN header
+  when VF true promiscuous mode (vf-true-promisc-support) and double VLANs are
+  enabled in SR-IOV mode.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+  # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+  # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+  Where "24" and "371" are example VLAN IDs.
+
+
+Tunnel/Overlay Stateless Offloads
+---------------------------------
+Supported tunnels and overlays include VXLAN, GENEVE, and others depending on
+hardware and software configuration. Stateless offloads are enabled by default.
+
+To view the current state of all offloads::
+
+  # ethtool -k <ethX>
+
+
+UDP Segmentation Offload
+------------------------
+Allows the adapter to offload transmit segmentation of UDP packets with
+payloads up to 64K into valid Ethernet frames. Because the adapter hardware is
+able to complete data segmentation much faster than operating system software,
+this feature may improve transmission performance.
+In addition, the adapter may use fewer CPU resources.
+
+NOTE:
+
+- The application sending UDP packets must support UDP segmentation offload.
+
+To enable/disable UDP Segmentation Offload, issue the following command::
+
+  # ethtool -K <ethX> tx-udp-segmentation [off|on]
+
+
+GNSS module
+-----------
+Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m.
+Allows user to read messages from the GNSS hardware module and write supported
+commands. If the module is physically present, a GNSS device is spawned:
+``/dev/gnss<id>``.
+The protocol of write command is dependent on the GNSS hardware module as the
+driver writes raw bytes by the GNSS object to the receiver through i2c. Please
+refer to the hardware GNSS module documentation for configuration details.
+
+
+Firmware (FW) logging
+---------------------
+The driver supports FW logging via the debugfs interface on PF 0 only. The FW
+running on the NIC must support FW logging; if the FW doesn't support FW logging
+the 'fwlog' file will not get created in the ice debugfs directory.
+
+Module configuration
+~~~~~~~~~~~~~~~~~~~~
+Firmware logging is configured on a per module basis. Each module can be set to
+a value independent of the other modules (unless the module 'all' is specified).
+The modules will be instantiated under the 'fwlog/modules' directory.
+
+The user can set the log level for a module by writing to the module file like
+this::
+
+  # echo <log_level> > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/<module>
+
+where
+
+* log_level is a name as described below. Each level includes the
+  messages from the previous/lower level
+
+      *	none
+      *	error
+      *	warning
+      *	normal
+      *	verbose
+
+* module is a name that represents the module to receive events for. The
+  module names are
+
+      *	general
+      *	ctrl
+      *	link
+      *	link_topo
+      *	dnl
+      *	i2c
+      *	sdp
+      *	mdio
+      *	adminq
+      *	hdma
+      *	lldp
+      *	dcbx
+      *	dcb
+      *	xlr
+      *	nvm
+      *	auth
+      *	vpd
+      *	iosf
+      *	parser
+      *	sw
+      *	scheduler
+      *	txq
+      *	rsvd
+      *	post
+      *	watchdog
+      *	task_dispatch
+      *	mng
+      *	synce
+      *	health
+      *	tsdrv
+      *	pfreg
+      *	mdlver
+      *	all
+
+The name 'all' is special and allows the user to set all of the modules to the
+specified log_level or to read the log_level of all of the modules.
+
+Example usage to configure the modules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To set a single module to 'verbose'::
+
+  # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/link
+
+To set multiple modules then issue the command multiple times::
+
+  # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/link
+  # echo warning > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/ctrl
+  # echo none > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/dcb
+
+To set all the modules to the same value::
+
+  # echo normal > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all
+
+To read the log_level of a specific module (e.g. module 'general')::
+
+  # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/general
+
+To read the log_level of all the modules::
+
+  # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all
+
+Enabling FW log
+~~~~~~~~~~~~~~~
+Configuring the modules indicates to the FW that the configured modules should
+generate events that the driver is interested in, but it **does not** send the
+events to the driver until the enable message is sent to the FW. To do this
+the user can write a 1 (enable) or 0 (disable) to 'fwlog/enable'. An example
+is::
+
+  # echo 1 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/enable
+
+Retrieving FW log data
+~~~~~~~~~~~~~~~~~~~~~~
+The FW log data can be retrieved by reading from 'fwlog/data'. The user can
+write any value to 'fwlog/data' to clear the data. The data can only be cleared
+when FW logging is disabled. The FW log data is a binary file that is sent to
+Intel and used to help debug user issues.
+
+An example to read the data is::
+
+  # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/data > fwlog.bin
+
+An example to clear the data is::
+
+  # echo 0 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/data
+
+Changing how often the log events are sent to the driver
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver receives FW log data from the Admin Receive Queue (ARQ). The
+frequency that the FW sends the ARQ events can be configured by writing to
+'fwlog/nr_messages'. The range is 1-128 (1 means push every log message, 128
+means push only when the max AQ command buffer is full). The suggested value is
+10. The user can see what the value is configured to by reading
+'fwlog/nr_messages'. An example to set the value is::
+
+  # echo 50 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/nr_messages
+
+Configuring the amount of memory used to store FW log data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver stores FW log data within the driver. The default size of the memory
+used to store the data is 1MB. Some use cases may require more or less data so
+the user can change the amount of memory that is allocated for FW log data.
+To change the amount of memory then write to 'fwlog/log_size'. The value must be
+one of: 128K, 256K, 512K, 1M, or 2M. FW logging must be disabled to change the
+value. An example of changing the value is::
+
+  # echo 128K > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/log_size
+
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Rx Descriptor Ring Size
+-----------------------
+To reduce the number of Rx packet discards, increase the number of Rx
+descriptors for each Rx ring using ethtool.
+
+  Check if the interface is dropping Rx packets due to buffers being full
+  (rx_dropped.nic can mean that there is no PCIe bandwidth)::
+
+    # ethtool -S <ethX> | grep "rx_dropped"
+
+  If the previous command shows drops on queues, it may help to increase
+  the number of descriptors using 'ethtool -G'::
+
+    # ethtool -G <ethX> rx <N>
+    Where <N> is the desired number of ring entries/descriptors
+
+  This can provide temporary buffering for issues that create latency while
+  the CPUs process descriptors.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+  # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+
+  Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+  affect every queue of the specified interface.
+
+  Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+  12,500 interrupts per second per queue::
+
+    # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80 tx-usecs 80
+
+For reduced latency:
+
+  Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+  using ethtool::
+
+    # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
+
+Per-queue interrupt rate settings:
+
+  The following examples are for queues 1 and 3, but you can adjust other
+  queues.
+
+  To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+  about 100,000 interrupts/second, for queues 1 and 3::
+
+    # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+    rx-usecs 10
+
+  To show the current coalesce settings for queues 1 and 3::
+
+    # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+Bounding interrupt rates using rx-usecs-high:
+
+  :Valid Range: 0-236 (0=no limit)
+
+   The range of 0-236 microseconds provides an effective range of 4,237 to
+   250,000 interrupts per second. The value of rx-usecs-high can be set
+   independently of rx-usecs and tx-usecs in the same ethtool command, and is
+   also independent of the adaptive interrupt moderation algorithm. The
+   underlying hardware supports granularity in 4-microsecond intervals, so
+   adjacent values may result in the same interrupt rate.
+
+  The following command would disable adaptive interrupt moderation, and allow
+  a maximum of 5 microseconds before indicating a receive or transmit was
+  complete. However, instead of resulting in as many as 200,000 interrupts per
+  second, it limits total interrupts per second to 50,000 via the rx-usecs-high
+  parameter.
+
+  ::
+
+    # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs-high 20
+    rx-usecs 5 tx-usecs 5
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+  Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+  individual LCPUs, making sure to use a set of CPUs included in the
+  device's local_cpulist: ``/sys/class/net/<ethX>/device/local_cpulist``.
+
+  Configure as many Rx/Tx queues in the VM as available. (See the iavf driver
+  documentation for the number of queues supported.) For example::
+
+    # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/idpf.rst b/Documentation/networking/device_drivers/ethernet/intel/idpf.rst
new file mode 100644
index 000000000000..adb16e2abd21
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/idpf.rst
@@ -0,0 +1,160 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================================================================
+idpf Linux* Base Driver for the Intel(R) Infrastructure Data Path Function
+==========================================================================
+
+Intel idpf Linux driver.
+Copyright(C) 2023 Intel Corporation.
+
+.. contents::
+
+The idpf driver serves as both the Physical Function (PF) and Virtual Function
+(VF) driver for the Intel(R) Infrastructure Data Path Function.
+
+Driver information can be obtained using ethtool, lspci, and ip.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+http://www.intel.com/support
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. If you don't have one yet, you can
+obtain it at:
+https://kernel.org/pub/software/network/ethtool/
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+  # dmesg -n 8
+
+.. note::
+   This setting is not saved across reboots.
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ip command to increase the MTU size. For example, enter the following
+where <ethX> is the interface number::
+
+  # ip link set mtu 9000 dev <ethX>
+  # ip link set up dev <ethX>
+
+.. note::
+   The maximum MTU setting for jumbo frames is 9706. This corresponds to the
+   maximum jumbo frame size of 9728 bytes.
+
+.. note::
+   This driver will attempt to use multiple page sized buffers to receive
+   each jumbo packet. This should help to avoid buffer starvation issues when
+   allocating receive packets.
+
+.. note::
+   Packet loss may have a greater impact on throughput when you use jumbo
+   frames. If you observe a drop in performance after enabling jumbo frames,
+   enabling flow control may mitigate the issue.
+
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+  # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+ - Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+   affect every queue of the specified interface.
+
+ - Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+   12,500 interrupts per second per queue::
+
+     # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80
+     tx-usecs 80
+
+For reduced latency:
+ - Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+   using ethtool::
+
+     # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0
+     tx-usecs 0
+
+Per-queue interrupt rate settings:
+ - The following examples are for queues 1 and 3, but you can adjust other
+   queues.
+
+ - To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+   about 100,000 interrupts/second, for queues 1 and 3::
+
+     # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+     rx-usecs 10
+
+ - To show the current coalesce settings for queues 1 and 3::
+
+     # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+ - Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+   individual LCPUs, making sure to use a set of CPUs included in the
+   device's local_cpulist: /sys/class/net/<ethX>/device/local_cpulist.
+
+ - Configure as many Rx/Tx queues in the VM as available. (See the idpf driver
+   documentation for the number of queues supported.) For example::
+
+     # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+http://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
index 87e560fe5eaa..fbd590b6a0d6 100644
--- a/Documentation/networking/device_drivers/intel/igb.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
@@ -20,7 +20,7 @@ Identifying Your Adapter
 ========================
 For information on how to identify your adapter, and for the latest Intel
 network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
 
 
 Command Line Parameters
@@ -201,13 +201,8 @@ NOTE: This feature is exclusive to i210 models.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
index 557fc020ef31..11a9017f3069 100644
--- a/Documentation/networking/device_drivers/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
@@ -35,7 +35,7 @@ Identifying Your Adapter
 ========================
 For information on how to identify your adapter, and for the latest Intel
 network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
 
 
 Additional Features and Configurations
@@ -53,13 +53,8 @@ https://www.kernel.org/pub/software/network/ethtool/
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
index f1d5233e5e51..1e5f16993f69 100644
--- a/Documentation/networking/device_drivers/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
@@ -440,6 +440,22 @@ NOTE: For 82599-based network connections, if you are enabling jumbo frames in
 a virtual function (VF), jumbo frames must first be enabled in the physical
 function (PF). The VF MTU setting cannot be larger than the PF MTU.
 
+NBASE-T Support
+---------------
+The ixgbe driver supports NBASE-T on some devices. However, the advertisement
+of NBASE-T speeds is suppressed by default, to accommodate broken network
+switches which cannot cope with advertised NBASE-T speeds. Use the ethtool
+command to enable advertising NBASE-T speeds on devices which support it::
+
+  ethtool -s eth? advertise 0x1800000001028
+
+On Linux systems with INTERFACES(5), this can be specified as a pre-up command
+in /etc/network/interfaces so that the interface is always brought up with
+NBASE-T support, e.g.::
+
+  iface eth? inet dhcp
+       pre-up ethtool -s eth? advertise 0x1800000001028 || true
+
 Generic Receive Offload, aka GRO
 --------------------------------
 The driver supports the in-kernel software implementation of GRO. GRO has
@@ -529,13 +545,8 @@ on the Intel Ethernet Controller XL710.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
index 76bbde736f21..08dc0d368a48 100644
--- a/Documentation/networking/device_drivers/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
@@ -55,13 +55,8 @@ VLANs: There is a limit of a total of 64 shared VLANs to 1 or more VFs.
 Support
 =======
 For general information, go to the Intel support website at:
-
 https://www.intel.com/support/
 
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
 If an issue is identified with the released source code on a supported kernel
 with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
new file mode 100644
index 000000000000..c96d262b30be
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC
+====================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC.
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Supported Devices`_
+- `Interface Control`_
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b100
+ * Network controller: Cavium, Inc. Device b200
+ * Network controller: Cavium, Inc. Device b400
+ * Network controller: Cavium, Inc. Device b900
+ * Network controller: Cavium, Inc. Device ba00
+ * Network controller: Cavium, Inc. Device bc00
+ * Network controller: Cavium, Inc. Device bd00
+
+Interface Control
+=================
+Network Interface control like changing mtu, link speed, link down/up are
+done by writing command to mailbox command queue, a mailbox interface
+implemented through a reserved region in BAR4.
+This driver writes the commands into the mailbox and the firmware on the
+Octeon device processes them. The firmware also sends unsolicited notifications
+to driver for events suchs as link change, through notification queue
+implemented as part of mailbox interface.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst
new file mode 100644
index 000000000000..603133d0b92f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC VF
+=======================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC VF.
+Copyright (c) 2020 Marvell International Ltd.
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC VF.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b203
+ * Network controller: Cavium, Inc. Device b403
+ * Network controller: Cavium, Inc. Device b103
+ * Network controller: Cavium, Inc. Device b903
+ * Network controller: Cavium, Inc. Device ba03
+ * Network controller: Cavium, Inc. Device bc03
+ * Network controller: Cavium, Inc. Device bd03
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
new file mode 100644
index 000000000000..a52850602cd8
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -0,0 +1,433 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+====================================
+Marvell OcteonTx2 RVU Kernel Drivers
+====================================
+
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Drivers`_
+- `Basic packet flow`_
+- `Devlink health reporters`_
+- `Quality of service`_
+- `RVU representors`_
+
+Overview
+========
+
+Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW
+resources from the network, crypto and other functional blocks into
+PCI-compatible physical and virtual functions. Each functional block
+again has multiple local functions (LFs) for provisioning to PCI devices.
+RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual
+functions (VFs). PF0 is called the administrative / admin function (AF)
+and has privileges to provision RVU functional block's LFs to each of the
+PF/VF.
+
+RVU managed networking functional blocks
+ - Network pool or buffer allocator (NPA)
+ - Network interface controller (NIX)
+ - Network parser CAM (NPC)
+ - Schedule/Synchronize/Order unit (SSO)
+ - Loopback interface (LBK)
+
+RVU managed non-networking functional blocks
+ - Crypto accelerator (CPT)
+ - Scheduled timers unit (TIM)
+ - Schedule/Synchronize/Order unit (SSO)
+   Used for both networking and non networking usecases
+
+Resource provisioning examples
+ - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device
+ - A PF/VF with CPT-LF resource works as a pure crypto offload device.
+
+RVU functional blocks are highly configurable as per software requirements.
+
+Firmware setups following stuff before kernel boots
+ - Enables required number of RVU PFs based on number of physical links.
+ - Number of VFs per PF are either static or configurable at compile time.
+   Based on config, firmware assigns VFs to each of the PFs.
+ - Also assigns MSIX vectors to each of PF and VFs.
+ - These are not changed after kernel boot.
+
+Drivers
+=======
+
+Linux kernel will have multiple drivers registering to different PF and VFs
+of RVU. Wrt networking there will be 3 flavours of drivers.
+
+Admin Function driver
+---------------------
+
+As mentioned above RVU PF0 is called the admin function (AF), this driver
+supports resource provisioning and configuration of functional blocks.
+Doesn't handle any I/O. It sets up few basic stuff but most of the
+functionality is achieved via configuration requests from PFs and VFs.
+
+PF/VFs communicates with AF via a shared memory region (mailbox). Upon
+receiving requests AF does resource provisioning and other HW configuration.
+AF is always attached to host kernel, but PFs and their VFs may be used by host
+kernel itself, or attached to VMs or to userspace applications like
+DPDK etc. So AF has to handle provisioning/configuration requests sent
+by any device from any domain.
+
+AF driver also interacts with underlying firmware to
+ - Manage physical ethernet links ie CGX LMACs.
+ - Retrieve information like speed, duplex, autoneg etc
+ - Retrieve PHY EEPROM and stats.
+ - Configure FEC, PAM modes
+ - etc
+
+From pure networking side AF driver supports following functionality.
+ - Map a physical link to a RVU PF to which a netdev is registered.
+ - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs
+   for regular networking functionality.
+ - Flow control (pause frames) enable/disable/config.
+ - HW PTP timestamping related config.
+ - NPC parser profile config, basically how to parse pkt and what info to extract.
+ - NPC extract profile config, what to extract from the pkt to match data in MCAM entries.
+ - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules.
+ - Defines receive side scaling (RSS) algorithms.
+ - Defines segmentation offload algorithms (eg TSO)
+ - VLAN stripping, capture and insertion config.
+ - SSO and TIM blocks config which provide packet scheduling support.
+ - Debugfs support, to check current resource provising, current status of
+   NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues.
+ - And many more.
+
+Physical Function driver
+------------------------
+
+This RVU PF handles IO, is mapped to a physical ethernet link and this
+driver registers a netdev. This supports SR-IOV. As said above this driver
+communicates with AF with a mailbox. To retrieve information from physical
+links this driver talks to AF and AF gets that info from firmware and responds
+back ie cannot talk to firmware directly.
+
+Supports ethtool for configuring links, RSS, queue count, queue size,
+flow control, ntuple filters, dump PHY EEPROM, config FEC etc.
+
+Virtual Function driver
+-----------------------
+
+There are two types VFs, VFs that share the physical link with their parent
+SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK).
+
+Type1:
+ - These VFs and their parent PF share a physical link and used for outside communication.
+ - VFs cannot communicate with AF directly, they send mbox message to PF and PF
+   forwards that to AF. AF after processing, responds back to PF and PF forwards
+   the reply to VF.
+ - From functionality point of view there is no difference between PF and VF as same type
+   HW resources are attached to both. But user would be able to configure few stuff only
+   from PF as PF is treated as owner/admin of the link.
+
+Type2:
+ - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
+ - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
+   VF0 will be received by VF1 and vice versa.
+ - These VFs can be used by applications or virtual machines to communicate between them
+   without sending traffic outside. There is no switch present in HW, hence the support
+   for loopback VFs.
+ - These communicate directly with AF (PF0) via mbox.
+
+Except for the IO channels or links used for packet reception and transmission there is
+no other difference between these VF types. AF driver takes care of IO channel mapping,
+hence same VF driver works for both types of devices.
+
+Basic packet flow
+=================
+
+Ingress
+-------
+
+1. CGX LMAC receives packet.
+2. Forwards the packet to the NIX block.
+3. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device.
+4. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF.
+5. RQ may be selected by RSS or by configuring MCAM rule with a RQ number.
+6. Packet is DMA'ed and driver is notified.
+
+Egress
+------
+
+1. Driver prepares a send descriptor and submits to SQ for transmission.
+2. The SQ is already configured (by AF) to transmit on a specific link/channel.
+3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
+4. NIX block transmits the pkt on the designated channel.
+5. NPC MCAM entries can be installed to divert pkt onto a different channel.
+
+Devlink health reporters
+========================
+
+NPA Reporters
+-------------
+The NPA reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+   - Error due to operation of unmapped PF.
+   - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA).
+
+2. ERROR events
+
+   - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write.
+   - AQ Doorbell Error.
+
+3. RAS events
+
+   - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S.
+
+4. RVU events
+
+   - Error due to unmapped slot.
+
+Sample Output::
+
+	~# devlink health
+	pci/0002:01:00.0:
+	  reporter hw_npa_intr
+	      state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_npa_gen
+	      state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_npa_err
+	      state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true
+	   reporter hw_npa_ras
+	      state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_gen
+	 NPA_AF_GENERAL:
+	         NPA General Interrupt Reg : 1
+	         NIX0: free disabled RX
+	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_intr
+	 NPA_AF_RVU:
+	         NPA RVU Interrupt Reg : 1
+	         Unmap Slot Error
+	~# devlink health dump show  pci/0002:01:00.0 reporter hw_npa_err
+	 NPA_AF_ERR:
+	        NPA Error Interrupt Reg : 4096
+	        AQ Doorbell Error
+
+
+NIX Reporters
+-------------
+The NIX reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+   - Receive mirror/multicast packet drop due to insufficient buffer.
+   - SMQ Flush operation.
+
+2. ERROR events
+
+   - Memory Fault due to WQE read/write from multicast/mirror buffer.
+   - Receive multicast/mirror replication list error.
+   - Receive packet on an unmapped PF.
+   - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write.
+   - AQ Doorbell Error.
+
+3. RAS events
+
+   - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure.
+   - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer..
+   - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S.
+
+4. RVU events
+
+   - Error due to unmapped slot.
+
+Sample Output::
+
+	~# ./devlink health
+	pci/0002:01:00.0:
+	  reporter hw_npa_intr
+	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_npa_gen
+	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_npa_err
+	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_npa_ras
+	    state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_nix_intr
+	    state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_nix_gen
+	    state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_nix_err
+	    state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true
+	  reporter hw_nix_ras
+	    state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
+	 NIX_AF_RVU:
+	        NIX RVU Interrupt Reg : 1
+	        Unmap Slot Error
+	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
+	 NIX_AF_GENERAL:
+	        NIX General Interrupt Reg : 1
+	        Rx multicast pkt drop
+	~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err
+	 NIX_AF_ERR:
+	        NIX Error Interrupt Reg : 64
+	        Rx on unmapped PF_FUNC
+
+
+Quality of service
+==================
+
+
+Hardware algorithms used in scheduling
+--------------------------------------
+
+octeontx2 silicon and CN10K transmit interface consists of five transmit levels
+starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1
+levels. Each level contains an array of queues to support scheduling and shaping.
+The hardware uses the below algorithms depending on the priority of scheduler queues.
+once the usercreates tc classes with different priorities, the driver configures
+schedulers allocated to the class with specified priority along with rate-limiting
+configuration.
+
+1. Strict Priority
+
+      -  Once packets are submitted to MDQ, hardware picks all active MDQs having different priority
+         using strict priority.
+
+2. Round Robin
+
+      - Active MDQs having the same priority level are chosen using round robin.
+
+
+Setup HTB offload
+-----------------
+
+1. Enable HW TC offload on the interface::
+
+        # ethtool -K <interface> hw-tc-offload on
+
+2. Crate htb root::
+
+        # tc qdisc add dev <interface> clsact
+        # tc qdisc replace dev <interface> root handle 1: htb offload
+
+3. Create tc classes with different priorities::
+
+        # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1
+
+        # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7
+
+4. Create tc classes with same priorities and different quantum::
+
+        # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600
+
+        # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416
+
+        # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768
+
+
+RVU Representors
+================
+
+RVU representor driver adds support for creation of representor devices for
+RVU PFs' VFs in the system. Representor devices are created when user enables
+the switchdev mode.
+Switchdev mode can be enabled either before or after setting up SRIOV numVFs.
+All representor devices share a single NIXLF but each has a dedicated Rx/Tx
+queues. RVU PF representor driver registers a separate netdev for each
+Rx/Tx queue pair.
+
+Current HW does not support built-in switch which can do L2 learning and
+forwarding packets between representee and representor. Hence, packet path
+between representee and it's representor is achieved by setting up appropriate
+NPC MCAM filters.
+Transmit packets matching these filters will be loopbacked through hardware
+loopback channel/interface (i.e, instead of sending them out of MAC interface).
+Which will again match the installed filters and will be forwarded.
+This way representee => representor and representor => representee packet
+path is achieved. These rules get installed when representors are created
+and gets active/deactivate based on the representor/representee interface state.
+
+Usage example:
+
+ - Change device to switchdev mode::
+
+	# devlink dev eswitch set pci/0002:1c:00.0 mode switchdev
+
+ - List of representor devices on the system::
+
+	# ip link show
+	Rpf1vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether f6:43:83:ee:26:21 brd ff:ff:ff:ff:ff:ff
+	Rpf1vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 12:b2:54:0e:24:54 brd ff:ff:ff:ff:ff:ff
+	Rpf1vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 4a:12:c4:4c:32:62 brd ff:ff:ff:ff:ff:ff
+	Rpf1vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether ca:cb:68:0e:e2:6e brd ff:ff:ff:ff:ff:ff
+	Rpf2vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 06:cc:ad:b4:f0:93 brd ff:ff:ff:ff:ff:ff
+
+
+To delete the representors devices from the system. Change the device to legacy mode.
+
+ - Change device to legacy mode::
+
+	# devlink dev eswitch set pci/0002:1c:00.0 mode legacy
+
+RVU representors can be managed using devlink ports
+(see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+ - Show devlink ports of representors::
+
+	# devlink port
+	pci/0002:1c:00.0/0: type eth netdev Rpf1vf0 flavour physical port 0 splittable false
+	pci/0002:1c:00.0/1: type eth netdev Rpf1vf1 flavour pcivf controller 0 pfnum 1 vfnum 1 external false splittable false
+	pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false
+	pci/0002:1c:00.0/3: type eth netdev Rpf1vf3 flavour pcivf controller 0 pfnum 1 vfnum 3 external false splittable false
+
+Function attributes
+===================
+
+The RVU representor support function attributes for representors.
+Port function configuration of the representors are supported through devlink eswitch port.
+
+MAC address setup
+-----------------
+
+RVU representor driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+ - To setup MAC address for port 2::
+
+	# devlink port function set pci/0002:1c:00.0/2 hw_addr 5c:a1:1b:5e:43:11
+	# devlink port show pci/0002:1c:00.0/2
+	pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false
+	function:
+		hw_addr 5c:a1:1b:5e:43:11
+
+
+TC offload
+==========
+
+The rvu representor driver implements support for offloading tc rules using port representors.
+
+ - Drop packets with vlan id 3::
+
+	# tc filter add dev Rpf1vf0 protocol 802.1Q parent ffff: flower vlan_id 3 vlan_ethtype ipv4 skip_sw action drop
+
+ - Redirect packets with vlan id 5 and IPv4 packets to eth1, after stripping vlan header.::
+
+	# tc filter add dev Rpf1vf0 ingress protocol 802.1Q flower vlan_id 5 vlan_ethtype ipv4 skip_sw action vlan pop action mirred ingress redirect dev eth1
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
new file mode 100644
index 000000000000..43d72c8b713b
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -0,0 +1,1343 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+================
+Ethtool counters
+================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Overview`_
+- `Groups`_
+- `Types`_
+- `Descriptions`_
+
+Overview
+========
+
+There are several counter groups based on where the counter is being counted. In
+addition, each group of counters may have different counter types.
+
+These counter groups are based on which component in a networking setup,
+illustrated below, that they describe::
+
+                                                  ----------------------------------------
+                                                  |                                      |
+    ----------------------------------------    ---------------------------------------- |
+    |              Hypervisor              |    |                  VM                  | |
+    |                                      |    |                                      | |
+    | -------------------  --------------- |    | -------------------  --------------- | |
+    | | Ethernet driver |  | RDMA driver | |    | | Ethernet driver |  | RDMA driver | | |
+    | -------------------  --------------- |    | -------------------  --------------- | |
+    |           |                 |        |    |           |                 |        | |
+    |           -------------------        |    |           -------------------        | |
+    |                   |                  |    |                   |                  |--
+    ----------------------------------------    ----------------------------------------
+                        |                                           |
+            -------------               -----------------------------
+            |                           |
+         ------                      ------ ------ ------         ------      ------      ------
+    -----| PF |----------------------| VF |-| VF |-| VF |-----  --| PF |--- --| PF |--- --| PF |---
+    |    ------                      ------ ------ ------    |  | ------  | | ------  | | ------  |
+    |                                                        |  |         | |         | |         |
+    |                                                        |  |         | |         | |         |
+    |                                                        |  |         | |         | |         |
+    | eSwitch                                                |  | eSwitch | | eSwitch | | eSwitch |
+    ----------------------------------------------------------  ----------- ----------- -----------
+               -------------------------------------------------------------------------------
+               |                                                                             |
+               |                                                                             |
+               | Uplink (no counters)                                                        |
+               -------------------------------------------------------------------------------
+                       ---------------------------------------------------------------
+                       |                                                             |
+                       |                                                             |
+                       | MPFS (no counters)                                          |
+                       ---------------------------------------------------------------
+                                                     |
+                                                     |
+                                                     | Port
+
+Groups
+======
+
+Ring
+  Software counters populated by the driver stack.
+
+Netdev
+  An aggregation of software ring counters.
+
+vPort counters
+  Traffic counters and drops due to steering or no buffers. May indicate issues
+  with NIC. These counters include Ethernet traffic counters (including Raw
+  Ethernet) and RDMA/RoCE traffic counters.
+
+Physical port counters
+  Counters that collect statistics about the PFs and VFs. May indicate issues
+  with NIC, link, or network. This measuring point holds information on
+  standardized counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and
+  additional counters like flow control, FEC and more. Physical port counters
+  are not exposed to virtual machines.
+
+Priority Port Counters
+  A set of the physical port counters, per priority per port.
+
+Types
+=====
+
+Counters are divided into three types.
+
+Traffic Informative Counters
+  Counters which count traffic. These counters can be used for load estimation
+  or for general debug.
+
+Traffic Acceleration Counters
+  Counters which count traffic that was accelerated by Mellanox driver or by
+  hardware. The counters are an additional layer to the informative counter set,
+  and the same traffic is counted in both informative and acceleration counters.
+
+.. [#accel] Traffic acceleration counter.
+
+Error Counters
+  Increment of these counters might indicate a problem. Each of these counters
+  has an explanation and correction action.
+
+Statistic can be fetched via the `ip link` or `ethtool` commands. `ethtool`
+provides more detailed information.::
+
+    ip –s link show <if-name>
+    ethtool -S <if-name>
+
+Descriptions
+============
+
+XSK, PTP, and QoS counters that are similar to counters defined previously will
+not be separately listed. For example, `ptp_tx[i]_packets` will not be
+explicitly documented since `tx[i]_packets` describes the behavior of both
+counters, except `ptp_tx[i]_packets` is only counted when precision time
+protocol is used.
+
+Ring / Netdev Counter
+----------------------------
+The following counters are available per ring or software port.
+
+These counters provide information on the amount of traffic that was accelerated
+by the NIC. The counters are counting the accelerated traffic in addition to the
+standard counters which counts it (i.e. accelerated traffic is counted twice).
+
+The counter names in the table below refers to both ring and port counters. The
+notation for ring counters includes the [i] index without the braces. The
+notation for port counters doesn't include the [i]. A counter name
+`rx[i]_packets` will be printed as `rx0_packets` for ring 0 and `rx_packets` for
+the software port.
+
+.. flat-table:: Ring / Software Port Counter Table
+   :widths: 2 3 1
+
+   * - Counter
+     - Description
+     - Type
+
+   * - `rx[i]_packets`
+     - The number of packets received on ring i.
+     - Informative
+
+   * - `rx[i]_bytes`
+     - The number of bytes received on ring i.
+     - Informative
+
+   * - `tx[i]_packets`
+     - The number of packets transmitted on ring i.
+     - Informative
+
+   * - `tx[i]_bytes`
+     - The number of bytes transmitted on ring i.
+     - Informative
+
+   * - `tx[i]_recover`
+     - The number of times the SQ was recovered.
+     - Error
+
+   * - `tx[i]_cqes`
+     - Number of CQEs events on SQ issued on ring i.
+     - Informative
+
+   * - `tx[i]_cqe_err`
+     - The number of error CQEs encountered on the SQ for ring i.
+     - Error
+
+   * - `tx[i]_tso_packets`
+     - The number of TSO packets transmitted on ring i [#accel]_.
+     - Acceleration
+
+   * - `tx[i]_tso_bytes`
+     - The number of TSO bytes transmitted on ring i [#accel]_.
+     - Acceleration
+
+   * - `tx[i]_tso_inner_packets`
+     - The number of TSO packets which are indicated to be carry internal
+       encapsulation transmitted on ring i [#accel]_.
+     - Acceleration
+
+   * - `tx[i]_tso_inner_bytes`
+     - The number of TSO bytes which are indicated to be carry internal
+       encapsulation transmitted on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_gro_packets`
+     - Number of received packets processed using hardware-accelerated GRO. The
+       number of hardware GRO offloaded packets received on ring i. Only true GRO
+       packets are counted: only packets that are in an SKB with a GRO count > 1.
+     - Acceleration
+
+   * - `rx[i]_gro_bytes`
+     - Number of received bytes processed using hardware-accelerated GRO. The
+       number of hardware GRO offloaded bytes received on ring i. Only true GRO
+       packets are counted: only packets that are in an SKB with a GRO count > 1.
+     - Acceleration
+
+   * - `rx[i]_gro_skbs`
+     - The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs
+       with a GRO count > 1 are counted.
+     - Informative
+
+   * - `rx[i]_gro_large_hds`
+     - Number of receive packets using hardware-accelerated GRO that have large
+       headers that require additional memory to be allocated.
+     - Informative
+
+   * - `rx[i]_hds_nodata_packets`
+     - Number of header only packets in header/data split mode [#accel]_.
+     - Informative
+
+   * - `rx[i]_hds_nodata_bytes`
+     - Number of bytes for header only packets in header/data split mode
+       [#accel]_.
+     - Informative
+
+   * - `rx[i]_hds_nosplit_packets`
+     - Number of packets that were not split in header/data split mode. A
+       packet will not get split when the hardware does not support its
+       protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+       TCP and UDP with IPv4/IPv6 are supported for header/data split
+       [#accel]_.
+     - Informative
+
+   * - `rx[i]_hds_nosplit_bytes`
+     - Number of bytes for packets that were not split in header/data split
+       mode. A packet will not get split when the hardware does not support its
+       protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+       TCP and UDP with IPv4/IPv6 are supported for header/data split
+       [#accel]_.
+     - Informative
+
+   * - `rx[i]_lro_packets`
+     - The number of LRO packets received on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_lro_bytes`
+     - The number of LRO bytes received on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_ecn_mark`
+     - The number of received packets where the ECN mark was turned on.
+     - Informative
+
+   * - `rx_oversize_pkts_buffer`
+     - The number of dropped received packets due to length which arrived to RQ
+       and exceed software buffer size allocated by the device for incoming
+       traffic. It might imply that the device MTU is larger than the software
+       buffers size.
+     - Error
+
+   * - `rx_oversize_pkts_sw_drop`
+     - Number of received packets dropped in software because the CQE data is
+       larger than the MTU size.
+     - Error
+
+   * - `rx[i]_csum_unnecessary`
+     - Packets received with a `CHECKSUM_UNNECESSARY` on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_csum_unnecessary_inner`
+     - Packets received with inner encapsulation with a `CHECKSUM_UNNECESSARY`
+       on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_csum_none`
+     - Packets received with a `CHECKSUM_NONE` on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_csum_complete`
+     - Packets received with a `CHECKSUM_COMPLETE` on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_csum_complete_tail`
+     - Number of received packets that had checksum calculation computed,
+       potentially needed padding, and were able to do so with
+       `CHECKSUM_PARTIAL`.
+     - Informative
+
+   * - `rx[i]_csum_complete_tail_slow`
+     - Number of received packets that need padding larger than eight bytes for
+       the checksum.
+     - Informative
+
+   * - `tx[i]_csum_partial`
+     - Packets transmitted with a `CHECKSUM_PARTIAL` on ring i [#accel]_.
+     - Acceleration
+
+   * - `tx[i]_csum_partial_inner`
+     - Packets transmitted with inner encapsulation with a `CHECKSUM_PARTIAL` on
+       ring i [#accel]_.
+     - Acceleration
+
+   * - `tx[i]_csum_none`
+     - Packets transmitted with no hardware checksum acceleration on ring i.
+     - Informative
+
+   * - `tx[i]_stopped` / `tx_queue_stopped` [#ring_global]_
+     - Events where SQ was full on ring i. If this counter is increased, check
+       the amount of buffers allocated for transmission.
+     - Informative
+
+   * - `tx[i]_wake` / `tx_queue_wake` [#ring_global]_
+     - Events where SQ was full and has become not full on ring i.
+     - Informative
+
+   * - `tx[i]_dropped` / `tx_queue_dropped` [#ring_global]_
+     - Packets transmitted that were dropped due to DMA mapping failure on
+       ring i. If this counter is increased, check the amount of buffers
+       allocated for transmission.
+     - Error
+
+   * - `tx[i]_nop`
+     - The number of nop WQEs (empty WQEs) inserted to the SQ (related to
+       ring i) due to the reach of the end of the cyclic buffer. When reaching
+       near to the end of cyclic buffer the driver may add those empty WQEs to
+       avoid handling a state the a WQE start in the end of the queue and ends
+       in the beginning of the queue. This is a normal condition.
+     - Informative
+
+   * - `tx[i]_timestamps`
+     - Transmitted packets that were hardware timestamped at the device's DMA
+       layer.
+     - Informative
+
+   * - `tx[i]_added_vlan_packets`
+     - The number of packets sent where vlan tag insertion was offloaded to the
+       hardware.
+     - Acceleration
+
+   * - `rx[i]_removed_vlan_packets`
+     - The number of packets received where vlan tag stripping was offloaded to
+       the hardware.
+     - Acceleration
+
+   * - `rx[i]_wqe_err`
+     - The number of wrong opcodes received on ring i.
+     - Error
+
+   * - `rx[i]_mpwqe_frag`
+     - The number of WQEs that failed to allocate compound page and hence
+       fragmented MPWQE’s (Multi Packet WQEs) were used on ring i. If this
+       counter raise, it may suggest that there is no enough memory for large
+       pages, the driver allocated fragmented pages. This is not abnormal
+       condition.
+     - Informative
+
+   * - `rx[i]_mpwqe_filler_cqes`
+     - The number of filler CQEs events that were issued on ring i.
+     - Informative
+
+   * - `rx[i]_mpwqe_filler_strides`
+     - The number of strides consumed by filler CQEs on ring i.
+     - Informative
+
+   * - `tx[i]_mpwqe_blks`
+     - The number of send blocks processed from Multi-Packet WQEs (mpwqe).
+     - Informative
+
+   * - `tx[i]_mpwqe_pkts`
+     - The number of send packets processed from Multi-Packet WQEs (mpwqe).
+     - Informative
+
+   * - `rx[i]_cqe_compress_blks`
+     - The number of receive blocks with CQE compression on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_cqe_compress_pkts`
+     - The number of receive packets with CQE compression on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_arfs_add`
+     - The number of aRFS flow rules added to the device for direct RQ steering
+       on ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_arfs_request_in`
+     - Number of flow rules that have been requested to move into ring i for
+       direct RQ steering [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_arfs_request_out`
+     - Number of flow rules that have been requested to move out of ring i [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_arfs_expired`
+     - Number of flow rules that have been expired and removed [#accel]_.
+     - Acceleration
+
+   * - `rx[i]_arfs_err`
+     - Number of flow rules that failed to be added to the flow table.
+     - Error
+
+   * - `rx[i]_recover`
+     - The number of times the RQ was recovered.
+     - Error
+
+   * - `tx[i]_xmit_more`
+     - The number of packets sent with `xmit_more` indication set on the skbuff
+       (no doorbell).
+     - Acceleration
+
+   * - `ch[i]_poll`
+     - The number of invocations of NAPI poll of channel i.
+     - Informative
+
+   * - `ch[i]_arm`
+     - The number of times the NAPI poll function completed and armed the
+       completion queues on channel i.
+     - Informative
+
+   * - `ch[i]_aff_change`
+     - The number of times the NAPI poll function explicitly stopped execution
+       on a CPU due to a change in affinity, on channel i.
+     - Informative
+
+   * - `ch[i]_events`
+     - The number of hard interrupt events on the completion queues of channel i.
+     - Informative
+
+   * - `ch[i]_eq_rearm`
+     - The number of times the EQ was recovered.
+     - Error
+
+   * - `ch[i]_force_irq`
+     - Number of times NAPI is triggered by XSK wakeups by posting a NOP to
+       ICOSQ.
+     - Acceleration
+
+   * - `rx[i]_congst_umr`
+     - The number of times an outstanding UMR request is delayed due to
+       congestion, on ring i.
+     - Informative
+
+   * - `rx_pp_alloc_fast`
+     - Number of successful fast path allocations.
+     - Informative
+
+   * - `rx_pp_alloc_slow`
+     - Number of slow path order-0 allocations.
+     - Informative
+
+   * - `rx_pp_alloc_slow_high_order`
+     - Number of slow path high order allocations.
+     - Informative
+
+   * - `rx_pp_alloc_empty`
+     - Counter is incremented when ptr ring is empty, so a slow path allocation
+       was forced.
+     - Informative
+
+   * - `rx_pp_alloc_refill`
+     - Counter is incremented when an allocation which triggered a refill of the
+       cache.
+     - Informative
+
+   * - `rx_pp_alloc_waive`
+     - Counter is incremented when pages obtained from the ptr ring that cannot
+       be added to the cache due to a NUMA mismatch.
+     - Informative
+
+   * - `rx_pp_recycle_cached`
+     - Counter is incremented when recycling placed page in the page pool cache.
+     - Informative
+
+   * - `rx_pp_recycle_cache_full`
+     - Counter is incremented when page pool cache was full.
+     - Informative
+
+   * - `rx_pp_recycle_ring`
+     - Counter is incremented when page placed into the ptr ring.
+     - Informative
+
+   * - `rx_pp_recycle_ring_full`
+     - Counter is incremented when page released from page pool because the ptr
+       ring was full.
+     - Informative
+
+   * - `rx_pp_recycle_released_ref`
+     - Counter is incremented when page released (and not recycled) because
+       refcnt > 1.
+     - Informative
+
+   * - `rx[i]_xsk_buff_alloc_err`
+     - The number of times allocating an skb or XSK buffer failed in the XSK RQ
+       context.
+     - Error
+
+   * - `rx[i]_xdp_tx_xmit`
+     - The number of packets forwarded back to the port due to XDP program
+       `XDP_TX` action (bouncing). these packets are not counted by other
+       software counters. These packets are counted by physical port and vPort
+       counters.
+     - Informative
+
+   * - `rx[i]_xdp_tx_mpwqe`
+     - Number of multi-packet WQEs transmitted by the netdev and `XDP_TX`-ed by
+       the netdev during the RQ context.
+     - Acceleration
+
+   * - `rx[i]_xdp_tx_inlnw`
+     - Number of WQE data segments transmitted where the data could be inlined
+       in the WQE and then `XDP_TX`-ed during the RQ context.
+     - Acceleration
+
+   * - `rx[i]_xdp_tx_nops`
+     - Number of NOP WQEBBs (WQE building blocks) received posted to the XDP SQ.
+     - Acceleration
+
+   * - `rx[i]_xdp_tx_full`
+     - The number of packets that should have been forwarded back to the port
+       due to `XDP_TX` action but were dropped due to full tx queue. These packets
+       are not counted by other software counters. These packets are counted by
+       physical port and vPort counters. You may open more rx queues and spread
+       traffic rx over all queues and/or increase rx ring size.
+     - Error
+
+   * - `rx[i]_xdp_tx_err`
+     - The number of times an `XDP_TX` error such as frame too long and frame
+       too short occurred on `XDP_TX` ring of RX ring.
+     - Error
+
+   * - `rx[i]_xdp_tx_cqes` / `rx_xdp_tx_cqe` [#ring_global]_
+     - The number of completions received on the CQ of the `XDP_TX` ring.
+     - Informative
+
+   * - `rx[i]_xdp_drop`
+     - The number of packets dropped due to XDP program `XDP_DROP` action. these
+       packets are not counted by other software counters. These packets are
+       counted by physical port and vPort counters.
+     - Informative
+
+   * - `rx[i]_xdp_redirect`
+     - The number of times an XDP redirect action was triggered on ring i.
+     - Acceleration
+
+   * - `tx[i]_xdp_xmit`
+     - The number of packets redirected to the interface(due to XDP redirect).
+       These packets are not counted by other software counters. These packets
+       are counted by physical port and vPort counters.
+     - Informative
+
+   * - `tx[i]_xdp_full`
+     - The number of packets redirected to the interface(due to XDP redirect),
+       but were dropped due to full tx queue. these packets are not counted by
+       other software counters. you may enlarge tx queues.
+     - Informative
+
+   * - `tx[i]_xdp_mpwqe`
+     - Number of multi-packet WQEs offloaded onto the NIC that were
+       `XDP_REDIRECT`-ed from other netdevs.
+     - Acceleration
+
+   * - `tx[i]_xdp_inlnw`
+     - Number of WQE data segments where the data could be inlined in the WQE
+       where the data segments were `XDP_REDIRECT`-ed from other netdevs.
+     - Acceleration
+
+   * - `tx[i]_xdp_nops`
+     - Number of NOP WQEBBs (WQE building blocks) posted to the SQ that were
+       `XDP_REDIRECT`-ed from other netdevs.
+     - Acceleration
+
+   * - `tx[i]_xdp_err`
+     - The number of packets redirected to the interface(due to XDP redirect)
+       but were dropped due to error such as frame too long and frame too short.
+     - Error
+
+   * - `tx[i]_xdp_cqes`
+     - The number of completions received for packets redirected to the
+       interface(due to XDP redirect) on the CQ.
+     - Informative
+
+   * - `tx[i]_xsk_xmit`
+     - The number of packets transmitted using XSK zerocopy functionality.
+     - Acceleration
+
+   * - `tx[i]_xsk_mpwqe`
+     - Number of multi-packet WQEs offloaded onto the NIC that were
+       `XDP_REDIRECT`-ed from other netdevs.
+     - Acceleration
+
+   * - `tx[i]_xsk_inlnw`
+     - Number of WQE data segments where the data could be inlined in the WQE
+       that are transmitted using XSK zerocopy.
+     - Acceleration
+
+   * - `tx[i]_xsk_full`
+     - Number of times doorbell is rung in XSK zerocopy mode when SQ is full.
+     - Error
+
+   * - `tx[i]_xsk_err`
+     - Number of errors that occurred in XSK zerocopy mode such as if the data
+       size is larger than the MTU size.
+     - Error
+
+   * - `tx[i]_xsk_cqes`
+     - Number of CQEs processed in XSK zerocopy mode.
+     - Acceleration
+
+   * - `tx_tls_ctx`
+     - Number of TLS TX HW offload contexts added to device for encryption.
+     - Acceleration
+
+   * - `tx_tls_del`
+     - Number of TLS TX HW offload contexts removed from device (connection
+       closed).
+     - Acceleration
+
+   * - `tx_tls_pool_alloc`
+     - Number of times a unit of work is successfully allocated in the TLS HW
+       offload pool.
+     - Acceleration
+
+   * - `tx_tls_pool_free`
+     - Number of times a unit of work is freed in the TLS HW offload pool.
+     - Acceleration
+
+   * - `rx_tls_ctx`
+     - Number of TLS RX HW offload contexts added to device for decryption.
+     - Acceleration
+
+   * - `rx_tls_del`
+     - Number of TLS RX HW offload contexts deleted from device (connection has
+       finished).
+     - Acceleration
+
+   * - `rx[i]_tls_decrypted_packets`
+     - Number of successfully decrypted RX packets which were part of a TLS
+       stream.
+     - Acceleration
+
+   * - `rx[i]_tls_decrypted_bytes`
+     - Number of TLS payload bytes in RX packets which were successfully
+       decrypted.
+     - Acceleration
+
+   * - `rx[i]_tls_resync_req_pkt`
+     - Number of received TLS packets with a resync request.
+     - Acceleration
+
+   * - `rx[i]_tls_resync_req_start`
+     - Number of times the TLS async resync request was started.
+     - Acceleration
+
+   * - `rx[i]_tls_resync_req_end`
+     - Number of times the TLS async resync request properly ended with
+       providing the HW tracked tcp-seq.
+     - Acceleration
+
+   * - `rx[i]_tls_resync_req_skip`
+     - Number of times the TLS async resync request procedure was started but
+       not properly ended.
+     - Error
+
+   * - `rx[i]_tls_resync_res_ok`
+     - Number of times the TLS resync response call to the driver was
+       successfully handled.
+     - Acceleration
+
+   * - `rx[i]_tls_resync_res_retry`
+     - Number of times the TLS resync response call to the driver was
+       reattempted when ICOSQ is full.
+     - Error
+
+   * - `rx[i]_tls_resync_res_skip`
+     - Number of times the TLS resync response call to the driver was terminated
+       unsuccessfully.
+     - Error
+
+   * - `rx[i]_tls_err`
+     - Number of times when CQE TLS offload was problematic.
+     - Error
+
+   * - `tx[i]_tls_encrypted_packets`
+     - The number of send packets that are TLS encrypted by the kernel.
+     - Acceleration
+
+   * - `tx[i]_tls_encrypted_bytes`
+     - The number of send bytes that are TLS encrypted by the kernel.
+     - Acceleration
+
+   * - `tx[i]_tls_ooo`
+     - Number of times out of order TLS SQE fragments were handled on ring i.
+     - Acceleration
+
+   * - `tx[i]_tls_dump_packets`
+     - Number of TLS decrypted packets copied over from NIC over DMA.
+     - Acceleration
+
+   * - `tx[i]_tls_dump_bytes`
+     - Number of TLS decrypted bytes copied over from NIC over DMA.
+     - Acceleration
+
+   * - `tx[i]_tls_resync_bytes`
+     - Number of TLS bytes requested to be resynchronized in order to be
+       decrypted.
+     - Acceleration
+
+   * - `tx[i]_tls_skip_no_sync_data`
+     - Number of TLS send data that can safely be skipped / do not need to be
+       decrypted.
+     - Acceleration
+
+   * - `tx[i]_tls_drop_no_sync_data`
+     - Number of TLS send data that were dropped due to retransmission of TLS
+       data.
+     - Acceleration
+
+   * - `ptp_cq[i]_abort`
+     - Number of times a CQE has to be skipped in precision time protocol due to
+       a skew between the port timestamp and CQE timestamp being greater than
+       128 seconds.
+     - Error
+
+   * - `ptp_cq[i]_abort_abs_diff_ns`
+     - Accumulation of time differences between the port timestamp and CQE
+       timestamp when the difference is greater than 128 seconds in precision
+       time protocol.
+     - Error
+
+   * - `ptp_cq[i]_late_cqe`
+     - Number of times a CQE has been delivered on the PTP timestamping CQ when
+       the CQE was not expected since a certain amount of time had elapsed where
+       the device typically ensures not posting the CQE.
+     - Error
+
+   * - `ptp_cq[i]_lost_cqe`
+     - Number of times a CQE is expected to not be delivered on the PTP
+       timestamping CQE by the device due to a time delta elapsing. If such a
+       CQE is somehow delivered, `ptp_cq[i]_late_cqe` is incremented.
+     - Error
+
+.. [#ring_global] The corresponding ring and global counters do not share the
+                  same name (i.e. do not follow the common naming scheme).
+
+vPort Counters
+--------------
+Counters on the NIC port that is connected to a eSwitch.
+
+.. flat-table:: vPort Counter Table
+   :widths: 2 3 1
+
+   * - Counter
+     - Description
+     - Type
+
+   * - `rx_vport_unicast_packets`
+     - Unicast packets received, steered to a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_unicast_bytes`
+     - Unicast bytes received, steered to a port including Raw Ethernet QP/DPDK
+       traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_unicast_packets`
+     - Unicast packets transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_unicast_bytes`
+     - Unicast bytes transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_multicast_packets`
+     - Multicast packets received, steered to a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_multicast_bytes`
+     - Multicast bytes received, steered to a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_multicast_packets`
+     - Multicast packets transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_multicast_bytes`
+     - Multicast bytes transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_broadcast_packets`
+     - Broadcast packets received, steered to a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_broadcast_bytes`
+     - Broadcast bytes received, steered to a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_broadcast_packets`
+     - Broadcast packets transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `tx_vport_broadcast_bytes`
+     - Broadcast bytes transmitted, steered from a port including Raw Ethernet
+       QP/DPDK traffic, excluding RDMA traffic.
+     - Informative
+
+   * - `rx_vport_rdma_unicast_packets`
+     - RDMA unicast packets received, steered to a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `rx_vport_rdma_unicast_bytes`
+     - RDMA unicast bytes received, steered to a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `tx_vport_rdma_unicast_packets`
+     - RDMA unicast packets transmitted, steered from a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `tx_vport_rdma_unicast_bytes`
+     - RDMA unicast bytes transmitted, steered from a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `rx_vport_rdma_multicast_packets`
+     - RDMA multicast packets received, steered to a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `rx_vport_rdma_multicast_bytes`
+     - RDMA multicast bytes received, steered to a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `tx_vport_rdma_multicast_packets`
+     - RDMA multicast packets transmitted, steered from a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `tx_vport_rdma_multicast_bytes`
+     - RDMA multicast bytes transmitted, steered from a port (counters counts
+       RoCE/UD/RC traffic) [#accel]_.
+     - Acceleration
+
+   * - `vport_loopback_packets`
+     - Unicast, multicast and broadcast packets that were loop-back (received
+       and transmitted), IB/Eth  [#accel]_.
+     - Acceleration
+
+   * - `vport_loopback_bytes`
+     - Unicast, multicast and broadcast bytes that were loop-back (received
+       and transmitted), IB/Eth  [#accel]_.
+     - Acceleration
+
+   * - `rx_steer_missed_packets`
+     - Number of packets that was received by the NIC, however was discarded
+       because it did not match any flow in the NIC flow table.
+     - Error
+
+   * - `rx_packets`
+     - Representor only: packets received, that were handled by the hypervisor.
+     - Informative
+
+   * - `rx_bytes`
+     - Representor only: bytes received, that were handled by the hypervisor.
+     - Informative
+
+   * - `tx_packets`
+     - Representor only: packets transmitted, that were handled by the
+       hypervisor.
+     - Informative
+
+   * - `tx_bytes`
+     - Representor only: bytes transmitted, that were handled by the hypervisor.
+     - Informative
+
+   * - `dev_internal_queue_oob`
+     - The number of dropped packets due to lack of receive WQEs for an internal
+       device RQ.
+     - Error
+
+Physical Port Counters
+----------------------
+The physical port counters are the counters on the external port connecting the
+adapter to the network. This measuring point holds information on standardized
+counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and additional counters
+like flow control, FEC and more.
+
+.. flat-table:: Physical Port Counter Table
+   :widths: 2 3 1
+
+   * - Counter
+     - Description
+     - Type
+
+   * - `rx_packets_phy`
+     - The number of packets received on the physical port. This counter doesn’t
+       include packets that were discarded due to FCS, frame size and similar
+       errors.
+     - Informative
+
+   * - `tx_packets_phy`
+     - The number of packets transmitted on the physical port.
+     - Informative
+
+   * - `rx_bytes_phy`
+     - The number of bytes received on the physical port, including Ethernet
+       header and FCS.
+     - Informative
+
+   * - `tx_bytes_phy`
+     - The number of bytes transmitted on the physical port.
+     - Informative
+
+   * - `rx_multicast_phy`
+     - The number of multicast packets received on the physical port.
+     - Informative
+
+   * - `tx_multicast_phy`
+     - The number of multicast packets transmitted on the physical port.
+     - Informative
+
+   * - `rx_broadcast_phy`
+     - The number of broadcast packets received on the physical port.
+     - Informative
+
+   * - `tx_broadcast_phy`
+     - The number of broadcast packets transmitted on the physical port.
+     - Informative
+
+   * - `rx_crc_errors_phy`
+     - The number of dropped received packets due to FCS (Frame Check Sequence)
+       error on the physical port. If this counter is increased in high rate,
+       check the link quality using `rx_symbol_error_phy` and
+       `rx_corrected_bits_phy` counters below.
+     - Error
+
+   * - `rx_in_range_len_errors_phy`
+     - The number of received packets dropped due to length/type errors on a
+       physical port.
+     - Error
+
+   * - `rx_out_of_range_len_phy`
+     - The number of received packets dropped due to length greater than allowed
+       on a physical port. If this counter is increasing, it implies that the
+       peer connected to the adapter has a larger MTU configured. Using same MTU
+       configuration shall resolve this issue.
+     - Error
+
+   * - `rx_oversize_pkts_phy`
+     - The number of dropped received packets due to length which exceed MTU
+       size on a physical port. If this counter is increasing, it implies that
+       the peer connected to the adapter has a larger MTU configured. Using same
+       MTU configuration shall resolve this issue.
+     - Error
+
+   * - `rx_symbol_err_phy`
+     - The number of received packets dropped due to physical coding errors
+       (symbol errors) on a physical port.
+     - Error
+
+   * - `rx_mac_control_phy`
+     - The number of MAC control packets received on the physical port.
+     - Informative
+
+   * - `tx_mac_control_phy`
+     - The number of MAC control packets transmitted on the physical port.
+     - Informative
+
+   * - `rx_pause_ctrl_phy`
+     - The number of link layer pause packets received on a physical port. If
+       this counter is increasing, it implies that the network is congested and
+       cannot absorb the traffic coming from to the adapter.
+     - Informative
+
+   * - `tx_pause_ctrl_phy`
+     - The number of link layer pause packets transmitted on a physical port. If
+       this counter is increasing, it implies that the NIC is congested and
+       cannot absorb the traffic coming from the network.
+     - Informative
+
+   * - `rx_unsupported_op_phy`
+     - The number of MAC control packets received with unsupported opcode on a
+       physical port.
+     - Error
+
+   * - `rx_discards_phy`
+     - The number of received packets dropped due to lack of buffers on a
+       physical port. If this counter is increasing, it implies that the adapter
+       is congested and cannot absorb the traffic coming from the network.
+     - Error
+
+   * - `tx_discards_phy`
+     - The number of packets which were discarded on transmission, even no
+       errors were detected. the drop might occur due to link in down state,
+       head of line drop, pause from the network, etc.
+     - Error
+
+   * - `tx_errors_phy`
+     - The number of transmitted packets dropped due to a length which exceed
+       MTU size on a physical port.
+     - Error
+
+   * - `rx_undersize_pkts_phy`
+     - The number of received packets dropped due to length which is shorter
+       than 64 bytes on a physical port. If this counter is increasing, it
+       implies that the peer connected to the adapter has a non-standard MTU
+       configured or malformed packet had arrived.
+     - Error
+
+   * - `rx_fragments_phy`
+     - The number of received packets dropped due to a length which is shorter
+       than 64 bytes and has FCS error on a physical port. If this counter is
+       increasing, it implies that the peer connected to the adapter has a
+       non-standard MTU configured.
+     - Error
+
+   * - `rx_jabbers_phy`
+     - The number of received packets d due to a length which is longer than 64
+       bytes and had FCS error on a physical port.
+     - Error
+
+   * - `rx_64_bytes_phy`
+     - The number of packets received on the physical port with size of 64 bytes.
+     - Informative
+
+   * - `rx_65_to_127_bytes_phy`
+     - The number of packets received on the physical port with size of 65 to
+       127 bytes.
+     - Informative
+
+   * - `rx_128_to_255_bytes_phy`
+     - The number of packets received on the physical port with size of 128 to
+       255 bytes.
+     - Informative
+
+   * - `rx_256_to_511_bytes_phy`
+     - The number of packets received on the physical port with size of 256 to
+       512 bytes.
+     - Informative
+
+   * - `rx_512_to_1023_bytes_phy`
+     - The number of packets received on the physical port with size of 512 to
+       1023 bytes.
+     - Informative
+
+   * - `rx_1024_to_1518_bytes_phy`
+     - The number of packets received on the physical port with size of 1024 to
+       1518 bytes.
+     - Informative
+
+   * - `rx_1519_to_2047_bytes_phy`
+     - The number of packets received on the physical port with size of 1519 to
+       2047 bytes.
+     - Informative
+
+   * - `rx_2048_to_4095_bytes_phy`
+     - The number of packets received on the physical port with size of 2048 to
+       4095 bytes.
+     - Informative
+
+   * - `rx_4096_to_8191_bytes_phy`
+     - The number of packets received on the physical port with size of 4096 to
+       8191 bytes.
+     - Informative
+
+   * - `rx_8192_to_10239_bytes_phy`
+     - The number of packets received on the physical port with size of 8192 to
+       10239 bytes.
+     - Informative
+
+   * - `link_down_events_phy`
+     - The number of times where the link operative state changed to down. In
+       case this counter is increasing it may imply on port flapping. You may
+       need to replace the cable/transceiver.
+     - Error
+
+   * - `total_success_recovery_phy`
+     - The number of total successful recovery events of any type during
+       ports reset cycle.
+     - Error
+
+   * - `rx_out_of_buffer`
+     - Number of times receive queue had no software buffers allocated for the
+       adapter's incoming traffic.
+     - Error
+
+   * - `module_bus_stuck`
+     - The number of times that module's I\ :sup:`2`\C bus (data or clock)
+       short-wire was detected. You may need to replace the cable/transceiver.
+     - Error
+
+   * - `module_high_temp`
+     - The number of times that the module temperature was too high. If this
+       issue persist, you may need to check the ambient temperature or replace
+       the cable/transceiver module.
+     - Error
+
+   * - `module_bad_shorted`
+     - The number of times that the module cables were shorted. You may need to
+       replace the cable/transceiver module.
+     - Error
+
+   * - `module_unplug`
+     - The number of times that module was ejected.
+     - Informative
+
+   * - `rx_buffer_passed_thres_phy`
+     - The number of events where the port receive buffer was over 85% full.
+     - Informative
+
+   * - `tx_pause_storm_warning_events`
+     - The number of times the device was sending pauses for a long period of
+       time.
+     - Informative
+
+   * - `tx_pause_storm_error_events`
+     - The number of times the device was sending pauses for a long period of
+       time, reaching time out and disabling transmission of pause frames. on
+       the period where pause frames were disabled, drop could have been
+       occurred.
+     - Error
+
+   * - `rx[i]_buff_alloc_err`
+     - Failed to allocate a buffer to received packet (or SKB) on ring i.
+     - Error
+
+   * - `rx_bits_phy`
+     - This counter provides information on the total amount of traffic that
+       could have been received and can be used as a guideline to measure the
+       ratio of errored traffic in `rx_pcs_symbol_err_phy` and
+       `rx_corrected_bits_phy`.
+     - Informative
+
+   * - `rx_pcs_symbol_err_phy`
+     - This counter counts the number of symbol errors that wasn’t corrected by
+       FEC correction algorithm or that FEC algorithm was not active on this
+       interface. If this counter is increasing, it implies that the link
+       between the NIC and the network is suffering from high BER, and that
+       traffic is lost. You may need to replace the cable/transceiver. The error
+       rate is the number of `rx_pcs_symbol_err_phy` divided by the number of
+       `rx_bits_phy` on a specific time frame.
+     - Error
+
+   * - `rx_corrected_bits_phy`
+     - The number of corrected bits on this port according to active FEC
+       (RS/FC). If this counter is increasing, it implies that the link between
+       the NIC and the network is suffering from high BER. The corrected bit
+       rate is the number of `rx_corrected_bits_phy` divided by the number of
+       `rx_bits_phy` on a specific time frame.
+     - Error
+
+   * - `rx_err_lane_[l]_phy`
+     - This counter counts the number of physical raw errors per lane l index.
+       The counter counts errors before FEC corrections. If this counter is
+       increasing, it implies that the link between the NIC and the network is
+       suffering from high BER, and that traffic might be lost. You may need to
+       replace the cable/transceiver. Please check in accordance with
+       `rx_corrected_bits_phy`.
+     - Error
+
+   * - `rx_global_pause`
+     - The number of pause packets received on the physical port. If this
+       counter is increasing, it implies that the network is congested and
+       cannot absorb the traffic coming from the adapter. Note: This counter is
+       only enabled when global pause mode is enabled.
+     - Informative
+
+   * - `rx_global_pause_duration`
+     - The duration of pause received (in microSec) on the physical port. The
+       counter represents the time the port did not send any traffic. If this
+       counter is increasing, it implies that the network is congested and
+       cannot absorb the traffic coming from the adapter. Note: This counter is
+       only enabled when global pause mode is enabled.
+     - Informative
+
+   * - `tx_global_pause`
+     - The number of pause packets transmitted on a physical port. If this
+       counter is increasing, it implies that the adapter is congested and
+       cannot absorb the traffic coming from the network. Note: This counter is
+       only enabled when global pause mode is enabled.
+     - Informative
+
+   * - `tx_global_pause_duration`
+     - The duration of pause transmitter (in microSec) on the physical port.
+       Note: This counter is only enabled when global pause mode is enabled.
+     - Informative
+
+   * - `rx_global_pause_transition`
+     - The number of times a transition from Xoff to Xon on the physical port
+       has occurred. Note: This counter is only enabled when global pause mode
+       is enabled.
+     - Informative
+
+   * - `rx_if_down_packets`
+     - The number of received packets that were dropped due to interface down.
+     - Informative
+
+Priority Port Counters
+----------------------
+The following counters are physical port counters that are counted per L2
+priority (0-7).
+
+**Note:** `p` in the counter name represents the priority.
+
+.. flat-table:: Priority Port Counter Table
+   :widths: 2 3 1
+
+   * - Counter
+     - Description
+     - Type
+
+   * - `rx_prio[p]_bytes`
+     - The number of bytes received with priority p on the physical port.
+     - Informative
+
+   * - `rx_prio[p]_packets`
+     - The number of packets received with priority p on the physical port.
+     - Informative
+
+   * - `tx_prio[p]_bytes`
+     - The number of bytes transmitted on priority p on the physical port.
+     - Informative
+
+   * - `tx_prio[p]_packets`
+     - The number of packets transmitted on priority p on the physical port.
+     - Informative
+
+   * - `rx_prio[p]_pause`
+     - The number of pause packets received with priority p on a physical port.
+       If this counter is increasing, it implies that the network is congested
+       and cannot absorb the traffic coming from the adapter. Note: This counter
+       is available only if PFC was enabled on priority p.
+     - Informative
+
+   * - `rx_prio[p]_pause_duration`
+     - The duration of pause received (in microSec) on priority p on the
+       physical port. The counter represents the time the port did not send any
+       traffic on this priority. If this counter is increasing, it implies that
+       the network is congested and cannot absorb the traffic coming from the
+       adapter. Note: This counter is available only if PFC was enabled on
+       priority p.
+     - Informative
+
+   * - `rx_prio[p]_pause_transition`
+     - The number of times a transition from Xoff to Xon on priority p on the
+       physical port has occurred. Note: This counter is available only if PFC
+       was enabled on priority p.
+     - Informative
+
+   * - `tx_prio[p]_pause`
+     - The number of pause packets transmitted on priority p on a physical port.
+       If this counter is increasing, it implies that the adapter is congested
+       and cannot absorb the traffic coming from the network. Note: This counter
+       is available only if PFC was enabled on priority p.
+     - Informative
+
+   * - `tx_prio[p]_pause_duration`
+     - The duration of pause transmitter (in microSec) on priority p on the
+       physical port. Note: This counter is available only if PFC was enabled on
+       priority p.
+     - Informative
+
+   * - `rx_prio[p]_buf_discard`
+     - The number of packets discarded by device due to lack of per host receive
+       buffers.
+     - Informative
+
+   * - `rx_prio[p]_cong_discard`
+     - The number of packets discarded by device due to per host congestion.
+     - Informative
+
+   * - `rx_prio[p]_marked`
+     - The number of packets ecn marked by device due to per host congestion.
+     - Informative
+
+   * - `rx_prio[p]_discards`
+     - The number of packets discarded by device due to lack of receive buffers.
+     - Informative
+
+Device Counters
+---------------
+.. flat-table:: Device Counter Table
+   :widths: 2 3 1
+
+   * - Counter
+     - Description
+     - Type
+
+   * - `rx_pci_signal_integrity`
+     - Counts physical layer PCIe signal integrity errors, the number of
+       transitions to recovery due to Framing errors and CRC (dlp and tlp). If
+       this counter is raising, try moving the adapter card to a different slot
+       to rule out a bad PCI slot. Validate that you are running with the latest
+       firmware available and latest server BIOS version.
+     - Error
+
+   * - `tx_pci_signal_integrity`
+     - Counts physical layer PCIe signal integrity errors, the number of
+       transition to recovery initiated by the other side (moving to recovery
+       due to getting TS/EIEOS). If this counter is raising, try moving the
+       adapter card to a different slot to rule out a bad PCI slot. Validate
+       that you are running with the latest firmware available and latest server
+       BIOS version.
+     - Error
+
+   * - `outbound_pci_buffer_overflow`
+     - The number of packets dropped due to pci buffer overflow. If this counter
+       is raising in high rate, it might indicate that the receive traffic rate
+       for a host is larger than the PCIe bus and therefore a congestion occurs.
+     - Informative
+
+   * - `outbound_pci_stalled_rd`
+     - The percentage (in the range 0...100) of time within the last second that
+       the NIC had outbound non-posted reads requests but could not perform the
+       operation due to insufficient posted credits.
+     - Informative
+
+   * - `outbound_pci_stalled_wr`
+     - The percentage (in the range 0...100) of time within the last second that
+       the NIC had outbound posted writes requests but could not perform the
+       operation due to insufficient posted credits.
+     - Informative
+
+   * - `outbound_pci_stalled_rd_events`
+     - The number of seconds where `outbound_pci_stalled_rd` was above 30%.
+     - Informative
+
+   * - `outbound_pci_stalled_wr_events`
+     - The number of seconds where `outbound_pci_stalled_wr` was above 30%.
+     - Informative
+
+   * - `dev_out_of_buffer`
+     - The number of times the device owned queue had not enough buffers
+       allocated.
+     - Error
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
new file mode 100644
index 000000000000..581a91caa579
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+:Copyright: |copy| 2019, Mellanox Technologies LTD.
+:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   kconfig
+   switchdev
+   tracepoints
+   counters
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
new file mode 100644
index 000000000000..34e911480108
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -0,0 +1,171 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=======================================
+Enabling the driver and kconfig options
+=======================================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features, please see below.
+
+**CONFIG_MLX5_BRIDGE=(y/n)**
+
+|    Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
+|    This will provide the ability to add representors of mlx5 uplink and VF
+|    ports to Bridge and offloading rules for traffic between such ports.
+|    Supports VLANs (trunk and access modes).
+
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+|    The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+|    This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+|    Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+|    mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+|    built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+|    Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+|    IPoIB offloads & acceleration support.
+|    Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+|    IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_CLS_ACT=(y/n)**
+
+|    Enables offload support for TC classifier action (NET_CLS_ACT).
+|    Works in both native NIC mode and Switchdev SRIOV mode.
+|    Flow-based classifiers, such as those registered through
+|    `tc-flower(8)`, are processed by the device, rather than the
+|    host. Actions that would then overwrite matching classification
+|    results would then be instant due to the offload.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+|    Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+|    https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+|    Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`.
+
+
+**CONFIG_MLX5_MACSEC=(y/n)**
+
+|    Build support for MACsec cryptography-offload acceleration in the NIC.
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+|    Enables ethtool receive network flow classification, which allows user defined
+|    flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+|    TLS cryptography-offload acceleration.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+|    Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+|    and switching for the enabled VFs and PF in two available modes:
+|           1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_.
+|           2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+|    Build support for the Innova family of network cards by Mellanox Technologies.
+|    Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+|    If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+|    building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+|    Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+|    Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+|    MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing
+|    user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_SF=(y/n)**
+
+|    Build support for subfunction.
+|    Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+|    will enable support for creating subfunction devices.
+
+
+**CONFIG_MLX5_SF_MANAGER=(y/n)**
+
+|    Build support for subfuction port in the NIC. A Mellanox subfunction
+|    port is managed through devlink.  A subfunction supports RDMA, netdevice
+|    and vdpa device. It is similar to a SRIOV VF but it doesn't require
+|    SRIOV support.
+
+
+**CONFIG_MLX5_SW_STEERING=(y/n)**
+
+|    Build support for software-managed steering in the NIC.
+
+**CONFIG_MLX5_HW_STEERING=(y/n)**
+
+|    Build support for hardware-managed steering in the NIC.
+
+**CONFIG_MLX5_TC_CT=(y/n)**
+
+|    Support offloading connection tracking rules via tc ct action.
+
+
+**CONFIG_MLX5_TC_SAMPLE=(y/n)**
+
+|    Support offloading sample rules via tc sample action.
+
+
+**CONFIG_MLX5_VDPA=(y/n)**
+
+|    Support library for Mellanox VDPA drivers. Provides code that is
+|    common for all types of VDPA drivers. The following drivers are planned:
+|    net, block.
+
+
+**CONFIG_MLX5_VDPA_NET=(y/n)**
+
+|    VDPA network driver for ConnectX6 and newer. Provides offloading
+|    of virtio net datapath such that descriptors put on the ring will
+|    be executed by the hardware. It also supports a variety of stateless
+|    offloads depending on the actual device used and firmware version.
+
+
+**CONFIG_MLX5_VFIO_PCI=(y/n)**
+
+|    This provides migration support for MLX5 devices using the VFIO framework.
+
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
new file mode 100644
index 000000000000..b617e93d7c2c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -0,0 +1,281 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=========
+Switchdev
+=========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+.. _mlx5_bridge_offload:
+
+Bridge offload
+==============
+
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+    $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+    $ ip link set bridge1 type bridge vlan_filtering 1
+    $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+    $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+    $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
+Subfunction
+===========
+
+Subfunction which are spawned over the E-switch are created only with devlink
+device, and by default all the SF auxiliary devices are disabled.
+This will allow user to configure the SF before the SF have been fully probed,
+which will save time.
+
+Usage example:
+
+- Create SF::
+
+    $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
+    $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active
+
+- Enable ETH auxiliary device::
+
+    $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit
+
+- Now, in order to fully probe the SF, use devlink reload::
+
+    $ devlink dev reload auxiliary/mlx5_core.sf.1
+
+mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
+queues are neither shared nor stolen from the parent PCI function.
+
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has a dedicated window in PCI BAR space that is not shared
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
+PCI BAR space.
+
+A subfunction supports eswitch representation through which it supports tc
+offloads. The user configures eswitch to send/receive packets from/to
+the subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system, and device view::
+
+       _______
+      | admin |
+      | user  |----------
+      |_______|         |
+          |             |
+      ____|____       __|______            _________________
+     |         |     |         |          |                 |
+     | devlink |     | tc tool |          |    user         |
+     | tool    |     |_________|          | applications    |
+     |_________|         |                |_________________|
+           |             |                   |          |
+           |             |                   |          |         Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+           |             |           +----------+   +----------+   Kernel
+           |             |           |  netdev  |   | rdma dev |
+           |             |           +----------+   +----------+
+   (devlink port add/del |              ^               ^
+    port function set)   |              |               |
+           |             |              +---------------|
+      _____|___          |              |        _______|_______
+     |         |         |              |       | mlx5 class    |
+     | devlink |   +------------+       |       |   drivers     |
+     | kernel  |   | rep netdev |       |       |(mlx5_core,ib) |
+     |_________|   +------------+       |       |_______________|
+           |             |              |               ^
+   (devlink ops)         |              |          (probe/remove)
+  _________|________     |              |           ____|________
+ | subfunction      |    |     +---------------+   | subfunction |
+ | management driver|-----     | subfunction   |---|  driver     |
+ | (mlx5_core)      |          | auxiliary dev |   | (mlx5_core) |
+ |__________________|          +---------------+   |_____________|
+           |                                            ^
+  (sf add/del, vhca events)                             |
+           |                                      (device add/del)
+      _____|____                                    ____|________
+     |          |                                  | subfunction |
+     |  PCI NIC |--- activate/deactivate events--->| host driver |
+     |__________|                                  | (mlx5_core) |
+                                                   |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flavour::
+
+    $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+    pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+      function:
+        hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Show a devlink port of the subfunction::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Delete a devlink port of subfunction after use::
+
+    $ devlink port del pci/0000:06:00.0/32768
+
+Function attributes
+===================
+
+The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
+a unified way for SmartNIC and non-SmartNIC.
+
+This is supported only when the eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+RoCE capability setup
+~~~~~~~~~~~~~~~~~~~~~
+Not all mlx5 PCI devices/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
+
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+migratable capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec crypto capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need
+to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec packet capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need
+to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_packet
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+SF state setup
+--------------
+
+To use the SF, the user must activate the SF using the SF function state
+attribute.
+
+- Get the state of the SF identified by its unique devlink port index::
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+   $ devlink port function set ens2f0npf0sf88 state active
+
+   $ devlink port show ens2f0npf0sf88
+   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+     function:
+       hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, the PF driver instance gets the event from the device
+that a particular SF was activated. It's the cue to put the device on bus, probe
+it and instantiate the devlink instance and class specific auxiliary devices
+for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+    $ devlink dev show
+    devlink dev show auxiliary/mlx5_core.sf.4
+
+    $ devlink port show auxiliary/mlx5_core.sf.4/1
+    auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+    $ rdma link show mlx5_0/1
+    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+    $ rdma dev show
+    8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+    13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+                 mlx5_core.sf.4
+          (subfunction auxiliary device)
+                       /\
+                      /  \
+                     /    \
+                    /      \
+                   /        \
+      mlx5_core.eth.4     mlx5_core.rdma.4
+     (sf eth aux dev)     (sf rdma aux dev)
+         |                      |
+         |                      |
+      p0sf88                  mlx5_0
+     (sf netdev)          (sf rdma device)
+
+Additionally, the SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visibility to the user to decide when is it
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+    $ devlink port show ens2f0npf0sf88
+    pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+      function:
+        hw_addr 00:00:00:00:88:88 state active opstate attached
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
new file mode 100644
index 000000000000..da8e53cebb6c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+===========
+Tracepoints
+===========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+mlx5 driver provides internal tracepoints for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events, check /sys/kernel/tracing/events/mlx5/.
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+    $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    tc-6535  [019] ...1  2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+    $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    tc-6569  [010] .N.1  2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+    $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    tc-6546  [010] ...1  2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+    $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u48:4-8806  [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+    $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u48:7-2221  [009] ...1  1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+    $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u20:9-2217    [003] ...1   318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+    $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    ip-2581    [005] ...1   318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+  mlx5::
+
+    $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u20:8-3849    [003] ...1       466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+  representor::
+
+    $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    ip-2560    [007] ...1   318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+  representor::
+
+    $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    bridge-2582    [007] ...1   318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+  device::
+
+    $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    ip-2560    [007] ...1   318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+  device::
+
+    $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    ip-5387    [000] ...1       573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
+
+Eswitch QoS tracepoints:
+
+- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
+
+    $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-23496   [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
+
+    $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-26548   [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
+
+    $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-27418   [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
+
+- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
+
+    $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-26578   [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
+
+- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
+
+    $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-27303   [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
+
+- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
+
+    $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    <...>-27418   [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
+
+SF tracepoints:
+
+- mlx5_sf_add: trace addition of the SF port::
+
+    $ echo mlx5:mlx5_sf_add >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-9363    [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_free: trace freeing of the SF port::
+
+    $ echo mlx5:mlx5_sf_free >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-9830    [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_activate: trace activation of the SF port::
+
+    $ echo mlx5:mlx5_sf_activate >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-29841   [008] .....  3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_deactivate: trace deactivation of the SF port::
+
+    $ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-29994   [008] .....  4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
+
+    $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-9775    [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
+
+    $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u128:3-9093    [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
+
+    $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    devlink-9519    [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_update_state: trace state updates for SF contexts::
+
+    $ echo mlx5:mlx5_sf_update_state >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u20:3-29490   [009] .....  4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
+
+- mlx5_sf_vhca_event: trace SF vhca event and state::
+
+    $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u128:3-9093    [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
+
+- mlx5_sf_dev_add: trace SF device add event::
+
+    $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u128:3-9093    [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_dev_del: trace SF device delete event::
+
+    $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/tracing/set_event
+    $ cat /sys/kernel/tracing/trace
+    ...
+    kworker/u128:3-9093    [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
new file mode 100644
index 000000000000..f8592dec8851
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
@@ -0,0 +1,132 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=====================================
+Meta Platforms Host Network Interface
+=====================================
+
+Firmware Versions
+-----------------
+
+fbnic has three components stored on the flash which are provided in one PLDM
+image:
+
+1. fw - The control firmware used to view and modify firmware settings, request
+   firmware actions, and retrieve firmware counters outside of the data path.
+   This is the firmware which fbnic_fw.c interacts with.
+2. bootloader - The firmware which validate firmware security and control basic
+   operations including loading and updating the firmware. This is also known
+   as the cmrt firmware.
+3. undi - This is the UEFI driver which is based on the Linux driver.
+
+fbnic stores two copies of these three components on flash. This allows fbnic
+to fall back to an older version of firmware automatically in case firmware
+fails to boot. Version information for both is provided as running and stored.
+The undi is only provided in stored as it is not actively running once the Linux
+driver takes over.
+
+devlink dev info provides version information for all three components. In
+addition to the version the hg commit hash of the build is included as a
+separate entry.
+
+Upgrading Firmware
+------------------
+
+fbnic supports updating firmware using signed PLDM images with devlink dev
+flash. PLDM images are written into the flash. Flashing does not interrupt
+the operation of the device.
+
+On host boot the latest UEFI driver is always used, no explicit activation
+is required. Firmware activation is required to run new control firmware. cmrt
+firmware can only be activated by power cycling the NIC.
+
+Statistics
+----------
+
+TX MAC Interface
+~~~~~~~~~~~~~~~~
+
+ - ``ptp_illegal_req``: packets sent to the NIC with PTP request bit set but routed to BMC/FW
+ - ``ptp_good_ts``: packets successfully routed to MAC with PTP request bit set
+ - ``ptp_bad_ts``: packets destined for MAC with PTP request bit set but aborted because of some error (e.g., DMA read error)
+
+TX Extension (TEI) Interface (TTI)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ - ``tti_cm_drop``: control messages dropped at the TX Extension (TEI) Interface because of credit starvation
+ - ``tti_frame_drop``: packets dropped at the TX Extension (TEI) Interface because of credit starvation
+ - ``tti_tbi_drop``: packets dropped at the TX BMC Interface (TBI) because of credit starvation
+
+RXB (RX Buffer) Enqueue
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ - ``rxb_integrity_err[i]``: frames enqueued with integrity errors (e.g., multi-bit ECC errors) on RXB input i
+ - ``rxb_mac_err[i]``: frames enqueued with MAC end-of-frame errors (e.g., bad FCS) on RXB input i
+ - ``rxb_parser_err[i]``: frames experienced RPC parser errors
+ - ``rxb_frm_err[i]``: frames experienced signaling errors (e.g., missing end-of-packet/start-of-packet) on RXB input i
+ - ``rxb_drbo[i]_frames``: frames received at RXB input i
+ - ``rxb_drbo[i]_bytes``: bytes received at RXB input i
+
+RXB (RX Buffer) FIFO
+~~~~~~~~~~~~~~~~~~~~
+
+ - ``rxb_fifo[i]_drop``: transitions into the drop state on RXB pool i
+ - ``rxb_fifo[i]_dropped_frames``: frames dropped on RXB pool i
+ - ``rxb_fifo[i]_ecn``: transitions into the ECN mark state on RXB pool i
+ - ``rxb_fifo[i]_level``: current occupancy of RXB pool i
+
+RXB (RX Buffer) Dequeue
+~~~~~~~~~~~~~~~~~~~~~~~
+
+   - ``rxb_intf[i]_frames``: frames sent to the output i
+   - ``rxb_intf[i]_bytes``: bytes sent to the output i
+   - ``rxb_pbuf[i]_frames``: frames sent to output i from the perspective of internal packet buffer
+   - ``rxb_pbuf[i]_bytes``: bytes sent to output i from the perspective of internal packet buffer
+
+RPC (Rx parser)
+~~~~~~~~~~~~~~~
+
+ - ``rpc_unkn_etype``: frames containing unknown EtherType
+ - ``rpc_unkn_ext_hdr``: frames containing unknown IPv6 extension header
+ - ``rpc_ipv4_frag``: frames containing IPv4 fragment
+ - ``rpc_ipv6_frag``: frames containing IPv6 fragment
+ - ``rpc_ipv4_esp``: frames with IPv4 ESP encapsulation
+ - ``rpc_ipv6_esp``: frames with IPv6 ESP encapsulation
+ - ``rpc_tcp_opt_err``: frames which encountered TCP option parsing error
+ - ``rpc_out_of_hdr_err``: frames where header was larger than parsable region
+ - ``ovr_size_err``: oversized frames
+
+Hardware Queues
+~~~~~~~~~~~~~~~
+
+1. RX DMA Engine:
+
+ - ``rde_[i]_pkt_err``: packets with MAC EOP, RPC parser, RXB truncation, or RDE frame truncation errors. These error are flagged in the packet metadata because of cut-through support but the actual drop happens once PCIE/RDE is reached.
+ - ``rde_[i]_pkt_cq_drop``: packets dropped because RCQ is full
+ - ``rde_[i]_pkt_bdq_drop``: packets dropped because HPQ or PPQ ran out of host buffer
+
+PCIe
+~~~~
+
+The fbnic driver exposes PCIe hardware performance statistics through debugfs
+(``pcie_stats``). These statistics provide insights into PCIe transaction
+behavior and potential performance bottlenecks.
+
+1. PCIe Transaction Counters:
+
+   These counters track PCIe transaction activity:
+        - ``pcie_ob_rd_tlp``: Outbound read Transaction Layer Packets count
+        - ``pcie_ob_rd_dword``: DWORDs transferred in outbound read transactions
+        - ``pcie_ob_wr_tlp``: Outbound write Transaction Layer Packets count
+        - ``pcie_ob_wr_dword``: DWORDs transferred in outbound write
+	  transactions
+        - ``pcie_ob_cpl_tlp``: Outbound completion TLP count
+        - ``pcie_ob_cpl_dword``: DWORDs transferred in outbound completion TLPs
+
+2. PCIe Resource Monitoring:
+
+   These counters indicate PCIe resource exhaustion events:
+        - ``pcie_ob_rd_no_tag``: Read requests dropped due to tag unavailability
+        - ``pcie_ob_rd_no_cpl_cred``: Read requests dropped due to completion
+	  credit exhaustion
+        - ``pcie_ob_rd_no_np_cred``: Read requests dropped due to non-posted
+	  credit exhaustion
diff --git a/Documentation/networking/device_drivers/microsoft/netvsc.rst b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
index c3f51c672a68..fc5acd427a5d 100644
--- a/Documentation/networking/device_drivers/microsoft/netvsc.rst
+++ b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
@@ -87,11 +87,15 @@ Receive Buffer
   contain one or more packets. The number of receive sections may be changed
   via ethtool Rx ring parameters.
 
-  There is a similar send buffer which is used to aggregate packets for sending.
-  The send area is broken into chunks of 6144 bytes, each of section may
-  contain one or more packets. The send buffer is an optimization, the driver
-  will use slower method to handle very large packets or if the send buffer
-  area is exhausted.
+  There is a similar send buffer which is used to aggregate packets
+  for sending.  The send area is broken into chunks, typically of 6144
+  bytes, each of section may contain one or more packets. Small
+  packets are usually transmitted via copy to the send buffer. However,
+  if the buffer is temporarily exhausted, or the packet to be transmitted is
+  an LSO packet, the driver will provide the host with pointers to the data
+  from the SKB. This attempts to achieve a balance between the overhead of
+  data copy and the impact of remapping VM memory to be accessible by the
+  host.
 
 XDP support
 -----------
diff --git a/Documentation/networking/device_drivers/neterion/s2io.rst b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
index c5673ec4559b..d731b5a98561 100644
--- a/Documentation/networking/device_drivers/neterion/s2io.rst
+++ b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
@@ -64,8 +64,8 @@ c. Multi-buffer receive mode. Scattering of packet across multiple
    IBM xSeries).
 
 d. MSI/MSI-X. Can be enabled on platforms which support this feature
-   (IA64, Xeon) resulting in noticeable performance improvement(up to 7%
-   on certain platforms).
+   resulting in noticeable performance improvement (up to 7% on certain
+   platforms).
 
 e. Statistics. Comprehensive MAC-level and software statistics displayed
    using "ethtool -S" option.
diff --git a/Documentation/networking/device_drivers/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
index ada611fb427c..650b57742d51 100644
--- a/Documentation/networking/device_drivers/netronome/nfp.rst
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
@@ -1,50 +1,57 @@
 .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+.. include:: <isonum.txt>
 
-=============================================
-Netronome Flow Processor (NFP) Kernel Drivers
-=============================================
+===========================================
+Network Flow Processor (NFP) Kernel Drivers
+===========================================
 
-Copyright (c) 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2022, Corigine, Inc.
 
 Contents
 ========
 
 - `Overview`_
 - `Acquiring Firmware`_
+- `Devlink Info`_
+- `Configure Device`_
+- `Statistics`_
 
 Overview
 ========
 
-This driver supports Netronome's line of Flow Processor devices,
-including the NFP4000, NFP5000, and NFP6000 models, which are also
-incorporated in the company's family of Agilio SmartNICs. The SR-IOV
-physical and virtual functions for these devices are supported by
-the driver.
+This driver supports Netronome and Corigine's line of Network Flow Processor
+devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
+are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by the driver.
 
 Acquiring Firmware
 ==================
 
-The NFP4000 and NFP6000 devices require application specific firmware
-to function.  Application firmware can be located either on the host file system
+The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
 or in the device flash (if supported by management firmware).
 
 Firmware files on the host filesystem contain card type (`AMDA-*` string), media
-config etc.  They should be placed in `/lib/firmware/netronome` directory to
+config etc. They should be placed in `/lib/firmware/netronome` directory to
 load firmware from the host file system.
 
 Firmware for basic NIC operation is available in the upstream
 `linux-firmware.git` repository.
 
+A more comprehensive list of firmware can be downloaded from the
+`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
+
 Firmware in NVRAM
 -----------------
 
 Recent versions of management firmware supports loading application
-firmware from flash when the host driver gets probed.  The firmware loading
+firmware from flash when the host driver gets probed. The firmware loading
 policy configuration may be used to configure this feature appropriately.
 
 Devlink or ethtool can be used to update the application firmware on the device
 flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
-command.  Users need to take care to write the correct firmware image for the
+command. Users need to take care to write the correct firmware image for the
 card and media configuration to flash.
 
 Available storage space in flash depends on the card being used.
@@ -79,9 +86,9 @@ You may need to use hard instead of symbolic links on distributions
 which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
 
 After changing firmware files you may need to regenerate the initramfs
-image.  Initramfs contains drivers and firmware files your system may
-need to boot.  Refer to the documentation of your distribution to find
-out how to update initramfs.  Good indication of stale initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
 is system loading wrong driver or firmware on boot, but when driver is
 later reloaded manually everything works correctly.
 
@@ -89,9 +96,9 @@ Selecting firmware per device
 -----------------------------
 
 Most commonly all cards on the system use the same type of firmware.
-If you want to load specific firmware image for a specific card, you
-can use either the PCI bus address or serial number.  Driver will print
-which files it's looking for when it recognizes a NFP device::
+If you want to load a specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. The driver will
+print which files it's looking for when it recognizes a NFP device::
 
     nfp: Looking for firmware file in order of priority:
     nfp:  netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
@@ -106,6 +113,15 @@ Note that `serial-*` and `pci-*` files are **not** automatically included
 in initramfs, you will have to refer to documentation of appropriate tools
 to find out how to include them.
 
+Running firmware version
+------------------------
+
+The version of the loaded firmware for a particular <netdev> interface,
+(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
+be displayed with the ethtool command::
+
+  $ ethtool -i <netdev>
+
 Firmware loading policy
 -----------------------
 
@@ -132,6 +148,115 @@ abi_drv_load_ifc
     Defines a list of PF devices allowed to load FW on the device.
     This variable is not currently user configurable.
 
+Devlink Info
+============
+
+The devlink info command displays the running and stored firmware versions
+on the device, serial number and board information.
+
+Devlink info command example (replace PCI address)::
+
+  $ devlink dev info pci/0000:03:00.0
+    pci/0000:03:00.0:
+      driver nfp
+      serial_number CSAAMDA2001-1003000111
+      versions:
+          fixed:
+            board.id AMDA2001-1003
+            board.rev 01
+            board.manufacture CSA
+            board.model mozart
+          running:
+            fw.mgmt 22.10.0-rc3
+            fw.cpld 0x1000003
+            fw.app nic-22.09.0
+            chip.init AMDA-2001-1003  1003000111
+          stored:
+            fw.bundle_id bspbundle_1003000111
+            fw.mgmt 22.10.0-rc3
+            fw.cpld 0x0
+            chip.init AMDA-2001-1003  1003000111
+
+Configure Device
+================
+
+This section explains how to use Agilio SmartNICs running basic NIC firmware.
+
+Configure interface link-speed
+------------------------------
+The following steps explains how to change between 10G mode and 25G mode on
+Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
+port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
+
+Down the respective interface(s)::
+
+  $ ip link set dev <netdev port 0> down
+  $ ip link set dev <netdev port 1> down
+
+Set interface link-speed to 10G::
+
+  $ ethtool -s <netdev port 0> speed 10000
+  $ ethtool -s <netdev port 1> speed 10000
+
+Set interface link-speed to 25G::
+
+  $ ethtool -s <netdev port 0> speed 25000
+  $ ethtool -s <netdev port 1> speed 25000
+
+Reload driver for changes to take effect::
+
+  $ rmmod nfp; modprobe nfp
+
+Configure interface Maximum Transmission Unit (MTU)
+---------------------------------------------------
+
+The MTU of interfaces can temporarily be set using the iproute2, ip link or
+ifconfig tools. Note that this change will not persist. Setting this via
+Network Manager, or another appropriate OS configuration tool, is
+recommended as changes to the MTU using Network Manager can be made to
+persist.
+
+Set interface MTU to 9000 bytes::
+
+  $ ip link set dev <netdev port> mtu 9000
+
+It is the responsibility of the user or the orchestration layer to set
+appropriate MTU values when handling jumbo frames or utilizing tunnels. For
+example, if packets sent from a VM are to be encapsulated on the card and
+egress a physical port, then the MTU of the VF should be set to lower than
+that of the physical port to account for the extra bytes added by the
+additional header. If a setup is expected to see fallback traffic between
+the SmartNIC and the kernel then the user should also ensure that the PF MTU
+is appropriately set to avoid unexpected drops on this path.
+
+Configure Forward Error Correction (FEC) modes
+----------------------------------------------
+
+Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
+ReedSolomon and Off modes. Each physical port's FEC mode can be set
+independently using ethtool. The supported FEC modes for an interface can
+be viewed using::
+
+  $ ethtool <netdev>
+
+The currently configured FEC mode can be viewed using::
+
+  $ ethtool --show-fec <netdev>
+
+To force the FEC mode for a particular port, auto-negotiation must be disabled
+(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
+to Reed-Solomon is::
+
+  $ ethtool --set-fec <netdev> encoding rs
+
+Auto-negotiation
+----------------
+
+To change auto-negotiation settings, the link must first be put down. After the
+link is down, auto-negotiation can be enabled or disabled using::
+
+  ethtool -s <netdev> autoneg <on|off>
+
 Statistics
 ==========
 
diff --git a/Documentation/networking/device_drivers/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
index 0eabbc347d6c..05fe2b11bb18 100644
--- a/Documentation/networking/device_drivers/pensando/ionic.rst
+++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
@@ -83,7 +83,7 @@ Configuring the Driver
 MTU
 ---
 
-Jumbo frame support is available with a maximim size of 9194 bytes.
+Jumbo frame support is available with a maximum size of 9194 bytes.
 
 Interrupt coalescing
 --------------------
@@ -99,6 +99,12 @@ Minimal SR-IOV support is currently offered and can be enabled by setting
 the sysfs 'sriov_numvfs' value, if supported by your particular firmware
 configuration.
 
+XDP
+---
+
+Support for XDP includes the basics, plus Jumbo frames, Redirect and
+ndo_xmit.  There is no current support for zero-copy sockets or HW offload.
+
 Statistics
 ==========
 
@@ -138,6 +144,12 @@ Driver port specific::
      rx_csum_none: 0
      rx_csum_complete: 3
      rx_csum_error: 0
+     xdp_drop: 0
+     xdp_aborted: 0
+     xdp_pass: 0
+     xdp_tx: 0
+     xdp_redirect: 0
+     xdp_frames: 0
 
 Driver queue specific::
 
@@ -149,9 +161,12 @@ Driver queue specific::
      tx_0_frags: 0
      tx_0_tso: 0
      tx_0_tso_bytes: 0
+     tx_0_hwstamp_valid: 0
+     tx_0_hwstamp_invalid: 0
      tx_0_csum_none: 3
      tx_0_csum: 0
      tx_0_vlan_inserted: 0
+     tx_0_xdp_frames: 0
      rx_0_pkts: 2
      rx_0_bytes: 120
      rx_0_dma_map_err: 0
@@ -159,8 +174,15 @@ Driver queue specific::
      rx_0_csum_none: 0
      rx_0_csum_complete: 0
      rx_0_csum_error: 0
+     rx_0_hwstamp_valid: 0
+     rx_0_hwstamp_invalid: 0
      rx_0_dropped: 0
      rx_0_vlan_stripped: 0
+     rx_0_xdp_drop: 0
+     rx_0_xdp_aborted: 0
+     rx_0_xdp_pass: 0
+     rx_0_xdp_tx: 0
+     rx_0_xdp_redirect: 0
 
 Firmware port specific::
 
diff --git a/Documentation/networking/device_drivers/smsc/smc9.rst b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
index e5eac896a631..e5eac896a631 100644
--- a/Documentation/networking/device_drivers/smsc/smc9.rst
+++ b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
index 5d46e5036129..5d46e5036129 100644
--- a/Documentation/networking/device_drivers/stmicro/stmmac.rst
+++ b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
new file mode 100644
index 000000000000..25fd9aa284e2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================================
+Texas Instruments K3 AM65 CPSW NUSS switchdev based ethernet driver
+===================================================================
+
+:Version: 1.0
+
+Port renaming
+=============
+
+In order to rename via udev::
+
+    ip -d link show dev sw0p1 | grep switchid
+
+    SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+	    ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+Multi mac mode
+==============
+
+- The driver is operating in multi-mac mode by default, thus
+  working as N individual network interfaces.
+
+Devlink configuration parameters
+================================
+
+See Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
+
+Enabling "switch"
+=================
+
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true::
+
+        devlink dev param set platform/c000000.ethernet \
+        name switch_mode value true cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver completely reloads its
+configuration when first port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag.
+
+All configuration is implemented via switchdev API.
+
+Bridge setup
+============
+
+::
+
+        devlink dev param set platform/c000000.ethernet \
+        name switch_mode value true cmode runtime
+
+	ip link add name br0 type bridge
+	ip link set dev br0 type bridge ageing_time 1000
+	ip link set dev sw0p1 up
+	ip link set dev sw0p2 up
+	ip link set dev sw0p1 master br0
+	ip link set dev sw0p2 master br0
+
+	[*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+	[*] if vlan_filtering=1. where default_pvid=1
+
+	Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
+
+::
+
+	ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+  bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+Add extra VLANs
+===============
+
+ 1. untagged::
+
+	bridge vlan add dev sw0p1 vid 100 pvid untagged master
+	bridge vlan add dev sw0p2 vid 100 pvid untagged master
+	bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+	bridge vlan add dev sw0p1 vid 100 master
+	bridge vlan add dev sw0p2 vid 100 master
+	bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+FDBs
+----
+
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs::
+
+    bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+    bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+MDBs
+----
+
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs::
+
+  bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+  bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on switch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+Access and Trunk port
+=====================
+
+::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+Note. Setting PVID on Bridge device itself works only for
+default VLAN (default_pvid).
diff --git a/Documentation/networking/device_drivers/ti/cpsw.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
index a88946bd188b..a88946bd188b 100644
--- a/Documentation/networking/device_drivers/ti/cpsw.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
index 1241ecac73bd..464dce938ed1 100644
--- a/Documentation/networking/device_drivers/ti/cpsw_switchdev.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
@@ -174,7 +174,7 @@ Multicast flooding
 ==================
 CPU port mcast_flooding is always on
 
-Turning flooding on/off on swithch ports:
+Turning flooding on/off on switch ports:
 bridge link set dev sw0p1 mcast_flood on/off
 
 Access and Trunk port
diff --git a/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
new file mode 100644
index 000000000000..da21ddf431bb
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Texas Instruments ICSSG PRUETH ethernet driver
+==============================================
+
+:Version: 1.0
+
+ICSSG Firmware
+==============
+
+Every ICSSG core has two Programmable Real-Time Unit(PRUs), two auxiliary
+Real-Time Transfer Unit (RTUs), and two Transmit Real-Time Transfer Units
+(TX_PRUs). Each one of these runs its own firmware. The firmwares combnined are
+referred as ICSSG Firmware.
+
+Firmware Statistics
+===================
+
+The ICSSG firmware maintains certain statistics which are dumped by the driver
+via ``ethtool -S <interface>``
+
+These statistics are as follows,
+
+ - ``FW_RTU_PKT_DROP``: Diagnostic error counter which increments when RTU drops a locally injected packet due to port being disabled or rule violation.
+ - ``FW_Q0_OVERFLOW``: TX overflow counter for queue0
+ - ``FW_Q1_OVERFLOW``: TX overflow counter for queue1
+ - ``FW_Q2_OVERFLOW``: TX overflow counter for queue2
+ - ``FW_Q3_OVERFLOW``: TX overflow counter for queue3
+ - ``FW_Q4_OVERFLOW``: TX overflow counter for queue4
+ - ``FW_Q5_OVERFLOW``: TX overflow counter for queue5
+ - ``FW_Q6_OVERFLOW``: TX overflow counter for queue6
+ - ``FW_Q7_OVERFLOW``: TX overflow counter for queue7
+ - ``FW_DROPPED_PKT``: This counter is incremented when a packet is dropped at PRU because of rule violation.
+ - ``FW_RX_ERROR``: Incremented if there was a CRC error or Min/Max frame error at PRU
+ - ``FW_RX_DS_INVALID``: Incremented when RTU detects Data Status invalid condition
+ - ``FW_TX_DROPPED_PACKET``: Counter for packets dropped via TX Port
+ - ``FW_TX_TS_DROPPED_PACKET``: Counter for packets with TS flag dropped via TX Port
+ - ``FW_INF_PORT_DISABLED``: Incremented when RX frame is dropped due to port being disabled
+ - ``FW_INF_SAV``: Incremented when RX frame is dropped due to Source Address violation
+ - ``FW_INF_SA_DL``: Incremented when RX frame is dropped due to Source Address being in the denylist
+ - ``FW_INF_PORT_BLOCKED``: Incremented when RX frame is dropped due to port being blocked and frame being a special frame
+ - ``FW_INF_DROP_TAGGED`` : Incremented when RX frame is dropped for being tagged
+ - ``FW_INF_DROP_PRIOTAGGED``: Incremented when RX frame is dropped for being priority tagged
+ - ``FW_INF_DROP_NOTAG``: Incremented when RX frame is dropped for being untagged
+ - ``FW_INF_DROP_NOTMEMBER``: Incremented when RX frame is dropped for port not being member of VLAN
+ - ``FW_RX_EOF_SHORT_FRMERR``: Incremented if End Of Frame (EOF) task is scheduled without seeing RX_B1
+ - ``FW_RX_B0_DROP_EARLY_EOF``: Incremented when frame is dropped due to Early EOF
+ - ``FW_TX_JUMBO_FRM_CUTOFF``: Incremented when frame is cut off to prevent packet size > 2000 Bytes
+ - ``FW_RX_EXP_FRAG_Q_DROP``: Incremented when express frame is received in the same queue as the previous fragment
+ - ``FW_RX_FIFO_OVERRUN``: RX fifo overrun counter
+ - ``FW_CUT_THR_PKT``: Incremented when a packet is forwarded using Cut-Through forwarding method
+ - ``FW_HOST_RX_PKT_CNT``: Number of valid packets sent by Rx PRU to Host on PSI
+ - ``FW_HOST_TX_PKT_CNT``: Number of valid packets copied by RTU0 to Tx queues
+ - ``FW_HOST_EGRESS_Q_PRE_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
+ - ``FW_HOST_EGRESS_Q_EXP_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
diff --git a/Documentation/networking/device_drivers/ti/tlan.rst b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
index 4fdc0907f4fc..4fdc0907f4fc 100644
--- a/Documentation/networking/device_drivers/ti/tlan.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
new file mode 100644
index 000000000000..43a02f9943e1
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================
+Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters
+=============================================================
+
+WangXun Gigabit Linux driver.
+Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at nic-support@net-swift.com or check our website
+ at https://www.net-swift.com
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
new file mode 100644
index 000000000000..d052ef40fe36
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================================
+Linux Base Driver for WangXun(R) 10 Gigabit PCI Express Adapters
+================================================================
+
+WangXun 10 Gigabit Linux driver.
+Copyright (c) 2015 - 2022 Beijing WangXun Technology Co., Ltd.
+
+
+Contents
+========
+
+- Support
+
+
+Support
+=======
+If you got any problem, contact Wangxun support team via nic-support@net-swift.com
+and Cc: netdev.
diff --git a/Documentation/networking/defza.rst b/Documentation/networking/device_drivers/fddi/defza.rst
index 73c2f793ea26..7393f33ea705 100644
--- a/Documentation/networking/defza.rst
+++ b/Documentation/networking/device_drivers/fddi/defza.rst
@@ -60,4 +60,4 @@ To do:
 
 Both success and failure reports are welcome.
 
-Maciej W. Rozycki  <macro@linux-mips.org>
+Maciej W. Rozycki  <macro@orcam.me.uk>
diff --git a/Documentation/networking/device_drivers/fddi/index.rst b/Documentation/networking/device_drivers/fddi/index.rst
new file mode 100644
index 000000000000..0b75294e6c8b
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Fiber Distributed Data Interface (FDDI) Device Drivers
+======================================================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   defza
+   skfp
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/skfp.rst b/Documentation/networking/device_drivers/fddi/skfp.rst
index 58f548105c1d..58f548105c1d 100644
--- a/Documentation/networking/skfp.rst
+++ b/Documentation/networking/device_drivers/fddi/skfp.rst
diff --git a/Documentation/networking/baycom.rst b/Documentation/networking/device_drivers/hamradio/baycom.rst
index fe2d010f0e86..fe2d010f0e86 100644
--- a/Documentation/networking/baycom.rst
+++ b/Documentation/networking/device_drivers/hamradio/baycom.rst
diff --git a/Documentation/networking/device_drivers/hamradio/index.rst b/Documentation/networking/device_drivers/hamradio/index.rst
new file mode 100644
index 000000000000..7e731732057b
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Amateur Radio Device Drivers
+============================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   baycom
+   z8530drv
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/z8530drv.rst b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
index d2942760f167..d2942760f167 100644
--- a/Documentation/networking/z8530drv.rst
+++ b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index e18dad11bc72..a254af25b7ef 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -1,56 +1,21 @@
 .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
 
-Vendor Device Drivers
-=====================
+Hardware Device Drivers
+=======================
 
 Contents:
 
 .. toctree::
    :maxdepth: 2
 
-   freescale/dpaa2/index
-   intel/e100
-   intel/e1000
-   intel/e1000e
-   intel/fm10k
-   intel/igb
-   intel/igbvf
-   intel/ixgb
-   intel/ixgbe
-   intel/ixgbevf
-   intel/i40e
-   intel/iavf
-   intel/ice
-   google/gve
-   marvell/octeontx2
-   mellanox/mlx5
-   netronome/nfp
-   pensando/ionic
-   stmicro/stmmac
-   3com/3c509
-   3com/vortex
-   amazon/ena
-   aquantia/atlantic
-   chelsio/cxgb
-   cirrus/cs89x0
-   davicom/dm9000
-   dec/de4x5
-   dec/dmfe
-   dlink/dl2k
-   freescale/dpaa
-   freescale/gianfar
-   intel/ipw2100
-   intel/ipw2200
-   microsoft/netvsc
-   neterion/s2io
-   neterion/vxge
-   qualcomm/rmnet
-   sb1000
-   smsc/smc9
-   ti/cpsw_switchdev
-   ti/cpsw
-   ti/tlan
-   toshiba/spider_net
+   atm/index
+   can/index
+   cellular/index
+   ethernet/index
+   fddi/index
+   hamradio/index
+   wifi/index
+   wwan/index
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst
deleted file mode 100644
index ee43ea57d443..000000000000
--- a/Documentation/networking/device_drivers/intel/ice.rst
+++ /dev/null
@@ -1,46 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-==================================================================
-Linux Base Driver for the Intel(R) Ethernet Connection E800 Series
-==================================================================
-
-Intel ice Linux driver.
-Copyright(c) 2018 Intel Corporation.
-
-Contents
-========
-
-- Enabling the driver
-- Support
-
-The driver in this release supports Intel's E800 Series of products. For
-more information, visit Intel's support page at https://support.intel.com.
-
-Enabling the driver
-===================
-The driver is enabled via the standard kernel configuration system,
-using the make command::
-
-  make oldconfig/menuconfig/etc.
-
-The driver is located in the menu structure at:
-
-  -> Device Drivers
-    -> Network device support (NETDEVICES [=y])
-      -> Ethernet driver support
-        -> Intel devices
-          -> Intel(R) Ethernet Connection E800 Series Support
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst
deleted file mode 100644
index ab624f1a44a8..000000000000
--- a/Documentation/networking/device_drivers/intel/ixgb.rst
+++ /dev/null
@@ -1,468 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-=====================================================================
-Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection
-=====================================================================
-
-October 1, 2018
-
-
-Contents
-========
-
-- In This Release
-- Identifying Your Adapter
-- Command Line Parameters
-- Improving Performance
-- Additional Configurations
-- Known Issues/Troubleshooting
-- Support
-
-
-
-In This Release
-===============
-
-This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R)
-Network Connection.  This driver includes support for Itanium(R)2-based
-systems.
-
-For questions related to hardware requirements, refer to the documentation
-supplied with your 10 Gigabit adapter.  All hardware requirements listed apply
-to use with Linux.
-
-The following features are available in this kernel:
- - Native VLANs
- - Channel Bonding (teaming)
- - SNMP
-
-Channel Bonding documentation can be found in the Linux kernel source:
-/Documentation/networking/bonding.rst
-
-The driver information previously displayed in the /proc filesystem is not
-supported in this release.  Alternatively, you can use ethtool (version 1.6
-or later), lspci, and iproute2 to obtain the same information.
-
-Instructions on updating ethtool can be found in the section "Additional
-Configurations" later in this document.
-
-
-Identifying Your Adapter
-========================
-
-The following Intel network adapters are compatible with the drivers in this
-release:
-
-+------------+------------------------------+----------------------------------+
-| Controller | Adapter Name                 | Physical Layer                   |
-+============+==============================+==================================+
-| 82597EX    | Intel(R) PRO/10GbE LR/SR/CX4 | - 10G Base-LR (fiber)            |
-|            | Server Adapters              | - 10G Base-SR (fiber)            |
-|            |                              | - 10G Base-CX4 (copper)          |
-+------------+------------------------------+----------------------------------+
-
-For more information on how to identify your adapter, go to the Adapter &
-Driver ID Guide at:
-
-    https://support.intel.com
-
-
-Command Line Parameters
-=======================
-
-If the driver is built as a module, the  following optional parameters are
-used by entering them on the command line with the modprobe command using
-this syntax::
-
-    modprobe ixgb [<option>=<VAL1>,<VAL2>,...]
-
-For example, with two 10GbE PCI adapters, entering::
-
-    modprobe ixgb TxDescriptors=80,128
-
-loads the ixgb driver with 80 TX resources for the first adapter and 128 TX
-resources for the second adapter.
-
-The default value for each parameter is generally the recommended setting,
-unless otherwise noted.
-
-Copybreak
----------
-:Valid Range: 0-XXXX
-:Default Value: 256
-
-    This is the maximum size of packet that is copied to a new buffer on
-    receive.
-
-Debug
------
-:Valid Range: 0-16 (0=none,...,16=all)
-:Default Value: 0
-
-    This parameter adjusts the level of debug messages displayed in the
-    system logs.
-
-FlowControl
------------
-:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
-:Default Value: 1 if no EEPROM, otherwise read from EEPROM
-
-    This parameter controls the automatic generation(Tx) and response(Rx) to
-    Ethernet PAUSE frames.  There are hardware bugs associated with enabling
-    Tx flow control so beware.
-
-RxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 1024
-
-    This value is the number of receive descriptors allocated by the driver.
-    Increasing this value allows the driver to buffer more incoming packets.
-    Each descriptor is 16 bytes.  A receive buffer is also allocated for
-    each descriptor and can be either 2048, 4056, 8192, or 16384 bytes,
-    depending on the MTU setting.  When the MTU size is 1500 or less, the
-    receive buffer size is 2048 bytes. When the MTU is greater than 1500 the
-    receive buffer size will be either 4056, 8192, or 16384 bytes.  The
-    maximum MTU size is 16114.
-
-TxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 256
-
-    This value is the number of transmit descriptors allocated by the driver.
-    Increasing this value allows the driver to queue more transmits.  Each
-    descriptor is 16 bytes.
-
-RxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 72
-
-    This value delays the generation of receive interrupts in units of
-    0.8192 microseconds.  Receive interrupt reduction can improve CPU
-    efficiency if properly tuned for specific network traffic.  Increasing
-    this value adds extra latency to frame reception and can end up
-    decreasing the throughput of TCP traffic.  If the system is reporting
-    dropped receives, this value may be set too high, causing the driver to
-    run out of available receive descriptors.
-
-TxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 32
-
-    This value delays the generation of transmit interrupts in units of
-    0.8192 microseconds.  Transmit interrupt reduction can improve CPU
-    efficiency if properly tuned for specific network traffic.  Increasing
-    this value adds extra latency to frame transmission and can end up
-    decreasing the throughput of TCP traffic.  If this value is set too high,
-    it will cause the driver to run out of available transmit descriptors.
-
-XsumRX
-------
-:Valid Range: 0-1
-:Default Value: 1
-
-    A value of '1' indicates that the driver should enable IP checksum
-    offload for received packets (both UDP and TCP) to the adapter hardware.
-
-RxFCHighThresh
---------------
-:Valid Range: 1,536-262,136 (0x600 - 0x3FFF8, 8 byte granularity)
-:Default Value: 196,608 (0x30000)
-
-    Receive Flow control high threshold (when we send a pause frame)
-
-RxFCLowThresh
--------------
-:Valid Range: 64-262,136 (0x40 - 0x3FFF8, 8 byte granularity)
-:Default Value: 163,840 (0x28000)
-
-    Receive Flow control low threshold (when we send a resume frame)
-
-FCReqTimeout
-------------
-:Valid Range: 1-65535
-:Default Value: 65535
-
-    Flow control request timeout (how long to pause the link partner's tx)
-
-IntDelayEnable
---------------
-:Value Range: 0,1
-:Default Value: 1
-
-    Interrupt Delay, 0 disables transmit interrupt delay and 1 enables it.
-
-
-Improving Performance
-=====================
-
-With the 10 Gigabit server adapters, the default Linux configuration will
-very likely limit the total available throughput artificially.  There is a set
-of configuration changes that, when applied together, will increase the ability
-of Linux to transmit and receive data.  The following enhancements were
-originally acquired from settings published at http://www.spec.org/web99/ for
-various submitted results using Linux.
-
-NOTE:
-  These changes are only suggestions, and serve as a starting point for
-  tuning your network performance.
-
-The changes are made in three major ways, listed in order of greatest effect:
-
-- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen
-  parameter.
-- Use sysctl to modify /proc parameters (essentially kernel tuning)
-- Use setpci to modify the MMRBC field in PCI-X configuration space to increase
-  transmit burst lengths on the bus.
-
-NOTE:
-  setpci modifies the adapter's configuration registers to allow it to read
-  up to 4k bytes at a time (for transmits).  However, for some systems the
-  behavior after modifying this register may be undefined (possibly errors of
-  some kind).  A power-cycle, hard reset or explicitly setting the e6 register
-  back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a
-  stable configuration.
-
-- COPY these lines and paste them into ixgb_perf.sh:
-
-::
-
-  #!/bin/bash
-  echo "configuring network performance , edit this file to change the interface
-  or device ID of 10GbE card"
-  # set mmrbc to 4k reads, modify only Intel 10GbE device IDs
-  # replace 1a48 with appropriate 10GbE device's ID installed on the system,
-  # if needed.
-  setpci -d 8086:1a48 e6.b=2e
-  # set the MTU (max transmission unit) - it requires your switch and clients
-  # to change as well.
-  # set the txqueuelen
-  # your ixgb adapter should be loaded as eth1 for this to work, change if needed
-  ip li set dev eth1 mtu 9000 txqueuelen 1000 up
-  # call the sysctl utility to modify /proc/sys entries
-  sysctl -p ./sysctl_ixgb.conf
-
-- COPY these lines and paste them into sysctl_ixgb.conf:
-
-::
-
-  # some of the defaults may be different for your kernel
-  # call this file with sysctl -p <this file>
-  # these are just suggested values that worked well to increase throughput in
-  # several network benchmark tests, your mileage may vary
-
-  ### IPV4 specific settings
-  # turn TCP timestamp support off, default 1, reduces CPU use
-  net.ipv4.tcp_timestamps = 0
-  # turn SACK support off, default on
-  # on systems with a VERY fast bus -> memory interface this is the big gainer
-  net.ipv4.tcp_sack = 0
-  # set min/default/max TCP read buffer, default 4096 87380 174760
-  net.ipv4.tcp_rmem = 10000000 10000000 10000000
-  # set min/pressure/max TCP write buffer, default 4096 16384 131072
-  net.ipv4.tcp_wmem = 10000000 10000000 10000000
-  # set min/pressure/max TCP buffer space, default 31744 32256 32768
-  net.ipv4.tcp_mem = 10000000 10000000 10000000
-
-  ### CORE settings (mostly for socket and UDP effect)
-  # set maximum receive socket buffer size, default 131071
-  net.core.rmem_max = 524287
-  # set maximum send socket buffer size, default 131071
-  net.core.wmem_max = 524287
-  # set default receive socket buffer size, default 65535
-  net.core.rmem_default = 524287
-  # set default send socket buffer size, default 65535
-  net.core.wmem_default = 524287
-  # set maximum amount of option memory buffers, default 10240
-  net.core.optmem_max = 524287
-  # set number of unprocessed input packets before kernel starts dropping them; default 300
-  net.core.netdev_max_backlog = 300000
-
-Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface
-your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's
-ID installed on the system.
-
-NOTE:
-  Unless these scripts are added to the boot process, these changes will
-  only last only until the next system reboot.
-
-
-Resolving Slow UDP Traffic
---------------------------
-If your server does not seem to be able to receive UDP traffic as fast as it
-can receive TCP traffic, it could be because Linux, by default, does not set
-the network stack buffers as large as they need to be to support high UDP
-transfer rates.  One way to alleviate this problem is to allow more memory to
-be used by the IP stack to store incoming data.
-
-For instance, use the commands::
-
-    sysctl -w net.core.rmem_max=262143
-
-and::
-
-    sysctl -w net.core.rmem_default=262143
-
-to increase the read buffer memory max and default to 262143 (256k - 1) from
-defaults of max=131071 (128k - 1) and default=65535 (64k - 1).  These variables
-will increase the amount of memory used by the network stack for receives, and
-can be increased significantly more if necessary for your application.
-
-
-Additional Configurations
-=========================
-
-Configuring the Driver on Different Distributions
--------------------------------------------------
-Configuring a network driver to load properly when the system is started is
-distribution dependent. Typically, the configuration process involves adding
-an alias line to /etc/modprobe.conf as well as editing other system startup
-scripts and/or configuration files.  Many popular Linux distributions ship
-with tools to make these changes for you.  To learn the proper way to
-configure a network device for your system, refer to your distribution
-documentation.  If during this process you are asked for the driver or module
-name, the name for the Linux Base Driver for the Intel 10GbE Family of
-Adapters is ixgb.
-
-Viewing Link Messages
----------------------
-Link messages will not be displayed to the console if the distribution is
-restricting system messages. In order to see network driver link messages on
-your console, set dmesg to eight by entering the following::
-
-    dmesg -n 8
-
-NOTE: This setting is not saved across reboots.
-
-Jumbo Frames
-------------
-The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
-enabled by changing the MTU to a value larger than the default of 1500.
-The maximum value for the MTU is 16114.  Use the ip command to
-increase the MTU size.  For example::
-
-    ip li set dev ethx mtu 9000
-
-The maximum MTU setting for Jumbo Frames is 16114.  This value coincides
-with the maximum Jumbo Frames size of 16128.
-
-Ethtool
--------
-The driver utilizes the ethtool interface for driver configuration and
-diagnostics, as well as displaying statistical information.  The ethtool
-version 1.6 or later is required for this functionality.
-
-The latest release of ethtool can be found from
-https://www.kernel.org/pub/software/network/ethtool/
-
-NOTE:
-  The ethtool version 1.6 only supports a limited set of ethtool options.
-  Support for a more complete ethtool feature set can be enabled by
-  upgrading to the latest version.
-
-NAPI
-----
-NAPI (Rx polling mode) is supported in the ixgb driver.
-
-See https://wiki.linuxfoundation.org/networking/napi for more information on
-NAPI.
-
-
-Known Issues/Troubleshooting
-============================
-
-NOTE:
-  After installing the driver, if your Intel Network Connection is not
-  working, verify in the "In This Release" section of the readme that you have
-  installed the correct driver.
-
-Cable Interoperability Issue with Fujitsu XENPAK Module in SmartBits Chassis
-----------------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4
-Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits
-chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni.
-The CRC errors may be received either by the Intel(R) PRO/10GbE CX4
-Server adapter or the SmartBits. If this situation occurs using a different
-cable assembly may resolve the issue.
-
-Cable Interoperability Issues with HP Procurve 3400cl Switch Port
------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server
-adapter is connected to an HP Procurve 3400cl switch port using short cables
-(1 m or shorter). If this situation occurs, using a longer cable may resolve
-the issue.
-
-Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that
-Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC
-errors may be received either by the CX4 Server adapter or at the switch. If
-this situation occurs, using a different cable assembly may resolve the issue.
-
-Jumbo Frames System Requirement
--------------------------------
-Memory allocation failures have been observed on Linux systems with 64 MB
-of RAM or less that are running Jumbo Frames.  If you are using Jumbo
-Frames, your system may require more than the advertised minimum
-requirement of 64 MB of system memory.
-
-Performance Degradation with Jumbo Frames
------------------------------------------
-Degradation in throughput performance may be observed in some Jumbo frames
-environments.  If this is observed, increasing the application's socket buffer
-size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
-See the specific application manual and /usr/src/linux*/Documentation/
-networking/ip-sysctl.txt for more details.
-
-Allocating Rx Buffers when Using Jumbo Frames
----------------------------------------------
-Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if
-the available memory is heavily fragmented. This issue may be seen with PCI-X
-adapters or with packet split disabled. This can be reduced or eliminated
-by changing the amount of available memory for receive buffer allocation, by
-increasing /proc/sys/vm/min_free_kbytes.
-
-Multiple Interfaces on Same Ethernet Broadcast Network
-------------------------------------------------------
-Due to the default ARP behavior on Linux, it is not possible to have
-one system on two IP networks in the same Ethernet broadcast domain
-(non-partitioned switch) behave as expected.  All Ethernet interfaces
-will respond to IP traffic for any IP address assigned to the system.
-This results in unbalanced receive traffic.
-
-If you have multiple interfaces in a server, do either of the following:
-
-  - Turn on ARP filtering by entering::
-
-      echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
-
-  - Install the interfaces in separate broadcast domains - either in
-    different switches or in a switch partitioned to VLANs.
-
-UDP Stress Test Dropped Packet Issue
---------------------------------------
-Under small packets UDP stress test with 10GbE driver, the Linux system
-may drop UDP packets due to the fullness of socket buffers. You may want
-to change the driver's Flow Control variables to the minimum value for
-controlling packet reception.
-
-Tx Hangs Possible Under Stress
-------------------------------
-Under stress conditions, if TX hangs occur, turning off TSO
-"ethtool -K eth0 tso off" may resolve the problem.
-
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/device_drivers/marvell/octeontx2.rst b/Documentation/networking/device_drivers/marvell/octeontx2.rst
deleted file mode 100644
index 88f508338c5f..000000000000
--- a/Documentation/networking/device_drivers/marvell/octeontx2.rst
+++ /dev/null
@@ -1,159 +0,0 @@
-.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
-
-====================================
-Marvell OcteonTx2 RVU Kernel Drivers
-====================================
-
-Copyright (c) 2020 Marvell International Ltd.
-
-Contents
-========
-
-- `Overview`_
-- `Drivers`_
-- `Basic packet flow`_
-
-Overview
-========
-
-Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW
-resources from the network, crypto and other functional blocks into
-PCI-compatible physical and virtual functions. Each functional block
-again has multiple local functions (LFs) for provisioning to PCI devices.
-RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual
-functions (VFs). PF0 is called the administrative / admin function (AF)
-and has privileges to provision RVU functional block's LFs to each of the
-PF/VF.
-
-RVU managed networking functional blocks
- - Network pool or buffer allocator (NPA)
- - Network interface controller (NIX)
- - Network parser CAM (NPC)
- - Schedule/Synchronize/Order unit (SSO)
- - Loopback interface (LBK)
-
-RVU managed non-networking functional blocks
- - Crypto accelerator (CPT)
- - Scheduled timers unit (TIM)
- - Schedule/Synchronize/Order unit (SSO)
-   Used for both networking and non networking usecases
-
-Resource provisioning examples
- - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device
- - A PF/VF with CPT-LF resource works as a pure crypto offload device.
-
-RVU functional blocks are highly configurable as per software requirements.
-
-Firmware setups following stuff before kernel boots
- - Enables required number of RVU PFs based on number of physical links.
- - Number of VFs per PF are either static or configurable at compile time.
-   Based on config, firmware assigns VFs to each of the PFs.
- - Also assigns MSIX vectors to each of PF and VFs.
- - These are not changed after kernel boot.
-
-Drivers
-=======
-
-Linux kernel will have multiple drivers registering to different PF and VFs
-of RVU. Wrt networking there will be 3 flavours of drivers.
-
-Admin Function driver
----------------------
-
-As mentioned above RVU PF0 is called the admin function (AF), this driver
-supports resource provisioning and configuration of functional blocks.
-Doesn't handle any I/O. It sets up few basic stuff but most of the
-funcionality is achieved via configuration requests from PFs and VFs.
-
-PF/VFs communicates with AF via a shared memory region (mailbox). Upon
-receiving requests AF does resource provisioning and other HW configuration.
-AF is always attached to host kernel, but PFs and their VFs may be used by host
-kernel itself, or attached to VMs or to userspace applications like
-DPDK etc. So AF has to handle provisioning/configuration requests sent
-by any device from any domain.
-
-AF driver also interacts with underlying firmware to
- - Manage physical ethernet links ie CGX LMACs.
- - Retrieve information like speed, duplex, autoneg etc
- - Retrieve PHY EEPROM and stats.
- - Configure FEC, PAM modes
- - etc
-
-From pure networking side AF driver supports following functionality.
- - Map a physical link to a RVU PF to which a netdev is registered.
- - Attach NIX and NPA block LFs to RVU PF/VF which provide buffer pools, RQs, SQs
-   for regular networking functionality.
- - Flow control (pause frames) enable/disable/config.
- - HW PTP timestamping related config.
- - NPC parser profile config, basically how to parse pkt and what info to extract.
- - NPC extract profile config, what to extract from the pkt to match data in MCAM entries.
- - Manage NPC MCAM entries, upon request can frame and install requested packet forwarding rules.
- - Defines receive side scaling (RSS) algorithms.
- - Defines segmentation offload algorithms (eg TSO)
- - VLAN stripping, capture and insertion config.
- - SSO and TIM blocks config which provide packet scheduling support.
- - Debugfs support, to check current resource provising, current status of
-   NPA pools, NIX RQ, SQ and CQs, various stats etc which helps in debugging issues.
- - And many more.
-
-Physical Function driver
-------------------------
-
-This RVU PF handles IO, is mapped to a physical ethernet link and this
-driver registers a netdev. This supports SR-IOV. As said above this driver
-communicates with AF with a mailbox. To retrieve information from physical
-links this driver talks to AF and AF gets that info from firmware and responds
-back ie cannot talk to firmware directly.
-
-Supports ethtool for configuring links, RSS, queue count, queue size,
-flow control, ntuple filters, dump PHY EEPROM, config FEC etc.
-
-Virtual Function driver
------------------------
-
-There are two types VFs, VFs that share the physical link with their parent
-SR-IOV PF and the VFs which work in pairs using internal HW loopback channels (LBK).
-
-Type1:
- - These VFs and their parent PF share a physical link and used for outside communication.
- - VFs cannot communicate with AF directly, they send mbox message to PF and PF
-   forwards that to AF. AF after processing, responds back to PF and PF forwards
-   the reply to VF.
- - From functionality point of view there is no difference between PF and VF as same type
-   HW resources are attached to both. But user would be able to configure few stuff only
-   from PF as PF is treated as owner/admin of the link.
-
-Type2:
- - RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
- - A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
-   VF0 will be received by VF1 and viceversa.
- - These VFs can be used by applications or virtual machines to communicate between them
-   without sending traffic outside. There is no switch present in HW, hence the support
-   for loopback VFs.
- - These communicate directly with AF (PF0) via mbox.
-
-Except for the IO channels or links used for packet reception and transmission there is
-no other difference between these VF types. AF driver takes care of IO channel mapping,
-hence same VF driver works for both types of devices.
-
-Basic packet flow
-=================
-
-Ingress
--------
-
-1. CGX LMAC receives packet.
-2. Forwards the packet to the NIX block.
-3. Then submitted to NPC block for parsing and then MCAM lookup to get the destination RVU device.
-4. NIX LF attached to the destination RVU device allocates a buffer from RQ mapped buffer pool of NPA block LF.
-5. RQ may be selected by RSS or by configuring MCAM rule with a RQ number.
-6. Packet is DMA'ed and driver is notified.
-
-Egress
-------
-
-1. Driver prepares a send descriptor and submits to SQ for transmission.
-2. The SQ is already configured (by AF) to transmit on a specific link/channel.
-3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
-4. NIX block transmits the pkt on the designated channel.
-5. NPC MCAM entries can be installed to divert pkt onto a different channel.
diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst
deleted file mode 100644
index e9b65035cd47..000000000000
--- a/Documentation/networking/device_drivers/mellanox/mlx5.rst
+++ /dev/null
@@ -1,321 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
-
-=================================================
-Mellanox ConnectX(R) mlx5 core VPI Network Driver
-=================================================
-
-Copyright (c) 2019, Mellanox Technologies LTD.
-
-Contents
-========
-
-- `Enabling the driver and kconfig options`_
-- `Devlink info`_
-- `Devlink parameters`_
-- `Devlink health reporters`_
-- `mlx5 tracepoints`_
-
-Enabling the driver and kconfig options
-================================================
-
-| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
-| at build time via kernel Kconfig flags.
-| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
-| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
-| For the list of advanced features please see below.
-
-**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
-
-|    The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
-|    This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
-
-
-**CONFIG_MLX5_CORE_EN=(y/n)**
-
-|    Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
-|    mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
-|    built-in into mlx5_core.ko.
-
-
-**CONFIG_MLX5_EN_ARFS=(y/n)**
-
-|     Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
-|     https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
-
-
-**CONFIG_MLX5_EN_RXNFC=(y/n)**
-
-|    Enables ethtool receive network flow classification, which allows user defined
-|    flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
-
-
-**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
-
-|    Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
-
-
-**CONFIG_MLX5_MPFS=(y/n)**
-
-|    Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
-|    MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
-|    user configured unicast MAC addresses to the requesting PF.
-
-
-**CONFIG_MLX5_ESWITCH=(y/n)**
-
-|    Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
-|    and switching for the enabled VFs and PF in two available modes:
-|           1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
-|           2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
-
-
-**CONFIG_MLX5_CORE_IPOIB=(y/n)**
-
-|    IPoIB offloads & acceleration support.
-|    Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
-|    IPoIB ulp netdevice.
-
-
-**CONFIG_MLX5_FPGA=(y/n)**
-
-|    Build support for the Innova family of network cards by Mellanox Technologies.
-|    Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
-|    If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
-|    building sandbox-specific client drivers.
-
-
-**CONFIG_MLX5_EN_IPSEC=(y/n)**
-
-|    Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
-
-**CONFIG_MLX5_EN_TLS=(y/n)**
-
-|   TLS cryptography-offload accelaration.
-
-
-**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
-
-|   Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
-
-
-**External options** ( Choose if the corresponding mlx5 feature is required )
-
-- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
-- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
-- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
-
-Devlink info
-============
-
-The devlink info reports the running and stored firmware versions on device.
-It also prints the device PSID which represents the HCA board type ID.
-
-User command example::
-
-   $ devlink dev info pci/0000:00:06.0
-      pci/0000:00:06.0:
-      driver mlx5_core
-      versions:
-         fixed:
-            fw.psid MT_0000000009
-         running:
-            fw.version 16.26.0100
-         stored:
-            fw.version 16.26.0100
-
-Devlink parameters
-==================
-
-flow_steering_mode: Device flow steering mode
----------------------------------------------
-The flow steering mode parameter controls the flow steering mode of the driver.
-Two modes are supported:
-1. 'dmfs' - Device managed flow steering.
-2. 'smfs  - Software/Driver managed flow steering.
-
-In DMFS mode, the HW steering entities are created and managed through the
-Firmware.
-In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into Hardware without firmware intervention.
-
-SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
-
-User command examples:
-
-- Set SMFS flow steering mode::
-
-    $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
-
-- Read device flow steering mode::
-
-    $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
-      pci/0000:06:00.0:
-      name flow_steering_mode type driver-specific
-      values:
-         cmode runtime value smfs
-
-enable_roce: RoCE enablement state
-----------------------------------
-RoCE enablement state controls driver support for RoCE traffic.
-When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
-
-To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
-
-User command examples:
-
-- Disable RoCE::
-
-    $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
-    $ devlink dev reload pci/0000:06:00.0
-
-- Read RoCE enablement state::
-
-    $ devlink dev param show pci/0000:06:00.0 name enable_roce
-      pci/0000:06:00.0:
-      name enable_roce type generic
-      values:
-         cmode driverinit value true
-
-Devlink health reporters
-========================
-
-tx reporter
------------
-The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- TX timeout
-    Report on kernel tx timeout detection.
-    Recover by searching lost interrupts.
-- TX error completion
-    Report on error tx completion.
-    Recover by flushing the TX queue and reset it.
-
-TX reporter also support on demand diagnose callback, on which it provides
-real time information of its send queues status.
-
-User commands examples:
-
-- Diagnose send queues status::
-
-    $ devlink health diagnose pci/0000:82:00.0 reporter tx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of tx errors indicated, number of recover flows ended successfully,
-  is autorecover enabled and graceful period from last recover::
-
-    $ devlink health show pci/0000:82:00.0 reporter tx
-
-rx reporter
------------
-The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- RX queues initialization (population) timeout
-    RX queues descriptors population on ring initialization is done in
-    napi context via triggering an irq, in case of a failure to get
-    the minimum amount of descriptors, a timeout would occur and it
-    could be recoverable by polling the EQ (Event Queue).
-- RX completions with errors (reported by HW on interrupt context)
-    Report on rx completion error.
-    Recover (if needed) by flushing the related queue and reset it.
-
-RX reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues status.
-
-- Diagnose rx queues status, and corresponding completion queue::
-
-    $ devlink health diagnose pci/0000:82:00.0 reporter rx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of rx errors indicated, number of recover flows ended successfully,
-  is autorecover enabled and graceful period from last recover::
-
-    $ devlink health show pci/0000:82:00.0 reporter rx
-
-fw reporter
------------
-The fw reporter implements diagnose and dump callbacks.
-It follows symptoms of fw error such as fw syndrome by triggering
-fw core dump and storing it into the dump buffer.
-The fw reporter diagnose command can be triggered any time by the user to check
-current fw status.
-
-User commands examples:
-
-- Check fw heath status::
-
-    $ devlink health diagnose pci/0000:82:00.0 reporter fw
-
-- Read FW core dump if already stored or trigger new one::
-
-    $ devlink health dump show pci/0000:82:00.0 reporter fw
-
-NOTE: This command can run only on the PF which has fw tracer ownership,
-running it on other PF or any VF will return "Operation not permitted".
-
-fw fatal reporter
------------------
-The fw fatal reporter implements dump and recover callbacks.
-It follows fatal errors indications by CR-space dump and recover flow.
-The CR-space dump uses vsc interface which is valid even if the FW command
-interface is not functional, which is the case in most FW fatal errors.
-The recover function runs recover flow which reloads the driver and triggers fw
-reset if needed.
-
-User commands examples:
-
-- Run fw recover flow manually::
-
-    $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-
-- Read FW CR-space dump if already strored or trigger new one::
-
-    $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
-
-NOTE: This command can run only on PF.
-
-mlx5 tracepoints
-================
-
-mlx5 driver provides internal trace points for tracking and debugging using
-kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
-
-For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
-
-tc and eswitch offloads tracepoints:
-
-- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
-
-    $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
-    $ cat /sys/kernel/debug/tracing/trace
-    ...
-    tc-6535  [019] ...1  2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
-
-- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
-
-    $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
-    $ cat /sys/kernel/debug/tracing/trace
-    ...
-    tc-6569  [010] .N.1  2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
-
-- mlx5e_stats_flower: trace flower stats request::
-
-    $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
-    $ cat /sys/kernel/debug/tracing/trace
-    ...
-    tc-6546  [010] ...1  2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
-
-- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
-
-    $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
-    $ cat /sys/kernel/debug/tracing/trace
-    ...
-    kworker/u48:4-8806  [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
-
-- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
-
-    $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
-    $ cat /sys/kernel/debug/tracing/trace
-    ...
-    kworker/u48:7-2221  [009] ...1  1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
diff --git a/Documentation/networking/device_drivers/neterion/vxge.rst b/Documentation/networking/device_drivers/neterion/vxge.rst
deleted file mode 100644
index 589c6b15c63d..000000000000
--- a/Documentation/networking/device_drivers/neterion/vxge.rst
+++ /dev/null
@@ -1,115 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============================================================================
-Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver
-==============================================================================
-
-.. Contents
-
-  1) Introduction
-  2) Features supported
-  3) Configurable driver parameters
-  4) Troubleshooting
-
-1. Introduction
-===============
-
-This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O
-Virtualized Server adapters.
-
-The X3100 series supports four modes of operation, configurable via
-firmware:
-
-	- Single function mode
-	- Multi function mode
-	- SRIOV mode
-	- MRIOV mode
-
-The functions share a 10GbE link and the pci-e bus, but hardly anything else
-inside the ASIC. Features like independent hw reset, statistics, bandwidth/
-priority allocation and guarantees, GRO, TSO, interrupt moderation etc are
-supported independently on each function.
-
-(See below for a complete list of features supported for both IPv4 and IPv6)
-
-2. Features supported
-=====================
-
-i)   Single function mode (up to 17 queues)
-
-ii)  Multi function mode (up to 17 functions)
-
-iii) PCI-SIG's I/O Virtualization
-
-       - Single Root mode: v1.0 (up to 17 functions)
-       - Multi-Root mode: v1.0 (up to 17 functions)
-
-iv)  Jumbo frames
-
-       X3100 Series supports MTU up to 9600 bytes, modifiable using
-       ip command.
-
-v)   Offloads supported: (Enabled by default)
-
-       - Checksum offload (TCP/UDP/IP) on transmit and receive paths
-       - TCP Segmentation Offload (TSO) on transmit path
-       - Generic Receive Offload (GRO) on receive path
-
-vi)  MSI-X: (Enabled by default)
-
-       Resulting in noticeable performance improvement (up to 7% on certain
-       platforms).
-
-vii) NAPI: (Enabled by default)
-
-       For better Rx interrupt moderation.
-
-viii)RTH (Receive Traffic Hash): (Enabled by default)
-
-       Receive side steering for better scaling.
-
-ix)  Statistics
-
-       Comprehensive MAC-level and software statistics displayed using
-       "ethtool -S" option.
-
-x)   Multiple hardware queues: (Enabled by default)
-
-       Up to 17 hardware based transmit and receive data channels, with
-       multiple steering options (transmit multiqueue enabled by default).
-
-3) Configurable driver parameters:
-----------------------------------
-
-i)  max_config_dev
-       Specifies maximum device functions to be enabled.
-
-       Valid range: 1-8
-
-ii) max_config_port
-       Specifies number of ports to be enabled.
-
-       Valid range: 1,2
-
-       Default: 1
-
-iii) max_config_vpath
-       Specifies maximum VPATH(s) configured for each device function.
-
-       Valid range: 1-17
-
-iv) vlan_tag_strip
-       Enables/disables vlan tag stripping from all received tagged frames that
-       are not replicated at the internal L2 switch.
-
-       Valid range: 0,1 (disabled, enabled respectively)
-
-       Default: 1
-
-v)  addr_learn_en
-       Enable learning the mac address of the guest OS interface in
-       virtualization environment.
-
-       Valid range: 0,1 (disabled, enabled respectively)
-
-       Default: 0
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx b/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
deleted file mode 100644
index 2f2077e34d81..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
+++ /dev/null
@@ -1,46 +0,0 @@
-Copyright (c)  2003-2006 QLogic Corporation
-QLogic Linux Networking HBA Driver
-
-This program includes a device driver for Linux 2.6 that may be
-distributed with QLogic hardware specific firmware binary file.
-You may modify and redistribute the device driver code under the
-GNU General Public License as published by the Free Software
-Foundation (version 2 or a later version).
-
-You may redistribute the hardware specific firmware binary file
-under the following terms:
-
-	1. Redistribution of source code (only if applicable),
-	   must retain the above copyright notice, this list of
-	   conditions and the following disclaimer.
-
-	2. Redistribution in binary form must reproduce the above
-	   copyright notice, this list of conditions and the
-	   following disclaimer in the documentation and/or other
-	   materials provided with the distribution.
-
-	3. The name of QLogic Corporation may not be used to
-	   endorse or promote products derived from this software
-	   without specific prior written permission
-
-REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE,
-THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY
-EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR
-BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
-TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
-
-USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT
-CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR
-OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT,
-TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN
-ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN
-COMBINATION WITH THIS PROGRAM.
-
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic b/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
deleted file mode 100644
index 2ae3b64983ab..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2009-2013 QLogic Corporation
-QLogic Linux qlcnic NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
-		    GNU GENERAL PUBLIC LICENSE
-		       Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
-			    Preamble
-
-  The licenses for most software are designed to take away your
-freedom to share and change it.  By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users.  This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it.  (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.)  You can apply it to
-your programs, too.
-
-  When we speak of free software, we are referring to freedom, not
-price.  Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
-  To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
-  For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have.  You must make sure that they, too, receive or can get the
-source code.  And you must show them these terms so they know their
-rights.
-
-  We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
-  Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software.  If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
-  Finally, any free program is threatened constantly by software
-patents.  We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary.  To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
-  The precise terms and conditions for copying, distribution and
-modification follow.
-
-		    GNU GENERAL PUBLIC LICENSE
-   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
-  0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License.  The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language.  (Hereinafter, translation is included without limitation in
-the term "modification".)  Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope.  The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
-  1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
-  2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
-    a) You must cause the modified files to carry prominent notices
-    stating that you changed the files and the date of any change.
-
-    b) You must cause any work that you distribute or publish, that in
-    whole or in part contains or is derived from the Program or any
-    part thereof, to be licensed as a whole at no charge to all third
-    parties under the terms of this License.
-
-    c) If the modified program normally reads commands interactively
-    when run, you must cause it, when started running for such
-    interactive use in the most ordinary way, to print or display an
-    announcement including an appropriate copyright notice and a
-    notice that there is no warranty (or else, saying that you provide
-    a warranty) and that users may redistribute the program under
-    these conditions, and telling the user how to view a copy of this
-    License.  (Exception: if the Program itself is interactive but
-    does not normally print such an announcement, your work based on
-    the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole.  If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works.  But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
-  3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
-    a) Accompany it with the complete corresponding machine-readable
-    source code, which must be distributed under the terms of Sections
-    1 and 2 above on a medium customarily used for software interchange; or,
-
-    b) Accompany it with a written offer, valid for at least three
-    years, to give any third party, for a charge no more than your
-    cost of physically performing source distribution, a complete
-    machine-readable copy of the corresponding source code, to be
-    distributed under the terms of Sections 1 and 2 above on a medium
-    customarily used for software interchange; or,
-
-    c) Accompany it with the information you received as to the offer
-    to distribute corresponding source code.  (This alternative is
-    allowed only for noncommercial distribution and only if you
-    received the program in object code or executable form with such
-    an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it.  For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable.  However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
-  4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License.  Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
-  5. You are not required to accept this License, since you have not
-signed it.  However, nothing else grants you permission to modify or
-distribute the Program or its derivative works.  These actions are
-prohibited by law if you do not accept this License.  Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
-  6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions.  You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
-  7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License.  If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all.  For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices.  Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
-  8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded.  In such case, this License incorporates
-the limitation as if written in the body of this License.
-
-  9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time.  Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number.  If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation.  If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
-  10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission.  For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this.  Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
-			    NO WARRANTY
-
-  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
-  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge b/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
deleted file mode 100644
index ce64e4d15b21..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2003-2011 QLogic Corporation
-QLogic Linux qlge NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
-		    GNU GENERAL PUBLIC LICENSE
-		       Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
-			    Preamble
-
-  The licenses for most software are designed to take away your
-freedom to share and change it.  By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users.  This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it.  (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.)  You can apply it to
-your programs, too.
-
-  When we speak of free software, we are referring to freedom, not
-price.  Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
-  To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
-  For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have.  You must make sure that they, too, receive or can get the
-source code.  And you must show them these terms so they know their
-rights.
-
-  We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
-  Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software.  If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
-  Finally, any free program is threatened constantly by software
-patents.  We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary.  To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
-  The precise terms and conditions for copying, distribution and
-modification follow.
-
-		    GNU GENERAL PUBLIC LICENSE
-   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
-  0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License.  The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language.  (Hereinafter, translation is included without limitation in
-the term "modification".)  Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope.  The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
-  1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
-  2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
-    a) You must cause the modified files to carry prominent notices
-    stating that you changed the files and the date of any change.
-
-    b) You must cause any work that you distribute or publish, that in
-    whole or in part contains or is derived from the Program or any
-    part thereof, to be licensed as a whole at no charge to all third
-    parties under the terms of this License.
-
-    c) If the modified program normally reads commands interactively
-    when run, you must cause it, when started running for such
-    interactive use in the most ordinary way, to print or display an
-    announcement including an appropriate copyright notice and a
-    notice that there is no warranty (or else, saying that you provide
-    a warranty) and that users may redistribute the program under
-    these conditions, and telling the user how to view a copy of this
-    License.  (Exception: if the Program itself is interactive but
-    does not normally print such an announcement, your work based on
-    the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole.  If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works.  But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
-  3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
-    a) Accompany it with the complete corresponding machine-readable
-    source code, which must be distributed under the terms of Sections
-    1 and 2 above on a medium customarily used for software interchange; or,
-
-    b) Accompany it with a written offer, valid for at least three
-    years, to give any third party, for a charge no more than your
-    cost of physically performing source distribution, a complete
-    machine-readable copy of the corresponding source code, to be
-    distributed under the terms of Sections 1 and 2 above on a medium
-    customarily used for software interchange; or,
-
-    c) Accompany it with the information you received as to the offer
-    to distribute corresponding source code.  (This alternative is
-    allowed only for noncommercial distribution and only if you
-    received the program in object code or executable form with such
-    an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it.  For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable.  However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
-  4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License.  Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
-  5. You are not required to accept this License, since you have not
-signed it.  However, nothing else grants you permission to modify or
-distribute the Program or its derivative works.  These actions are
-prohibited by law if you do not accept this License.  Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
-  6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions.  You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
-  7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License.  If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all.  For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices.  Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
-  8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded.  In such case, this License incorporates
-the limitation as if written in the body of this License.
-
-  9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time.  Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number.  If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation.  If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
-  10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission.  For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this.  Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
-			    NO WARRANTY
-
-  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
-  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/qualcomm/rmnet.rst
deleted file mode 100644
index 70643b58de05..000000000000
--- a/Documentation/networking/device_drivers/qualcomm/rmnet.rst
+++ /dev/null
@@ -1,95 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-Rmnet Driver
-============
-
-1. Introduction
-===============
-
-rmnet driver is used for supporting the Multiplexing and aggregation
-Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
-Technologies, Inc. modems.
-
-This driver can be used to register onto any physical network device in
-IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
-
-Multiplexing allows for creation of logical netdevices (rmnet devices) to
-handle multiple private data networks (PDN) like a default internet, tethering,
-multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
-packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
-routes to the appropriate PDN after removing the MAP header.
-
-Aggregation is required to achieve high data rates. This involves hardware
-sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
-these MAP frames and send them to appropriate PDN's.
-
-2. Packet format
-================
-
-a. MAP packet (data / control)
-
-MAP header has the same endianness of the IP packet.
-
-Packet format::
-
-  Bit             0             1           2-7      8 - 15           16 - 31
-  Function   Command / Data   Reserved     Pad   Multiplexer ID    Payload length
-  Bit            32 - x
-  Function     Raw  Bytes
-
-Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
-or data packet. Control packet is used for transport level flow control. Data
-packets are standard IP packets.
-
-Reserved bits are usually zeroed out and to be ignored by receiver.
-
-Padding is number of bytes to be added for 4 byte alignment if required by
-hardware.
-
-Multiplexer ID is to indicate the PDN on which data has to be sent.
-
-Payload length includes the padding length but does not include MAP header
-length.
-
-b. MAP packet (command specific)::
-
-    Bit             0             1           2-7      8 - 15           16 - 31
-    Function   Command         Reserved     Pad   Multiplexer ID    Payload length
-    Bit          32 - 39        40 - 45    46 - 47       48 - 63
-    Function   Command name    Reserved   Command Type   Reserved
-    Bit          64 - 95
-    Function   Transaction ID
-    Bit          96 - 127
-    Function   Command data
-
-Command 1 indicates disabling flow while 2 is enabling flow
-
-Command types
-
-= ==========================================
-0 for MAP command request
-1 is to acknowledge the receipt of a command
-2 is for unsupported commands
-3 is for error during processing of commands
-= ==========================================
-
-c. Aggregation
-
-Aggregation is multiple MAP packets (can be data or command) delivered to
-rmnet in a single linear skb. rmnet will process the individual
-packets and either ACK the MAP command or deliver the IP packet to the
-network stack as needed
-
-MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
-
-MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
-
-3. Userspace configuration
-==========================
-
-rmnet userspace configuration is done through netlink library librmnetctl
-and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
-The driver uses rtnl_link_ops for communication.
-
-https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
diff --git a/Documentation/networking/device_drivers/sb1000.rst b/Documentation/networking/device_drivers/sb1000.rst
deleted file mode 100644
index c8582ca4034d..000000000000
--- a/Documentation/networking/device_drivers/sb1000.rst
+++ /dev/null
@@ -1,222 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================
-SB100 device driver
-===================
-
-sb1000 is a module network device driver for the General Instrument (also known
-as NextLevel) SURFboard1000 internal cable modem board.  This is an ISA card
-which is used by a number of cable TV companies to provide cable modem access.
-It's a one-way downstream-only cable modem, meaning that your upstream net link
-is provided by your regular phone modem.
-
-This driver was written by Franco Venturi <fventuri@mediaone.net>.  He deserves
-a great deal of thanks for this wonderful piece of code!
-
-Needed tools
-============
-
-Support for this device is now a part of the standard Linux kernel.  The
-driver source code file is drivers/net/sb1000.c.  In addition to this
-you will need:
-
-1. The "cmconfig" program.  This is a utility which supplements "ifconfig"
-   to configure the cable modem and network interface (usually called "cm0");
-
-2. Several PPP scripts which live in /etc/ppp to make connecting via your
-   cable modem easy.
-
-   These utilities can be obtained from:
-
-      http://www.jacksonville.net/~fventuri/
-
-   in Franco's original source code distribution .tar.gz file.  Support for
-   the sb1000 driver can be found at:
-
-      - http://web.archive.org/web/%2E/http://home.adelphia.net/~siglercm/sb1000.html
-      - http://web.archive.org/web/%2E/http://linuxpower.cx/~cable/
-
-   along with these utilities.
-
-3. The standard isapnp tools.  These are necessary to configure your SB1000
-   card at boot time (or afterwards by hand) since it's a PnP card.
-
-   If you don't have these installed as a standard part of your Linux
-   distribution, you can find them at:
-
-      http://www.roestock.demon.co.uk/isapnptools/
-
-   or check your Linux distribution binary CD or their web site.  For help with
-   isapnp, pnpdump, or /etc/isapnp.conf, go to:
-
-      http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
-
-Using the driver
-================
-
-To make the SB1000 card work, follow these steps:
-
-1. Run ``make config``, or ``make menuconfig``, or ``make xconfig``, whichever
-   you prefer, in the top kernel tree directory to set up your kernel
-   configuration.  Make sure to say "Y" to "Prompt for development drivers"
-   and to say "M" to the sb1000 driver.  Also say "Y" or "M" to all the standard
-   networking questions to get TCP/IP and PPP networking support.
-
-2. **BEFORE** you build the kernel, edit drivers/net/sb1000.c.  Make sure
-   to redefine the value of READ_DATA_PORT to match the I/O address used
-   by isapnp to access your PnP cards.  This is the value of READPORT in
-   /etc/isapnp.conf or given by the output of pnpdump.
-
-3. Build and install the kernel and modules as usual.
-
-4. Boot your new kernel following the usual procedures.
-
-5. Set up to configure the new SB1000 PnP card by capturing the output
-   of "pnpdump" to a file and editing this file to set the correct I/O ports,
-   IRQ, and DMA settings for all your PnP cards.  Make sure none of the settings
-   conflict with one another.  Then test this configuration by running the
-   "isapnp" command with your new config file as the input.  Check for
-   errors and fix as necessary.  (As an aside, I use I/O ports 0x110 and
-   0x310 and IRQ 11 for my SB1000 card and these work well for me.  YMMV.)
-   Then save the finished config file as /etc/isapnp.conf for proper
-   configuration on subsequent reboots.
-
-6. Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
-   the others referenced above.  As root, unpack it into a temporary directory
-   and do a ``make cmconfig`` and then ``install -c cmconfig /usr/local/sbin``.
-   Don't do ``make install`` because it expects to find all the utilities built
-   and ready for installation, not just cmconfig.
-
-7. As root, copy all the files under the ppp/ subdirectory in Franco's
-   tar file into /etc/ppp, being careful not to overwrite any files that are
-   already in there.  Then modify ppp@gi-on to set the correct login name,
-   phone number, and frequency for the cable modem.  Also edit pap-secrets
-   to specify your login name and password and any site-specific information
-   you need.
-
-8. Be sure to modify /etc/ppp/firewall to use ipchains instead of
-   the older ipfwadm commands from the 2.0.x kernels.  There's a neat utility to
-   convert ipfwadm commands to ipchains commands:
-
-	http://users.dhp.com/~whisper/ipfwadm2ipchains/
-
-   You may also wish to modify the firewall script to implement a different
-   firewalling scheme.
-
-9. Start the PPP connection via the script /etc/ppp/ppp@gi-on.  You must be
-   root to do this.  It's better to use a utility like sudo to execute
-   frequently used commands like this with root permissions if possible.  If you
-   connect successfully the cable modem interface will come up and you'll see a
-   driver message like this at the console::
-
-	 cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
-	 sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
-
-   The "ifconfig" command should show two new interfaces, ppp0 and cm0.
-
-   The command "cmconfig cm0" will give you information about the cable modem
-   interface.
-
-10. Try pinging a site via ``ping -c 5 www.yahoo.com``, for example.  You should
-    see packets received.
-
-11. If you can't get site names (like www.yahoo.com) to resolve into
-    IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
-    has no syntax errors and has the right nameserver IP addresses in it.
-    If this doesn't help, try something like ``ping -c 5 204.71.200.67`` to
-    see if the networking is running but the DNS resolution is where the
-    problem lies.
-
-12. If you still have problems, go to the support web sites mentioned above
-    and read the information and documentation there.
-
-Common problems
-===============
-
-1. Packets go out on the ppp0 interface but don't come back on the cm0
-   interface.  It looks like I'm connected but I can't even ping any
-   numerical IP addresses.  (This happens predominantly on Debian systems due
-   to a default boot-time configuration script.)
-
-Solution
-   As root ``echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter`` so it
-   can share the same IP address as the ppp0 interface.  Note that this
-   command should probably be added to the /etc/ppp/cablemodem script
-   *right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
-   You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
-   If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
-   (in rc.local or some such) then any interfaces can share the same IP
-   addresses.
-
-2. I get "unresolved symbol" error messages on executing ``insmod sb1000.o``.
-
-Solution
-   You probably have a non-matching kernel source tree and
-   /usr/include/linux and /usr/include/asm header files.  Make sure you
-   install the correct versions of the header files in these two directories.
-   Then rebuild and reinstall the kernel.
-
-3. When isapnp runs it reports an error, and my SB1000 card isn't working.
-
-Solution
-   There's a problem with later versions of isapnp using the "(CHECK)"
-   option in the lines that allocate the two I/O addresses for the SB1000 card.
-   This first popped up on RH 6.0.  Delete "(CHECK)" for the SB1000 I/O addresses.
-   Make sure they don't conflict with any other pieces of hardware first!  Then
-   rerun isapnp and go from there.
-
-4. I can't execute the /etc/ppp/ppp@gi-on file.
-
-Solution
-   As root do ``chmod ug+x /etc/ppp/ppp@gi-on``.
-
-5. The firewall script isn't working (with 2.2.x and higher kernels).
-
-Solution
-   Use the ipfwadm2ipchains script referenced above to convert the
-   /etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
-
-6. I'm getting *tons* of firewall deny messages in the /var/kern.log,
-   /var/messages, and/or /var/syslog files, and they're filling up my /var
-   partition!!!
-
-Solution
-   First, tell your ISP that you're receiving DoS (Denial of Service)
-   and/or portscanning (UDP connection attempts) attacks!  Look over the deny
-   messages to figure out what the attack is and where it's coming from.  Next,
-   edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
-   to the "cmconfig" command (uncomment that line).  If you're not receiving these
-   denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
-   typically), then someone is attacking your machine in particular.  Be careful
-   out there....
-
-7. Everything seems to work fine but my computer locks up after a while
-   (and typically during a lengthy download through the cable modem)!
-
-Solution
-   You may need to add a short delay in the driver to 'slow down' the
-   SURFboard because your PC might not be able to keep up with the transfer rate
-   of the SB1000. To do this, it's probably best to download Franco's
-   sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually.  You'll
-   want to edit the 'Makefile' and look for the 'SB1000_DELAY'
-   define.  Uncomment those 'CFLAGS' lines (and comment out the default ones)
-   and try setting the delay to something like 60 microseconds with:
-   '-DSB1000_DELAY=60'.  Then do ``make`` and as root ``make install`` and try
-   it out.  If it still doesn't work or you like playing with the driver, you may
-   try other numbers.  Remember though that the higher the delay, the slower the
-   driver (which slows down the rest of the PC too when it is actively
-   used). Thanks to Ed Daiga for this tip!
-
-Credits
-=======
-
-This README came from Franco Venturi's original README file which is
-still supplied with his driver .tar.gz archive.  I and all other sb1000 users
-owe Franco a tremendous "Thank you!"  Additional thanks goes to Carl Patten
-and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
-the SB1000 users who reported and helped debug the common problems listed
-above.
-
-
-					Clemmitt Sigler
-					csigler@vt.edu
diff --git a/Documentation/networking/device_drivers/toshiba/spider_net.rst b/Documentation/networking/device_drivers/toshiba/spider_net.rst
deleted file mode 100644
index fe5b32be15cd..000000000000
--- a/Documentation/networking/device_drivers/toshiba/spider_net.rst
+++ /dev/null
@@ -1,202 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-The Spidernet Device Driver
-===========================
-
-Written by Linas Vepstas <linas@austin.ibm.com>
-
-Version of 7 June 2007
-
-Abstract
-========
-This document sketches the structure of portions of the spidernet
-device driver in the Linux kernel tree. The spidernet is a gigabit
-ethernet device built into the Toshiba southbridge commonly used
-in the SONY Playstation 3 and the IBM QS20 Cell blade.
-
-The Structure of the RX Ring.
-=============================
-The receive (RX) ring is a circular linked list of RX descriptors,
-together with three pointers into the ring that are used to manage its
-contents.
-
-The elements of the ring are called "descriptors" or "descrs"; they
-describe the received data. This includes a pointer to a buffer
-containing the received data, the buffer size, and various status bits.
-
-There are three primary states that a descriptor can be in: "empty",
-"full" and "not-in-use".  An "empty" or "ready" descriptor is ready
-to receive data from the hardware. A "full" descriptor has data in it,
-and is waiting to be emptied and processed by the OS. A "not-in-use"
-descriptor is neither empty or full; it is simply not ready. It may
-not even have a data buffer in it, or is otherwise unusable.
-
-During normal operation, on device startup, the OS (specifically, the
-spidernet device driver) allocates a set of RX descriptors and RX
-buffers. These are all marked "empty", ready to receive data. This
-ring is handed off to the hardware, which sequentially fills in the
-buffers, and marks them "full". The OS follows up, taking the full
-buffers, processing them, and re-marking them empty.
-
-This filling and emptying is managed by three pointers, the "head"
-and "tail" pointers, managed by the OS, and a hardware current
-descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
-currently being filled. When this descr is filled, the hardware
-marks it full, and advances the GDACTDPA by one.  Thus, when there is
-flowing RX traffic, every descr behind it should be marked "full",
-and everything in front of it should be "empty".  If the hardware
-discovers that the current descr is not empty, it will signal an
-interrupt, and halt processing.
-
-The tail pointer tails or trails the hardware pointer. When the
-hardware is ahead, the tail pointer will be pointing at a "full"
-descr. The OS will process this descr, and then mark it "not-in-use",
-and advance the tail pointer.  Thus, when there is flowing RX traffic,
-all of the descrs in front of the tail pointer should be "full", and
-all of those behind it should be "not-in-use". When RX traffic is not
-flowing, then the tail pointer can catch up to the hardware pointer.
-The OS will then note that the current tail is "empty", and halt
-processing.
-
-The head pointer (somewhat mis-named) follows after the tail pointer.
-When traffic is flowing, then the head pointer will be pointing at
-a "not-in-use" descr. The OS will perform various housekeeping duties
-on this descr. This includes allocating a new data buffer and
-dma-mapping it so as to make it visible to the hardware. The OS will
-then mark the descr as "empty", ready to receive data. Thus, when there
-is flowing RX traffic, everything in front of the head pointer should
-be "not-in-use", and everything behind it should be "empty". If no
-RX traffic is flowing, then the head pointer can catch up to the tail
-pointer, at which point the OS will notice that the head descr is
-"empty", and it will halt processing.
-
-Thus, in an idle system, the GDACTDPA, tail and head pointers will
-all be pointing at the same descr, which should be "empty". All of the
-other descrs in the ring should be "empty" as well.
-
-The show_rx_chain() routine will print out the locations of the
-GDACTDPA, tail and head pointers. It will also summarize the contents
-of the ring, starting at the tail pointer, and listing the status
-of the descrs that follow.
-
-A typical example of the output, for a nearly idle system, might be::
-
-    net eth1: Total number of descrs=256
-    net eth1: Chain tail located at descr=20
-    net eth1: Chain head is at 20
-    net eth1: HW curr desc (GDACTDPA) is at 21
-    net eth1: Have 1 descrs with stat=x40800101
-    net eth1: HW next desc (GDACNEXTDA) is at 22
-    net eth1: Last 255 descrs with stat=xa0800000
-
-In the above, the hardware has filled in one descr, number 20. Both
-head and tail are pointing at 20, because it has not yet been emptied.
-Meanwhile, hw is pointing at 21, which is free.
-
-The "Have nnn decrs" refers to the descr starting at the tail: in this
-case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
-to all of the rest of the descrs, from the last status change. The "nnn"
-is a count of how many descrs have exactly the same status.
-
-The status x4... corresponds to "full" and status xa... corresponds
-to "empty". The actual value printed is RXCOMST_A.
-
-In the device driver source code, a different set of names are
-used for these same concepts, so that::
-
-    "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
-    "full"  == SPIDER_NET_DESCR_FRAME_END == 0x4
-    "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
-
-
-The RX RAM full bug/feature
-===========================
-
-As long as the OS can empty out the RX buffers at a rate faster than
-the hardware can fill them, there is no problem. If, for some reason,
-the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
-pointer will catch up to the head, notice the not-empty condition,
-ad stop. However, RX packets may still continue arriving on the wire.
-The spidernet chip can save some limited number of these in local RAM.
-When this local ram fills up, the spider chip will issue an interrupt
-indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
-will be set in GHIINT1STS).  When the RX ram full condition occurs,
-a certain bug/feature is triggered that has to be specially handled.
-This section describes the special handling for this condition.
-
-When the OS finally has a chance to run, it will empty out the RX ring.
-In particular, it will clear the descriptor on which the hardware had
-stopped. However, once the hardware has decided that a certain
-descriptor is invalid, it will not restart at that descriptor; instead
-it will restart at the next descr. This potentially will lead to a
-deadlock condition, as the tail pointer will be pointing at this descr,
-which, from the OS point of view, is empty; the OS will be waiting for
-this descr to be filled. However, the hardware has skipped this descr,
-and is filling the next descrs. Since the OS doesn't see this, there
-is a potential deadlock, with the OS waiting for one descr to fill,
-while the hardware is waiting for a different set of descrs to become
-empty.
-
-A call to show_rx_chain() at this point indicates the nature of the
-problem. A typical print when the network is hung shows the following::
-
-    net eth1: Spider RX RAM full, incoming packets might be discarded!
-    net eth1: Total number of descrs=256
-    net eth1: Chain tail located at descr=255
-    net eth1: Chain head is at 255
-    net eth1: HW curr desc (GDACTDPA) is at 0
-    net eth1: Have 1 descrs with stat=xa0800000
-    net eth1: HW next desc (GDACNEXTDA) is at 1
-    net eth1: Have 127 descrs with stat=x40800101
-    net eth1: Have 1 descrs with stat=x40800001
-    net eth1: Have 126 descrs with stat=x40800101
-    net eth1: Last 1 descrs with stat=xa0800000
-
-Both the tail and head pointers are pointing at descr 255, which is
-marked xa... which is "empty". Thus, from the OS point of view, there
-is nothing to be done. In particular, there is the implicit assumption
-that everything in front of the "empty" descr must surely also be empty,
-as explained in the last section. The OS is waiting for descr 255 to
-become non-empty, which, in this case, will never happen.
-
-The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
-Since its already full, the hardware can do nothing more, and thus has
-halted processing. Notice that descrs 0 through 254 are all marked
-"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
-descr 254, since tail was at 255.) Thus, the system is deadlocked,
-and there can be no forward progress; the OS thinks there's nothing
-to do, and the hardware has nowhere to put incoming data.
-
-This bug/feature is worked around with the spider_net_resync_head_ptr()
-routine. When the driver receives RX interrupts, but an examination
-of the RX chain seems to show it is empty, then it is probable that
-the hardware has skipped a descr or two (sometimes dozens under heavy
-network conditions). The spider_net_resync_head_ptr() subroutine will
-search the ring for the next full descr, and the driver will resume
-operations there.  Since this will leave "holes" in the ring, there
-is also a spider_net_resync_tail_ptr() that will skip over such holes.
-
-As of this writing, the spider_net_resync() strategy seems to work very
-well, even under heavy network loads.
-
-
-The TX ring
-===========
-The TX ring uses a low-watermark interrupt scheme to make sure that
-the TX queue is appropriately serviced for large packet sizes.
-
-For packet sizes greater than about 1KBytes, the kernel can fill
-the TX ring quicker than the device can drain it. Once the ring
-is full, the netdev is stopped. When there is room in the ring,
-the netdev needs to be reawakened, so that more TX packets are placed
-in the ring. The hardware can empty the ring about four times per jiffy,
-so its not appropriate to wait for the poll routine to refill, since
-the poll routine runs only once per jiffy.  The low-watermark mechanism
-marks a descr about 1/4th of the way from the bottom of the queue, so
-that an interrupt is generated when the descr is processed. This
-interrupt wakes up the netdev, which can then refill the queue.
-For large packets, this mechanism generates a relatively small number
-of interrupts, about 1K/sec. For smaller packets, this will drop to zero
-interrupts, as the hardware can empty the queue faster than the kernel
-can fill it.
diff --git a/Documentation/networking/device_drivers/wifi/index.rst b/Documentation/networking/device_drivers/wifi/index.rst
new file mode 100644
index 000000000000..fb394f5de4a9
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Wi-Fi Device Drivers
+====================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   intel/ipw2100
+   intel/ipw2200
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/ipw2100.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
index d54ad522f937..883e96355799 100644
--- a/Documentation/networking/device_drivers/intel/ipw2100.rst
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
@@ -78,7 +78,7 @@ such, if you are interested in deploying or shipping a driver as part of
 solution intended to be used for purposes other than development, please
 obtain a tested driver from Intel Customer Support at:
 
-http://www.intel.com/support/wireless/sb/CS-006408.htm
+https://www.intel.com/support/wireless/sb/CS-006408.htm
 
 1. Introduction
 ===============
diff --git a/Documentation/networking/device_drivers/intel/ipw2200.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
index 0cb42d2fd7e5..0cb42d2fd7e5 100644
--- a/Documentation/networking/device_drivers/intel/ipw2200.rst
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst
new file mode 100644
index 000000000000..370d8264d5dc
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+WWAN Device Drivers
+===================
+
+Contents:
+
+.. toctree::
+   :maxdepth: 2
+
+   iosm
+   t7xx
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst
new file mode 100644
index 000000000000..6f9e955af984
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/iosm.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _iosm_driver_doc:
+
+===========================================
+IOSM Driver for Intel M.2 PCIe based Modems
+===========================================
+The IOSM (IPC over Shared Memory) driver is a WWAN PCIe host driver developed
+for linux or chrome platform for data exchange over PCIe interface between
+Host platform & Intel M.2 Modem. The driver exposes interface conforming to the
+MBIM protocol [1]. Any front end application ( eg: Modem Manager) could easily
+manage the MBIM interface to enable data communication towards WWAN.
+
+Basic usage
+===========
+MBIM functions are inactive when unmanaged. The IOSM driver only provides a
+userspace interface MBIM "WWAN PORT" representing MBIM control channel and does
+not play any role in managing the functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM functionality.
+
+Examples of few such userspace application are:
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session. The Driver supports
+up to 8 IP sessions for simultaneous IP communication.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+  ip link add dev wwan0-1 parentdev-name wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+References
+==========
+[1] "MBIM (Mobile Broadband Interface Model) Errata-1"
+      - https://www.usb.org/document-library/
+
+[2] libmbim - "a glib-based library for talking to WWAN modems and
+      devices which speak the Mobile Interface Broadband Model (MBIM)
+      protocol"
+      - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] Modem Manager - "a DBus-activated daemon which controls mobile
+      broadband (2G/3G/4G) devices and connections"
+      - http://www.freedesktop.org/wiki/Software/ModemManager/
diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst
new file mode 100644
index 000000000000..e07de7700dfc
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/t7xx.rst
@@ -0,0 +1,214 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _t7xx_driver_doc:
+
+============================================
+t7xx driver for MTK PCIe based T700 5G modem
+============================================
+The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS
+platforms for data exchange over PCIe interface between Host platform &
+MediaTek's T700 5G modem.
+The driver exposes an interface conforming to the MBIM protocol [1]. Any front
+end application (e.g. Modem Manager) could easily manage the MBIM interface to
+enable data communication towards WWAN. The driver also provides an interface
+to interact with the MediaTek's modem via AT commands.
+
+Basic usage
+===========
+MBIM & AT functions are inactive when unmanaged. The t7xx driver provides
+WWAN port userspace interfaces representing MBIM & AT control channels and does
+not play any role in managing their functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM & AT functionalities.
+
+Examples of few such userspace applications are:
+
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management Applications to carry out below required actions for send an AT
+command and receive response:
+
+- open the AT control channel using a UART tool or a special user tool
+
+Sysfs
+=====
+The driver provides sysfs interfaces to userspace.
+
+t7xx_mode
+---------
+The sysfs interface provides userspace with access to the device mode, this
+interface supports read and write operations.
+
+Device mode:
+
+- ``unknown`` represents that device in unknown status
+- ``ready`` represents that device in ready status
+- ``reset`` represents that device in reset status
+- ``fastboot_switching`` represents that device in fastboot switching status
+- ``fastboot_download`` represents that device in fastboot download status
+- ``fastboot_dump`` represents that device in fastboot dump status
+
+Read from userspace to get the current device mode.
+
+::
+  $ cat /sys/bus/pci/devices/${bdf}/t7xx_mode
+
+Write from userspace to set the device mode.
+
+::
+  $ echo fastboot_switching > /sys/bus/pci/devices/${bdf}/t7xx_mode
+
+t7xx_debug_ports
+----------------
+The sysfs interface provides userspace with access to enable/disable the debug
+ports, this interface supports read and write operations.
+
+Debug port status:
+
+- ``1`` represents enable debug ports
+- ``0`` represents disable debug ports
+
+Currently supported debug ports (ADB/MIPC).
+
+Read from userspace to get the current debug ports status.
+
+::
+  $ cat /sys/bus/pci/devices/${bdf}/t7xx_debug_ports
+
+Write from userspace to set the debug ports status.
+
+::
+  $ echo 1 > /sys/bus/pci/devices/${bdf}/t7xx_debug_ports
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The t7xx driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+  ip link add dev wwan0-1 parentdev wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+AT port userspace ABI
+----------------------------------
+
+/dev/wwan0at0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an AT port by implementing AT WWAN Port.
+The userspace end of the control port is a /dev/wwan0at0 character
+device. Application shall use this interface to issue AT commands.
+
+fastboot port userspace ABI
+---------------------------
+
+/dev/wwan0fastboot0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a fastboot protocol interface by implementing
+fastboot WWAN Port. The userspace end of the fastboot channel pipe is a
+/dev/wwan0fastboot0 character device. Application shall use this interface for
+fastboot protocol communication.
+
+Please note that driver needs to be reloaded to export /dev/wwan0fastboot0
+port, because device needs a cold reset after enter ``fastboot_switching``
+mode.
+
+ADB port userspace ABI
+----------------------
+
+/dev/wwan0adb0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a ADB protocol interface by implementing ADB WWAN Port.
+The userspace end of the ADB channel pipe is a /dev/wwan0adb0 character device.
+Application shall use this interface for ADB protocol communication.
+
+MIPC port userspace ABI
+-----------------------
+
+/dev/wwan0mipc0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a diagnostic interface by implementing MIPC (Modem
+Information Process Center) WWAN Port. The userspace end of the MIPC channel
+pipe is a /dev/wwan0mipc0 character device.
+Application shall use this interface for MTK modem diagnostic communication.
+
+The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
+
+References
+==========
+[1] *MBIM (Mobile Broadband Interface Model) Errata-1*
+
+- https://www.usb.org/document-library/
+
+[2] *libmbim "a glib-based library for talking to WWAN modems and devices which
+speak the Mobile Interface Broadband Model (MBIM) protocol"*
+
+- http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] *Modem Manager "a DBus-activated daemon which controls mobile broadband
+(2G/3G/4G/5G) devices and connections"*
+
+- http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[4] *Specification # 27.007 - 3GPP*
+
+- https://www.3gpp.org/DynaReport/27007.htm
+
+[5] *fastboot "a mechanism for communicating with bootloaders"*
+
+- https://android.googlesource.com/platform/system/core/+/refs/heads/main/fastboot/README.md
+
+[6] *ADB (Android Debug Bridge) "a mechanism to keep track of Android devices
+and emulators instances connected to or running on a given host developer
+machine with ADB protocol"*
+
+- https://android.googlesource.com/platform/packages/modules/adb/+/refs/heads/main/README.md
diff --git a/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
new file mode 100644
index 000000000000..1e589c26abff
--- /dev/null
+++ b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+am65-cpsw-nuss devlink support
+==============================
+
+This document describes the devlink features implemented by the ``am65-cpsw-nuss``
+device driver.
+
+Parameters
+==========
+
+The ``am65-cpsw-nuss`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+   :widths: 5 5 5 85
+
+   * - Name
+     - Type
+     - Mode
+     - Description
+   * - ``switch_mode``
+     - Boolean
+     - runtime
+     - Enable switch mode
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
index 3dfd84ccb1c7..9a8b3d76d11f 100644
--- a/Documentation/networking/devlink/bnxt.rst
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -22,6 +22,10 @@ Parameters
      - Permanent
    * - ``msix_vec_per_pf_min``
      - Permanent
+   * - ``enable_remote_dev_reset``
+     - Runtime
+   * - ``enable_roce``
+     - Permanent
 
 The ``bnxt`` driver also implements the following driver-specific
 parameters.
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
index 468fe1001b74..af37f250df43 100644
--- a/Documentation/networking/devlink/devlink-dpipe.rst
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -52,7 +52,7 @@ purposes as a standard complementary tool. The system's view from
 ``devlink-dpipe`` should change according to the changes done by the
 standard configuration tools.
 
-For example, it’s quiet common to  implement Access Control Lists (ACL)
+For example, it’s quite common to  implement Access Control Lists (ACL)
 using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
 divided into TCAM regions. Complex TC filters can have multiple rules with
 different priorities and different lookup keys. On the other hand hardware
diff --git a/Documentation/networking/devlink/devlink-eswitch-attr.rst b/Documentation/networking/devlink/devlink-eswitch-attr.rst
new file mode 100644
index 000000000000..08bb39ab1528
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-eswitch-attr.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Devlink E-Switch Attribute
+==========================
+
+Devlink E-Switch supports two modes of operation: legacy and switchdev.
+Legacy mode operates based on traditional MAC/VLAN steering rules. Switching
+decisions are made based on MAC addresses, VLANs, etc. There is limited ability
+to offload switching rules to hardware.
+
+On the other hand, switchdev mode allows for more advanced offloading
+capabilities of the E-Switch to hardware. In switchdev mode, more switching
+rules and logic can be offloaded to the hardware switch ASIC. It enables
+representor netdevices that represent the slow path of virtual functions (VFs)
+or scalable-functions (SFs) of the device. See more information about
+:ref:`Documentation/networking/switchdev.rst <switchdev>` and
+:ref:`Documentation/networking/representors.rst <representors>`.
+
+In addition, the devlink E-Switch also comes with other attributes listed
+in the following section.
+
+Attributes Description
+======================
+
+The following is a list of E-Switch attributes.
+
+.. list-table:: E-Switch attributes
+   :widths: 8 5 45
+
+   * - Name
+     - Type
+     - Description
+   * - ``mode``
+     - enum
+     - The mode of the device. The mode can be one of the following:
+
+       * ``legacy`` operates based on traditional MAC/VLAN steering
+         rules.
+       * ``switchdev`` allows for more advanced offloading capabilities of
+         the E-Switch to hardware.
+   * - ``inline-mode``
+     - enum
+     - Some HWs need the VF driver to put part of the packet
+       headers on the TX descriptor so the e-switch can do proper
+       matching and steering. Support for both switchdev mode and legacy mode.
+
+       * ``none`` none.
+       * ``link`` L2 mode.
+       * ``network`` L3 mode.
+       * ``transport`` L4 mode.
+   * - ``encap-mode``
+     - enum
+     - The encapsulation mode of the device. Support for both switchdev mode
+       and legacy mode. The mode can be one of the following:
+
+       * ``none`` Disable encapsulation support.
+       * ``basic`` Enable encapsulation support.
+
+Example Usage
+=============
+
+.. code:: shell
+
+    # enable switchdev mode
+    $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+
+    # set inline-mode and encap-mode
+    $ devlink dev eswitch set pci/0000:08:00.0 inline-mode none encap-mode basic
+
+    # display devlink device eswitch attributes
+    $ devlink dev eswitch show pci/0000:08:00.0
+      pci/0000:08:00.0: mode switchdev inline-mode none encap-mode basic
+
+    # enable encap-mode with legacy mode
+    $ devlink dev eswitch set pci/0000:08:00.0 mode legacy inline-mode none encap-mode basic
diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst
index 40a87c0222cb..603e732f00cc 100644
--- a/Documentation/networking/devlink/devlink-flash.rst
+++ b/Documentation/networking/devlink/devlink-flash.rst
@@ -16,6 +16,34 @@ Note that the file name is a path relative to the firmware loading path
 (usually ``/lib/firmware/``). Drivers may send status updates to inform
 user space about the progress of the update operation.
 
+Overwrite Mask
+==============
+
+The ``devlink-flash`` command allows optionally specifying a mask indicating
+how the device should handle subsections of flash components when updating.
+This mask indicates the set of sections which are allowed to be overwritten.
+
+.. list-table:: List of overwrite mask bits
+   :widths: 5 95
+
+   * - Name
+     - Description
+   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+     - Indicates that the device should overwrite settings in the components
+       being updated with the settings found in the provided image.
+   * - ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+     - Indicates that the device should overwrite identifiers in the
+       components being updated with the identifiers found in the provided
+       image. This includes MAC addresses, serial IDs, and similar device
+       identifiers.
+
+Multiple overwrite bits may be combined and requested together. If no bits
+are provided, it is expected that the device only update firmware binaries
+in the components being updated. Settings and identifiers are expected to be
+preserved across the update. A device may not support every combination and
+the driver for such a device must reject any combination which cannot be
+faithfully implemented.
+
 Firmware Loading
 ================
 
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 0c99b11f05f9..e0b8cfed610a 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -24,7 +24,7 @@ attributes of the health reporting and recovery procedures.
 
 The ``devlink`` health reporter:
 Device driver creates a "health reporter" per each error/health type.
-Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
 or unknown (driver specific).
 For each registered health reporter a driver can issue error/health reports
 asynchronously. All health reports handling is done by ``devlink``.
@@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.:
   * Recovery procedures
   * Diagnostics procedures
   * Object dump procedures
-  * OOB initial parameters
+  * Out Of Box initial parameters
 
 Different parts of the driver can register different types of health reporters
 with different handlers.
@@ -46,11 +46,31 @@ Once an error is reported, devlink health will perform the following actions:
   * A log is being send to the kernel trace events buffer
   * Health status and statistics are being updated for the reporter instance
   * Object dump is being taken and saved at the reporter instance (as long as
-    there is no other dump which is already stored)
+    auto-dump is set and there is no other dump which is already stored)
   * Auto recovery attempt is being done. Depends on:
+
     - Auto-recovery configuration
     - Grace period vs. time passed since last recover
 
+Devlink formatted message
+=========================
+
+To handle devlink health diagnose and health dump requests, devlink creates a
+formatted message structure ``devlink_fmsg`` and send it to the driver's callback
+to fill the data in using the devlink fmsg API.
+
+Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
+json-like format. The API allows the driver to add nested attributes such as
+object, object pair and value array, in addition to attributes such as name and
+value.
+
+Driver should use this API to fill the fmsg context in a format which will be
+translated by the devlink to the netlink message later. When it needs to send
+the data using SKBs to the netlink layer, it fragments the data between
+different SKBs. In order to do this fragmentation, it uses virtual nests
+attributes, to avoid actual nesting use which cannot be divided between
+different SKBs.
+
 User Interface
 ==============
 
@@ -72,14 +92,18 @@ via ``devlink``, e.g per error type (per health reporter):
    * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
      - Allows reporter-related configuration setting.
    * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
-     - Triggers a reporter's recovery procedure.
+     - Triggers reporter's recovery procedure.
+   * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
+     - Triggers a fake health event on the reporter. The effects of the test
+       event in terms of recovery flow should follow closely that of a real
+       event.
    * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
-     - Retrieves diagnostics data from a reporter on a device.
+     - Retrieves current device state related to the reporter.
    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
      - Retrieves the last stored dump. Devlink health
-       saves a single dump. If an dump is not already stored by the devlink
+       saves a single dump. If an dump is not already stored by devlink
        for this reporter, devlink generates a new dump.
-       dump output is defined by the reporter.
+       Dump output is defined by the reporter.
    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
      - Clears the last saved dump file for the specified reporter.
 
@@ -93,7 +117,7 @@ The following diagram provides a general overview of ``devlink-health``::
                                           +--------------------------+
                                                        |request for ops
                                                        |(diagnose,
-     mlx5_core                             devlink     |recover,
+      driver                               devlink     |recover,
                                                        |dump)
     +--------+                            +--------------------------+
     |        |                            |    reporter|             |
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 3fe11401b838..dd6adc4d0559 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -44,9 +44,11 @@ versions is generally discouraged - here, and via any other Linux API.
        reported for two ports of the same device or on two hosts of
        a multi-host device should be identical.
 
-       .. note:: ``devlink-info`` API should be extended with a new field
-	  if devices want to report board/product serial number (often
-	  reported in PCI *Vital Product Data* capability).
+   * - ``board.serial_number``
+     - Board serial number of the device.
+
+       This is usually the serial number of the board, often available in
+       PCI *Vital Product Data*.
 
    * - ``fixed``
      - Group for hardware identifiers, and versions of components
@@ -84,6 +86,10 @@ In case software/firmware components are loaded from the disk (e.g.
 ``/lib/firmware``) only the running version should be reported via
 the kernel API.
 
+Please note that any security versions reported via devlink are purely
+informational. Devlink does not use a secure channel to communicate with
+the device.
+
 Generic Versions
 ================
 
@@ -144,6 +150,11 @@ board.manufacture
 
 An identifier of the company or the facility which produced the part.
 
+board.part_number
+-----------------
+
+Part number of the board and its components.
+
 fw
 --
 
@@ -196,15 +207,16 @@ fw.bundle_id
 
 Unique identifier of the entire firmware bundle.
 
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
 Future work
 ===========
 
 The following extensions could be useful:
 
- - product serial number - NIC boards often get labeled with a board serial
-   number rather than ASIC serial number; it'd be useful to add board serial
-   numbers to the API if they can be retrieved from the device;
-
  - on-disk firmware file names - drivers list the file names of firmware they
    may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These,
    however, are per module, rather than per device. It'd be useful to list
diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst
new file mode 100644
index 000000000000..6c0b8928bc13
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-linecard.rst
@@ -0,0 +1,122 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Devlink Line card
+=================
+
+Background
+==========
+
+The ``devlink-linecard`` mechanism is targeted for manipulation of
+line cards that serve as a detachable PHY modules for modular switch
+system. Following operations are provided:
+
+  * Get a list of supported line card types.
+  * Provision of a slot with specific line card type.
+  * Get and monitor of line card state and its change.
+
+Line card according to the type may contain one or more gearboxes
+to mux the lanes with certain speed to multiple ports with lanes
+of different speed. Line card ensures N:M mapping between
+the switch ASIC modules and physical front panel ports.
+
+Overview
+========
+
+Each line card devlink object is created by device driver,
+according to the physical line card slots available on the device.
+
+Similar to splitter cable, where the device might have no way
+of detection of the splitter cable geometry, the device
+might not have a way to detect line card type. For that devices,
+concept of provisioning is introduced. It allows the user to:
+
+  * Provision a line card slot with certain line card type
+
+    - Device driver would instruct the ASIC to prepare all
+      resources accordingly. The device driver would
+      create all instances, namely devlink port and netdevices
+      that reside on the line card, according to the line card type
+  * Manipulate of line card entities even without line card
+    being physically connected or powered-up
+  * Setup splitter cable on line card ports
+
+    - As on the ordinary ports, user may provision a splitter
+      cable of a certain type, without the need to
+      be physically connected to the port
+  * Configure devlink ports and netdevices
+
+Netdevice carrier is decided as follows:
+
+  * Line card is not inserted or powered-down
+
+    - The carrier is always down
+  * Line card is inserted and powered up
+
+    - The carrier is decided as for ordinary port netdevice
+
+Line card state
+===============
+
+The ``devlink-linecard`` mechanism supports the following line card states:
+
+  * ``unprovisioned``: Line card is not provisioned on the slot.
+  * ``unprovisioning``: Line card slot is currently being unprovisioned.
+  * ``provisioning``: Line card slot is currently in a process of being provisioned
+    with a line card type.
+  * ``provisioning_failed``: Provisioning was not successful.
+  * ``provisioned``: Line card slot is provisioned with a type.
+  * ``active``: Line card is powered-up and active.
+
+The following diagram provides a general overview of ``devlink-linecard``
+state transitions::
+
+                                          +-------------------------+
+                                          |                         |
+       +---------------------------------->      unprovisioned      |
+       |                                  |                         |
+       |                                  +--------|-------^--------+
+       |                                           |       |
+       |                                           |       |
+       |                                  +--------v-------|--------+
+       |                                  |                         |
+       |                                  |       provisioning      |
+       |                                  |                         |
+       |                                  +------------|------------+
+       |                                               |
+       |                 +-----------------------------+
+       |                 |                             |
+       |    +------------v------------+   +------------v------------+   +-------------------------+
+       |    |                         |   |                         ---->                         |
+       +-----   provisioning_failed   |   |       provisioned       |   |         active          |
+       |    |                         |   |                         <----                         |
+       |    +------------^------------+   +------------|------------+   +-------------------------+
+       |                 |                             |
+       |                 |                             |
+       |                 |                +------------v------------+
+       |                 |                |                         |
+       |                 |                |     unprovisioning      |
+       |                 |                |                         |
+       |                 |                +------------|------------+
+       |                 |                             |
+       |                 +-----------------------------+
+       |                                               |
+       +-----------------------------------------------+
+
+
+Example usage
+=============
+
+.. code:: shell
+
+    $ devlink lc show [ DEV [ lc LC_INDEX ] ]
+    $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ]
+
+    # Show current line card configuration and status for all slots:
+    $ devlink lc
+
+    # Set slot 8 to be provisioned with type "16x100G":
+    $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
+
+    # Set slot 8 to be unprovisioned:
+    $ devlink lc set pci/0000:01:00.0 lc 8 notype
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index d075fd090b3d..4e01dc32bc08 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -97,14 +97,43 @@ own name.
    * - ``enable_roce``
      - Boolean
      - Enable handling of RoCE traffic in the device.
+   * - ``enable_eth``
+     - Boolean
+     - When enabled, the device driver will instantiate Ethernet specific
+       auxiliary device of the devlink device.
+   * - ``enable_rdma``
+     - Boolean
+     - When enabled, the device driver will instantiate RDMA specific
+       auxiliary device of the devlink device.
+   * - ``enable_vnet``
+     - Boolean
+     - When enabled, the device driver will instantiate VDPA networking
+       specific auxiliary device of the devlink device.
+   * - ``enable_iwarp``
+     - Boolean
+     - Enable handling of iWARP traffic in the device.
    * - ``internal_err_reset``
      - Boolean
      - When enabled, the device driver will reset the device on internal
        errors.
    * - ``max_macs``
      - u32
-     - Specifies the maximum number of MAC addresses per ethernet port of
-       this device.
+     - Typically macvlan, vlan net devices mac are also programmed in their
+       parent netdevice's Function rx filter. This parameter limit the
+       maximum number of unicast mac address filters to receive traffic from
+       per ethernet port of this device.
    * - ``region_snapshot_enable``
      - Boolean
      - Enable capture of ``devlink-region`` snapshots.
+   * - ``enable_remote_dev_reset``
+     - Boolean
+     - Enable device reset by remote host. When cleared, the device driver
+       will NACK any attempt of other host to reset the device. This parameter
+       is useful for setups where a device is shared by different hosts, such
+       as multi-host setup.
+   * - ``io_eq_size``
+     - u32
+     - Control the size of I/O completion EQs.
+   * - ``event_eq_size``
+     - u32
+     - Control the size of asynchronous control events EQ.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..9d22d41a7cd1
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,476 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+   :widths: 33 90
+
+   * - Flavour
+     - Description
+   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+     - Any kind of physical port. This can be an eswitch physical port or any
+       other physical port on the device.
+   * - ``DEVLINK_PORT_FLAVOUR_DSA``
+     - This indicates a DSA interconnect port.
+   * - ``DEVLINK_PORT_FLAVOUR_CPU``
+     - This indicates a CPU port applicable only to DSA.
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+     - This indicates an eswitch port representing a port of PCI
+       physical function (PF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+     - This indicates an eswitch port representing a port of PCI
+       virtual function (VF).
+   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+     - This indicates an eswitch port representing a port of PCI
+       subfunction (SF).
+   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+     - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+   :widths: 23 90
+
+   * - Type
+     - Description
+   * - ``DEVLINK_PORT_TYPE_ETH``
+     - Driver should set this port type when a link layer of the port is
+       Ethernet.
+   * - ``DEVLINK_PORT_TYPE_IB``
+     - Driver should set this port type when a link layer of the port is
+       InfiniBand.
+   * - ``DEVLINK_PORT_TYPE_AUTO``
+     - This type is indicated by the user when driver should detect the port
+       type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical, virtual functions and subfunctions. A function
+consists of one or more ports. This port is represented by the devlink eswitch
+port.
+
+A PCI device connected to multiple CPUs or multiple PCI root complexes or a
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch is on the PCI device which supports ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+                 ---------------------------------------------------------
+                 |                                                       |
+                 |           --------- ---------         ------- ------- |
+    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
+    | connect |  | -------                       -------                 |
+    -----------  |     | controller_num=1 (no eswitch)                   |
+                 ------|--------------------------------------------------
+                 (internal wire)
+                       |
+                 ---------------------------------------------------------
+                 | devlink eswitch ports and reps                        |
+                 | ----------------------------------------------------- |
+                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
+                 | ----------------------------------------------------- |
+                 |                                                       |
+                 |                                                       |
+    -----------  |           --------- ---------         ------- ------- |
+    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
+    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
+    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
+    -----------  | -------                       -------                 |
+                 |                                                       |
+                 |  local controller_num=0 (eswitch)                     |
+                 ---------------------------------------------------------
+
+In the above example, the external controller (identified by controller number = 1)
+doesn't have the eswitch. Local controller (identified by controller number = 0)
+has the eswitch. The Devlink instance on the local controller has eswitch
+devlink ports for both the controllers.
+
+Function configuration
+======================
+
+Users can configure one or more function attributes before enumerating the PCI
+function. Usually it means, user should configure function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
+
+A user may set the hardware address of the function using
+`devlink port function set hw_addr` command. For Ethernet port function
+this means a MAC address.
+
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+`devlink port function set migratable` command.
+
+Users may also set the IPsec crypto capability of the function using
+`devlink port function set ipsec_crypto` command.
+
+Users may also set the IPsec packet capability of the function using
+`devlink port function set ipsec_packet` command.
+
+Users may also set the maximum IO event queues of the function
+using `devlink port function set max_io_eqs` command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+      function:
+        hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+    $ devlink port show pci/0000:06:00.0/32768
+    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+      function:
+        hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+    $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+    $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
+IPsec crypto capability setup
+-----------------------------
+When user enables IPsec crypto capability for a VF, user application can offload
+XFRM state crypto operation (Encrypt/Decrypt) to this VF.
+
+When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
+processed in software by the kernel.
+
+- Get IPsec crypto capability of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
+
+- Set IPsec crypto capability of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
+
+IPsec packet capability setup
+-----------------------------
+When user enables IPsec packet capability for a VF, user application can offload
+XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
+IPsec encapsulation.
+
+When IPsec packet capability is disabled (default) for a VF, the XFRM state and
+policy is processed in software by the kernel.
+
+- Get IPsec packet capability of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_packet disabled
+
+- Set IPsec packet capability of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+
+Maximum IO events queues setup
+------------------------------
+When user sets maximum number of IO event queues for a SF or
+a VF, such function driver is limited to consume only enforced
+number of IO event queues.
+
+IO event queues deliver events related to IO queues, including network
+device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
+For example, the number of netdevice channels and RDMA device completion
+vectors are derived from the function's IO event queues. Usually, the number
+of interrupt vectors consumed by the driver is limited by the number of IO
+event queues per device, as each of the IO event queues is connected to an
+interrupt vector.
+
+- Get maximum IO event queues of the VF device::
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
+
+- Set maximum IO event queues of the VF device::
+
+    $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
+
+    $ devlink port show pci/0000:06:00.0/2
+    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+        function:
+            hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
+
+Subfunction
+============
+
+Subfunction is a lightweight function that has a parent PCI function on which
+it is deployed. Subfunction is created and deployed in unit of 1. Unlike
+SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
+A subfunction communicates with the hardware through the parent PCI function.
+
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. A user adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink ops) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+A subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself is not created. A user might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. A user
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once a subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
+At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+  Represents a single devlink port; created/destroyed by the driver. Since leaf
+  have 1to1 mapping to its devlink port, in user space it is referred as
+  ``pci/<bus_addr>/<port_index>``;
+
+``node``
+  Represents a group of rate objects (leafs and/or nodes); created/deleted by
+  request from the userspace; initially empty (no rate objects added). In
+  userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+  ``node_name`` can be any identifier, except decimal number, to avoid
+  collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+  Minimum TX rate value shared among all other rate objects, or rate objects
+  that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+  Maximum TX rate value.
+
+``tx_priority``
+  Allows for usage of strict priority arbiter among siblings. This
+  arbitration scheme attempts to schedule nodes based on their priority
+  as long as the nodes remain within their bandwidth limit. The higher the
+  priority the higher the probability that the node will get selected for
+  scheduling.
+
+``tx_weight``
+  Allows for usage of Weighted Fair Queuing arbitration scheme among
+  siblings. This arbitration scheme can be used simultaneously with the
+  strict priority. As a node is configured with a higher rate it gets more
+  BW relative to its siblings. Values are relative like a percentage
+  points, they basically tell how much BW should node take relative to
+  its siblings.
+
+``parent``
+  Parent node name. Parent node rate limits are considered as additional limits
+  to all node children limits. ``tx_max`` is an upper limit for children.
+  ``tx_share`` is a total bandwidth distributed among children.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+   within the BW limit and are not blocked. Use ``tx_priority`` as a
+   parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among its children,
+   until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+   overused their assigned BW, move to the lower priority nodes.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+   :widths: 22 90
+
+   * - Term
+     - Definitions
+   * - ``PCI device``
+     - A physical PCI device having one or more PCI buses consists of one or
+       more PCI controllers.
+   * - ``PCI controller``
+     -  A controller consists of potentially multiple physical functions,
+        virtual functions and subfunctions.
+   * - ``Port function``
+     -  An object to manage the function of a port.
+   * - ``Subfunction``
+     -  A lightweight function that has parent PCI function on which it is
+        deployed.
+   * - ``Subfunction device``
+     -  A bus device of the subfunction, usually on a auxiliary bus.
+   * - ``Subfunction driver``
+     -  A device driver for the subfunction auxiliary device.
+   * - ``Subfunction management device``
+     -  A PCI physical function that supports subfunction management.
+   * - ``Subfunction management driver``
+     -  A device driver for PCI physical function that supports
+        subfunction management using devlink port interface.
+   * - ``Subfunction host driver``
+     -  A device driver for PCI physical function that hosts subfunction
+        devices. In most cases it is same as subfunction management driver. When
+        subfunction is used on external controller, subfunction management and
+        host drivers are different.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 3654c3e9658f..5d0b68f752c0 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -22,7 +22,7 @@ The major benefit to creating a region is to provide access to internal
 address regions that are otherwise inaccessible to the user.
 
 Regions may also be used to provide an additional way to debug complex error
-states, but see also :doc:`devlink-health`
+states, but see also Documentation/networking/devlink/devlink-health.rst
 
 Regions may optionally support capturing a snapshot on demand via the
 ``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
@@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in
 the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
 the snapshot information to user space.
 
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
+
 example usage
 -------------
 
@@ -40,12 +49,12 @@ example usage
     $ devlink region show [ DEV/REGION ]
     $ devlink region del DEV/REGION snapshot SNAPSHOT_ID
     $ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
-    $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length
+    $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length LENGTH
 
     # Show all of the exposed regions with region sizes:
     $ devlink region show
-    pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
-    pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
+    pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] max 8
+    pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] max 8
 
     # Delete a snapshot using:
     $ devlink region del pci/0000:00:05.0/cr-space snapshot 1
@@ -65,6 +74,10 @@ example usage
     $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
     0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
 
+    # Read from the region without a snapshot
+    $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+    0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
 As regions are likely very device or driver specific, no generic regions are
 defined. See the driver-specific documentation files for information on the
 specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-reload.rst b/Documentation/networking/devlink/devlink-reload.rst
new file mode 100644
index 000000000000..2fb0269b2054
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-reload.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Reload
+==============
+
+``devlink-reload`` provides mechanism to reinit driver entities, applying
+``devlink-params`` and ``devlink-resources`` new values. It also provides
+mechanism to activate firmware.
+
+Reload Actions
+==============
+
+User may select a reload action.
+By default ``driver_reinit`` action is selected.
+
+.. list-table:: Possible reload actions
+   :widths: 5 90
+
+   * - Name
+     - Description
+   * - ``driver-reinit``
+     - Devlink driver entities re-initialization, including applying
+       new values to devlink entities which are used during driver
+       load which are:
+
+       * ``devlink-params`` in configuration mode ``driverinit``
+       * ``devlink-resources``
+
+       Other devlink entities may stay over the re-initialization:
+
+       * ``devlink-health-reporter``
+       * ``devlink-region``
+
+       The rest of the devlink entities have to be removed and readded.
+   * - ``fw_activate``
+     - Firmware activate. Activates new firmware if such image is stored and
+       pending activation. If no limitation specified this action may involve
+       firmware reset. If no new image pending this action will reload current
+       firmware image.
+
+Note that even though user asks for a specific action, the driver
+implementation might require to perform another action alongside with
+it. For example, some driver do not support driver reinitialization
+being performed without fw activation. Therefore, the devlink reload
+command returns the list of actions which were actrually performed.
+
+Reload Limits
+=============
+
+By default reload actions are not limited and driver implementation may
+include reset or downtime as needed to perform the actions.
+
+However, some drivers support action limits, which limit the action
+implementation to specific constraints.
+
+.. list-table:: Possible reload limits
+   :widths: 5 90
+
+   * - Name
+     - Description
+   * - ``no_reset``
+     - No reset allowed, no down time allowed, no link flap and no
+       configuration is lost.
+
+Change Namespace
+================
+
+The netns option allows user to be able to move devlink instances into
+namespaces during devlink reload operation.
+By default all devlink instances are created in init_net and stay there.
+
+example usage
+-------------
+
+.. code:: shell
+
+    $ devlink dev reload help
+    $ devlink dev reload DEV [ netns { PID | NAME | ID } ] [ action { driver_reinit | fw_activate } ] [ limit no_reset ]
+
+    # Run reload command for devlink driver entities re-initialization:
+    $ devlink dev reload pci/0000:82:00.0 action driver_reinit
+    reload_actions_performed:
+      driver_reinit
+
+    # Run reload command to activate firmware:
+    # Note that mlx5 driver reloads the driver while activating firmware
+    $ devlink dev reload pci/0000:82:00.0 action fw_activate
+    reload_actions_performed:
+      driver_reinit fw_activate
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 93e92d2f0752..3d5ae51e65a2 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -23,6 +23,20 @@ current size and related sub resources. To access a sub resource, you
 specify the path of the resource. For example ``/IPv4/fib`` is the id for
 the ``fib`` sub-resource under the ``IPv4`` resource.
 
+Generic Resources
+=================
+
+Generic resources are used to describe resources that can be shared by multiple
+device drivers and their description must be added to the following table:
+
+.. list-table:: List of Generic Resources
+   :widths: 10 90
+
+   * - Name
+     - Description
+   * - ``physical_ports``
+     - A limited capacity of physical ports that the switch ASIC can support
+
 example usage
 -------------
 
diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst
new file mode 100644
index 000000000000..c0aa1f3aef0d
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-selftests.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=================
+Devlink Selftests
+=================
+
+The ``devlink-selftests`` API allows executing selftests on the device.
+
+Tests Mask
+==========
+The ``devlink-selftests`` command should be run with a mask indicating
+the tests to be executed.
+
+Tests Description
+=================
+The following is a list of tests that drivers may execute.
+
+.. list-table:: List of tests
+   :widths: 5 90
+
+   * - Name
+     - Description
+   * - ``DEVLINK_SELFTEST_FLASH``
+     - Devices may have the firmware on non-volatile memory on the board, e.g.
+       flash. This particular test helps to run a flash selftest on the device.
+       Implementation of the test is left to the driver/firmware.
+
+example usage
+-------------
+
+.. code:: shell
+
+    # Query selftests supported on the devlink device
+    $ devlink dev selftests show DEV
+    # Query selftests supported on all devlink devices
+    $ devlink dev selftests show
+    # Executes selftests on the device
+    $ devlink dev selftests run DEV id flash
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 1e3f3ffee248..5885e21e2212 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -405,6 +405,96 @@ be added to the following table:
      - ``control``
      - Traps packets logged during processing of flow action trap (e.g., via
        tc's trap action)
+   * - ``early_drop``
+     - ``drop``
+     - Traps packets dropped due to the RED (Random Early Detection) algorithm
+       (i.e., early drops)
+   * - ``vxlan_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the VXLAN header parsing which
+       might be because of packet truncation or the I flag is not set.
+   * - ``llc_snap_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the LLC+SNAP header parsing
+   * - ``vlan_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the VLAN header parsing. Could
+       include unexpected packet truncation.
+   * - ``pppoe_ppp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the PPPoE+PPP header parsing.
+       This could include finding a session ID of 0xFFFF (which is reserved and
+       not for use), a PPPoE length which is larger than the frame received or
+       any common error on this type of header
+   * - ``mpls_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the MPLS header parsing which
+       could include unexpected header truncation
+   * - ``arp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the ARP header parsing
+   * - ``ip_1_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the first IP header parsing.
+       This packet trap could include packets which do not pass an IP checksum
+       check, a header length check (a minimum of 20 bytes), which might suffer
+       from packet truncation thus the total length field exceeds the received
+       packet length etc
+   * - ``ip_n_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the parsing of the last IP
+       header (the inner one in case of an IP over IP tunnel). The same common
+       error checking is performed here as for the ip_1_parsing trap
+   * - ``gre_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the GRE header parsing
+   * - ``udp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the UDP header parsing.
+       This packet trap could include checksum errors, an improper UDP
+       length detected (smaller than 8 bytes) or detection of header
+       truncation.
+   * - ``tcp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the TCP header parsing.
+       This could include TCP checksum errors, improper combination of SYN, FIN
+       and/or RESET etc.
+   * - ``ipsec_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the IPSEC header parsing
+   * - ``sctp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the SCTP header parsing.
+       This would mean that port number 0 was used or that the header is
+       truncated.
+   * - ``dccp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the DCCP header parsing
+   * - ``gtp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the GTP header parsing
+   * - ``esp_parsing``
+     - ``drop``
+     - Traps packets dropped due to an error in the ESP header parsing
+   * - ``blackhole_nexthop``
+     - ``drop``
+     - Traps packets that the device decided to drop in case they hit a
+       blackhole nexthop
+   * - ``dmac_filter``
+     - ``drop``
+     - Traps incoming packets that the device decided to drop because
+       the destination MAC is not configured in the MAC table and
+       the interface is not in promiscuous mode
+   * - ``eapol``
+     - ``control``
+     - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+       specified in IEEE 802.1X
+   * - ``locked_port``
+     - ``drop``
+     - Traps packets that the device decided to drop because they failed the
+       locked bridge port check. That is, packets that were received via a
+       locked port and whose {SMAC, VID} does not correspond to an FDB entry
+       pointing to the port
 
 Driver-specific Packet Traps
 ============================
@@ -415,8 +505,9 @@ help debug packet drops caused by these exceptions. The following list includes
 links to the description of driver-specific traps registered by various device
 drivers:
 
-  * :doc:`netdevsim`
-  * :doc:`mlxsw`
+  * Documentation/networking/devlink/netdevsim.rst
+  * Documentation/networking/devlink/mlxsw.rst
+  * Documentation/networking/devlink/prestera.rst
 
 .. _Generic-Packet-Trap-Groups:
 
@@ -486,6 +577,10 @@ narrow. The description of these groups must be added to the following table:
      - Contains packet traps for packets that should be locally delivered after
        routing, but do not match more specific packet traps (e.g.,
        ``ipv4_bgp``)
+   * - ``external_delivery``
+     - Contains packet traps for packets that should be routed through an
+       external interface (e.g., management interface) that does not belong to
+       the same device (e.g., switch ASIC) as the ingress interface
    * - ``ipv6``
      - Contains packet traps for various IPv6 control packets (e.g., Router
        Advertisements)
@@ -501,6 +596,12 @@ narrow. The description of these groups must be added to the following table:
    * - ``acl_trap``
      - Contains packet traps for packets that were trapped (logged) by the
        device during ACL processing
+   * - ``parser_error_drops``
+     - Contains packet traps for packets that were marked by the device during
+       parsing as erroneous
+   * - ``eapol``
+     - Contains packet traps for "Extensible Authentication Protocol over LAN"
+       (EAPOL) packets specified in IEEE 802.1X
 
 Packet Trap Policers
 ====================
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 000000000000..3b857d82a44c
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+   * - ``fw``
+     - running
+     - Version of the firmware running on the device. Also available
+       through ``ethtool -i`` as the first member of the
+       ``firmware-version``.
+   * - ``fw.bootloader``
+     - running
+     - Version of the bootloader running on the device. Also available
+       through ``ethtool -i`` as the second member of the
+       ``firmware-version``.
+   * - ``board.rev``
+     - fixed
+     - The hardware revision of the device.
+   * - ``serial_number``
+     - fixed
+     - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
new file mode 100644
index 000000000000..72bc1b9f3785
--- /dev/null
+++ b/Documentation/networking/devlink/hns3.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+hns3 devlink support
+====================
+
+This document describes the devlink features implemented by the ``hns3``
+device driver.
+
+The ``hns3`` driver supports reloading via ``DEVLINK_CMD_RELOAD``.
+
+Info versions
+=============
+
+The ``hns3`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+   :widths: 10 10 80
+
+   * - Name
+     - Type
+     - Description
+   * - ``fw``
+     - running
+     - Used to represent the firmware version.
+   * - ``fw.scc``
+     - running
+     - Used to represent the Soft Congestion Control (SSC) firmware version.
+       SCC is a firmware component which provides multiple RDMA congestion
+       control algorithms, including DCQCN.
diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
new file mode 100644
index 000000000000..d3cb5bb5197e
--- /dev/null
+++ b/Documentation/networking/devlink/i40e.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+i40e devlink support
+====================
+
+This document describes the devlink features implemented by the ``i40e``
+device driver.
+
+Info versions
+=============
+
+The ``i40e`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+    :widths: 5 5 5 90
+
+    * - Name
+      - Type
+      - Example
+      - Description
+    * - ``board.id``
+      - fixed
+      - K15190-000
+      - The Product Board Assembly (PBA) identifier of the board.
+    * - ``fw.mgmt``
+      - running
+      - 9.130
+      - 2-digit version number of the management firmware that controls the
+        PHY, link, etc.
+    * - ``fw.mgmt.api``
+      - running
+      - 1.15
+      - 2-digit version number of the API exported over the AdminQ by the
+        management firmware. Used by the driver to identify what commands
+        are supported.
+    * - ``fw.mgmt.build``
+      - running
+      - 73618
+      - Build number of the source for the management firmware.
+    * - ``fw.undi``
+      - running
+      - 1.3429.0
+      - Version of the Option ROM containing the UEFI driver. The version is
+        reported in ``major.minor.patch`` format. The major version is
+        incremented whenever a major breaking change occurs, or when the
+        minor version would overflow. The minor version is incremented for
+        non-breaking changes and reset to 1 when the major version is
+        incremented. The patch version is normally 0 but is incremented when
+        a fix is delivered as a patch against an older base Option ROM.
+    * - ``fw.psid.api``
+      - running
+      - 9.30
+      - Version defining the format of the flash contents.
+    * - ``fw.bundle_id``
+      - running
+      - 0x8000e5f3
+      - Unique identifier of the firmware image file that was loaded onto
+        the device. Also referred to as the EETRACK identifier of the NVM.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 72ea8d295724..792e9f8c846a 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -7,6 +7,104 @@ ice devlink support
 This document describes the devlink features implemented by the ``ice``
 device driver.
 
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+   :widths: 5 5 90
+
+   * - Name
+     - Mode
+     - Notes
+   * - ``enable_roce``
+     - runtime
+     - mutually exclusive with ``enable_iwarp``
+   * - ``enable_iwarp``
+     - runtime
+     - mutually exclusive with ``enable_roce``
+   * - ``tx_scheduling_layers``
+     - permanent
+     - The ice hardware uses hierarchical scheduling for Tx with a fixed
+       number of layers in the scheduling tree. Each of them are decision
+       points. Root node represents a port, while all the leaves represent
+       the queues. This way of configuring the Tx scheduler allows features
+       like DCB or devlink-rate (documented below) to configure how much
+       bandwidth is given to any given queue or group of queues, enabling
+       fine-grained control because scheduling parameters can be configured
+       at any given layer of the tree.
+
+       The default 9-layer tree topology was deemed best for most workloads,
+       as it gives an optimal ratio of performance to configurability. However,
+       for some specific cases, this 9-layer topology might not be desired.
+       One example would be sending traffic to queues that are not a multiple
+       of 8. Because the maximum radix is limited to 8 in 9-layer topology,
+       the 9th queue has a different parent than the rest, and it's given
+       more bandwidth credits. This causes a problem when the system is
+       sending traffic to 9 queues:
+
+       | tx_queue_0_packets: 24163396
+       | tx_queue_1_packets: 24164623
+       | tx_queue_2_packets: 24163188
+       | tx_queue_3_packets: 24163701
+       | tx_queue_4_packets: 24163683
+       | tx_queue_5_packets: 24164668
+       | tx_queue_6_packets: 23327200
+       | tx_queue_7_packets: 24163853
+       | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
+
+       To address this need, you can switch to a 5-layer topology, which
+       changes the maximum topology radix to 512. With this enhancement,
+       the performance characteristic is equal as all queues can be assigned
+       to the same parent in the tree. The obvious drawback of this solution
+       is a lower configuration depth of the tree.
+
+       Use the ``tx_scheduling_layer`` parameter with the devlink command
+       to change the transmit scheduler topology. To use 5-layer topology,
+       use a value of 5. For example:
+       $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
+       value 5 cmode permanent
+       Use a value of 9 to set it back to the default value.
+
+       You must do PCI slot powercycle for the selected topology to take effect.
+
+       To verify that value has been set:
+       $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
+   * - ``msix_vec_per_pf_max``
+     - driverinit
+     - Set the max MSI-X that can be used by the PF, rest can be utilized for
+       SRIOV. The range is from min value set in msix_vec_per_pf_min to
+       2k/number of ports.
+   * - ``msix_vec_per_pf_min``
+     - driverinit
+     - Set the min MSI-X that will be used by the PF. This value inform how many
+       MSI-X will be allocated statically. The range is from 2 to value set
+       in msix_vec_per_pf_max.
+
+.. list-table:: Driver specific parameters implemented
+    :widths: 5 5 90
+
+    * - Name
+      - Mode
+      - Description
+    * - ``local_forwarding``
+      - runtime
+      - Controls loopback behavior by tuning scheduler bandwidth.
+        It impacts all kinds of functions: physical, virtual and
+        subfunctions.
+        Supported values are:
+
+        ``enabled`` - loopback traffic is allowed on port
+
+        ``disabled`` - loopback traffic is not allowed on this port
+
+        ``prioritized`` - loopback traffic is prioritized on this port
+
+        Default value of ``local_forwarding`` parameter is ``enabled``.
+        ``prioritized`` provides ability to adjust loopback traffic rate to increase
+        one port capacity at cost of the another. User needs to disable
+        local forwarding on one of the ports in order have increased capacity
+        on the ``prioritized`` port.
+
 Info versions
 =============
 
@@ -23,17 +121,24 @@ The ``ice`` driver reports the following versions
       - fixed
       - K65390-000
       - The Product Board Assembly (PBA) identifier of the board.
+    * - ``cgu.id``
+      - fixed
+      - 36
+      - The Clock Generation Unit (CGU) hardware revision identifier.
     * - ``fw.mgmt``
       - running
       - 2.1.7
-      - 3-digit version number of the management firmware that controls the
-        PHY, link, etc.
+      - 3-digit version number of the management firmware running on the
+        Embedded Management Processor of the device. It controls the PHY,
+        link, access to device resources, etc. Intel documentation refers to
+        this as the EMP firmware.
     * - ``fw.mgmt.api``
       - running
-      - 1.5
-      - 2-digit version number of the API exported over the AdminQ by the
-        management firmware. Used by the driver to identify what commands
-        are supported.
+      - 1.5.1
+      - 3-digit version number (major.minor.patch) of the API exported over
+        the AdminQ by the management firmware. Used by the driver to
+        identify what commands are supported. Historical versions of the
+        kernel only displayed a 2-digit version number (major.minor).
     * - ``fw.mgmt.build``
       - running
       - 0x305d955f
@@ -69,6 +174,12 @@ The ``ice`` driver reports the following versions
       - The version of the DDP package that is active in the device. Note
         that both the name (as reported by ``fw.app.name``) and version are
         required to uniquely identify the package.
+    * - ``fw.app.bundle_id``
+      - running
+      - 0xc0000001
+      - Unique identifier for the DDP package loaded in the device. Also
+        referred to as the DDP Track ID. Can be used to uniquely identify
+        the specific DDP package.
     * - ``fw.netlist``
       - running
       - 1.1.2000-6.7.0
@@ -80,18 +191,133 @@ The ``ice`` driver reports the following versions
       - running
       - 0xee16ced7
       - The first 4 bytes of the hash of the netlist module contents.
+    * - ``fw.cgu``
+      - running
+      - 8032.16973825.6021
+      - The version of Clock Generation Unit (CGU). Format:
+        <CGU type>.<configuration version>.<firmware version>.
+
+Flash Update
+============
+
+The ``ice`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+   :widths: 5 95
+
+   * - Bits
+     - Behavior
+   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+     - Do not preserve settings stored in the flash components being
+       updated. This includes overwriting the port configuration that
+       determines the number of physical functions the device will
+       initialize with.
+   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+     - Do not preserve either settings or identifiers. Overwrite everything
+       in the flash with the contents from the provided image, without
+       performing any preservation. This includes overwriting device
+       identifying fields such as the MAC address, VPD area, and device
+       serial number. It is expected that this combination be used with an
+       image customized for the specific device.
+
+The ice hardware does not support overwriting only identifiers while
+preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
+own will be rejected. If no overwrite mask is provided, the firmware will be
+instructed to preserve all settings and identifying fields when updating.
+
+Reload
+======
+
+The ``ice`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+    $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Port split
+==========
+
+The ``ice`` driver supports port splitting only for port 0, as the FW has
+a predefined set of available port split options for the whole device.
+
+A system reboot is required for port split to be applied.
+
+The following command will select the port split option with 4 ports:
+
+.. code:: shell
+
+    $ devlink port split pci/0000:16:00.0/0 count 4
+
+The list of all available port options will be printed to dynamic debug after
+each ``split`` and ``unsplit`` command. The first option is the default.
+
+.. code:: shell
+
+    ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
+    ice 0000:16:00.0: Status  Split      Quad 0          Quad 1
+    ice 0000:16:00.0:         count  L0  L1  L2  L3  L4  L5  L6  L7
+    ice 0000:16:00.0: Active  2     100   -   -   - 100   -   -   -
+    ice 0000:16:00.0:         2      50   -  50   -   -   -   -   -
+    ice 0000:16:00.0: Pending 4      25  25  25  25   -   -   -   -
+    ice 0000:16:00.0:         4      25  25   -   -  25  25   -   -
+    ice 0000:16:00.0:         8      10  10  10  10  10  10  10  10
+    ice 0000:16:00.0:         1     100   -   -   -   -   -   -   -
+
+There could be multiple FW port options with the same port split count. When
+the same port split count request is issued again, the next FW port option with
+the same port split count will be selected.
+
+``devlink port unsplit`` will select the option with a split count of 1. If
+there is no FW option available with split count 1, you will receive an error.
 
 Regions
 =======
 
-The ``ice`` driver enables access to the contents of the Non Volatile Memory
-flash chip via the ``nvm-flash`` region.
+The ``ice`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+    :widths: 15 85
 
-Users can request an immediate capture of a snapshot via the
-``DEVLINK_CMD_REGION_NEW``
+    * - Name
+      - Description
+    * - ``nvm-flash``
+      - The contents of the entire flash chip, sometimes referred to as
+        the device's Non Volatile Memory.
+    * - ``shadow-ram``
+      - The contents of the Shadow RAM, which is loaded from the beginning
+        of the flash. Although the contents are primarily from the flash,
+        this area also contains data generated during device boot which is
+        not stored in flash.
+    * - ``device-caps``
+      - The contents of the device firmware's capabilities buffer. Useful to
+        determine the current state and configuration of the device.
+
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
 
 .. code:: shell
 
+    $ devlink region show
+    pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+    pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
     $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
     $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
 
@@ -105,3 +331,157 @@ Users can request an immediate capture of a snapshot via the
     0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
 
     $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
+
+    $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
+    $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
+    0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
+    0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
+    0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
+    00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
+    00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
+    0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+    00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
+    00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+    00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+    0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+    0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+
+    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required because devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+    :widths: 15 85
+
+    * - Name
+      - Description
+    * - ``tx_max``
+      - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+        an absolute number specifying a maximum amount of bytes a Node may
+        consume during the course of one second. Rate limit guarantees
+        that a link will not oversaturate the receiver on the remote end
+        and also enforces an SLA between the subscriber and network
+        provider.
+    * - ``tx_share``
+      - minimum bandwidth allocated to a tree node when it is not blocked.
+        It specifies an absolute BW. While tx_max defines the maximum
+        bandwidth the node may consume, the tx_share marks committed BW
+        for the Node.
+    * - ``tx_priority``
+      - allows for usage of strict priority arbiter among siblings. This
+        arbitration scheme attempts to schedule nodes based on their
+        priority as long as the nodes remain within their bandwidth limit.
+        Range 0-7. Nodes with priority 7 have the highest priority and are
+        selected first, while nodes with priority 0 have the lowest
+        priority. Nodes that have the same priority are treated equally.
+    * - ``tx_weight``
+      - allows for usage of Weighted Fair Queuing arbitration scheme among
+        siblings. This arbitration scheme can be used simultaneously with
+        the strict priority. Range 1-200. Only relative values matter for
+        arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+    # enable switchdev
+    $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+    # at this point driver should export internal hierarchy
+    $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+    $ devlink port function rate show
+    pci/0000:4b:00.0/node_25: type node parent node_24
+    pci/0000:4b:00.0/node_24: type node parent node_0
+    pci/0000:4b:00.0/node_32: type node parent node_31
+    pci/0000:4b:00.0/node_31: type node parent node_30
+    pci/0000:4b:00.0/node_30: type node parent node_16
+    pci/0000:4b:00.0/node_19: type node parent node_18
+    pci/0000:4b:00.0/node_18: type node parent node_17
+    pci/0000:4b:00.0/node_17: type node parent node_16
+    pci/0000:4b:00.0/node_14: type node parent node_5
+    pci/0000:4b:00.0/node_5: type node parent node_3
+    pci/0000:4b:00.0/node_13: type node parent node_4
+    pci/0000:4b:00.0/node_12: type node parent node_4
+    pci/0000:4b:00.0/node_11: type node parent node_4
+    pci/0000:4b:00.0/node_10: type node parent node_4
+    pci/0000:4b:00.0/node_9: type node parent node_4
+    pci/0000:4b:00.0/node_8: type node parent node_4
+    pci/0000:4b:00.0/node_7: type node parent node_4
+    pci/0000:4b:00.0/node_6: type node parent node_4
+    pci/0000:4b:00.0/node_4: type node parent node_3
+    pci/0000:4b:00.0/node_3: type node parent node_16
+    pci/0000:4b:00.0/node_16: type node parent node_15
+    pci/0000:4b:00.0/node_15: type node parent node_0
+    pci/0000:4b:00.0/node_2: type node parent node_1
+    pci/0000:4b:00.0/node_1: type node parent node_0
+    pci/0000:4b:00.0/node_0: type node
+    pci/0000:4b:00.0/1: type leaf parent node_25
+    pci/0000:4b:00.0/2: type leaf parent node_25
+
+    # let's create some custom node
+    $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+    # second custom node
+    $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+    # reassign second VF to newly created branch
+    $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+    # assign tx_weight to the VF
+    $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+    # assign tx_share to the VF
+    $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 7684ae5c4a4a..8319f43b5933 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -4,6 +4,48 @@ Linux Devlink Documentation
 devlink is an API to expose device information and resources not directly
 related to any device class, such as chip-wide/switch-ASIC-wide configuration.
 
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held all callbacks of devlink
+netlink commands.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
+Drivers need to be cautious when taking devlink instance lock and
+taking RTNL lock at the same time. Devlink instance lock needs to be taken
+first, only after that RTNL lock could be taken.
+
+Nested instances
+----------------
+
+Some objects, like linecards or port functions, could have another
+devlink instances created underneath. In that case, drivers should make
+sure to respect following rules:
+
+ - Lock ordering should be maintained. If driver needs to take instance
+   lock of both nested and parent instances at the same time, devlink
+   instance lock of the parent instance should be taken first, only then
+   instance lock of the nested instance could be taken.
+ - Driver should use object-specific helpers to setup the
+   nested relationship:
+
+   - ``devl_nested_devlink_set()`` - called to setup devlink -> nested
+     devlink relationship (could be user for multiple nested instances.
+   - ``devl_port_fn_devlink_set()`` - called to setup port function ->
+     nested devlink relationship.
+   - ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
+     nested devlink relationship.
+
+The nested devlink info is exposed to the userspace over object-specific
+attributes of devlink netlink.
+
 Interface documentation
 -----------------------
 
@@ -18,9 +60,14 @@ general.
    devlink-info
    devlink-flash
    devlink-params
+   devlink-port
    devlink-region
    devlink-resource
+   devlink-reload
+   devlink-selftests
    devlink-trap
+   devlink-linecard
+   devlink-eswitch-attr
 
 Driver-specific documentation
 -----------------------------
@@ -32,14 +79,22 @@ parameters, info versions, and other features it supports.
    :maxdepth: 1
 
    bnxt
+   etas_es58x
+   hns3
+   i40e
    ionic
    ice
+   ixgbe
    mlx4
    mlx5
    mlxsw
    mv88e6xxx
    netdevsim
    nfp
-   sja1105
    qed
    ti-cpsw-switch
+   am65-nuss-cpsw-switch
+   prestera
+   iosm
+   octeontx2
+   sfc
diff --git a/Documentation/networking/devlink/iosm.rst b/Documentation/networking/devlink/iosm.rst
new file mode 100644
index 000000000000..6136181339aa
--- /dev/null
+++ b/Documentation/networking/devlink/iosm.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+iosm devlink support
+====================
+
+This document describes the devlink features implemented by the ``iosm``
+device driver.
+
+Parameters
+==========
+
+The ``iosm`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+   :widths: 5 5 5 85
+
+   * - Name
+     - Type
+     - Mode
+     - Description
+   * - ``erase_full_flash``
+     - u8
+     - runtime
+     - erase_full_flash parameter is used to check if full erase is required for
+       the device during firmware flashing.
+       If set, Full nand erase command will be sent to the device. By default,
+       only conditional erase support is enabled.
+
+
+Flash Update
+============
+
+The ``iosm`` driver implements support for flash update using the
+``devlink-flash`` interface.
+
+It supports updating the device flash using a combined flash image which contains
+the Bootloader images and other modem software images.
+
+The driver uses DEVLINK_SUPPORT_FLASH_UPDATE_COMPONENT to identify type of
+firmware image that need to be flashed as requested by user space application.
+Supported firmware image types.
+
+.. list-table:: Firmware Image types
+    :widths: 15 85
+
+    * - Name
+      - Description
+    * - ``PSI RAM``
+      - Primary Signed Image
+    * - ``EBL``
+      - External Bootloader
+    * - ``FLS``
+      - Modem Software Image
+
+PSI RAM and EBL are the RAM images which are injected to the device when the
+device is in BOOT ROM stage. Once this is successful, the actual modem firmware
+image is flashed to the device. The modem software image contains multiple files
+each having one secure bin file and at least one Loadmap/Region file. For flashing
+these files, appropriate commands are sent to the modem device along with the
+data required for flashing. The data like region count and address of each region
+has to be passed to the driver using the devlink param command.
+
+If the device has to be fully erased before firmware flashing, user application
+need to set the erase_full_flash parameter using devlink param command.
+By default, conditional erase feature is supported.
+
+Flash Commands:
+===============
+1) When modem is in Boot ROM stage, user can use below command to inject PSI RAM
+image using devlink flash command.
+
+$ devlink dev flash pci/0000:02:00.0 file <PSI_RAM_File_name>
+
+2) If user want to do a full erase, below command need to be issued to set the
+erase full flash param (To be set only if full erase required).
+
+$ devlink dev param set pci/0000:02:00.0 name erase_full_flash value true cmode runtime
+
+3) Inject EBL after the modem is in PSI stage.
+
+$ devlink dev flash pci/0000:02:00.0 file <EBL_File_name>
+
+4) Once EBL is injected successfully, then the actual firmware flashing takes
+place. Below is the sequence of commands used for each of the firmware images.
+
+a) Flash secure bin file.
+
+$ devlink dev flash pci/0000:02:00.0 file <Secure_bin_file_name>
+
+b) Flashing the Loadmap/Region file
+
+$ devlink dev flash pci/0000:02:00.0 file <Load_map_file_name>
+
+Regions
+=======
+
+The ``iosm`` driver supports dumping the coredump logs.
+
+In case a firmware encounters an exception, a snapshot will be taken by the
+driver. Following regions are accessed for device internal data.
+
+.. list-table:: Regions implemented
+    :widths: 15 85
+
+    * - Name
+      - Description
+    * - ``report.json``
+      - The summary of exception details logged as part of this region.
+    * - ``coredump.fcd``
+      - This region contains the details related to the exception occurred in the
+        device (RAM dump).
+    * - ``cdd.log``
+      - This region contains the logs related to the modem CDD driver.
+    * - ``eeprom.bin``
+      - This region contains the eeprom logs.
+    * - ``bootcore_trace.bin``
+      -  This region contains the current instance of bootloader logs.
+    * - ``bootcore_prev_trace.bin``
+      - This region contains the previous instance of bootloader logs.
+
+
+Region commands
+===============
+
+$ devlink region show
+
+$ devlink region new pci/0000:02:00.0/report.json
+
+$ devlink region dump pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region del pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region new pci/0000:02:00.0/coredump.fcd
+
+$ devlink region dump pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region del pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region new pci/0000:02:00.0/cdd.log
+
+$ devlink region dump pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region del pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region new pci/0000:02:00.0/eeprom.bin
+
+$ devlink region dump pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region del pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region new pci/0000:02:00.0/bootcore_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region del pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region new pci/0000:02:00.0/bootcore_prev_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
+
+$ devlink region del pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
diff --git a/Documentation/networking/devlink/ixgbe.rst b/Documentation/networking/devlink/ixgbe.rst
new file mode 100644
index 000000000000..c27d1436c70e
--- /dev/null
+++ b/Documentation/networking/devlink/ixgbe.rst
@@ -0,0 +1,171 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+ixgbe devlink support
+=====================
+
+This document describes the devlink features implemented by the ``ixgbe``
+device driver.
+
+Info versions
+=============
+
+Any of the versions dealing with the security presented by ``devlink-info``
+is purely informational. Devlink does not use a secure channel to communicate
+with the device.
+
+The ``ixgbe`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+    :widths: 5 5 5 90
+
+    * - Name
+      - Type
+      - Example
+      - Description
+    * - ``board.id``
+      - fixed
+      - H49289-000
+      - The Product Board Assembly (PBA) identifier of the board.
+    * - ``fw.undi``
+      - running
+      - 1.1937.0
+      - Version of the Option ROM containing the UEFI driver. The version is
+        reported in ``major.minor.patch`` format. The major version is
+        incremented whenever a major breaking change occurs, or when the
+        minor version would overflow. The minor version is incremented for
+        non-breaking changes and reset to 1 when the major version is
+        incremented. The patch version is normally 0 but is incremented when
+        a fix is delivered as a patch against an older base Option ROM.
+    * - ``fw.undi.srev``
+      - running
+      - 4
+      - Number indicating the security revision of the Option ROM.
+    * - ``fw.bundle_id``
+      - running
+      - 0x80000d0d
+      - Unique identifier of the firmware image file that was loaded onto
+        the device. Also referred to as the EETRACK identifier of the NVM.
+    * - ``fw.mgmt.api``
+      - running
+      - 1.5.1
+      - 3-digit version number (major.minor.patch) of the API exported over
+        the AdminQ by the management firmware. Used by the driver to
+        identify what commands are supported. Historical versions of the
+        kernel only displayed a 2-digit version number (major.minor).
+    * - ``fw.mgmt.build``
+      - running
+      - 0x305d955f
+      - Unique identifier of the source for the management firmware.
+    * - ``fw.mgmt.srev``
+      - running
+      - 3
+      - Number indicating the security revision of the firmware.
+    * - ``fw.psid.api``
+      - running
+      - 0.80
+      - Version defining the format of the flash contents.
+    * - ``fw.netlist``
+      - running
+      - 1.1.2000-6.7.0
+      - The version of the netlist module. This module defines the device's
+        Ethernet capabilities and default settings, and is used by the
+        management firmware as part of managing link and device
+        connectivity.
+    * - ``fw.netlist.build``
+      - running
+      - 0xee16ced7
+      - The first 4 bytes of the hash of the netlist module contents.
+
+Flash Update
+============
+
+The ``ixgbe`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+   :widths: 5 95
+
+   * - Bits
+     - Behavior
+   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+     - Do not preserve settings stored in the flash components being
+       updated. This includes overwriting the port configuration that
+       determines the number of physical functions the device will
+       initialize with.
+   * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+     - Do not preserve either settings or identifiers. Overwrite everything
+       in the flash with the contents from the provided image, without
+       performing any preservation. This includes overwriting device
+       identifying fields such as the MAC address, Vital product Data (VPD) area,
+       and device serial number. It is expected that this combination be used with an
+       image customized for the specific device.
+
+Reload
+======
+
+The ``ixgbe`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+    $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Regions
+=======
+
+The ``ixgbe`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+    :widths: 15 85
+
+    * - Name
+      - Description
+    * - ``nvm-flash``
+      - The contents of the entire flash chip, sometimes referred to as
+        the device's Non Volatile Memory.
+    * - ``shadow-ram``
+      - The contents of the Shadow RAM, which is loaded from the beginning
+        of the flash. Although the contents are primarily from the flash,
+        this area also contains data generated during device boot which is
+        not stored in flash.
+    * - ``device-caps``
+      - The contents of the device firmware's capabilities buffer. Useful to
+        determine the current state and configuration of the device.
+
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
+
+.. code:: shell
+
+    $ devlink region show
+    pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+    pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
+    $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
+
+    $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+    0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+    0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+    0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+    $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
+    0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4e4b97f7971a..7febe0aecd53 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -14,8 +14,24 @@ Parameters
 
    * - Name
      - Mode
+     - Validation
    * - ``enable_roce``
      - driverinit
+     - Type: Boolean
+
+       If the device supports RoCE disablement, RoCE enablement state controls
+       device support for RoCE capability. Otherwise, the control occurs in the
+       driver stack. When RoCE is disabled at the driver level, only raw
+       ethernet QPs are supported.
+   * - ``io_eq_size``
+     - driverinit
+     - The range is between 64 and 4096.
+   * - ``event_eq_size``
+     - driverinit
+     - The range is between 64 and 4096.
+   * - ``max_macs``
+     - driverinit
+     - The range is between 1 and 2^31. Only power of 2 values are supported.
 
 The ``mlx5`` driver also implements the following driver-specific
 parameters.
@@ -37,12 +53,69 @@ parameters.
        * ``smfs`` Software managed flow steering. In SMFS mode, the HW
          steering entities are created and manage through the driver without
          firmware intervention.
+       * ``hmfs`` Hardware managed flow steering. In HMFS mode, the driver
+         is configuring steering rules directly to the HW using Work Queues with
+         a special new type of WQE (Work Queue Element).
+
+       SMFS mode is faster and provides better rule insertion rate compared to
+       default DMFS mode.
    * - ``fdb_large_groups``
      - u32
      - driverinit
      - Control the number of large groups (size > 1) in the FDB table.
 
        * The default value is 15, and the range is between 1 and 1024.
+   * - ``esw_multiport``
+     - Boolean
+     - runtime
+     - Control MultiPort E-Switch shared fdb mode.
+
+       An experimental mode where a single E-Switch is used and all the vports
+       and physical ports on the NIC are connected to it.
+
+       An example is to send traffic from a VF that is created on PF0 to an
+       uplink that is natively associated with the uplink of PF1
+
+       Note: Future devices, ConnectX-8 and onward, will eventually have this
+       as the default to allow forwarding between all NIC ports in a single
+       E-switch environment and the dual E-switch mode will likely get
+       deprecated.
+
+       Default: disabled
+   * - ``esw_port_metadata``
+     - Boolean
+     - runtime
+     - When applicable, disabling eswitch metadata can increase packet rate up
+       to 20% depending on the use case and packet sizes.
+
+       Eswitch port metadata state controls whether to internally tag packets
+       with metadata. Metadata tagging must be enabled for multi-port RoCE,
+       failover between representors and stacked devices. By default metadata is
+       enabled on the supported devices in E-switch. Metadata is applicable only
+       for E-switch in switchdev mode and users may disable it when NONE of the
+       below use cases will be in use:
+       1. HCA is in Dual/multi-port RoCE mode.
+       2. VF/SF representor bonding (Usually used for Live migration)
+       3. Stacked devices
+
+       When metadata is disabled, the above use cases will fail to initialize if
+       users try to enable them.
+
+       Note: Setting this parameter does not take effect immediately. Setting
+       must happen in legacy mode and eswitch port metadata takes effect after
+       enabling switchdev mode.
+   * - ``hairpin_num_queues``
+     - u32
+     - driverinit
+     - We refer to a TC NIC rule that involves forwarding as "hairpin".
+       Hairpin queues are mlx5 hardware specific implementation for hardware
+       forwarding of such packets.
+
+       Control the number of hairpin queues.
+   * - ``hairpin_queue_size``
+     - u32
+     - driverinit
+     - Control the size (in packets) of the hairpin queues.
 
 The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
 
@@ -63,3 +136,165 @@ The ``mlx5`` driver reports the following versions
    * - ``fw.version``
      - stored, running
      - Three digit major.minor.subminor firmware version number.
+
+Health reporters
+================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following three error scenarios:
+
+- tx timeout
+    Report on kernel tx timeout detection.
+    Recover by searching lost interrupts.
+- tx error completion
+    Report on error tx completion.
+    Recover by flushing the tx queue and reset it.
+- tx PTP port timestamping CQ unhealthy
+    Report too many CQEs never delivered on port ts CQ.
+    Recover by flushing and re-creating all PTP channels.
+
+tx reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+    $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+.. note::
+   This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+  is autorecover enabled and graceful period from last recover::
+
+    $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- rx queues' initialization (population) timeout
+    Population of rx queues' descriptors on ring initialization is done
+    in napi context via triggering an irq. In case of a failure to get
+    the minimum amount of descriptors, a timeout would occur, and
+    descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
+    Report on rx completion error.
+    Recover (if needed) by flushing the related queue and reset it.
+
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
+
+- Diagnose rx queues' status and corresponding completion queue::
+
+    $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+.. note::
+   This command has valid output only when interface is up. Otherwise, the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+  is autorecover enabled, and graceful period from last recover::
+
+    $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements `diagnose` and `dump` callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+    $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+    $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+.. note::
+   This command can run only on the PF which has fw tracer ownership,
+   running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements `dump` and `recover` callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+    $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already stored or trigger new one::
+
+    $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+.. note::
+   This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+
+- total_error_queues
+        number of queues in an error state due to
+        an async error or errored command.
+- send_queue_priority_update_flow
+        number of QP/SQ priority/SL update events.
+- cq_overrun
+        number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+        number of times an EQ mapped to async events was overrun.
+- comp_eq_overrun
+        number of times an EQ mapped to completion events was
+        overrun.
+- quota_exceeded_command
+        number of commands issued and failed due to quota exceeded.
+- invalid_command
+        number of commands issued and failed dues to any reason other than quota
+        exceeded.
+- nic_receive_steering_discard
+        number of packets that completed RX flow
+        steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+	number of packets generated by the VNIC experiencing unexpected steering
+	failure (at any point in steering flow).
+- handled_pkt_steering_fail
+	number of packets handled by the VNIC experiencing unexpected steering
+	failure (at any point in steering flow owned by the VNIC, including the FDB
+	for the eswitch owner).
+- icm_consumption
+        amount of Interconnect Host Memory (ICM) consumed by the vnic in
+        granularity of 4KB. ICM is host memory allocated by SW upon HCA request
+        and is used for storing data structures that control HCA operation.
+
+User commands examples:
+
+- Diagnose PF/VF vnic counters::
+
+        $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+  representor, which can be obtained via devlink port command)::
+
+        $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+.. note::
+   This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
index cf857cb4ba8f..433962225bd4 100644
--- a/Documentation/networking/devlink/mlxsw.rst
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -58,6 +58,30 @@ The ``mlxsw`` driver reports the following versions
      - running
      - Three digit firmware version
 
+Line card auxiliary device info versions
+========================================
+
+The ``mlxsw`` driver reports the following versions for line card auxiliary device
+
+.. list-table:: devlink info versions implemented
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+   * - ``hw.revision``
+     - fixed
+     - The hardware revision for this line card
+   * - ``ini.version``
+     - running
+     - Version of line card INI loaded
+   * - ``fw.psid``
+     - fixed
+     - Line card device PSID
+   * - ``fw.version``
+     - running
+     - Three digit firmware version of line card device
+
 Driver-specific Traps
 =====================
 
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 2a266b7e7b38..88482725422c 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -46,7 +46,7 @@ Resources
 =========
 
 The ``netdevsim`` driver exposes resources to control the number of FIB
-entries and FIB rule entries that the driver will allow.
+entries, FIB rule entries and nexthops that the driver will allow.
 
 .. code:: shell
 
@@ -54,8 +54,35 @@ entries and FIB rule entries that the driver will allow.
     $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
     $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
     $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+    $ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
     $ devlink dev reload netdevsim/netdevsim0
 
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+    $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+    rate_parent  tx_max  tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+    $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+    dev  ethtool  rate_parent  tx_max  tx_share
+
 Driver-specific Traps
 =====================
 
@@ -68,5 +95,5 @@ Driver-specific Traps
    * - ``fid_miss``
      - ``exception``
      - When a packet enters the device it is classified to a filtering
-       indentifier (FID) based on the ingress port and VLAN. This trap is used
+       identifier (FID) based on the ingress port and VLAN. This trap is used
        to trap packets for which a FID could not be found
diff --git a/Documentation/networking/devlink/nfp.rst b/Documentation/networking/devlink/nfp.rst
index a1717db0dfcc..3093642bdae4 100644
--- a/Documentation/networking/devlink/nfp.rst
+++ b/Documentation/networking/devlink/nfp.rst
@@ -32,7 +32,7 @@ The ``nfp`` driver reports the following versions
      - Description
    * - ``board.id``
      - fixed
-     - Part number identifying the board design
+     - Identifier of the board design
    * - ``board.rev``
      - fixed
      - Revision of the board design
@@ -42,6 +42,9 @@ The ``nfp`` driver reports the following versions
    * - ``board.model``
      - fixed
      - Model name of the board design
+   * - ``board.part_number``
+     - fixed
+     - Part number of the board and its components
    * - ``fw.bundle_id``
      - stored, running
      - Firmware bundle id
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
new file mode 100644
index 000000000000..84206537aedb
--- /dev/null
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+octeontx2 devlink support
+=========================
+
+This document describes the devlink features implemented by the ``octeontx2 AF, PF and VF``
+device drivers.
+
+Parameters
+==========
+
+The ``octeontx2 PF and VF`` drivers implement the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+   :widths: 5 5 5 85
+
+   * - Name
+     - Type
+     - Mode
+     - Description
+   * - ``mcam_count``
+     - u16
+     - runtime
+     - Select number of match CAM entries to be allocated for an interface.
+       The same is used for ntuple filters of the interface. Supported by
+       PF and VF drivers.
+
+The ``octeontx2 AF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+   :widths: 5 5 5 85
+
+   * - Name
+     - Type
+     - Mode
+     - Description
+   * - ``dwrr_mtu``
+     - u32
+     - runtime
+     - Use to set the quantum which hardware uses for scheduling among transmit queues.
+       Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
+   * - ``npc_mcam_high_zone_percent``
+     - u8
+     - runtime
+     - Use to set the number of high priority zone entries in NPC MCAM that can be allocated
+       by a user, out of the three priority zone categories high, mid and low.
+   * - ``npc_def_rule_cntr``
+     - bool
+     - runtime
+     - Use to enable or disable hit counters for the default rules in NPC MCAM.
+       Its not guaranteed that counters gets enabled and mapped to all the default rules,
+       since the counters are scarce and driver follows a best effort approach.
+       The default rule serves as the primary packet steering rule for a specific PF or VF,
+       based on its DMAC address which is installed by AF driver as part of its initialization.
+       Sample command to read hit counters for default rule from debugfs is as follows,
+       cat /sys/kernel/debug/cn10k/npc/mcam_rules
+   * - ``nix_maxlf``
+     - u16
+     - runtime
+     - Use to set the maximum number of LFs in NIX hardware block. This would be useful
+       to increase the availability of default resources allocated to enabled LFs like
+       MCAM entries for example.
+
+The ``octeontx2 PF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+   :widths: 5 5 5 85
+
+   * - Name
+     - Type
+     - Mode
+     - Description
+   * - ``unicast_filter_count``
+     - u8
+     - runtime
+     - Set the maximum number of unicast filters that can be programmed for
+       the device. This can be used to achieve better device resource
+       utilization, avoiding over consumption of unused MCAM table entries.
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 000000000000..96b1124e614b
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+   * - ``arp_bc``
+     - ``trap``
+     - Traps ARP broadcast packets (both requests/responses)
+   * - ``is_is``
+     - ``trap``
+     - Traps IS-IS packets
+   * - ``ospf``
+     - ``trap``
+     - Traps OSPF packets
+   * - ``ip_bc_mac``
+     - ``trap``
+     - Traps IPv4 packets with broadcast DA Mac address
+   * - ``stp``
+     - ``trap``
+     - Traps STP BPDU
+   * - ``lacp``
+     - ``trap``
+     - Traps LACP packets
+   * - ``lldp``
+     - ``trap``
+     - Traps LLDP packets
+   * - ``router_mc``
+     - ``trap``
+     - Traps multicast packets
+   * - ``vrrp``
+     - ``trap``
+     - Traps VRRP packets
+   * - ``dhcp``
+     - ``trap``
+     - Traps DHCP packets
+   * - ``mtu_error``
+     - ``trap``
+     - Traps (exception) packets that exceeded port's MTU
+   * - ``mac_to_me``
+     - ``trap``
+     -  Traps packets with switch-port's DA Mac address
+   * - ``ttl_error``
+     - ``trap``
+     - Traps (exception) IPv4 packets whose TTL exceeded
+   * - ``ipv4_options``
+     - ``trap``
+     - Traps (exception) packets due to the malformed IPV4 header options
+   * - ``ip_default_route``
+     - ``trap``
+     - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+   * - ``local_route``
+     - ``trap``
+     - Traps packets that have been send to one of switch IP interfaces addresses
+   * - ``ipv4_icmp_redirect``
+     - ``trap``
+     - Traps (exception) IPV4 ICMP redirect packets
+   * - ``arp_response``
+     - ``trap``
+     - Traps ARP replies packets that have switch-port's DA Mac address
+   * - ``acl_code_0``
+     - ``trap``
+     - Traps packets that have ACL priority set to 0 (tc pref 0)
+   * - ``acl_code_1``
+     - ``trap``
+     - Traps packets that have ACL priority set to 1 (tc pref 1)
+   * - ``acl_code_2``
+     - ``trap``
+     - Traps packets that have ACL priority set to 2 (tc pref 2)
+   * - ``acl_code_3``
+     - ``trap``
+     - Traps packets that have ACL priority set to 3 (tc pref 3)
+   * - ``acl_code_4``
+     - ``trap``
+     - Traps packets that have ACL priority set to 4 (tc pref 4)
+   * - ``acl_code_5``
+     - ``trap``
+     - Traps packets that have ACL priority set to 5 (tc pref 5)
+   * - ``acl_code_6``
+     - ``trap``
+     - Traps packets that have ACL priority set to 6 (tc pref 6)
+   * - ``acl_code_7``
+     - ``trap``
+     - Traps packets that have ACL priority set to 7 (tc pref 7)
+   * - ``ipv4_bgp``
+     - ``trap``
+     - Traps IPv4 BGP packets
+   * - ``ssh``
+     - ``trap``
+     - Traps SSH packets
+   * - ``telnet``
+     - ``trap``
+     - Traps Telnet packets
+   * - ``icmp``
+     - ``trap``
+     - Traps ICMP packets
+   * - ``rxdma_drop``
+     - ``drop``
+     - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+   * - ``port_no_vlan``
+     - ``drop``
+     - Drops packets due to faulty-configured network or due to internal bug (config issue).
+   * - ``local_port``
+     - ``drop``
+     - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+   * - ``invalid_sa``
+     - ``drop``
+     - Drops packets with multicast source MAC address.
+   * - ``illegal_ip_addr``
+     - ``drop``
+     - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+   * - ``illegal_ipv4_hdr``
+     - ``drop``
+     - Drops packets with illegal IPV4 header.
+   * - ``ip_uc_dip_da_mismatch``
+     - ``drop``
+     - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+   * - ``ip_sip_is_zero``
+     - ``drop``
+     - Drops packets with zero (0) IPV4 source address.
+   * - ``met_red``
+     - ``drop``
+     - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.
diff --git a/Documentation/networking/devlink/sfc.rst b/Documentation/networking/devlink/sfc.rst
new file mode 100644
index 000000000000..0398d59ea184
--- /dev/null
+++ b/Documentation/networking/devlink/sfc.rst
@@ -0,0 +1,71 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+sfc devlink support
+===================
+
+This document describes the devlink features implemented by the ``sfc``
+device driver for the ef10 and ef100 devices.
+
+Info versions
+=============
+
+The ``sfc`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+   :widths: 5 5 90
+
+   * - Name
+     - Type
+     - Description
+   * - ``fw.bundle_id``
+     - stored
+     - Version of the firmware "bundle" image that was last used to update
+       multiple components.
+   * - ``fw.mgmt.suc``
+     - running
+     - For boards where the management function is split between multiple
+       control units, this is the SUC control unit's firmware version.
+   * - ``fw.mgmt.cmc``
+     - running
+     - For boards where the management function is split between multiple
+       control units, this is the CMC control unit's firmware version.
+   * - ``fpga.rev``
+     - running
+     - FPGA design revision.
+   * - ``fpga.app``
+     - running
+     - Datapath programmable logic version.
+   * - ``fw.app``
+     - running
+     - Datapath software/microcode/firmware version.
+   * - ``coproc.boot``
+     - running
+     - SmartNIC application co-processor (APU) first stage boot loader version.
+   * - ``coproc.uboot``
+     - running
+     - SmartNIC application co-processor (APU) co-operating system loader version.
+   * - ``coproc.main``
+     - running
+     - SmartNIC application co-processor (APU) main operating system version.
+   * - ``coproc.recovery``
+     - running
+     - SmartNIC application co-processor (APU) recovery operating system version.
+   * - ``fw.exprom``
+     - running
+     - Expansion ROM version. For boards where the expansion ROM is split between
+       multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot
+       ROM version.
+   * - ``fw.uefi``
+     - running
+     - UEFI driver version (No UNDI support).
+
+Flash Update
+============
+
+The ``sfc`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image ("bundle") that contains multiple components (on ef10,
+typically ``fw.mgmt``, ``fw.app``, ``fw.exprom`` and ``fw.uefi``).
+
+The driver does not support any overwrite mask flags.
diff --git a/Documentation/networking/devlink/sja1105.rst b/Documentation/networking/devlink/sja1105.rst
deleted file mode 100644
index e2679c274085..000000000000
--- a/Documentation/networking/devlink/sja1105.rst
+++ /dev/null
@@ -1,49 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=======================
-sja1105 devlink support
-=======================
-
-This document describes the devlink features implemented
-by the ``sja1105`` device driver.
-
-Parameters
-==========
-
-.. list-table:: Driver-specific parameters implemented
-  :widths: 5 5 5 85
-
-  * - Name
-    - Type
-    - Mode
-    - Description
-  * - ``best_effort_vlan_filtering``
-    - Boolean
-    - runtime
-    - Allow plain ETH_P_8021Q headers to be used as DSA tags.
-
-      Benefits:
-
-      - Can terminate untagged traffic over switch net
-        devices even when enslaved to a bridge with
-        vlan_filtering=1.
-      - Can terminate VLAN-tagged traffic over switch net
-        devices even when enslaved to a bridge with
-        vlan_filtering=1, with some constraints (no more than
-        7 non-pvid VLANs per user port).
-      - Can do QoS based on VLAN PCP and VLAN membership
-        admission control for autonomously forwarded frames
-        (regardless of whether they can be terminated on the
-        CPU or not).
-
-      Drawbacks:
-
-      - User cannot use VLANs in range 1024-3071. If the
-	switch receives frames with such VIDs, it will
-	misinterpret them as DSA tags.
-      - Switch uses Shared VLAN Learning (FDB lookup uses
-	only DMAC as key).
-      - When VLANs span cross-chip topologies, the total
-	number of permitted VLANs may be less than 7 per
-	port, due to a maximum number of 32 VLAN retagging
-	rules per switch.
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
new file mode 100644
index 000000000000..a6cd7236bfbd
--- /dev/null
+++ b/Documentation/networking/devmem.rst
@@ -0,0 +1,419 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Device Memory TCP
+=================
+
+
+Intro
+=====
+
+Device memory TCP (devmem TCP) enables receiving data directly into device
+memory (dmabuf). The feature is currently implemented for TCP sockets.
+
+
+Opportunity
+-----------
+
+A large number of data transfers have device memory as the source and/or
+destination. Accelerators drastically increased the prevalence of such
+transfers.  Some examples include:
+
+- Distributed training, where ML accelerators, such as GPUs on different hosts,
+  exchange data.
+
+- Distributed raw block storage applications transfer large amounts of data with
+  remote SSDs. Much of this data does not require host processing.
+
+Typically the Device-to-Device data transfers in the network are implemented as
+the following low-level operations: Device-to-Host copy, Host-to-Host network
+transfer, and Host-to-Device copy.
+
+The flow involving host copies is suboptimal, especially for bulk data transfers,
+and can put significant strains on system resources such as host memory
+bandwidth and PCIe bandwidth.
+
+Devmem TCP optimizes this use case by implementing socket APIs that enable
+the user to receive incoming network packets directly into device memory.
+
+Packet payloads go directly from the NIC to device memory.
+
+Packet headers go to host memory and are processed by the TCP/IP stack
+normally. The NIC must support header split to achieve this.
+
+Advantages:
+
+- Alleviate host memory bandwidth pressure, compared to existing
+  network-transfer + device-copy semantics.
+
+- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
+  level of the PCIe tree, compared to the traditional path which sends data
+  through the root complex.
+
+
+More Info
+---------
+
+  slides, video
+    https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
+
+  patchset
+    [PATCH net-next v24 00/13] Device Memory TCP
+    https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
+
+
+RX Interface
+============
+
+
+Example
+-------
+
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
+setting up the RX path of this API.
+
+
+NIC Setup
+---------
+
+Header split, flow steering, & RSS are required features for devmem TCP.
+
+Header split is used to split incoming packets into a header buffer in host
+memory, and a payload buffer in device memory.
+
+Flow steering & RSS are used to ensure that only flows targeting devmem land on
+an RX queue bound to devmem.
+
+Enable header split & flow steering::
+
+	# enable header split
+	ethtool -G eth1 tcp-data-split on
+
+
+	# enable flow steering
+	ethtool -K eth1 ntuple on
+
+Configure RSS to steer all traffic away from the target RX queue (queue 15 in
+this example)::
+
+	ethtool --set-rxfh-indir eth1 equal 15
+
+
+The user must bind a dmabuf to any number of RX queues on a given NIC using
+the netlink API::
+
+	/* Bind dmabuf to NIC RX queue 15 */
+	struct netdev_queue *queues;
+	queues = malloc(sizeof(*queues) * 1);
+
+	queues[0]._present.type = 1;
+	queues[0]._present.idx = 1;
+	queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
+	queues[0].idx = 15;
+
+	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+	req = netdev_bind_rx_req_alloc();
+	netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
+	netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
+	__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
+
+	rsp = netdev_bind_rx(*ys, req);
+
+	dmabuf_id = rsp->dmabuf_id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+
+Socket Setup
+------------
+
+The socket must be flow steered to the dmabuf bound RX queue::
+
+	ethtool -N eth1 flow-type tcp4 ... queue 15
+
+
+Receiving data
+--------------
+
+The user application must signal to the kernel that it is capable of receiving
+devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
+
+	ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
+
+Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
+on devmem data.
+
+Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
+Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
+
+		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+			if (cm->cmsg_level != SOL_SOCKET ||
+				(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
+				 cm->cmsg_type != SCM_DEVMEM_LINEAR))
+				continue;
+
+			dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
+
+			if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
+				/* Frag landed in dmabuf.
+				 *
+				 * dmabuf_cmsg->dmabuf_id is the dmabuf the
+				 * frag landed on.
+				 *
+				 * dmabuf_cmsg->frag_offset is the offset into
+				 * the dmabuf where the frag starts.
+				 *
+				 * dmabuf_cmsg->frag_size is the size of the
+				 * frag.
+				 *
+				 * dmabuf_cmsg->frag_token is a token used to
+				 * refer to this frag for later freeing.
+				 */
+
+				struct dmabuf_token token;
+				token.token_start = dmabuf_cmsg->frag_token;
+				token.token_count = 1;
+				continue;
+			}
+
+			if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
+				/* Frag landed in linear buffer.
+				 *
+				 * dmabuf_cmsg->frag_size is the size of the
+				 * frag.
+				 */
+				continue;
+
+		}
+
+Applications may receive 2 cmsgs:
+
+- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
+  by dmabuf_id.
+
+- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
+  This typically happens when the NIC is unable to split the packet at the
+  header boundary, such that part (or all) of the payload landed in host
+  memory.
+
+Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
+regular TCP data that landed on an RX queue not bound to a dmabuf.
+
+
+Freeing frags
+-------------
+
+Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
+processes the frag. The user must return the frag to the kernel via
+SO_DEVMEM_DONTNEED::
+
+	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
+			 sizeof(token));
+
+The user must ensure the tokens are returned to the kernel in a timely manner.
+Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
+and will lead to packet drops.
+
+The user must pass no more than 128 tokens, with no more than 1024 total frags
+among the token->token_count across all the tokens. If the user provides more
+than 1024 frags, the kernel will free up to 1024 frags and return early.
+
+The kernel returns the number of actual frags freed. The number of frags freed
+can be less than the tokens provided by the user in case of:
+
+(a) an internal kernel leak bug.
+(b) the user passed more than 1024 frags.
+
+TX Interface
+============
+
+
+Example
+-------
+
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
+setting up the TX path of this API.
+
+
+NIC Setup
+---------
+
+The user must bind a TX dmabuf to a given NIC using the netlink API::
+
+        struct netdev_bind_tx_req *req = NULL;
+        struct netdev_bind_tx_rsp *rsp = NULL;
+        struct ynl_error yerr;
+
+        *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+        req = netdev_bind_tx_req_alloc();
+        netdev_bind_tx_req_set_ifindex(req, ifindex);
+        netdev_bind_tx_req_set_fd(req, dmabuf_fd);
+
+        rsp = netdev_bind_tx(*ys, req);
+
+        tx_dmabuf_id = rsp->id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+Socket Setup
+------------
+
+The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
+cannot be copied by the kernel, so the semantics of the devmem TX are similar
+to the semantics of MSG_ZEROCOPY::
+
+	setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
+
+It is also recommended that the user binds the TX socket to the same interface
+the dma-buf has been bound to via SO_BINDTODEVICE::
+
+	setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);
+
+
+Sending data
+------------
+
+Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
+
+The user should create a msghdr where,
+
+* iov_base is set to the offset into the dmabuf to start sending from
+* iov_len is set to the number of bytes to be sent from the dmabuf
+
+The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.
+
+The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
+from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
+
+       char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
+       struct dmabuf_tx_cmsg ddmabuf;
+       struct msghdr msg = {};
+       struct cmsghdr *cmsg;
+       struct iovec iov[2];
+
+       iov[0].iov_base = (void*)100;
+       iov[0].iov_len = 1024;
+       iov[1].iov_base = (void*)2000;
+       iov[1].iov_len = 2048;
+
+       msg.msg_iov = iov;
+       msg.msg_iovlen = 2;
+
+       msg.msg_control = ctrl_data;
+       msg.msg_controllen = sizeof(ctrl_data);
+
+       cmsg = CMSG_FIRSTHDR(&msg);
+       cmsg->cmsg_level = SOL_SOCKET;
+       cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
+       cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
+
+       ddmabuf.dmabuf_id = tx_dmabuf_id;
+
+       *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
+
+       sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
+
+
+Reusing TX dmabufs
+------------------
+
+Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
+contents of the dma-buf while a send operation is in progress. This is because
+the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
+will pin and send data from the buffer available to the userspace.
+
+Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
+using MSG_ERRQUEUE::
+
+        int64_t tstop = gettimeofday_ms() + waittime_ms;
+        char control[CMSG_SPACE(100)] = {};
+        struct sock_extended_err *serr;
+        struct msghdr msg = {};
+        struct cmsghdr *cm;
+        int retries = 10;
+        __u32 hi, lo;
+
+        msg.msg_control = control;
+        msg.msg_controllen = sizeof(control);
+
+        while (gettimeofday_ms() < tstop) {
+                if (!do_poll(fd)) continue;
+
+                ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+
+                for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+                        serr = (void *)CMSG_DATA(cm);
+
+                        hi = serr->ee_data;
+                        lo = serr->ee_info;
+
+                        fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
+                }
+        }
+
+After the associated sendmsg has been completed, the dmabuf can be reused by
+the userspace.
+
+
+Implementation & Caveats
+========================
+
+Unreadable skbs
+---------------
+
+Devmem payloads are inaccessible to the kernel processing the packets. This
+results in a few quirks for payloads of devmem skbs:
+
+- Loopback is not functional. Loopback relies on copying the payload, which is
+  not possible with devmem skbs.
+
+- Software checksum calculation fails.
+
+- TCP Dump and bpf can't access devmem packet payloads.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+``tools/testing/selftests/drivers/net/hw/ncdevmem.c``
+
+ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
+receives data directly into a udmabuf.
+
+To run ncdevmem, you need to run it on a server on the machine under test, and
+you need to run netcat on a peer to provide the TX data.
+
+ncdevmem has a validation mode as well that expects a repeating pattern of
+incoming data and validates it as such. For example, you can launch
+ncdevmem on the server by::
+
+	ncdevmem -s <server IP> -c <client IP> -f <ifname> -l -p 5201 -v 7
+
+On client side, use regular netcat to send TX data to ncdevmem process
+on the server::
+
+	yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
+		tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
diff --git a/Documentation/networking/diagnostic/index.rst b/Documentation/networking/diagnostic/index.rst
new file mode 100644
index 000000000000..86488aa46b48
--- /dev/null
+++ b/Documentation/networking/diagnostic/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Networking Diagnostics
+======================
+
+.. toctree::
+   :maxdepth: 2
+
+   twisted_pair_layer1_diagnostics.rst
+
+.. only::  subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst
new file mode 100644
index 000000000000..079e17effadf
--- /dev/null
+++ b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst
@@ -0,0 +1,784 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Diagnostic Concept for Investigating Twisted Pair Ethernet Variants at OSI Layer 1
+==================================================================================
+
+Introduction
+------------
+
+This documentation is designed for two primary audiences:
+
+1. **Users and System Administrators**: For those dealing with real-world
+   Ethernet issues, this guide provides a practical, step-by-step
+   troubleshooting flow to help identify and resolve common problems in Twisted
+   Pair Ethernet at OSI Layer 1. If you're facing unstable links, speed drops,
+   or mysterious network issues, jump right into the step-by-step guide and
+   follow it through to find your solution.
+
+2. **Kernel Developers**: For developers working with network drivers and PHY
+   support, this documentation outlines the diagnostic process and highlights
+   areas where the Linux kernel’s diagnostic interfaces could be extended or
+   improved. By understanding the diagnostic flow, developers can better
+   prioritize future enhancements.
+
+Step-by-Step Diagnostic Guide from Linux (General Ethernet)
+-----------------------------------------------------------
+
+This diagnostic guide covers common Ethernet troubleshooting scenarios,
+focusing on **link stability and detection** across different Ethernet
+environments, including **Single-Pair Ethernet (SPE)** and **Multi-Pair
+Ethernet (MPE)**, as well as power delivery technologies like **PoDL** (Power
+over Data Line) and **PoE** (Clause 33 PSE).
+
+The guide is designed to help users diagnose physical layer (Layer 1) issues on
+systems running **Linux kernel version 6.11 or newer**, utilizing **ethtool
+version 6.10 or later** and **iproute2 version 6.4.0 or later**.
+
+In this guide, we assume that users may have **limited or no access to the link
+partner** and will focus on diagnosing issues locally.
+
+Diagnostic Scenarios
+~~~~~~~~~~~~~~~~~~~~
+
+- **Link is up and stable, but no data transfer**: If the link is stable but
+  there are issues with data transmission, refer to the **OSI Layer 2
+  Troubleshooting Guide**.
+
+- **Link is unstable**: Link resets, speed drops, or other fluctuations
+  indicate potential issues at the hardware or physical layer.
+
+- **No link detected**: The interface is up, but no link is established.
+
+Verify Interface Status
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Begin by verifying the status of the Ethernet interface to check if it is
+administratively up. Unlike `ethtool`, which provides information on the link
+and PHY status, it does not show the **administrative state** of the interface.
+To check this, you should use the `ip` command, which describes the interface
+state within the angle brackets `"<>"` in its output.
+
+For example, in the output `<NO-CARRIER,BROADCAST,MULTICAST,UP>`, the important
+keywords are:
+
+- **UP**: The interface is in the administrative "UP" state.
+- **NO-CARRIER**: The interface is administratively up, but no physical link is
+  detected.
+
+If the output shows `<BROADCAST,MULTICAST>`, this indicates the interface is in
+the administrative "DOWN" state.
+
+- **Command:** `ip link show dev <interface>`
+
+- **Expected Output:**
+
+  .. code-block:: bash
+
+     4: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 ...
+        link/ether 88:14:2b:00:96:f2 brd ff:ff:ff:ff:ff:ff
+
+- **Interpreting the Output:**
+
+  - **Administrative UP State**:
+
+    - If the output contains **"UP"**, the interface is administratively up,
+      and the system is trying to establish a physical link.
+
+    - If you also see **"NO-CARRIER"**, it means the physical link has not been
+      detected, indicating potential Layer 1 issues like a cable fault,
+      misconfiguration, or no connection at the link partner. In this case,
+      proceed to the **Inspect Link Status and PHY Configuration** section.
+
+  - **Administrative DOWN State**:
+
+    - If the output lacks **"UP"** and shows only states like
+      **"<BROADCAST,MULTICAST>"**, it means the interface is administratively
+      down. In this case, bring the interface up using the following command:
+
+      .. code-block:: bash
+
+         ip link set dev <interface> up
+
+- **Next Steps**:
+
+  - If the interface is **administratively up** but shows **NO-CARRIER**,
+    proceed to the **Inspect Link Status and PHY Configuration** section to
+    troubleshoot potential physical layer issues.
+
+  - If the interface was **administratively down** and you have brought it up,
+    ensure to **repeat this verification step** to confirm the new state of the
+    interface before proceeding
+
+  - **If the interface is up and the link is detected**:
+
+    - If the output shows **"UP"** and there is **no `NO-CARRIER`**, the
+      interface is administratively up, and the physical link has been
+      successfully established. If everything is working as expected, the Layer
+      1 diagnostics are complete, and no further action is needed.
+
+    - If the interface is up and the link is detected but **no data is being
+      transferred**, the issue is likely beyond Layer 1, and you should proceed
+      with diagnosing the higher layers of the OSI model. This may involve
+      checking Layer 2 configurations (such as VLANs or MAC address issues),
+      Layer 3 settings (like IP addresses, routing, or ARP), or Layer 4 and
+      above (firewalls, services, etc.).
+
+    - If the **link is unstable** or **frequently resetting or dropping**, this
+      may indicate a physical layer issue such as a faulty cable, interference,
+      or power delivery problems. In this case, proceed with the next step in
+      this guide.
+
+Inspect Link Status and PHY Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use `ethtool -I` to check the link status, PHY configuration, supported link
+modes, and additional statistics such as the **Link Down Events** counter. This
+step is essential for diagnosing Layer 1 problems such as speed mismatches,
+duplex issues, and link instability.
+
+For both **Single-Pair Ethernet (SPE)** and **Multi-Pair Ethernet (MPE)**
+devices, you will use this step to gather key details about the link. **SPE**
+links generally support a single speed and mode without autonegotiation (with
+the exception of **10BaseT1L**), while **MPE** devices typically support
+multiple link modes and autonegotiation.
+
+- **Command:** `ethtool -I <interface>`
+
+- **Example Output for SPE Interface (Non-autonegotiation)**:
+
+  .. code-block:: bash
+
+     Settings for spe4:
+         Supported ports: [ TP ]
+         Supported link modes:   100baseT1/Full
+         Supported pause frame use: No
+         Supports auto-negotiation: No
+         Supported FEC modes: Not reported
+         Advertised link modes: Not applicable
+         Advertised pause frame use: No
+         Advertised auto-negotiation: No
+         Advertised FEC modes: Not reported
+         Speed: 100Mb/s
+         Duplex: Full
+         Auto-negotiation: off
+         master-slave cfg: forced slave
+         master-slave status: slave
+         Port: Twisted Pair
+         PHYAD: 6
+         Transceiver: external
+         MDI-X: Unknown
+         Supports Wake-on: d
+         Wake-on: d
+         Link detected: yes
+         SQI: 7/7
+         Link Down Events: 2
+
+- **Example Output for MPE Interface (Autonegotiation)**:
+
+  .. code-block:: bash
+
+     Settings for eth1:
+         Supported ports: [ TP    MII ]
+         Supported link modes:   10baseT/Half 10baseT/Full
+                                 100baseT/Half 100baseT/Full
+         Supported pause frame use: Symmetric Receive-only
+         Supports auto-negotiation: Yes
+         Supported FEC modes: Not reported
+         Advertised link modes:  10baseT/Half 10baseT/Full
+                                 100baseT/Half 100baseT/Full
+         Advertised pause frame use: Symmetric Receive-only
+         Advertised auto-negotiation: Yes
+         Advertised FEC modes: Not reported
+         Link partner advertised link modes:  10baseT/Half 10baseT/Full
+                                              100baseT/Half 100baseT/Full
+         Link partner advertised pause frame use: Symmetric Receive-only
+         Link partner advertised auto-negotiation: Yes
+         Link partner advertised FEC modes: Not reported
+         Speed: 100Mb/s
+         Duplex: Full
+         Auto-negotiation: on
+         Port: Twisted Pair
+         PHYAD: 10
+         Transceiver: internal
+         MDI-X: Unknown
+         Supports Wake-on: pg
+         Wake-on: p
+         Link detected: yes
+         Link Down Events: 1
+
+- **Next Steps**:
+
+  - Record the output provided by `ethtool`, particularly noting the
+    **master-slave status**, **speed**, **duplex**, and other relevant fields.
+    This information will be useful for further analysis or troubleshooting.
+    Once the **ethtool** output has been collected and stored, move on to the
+    next diagnostic step.
+
+Check Power Delivery (PoDL or PoE)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If it is known that **PoDL** or **PoE** is **not implemented** on the system,
+or the **PSE** (Power Sourcing Equipment) is managed by proprietary user-space
+software or external tools, you can skip this step. In such cases, verify power
+delivery through alternative methods, such as checking hardware indicators
+(LEDs), using multimeters, or consulting vendor-specific software for
+monitoring power status.
+
+If **PoDL** or **PoE** is implemented and managed directly by Linux, follow
+these steps to ensure power is being delivered correctly:
+
+- **Command:** `ethtool --show-pse <interface>`
+
+- **Expected Output Examples**:
+
+  1. **PSE Not Supported**:
+
+     If no PSE is attached or the interface does not support PSE, the following
+     output is expected:
+
+     .. code-block:: bash
+
+        netlink error: No PSE is attached
+        netlink error: Operation not supported
+
+  2. **PoDL (Single-Pair Ethernet)**:
+
+     When PoDL is implemented, you might see the following attributes:
+
+     .. code-block:: bash
+
+        PSE attributes for eth1:
+        PoDL PSE Admin State: enabled
+        PoDL PSE Power Detection Status: delivering power
+
+  3. **PoE (Clause 33 PSE)**:
+
+     For standard PoE, the output may look like this:
+
+     .. code-block:: bash
+
+        PSE attributes for eth1:
+        Clause 33 PSE Admin State: enabled
+        Clause 33 PSE Power Detection Status: delivering power
+        Clause 33 PSE Available Power Limit: 18000
+
+- **Adjust Power Limit (if needed)**:
+
+  - Sometimes, the available power limit may not be sufficient for the link
+    partner. You can increase the power limit as needed.
+
+  - **Command:** `ethtool --set-pse <interface> c33-pse-avail-pw-limit <limit>`
+
+    Example:
+
+    .. code-block:: bash
+
+      ethtool --set-pse eth1 c33-pse-avail-pw-limit 18000
+      ethtool --show-pse eth1
+
+    **Expected Output** after adjusting the power limit:
+
+    .. code-block:: bash
+
+      Clause 33 PSE Available Power Limit: 18000
+
+
+- **Next Steps**:
+
+  - **PoE or PoDL Not Used**: If **PoE** or **PoDL** is not implemented or used
+    on the system, proceed to the next diagnostic step, as power delivery is
+    not relevant for this setup.
+
+  - **PoE or PoDL Controlled Externally**: If **PoE** or **PoDL** is used but
+    is not managed by the Linux kernel's **PSE-PD** framework (i.e., it is
+    controlled by proprietary user-space software or external tools), this part
+    is out of scope for this documentation. Please consult vendor-specific
+    documentation or external tools for monitoring and managing power delivery.
+
+  - **PSE Admin State Disabled**:
+
+    - If the `PSE Admin State:` is **disabled**, enable it by running one of
+      the following commands:
+
+      .. code-block:: bash
+
+         ethtool --set-pse <devname> podl-pse-admin-control enable
+
+      or, for Clause 33 PSE (PoE):
+
+         ethtool --set-pse <devname> c33-pse-admin-control enable
+
+    - After enabling the PSE Admin State, return to the start of the **Check
+      Power Delivery (PoDL or PoE)** step to recheck the power delivery status.
+
+  - **Power Not Delivered**: If the `Power Detection Status` shows something
+    other than "delivering power" (e.g., `over current`), troubleshoot the
+    **PSE**. Check for potential issues such as a short circuit in the cable,
+    insufficient power delivery, or a fault in the PSE itself.
+
+  - **Power Delivered but No Link**: If power is being delivered but no link is
+    established, proceed with further diagnostics by performing **Cable
+    Diagnostics** or reviewing the **Inspect Link Status and PHY
+    Configuration** steps to identify any underlying issues with the physical
+    link or settings.
+
+Cable Diagnostics
+~~~~~~~~~~~~~~~~~
+
+Use `ethtool` to test for physical layer issues such as cable faults. The test
+results can vary depending on the cable's condition, the technology in use, and
+the state of the link partner. The results from the cable test will help in
+diagnosing issues like open circuits, shorts, impedance mismatches, and
+noise-related problems.
+
+- **Command:** `ethtool --cable-test <interface>`
+
+The following are the typical outputs for **Single-Pair Ethernet (SPE)** and
+**Multi-Pair Ethernet (MPE)**:
+
+- **For Single-Pair Ethernet (SPE)**:
+  - **Expected Output (SPE)**:
+
+  .. code-block:: bash
+
+    Cable test completed for device eth1.
+    Pair A, fault length: 25.00m
+    Pair A code Open Circuit
+
+  This indicates an open circuit or cable fault at the reported distance, but
+  results can be influenced by the link partner's state. Refer to the
+  **"Troubleshooting Based on Cable Test Results"** section for further
+  interpretation of these results.
+
+- **For Multi-Pair Ethernet (MPE)**:
+  - **Expected Output (MPE)**:
+
+  .. code-block:: bash
+
+    Cable test completed for device eth0.
+    Pair A code OK
+    Pair B code OK
+    Pair C code Open Circuit
+
+  Here, Pair C is reported as having an open circuit, while Pairs A and B are
+  functioning correctly. However, if autonegotiation is in use on Pairs A and
+  B, the cable test may be disrupted. Refer to the **"Troubleshooting Based on
+  Cable Test Results"** section for a detailed explanation of these issues and
+  how to resolve them.
+
+For detailed descriptions of the different possible cable test results, please
+refer to the **"Troubleshooting Based on Cable Test Results"** section.
+
+Troubleshooting Based on Cable Test Results
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After running the cable test, the results can help identify specific issues in
+the physical connection. However, it is important to note that **cable testing
+results heavily depend on the capabilities and characteristics of both the
+local hardware and the link partner**. The accuracy and reliability of the
+results can vary significantly between different hardware implementations.
+
+In some cases, this can introduce **blind spots** in the current cable testing
+implementation, where certain results may not accurately reflect the actual
+physical state of the cable. For example:
+
+- An **Open Circuit** result might not only indicate a damaged or disconnected
+  cable but also occur if the cable is properly attached to a powered-down link
+  partner.
+
+- Some PHYs may report a **Short within Pair** if the link partner is in
+  **forced slave mode**, even though there is no actual short in the cable.
+
+To help users interpret the results more effectively, it could be beneficial to
+extend the **kernel UAPI** (User API) to provide additional context or
+**possible variants** of issues based on the hardware’s characteristics. Since
+these quirks are often hardware-specific, the **kernel driver** would be an
+ideal source of such information. By providing flags or hints related to
+potential false positives for each test result, users would have a better
+understanding of what to verify and where to investigate further.
+
+Until such improvements are made, users should be aware of these limitations
+and manually verify cable issues as needed. Physical inspections may help
+resolve uncertainties related to false positive results.
+
+The results can be one of the following:
+
+- **OK**:
+
+  - The cable is functioning correctly, and no issues were detected.
+
+  - **Next Steps**: If you are still experiencing issues, it might be related
+    to higher-layer problems, such as duplex mismatches or speed negotiation,
+    which are not physical-layer issues.
+
+  - **Special Case for `BaseT1` (1000/100/10BaseT1)**: In `BaseT1` systems, an
+    "OK" result typically also means that the link is up and likely in **slave
+    mode**, since cable tests usually only pass in this mode. For some
+    **10BaseT1L** PHYs, an "OK" result may occur even if the cable is too long
+    for the PHY's configured range (for example, when the range is configured
+    for short-distance mode).
+
+- **Open Circuit**:
+
+  - An **Open Circuit** result typically indicates that the cable is damaged or
+    disconnected at the reported fault length. Consider these possibilities:
+
+    - If the link partner is in **admin down** state or powered off, you might
+      still get an "Open Circuit" result even if the cable is functional.
+
+    - **Next Steps**: Inspect the cable at the fault length for visible damage
+      or loose connections. Verify the link partner is powered on and in the
+      correct mode.
+
+- **Short within Pair**:
+
+  - A **Short within Pair** indicates an unintended connection within the same
+    pair of wires, typically caused by physical damage to the cable.
+
+    - **Next Steps**: Replace or repair the cable and check for any physical
+      damage or improperly crimped connectors.
+
+- **Short to Another Pair**:
+
+  - A **Short to Another Pair** means the wires from different pairs are
+    shorted, which could occur due to physical damage or incorrect wiring.
+
+    - **Next Steps**: Replace or repair the damaged cable. Inspect the cable for
+      incorrect terminations or pinched wiring.
+
+- **Impedance Mismatch**:
+
+  - **Impedance Mismatch** indicates a reflection caused by an impedance
+    discontinuity in the cable. This can happen when a part of the cable has
+    abnormal impedance (e.g., when different cable types are spliced together
+    or when there is a defect in the cable).
+
+    - **Next Steps**: Check the cable quality and ensure consistent impedance
+      throughout its length. Replace any sections of the cable that do not meet
+      specifications.
+
+- **Noise**:
+
+  - **Noise** means that the Time Domain Reflectometry (TDR) test could not
+    complete due to excessive noise on the cable, which can be caused by
+    interference from electromagnetic sources.
+
+    - **Next Steps**: Identify and eliminate sources of electromagnetic
+      interference (EMI) near the cable. Consider using shielded cables or
+      rerouting the cable away from noise sources.
+
+- **Resolution Not Possible**:
+
+  - **Resolution Not Possible** means that the TDR test could not detect the
+    issue due to the resolution limitations of the test or because the fault is
+    beyond the distance that the test can measure.
+
+    - **Next Steps**: Inspect the cable manually if possible, or use alternative
+      diagnostic tools that can handle greater distances or higher resolution.
+
+- **Unknown**:
+
+  - An **Unknown** result may occur when the test cannot classify the fault or
+    when a specific issue is outside the scope of the tool's detection
+    capabilities.
+
+    - **Next Steps**: Re-run the test, verify the link partner's state, and inspect
+      the cable manually if necessary.
+
+Verify Link Partner PHY Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the cable test passes but the link is still not functioning correctly, it’s
+essential to verify the configuration of the link partner’s PHY. Mismatches in
+speed, duplex settings, or master-slave roles can cause connection issues.
+
+Autonegotiation Mismatch
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+- If both link partners support autonegotiation, ensure that autonegotiation is
+  enabled on both sides and that all supported link modes are advertised. A
+  mismatch can lead to connectivity problems or sub optimal performance.
+
+- **Quick Fix:** Reset autonegotiation to the default settings, which will
+  advertise all default link modes:
+
+  .. code-block:: bash
+
+     ethtool -s <interface> autoneg on
+
+- **Command to check configuration:** `ethtool <interface>`
+
+- **Expected Output:** Ensure that both sides advertise compatible link modes.
+  If autonegotiation is off, verify that both link partners are configured for
+  the same speed and duplex.
+
+  The following example shows a case where the local PHY advertises fewer link
+  modes than it supports. This will reduce the number of overlapping link modes
+  with the link partner. In the worst case, there will be no common link modes,
+  and the link will not be created:
+
+  .. code-block:: bash
+
+     Settings for eth0:
+        Supported link modes:  1000baseT/Full, 100baseT/Full
+        Advertised link modes: 1000baseT/Full
+        Speed: 1000Mb/s
+        Duplex: Full
+        Auto-negotiation: on
+
+Combined Mode Mismatch (Autonegotiation on One Side, Forced on the Other)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- One possible issue occurs when one side is using **autonegotiation** (as in
+  most modern systems), and the other side is set to a **forced link mode**
+  (e.g., older hardware with single-speed hubs). In such cases, modern PHYs
+  will attempt to detect the forced mode on the other side. If the link is
+  established, you may notice:
+
+  - **No or empty "Link partner advertised link modes"**.
+
+  - **"Link partner advertised auto-negotiation:"** will be **"no"** or not
+    present.
+
+- This type of detection does not always work reliably:
+
+  - Typically, the modern PHY will default to **Half Duplex**, even if the link
+    partner is actually configured for **Full Duplex**.
+
+  - Some PHYs may not work reliably if the link partner switches from one
+    forced mode to another. In this case, only a down/up cycle may help.
+
+- **Next Steps**: Set both sides to the same fixed speed and duplex mode to
+  avoid potential detection issues.
+
+  .. code-block:: bash
+
+     ethtool -s <interface> speed 1000 duplex full autoneg off
+
+Master/Slave Role Mismatch (BaseT1 and 1000BaseT PHYs)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- In **BaseT1** systems (e.g., 1000BaseT1, 100BaseT1), link establishment
+  requires that one device is configured as **master** and the other as
+  **slave**. A mismatch in this master-slave configuration can prevent the link
+  from being established. However, **1000BaseT** also supports configurable
+  master/slave roles and can face similar issues.
+
+- **Role Preference in 1000BaseT**: The **1000BaseT** specification allows link
+  partners to negotiate master-slave roles or role preferences during
+  autonegotiation. Some PHYs have hardware limitations or bugs that prevent
+  them from functioning properly in certain roles. In such cases, drivers may
+  force these PHYs into a specific role (e.g., **forced master** or **forced
+  slave**) or try a weaker option by setting preferences. If both link partners
+  have the same issue and are forced into the same mode (e.g., both forced into
+  master mode), they will not be able to establish a link.
+
+- **Next Steps**: Ensure that one side is configured as **master** and the
+  other as **slave** to avoid this issue, particularly when hardware
+  limitations are involved, or try the weaker **preferred** option instead of
+  **forced**. Check for any driver-related restrictions or forced modes.
+
+- **Command to force master/slave mode**:
+
+  .. code-block:: bash
+
+     ethtool -s <interface> master-slave forced-master
+
+  or:
+
+  .. code-block:: bash
+
+     ethtool -s <interface> master-slave forced-master speed 1000 duplex full autoneg off
+
+
+- **Check the current master/slave status**:
+
+  .. code-block:: bash
+
+     ethtool <interface>
+
+  Example Output:
+
+  .. code-block:: bash
+
+     master-slave cfg: forced-master
+     master-slave status: master
+
+- **Hardware Bugs and Driver Forcing**: If a known hardware issue forces the
+  PHY into a specific mode, it’s essential to check the driver source code or
+  hardware documentation for details. Ensure that the roles are compatible
+  across both link partners, and if both PHYs are forced into the same mode,
+  adjust one side accordingly to resolve the mismatch.
+
+Monitor Link Resets and Speed Drops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the link is unstable, showing frequent resets or speed drops, this may
+indicate issues with the cable, PHY configuration, or environmental factors.
+While there is still no completely unified way in Linux to directly monitor
+downshift events or link speed changes via user space tools, both the Linux
+kernel logs and `ethtool` can provide valuable insights, especially if the
+driver supports reporting such events.
+
+- **Monitor Kernel Logs for Link Resets and Speed Drops**:
+
+  - The Linux kernel will print link status changes, including downshift
+    events, in the system logs. These messages typically include speed changes,
+    duplex mode, and downshifted link speed (if the driver supports it).
+
+  - **Command to monitor kernel logs in real-time:**
+
+    .. code-block:: bash
+
+      dmesg -w | grep "Link is Up\|Link is Down"
+
+  - Example Output (if a downshift occurs):
+
+    .. code-block:: bash
+
+      eth0: Link is Up - 100Mbps/Full (downshifted) - flow control rx/tx
+      eth0: Link is Down
+
+    This indicates that the link has been established but has downshifted from
+    a higher speed.
+
+  - **Note**: Not all drivers or PHYs support downshift reporting, so you may
+    not see this information for all devices.
+
+- **Monitor Link Down Events Using `ethtool`**:
+
+  - Starting with the latest kernel and `ethtool` versions, you can track
+    **Link Down Events** using the `ethtool -I` command. This will provide
+    counters for link drops, helping to diagnose link instability issues if
+    supported by the driver.
+
+  - **Command to monitor link down events:**
+
+    .. code-block:: bash
+
+      ethtool -I <interface>
+
+  - Example Output (if supported):
+
+    .. code-block:: bash
+
+      PSE attributes for eth1:
+      Link Down Events: 5
+
+    This indicates that the link has dropped 5 times. Frequent link down events
+    may indicate cable or environmental issues that require further
+    investigation.
+
+- **Check Link Status and Speed**:
+
+  - Even though downshift counts or events are not easily tracked, you can
+    still use `ethtool` to manually check the current link speed and status.
+
+  - **Command:** `ethtool <interface>`
+
+  - **Expected Output:**
+
+    .. code-block:: bash
+
+      Speed: 1000Mb/s
+      Duplex: Full
+      Auto-negotiation: on
+      Link detected: yes
+
+    Any inconsistencies in the expected speed or duplex setting could indicate
+    an issue.
+
+- **Disable Energy-Efficient Ethernet (EEE) for Diagnostics**:
+
+  - **EEE** (Energy-Efficient Ethernet) can be a source of link instability due
+    to transitions in and out of low-power states. For diagnostic purposes, it
+    may be useful to **temporarily** disable EEE to determine if it is
+    contributing to link instability. This is **not a generic recommendation**
+    for disabling power management.
+
+  - **Next Steps**: Disable EEE and monitor if the link becomes stable. If
+    disabling EEE resolves the issue, report the bug so that the driver can be
+    fixed.
+
+  - **Command:**
+
+    .. code-block:: bash
+
+      ethtool --set-eee <interface> eee off
+
+  - **Important**: If disabling EEE resolves the instability, the issue should
+    be reported to the maintainers as a bug, and the driver should be corrected
+    to handle EEE properly without causing instability. Disabling EEE
+    permanently should not be seen as a solution.
+
+- **Monitor Error Counters**:
+
+  - Use `ethtool -S <interface> --all-groups` to retrieve standardized interface
+    statistics if the driver supports the unified interface:
+
+  - **Command:** `ethtool -S <interface> --all-groups`
+
+  - **Example Output (if supported)**:
+
+    .. code-block:: bash
+
+      phydev-RxFrames: 100391
+      phydev-RxErrors: 0
+      phydev-TxFrames: 9
+      phydev-TxErrors: 0
+
+  - If the unified interface is not supported, use `ethtool -S <interface>` to
+    retrieve MAC and PHY counters. Note that non-standardized PHY counter names
+    vary by driver and must be interpreted accordingly:
+
+  - **Command:** `ethtool -S <interface>`
+
+  - **Example Output (if supported)**:
+
+    .. code-block:: bash
+
+      rx_crc_errors: 123
+      tx_errors: 45
+      rx_frame_errors: 78
+
+  - **Note**: If no meaningful error counters are available or if counters are
+    not supported, you may need to rely on physical inspections (e.g., cable
+    condition) or kernel log messages (e.g., link up/down events) to further
+    diagnose the issue.
+
+  - **Compare Counters**:
+
+    - Compare the egress and ingress frame counts reported by the PHY and MAC.
+
+    - A small difference may occur due to sampling rate differences between the
+      MAC and PHY drivers, or if the PHY and MAC are not always fully
+      synchronized in their UP or DOWN states.
+
+    - Significant discrepancies indicate potential issues in the data path
+      between the MAC and PHY.
+
+When All Else Fails...
+~~~~~~~~~~~~~~~~~~~~~~
+
+So you've checked the cables, monitored the logs, disabled EEE, and still...
+nothing? Don’t worry, you’re not alone. Sometimes, Ethernet gremlins just don’t
+want to cooperate.
+
+But before you throw in the towel (or the Ethernet cable), take a deep breath.
+It’s always possible that:
+
+1. Your PHY has a unique, undocumented personality.
+
+2. The problem is lying dormant, waiting for just the right moment to magically
+   resolve itself (hey, it happens!).
+
+3. Or, it could be that the ultimate solution simply hasn’t been invented yet.
+
+If none of the above bring you comfort, there’s one final step: contribute! If
+you've uncovered new or unusual issues, or have creative diagnostic methods,
+feel free to share your findings and extend this documentation. Together, we
+can hunt down every elusive network issue - one twisted pair at a time.
+
+Remember: sometimes the solution is just a reboot away, but if not, it’s time to
+dig deeper - or report that bug!
+
diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst
index add4d59a99a5..c0364f7070af 100644
--- a/Documentation/networking/dns_resolver.rst
+++ b/Documentation/networking/dns_resolver.rst
@@ -118,7 +118,7 @@ Keys of dns_resolver type can be read from userspace using keyctl_read() or
 Mechanism
 =========
 
-The dnsresolver module registers a key type called "dns_resolver".  Keys of
+The dns_resolver module registers a key type called "dns_resolver".  Keys of
 this type are used to transport and cache DNS lookup results from userspace.
 
 When dns_query() is invoked, it calls request_key() to search the local
@@ -152,4 +152,4 @@ Debugging
 Debugging messages can be turned on dynamically by writing a 1 into the
 following file::
 
-	/sys/module/dnsresolver/parameters/debug
+	/sys/module/dns_resolver/parameters/debug
diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
index c8f59dbda46f..4f5dfa9c022e 100644
--- a/Documentation/networking/driver.rst
+++ b/Documentation/networking/driver.rst
@@ -4,94 +4,124 @@
 Softnet Driver Issues
 =====================
 
-Transmit path guidelines:
+Probing guidelines
+==================
 
-1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under
-   any normal circumstances.  It is considered a hard error unless
-   there is no way your device can tell ahead of time when it's
-   transmit function will become busy.
+Address validation
+------------------
 
-   Instead it must maintain the queue properly.  For example,
-   for a driver implementing scatter-gather this means::
+Any hardware layer address you obtain for your device should
+be verified.  For example, for ethernet check it with
+linux/etherdevice.h:is_valid_ether_addr()
+
+Close/stop guidelines
+=====================
+
+Quiescence
+----------
+
+After the ndo_stop routine has been called, the hardware must
+not receive or transmit any data.  All in flight packets must
+be aborted. If necessary, poll or wait for completion of
+any reset commands.
+
+Auto-close
+----------
+
+The ndo_stop routine will be called by unregister_netdevice
+if device is still UP.
+
+Transmit path guidelines
+========================
+
+Stop queues in advance
+----------------------
+
+The ndo_start_xmit method must not return NETDEV_TX_BUSY under
+any normal circumstances.  It is considered a hard error unless
+there is no way your device can tell ahead of time when its
+transmit function will become busy.
+
+Instead it must maintain the queue properly.  For example,
+for a driver implementing scatter-gather this means:
+
+.. code-block:: c
+
+	static u32 drv_tx_avail(struct drv_ring *dr)
+	{
+		u32 used = READ_ONCE(dr->prod) - READ_ONCE(dr->cons);
+
+		return dr->tx_ring_size - (used & bp->tx_ring_mask);
+	}
 
 	static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
 					       struct net_device *dev)
 	{
 		struct drv *dp = netdev_priv(dev);
+		struct netdev_queue *txq;
+		struct drv_ring *dr;
+		int idx;
 
-		lock_tx(dp);
-		...
-		/* This is a hard error log it. */
-		if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) {
+		idx = skb_get_queue_mapping(skb);
+		dr = dp->tx_rings[idx];
+		txq = netdev_get_tx_queue(dev, idx);
+
+		//...
+		/* This should be a very rare race - log it. */
+		if (drv_tx_avail(dr) <= skb_shinfo(skb)->nr_frags + 1) {
 			netif_stop_queue(dev);
-			unlock_tx(dp);
-			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n",
-			       dev->name);
+			netdev_warn(dev, "Tx Ring full when queue awake!\n");
 			return NETDEV_TX_BUSY;
 		}
 
-		... queue packet to card ...
-		... update tx consumer index ...
-
-		if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1))
-			netif_stop_queue(dev);
-
-		...
-		unlock_tx(dp);
-		...
-		return NETDEV_TX_OK;
-	}
-
-   And then at the end of your TX reclamation event handling::
+		//... queue packet to card ...
 
-	if (netif_queue_stopped(dp->dev) &&
-	    TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
-		netif_wake_queue(dp->dev);
+		netdev_tx_sent_queue(txq, skb->len);
 
-   For a non-scatter-gather supporting card, the three tests simply become::
+		//... update tx producer index using WRITE_ONCE() ...
 
-		/* This is a hard error log it. */
-		if (TX_BUFFS_AVAIL(dp) <= 0)
+		if (!netif_txq_maybe_stop(txq, drv_tx_avail(dr),
+					  MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS))
+			dr->stats.stopped++;
 
-   and::
+		//...
+		return NETDEV_TX_OK;
+	}
 
-		if (TX_BUFFS_AVAIL(dp) == 0)
+And then at the end of your TX reclamation event handling:
 
-   and::
+.. code-block:: c
 
-	if (netif_queue_stopped(dp->dev) &&
-	    TX_BUFFS_AVAIL(dp) > 0)
-		netif_wake_queue(dp->dev);
+	//... update tx consumer index using WRITE_ONCE() ...
 
-2) An ndo_start_xmit method must not modify the shared parts of a
-   cloned SKB.
+	netif_txq_completed_wake(txq, cmpl_pkts, cmpl_bytes,
+				 drv_tx_avail(dr), 2 * MAX_SKB_FRAGS);
 
-3) Do not forget that once you return NETDEV_TX_OK from your
-   ndo_start_xmit method, it is your driver's responsibility to free
-   up the SKB and in some finite amount of time.
+Lockless queue stop / wake helper macros
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-   For example, this means that it is not allowed for your TX
-   mitigation scheme to let TX packets "hang out" in the TX
-   ring unreclaimed forever if no new TX packets are sent.
-   This error can deadlock sockets waiting for send buffer room
-   to be freed up.
+.. kernel-doc:: include/net/netdev_queues.h
+   :doc: Lockless queue stopping / waking helpers.
 
-   If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
-   must not keep any reference to that SKB and you must not attempt
-   to free it up.
+No exclusive ownership
+----------------------
 
-Probing guidelines:
+An ndo_start_xmit method must not modify the shared parts of a
+cloned SKB.
 
-1) Any hardware layer address you obtain for your device should
-   be verified.  For example, for ethernet check it with
-   linux/etherdevice.h:is_valid_ether_addr()
+Timely completions
+------------------
 
-Close/stop guidelines:
+Do not forget that once you return NETDEV_TX_OK from your
+ndo_start_xmit method, it is your driver's responsibility to free
+up the SKB and in some finite amount of time.
 
-1) After the ndo_stop routine has been called, the hardware must
-   not receive or transmit any data.  All in flight packets must
-   be aborted. If necessary, poll or wait for completion of
-   any reset commands.
+For example, this means that it is not allowed for your TX
+mitigation scheme to let TX packets "hang out" in the TX
+ring unreclaimed forever if no new TX packets are sent.
+This error can deadlock sockets waiting for send buffer room
+to be freed up.
 
-2) The ndo_stop routine will be called by unregister_netdevice
-   if device is still UP.
+If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
+must not keep any reference to that SKB and you must not attempt
+to free it up.
diff --git a/Documentation/networking/dsa/b53.rst b/Documentation/networking/dsa/b53.rst
index b41637cdb82b..1cb3ff648f88 100644
--- a/Documentation/networking/dsa/b53.rst
+++ b/Documentation/networking/dsa/b53.rst
@@ -52,7 +52,7 @@ VLAN programming would basically change the CPU port's default PVID and make
 it untagged, undesirable.
 
 In difference to the configuration described in :ref:`dsa-vlan-configuration`
-the default VLAN 1 has to be removed from the slave interface configuration in
+the default VLAN 1 has to be removed from the user interface configuration in
 single port and gateway configuration, while there is no need to add an extra
 VLAN configuration in the bridge showcase.
 
@@ -68,13 +68,13 @@ By default packages are tagged with vid 1:
   ip link add link eth0 name eth0.2 type vlan id 2
   ip link add link eth0 name eth0.3 type vlan id 3
 
-  # The master interface needs to be brought up before the slave ports.
+  # The conduit interface needs to be brought up before the user ports.
   ip link set eth0 up
   ip link set eth0.1 up
   ip link set eth0.2 up
   ip link set eth0.3 up
 
-  # bring up the slave interfaces
+  # bring up the user interfaces
   ip link set wan up
   ip link set lan1 up
   ip link set lan2 up
@@ -113,11 +113,11 @@ bridge
   # tag traffic on CPU port
   ip link add link eth0 name eth0.1 type vlan id 1
 
-  # The master interface needs to be brought up before the slave ports.
+  # The conduit interface needs to be brought up before the user ports.
   ip link set eth0 up
   ip link set eth0.1 up
 
-  # bring up the slave interfaces
+  # bring up the user interfaces
   ip link set wan up
   ip link set lan1 up
   ip link set lan2 up
@@ -149,12 +149,12 @@ gateway
   ip link add link eth0 name eth0.1 type vlan id 1
   ip link add link eth0 name eth0.2 type vlan id 2
 
-  # The master interface needs to be brought up before the slave ports.
+  # The conduit interface needs to be brought up before the user ports.
   ip link set eth0 up
   ip link set eth0.1 up
   ip link set eth0.2 up
 
-  # bring up the slave interfaces
+  # bring up the user interfaces
   ip link set wan up
   ip link set lan1 up
   ip link set lan2 up
diff --git a/Documentation/networking/dsa/bcm_sf2.rst b/Documentation/networking/dsa/bcm_sf2.rst
index dee234039e1e..d2571435696f 100644
--- a/Documentation/networking/dsa/bcm_sf2.rst
+++ b/Documentation/networking/dsa/bcm_sf2.rst
@@ -67,7 +67,7 @@ MDIO indirect accesses
 ----------------------
 
 Due to a limitation in how Broadcom switches have been designed, external
-Broadcom switches connected to a SF2 require the use of the DSA slave MDIO bus
+Broadcom switches connected to a SF2 require the use of the DSA user MDIO bus
 in order to properly configure them. By default, the SF2 pseudo-PHY address, and
 an external switch pseudo-PHY address will both be snooping for incoming MDIO
 transactions, since they are at the same address (30), resulting in some kind of
diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst
index af029b3ca2ab..6cc4ded3cc23 100644
--- a/Documentation/networking/dsa/configuration.rst
+++ b/Documentation/networking/dsa/configuration.rst
@@ -5,7 +5,7 @@ DSA switch configuration from userspace
 =======================================
 
 The DSA switch configuration is not integrated into the main userspace
-network configuration suites by now and has to be performed manualy.
+network configuration suites by now and has to be performed manually.
 
 .. _dsa-config-showcases:
 
@@ -31,28 +31,38 @@ at https://www.kernel.org/pub/linux/utils/net/iproute2/
 
 Through DSA every port of a switch is handled like a normal linux Ethernet
 interface. The CPU port is the switch port connected to an Ethernet MAC chip.
-The corresponding linux Ethernet interface is called the master interface.
-All other corresponding linux interfaces are called slave interfaces.
+The corresponding linux Ethernet interface is called the conduit interface.
+All other corresponding linux interfaces are called user interfaces.
 
-The slave interfaces depend on the master interface. They can only brought up,
-when the master interface is up.
+The user interfaces depend on the conduit interface being up in order for them
+to send or receive traffic. Prior to kernel v5.12, the state of the conduit
+interface had to be managed explicitly by the user. Starting with kernel v5.12,
+the behavior is as follows:
+
+- when a DSA user interface is brought up, the conduit interface is
+  automatically brought up.
+- when the conduit interface is brought down, all DSA user interfaces are
+  automatically brought down.
 
 In this documentation the following Ethernet interfaces are used:
 
 *eth0*
-  the master interface
+  the conduit interface
+
+*eth1*
+  another conduit interface
 
 *lan1*
-  a slave interface
+  a user interface
 
 *lan2*
-  another slave interface
+  another user interface
 
 *lan3*
-  a third slave interface
+  a third user interface
 
 *wan*
-  A slave interface dedicated for upstream traffic
+  A user interface dedicated for upstream traffic
 
 Further Ethernet interfaces can be configured similar.
 The configured IPs and networks are:
@@ -78,79 +88,76 @@ The tagging based configuration is desired and supported by the majority of
 DSA switches. These switches are capable to tag incoming and outgoing traffic
 without using a VLAN based configuration.
 
-single port
-~~~~~~~~~~~
-
-.. code-block:: sh
-
-  # configure each interface
-  ip addr add 192.0.2.1/30 dev lan1
-  ip addr add 192.0.2.5/30 dev lan2
-  ip addr add 192.0.2.9/30 dev lan3
-
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
+*single port*
+  .. code-block:: sh
 
-  # bring up the slave interfaces
-  ip link set lan1 up
-  ip link set lan2 up
-  ip link set lan3 up
+    # configure each interface
+    ip addr add 192.0.2.1/30 dev lan1
+    ip addr add 192.0.2.5/30 dev lan2
+    ip addr add 192.0.2.9/30 dev lan3
 
-bridge
-~~~~~~
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
 
-.. code-block:: sh
+    # bring up the user interfaces
+    ip link set lan1 up
+    ip link set lan2 up
+    ip link set lan3 up
 
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
+*bridge*
+  .. code-block:: sh
 
-  # bring up the slave interfaces
-  ip link set lan1 up
-  ip link set lan2 up
-  ip link set lan3 up
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
 
-  # create bridge
-  ip link add name br0 type bridge
+    # bring up the user interfaces
+    ip link set lan1 up
+    ip link set lan2 up
+    ip link set lan3 up
 
-  # add ports to bridge
-  ip link set dev lan1 master br0
-  ip link set dev lan2 master br0
-  ip link set dev lan3 master br0
+    # create bridge
+    ip link add name br0 type bridge
 
-  # configure the bridge
-  ip addr add 192.0.2.129/25 dev br0
+    # add ports to bridge
+    ip link set dev lan1 master br0
+    ip link set dev lan2 master br0
+    ip link set dev lan3 master br0
 
-  # bring up the bridge
-  ip link set dev br0 up
+    # configure the bridge
+    ip addr add 192.0.2.129/25 dev br0
 
-gateway
-~~~~~~~
+    # bring up the bridge
+    ip link set dev br0 up
 
-.. code-block:: sh
+*gateway*
+  .. code-block:: sh
 
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
 
-  # bring up the slave interfaces
-  ip link set wan up
-  ip link set lan1 up
-  ip link set lan2 up
+    # bring up the user interfaces
+    ip link set wan up
+    ip link set lan1 up
+    ip link set lan2 up
 
-  # configure the upstream port
-  ip addr add 192.0.2.1/30 dev wan
+    # configure the upstream port
+    ip addr add 192.0.2.1/30 dev wan
 
-  # create bridge
-  ip link add name br0 type bridge
+    # create bridge
+    ip link add name br0 type bridge
 
-  # add ports to bridge
-  ip link set dev lan1 master br0
-  ip link set dev lan2 master br0
+    # add ports to bridge
+    ip link set dev lan1 master br0
+    ip link set dev lan2 master br0
 
-  # configure the bridge
-  ip addr add 192.0.2.129/25 dev br0
+    # configure the bridge
+    ip addr add 192.0.2.129/25 dev br0
 
-  # bring up the bridge
-  ip link set dev br0 up
+    # bring up the bridge
+    ip link set dev br0 up
 
 .. _dsa-vlan-configuration:
 
@@ -161,132 +168,291 @@ A minority of switches are not capable to use a taging protocol
 (DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based
 configuration.
 
-single port
-~~~~~~~~~~~
-The configuration can only be set up via VLAN tagging and bridge setup.
-
-.. code-block:: sh
-
-  # tag traffic on CPU port
-  ip link add link eth0 name eth0.1 type vlan id 1
-  ip link add link eth0 name eth0.2 type vlan id 2
-  ip link add link eth0 name eth0.3 type vlan id 3
-
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
-  ip link set eth0.1 up
-  ip link set eth0.2 up
-  ip link set eth0.3 up
-
-  # bring up the slave interfaces
-  ip link set lan1 up
-  ip link set lan1 up
-  ip link set lan3 up
-
-  # create bridge
-  ip link add name br0 type bridge
-
-  # activate VLAN filtering
-  ip link set dev br0 type bridge vlan_filtering 1
-
-  # add ports to bridges
-  ip link set dev lan1 master br0
-  ip link set dev lan2 master br0
-  ip link set dev lan3 master br0
-
-  # tag traffic on ports
-  bridge vlan add dev lan1 vid 1 pvid untagged
-  bridge vlan add dev lan2 vid 2 pvid untagged
-  bridge vlan add dev lan3 vid 3 pvid untagged
-
-  # configure the VLANs
-  ip addr add 192.0.2.1/30 dev eth0.1
-  ip addr add 192.0.2.5/30 dev eth0.2
-  ip addr add 192.0.2.9/30 dev eth0.3
-
-  # bring up the bridge devices
-  ip link set br0 up
-
+*single port*
+  The configuration can only be set up via VLAN tagging and bridge setup.
 
-bridge
-~~~~~~
+  .. code-block:: sh
 
-.. code-block:: sh
+    # tag traffic on CPU port
+    ip link add link eth0 name eth0.1 type vlan id 1
+    ip link add link eth0 name eth0.2 type vlan id 2
+    ip link add link eth0 name eth0.3 type vlan id 3
 
-  # tag traffic on CPU port
-  ip link add link eth0 name eth0.1 type vlan id 1
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
+    ip link set eth0.1 up
+    ip link set eth0.2 up
+    ip link set eth0.3 up
 
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
-  ip link set eth0.1 up
+    # bring up the user interfaces
+    ip link set lan1 up
+    ip link set lan2 up
+    ip link set lan3 up
 
-  # bring up the slave interfaces
-  ip link set lan1 up
-  ip link set lan2 up
-  ip link set lan3 up
+    # create bridge
+    ip link add name br0 type bridge
 
-  # create bridge
-  ip link add name br0 type bridge
+    # activate VLAN filtering
+    ip link set dev br0 type bridge vlan_filtering 1
 
-  # activate VLAN filtering
-  ip link set dev br0 type bridge vlan_filtering 1
+    # add ports to bridges
+    ip link set dev lan1 master br0
+    ip link set dev lan2 master br0
+    ip link set dev lan3 master br0
 
-  # add ports to bridge
-  ip link set dev lan1 master br0
-  ip link set dev lan2 master br0
-  ip link set dev lan3 master br0
-  ip link set eth0.1 master br0
+    # tag traffic on ports
+    bridge vlan add dev lan1 vid 1 pvid untagged
+    bridge vlan add dev lan2 vid 2 pvid untagged
+    bridge vlan add dev lan3 vid 3 pvid untagged
 
-  # tag traffic on ports
-  bridge vlan add dev lan1 vid 1 pvid untagged
-  bridge vlan add dev lan2 vid 1 pvid untagged
-  bridge vlan add dev lan3 vid 1 pvid untagged
+    # configure the VLANs
+    ip addr add 192.0.2.1/30 dev eth0.1
+    ip addr add 192.0.2.5/30 dev eth0.2
+    ip addr add 192.0.2.9/30 dev eth0.3
 
-  # configure the bridge
-  ip addr add 192.0.2.129/25 dev br0
+    # bring up the bridge devices
+    ip link set br0 up
 
-  # bring up the bridge
-  ip link set dev br0 up
 
-gateway
-~~~~~~~
+*bridge*
+  .. code-block:: sh
 
-.. code-block:: sh
+    # tag traffic on CPU port
+    ip link add link eth0 name eth0.1 type vlan id 1
 
-  # tag traffic on CPU port
-  ip link add link eth0 name eth0.1 type vlan id 1
-  ip link add link eth0 name eth0.2 type vlan id 2
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
+    ip link set eth0.1 up
 
-  # The master interface needs to be brought up before the slave ports.
-  ip link set eth0 up
-  ip link set eth0.1 up
-  ip link set eth0.2 up
+    # bring up the user interfaces
+    ip link set lan1 up
+    ip link set lan2 up
+    ip link set lan3 up
 
-  # bring up the slave interfaces
-  ip link set wan up
-  ip link set lan1 up
-  ip link set lan2 up
+    # create bridge
+    ip link add name br0 type bridge
 
-  # create bridge
-  ip link add name br0 type bridge
+    # activate VLAN filtering
+    ip link set dev br0 type bridge vlan_filtering 1
 
-  # activate VLAN filtering
-  ip link set dev br0 type bridge vlan_filtering 1
+    # add ports to bridge
+    ip link set dev lan1 master br0
+    ip link set dev lan2 master br0
+    ip link set dev lan3 master br0
+    ip link set eth0.1 master br0
 
-  # add ports to bridges
-  ip link set dev wan master br0
-  ip link set eth0.1 master br0
-  ip link set dev lan1 master br0
-  ip link set dev lan2 master br0
+    # tag traffic on ports
+    bridge vlan add dev lan1 vid 1 pvid untagged
+    bridge vlan add dev lan2 vid 1 pvid untagged
+    bridge vlan add dev lan3 vid 1 pvid untagged
 
-  # tag traffic on ports
-  bridge vlan add dev lan1 vid 1 pvid untagged
-  bridge vlan add dev lan2 vid 1 pvid untagged
-  bridge vlan add dev wan vid 2 pvid untagged
+    # configure the bridge
+    ip addr add 192.0.2.129/25 dev br0
 
-  # configure the VLANs
-  ip addr add 192.0.2.1/30 dev eth0.2
-  ip addr add 192.0.2.129/25 dev br0
+    # bring up the bridge
+    ip link set dev br0 up
 
-  # bring up the bridge devices
-  ip link set br0 up
+*gateway*
+  .. code-block:: sh
+
+    # tag traffic on CPU port
+    ip link add link eth0 name eth0.1 type vlan id 1
+    ip link add link eth0 name eth0.2 type vlan id 2
+
+    # For kernels earlier than v5.12, the conduit interface needs to be
+    # brought up manually before the user ports.
+    ip link set eth0 up
+    ip link set eth0.1 up
+    ip link set eth0.2 up
+
+    # bring up the user interfaces
+    ip link set wan up
+    ip link set lan1 up
+    ip link set lan2 up
+
+    # create bridge
+    ip link add name br0 type bridge
+
+    # activate VLAN filtering
+    ip link set dev br0 type bridge vlan_filtering 1
+
+    # add ports to bridges
+    ip link set dev wan master br0
+    ip link set eth0.1 master br0
+    ip link set dev lan1 master br0
+    ip link set dev lan2 master br0
+
+    # tag traffic on ports
+    bridge vlan add dev lan1 vid 1 pvid untagged
+    bridge vlan add dev lan2 vid 1 pvid untagged
+    bridge vlan add dev wan vid 2 pvid untagged
+
+    # configure the VLANs
+    ip addr add 192.0.2.1/30 dev eth0.2
+    ip addr add 192.0.2.129/25 dev br0
+
+    # bring up the bridge devices
+    ip link set br0 up
+
+Forwarding database (FDB) management
+------------------------------------
+
+The existing DSA switches do not have the necessary hardware support to keep
+the software FDB of the bridge in sync with the hardware tables, so the two
+tables are managed separately (``bridge fdb show`` queries both, and depending
+on whether the ``self`` or ``master`` flags are being used, a ``bridge fdb
+add`` or ``bridge fdb del`` command acts upon entries from one or both tables).
+
+Up until kernel v4.14, DSA only supported user space management of bridge FDB
+entries using the bridge bypass operations (which do not update the software
+FDB, just the hardware one) using the ``self`` flag (which is optional and can
+be omitted).
+
+  .. code-block:: sh
+
+    bridge fdb add dev swp0 00:01:02:03:04:05 self static
+    # or shorthand
+    bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Due to a bug, the bridge bypass FDB implementation provided by DSA did not
+distinguish between ``static`` and ``local`` FDB entries (``static`` are meant
+to be forwarded, while ``local`` are meant to be locally terminated, i.e. sent
+to the host port). Instead, all FDB entries with the ``self`` flag (implicit or
+explicit) are treated by DSA as ``static`` even if they are ``local``.
+
+  .. code-block:: sh
+
+    # This command:
+    bridge fdb add dev swp0 00:01:02:03:04:05 static
+    # behaves the same for DSA as this command:
+    bridge fdb add dev swp0 00:01:02:03:04:05 local
+    # or shorthand, because the 'local' flag is implicit if 'static' is not
+    # specified, it also behaves the same as:
+    bridge fdb add dev swp0 00:01:02:03:04:05
+
+The last command is an incorrect way of adding a static bridge FDB entry to a
+DSA switch using the bridge bypass operations, and works by mistake. Other
+drivers will treat an FDB entry added by the same command as ``local`` and as
+such, will not forward it, as opposed to DSA.
+
+Between kernel v4.14 and v5.14, DSA has supported in parallel two modes of
+adding a bridge FDB entry to the switch: the bridge bypass discussed above, as
+well as a new mode using the ``master`` flag which installs FDB entries in the
+software bridge too.
+
+  .. code-block:: sh
+
+    bridge fdb add dev swp0 00:01:02:03:04:05 master static
+
+Since kernel v5.14, DSA has gained stronger integration with the bridge's
+software FDB, and the support for its bridge bypass FDB implementation (using
+the ``self`` flag) has been removed. This results in the following changes:
+
+  .. code-block:: sh
+
+    # This is the only valid way of adding an FDB entry that is supported,
+    # compatible with v4.14 kernels and later:
+    bridge fdb add dev swp0 00:01:02:03:04:05 master static
+    # This command is no longer buggy and the entry is properly treated as
+    # 'local' instead of being forwarded:
+    bridge fdb add dev swp0 00:01:02:03:04:05
+    # This command no longer installs a static FDB entry to hardware:
+    bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Script writers are therefore encouraged to use the ``master static`` set of
+flags when working with bridge FDB entries on DSA switch interfaces.
+
+Affinity of user ports to CPU ports
+-----------------------------------
+
+Typically, DSA switches are attached to the host via a single Ethernet
+interface, but in cases where the switch chip is discrete, the hardware design
+may permit the use of 2 or more ports connected to the host, for an increase in
+termination throughput.
+
+DSA can make use of multiple CPU ports in two ways. First, it is possible to
+statically assign the termination traffic associated with a certain user port
+to be processed by a certain CPU port. This way, user space can implement
+custom policies of static load balancing between user ports, by spreading the
+affinities according to the available CPU ports.
+
+Secondly, it is possible to perform load balancing between CPU ports on a per
+packet basis, rather than statically assigning user ports to CPU ports.
+This can be achieved by placing the DSA conduits under a LAG interface (bonding
+or team). DSA monitors this operation and creates a mirror of this software LAG
+on the CPU ports facing the physical DSA conduits that constitute the LAG slave
+devices.
+
+To make use of multiple CPU ports, the firmware (device tree) description of
+the switch must mark all the links between CPU ports and their DSA conduits
+using the ``ethernet`` reference/phandle. At startup, only a single CPU port
+and DSA conduit will be used - the numerically first port from the firmware
+description which has an ``ethernet`` property. It is up to the user to
+configure the system for the switch to use other conduits.
+
+DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
+changing the DSA conduit of a user port. The ``IFLA_DSA_CONDUIT`` u32 netlink
+attribute contains the ifindex of the conduit device that handles each user
+device. The DSA conduit must be a valid candidate based on firmware node
+information, or a LAG interface which contains only slaves which are valid
+candidates.
+
+Using iproute2, the following manipulations are possible:
+
+  .. code-block:: sh
+
+    # See the DSA conduit in current use
+    ip -d link show dev swp0
+        (...)
+        dsa master eth0
+
+    # Static CPU port distribution
+    ip link set swp0 type dsa master eth1
+    ip link set swp1 type dsa master eth0
+    ip link set swp2 type dsa master eth1
+    ip link set swp3 type dsa master eth0
+
+    # CPU ports in LAG, using explicit assignment of the DSA conduit
+    ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+    ip link set eth1 down && ip link set eth1 master bond0
+    ip link set swp0 type dsa master bond0
+    ip link set swp1 type dsa master bond0
+    ip link set swp2 type dsa master bond0
+    ip link set swp3 type dsa master bond0
+    ip link set eth0 down && ip link set eth0 master bond0
+    ip -d link show dev swp0
+        (...)
+        dsa master bond0
+
+    # CPU ports in LAG, relying on implicit migration of the DSA conduit
+    ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+    ip link set eth0 down && ip link set eth0 master bond0
+    ip link set eth1 down && ip link set eth1 master bond0
+    ip -d link show dev swp0
+        (...)
+        dsa master bond0
+
+Notice that in the case of CPU ports under a LAG, the use of the
+``IFLA_DSA_CONDUIT`` netlink attribute is not strictly needed, but rather, DSA
+reacts to the ``IFLA_MASTER`` attribute change of its present conduit (``eth0``)
+and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
+when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
+that were assigned to this interface to the first physical DSA conduit which is
+eligible, based on the firmware description (it effectively reverts to the
+startup configuration).
+
+In a setup with more than 2 physical CPU ports, it is therefore possible to mix
+static user to CPU port assignment with LAG between DSA conduits. It is not
+possible to statically assign a user port towards a DSA conduit that has any
+upper interfaces (this includes LAG devices - the conduit must always be the LAG
+in this case).
+
+Live changing of the DSA conduit (and thus CPU port) affinity of a user port is
+permitted, in order to allow dynamic redistribution in response to traffic.
+
+Physical DSA conduits are allowed to join and leave at any time a LAG interface
+used as a DSA conduit; however, DSA will reject a LAG interface as a valid
+candidate for being a DSA conduit unless it has at least one physical DSA conduit
+as a slave device.
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index a8d15dd2b42b..7b2e69cd7ef0 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -10,22 +10,22 @@ in joining the effort.
 Design principles
 =================
 
-The Distributed Switch Architecture is a subsystem which was primarily designed
-to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
-using Linux, but has since evolved to support other vendors as well.
+The Distributed Switch Architecture subsystem was primarily designed to
+support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product
+line) using Linux, but has since evolved to support other vendors as well.
 
 The original philosophy behind this design was to be able to use unmodified
 Linux tools such as bridge, iproute2, ifconfig to work transparently whether
 they configured/queried a switch port network device or a regular network
 device.
 
-An Ethernet switch is typically comprised of multiple front-panel ports, and one
-or more CPU or management port. The DSA subsystem currently relies on the
+An Ethernet switch typically comprises multiple front-panel ports and one
+or more CPU or management ports. The DSA subsystem currently relies on the
 presence of a management port connected to an Ethernet controller capable of
 receiving Ethernet frames from the switch. This is a very common setup for all
 kinds of Ethernet switches found in Small Home and Office products: routers,
-gateways, or even top-of-the rack switches. This host Ethernet controller will
-be later referred to as "master" and "cpu" in DSA terminology and code.
+gateways, or even top-of-rack switches. This host Ethernet controller will
+be later referred to as "conduit" and "cpu" in DSA terminology and code.
 
 The D in DSA stands for Distributed, because the subsystem has been designed
 with the ability to configure and manage cascaded switches on top of each other
@@ -33,14 +33,14 @@ using upstream and downstream Ethernet links between switches. These specific
 ports are referred to as "dsa" ports in DSA terminology and code. A collection
 of multiple switches connected to each other is called a "switch tree".
 
-For each front-panel port, DSA will create specialized network devices which are
+For each front-panel port, DSA creates specialized network devices which are
 used as controlling and data-flowing endpoints for use by the Linux networking
-stack. These specialized network interfaces are referred to as "slave" network
+stack. These specialized network interfaces are referred to as "user" network
 interfaces in DSA terminology and code.
 
 The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
 which is a hardware feature making the switch insert a specific tag for each
-Ethernet frames it received to/from specific ports to help the management
+Ethernet frame it receives to/from specific ports to help the management
 interface figure out:
 
 - what port is this frame coming from
@@ -56,23 +56,21 @@ Note that DSA does not currently create network interfaces for the "cpu" and
 
 - the "cpu" port is the Ethernet switch facing side of the management
   controller, and as such, would create a duplication of feature, since you
-  would get two interfaces for the same conduit: master netdev, and "cpu" netdev
+  would get two interfaces for the same conduit: conduit netdev, and "cpu" netdev
 
 - the "dsa" port(s) are just conduits between two or more switches, and as such
   cannot really be used as proper network interfaces either, only the
   downstream, or the top-most upstream interface makes sense with that model
 
+NB: for the past 15 years, the DSA subsystem had been making use of the terms
+"master" (rather than "conduit") and "slave" (rather than "user"). These terms
+have been removed from the DSA codebase and phased out of the uAPI.
+
 Switch tagging protocols
 ------------------------
 
-DSA currently supports 5 different tagging protocols, and a tag-less mode as
-well. The different protocols are implemented in:
-
-- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
-- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
-- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
-- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
-- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
+DSA supports many vendor-specific tagging protocols, one software-defined
+tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
 
 The exact format of the tag protocol is vendor specific, but in general, they
 all contain something which:
@@ -80,10 +78,153 @@ all contain something which:
 - identifies which port the Ethernet frame came from/should be sent to
 - provides a reason why this frame was forwarded to the management interface
 
-Master network devices
-----------------------
-
-Master network devices are regular, unmodified Linux network device drivers for
+All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
+methods of the ``struct dsa_device_ops`` structure, which are detailed below.
+
+Tagging protocols generally fall in one of three categories:
+
+1. The switch-specific frame header is located before the Ethernet header,
+   shifting to the right (from the perspective of the DSA conduit's frame
+   parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
+2. The switch-specific frame header is located before the EtherType, keeping
+   the MAC DA and MAC SA in place from the DSA conduit's perspective, but
+   shifting the 'real' EtherType and L2 payload to the right.
+3. The switch-specific frame header is located at the tail of the packet,
+   keeping all frame headers in place and not altering the view of the packet
+   that the DSA conduit's frame parser has.
+
+A tagging protocol may tag all packets with switch tags of the same length, or
+the tag length might vary (for example packets with PTP timestamps might
+require an extended switch tag, or there might be one tag length on TX and a
+different one on RX). Either way, the tagging protocol driver must populate the
+``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
+with the length in octets of the longest switch frame header/trailer. The DSA
+framework will automatically adjust the MTU of the conduit interface to
+accommodate for this extra size in order for DSA user ports to support the
+standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
+``needed_tailroom`` properties are also used to request from the network stack,
+on a best-effort basis, the allocation of packets with enough extra space such
+that the act of pushing the switch tag on transmission of a packet does not
+cause it to reallocate due to lack of memory.
+
+Even though applications are not expected to parse DSA-specific frame headers,
+the format on the wire of the tagging protocol represents an Application Binary
+Interface exposed by the kernel towards user space, for decoders such as
+``libpcap``. The tagging protocol driver must populate the ``proto`` member of
+``struct dsa_device_ops`` with a value that uniquely describes the
+characteristics of the interaction required between the switch hardware and the
+data path driver: the offset of each bit field within the frame header and any
+stateful processing required to deal with the frames (as may be required for
+PTP timestamping).
+
+From the perspective of the network stack, all switches within the same DSA
+switch tree use the same tagging protocol. In case of a packet transiting a
+fabric with more than one switch, the switch-specific frame header is inserted
+by the first switch in the fabric that the packet was received on. This header
+typically contains information regarding its type (whether it is a control
+frame that must be trapped to the CPU, or a data frame to be forwarded).
+Control frames should be decapsulated only by the software data path, whereas
+data frames might also be autonomously forwarded towards other user ports of
+other switches from the same fabric, and in this case, the outermost switch
+ports must decapsulate the packet.
+
+Note that in certain cases, it might be the case that the tagging format used
+by a leaf switch (not connected directly to the CPU) is not the same as what
+the network stack sees. This can be seen with Marvell switch trees, where the
+CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
+format, but the DSA links are configured to use the shorter (without Ethertype)
+DSA frame header, in order to reduce the autonomous packet forwarding overhead.
+It still remains the case that, if the DSA switch tree is configured for the
+EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
+leaf switches that tagged them with the shorter DSA header. This can be done
+because the Marvell switch connected directly to the CPU is configured to
+perform tag translation between DSA and EDSA (which is simply the operation of
+adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
+
+It is possible to construct cascaded setups of DSA switches even if their
+tagging protocols are not compatible with one another. In this case, there are
+no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
+tree. The DSA links are viewed as simply a pair of a DSA conduit (the out-facing
+port of the upstream DSA switch) and a CPU port (the in-facing port of the
+downstream DSA switch).
+
+The tagging protocol of the attached DSA switch tree can be viewed through the
+``dsa/tagging`` sysfs attribute of the DSA conduit::
+
+    cat /sys/class/net/eth0/dsa/tagging
+
+If the hardware and driver are capable, the tagging protocol of the DSA switch
+tree can be changed at runtime. This is done by writing the new tagging
+protocol name to the same sysfs device attribute as above (the DSA conduit and
+all attached switch ports must be down while doing this).
+
+It is desirable that all tagging protocols are testable with the ``dsa_loop``
+mockup driver, which can be attached to any network interface. The goal is that
+any network interface should be capable of transmitting the same packet in the
+same way, and the tagger should decode the same received packet in the same way
+regardless of the driver used for the switch control path, and the driver used
+for the DSA conduit.
+
+The transmission of a packet goes through the tagger's ``xmit`` function.
+The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
+``struct net_device *dev`` represents the virtual DSA user network interface
+whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
+The job of this method is to prepare the skb in a way that the switch will
+understand what egress port the packet is for (and not deliver it towards other
+ports). Typically this is fulfilled by pushing a frame header. Checking for
+insufficient size in the skb headroom or tailroom is unnecessary provided that
+the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
+properly, because DSA ensures there is enough space before calling this method.
+
+The reception of a packet goes through the tagger's ``rcv`` function. The
+passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
+the EtherType would have been, were this frame not tagged. The role of this
+method is to consume the frame header, adjust ``skb->data`` to really point at
+the first octet after the EtherType, and to change ``skb->dev`` to point to the
+virtual DSA user network interface corresponding to the physical front-facing
+switch port that the packet was received on.
+
+Since tagging protocols in category 1 and 2 break software (and most often also
+hardware) packet dissection on the DSA conduit, features such as RPS (Receive
+Packet Steering) on the DSA conduit would be broken. The DSA framework deals
+with this by hooking into the flow dissector and shifting the offset at which
+the IP header is to be found in the tagged frame as seen by the DSA conduit.
+This behavior is automatic based on the ``overhead`` value of the tagging
+protocol. If not all packets are of equal size, the tagger can implement the
+``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
+default behavior by specifying the correct offset incurred by each individual
+RX packet. Tail taggers do not cause issues to the flow dissector.
+
+Checksum offload should work with category 1 and 2 taggers when the DSA conduit
+driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and
+csum_offset. For those cases, DSA will shift the checksum start and offset by
+the tag size. If the DSA conduit driver still uses the legacy NETIF_F_IP_CSUM
+or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
+offload hardware already expects that specific tag (perhaps due to matching
+vendors). DSA user ports inherit those flags from the conduit, and it is up to
+the driver to correctly fall back to software checksum when the IP header is not
+where the hardware expects. If that check is ineffective, the packets might go
+to the network without a proper checksum (the checksum field will have the
+pseudo IP header sum). For category 3, when the offload hardware does not
+already expect the switch tag in use, the checksum must be calculated before any
+tag is inserted (i.e. inside the tagger). Otherwise, the DSA conduit would
+include the tail tag in the (software or hardware) checksum calculation. Then,
+when the tag gets stripped by the switch during transmission, it will leave an
+incorrect IP checksum in place.
+
+Due to various reasons (most common being category 1 taggers being associated
+with DSA-unaware conduits, mangling what the conduit perceives as MAC DA), the
+tagging protocol may require the DSA conduit to operate in promiscuous mode, to
+receive all frames regardless of the value of the MAC DA. This can be done by
+setting the ``promisc_on_conduit`` property of the ``struct dsa_device_ops``.
+Note that this assumes a DSA-unaware conduit driver, which is the norm.
+
+Conduit network devices
+-----------------------
+
+Conduit network devices are regular, unmodified Linux network device drivers for
 the CPU/management Ethernet interface. Such a driver might occasionally need to
 know whether DSA is enabled (e.g.: to enable/disable specific offload features),
 but the DSA subsystem has been proven to work with industry standard drivers:
@@ -95,14 +236,14 @@ Ethernet switch.
 Networking stack hooks
 ----------------------
 
-When a master netdev is used with DSA, a small hook is placed in the
+When a conduit netdev is used with DSA, a small hook is placed in the
 networking stack is in order to have the DSA subsystem process the Ethernet
 switch specific tagging protocol. DSA accomplishes this by registering a
 specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
 networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
 Ethernet Frame receive sequence looks like this:
 
-Master network device (e.g.: e1000e):
+Conduit network device (e.g.: e1000e):
 
 1. Receive interrupt fires:
 
@@ -132,16 +273,16 @@ Master network device (e.g.: e1000e):
 
         - inspect and strip switch tag protocol to determine originating port
         - locate per-port network device
-        - invoke ``eth_type_trans()`` with the DSA slave network device
+        - invoke ``eth_type_trans()`` with the DSA user network device
         - invoked ``netif_receive_skb()``
 
-Past this point, the DSA slave network devices get delivered regular Ethernet
+Past this point, the DSA user network devices get delivered regular Ethernet
 frames that can be processed by the networking stack.
 
-Slave network devices
----------------------
+User network devices
+--------------------
 
-Slave network devices created by DSA are stacked on top of their master network
+User network devices created by DSA are stacked on top of their conduit network
 device, each of these network interfaces will be responsible for being a
 controlling and data-flowing end-point for each front-panel port of the switch.
 These interfaces are specialized in order to:
@@ -150,21 +291,35 @@ These interfaces are specialized in order to:
   to/from specific switch ports
 - query the switch for ethtool operations: statistics, link state,
   Wake-on-LAN, register dumps...
-- external/internal PHY management: link, auto-negotiation etc.
+- manage external/internal PHY: link, auto-negotiation, etc.
 
-These slave network devices have custom net_device_ops and ethtool_ops function
+These user network devices have custom net_device_ops and ethtool_ops function
 pointers which allow DSA to introduce a level of layering between the networking
-stack/ethtool, and the switch driver implementation.
+stack/ethtool and the switch driver implementation.
 
-Upon frame transmission from these slave network devices, DSA will look up which
-switch tagging protocol is currently registered with these network devices, and
+Upon frame transmission from these user network devices, DSA will look up which
+switch tagging protocol is currently registered with these network devices and
 invoke a specific transmit routine which takes care of adding the relevant
 switch tag in the Ethernet frames.
 
-These frames are then queued for transmission using the master network device
-``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
+These frames are then queued for transmission using the conduit network device
+``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the
 Ethernet switch will be able to process these incoming frames from the
-management interface and delivers these frames to the physical switch port.
+management interface and deliver them to the physical switch port.
+
+When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
+device between the DSA user devices and the physical DSA conduits. The LAG
+device is thus also a DSA conduit, but the LAG slave devices continue to be DSA
+conduits as well (just with no user port assigned to them; this is needed for
+recovery in case the LAG DSA conduit disappears). Thus, the data path of the LAG
+DSA conduit is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
+calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA conduit;
+LAG slave). Therefore, the RX data path of the LAG DSA conduit is not used.
+On the other hand, TX takes place linearly: ``dsa_user_xmit`` calls
+``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA conduit.
+The latter calls ``dev_queue_xmit`` towards one physical DSA conduit or the
+other, and in both cases, the packet exits the system through a hardware path
+towards the switch.
 
 Graphical representation
 ------------------------
@@ -172,37 +327,48 @@ Graphical representation
 Summarized, this is basically how DSA looks like from a network device
 perspective::
 
-
-                |---------------------------
-                | CPU network device (eth0)|
-                ----------------------------
-                | <tag added by switch     |
-                |                          |
-                |                          |
-                |        tag added by CPU> |
-        |--------------------------------------------|
-        |            Switch driver                   |
-        |--------------------------------------------|
-                  ||        ||         ||
-              |-------|  |-------|  |-------|
-              | sw0p0 |  | sw0p1 |  | sw0p2 |
-              |-------|  |-------|  |-------|
-
-
-
-Slave MDIO bus
---------------
-
-In order to be able to read to/from a switch PHY built into it, DSA creates a
-slave MDIO bus which allows a specific switch driver to divert and intercept
+                Unaware application
+              opens and binds socket
+                       |  ^
+                       |  |
+           +-----------v--|--------------------+
+           |+------+ +------+ +------+ +------+|
+           || swp0 | | swp1 | | swp2 | | swp3 ||
+           |+------+-+------+-+------+-+------+|
+           |          DSA switch driver        |
+           +-----------------------------------+
+                         |        ^
+            Tag added by |        | Tag consumed by
+           switch driver |        | switch driver
+                         v        |
+           +-----------------------------------+
+           | Unmodified host interface driver  | Software
+   --------+-----------------------------------+------------
+           |       Host interface (eth0)       | Hardware
+           +-----------------------------------+
+                         |        ^
+         Tag consumed by |        | Tag added by
+         switch hardware |        | switch hardware
+                         v        |
+           +-----------------------------------+
+           |               Switch              |
+           |+------+ +------+ +------+ +------+|
+           || swp0 | | swp1 | | swp2 | | swp3 ||
+           ++------+-+------+-+------+-+------++
+
+User MDIO bus
+-------------
+
+In order to be able to read to/from a switch PHY built into it, DSA creates an
+user MDIO bus which allows a specific switch driver to divert and intercept
 MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
 switches, these functions would utilize direct or indirect PHY addressing mode
 to return standard MII registers from the switch builtin PHYs, allowing the PHY
 library and/or to return link status, link partner pages, auto-negotiation
-results etc..
+results, etc.
 
-For Ethernet switches which have both external and internal MDIO busses, the
-slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
+For Ethernet switches which have both external and internal MDIO buses, the
+user MII bus can be utilized to mux/demux MDIO reads and writes towards either
 internal or external MDIO devices this switch might be connected to: internal
 PHYs, external PHYs, or even external switches.
 
@@ -218,11 +384,11 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
   table indication (when cascading switches)
 
 - ``dsa_platform_data``: platform device configuration data which can reference
-  a collection of dsa_chip_data structure if multiples switches are cascaded,
-  the master network device this switch tree is attached to needs to be
+  a collection of dsa_chip_data structures if multiple switches are cascaded,
+  the conduit network device this switch tree is attached to needs to be
   referenced
 
-- ``dsa_switch_tree``: structure assigned to the master network device under
+- ``dsa_switch_tree``: structure assigned to the conduit network device under
   ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
   the tagging protocol supported by the switch tree, and which receive/transmit
   function hooks should be invoked, information about the directly attached
@@ -230,7 +396,7 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
   referenced to address individual switches in the tree.
 
 - ``dsa_switch``: structure describing a switch device in the tree, referencing
-  a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
+  a ``dsa_switch_tree`` as a backpointer, user network devices, conduit network
   device, and a reference to the backing``dsa_switch_ops``
 
 - ``dsa_switch_ops``: structure referencing function pointers, see below for a
@@ -239,18 +405,10 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
 Design limitations
 ==================
 
-Limits on the number of devices and ports
------------------------------------------
-
-DSA currently limits the number of maximum switches within a tree to 4
-(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
-These limits could be extended to support larger configurations would this need
-arise.
-
 Lack of CPU/DSA network devices
 -------------------------------
 
-DSA does not currently create slave network devices for the CPU or DSA ports, as
+DSA does not currently create user network devices for the CPU or DSA ports, as
 described before. This might be an issue in the following cases:
 
 - inability to fetch switch CPU port statistics counters using ethtool, which
@@ -265,7 +423,7 @@ described before. This might be an issue in the following cases:
 Common pitfalls using DSA setups
 --------------------------------
 
-Once a master network device is configured to use DSA (dev->dsa_ptr becomes
+Once a conduit network device is configured to use DSA (dev->dsa_ptr becomes
 non-NULL), and the switch behind it expects a tagging protocol, this network
 interface can only exclusively be used as a conduit interface. Sending packets
 directly through this interface (e.g.: opening a socket using this interface)
@@ -273,10 +431,6 @@ will not make us go through the switch tagging protocol transmit function, so
 the Ethernet switch on the other end, expecting a tag will typically drop this
 frame.
 
-Slave network devices check that the master network device is UP before allowing
-you to administratively bring UP these slave network devices. A common
-configuration mistake is forgetting to bring UP the master network device first.
-
 Interactions with other subsystems
 ==================================
 
@@ -285,11 +439,12 @@ DSA currently leverages the following subsystems:
 - MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
 - Switchdev:``net/switchdev/*``
 - Device Tree for various of_* functions
+- Devlink: ``net/core/devlink.c``
 
 MDIO/PHY library
 ----------------
 
-Slave network devices exposed by DSA may or may not be interfacing with PHY
+User network devices exposed by DSA may or may not be interfacing with PHY
 devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
 subsystem deals with all possible combinations:
 
@@ -299,20 +454,20 @@ subsystem deals with all possible combinations:
 - special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
   fixed PHYs
 
-The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
+The PHY configuration is done by the ``dsa_user_phy_setup()`` function and the
 logic basically looks like this:
 
 - if Device Tree is used, the PHY device is looked up using the standard
   "phy-handle" property, if found, this PHY device is created and registered
   using ``of_phy_connect()``
 
-- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
+- if Device Tree is used and the PHY device is "fixed", that is, conforms to
   the definition of a non-MDIO managed PHY as defined in
   ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
   and connected transparently using the special fixed MDIO bus driver
 
 - finally, if the PHY is built into the switch, as is very common with
-  standalone switch packages, the PHY is probed using the slave MII bus created
+  standalone switch packages, the PHY is probed using the user MII bus created
   by DSA
 
 
@@ -321,14 +476,39 @@ SWITCHDEV
 
 DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
 more specifically with its VLAN filtering portion when configuring VLANs on top
-of per-port slave network devices. Since DSA primarily deals with
-MDIO-connected switches, although not exclusively, SWITCHDEV's
-prepare/abort/commit phases are often simplified into a prepare phase which
-checks whether the operation is supported by the DSA switch driver, and a commit
-phase which applies the changes.
-
-As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
-objects.
+of per-port user network devices. As of today, the only SWITCHDEV objects
+supported by DSA are the FDB and VLAN objects.
+
+Devlink
+-------
+
+DSA registers one devlink device per physical switch in the fabric.
+For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
+links or unused ports) is exposed as a devlink port.
+
+DSA drivers can make use of the following devlink features:
+
+- Regions: debugging feature which allows user space to dump driver-defined
+  areas of hardware information in a low-level, binary format. Both global
+  regions as well as per-port regions are supported. It is possible to export
+  devlink regions even for pieces of data that are already exposed in some way
+  to the standard iproute2 user space programs (ip-link, bridge), like address
+  tables and VLAN tables. For example, this might be useful if the tables
+  contain additional hardware-specific details which are not visible through
+  the iproute2 abstraction, or it might be useful to inspect these tables on
+  the non-user ports too, which are invisible to iproute2 because no network
+  interface is registered for them.
+- Params: a feature which enables user to configure certain low-level tunable
+  knobs pertaining to the device. Drivers may implement applicable generic
+  devlink params, or may add new device-specific devlink params.
+- Resources: a monitoring feature which enables users to see the degree of
+  utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
+- Shared buffers: a QoS feature for adjusting and partitioning memory and frame
+  reservations per port and per traffic class, in the ingress and egress
+  directions, such that low-priority bulk traffic does not impede the
+  processing of high-priority critical traffic.
+
+For more details, consult ``Documentation/networking/devlink/``.
 
 Device Tree
 -----------
@@ -336,35 +516,117 @@ Device Tree
 DSA features a standardized binding which is documented in
 ``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
 functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
-per-port PHY specific details: interface connection, MDIO bus location etc..
+per-port PHY specific details: interface connection, MDIO bus location, etc.
 
 Driver development
 ==================
 
-DSA switch drivers need to implement a dsa_switch_ops structure which will
+DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will
 contain the various members described below.
 
-``register_switch_driver()`` registers this dsa_switch_ops in its internal list
-of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
+Probing, registration and device lifetime
+-----------------------------------------
+
+DSA switches are regular ``device`` structures on buses (be they platform, SPI,
+I2C, MDIO or otherwise). The DSA framework is not involved in their probing
+with the device core.
+
+Switch registration from the perspective of a driver means passing a valid
+``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the
+switch driver's probing function. The following members must be valid in the
+provided structure:
+
+- ``ds->dev``: will be used to parse the switch's OF node or platform data.
+
+- ``ds->num_ports``: will be used to create the port list for this switch, and
+  to validate the port indices provided in the OF node.
+
+- ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA
+  method implementations.
+
+- ``ds->priv``: backpointer to a driver-private data structure which can be
+  retrieved in all further DSA method callbacks.
+
+In addition, the following flags in the ``dsa_switch`` structure may optionally
+be configured to obtain driver-specific behavior from the DSA core. Their
+behavior when set is documented through comments in ``include/net/dsa.h``.
+
+- ``ds->vlan_filtering_is_global``
+
+- ``ds->needs_standalone_vlan_filtering``
 
-Unless requested differently by setting the priv_size member accordingly, DSA
-does not allocate any driver private context space.
+- ``ds->configure_vlan_while_not_filtering``
+
+- ``ds->untag_bridge_pvid``
+
+- ``ds->assisted_learning_on_cpu_port``
+
+- ``ds->mtu_enforcement_ingress``
+
+- ``ds->fdb_isolation``
+
+Internally, DSA keeps an array of switch trees (group of switches) global to
+the kernel, and attaches a ``dsa_switch`` structure to a tree on registration.
+The tree ID to which the switch is attached is determined by the first u32
+number of the ``dsa,member`` property of the switch's OF node (0 if missing).
+The switch ID within the tree is determined by the second u32 number of the
+same OF property (0 if missing). Registering multiple switches with the same
+switch ID and tree ID is illegal and will cause an error. Using platform data,
+a single switch and a single switch tree is permitted.
+
+In case of a tree with multiple switches, probing takes place asymmetrically.
+The first N-1 callers of ``dsa_register_switch()`` only add their ports to the
+port list of the tree (``dst->ports``), each port having a backpointer to its
+associated switch (``dp->ds``). Then, these switches exit their
+``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()``
+has determined that the tree is not yet complete (not all ports referenced by
+DSA links are present in the tree's port list). The tree becomes complete when
+the last switch calls ``dsa_register_switch()``, and this triggers the effective
+continuation of initialization (including the call to ``ds->ops->setup()``) for
+all switches within that tree, all as part of the calling context of the last
+switch's probe function.
+
+The opposite of registration takes place when calling ``dsa_unregister_switch()``,
+which removes a switch's ports from the port list of the tree. The entire tree
+is torn down when the first switch unregisters.
+
+It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback
+of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal
+version of the full teardown performed by ``dsa_unregister_switch()``).
+The reason is that DSA keeps a reference on the conduit net device, and if the
+driver for the conduit device decides to unbind on shutdown, DSA's reference
+will block that operation from finalizing.
+
+Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called,
+but not both, and the device driver model permits the bus' ``remove()`` method
+to be called even if ``shutdown()`` was already called. Therefore, drivers are
+expected to implement a mutual exclusion method between ``remove()`` and
+``shutdown()`` by setting their drvdata to NULL after any of these has run, and
+checking whether the drvdata is NULL before proceeding to take any action.
+
+After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no
+further callbacks via the provided ``dsa_switch_ops`` may take place, and the
+driver may free the data structures associated with the ``dsa_switch``.
 
 Switch configuration
 --------------------
 
-- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
-  should be a valid value from the ``dsa_tag_protocol`` enum
+- ``get_tag_protocol``: this is to indicate what kind of tagging protocol is
+  supported, should be a valid value from the ``dsa_tag_protocol`` enum.
+  The returned information does not have to be static; the driver is passed the
+  CPU port number, as well as the tagging protocol of a possibly stacked
+  upstream switch, in case there are hardware limitations in terms of supported
+  tag formats.
 
-- ``probe``: probe routine which will be invoked by the DSA platform device upon
-  registration to test for the presence/absence of a switch device. For MDIO
-  devices, it is recommended to issue a read towards internal registers using
-  the switch pseudo-PHY and return whether this is a supported device. For other
-  buses, return a non-NULL string
+- ``change_tag_protocol``: when the default tagging protocol has compatibility
+  problems with the conduit or other issues, the driver may support changing it
+  at runtime, either through a device tree property or through sysfs. In that
+  case, further calls to ``get_tag_protocol`` should report the protocol in
+  current use.
 
 - ``setup``: setup function for the switch, this function is responsible for setting
   up the ``dsa_switch_ops`` private structure with all it needs: register maps,
-  interrupts, mutexes, locks etc.. This function is also expected to properly
+  interrupts, mutexes, locks, etc. This function is also expected to properly
   configure the switch to separate all network interfaces from each other, that
   is, they should be isolated by the switch hardware itself, typically by creating
   a Port-based VLAN ID for each port and allowing only the CPU port and the
@@ -373,7 +635,35 @@ Switch configuration
   fully configured and ready to serve any kind of request. It is recommended
   to issue a software reset of the switch during this setup function in order to
   avoid relying on what a previous software agent such as a bootloader/firmware
-  may have previously configured.
+  may have previously configured. The method responsible for undoing any
+  applicable allocations or operations done here is ``teardown``.
+
+- ``port_setup`` and ``port_teardown``: methods for initialization and
+  destruction of per-port data structures. It is mandatory for some operations
+  such as registering and unregistering devlink port regions to be done from
+  these methods, otherwise they are optional. A port will be torn down only if
+  it has been previously set up. It is possible for a port to be set up during
+  probing only to be torn down immediately afterwards, for example in case its
+  PHY cannot be found. In this case, probing of the DSA switch continues
+  without that particular port.
+
+- ``port_change_conduit``: method through which the affinity (association used
+  for traffic termination purposes) between a user port and a CPU port can be
+  changed. By default all user ports from a tree are assigned to the first
+  available CPU port that makes sense for them (most of the times this means
+  the user ports of a tree are all assigned to the same CPU port, except for H
+  topologies as described in commit 2c0b03258b8b). The ``port`` argument
+  represents the index of the user port, and the ``conduit`` argument represents
+  the new DSA conduit ``net_device``. The CPU port associated with the new
+  conduit can be retrieved by looking at ``struct dsa_port *cpu_dp =
+  conduit->dsa_ptr``. Additionally, the conduit can also be a LAG device where
+  all the slave devices are physical DSA conduits. LAG DSA  also have a
+  valid ``conduit->dsa_ptr`` pointer, however this is not unique, but rather a
+  duplicate of the first physical DSA conduit's (LAG slave) ``dsa_ptr``. In case
+  of a LAG DSA conduit, a further call to ``port_lag_join`` will be emitted
+  separately for the physical CPU ports associated with the physical DSA
+  conduits, requesting them to create a hardware LAG associated with the LAG
+  interface.
 
 PHY devices and link management
 -------------------------------
@@ -381,19 +671,19 @@ PHY devices and link management
 - ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
   if the PHY library PHY driver needs to know about information it cannot obtain
   on its own (e.g.: coming from switch memory mapped registers), this function
-  should return a 32-bits bitmask of "flags", that is private between the switch
+  should return a 32-bit bitmask of "flags" that is private between the switch
   driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
 
-- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
+- ``phy_read``: Function invoked by the DSA user MDIO bus when attempting to read
   the switch port MDIO registers. If unavailable, return 0xffff for each read.
   For builtin switch Ethernet PHYs, this function should allow reading the link
-  status, auto-negotiation results, link partner pages etc..
+  status, auto-negotiation results, link partner pages, etc.
 
-- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
+- ``phy_write``: Function invoked by the DSA user MDIO bus when attempting to write
   to the switch port MDIO registers. If unavailable return a negative error
   code.
 
-- ``adjust_link``: Function invoked by the PHY library when a slave network device
+- ``adjust_link``: Function invoked by the PHY library when a user network device
   is attached to a PHY device. This function is responsible for appropriately
   configuring the switch port link parameters: speed, duplex, pause based on
   what the ``phy_device`` is providing.
@@ -409,17 +699,17 @@ Ethtool operations
 ------------------
 
 - ``get_strings``: ethtool function used to query the driver's strings, will
-  typically return statistics strings, private flags strings etc.
+  typically return statistics strings, private flags strings, etc.
 
 - ``get_ethtool_stats``: ethtool function used to query per-port statistics and
-  return their values. DSA overlays slave network devices general statistics:
+  return their values. DSA overlays user network devices general statistics:
   RX/TX counters from the network device, with switch driver specific statistics
   per port
 
 - ``get_sset_count``: ethtool function used to query the number of statistics items
 
 - ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
-  function may, for certain implementations also query the master network device
+  function may for certain implementations also query the conduit network device
   Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
 
 - ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
@@ -461,38 +751,227 @@ Power management
   should resume all Ethernet switch activities and re-configure the switch to be
   in a fully active state
 
-- ``port_enable``: function invoked by the DSA slave network device ndo_open
-  function when a port is administratively brought up, this function should be
-  fully enabling a given switch port. DSA takes care of marking the port with
+- ``port_enable``: function invoked by the DSA user network device ndo_open
+  function when a port is administratively brought up, this function should
+  fully enable a given switch port. DSA takes care of marking the port with
   ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
   was not, and propagating these changes down to the hardware
 
-- ``port_disable``: function invoked by the DSA slave network device ndo_close
-  function when a port is administratively brought down, this function should be
-  fully disabling a given switch port. DSA takes care of marking the port with
+- ``port_disable``: function invoked by the DSA user network device ndo_close
+  function when a port is administratively brought down, this function should
+  fully disable a given switch port. DSA takes care of marking the port with
   ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
   disabled while being a bridge member
 
+Address databases
+-----------------
+
+Switching hardware is expected to have a table for FDB entries, however not all
+of them are active at the same time. An address database is the subset (partition)
+of FDB entries that is active (can be matched by address learning on RX, or FDB
+lookup on TX) depending on the state of the port. An address database may
+occasionally be called "FID" (Filtering ID) in this document, although the
+underlying implementation may choose whatever is available to the hardware.
+
+For example, all ports that belong to a VLAN-unaware bridge (which is
+*currently* VLAN-unaware) are expected to learn source addresses in the
+database associated by the driver with that bridge (and not with other
+VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a
+VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having
+the same MAC DA as the packet, which is present on another port member of the
+same bridge. At the same time, the FDB lookup process must be able to not find
+an FDB entry having the same MAC DA as the packet, if that entry points towards
+a port which is a member of a different VLAN-unaware bridge (and is therefore
+associated with a different address database).
+
+Similarly, each VLAN of each offloaded VLAN-aware bridge should have an
+associated address database, which is shared by all ports which are members of
+that VLAN, but not shared by ports belonging to different bridges that are
+members of the same VID.
+
+In this context, a VLAN-unaware database means that all packets are expected to
+match on it irrespective of VLAN ID (only MAC address lookup), whereas a
+VLAN-aware database means that packets are supposed to match based on the VLAN
+ID from the classified 802.1Q header (or the pvid if untagged).
+
+At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0,
+whereas VLAN-aware FDB entries have non-zero VID values. Note that a
+VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a
+VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the
+software bridge keeps separate address databases, and offloads to hardware the
+FDB entries belonging to these databases, through switchdev, asynchronously
+relative to the moment when the databases become active or inactive.
+
+When a user port operates in standalone mode, its driver should configure it to
+use a separate database called a port private database. This is different from
+the databases described above, and should impede operation as standalone port
+(packet in, packet out to the CPU port) as little as possible. For example,
+on ingress, it should not attempt to learn the MAC SA of ingress traffic, since
+learning is a bridging layer service and this is a standalone port, therefore
+it would consume useless space. With no address learning, the port private
+database should be empty in a naive implementation, and in this case, all
+received packets should be trivially flooded to the CPU port.
+
+DSA (cascade) and CPU ports are also called "shared" ports because they service
+multiple address databases, and the database that a packet should be associated
+to is usually embedded in the DSA tag. This means that the CPU port may
+simultaneously transport packets coming from a standalone port (which were
+classified by hardware in one address database), and from a bridge port (which
+were classified to a different address database).
+
+Switch drivers which satisfy certain criteria are able to optimize the naive
+configuration by removing the CPU port from the flooding domain of the switch,
+and just program the hardware with FDB entries pointing towards the CPU port
+for which it is known that software is interested in those MAC addresses.
+Packets which do not match a known FDB entry will not be delivered to the CPU,
+which will save CPU cycles required for creating an skb just to drop it.
+
+DSA is able to perform host address filtering for the following kinds of
+addresses:
+
+- Primary unicast MAC addresses of ports (``dev->dev_addr``). These are
+  associated with the port private database of the respective user port,
+  and the driver is notified to install them through ``port_fdb_add`` towards
+  the CPU port.
+
+- Secondary unicast and multicast MAC addresses of ports (addresses added
+  through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated
+  with the port private database of the respective user port.
+
+- Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC
+  addresses of the bridge ports, for which packets must be terminated locally
+  and not forwarded. They are associated with the address database for that
+  bridge.
+
+- Static bridge FDB entries installed towards foreign (non-DSA) interfaces
+  present in the same bridge as some DSA switch ports. These are also
+  associated with the address database for that bridge.
+
+- Dynamically learned FDB entries on foreign interfaces present in the same
+  bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port``
+  is set to true by the driver. These are associated with the address database
+  for that bridge.
+
+For various operations detailed below, DSA provides a ``dsa_db`` structure
+which can be of the following types:
+
+- ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to
+  the port private database of user port ``db->dp``.
+- ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge
+  ``db->bridge``. Separation between the VLAN-unaware database and the per-VID
+  databases of this bridge is expected to be done by the driver.
+- ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``.
+  Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future.
+
+The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``,
+``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true.
+
+DSA associates each offloaded bridge and each offloaded LAG with a one-based ID
+(``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of
+refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering
+scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may
+implement their own.
+
+Only the drivers which declare support for FDB isolation are notified of FDB
+entries on the CPU port belonging to ``DSA_DB_PORT`` databases.
+For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to
+drivers even if they do not support FDB isolation. However, ``db->bridge.num``
+and ``db->lag.id`` are always set to 0 in that case (to denote the lack of
+isolation, for refcounting purposes).
+
+Note that it is not mandatory for a switch driver to implement physically
+separate address databases for each standalone user port. Since FDB entries in
+the port private databases will always point to the CPU port, there is no risk
+for incorrect forwarding decisions. In this case, all standalone ports may
+share the same database, but the reference counting of host-filtered addresses
+(not deleting the FDB entry for a port's MAC address if it's still in use by
+another port) becomes the responsibility of the driver, because DSA is unaware
+that the port databases are in fact shared. This can be achieved by calling
+``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``.
+The down side is that the RX filtering lists of each user port are in fact
+shared, which means that user port A may accept a packet with a MAC DA it
+shouldn't have, only because that MAC address was in the RX filtering list of
+user port B. These packets will still be dropped in software, however.
+
 Bridge layer
 ------------
 
+Offloading the bridge forwarding plane is optional and handled by the methods
+below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may
+be non-zero and exceeded, and in this case, joining a bridge port is still
+possible, but the packet forwarding will take place in software, and the ports
+under a software bridge must remain configured in the same way as for
+standalone operation, i.e. have all bridging service functions (address
+learning etc) disabled, and send all received packets to the CPU port only.
+
+Concretely, a port starts offloading the forwarding plane of a bridge once it
+returns success to the ``port_bridge_join`` method, and stops doing so after
+``port_bridge_leave`` has been called. Offloading the bridge means autonomously
+learning FDB entries in accordance with the software bridge port's state, and
+autonomously forwarding (or flooding) received packets without CPU intervention.
+This is optional even when offloading a bridge port. Tagging protocol drivers
+are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which
+have already been autonomously forwarded in the forwarding domain of the
+ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all
+switch ports part of the same tree ID to be part of the same bridge forwarding
+domain (capable of autonomous forwarding to each other).
+
+Offloading the TX forwarding process of a bridge is a distinct concept from
+simply offloading its forwarding plane, and refers to the ability of certain
+driver and tag protocol combinations to transmit a single skb coming from the
+bridge device's transmit function to potentially multiple egress ports (and
+thereby avoid its cloning in software).
+
+Packets for which the bridge requests this behavior are called data plane
+packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol
+driver's ``xmit`` function. Data plane packets are subject to FDB lookup,
+hardware learning on the CPU port, and do not override the port STP state.
+Additionally, replication of data plane packets (multicast, flooding) is
+handled in hardware and the bridge driver will transmit a single skb for each
+packet that may or may not need replication.
+
+When the TX forwarding offload is enabled, the tag protocol driver is
+responsible to inject packets into the data plane of the hardware towards the
+correct bridging domain (FID) that the port is a part of. The port may be
+VLAN-unaware, and in this case the FID must be equal to the FID used by the
+driver for its VLAN-unaware address database associated with that bridge.
+Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed
+that the packet is also VLAN-tagged with the VLAN ID that the bridge processed
+this packet in. It is the responsibility of the hardware to untag the VID on
+the egress-untagged ports, or keep the tag on the egress-tagged ones.
+
 - ``port_bridge_join``: bridge layer function invoked when a given switch port is
-  added to a bridge, this function should be doing the necessary at the switch
-  level to permit the joining port from being added to the relevant logical
+  added to a bridge, this function should do what's necessary at the switch
+  level to permit the joining port to be added to the relevant logical
   domain for it to ingress/egress traffic with other members of the bridge.
+  By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process
+  of this bridge is also offloaded.
 
 - ``port_bridge_leave``: bridge layer function invoked when a given switch port is
-  removed from a bridge, this function should be doing the necessary at the
+  removed from a bridge, this function should do what's necessary at the
   switch level to deny the leaving port from ingress/egress traffic from the
-  remaining bridge members. When the port leaves the bridge, it should be aged
-  out at the switch hardware for the switch to (re) learn MAC addresses behind
-  this port.
+  remaining bridge members.
 
 - ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
   state is computed by the bridge layer and should be propagated to switch
-  hardware to forward/block/learn traffic. The switch driver is responsible for
-  computing a STP state change based on current and asked parameters and perform
-  the relevant ageing based on the intersection results
+  hardware to forward/block/learn traffic.
+
+- ``port_bridge_flags``: bridge layer function invoked when a port must
+  configure its settings for e.g. flooding of unknown traffic or source address
+  learning. The switch driver is responsible for initial setup of the
+  standalone ports with address learning disabled and egress flooding of all
+  types of traffic, then the DSA core notifies of any change to the bridge port
+  flags when the port joins and leaves a bridge. DSA does not currently manage
+  the bridge port flags for the CPU port. The assumption is that address
+  learning should be statically enabled (if supported by the hardware) on the
+  CPU port, and flooding towards the CPU port should also be enabled, due to a
+  lack of an explicit address filtering mechanism in the DSA core.
+
+- ``port_fast_age``: bridge layer function invoked when flushing the
+  dynamically learned FDB entries on the port is necessary. This is called when
+  transitioning from an STP state where learning should take place to an STP
+  state where it shouldn't, or when leaving a bridge, or when address learning
+  is turned off via ``port_bridge_flags``.
 
 Bridge VLAN filtering
 ---------------------
@@ -507,63 +986,139 @@ Bridge VLAN filtering
   accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
   allowed.
 
-- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
-  configuration of a VLAN on the given port. If the operation is not supported
-  by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
-  code to fallback to a software implementation. No hardware setup must be done
-  in this function. See port_vlan_add for this and details.
-
 - ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
-  (tagged or untagged) for the given switch port
+  (tagged or untagged) for the given switch port. The CPU port becomes a member
+  of a VLAN only if a foreign bridge port is also a member of it (and
+  forwarding needs to take place in software), or the VLAN is installed to the
+  VLAN group of the bridge device itself, for termination purposes
+  (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are
+  reference counted and removed when there is no user left. Drivers do not need
+  to manually install a VLAN on the CPU port.
 
 - ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
   given switch port
 
-- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
-  function that the driver has to call for each VLAN the given port is a member
-  of. A switchdev object is used to carry the VID and bridge flags.
-
 - ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
   Forwarding Database entry, the switch hardware should be programmed with the
   specified address in the specified VLAN Id in the forwarding database
-  associated with this VLAN ID. If the operation is not supported, this
-  function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
-  a software implementation.
-
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
-        of DSA, would be its port-based VLAN, used by the associated bridge device.
+  associated with this VLAN ID.
 
 - ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
   Forwarding Database entry, the switch hardware should be programmed to delete
   the specified MAC address from the specified VLAN ID if it was mapped into
   this port forwarding database
 
-- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
-  function that the driver has to call for each MAC address known to be behind
-  the given port. A switchdev object is used to carry the VID and FDB info.
-
-- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
-  installation of a multicast database entry. If the operation is not supported,
-  this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
-  to a software implementation. No hardware setup must be done in this function.
-  See ``port_fdb_add`` for this and details.
+- ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the
+  physical DSA port interfaces. Since DSA does not attempt to keep in sync its
+  hardware FDB entries with the software bridge, this method is implemented as
+  a means to view the entries visible on user ports in the hardware database.
+  The entries reported by this function have the ``self`` flag in the output of
+  the ``bridge fdb show`` command.
 
 - ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
-  a multicast database entry, the switch hardware should be programmed with the
+  a multicast database entry. The switch hardware should be programmed with the
   specified address in the specified VLAN ID in the forwarding database
   associated with this VLAN ID.
 
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
-        of DSA, would be its port-based VLAN, used by the associated bridge device.
-
 - ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
   multicast database entry, the switch hardware should be programmed to delete
   the specified MAC address from the specified VLAN ID if it was mapped into
   this port forwarding database.
 
-- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
-  function that the driver has to call for each MAC address known to be behind
-  the given port. A switchdev object is used to carry the VID and MDB info.
+Link aggregation
+----------------
+
+Link aggregation is implemented in the Linux networking stack by the bonding
+and team drivers, which are modeled as virtual, stackable network interfaces.
+DSA is capable of offloading a link aggregation group (LAG) to hardware that
+supports the feature, and supports bridging between physical ports and LAGs,
+as well as between LAGs. A bonding/team interface which holds multiple physical
+ports constitutes a logical port, although DSA has no explicit concept of a
+logical port at the moment. Due to this, events where a LAG joins/leaves a
+bridge are treated as if all individual physical ports that are members of that
+LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
+state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
+are treated similarly: DSA offloads the same switchdev object / port attribute
+on all members of the LAG. Static bridge FDB entries on a LAG are not yet
+supported, since the DSA driver API does not have the concept of a logical port
+ID.
+
+- ``port_lag_join``: function invoked when a given switch port is added to a
+  LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
+  back to a software implementation where all traffic from this port is sent to
+  the CPU.
+- ``port_lag_leave``: function invoked when a given switch port leaves a LAG
+  and returns to operation as a standalone port.
+- ``port_lag_change``: function invoked when the link state of any member of
+  the LAG changes, and the hashing function needs rebalancing to only make use
+  of the subset of physical LAG member ports that are up.
+
+Drivers that benefit from having an ID associated with each offloaded LAG
+can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
+method. The LAG ID associated with a bonding/team interface can then be
+retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
+
+IEC 62439-2 (MRP)
+-----------------
+
+The Media Redundancy Protocol is a topology management protocol optimized for
+fast fault recovery time for ring networks, which has some components
+implemented as a function of the bridge driver. MRP uses management PDUs
+(Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
+address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
+Depending on the node's role in the ring (MRM: Media Redundancy Manager,
+MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
+PDUs might need to be terminated locally and others might need to be forwarded.
+An MRM might also benefit from offloading to hardware the creation and
+transmission of certain MRP PDUs (Test).
+
+Normally an MRP instance can be created on top of any network interface,
+however in the case of a device with an offloaded data path such as DSA, it is
+necessary for the hardware, even if it is not MRP-aware, to be able to extract
+the MRP PDUs from the fabric before the driver can proceed with the software
+implementation. DSA today has no driver which is MRP-aware, therefore it only
+listens for the bare minimum switchdev objects required for the software assist
+to work properly. The operations are detailed below.
+
+- ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
+  with a certain ring ID, priority, primary port and secondary port is
+  created/deleted.
+- ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
+  when an MRP instance changes ring roles between MRM or MRC. This affects
+  which MRP PDUs should be trapped to software and which should be autonomously
+  forwarded.
+
+IEC 62439-3 (HSR/PRP)
+---------------------
+
+The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
+works by duplicating and sequence numbering packets through two independent L2
+networks (which are unaware of the PRP tail tags carried in the packets), and
+eliminating the duplicates at the receiver. The High-availability Seamless
+Redundancy (HSR) protocol is similar in concept, except all nodes that carry
+the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
+uses a header with an EtherType of 0x892f) and are physically connected in a
+ring topology. Both HSR and PRP use supervision frames for monitoring the
+health of the network and for discovery of other nodes.
+
+In Linux, both HSR and PRP are implemented in the hsr driver, which
+instantiates a virtual, stackable network interface with two member ports.
+The driver only implements the basic roles of DANH (Doubly Attached Node
+implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
+of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
+interface with a physical switch port does not produce the expected result).
+
+A driver which is able of offloading certain functions of a DANP or DANH should
+declare the corresponding netdev features as indicated by the documentation at
+``Documentation/networking/netdev-features.rst``. Additionally, the following
+methods must be implemented:
+
+- ``port_hsr_join``: function invoked when a given switch port is added to a
+  DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
+  fall back to a software implementation where all traffic from this port is
+  sent to the CPU.
+- ``port_hsr_leave``: function invoked when a given switch port leaves a
+  DANP/DANH and returns to normal operation as a standalone port.
 
 TODO
 ====
@@ -576,12 +1131,3 @@ capable hardware, but does not enforce a strict switch device driver model. On
 the other DSA enforces a fairly strict device driver model, and deals with most
 of the switch specific. At some point we should envision a merger between these
 two subsystems and get the best of both worlds.
-
-Other hanging fruits
---------------------
-
-- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
-- allowing more than one CPU/management interface:
-  http://comments.gmane.org/gmane.linux.network/365657
-- porting more drivers from other vendors:
-  http://comments.gmane.org/gmane.linux.network/365510
diff --git a/Documentation/networking/dsa/lan9303.rst b/Documentation/networking/dsa/lan9303.rst
index e3c820db28ad..ab81b4e0139e 100644
--- a/Documentation/networking/dsa/lan9303.rst
+++ b/Documentation/networking/dsa/lan9303.rst
@@ -4,7 +4,7 @@ LAN9303 Ethernet switch driver
 
 The LAN9303 is a three port 10/100 Mbps ethernet switch with integrated phys for
 the two external ethernet ports. The third port is an RMII/MII interface to a
-host master network interface (e.g. fixed link).
+host conduit network interface (e.g. fixed link).
 
 
 Driver details
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index 7395a33baaf9..8ab60eef07d4 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -5,7 +5,7 @@ NXP SJA1105 switch driver
 Overview
 ========
 
-The NXP SJA1105 is a family of 6 devices:
+The NXP SJA1105 is a family of 10 SPI-managed automotive switches:
 
 - SJA1105E: First generation, no TTEthernet
 - SJA1105T: First generation, TTEthernet
@@ -13,9 +13,11 @@ The NXP SJA1105 is a family of 6 devices:
 - SJA1105Q: Second generation, TTEthernet, no SGMII
 - SJA1105R: Second generation, no TTEthernet, SGMII
 - SJA1105S: Second generation, TTEthernet, SGMII
-
-These are SPI-managed automotive switches, with all ports being gigabit
-capable, and supporting MII/RMII/RGMII and optionally SGMII on one port.
+- SJA1110A: Third generation, TTEthernet, SGMII, integrated 100base-T1 and
+  100base-TX PHYs
+- SJA1110B: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110C: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110D: Third generation, TTEthernet, SGMII, 100base-T1
 
 Being automotive parts, their configuration interface is geared towards
 set-and-forget use, with minimal dynamic interaction at runtime. They
@@ -63,199 +65,6 @@ If that changed setting can be transmitted to the switch through the dynamic
 reconfiguration interface, it is; otherwise the switch is reset and
 reprogrammed with the updated static configuration.
 
-Traffic support
-===============
-
-The switches do not have hardware support for DSA tags, except for "slow
-protocols" for switch control as STP and PTP. For these, the switches have two
-programmable filters for link-local destination MACs.
-These are used to trap BPDUs and PTP traffic to the master netdevice, and are
-further used to support STP and 1588 ordinary clock/boundary clock
-functionality. For frames trapped to the CPU, source port and switch ID
-information is encoded by the hardware into the frames.
-
-But by leveraging ``CONFIG_NET_DSA_TAG_8021Q`` (a software-defined DSA tagging
-format based on VLANs), general-purpose traffic termination through the network
-stack can be supported under certain circumstances.
-
-Depending on VLAN awareness state, the following operating modes are possible
-with the switch:
-
-- Mode 1 (VLAN-unaware): a port is in this mode when it is used as a standalone
-  net device, or when it is enslaved to a bridge with ``vlan_filtering=0``.
-- Mode 2 (fully VLAN-aware): a port is in this mode when it is enslaved to a
-  bridge with ``vlan_filtering=1``. Access to the entire VLAN range is given to
-  the user through ``bridge vlan`` commands, but general-purpose (anything
-  other than STP, PTP etc) traffic termination is not possible through the
-  switch net devices. The other packets can be still by user space processed
-  through the DSA master interface (similar to ``DSA_TAG_PROTO_NONE``).
-- Mode 3 (best-effort VLAN-aware): a port is in this mode when enslaved to a
-  bridge with ``vlan_filtering=1``, and the devlink property of its parent
-  switch named ``best_effort_vlan_filtering`` is set to ``true``. When
-  configured like this, the range of usable VIDs is reduced (0 to 1023 and 3072
-  to 4094), so is the number of usable VIDs (maximum of 7 non-pvid VLANs per
-  port*), and shared VLAN learning is performed (FDB lookup is done only by
-  DMAC, not also by VID).
-
-To summarize, in each mode, the following types of traffic are supported over
-the switch net devices:
-
-+-------------+-----------+--------------+------------+
-|             |   Mode 1  |    Mode 2    |   Mode 3   |
-+=============+===========+==============+============+
-|   Regular   |    Yes    | No           |     Yes    |
-|   traffic   |           | (use master) |            |
-+-------------+-----------+--------------+------------+
-| Management  |    Yes    |     Yes      |     Yes    |
-| traffic     |           |              |            |
-| (BPDU, PTP) |           |              |            |
-+-------------+-----------+--------------+------------+
-
-To configure the switch to operate in Mode 3, the following steps can be
-followed::
-
-  ip link add dev br0 type bridge
-  # swp2 operates in Mode 1 now
-  ip link set dev swp2 master br0
-  # swp2 temporarily moves to Mode 2
-  ip link set dev br0 type bridge vlan_filtering 1
-  [   61.204770] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
-  [   61.239944] sja1105 spi0.1: Disabled switch tagging
-  # swp3 now operates in Mode 3
-  devlink dev param set spi/spi0.1 name best_effort_vlan_filtering value true cmode runtime
-  [   64.682927] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
-  [   64.711925] sja1105 spi0.1: Enabled switch tagging
-  # Cannot use VLANs in range 1024-3071 while in Mode 3.
-  bridge vlan add dev swp2 vid 1025 untagged pvid
-  RTNETLINK answers: Operation not permitted
-  bridge vlan add dev swp2 vid 100
-  bridge vlan add dev swp2 vid 101 untagged
-  bridge vlan
-  port    vlan ids
-  swp5     1 PVID Egress Untagged
-
-  swp2     1 PVID Egress Untagged
-           100
-           101 Egress Untagged
-
-  swp3     1 PVID Egress Untagged
-
-  swp4     1 PVID Egress Untagged
-
-  br0      1 PVID Egress Untagged
-  bridge vlan add dev swp2 vid 102
-  bridge vlan add dev swp2 vid 103
-  bridge vlan add dev swp2 vid 104
-  bridge vlan add dev swp2 vid 105
-  bridge vlan add dev swp2 vid 106
-  bridge vlan add dev swp2 vid 107
-  # Cannot use mode than 7 VLANs per port while in Mode 3.
-  [ 3885.216832] sja1105 spi0.1: No more free subvlans
-
-\* "maximum of 7 non-pvid VLANs per port": Decoding VLAN-tagged packets on the
-CPU in mode 3 is possible through VLAN retagging of packets that go from the
-switch to the CPU. In cross-chip topologies, the port that goes to the CPU
-might also go to other switches. In that case, those other switches will see
-only a retagged packet (which only has meaning for the CPU). So if they are
-interested in this VLAN, they need to apply retagging in the reverse direction,
-to recover the original value from it. This consumes extra hardware resources
-for this switch. There is a maximum of 32 entries in the Retagging Table of
-each switch device.
-
-As an example, consider this cross-chip topology::
-
-  +-------------------------------------------------+
-  | Host SoC                                        |
-  |           +-------------------------+           |
-  |           | DSA master for embedded |           |
-  |           |   switch (non-sja1105)  |           |
-  |  +--------+-------------------------+--------+  |
-  |  |   embedded L2 switch                      |  |
-  |  |                                           |  |
-  |  |   +--------------+     +--------------+   |  |
-  |  |   |DSA master for|     |DSA master for|   |  |
-  |  |   |  SJA1105 1   |     |  SJA1105 2   |   |  |
-  +--+---+--------------+-----+--------------+---+--+
-
-  +-----------------------+ +-----------------------+
-  |   SJA1105 switch 1    | |   SJA1105 switch 2    |
-  +-----+-----+-----+-----+ +-----+-----+-----+-----+
-  |sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3|
-  +-----+-----+-----+-----+ +-----+-----+-----+-----+
-
-To reach the CPU, SJA1105 switch 1 (spi/spi2.1) uses the same port as is uses
-to reach SJA1105 switch 2 (spi/spi2.2), which would be port 4 (not drawn).
-Similarly for SJA1105 switch 2.
-
-Also consider the following commands, that add VLAN 100 to every sja1105 user
-port::
-
-  devlink dev param set spi/spi2.1 name best_effort_vlan_filtering value true cmode runtime
-  devlink dev param set spi/spi2.2 name best_effort_vlan_filtering value true cmode runtime
-  ip link add dev br0 type bridge
-  for port in sw1p0 sw1p1 sw1p2 sw1p3 \
-              sw2p0 sw2p1 sw2p2 sw2p3; do
-      ip link set dev $port master br0
-  done
-  ip link set dev br0 type bridge vlan_filtering 1
-  for port in sw1p0 sw1p1 sw1p2 sw1p3 \
-              sw2p0 sw2p1 sw2p2; do
-      bridge vlan add dev $port vid 100
-  done
-  ip link add link br0 name br0.100 type vlan id 100 && ip link set dev br0.100 up
-  ip addr add 192.168.100.3/24 dev br0.100
-  bridge vlan add dev br0 vid 100 self
-
-  bridge vlan
-  port    vlan ids
-  sw1p0    1 PVID Egress Untagged
-           100
-
-  sw1p1    1 PVID Egress Untagged
-           100
-
-  sw1p2    1 PVID Egress Untagged
-           100
-
-  sw1p3    1 PVID Egress Untagged
-           100
-
-  sw2p0    1 PVID Egress Untagged
-           100
-
-  sw2p1    1 PVID Egress Untagged
-           100
-
-  sw2p2    1 PVID Egress Untagged
-           100
-
-  sw2p3    1 PVID Egress Untagged
-
-  br0      1 PVID Egress Untagged
-           100
-
-SJA1105 switch 1 consumes 1 retagging entry for each VLAN on each user port
-towards the CPU. It also consumes 1 retagging entry for each non-pvid VLAN that
-it is also interested in, which is configured on any port of any neighbor
-switch.
-
-In this case, SJA1105 switch 1 consumes a total of 11 retagging entries, as
-follows:
-
-- 8 retagging entries for VLANs 1 and 100 installed on its user ports
-  (``sw1p0`` - ``sw1p3``)
-- 3 retagging entries for VLAN 100 installed on the user ports of SJA1105
-  switch 2 (``sw2p0`` - ``sw2p2``), because it also has ports that are
-  interested in it. The VLAN 1 is a pvid on SJA1105 switch 2 and does not need
-  reverse retagging.
-
-SJA1105 switch 2 also consumes 11 retagging entries, but organized as follows:
-
-- 7 retagging entries for the bridge VLANs on its user ports (``sw2p0`` -
-  ``sw2p3``).
-- 4 retagging entries for VLAN 100 installed on the user ports of SJA1105
-  switch 1 (``sw1p0`` - ``sw1p3``).
-
 Switching features
 ==================
 
@@ -270,7 +79,7 @@ The hardware tags all traffic internally with a port-based VLAN (pvid), or it
 decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification
 is not possible. Once attributed a VLAN tag, frames are checked against the
 port's membership rules and dropped at ingress if they don't match any VLAN.
-This behavior is available when switch ports are enslaved to a bridge with
+This behavior is available when switch ports join a bridge with
 ``vlan_filtering 1``.
 
 Normally the hardware is not configurable with respect to VLAN awareness, but
@@ -280,33 +89,10 @@ untagged), and therefore this mode is also supported.
 
 Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but
 all bridges should have the same level of VLAN awareness (either both have
-``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact
-that VLAN awareness is global at the switch level is that once a bridge with
-``vlan_filtering`` enslaves at least one switch port, the other un-bridged
-ports are no longer available for standalone traffic termination.
+``vlan_filtering`` 0, or both 1).
 
 Topology and loop detection through STP is supported.
 
-L2 FDB manipulation (add/delete/dump) is currently possible for the first
-generation devices. Aging time of FDB entries, as well as enabling fully static
-management (no address learning and no flooding of unknown traffic) is not yet
-configurable in the driver.
-
-A special comment about bridging with other netdevices (illustrated with an
-example):
-
-A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1.
-The switch ports (swp0-3) are under br0.
-It is desired that eth0 is turned into another switched port that communicates
-with swp0-3.
-
-If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the
-intended results.
-If br0 has vlan_filtering 1, then a new br1 interface needs to be created that
-enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in
-this mode, the switch ports beneath br0 are not capable of regular traffic, and
-are only used as a conduit for switchdev operations.
-
 Offloads
 ========
 
@@ -336,7 +122,7 @@ on egress. Using ``vlan_filtering=1``, the behavior is the other way around:
 offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA
 net devices are no longer able to do that. To inject frames into a hardware TX
 queue with VLAN awareness active, it is necessary to create a VLAN
-sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged
+sub-interface on the DSA conduit port, and send normal (0x8100) VLAN-tagged
 towards the switch, with the VLAN PCP bits set appropriately.
 
 Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the
@@ -507,10 +293,37 @@ of dropped frames, which is a sum of frames dropped due to timing violations,
 lack of destination ports and MTU enforcement checks). Byte-level counters are
 not available.
 
+Limitations
+===========
+
+The SJA1105 switch family always performs VLAN processing. When configured as
+VLAN-unaware, frames carry a different VLAN tag internally, depending on
+whether the port is standalone or under a VLAN-unaware bridge.
+
+The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the
+driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware
+bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on
+whether the port is standalone or in a VLAN-unaware bridge, and accepts only
+"VLAN-unaware" tc-flower keys (MAC DA).
+
+The existing tc-flower keys that are offloaded using virtual links are no
+longer operational after one of the following happens:
+
+- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware)
+- port is part of a bridge whose VLAN awareness state changes
+- port was part of a bridge and becomes standalone
+- port was standalone, but another port joins a VLAN-aware bridge and this
+  changes the global VLAN awareness state of the bridge
+
+The driver cannot veto all these operations, and it cannot update/remove the
+existing tc-flower filters either. So for proper operation, the tc-flower
+filters should be installed only after the forwarding configuration of the port
+has been made, and removed by user space before making any changes to it.
+
 Device Tree bindings and board design
 =====================================
 
-This section references ``Documentation/devicetree/bindings/net/dsa/sja1105.txt``
+This section references ``Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml``
 and aims to showcase some potential switch caveats.
 
 RMII PHY role and out-of-band signaling
@@ -576,6 +389,57 @@ MDIO bus and PHY management
 The SJA1105 does not have an MDIO bus and does not perform in-band AN either.
 Therefore there is no link state notification coming from the switch device.
 A board would need to hook up the PHYs connected to the switch to any other
-MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO
+MDIO bus available to Linux within the system (e.g. to the DSA conduit's MDIO
 bus). Link state management then works by the driver manually keeping in sync
 (over SPI commands) the MAC link speed with the settings negotiated by the PHY.
+
+By comparison, the SJA1110 supports an MDIO slave access point over which its
+internal 100base-T1 PHYs can be accessed from the host. This is, however, not
+used by the driver, instead the internal 100base-T1 and 100base-TX PHYs are
+accessed through SPI commands, modeled in Linux as virtual MDIO buses.
+
+The microcontroller attached to the SJA1110 port 0 also has an MDIO controller
+operating in master mode, however the driver does not support this either,
+since the microcontroller gets disabled when the Linux driver operates.
+Discrete PHYs connected to the switch ports should have their MDIO interface
+attached to an MDIO controller from the host system and not to the switch,
+similar to SJA1105.
+
+Port compatibility matrix
+-------------------------
+
+The SJA1105 port compatibility matrix is:
+
+===== ============== ============== ==============
+Port   SJA1105E/T     SJA1105P/Q     SJA1105R/S
+===== ============== ============== ==============
+0      xMII           xMII           xMII
+1      xMII           xMII           xMII
+2      xMII           xMII           xMII
+3      xMII           xMII           xMII
+4      xMII           xMII           SGMII
+===== ============== ============== ==============
+
+
+The SJA1110 port compatibility matrix is:
+
+===== ============== ============== ============== ==============
+Port   SJA1110A       SJA1110B       SJA1110C       SJA1110D
+===== ============== ============== ============== ==============
+0      RevMII (uC)    RevMII (uC)    RevMII (uC)    RevMII (uC)
+1      100base-TX     100base-TX     100base-TX
+       or SGMII                                     SGMII
+2      xMII           xMII           xMII           xMII
+       or SGMII                                     or SGMII
+3      xMII           xMII           xMII
+       or SGMII       or SGMII                      SGMII
+       or 2500base-X  or 2500base-X                 or 2500base-X
+4      SGMII          SGMII          SGMII          SGMII
+       or 2500base-X  or 2500base-X  or 2500base-X  or 2500base-X
+5      100base-T1     100base-T1     100base-T1     100base-T1
+6      100base-T1     100base-T1     100base-T1     100base-T1
+7      100base-T1     100base-T1     100base-T1     100base-T1
+8      100base-T1     100base-T1     n/a            n/a
+9      100base-T1     100base-T1     n/a            n/a
+10     100base-T1     n/a            n/a            n/a
+===== ============== ============== ============== ==============
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 82470c36c27a..b6e9af4d0f1b 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -41,6 +41,11 @@ In the message structure descriptions below, if an attribute name is suffixed
 with "+", parent nest can contain multiple attributes of the same type. This
 implements an array of entries.
 
+Attributes that need to be filled-in by device drivers and that are dumped to
+user space based on whether they are valid or not should not use zero as a
+valid value. This avoids the need to explicitly signal the validity of the
+attribute in the device driver API.
+
 
 Request header
 ==============
@@ -52,6 +57,7 @@ Structure of this header is
   ``ETHTOOL_A_HEADER_DEV_INDEX``  u32     device ifindex
   ``ETHTOOL_A_HEADER_DEV_NAME``   string  device name
   ``ETHTOOL_A_HEADER_FLAGS``      u32     flags common for all requests
+  ``ETHTOOL_A_HEADER_PHY_INDEX``  u32     phy device index
   ==============================  ======  =============================
 
 ``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the
@@ -68,6 +74,7 @@ the flags may not apply to requests. Recognized flags are:
   =================================  ===================================
   ``ETHTOOL_FLAG_COMPACT_BITSETS``   use compact format bitsets in reply
   ``ETHTOOL_FLAG_OMIT_REPLY``        omit optional reply (_SET and _ACT)
+  ``ETHTOOL_FLAG_STATS``             include optional device statistics
   =================================  ===================================
 
 New request flags should follow the general idea that if the flag is not set,
@@ -75,6 +82,12 @@ the behaviour is backward compatible, i.e. requests from old clients not aware
 of the flag should be interpreted the way the client expects. A client must
 not set flags it does not understand.
 
+``ETHTOOL_A_HEADER_PHY_INDEX`` identifies the Ethernet PHY the message relates to.
+As there are numerous commands that are related to PHY configuration, and because
+there may be more than one PHY on the link, the PHY index can be passed in the
+request for the commands that needs it. It is, however, not mandatory, and if it
+is not passed for commands that target a PHY, the net_device.phydev pointer
+is used.
 
 Bit sets
 ========
@@ -100,7 +113,7 @@ modifying a bitmap, the former changes the bit set in mask to values set in
 value and preserves the rest; the latter sets the bits set in the bitmap and
 clears the rest.
 
-Compact form: nested (bitset) atrribute contents:
+Compact form: nested (bitset) attribute contents:
 
   ============================  ======  ============================
   ``ETHTOOL_A_BITSET_NOMASK``   flag    no mask, only a list
@@ -178,7 +191,7 @@ according to message purpose:
 
 Userspace to kernel:
 
-  ===================================== ================================
+  ===================================== =================================
   ``ETHTOOL_MSG_STRSET_GET``            get string set
   ``ETHTOOL_MSG_LINKINFO_GET``          get link settings
   ``ETHTOOL_MSG_LINKINFO_SET``          set link settings
@@ -206,40 +219,78 @@ Userspace to kernel:
   ``ETHTOOL_MSG_TSINFO_GET``		get timestamping info
   ``ETHTOOL_MSG_CABLE_TEST_ACT``        action start cable test
   ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``    action start raw TDR cable test
-  ===================================== ================================
+  ``ETHTOOL_MSG_TUNNEL_INFO_GET``       get tunnel offload info
+  ``ETHTOOL_MSG_FEC_GET``               get FEC settings
+  ``ETHTOOL_MSG_FEC_SET``               set FEC settings
+  ``ETHTOOL_MSG_MODULE_EEPROM_GET``     read SFP module EEPROM
+  ``ETHTOOL_MSG_STATS_GET``             get standard statistics
+  ``ETHTOOL_MSG_PHC_VCLOCKS_GET``       get PHC virtual clocks info
+  ``ETHTOOL_MSG_MODULE_SET``            set transceiver module parameters
+  ``ETHTOOL_MSG_MODULE_GET``            get transceiver module parameters
+  ``ETHTOOL_MSG_PSE_SET``               set PSE parameters
+  ``ETHTOOL_MSG_PSE_GET``               get PSE parameters
+  ``ETHTOOL_MSG_RSS_GET``               get RSS settings
+  ``ETHTOOL_MSG_PLCA_GET_CFG``          get PLCA RS parameters
+  ``ETHTOOL_MSG_PLCA_SET_CFG``          set PLCA RS parameters
+  ``ETHTOOL_MSG_PLCA_GET_STATUS``       get PLCA RS status
+  ``ETHTOOL_MSG_MM_GET``                get MAC merge layer state
+  ``ETHTOOL_MSG_MM_SET``                set MAC merge layer parameters
+  ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT``   flash transceiver module firmware
+  ``ETHTOOL_MSG_PHY_GET``               get Ethernet PHY information
+  ``ETHTOOL_MSG_TSCONFIG_GET``          get hw timestamping configuration
+  ``ETHTOOL_MSG_TSCONFIG_SET``          set hw timestamping configuration
+  ===================================== =================================
 
 Kernel to userspace:
 
-  ===================================== =================================
-  ``ETHTOOL_MSG_STRSET_GET_REPLY``      string set contents
-  ``ETHTOOL_MSG_LINKINFO_GET_REPLY``    link settings
-  ``ETHTOOL_MSG_LINKINFO_NTF``          link settings notification
-  ``ETHTOOL_MSG_LINKMODES_GET_REPLY``   link modes info
-  ``ETHTOOL_MSG_LINKMODES_NTF``         link modes notification
-  ``ETHTOOL_MSG_LINKSTATE_GET_REPLY``   link state info
-  ``ETHTOOL_MSG_DEBUG_GET_REPLY``       debugging settings
-  ``ETHTOOL_MSG_DEBUG_NTF``             debugging settings notification
-  ``ETHTOOL_MSG_WOL_GET_REPLY``         wake-on-lan settings
-  ``ETHTOOL_MSG_WOL_NTF``               wake-on-lan settings notification
-  ``ETHTOOL_MSG_FEATURES_GET_REPLY``    device features
-  ``ETHTOOL_MSG_FEATURES_SET_REPLY``    optional reply to FEATURES_SET
-  ``ETHTOOL_MSG_FEATURES_NTF``          netdev features notification
-  ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY``   private flags
-  ``ETHTOOL_MSG_PRIVFLAGS_NTF``         private flags
-  ``ETHTOOL_MSG_RINGS_GET_REPLY``       ring sizes
-  ``ETHTOOL_MSG_RINGS_NTF``             ring sizes
-  ``ETHTOOL_MSG_CHANNELS_GET_REPLY``    channel counts
-  ``ETHTOOL_MSG_CHANNELS_NTF``          channel counts
-  ``ETHTOOL_MSG_COALESCE_GET_REPLY``    coalescing parameters
-  ``ETHTOOL_MSG_COALESCE_NTF``          coalescing parameters
-  ``ETHTOOL_MSG_PAUSE_GET_REPLY``       pause parameters
-  ``ETHTOOL_MSG_PAUSE_NTF``             pause parameters
-  ``ETHTOOL_MSG_EEE_GET_REPLY``         EEE settings
-  ``ETHTOOL_MSG_EEE_NTF``               EEE settings
-  ``ETHTOOL_MSG_TSINFO_GET_REPLY``	timestamping info
-  ``ETHTOOL_MSG_CABLE_TEST_NTF``        Cable test results
-  ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF``    Cable test TDR results
-  ===================================== =================================
+  ======================================== =================================
+  ``ETHTOOL_MSG_STRSET_GET_REPLY``         string set contents
+  ``ETHTOOL_MSG_LINKINFO_GET_REPLY``       link settings
+  ``ETHTOOL_MSG_LINKINFO_NTF``             link settings notification
+  ``ETHTOOL_MSG_LINKMODES_GET_REPLY``      link modes info
+  ``ETHTOOL_MSG_LINKMODES_NTF``            link modes notification
+  ``ETHTOOL_MSG_LINKSTATE_GET_REPLY``      link state info
+  ``ETHTOOL_MSG_DEBUG_GET_REPLY``          debugging settings
+  ``ETHTOOL_MSG_DEBUG_NTF``                debugging settings notification
+  ``ETHTOOL_MSG_WOL_GET_REPLY``            wake-on-lan settings
+  ``ETHTOOL_MSG_WOL_NTF``                  wake-on-lan settings notification
+  ``ETHTOOL_MSG_FEATURES_GET_REPLY``       device features
+  ``ETHTOOL_MSG_FEATURES_SET_REPLY``       optional reply to FEATURES_SET
+  ``ETHTOOL_MSG_FEATURES_NTF``             netdev features notification
+  ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY``      private flags
+  ``ETHTOOL_MSG_PRIVFLAGS_NTF``            private flags
+  ``ETHTOOL_MSG_RINGS_GET_REPLY``          ring sizes
+  ``ETHTOOL_MSG_RINGS_NTF``                ring sizes
+  ``ETHTOOL_MSG_CHANNELS_GET_REPLY``       channel counts
+  ``ETHTOOL_MSG_CHANNELS_NTF``             channel counts
+  ``ETHTOOL_MSG_COALESCE_GET_REPLY``       coalescing parameters
+  ``ETHTOOL_MSG_COALESCE_NTF``             coalescing parameters
+  ``ETHTOOL_MSG_PAUSE_GET_REPLY``          pause parameters
+  ``ETHTOOL_MSG_PAUSE_NTF``                pause parameters
+  ``ETHTOOL_MSG_EEE_GET_REPLY``            EEE settings
+  ``ETHTOOL_MSG_EEE_NTF``                  EEE settings
+  ``ETHTOOL_MSG_TSINFO_GET_REPLY``         timestamping info
+  ``ETHTOOL_MSG_CABLE_TEST_NTF``           Cable test results
+  ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF``       Cable test TDR results
+  ``ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY``    tunnel offload info
+  ``ETHTOOL_MSG_FEC_GET_REPLY``            FEC settings
+  ``ETHTOOL_MSG_FEC_NTF``                  FEC settings
+  ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY``  read SFP module EEPROM
+  ``ETHTOOL_MSG_STATS_GET_REPLY``          standard statistics
+  ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY``    PHC virtual clocks info
+  ``ETHTOOL_MSG_MODULE_GET_REPLY``         transceiver module parameters
+  ``ETHTOOL_MSG_PSE_GET_REPLY``            PSE parameters
+  ``ETHTOOL_MSG_RSS_GET_REPLY``            RSS settings
+  ``ETHTOOL_MSG_PLCA_GET_CFG_REPLY``       PLCA RS parameters
+  ``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY``    PLCA RS status
+  ``ETHTOOL_MSG_PLCA_NTF``                 PLCA RS parameters
+  ``ETHTOOL_MSG_MM_GET_REPLY``             MAC merge layer status
+  ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF``      transceiver module flash updates
+  ``ETHTOOL_MSG_PHY_GET_REPLY``            Ethernet PHY information
+  ``ETHTOOL_MSG_PHY_NTF``                  Ethernet PHY information change
+  ``ETHTOOL_MSG_TSCONFIG_GET_REPLY``       hw timestamping configuration
+  ``ETHTOOL_MSG_TSCONFIG_SET_REPLY``       new hw timestamping configuration
+  ======================================== =================================
 
 ``GET`` requests are sent by userspace applications to retrieve device
 information. They usually do not contain any message specific attributes.
@@ -405,6 +456,7 @@ Kernel response contents:
   ``ETHTOOL_A_LINKMODES_DUPLEX``              u8      duplex mode
   ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG``    u8      Master/slave port mode
   ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE``  u8      Master/slave port state
+  ``ETHTOOL_A_LINKMODES_RATE_MATCHING``       u8      PHY rate matching
   ==========================================  ======  ==========================
 
 For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask
@@ -428,25 +480,28 @@ Request contents:
   ``ETHTOOL_A_LINKMODES_SPEED``               u32     link speed (Mb/s)
   ``ETHTOOL_A_LINKMODES_DUPLEX``              u8      duplex mode
   ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG``    u8      Master/slave port mode
+  ``ETHTOOL_A_LINKMODES_RATE_MATCHING``       u8      PHY rate matching
+  ``ETHTOOL_A_LINKMODES_LANES``               u32     lanes
   ==========================================  ======  ==========================
 
 ``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If
 autonegotiation is on (either set now or kept from before), advertised modes
 are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one
-of speed and duplex is specified, kernel adjusts advertised modes to all
-supported modes matching speed, duplex or both (whatever is specified). This
-autoselection is done on ethtool side with ioctl interface, netlink interface
-is supposed to allow requesting changes without knowing what exactly kernel
-supports.
+of speed, duplex and lanes is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex, lanes or all (whatever is specified).
+This autoselection is done on ethtool side with ioctl interface, netlink
+interface is supposed to allow requesting changes without knowing what exactly
+kernel supports.
 
 
 LINKSTATE_GET
 =============
 
-Requests link state information. At the moment, only link up/down flag (as
-provided by ``ETHTOOL_GLINK`` ioctl command) is provided but some future
-extensions are planned (e.g. link down reason). This request does not have any
-attributes.
+Requests link state information. Link up/down flag (as provided by
+``ETHTOOL_GLINK`` ioctl command) is provided. Optionally, extended state might
+be provided as well. In general, extended state describes reasons for why a port
+is down, or why it operates in some non-obvious mode. This request does not have
+any attributes.
 
 Request contents:
 
@@ -461,16 +516,154 @@ Kernel response contents:
   ``ETHTOOL_A_LINKSTATE_LINK``          bool    link state (up/down)
   ``ETHTOOL_A_LINKSTATE_SQI``           u32     Current Signal Quality Index
   ``ETHTOOL_A_LINKSTATE_SQI_MAX``       u32     Max support SQI value
+  ``ETHTOOL_A_LINKSTATE_EXT_STATE``     u8      link extended state
+  ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE``  u8      link extended substate
+  ``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT``  u32     count of link down events
   ====================================  ======  ============================
 
 For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
 carrier flag provided by ``netif_carrier_ok()`` but there are drivers which
 define their own handler.
 
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` are
+optional values. ethtool core can provide either both
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE``,
+or only ``ETHTOOL_A_LINKSTATE_EXT_STATE``, or none of them.
+
 ``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all
 devices supporting the request).
 
 
+Link extended states:
+
+  ================================================      ============================================
+  ``ETHTOOL_LINK_EXT_STATE_AUTONEG``                    States relating to the autonegotiation or
+                                                        issues therein
+
+  ``ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE``      Failure during link training
+
+  ``ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH``      Logical mismatch in physical coding sublayer
+                                                        or forward error correction sublayer
+
+  ``ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY``       Signal integrity issues
+
+  ``ETHTOOL_LINK_EXT_STATE_NO_CABLE``                   No cable connected
+
+  ``ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE``                Failure is related to cable,
+                                                        e.g., unsupported cable
+
+  ``ETHTOOL_LINK_EXT_STATE_EEPROM_ISSUE``               Failure is related to EEPROM, e.g., failure
+                                                        during reading or parsing the data
+
+  ``ETHTOOL_LINK_EXT_STATE_CALIBRATION_FAILURE``        Failure during calibration algorithm
+
+  ``ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED``      The hardware is not able to provide the
+                                                        power required from cable or module
+
+  ``ETHTOOL_LINK_EXT_STATE_OVERHEAT``                   The module is overheated
+
+  ``ETHTOOL_LINK_EXT_STATE_MODULE``                     Transceiver module issue
+  ================================================      ============================================
+
+Link extended substates:
+
+  Autoneg substates:
+
+  ===============================================================   ================================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED``              Peer side is down
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_ACK_NOT_RECEIVED``                 Ack not received from peer side
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NEXT_PAGE_EXCHANGE_FAILED``        Next page exchange failed
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED_FORCE_MODE``   Peer side is down during force
+                                                                    mode or there is no agreement of
+                                                                    speed
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_FEC_MISMATCH_DURING_OVERRIDE``     Forward error correction modes
+                                                                    in both sides are mismatched
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_HCD``                           No Highest Common Denominator
+  ===============================================================   ================================
+
+  Link training substates:
+
+  ===========================================================================   ====================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_FRAME_LOCK_NOT_ACQUIRED``                    Frames were not
+                                                                                 recognized, the
+                                                                                 lock failed
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_INHIBIT_TIMEOUT``                       The lock did not
+                                                                                 occur before
+                                                                                 timeout
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_PARTNER_DID_NOT_SET_RECEIVER_READY``    Peer side did not
+                                                                                 send ready signal
+                                                                                 after training
+                                                                                 process
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LT_REMOTE_FAULT``                                  Remote side is not
+                                                                                 ready yet
+  ===========================================================================   ====================
+
+  Link logical mismatch substates:
+
+  ================================================================   ===============================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_BLOCK_LOCK``   Physical coding sublayer was
+                                                                     not locked in first phase -
+                                                                     block lock
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_AM_LOCK``      Physical coding sublayer was
+                                                                     not locked in second phase -
+                                                                     alignment markers lock
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_GET_ALIGN_STATUS``     Physical coding sublayer did
+                                                                     not get align status
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_FC_FEC_IS_NOT_LOCKED``             FC forward error correction is
+                                                                     not locked
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_RS_FEC_IS_NOT_LOCKED``             RS forward error correction is
+                                                                     not locked
+  ================================================================   ===============================
+
+  Bad signal integrity substates:
+
+  =================================================================    =============================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_LARGE_NUMBER_OF_PHYSICAL_ERRORS``    Large number of physical
+                                                                       errors
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_UNSUPPORTED_RATE``                   The system attempted to
+                                                                       operate the cable at a rate
+                                                                       that is not formally
+                                                                       supported, which led to
+                                                                       signal integrity issues
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST``        The external clock signal for
+                                                                       SerDes is too weak or
+                                                                       unavailable.
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS``                        The received signal for
+                                                                       SerDes is too weak because
+                                                                       analog loss of signal.
+  =================================================================    =============================
+
+  Cable issue substates:
+
+  ===================================================   ============================================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_CI_UNSUPPORTED_CABLE``    Unsupported cable
+
+  ``ETHTOOL_LINK_EXT_SUBSTATE_CI_CABLE_TEST_FAILURE``   Cable test failure
+  ===================================================   ============================================
+
+  Transceiver module issue substates:
+
+  ===================================================   ============================================
+  ``ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY``   The CMIS Module State Machine did not reach
+                                                        the ModuleReady state. For example, if the
+                                                        module is stuck at ModuleFault state
+  ===================================================   ============================================
+
 DEBUG_GET
 =========
 
@@ -612,7 +805,7 @@ Kernel response contents:
   ``ETHTOOL_A_FEATURES_ACTIVE``         bitset  diff old vs. new active
   ====================================  ======  ==========================
 
-Request constains only one bitset which can be either value/mask pair (request
+Request contains only one bitset which can be either value/mask pair (request
 to change specific feature bits and leave the rest) or only a value (request
 to set all features to specified set).
 
@@ -689,18 +882,54 @@ Request contents:
 
 Kernel response contents:
 
-  ====================================  ======  ==========================
-  ``ETHTOOL_A_RINGS_HEADER``            nested  reply header
-  ``ETHTOOL_A_RINGS_RX_MAX``            u32     max size of RX ring
-  ``ETHTOOL_A_RINGS_RX_MINI_MAX``       u32     max size of RX mini ring
-  ``ETHTOOL_A_RINGS_RX_JUMBO_MAX``      u32     max size of RX jumbo ring
-  ``ETHTOOL_A_RINGS_TX_MAX``            u32     max size of TX ring
-  ``ETHTOOL_A_RINGS_RX``                u32     size of RX ring
-  ``ETHTOOL_A_RINGS_RX_MINI``           u32     size of RX mini ring
-  ``ETHTOOL_A_RINGS_RX_JUMBO``          u32     size of RX jumbo ring
-  ``ETHTOOL_A_RINGS_TX``                u32     size of TX ring
-  ====================================  ======  ==========================
-
+  =======================================   ======  ===========================
+  ``ETHTOOL_A_RINGS_HEADER``                nested  reply header
+  ``ETHTOOL_A_RINGS_RX_MAX``                u32     max size of RX ring
+  ``ETHTOOL_A_RINGS_RX_MINI_MAX``           u32     max size of RX mini ring
+  ``ETHTOOL_A_RINGS_RX_JUMBO_MAX``          u32     max size of RX jumbo ring
+  ``ETHTOOL_A_RINGS_TX_MAX``                u32     max size of TX ring
+  ``ETHTOOL_A_RINGS_RX``                    u32     size of RX ring
+  ``ETHTOOL_A_RINGS_RX_MINI``               u32     size of RX mini ring
+  ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
+  ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
+  ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
+  ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
+  ``ETHTOOL_A_RINGS_RX_PUSH``               u8      flag of RX Push mode
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``       u32     size of TX push buffer
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX``   u32     max size of TX push buffer
+  ``ETHTOOL_A_RINGS_HDS_THRESH``            u32     threshold of
+                                                    header / data split
+  ``ETHTOOL_A_RINGS_HDS_THRESH_MAX``        u32     max threshold of
+                                                    header / data split
+  =======================================   ======  ===========================
+
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
+page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
+If enabled the device is configured to place frame headers and data into
+separate buffers. The device configuration must make it possible to receive
+full memory pages of data, for example because MTU is high enough or through
+HW-GRO.
+
+``ETHTOOL_A_RINGS_[RX|TX]_PUSH`` flag is used to enable descriptor fast
+path to send or receive packets. In ordinary path, driver fills descriptors in DRAM and
+notifies NIC hardware. In fast path, driver pushes descriptors to the device
+through MMIO writes, thus reducing the latency. However, enabling this feature
+may increase the CPU cost. Drivers may enforce additional per-packet
+eligibility checks (e.g. on packet size).
+
+``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` specifies the maximum number of bytes of a
+transmitted packet a driver can push directly to the underlying device
+('push' mode). Pushing some of the payload bytes to the device has the
+advantages of reducing latency for small packets by avoiding DMA mapping (same
+as ``ETHTOOL_A_RINGS_TX_PUSH`` parameter) as well as allowing the underlying
+device to process packet headers ahead of fetching its payload.
+This can help the device to make fast actions based on the packet's headers.
+This is similar to the "tx-copybreak" parameter, which copies the packet to a
+preallocated DMA memory area instead of mapping new memory. However,
+tx-push-buff parameter copies the packet directly to the device to allow the
+device to take faster actions on the packet.
 
 RINGS_SET
 =========
@@ -709,19 +938,39 @@ Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request.
 
 Request contents:
 
-  ====================================  ======  ==========================
+  ====================================  ======  ===========================
   ``ETHTOOL_A_RINGS_HEADER``            nested  reply header
   ``ETHTOOL_A_RINGS_RX``                u32     size of RX ring
   ``ETHTOOL_A_RINGS_RX_MINI``           u32     size of RX mini ring
   ``ETHTOOL_A_RINGS_RX_JUMBO``          u32     size of RX jumbo ring
   ``ETHTOOL_A_RINGS_TX``                u32     size of TX ring
-  ====================================  ======  ==========================
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN``        u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``    u8      TCP header / data split
+  ``ETHTOOL_A_RINGS_CQE_SIZE``          u32     Size of TX/RX CQE
+  ``ETHTOOL_A_RINGS_TX_PUSH``           u8      flag of TX Push mode
+  ``ETHTOOL_A_RINGS_RX_PUSH``           u8      flag of RX Push mode
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``   u32     size of TX push buffer
+  ``ETHTOOL_A_RINGS_HDS_THRESH``        u32     threshold of header / data split
+  ====================================  ======  ===========================
 
 Kernel checks that requested ring sizes do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
 attributes.
 
 
+``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
+Completion queue events (CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent (like send success or
+error) or received (like pointers to packet fragments). The CQE size parameter
+enables to modify the CQE size other than default size if NIC supports it.
+A bigger CQE can have more receive buffer pointers, and in turn the NIC can
+transfer a bigger frame from wire. Based on the NIC hardware, the overall
+completion queue size can be adjusted in the driver if CQE size is modified.
+
+``ETHTOOL_A_RINGS_HDS_THRESH`` specifies the threshold value of
+header / data split feature. If a received packet size is larger than this
+threshold value, header and data will be split.
+
 CHANNELS_GET
 ============
 
@@ -764,7 +1013,7 @@ Request contents:
   =====================================  ======  ==========================
 
 Kernel checks that requested channel counts do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
 attributes.
 
 
@@ -805,12 +1054,45 @@ Kernel response contents:
   ``ETHTOOL_A_COALESCE_TX_USECS_HIGH``         u32     delay (us), high Tx
   ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH``    u32     max packets, high Tx
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
+  ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
+  ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
+  ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
+  ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
 corresponding bit in ``ethtool_ops::supported_coalesce_params`` is set (i.e.
 they are declared as supported by driver).
 
+Timer reset mode (``ETHTOOL_A_COALESCE_USE_CQE_TX`` and
+``ETHTOOL_A_COALESCE_USE_CQE_RX``) controls the interaction between packet
+arrival and the various time based delay parameters. By default timers are
+expected to limit the max delay between any packet arrival/departure and a
+corresponding interrupt. In this mode timer should be started by packet
+arrival (sometimes delivery of previous interrupt) and reset when interrupt
+is delivered.
+Setting the appropriate attribute to 1 will enable ``CQE`` mode, where
+each packet event resets the timer. In this mode timer is used to force
+the interrupt if queue goes idle, while busy queues depend on the packet
+limit to trigger interrupts.
+
+Tx aggregation consists of copying frames into a contiguous buffer so that they
+can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``
+describes the maximum size in bytes for the submitted buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames
+that can be aggregated into a single buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs,
+counted since the first packet arrival in an aggregated block, after which the
+block should be sent.
+This feature is mainly of interest for specific USB devices which does not cope
+well with frequent small-sized URBs transmissions.
+
+``ETHTOOL_A_COALESCE_RX_PROFILE`` and ``ETHTOOL_A_COALESCE_TX_PROFILE`` refer
+to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
+<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
 
 COALESCE_SET
 ============
@@ -843,6 +1125,13 @@ Request contents:
   ``ETHTOOL_A_COALESCE_TX_USECS_HIGH``         u32     delay (us), high Tx
   ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH``    u32     max packets, high Tx
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
+  ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
+  ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``     u32     max aggr size, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS``    u32     time (us), aggr, Tx
+  ``ETHTOOL_A_COALESCE_RX_PROFILE``            nested  profile of DIM, Rx
+  ``ETHTOOL_A_COALESCE_TX_PROFILE``            nested  profile of DIM, Tx
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
@@ -850,18 +1139,32 @@ such that the corresponding bit in ``ethtool_ops::supported_coalesce_params``
 is not set), regardless of their values. Driver may impose additional
 constraints on coalescing parameters and their values.
 
+Compared to requests issued via the ``ioctl()`` netlink version of this request
+will try harder to make sure that values specified by the user have been applied
+and may call the driver twice.
+
 
 PAUSE_GET
-============
+=========
 
-Gets channel counts like ``ETHTOOL_GPAUSE`` ioctl request.
+Gets pause frame settings like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
 
 Request contents:
 
   =====================================  ======  ==========================
   ``ETHTOOL_A_PAUSE_HEADER``             nested  request header
+  ``ETHTOOL_A_PAUSE_STATS_SRC``          u32     source of statistics
   =====================================  ======  ==========================
 
+``ETHTOOL_A_PAUSE_STATS_SRC`` is optional. It takes values from:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_mac_stats_src
+
+If absent from the request, stats will be provided with
+an ``ETHTOOL_A_PAUSE_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
 Kernel response contents:
 
   =====================================  ======  ==========================
@@ -869,11 +1172,21 @@ Kernel response contents:
   ``ETHTOOL_A_PAUSE_AUTONEG``            bool    pause autonegotiation
   ``ETHTOOL_A_PAUSE_RX``                 bool    receive pause frames
   ``ETHTOOL_A_PAUSE_TX``                 bool    transmit pause frames
+  ``ETHTOOL_A_PAUSE_STATS``              nested  pause statistics
   =====================================  ======  ==========================
 
+``ETHTOOL_A_PAUSE_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set
+in ``ETHTOOL_A_HEADER_FLAGS``.
+It will be empty if driver did not report any statistics. Drivers fill in
+the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+    :identifiers: ethtool_pause_stats
+
+Each member has a corresponding attribute defined.
 
 PAUSE_SET
-============
+=========
 
 Sets pause parameters like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
 
@@ -890,7 +1203,7 @@ Request contents:
 EEE_GET
 =======
 
-Gets channel counts like ``ETHTOOL_GEEE`` ioctl request.
+Gets Energy Efficient Ethernet settings like ``ETHTOOL_GEEE`` ioctl request.
 
 Request contents:
 
@@ -920,7 +1233,7 @@ first 32 are provided by the ``ethtool_ops`` callback.
 EEE_SET
 =======
 
-Sets pause parameters like ``ETHTOOL_GEEEPARAM`` ioctl request.
+Sets Energy Efficient Ethernet parameters like ``ETHTOOL_SEEE`` ioctl request.
 
 Request contents:
 
@@ -946,9 +1259,10 @@ Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request.
 
 Request contents:
 
-  =====================================  ======  ==========================
-  ``ETHTOOL_A_TSINFO_HEADER``            nested  request header
-  =====================================  ======  ==========================
+  ========================================  ======  ============================
+  ``ETHTOOL_A_TSINFO_HEADER``               nested  request header
+  ``ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER``    nested  PTP hw clock provider
+  ========================================  ======  ============================
 
 Kernel response contents:
 
@@ -958,12 +1272,27 @@ Kernel response contents:
   ``ETHTOOL_A_TSINFO_TX_TYPES``          bitset  supported Tx types
   ``ETHTOOL_A_TSINFO_RX_FILTERS``        bitset  supported Rx filters
   ``ETHTOOL_A_TSINFO_PHC_INDEX``         u32     PTP hw clock index
+  ``ETHTOOL_A_TSINFO_STATS``             nested  HW timestamping statistics
   =====================================  ======  ==========================
 
 ``ETHTOOL_A_TSINFO_PHC_INDEX`` is absent if there is no associated PHC (there
 is no special value for this case). The bitset attributes are omitted if they
 would be empty (no bit set).
 
+Additional hardware timestamping statistics response contents:
+
+  ==================================================  ======  =====================
+  ``ETHTOOL_A_TS_STAT_TX_PKTS``                       uint    Packets with Tx
+                                                              HW timestamps
+  ``ETHTOOL_A_TS_STAT_TX_LOST``                       uint    Tx HW timestamp
+                                                              not arrived count
+  ``ETHTOOL_A_TS_STAT_TX_ERR``                        uint    HW error request
+                                                              Tx timestamp count
+  ``ETHTOOL_A_TS_STAT_TX_ONESTEP_PKTS_UNCONFIRMED``   uint    Packets with one-step
+                                                              HW TX timestamps with
+                                                              unconfirmed delivery
+  ==================================================  ======  =====================
+
 CABLE_TEST
 ==========
 
@@ -1009,12 +1338,17 @@ information.
  +-+-+-----------------------------------------+--------+---------------------+
  | | | ``ETHTOOL_A_CABLE_RESULTS_CODE``        | u8     | result code         |
  +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULT_SRC``          | u32    | information source  |
+ +-+-+-----------------------------------------+--------+---------------------+
  | | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH``     | nested | cable length        |
  +-+-+-----------------------------------------+--------+---------------------+
  | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR``   | u8     | pair number         |
  +-+-+-----------------------------------------+--------+---------------------+
  | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM``     | u32    | length in cm        |
  +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_SRC``    | u32    | information source  |
+ +-+-+-----------------------------------------+--------+---------------------+
+
 
 CABLE_TEST TDR
 ==============
@@ -1110,6 +1444,896 @@ used to report the amplitude of the reflection for a given pair.
  | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV``        | s16    | Reflection amplitude |
  +-+-+-----------------------------------------+--------+----------------------+
 
+TUNNEL_INFO
+===========
+
+Gets information about the tunnel state NIC is aware of.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_TUNNEL_INFO_HEADER``       nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_HEADER``            | nested | reply header        |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_UDP_PORTS``         | nested | all UDP port tables |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_TUNNEL_UDP_TABLE``            | nested | one UDP port table  |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_SIZE``     | u32    | max size of the     |
+ | | |                                         |        | table               |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES``    | bitset | tunnel types which  |
+ | | |                                         |        | table can hold      |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_ENTRY``    | nested | offloaded UDP port  |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_PORT``   | be16   | UDP port            |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_TYPE``   | u32    | tunnel type         |
+ +-+-+-+---------------------------------------+--------+---------------------+
+
+For UDP tunnel table empty ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` indicates that
+the table contains static entries, hard-coded by the NIC.
+
+FEC_GET
+=======
+
+Gets FEC configuration and state like ``ETHTOOL_GFECPARAM`` ioctl request.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_FEC_HEADER``               nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_FEC_HEADER``               nested  request header
+  ``ETHTOOL_A_FEC_MODES``                bitset  configured modes
+  ``ETHTOOL_A_FEC_AUTO``                 bool    FEC mode auto selection
+  ``ETHTOOL_A_FEC_ACTIVE``               u32     index of active FEC mode
+  ``ETHTOOL_A_FEC_STATS``                nested  FEC statistics
+  =====================================  ======  ==========================
+
+``ETHTOOL_A_FEC_ACTIVE`` is the bit index of the FEC link mode currently
+active on the interface. This attribute may not be present if device does
+not support FEC.
+
+``ETHTOOL_A_FEC_MODES`` and ``ETHTOOL_A_FEC_AUTO`` are only meaningful when
+autonegotiation is disabled. If ``ETHTOOL_A_FEC_AUTO`` is non-zero driver will
+select the FEC mode automatically based on the parameters of the SFP module.
+This is equivalent to the ``ETHTOOL_FEC_AUTO`` bit of the ioctl interface.
+``ETHTOOL_A_FEC_MODES`` carry the current FEC configuration using link mode
+bits (rather than old ``ETHTOOL_FEC_*`` bits).
+
+``ETHTOOL_A_FEC_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``.
+Each attribute carries an array of 64bit statistics. First entry in the array
+contains the total number of events on the port, while the following entries
+are counters corresponding to lanes/PCS instances. The number of entries in
+the array will be:
+
++--------------+---------------------------------------------+
+| `0`          | device does not support FEC statistics      |
++--------------+---------------------------------------------+
+| `1`          | device does not support per-lane break down |
++--------------+---------------------------------------------+
+| `1 + #lanes` | device has full support for FEC stats       |
++--------------+---------------------------------------------+
+
+Drivers fill in the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+    :identifiers: ethtool_fec_stats
+
+FEC_SET
+=======
+
+Sets FEC parameters like ``ETHTOOL_SFECPARAM`` ioctl request.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_FEC_HEADER``               nested  request header
+  ``ETHTOOL_A_FEC_MODES``                bitset  configured modes
+  ``ETHTOOL_A_FEC_AUTO``                 bool    FEC mode auto selection
+  =====================================  ======  ==========================
+
+``FEC_SET`` is only meaningful when autonegotiation is disabled. Otherwise
+FEC mode is selected as part of autonegotiation.
+
+``ETHTOOL_A_FEC_MODES`` selects which FEC mode should be used. It's recommended
+to set only one bit, if multiple bits are set driver may choose between them
+in an implementation specific way.
+
+``ETHTOOL_A_FEC_AUTO`` requests the driver to choose FEC mode based on SFP
+module parameters. This does not mean autonegotiation.
+
+MODULE_EEPROM_GET
+=================
+
+Fetch module EEPROM data dump.
+This interface is designed to allow dumps of at most 1/2 page at once. This
+means only dumps of 128 (or less) bytes are allowed, without crossing half page
+boundary located at offset 128. For pages other than 0 only high 128 bytes are
+accessible.
+
+Request contents:
+
+  =======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_EEPROM_HEADER``       nested  request header
+  ``ETHTOOL_A_MODULE_EEPROM_OFFSET``       u32     offset within a page
+  ``ETHTOOL_A_MODULE_EEPROM_LENGTH``       u32     amount of bytes to read
+  ``ETHTOOL_A_MODULE_EEPROM_PAGE``         u8      page number
+  ``ETHTOOL_A_MODULE_EEPROM_BANK``         u8      bank number
+  ``ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS``  u8      page I2C address
+  =======================================  ======  ==========================
+
+If ``ETHTOOL_A_MODULE_EEPROM_BANK`` is not specified, bank 0 is assumed.
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_HEADER``          | nested | reply header        |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_DATA``            | binary | array of bytes from |
+ |                                             |        | module EEPROM       |
+ +---------------------------------------------+--------+---------------------+
+
+``ETHTOOL_A_MODULE_EEPROM_DATA`` has an attribute length equal to the amount of
+bytes driver actually read.
+
+STATS_GET
+=========
+
+Get standard statistics for the interface. Note that this is not
+a re-implementation of ``ETHTOOL_GSTATS`` which exposed driver-defined
+stats.
+
+Request contents:
+
+  =======================================  ======  ==========================
+  ``ETHTOOL_A_STATS_HEADER``               nested  request header
+  ``ETHTOOL_A_STATS_SRC``                  u32     source of statistics
+  ``ETHTOOL_A_STATS_GROUPS``               bitset  requested groups of stats
+  =======================================  ======  ==========================
+
+Kernel response contents:
+
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_HEADER``        | nested | reply header                   |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_SRC``           | u32    | source of statistics           |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_GRP``           | nested | one or more group of stats     |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_ID``      | u32    | group ID - ``ETHTOOL_STATS_*`` |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_SS_ID``   | u32    | string set ID for names        |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_STAT``    | nested | nest containing a statistic    |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_RX`` | nested | histogram statistic (Rx)       |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_TX`` | nested | histogram statistic (Tx)       |
+ +-+---------------------------------+--------+--------------------------------+
+
+Users specify which groups of statistics they are requesting via
+the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are:
+
+ ====================== ======== ===============================================
+ ETHTOOL_STATS_ETH_MAC  eth-mac  Basic IEEE 802.3 MAC statistics (30.3.1.1.*)
+ ETHTOOL_STATS_ETH_PHY  eth-phy  Basic IEEE 802.3 PHY statistics (30.3.2.1.*)
+ ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*)
+ ETHTOOL_STATS_RMON     rmon     RMON (RFC 2819) statistics
+ ETHTOOL_STATS_PHY      phy      Additional PHY statistics, not defined by IEEE
+ ====================== ======== ===============================================
+
+Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply.
+``ETHTOOL_A_STATS_GRP_ID`` identifies which group's statistics nest contains.
+``ETHTOOL_A_STATS_GRP_SS_ID`` identifies the string set ID for the names of
+the statistics in the group, if available.
+
+Statistics are added to the ``ETHTOOL_A_STATS_GRP`` nest under
+``ETHTOOL_A_STATS_GRP_STAT``. ``ETHTOOL_A_STATS_GRP_STAT`` should contain
+single 8 byte (u64) attribute inside - the type of that attribute is
+the statistic ID and the value is the value of the statistic.
+Each group has its own interpretation of statistic IDs.
+Attribute IDs correspond to strings from the string set identified
+by ``ETHTOOL_A_STATS_GRP_SS_ID``. Complex statistics (such as RMON histogram
+entries) are also listed inside ``ETHTOOL_A_STATS_GRP`` and do not have
+a string defined in the string set.
+
+RMON "histogram" counters count number of packets within given size range.
+Because RFC does not specify the ranges beyond the standard 1518 MTU devices
+differ in definition of buckets. For this reason the definition of packet ranges
+is left to each driver.
+
+``ETHTOOL_A_STATS_GRP_HIST_RX`` and ``ETHTOOL_A_STATS_GRP_HIST_TX`` nests
+contain the following attributes:
+
+ ================================= ====== ===================================
+ ETHTOOL_A_STATS_RMON_HIST_BKT_LOW u32    low bound of the packet size bucket
+ ETHTOOL_A_STATS_RMON_HIST_BKT_HI  u32    high bound of the bucket
+ ETHTOOL_A_STATS_RMON_HIST_VAL     u64    packet counter
+ ================================= ====== ===================================
+
+Low and high bounds are inclusive, for example:
+
+ ============================= ==== ====
+ RFC statistic                 low  high
+ ============================= ==== ====
+ etherStatsPkts64Octets          0    64
+ etherStatsPkts512to1023Octets 512  1023
+ ============================= ==== ====
+
+``ETHTOOL_A_STATS_SRC`` is optional. Similar to ``PAUSE_GET``, it takes values
+from ``enum ethtool_mac_stats_src``. If absent from the request, stats will be
+provided with an ``ETHTOOL_A_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
+PHC_VCLOCKS_GET
+===============
+
+Query device PHC virtual clocks information.
+
+Request contents:
+
+  ====================================  ======  ==========================
+  ``ETHTOOL_A_PHC_VCLOCKS_HEADER``      nested  request header
+  ====================================  ======  ==========================
+
+Kernel response contents:
+
+  ====================================  ======  ==========================
+  ``ETHTOOL_A_PHC_VCLOCKS_HEADER``      nested  reply header
+  ``ETHTOOL_A_PHC_VCLOCKS_NUM``         u32     PHC virtual clocks number
+  ``ETHTOOL_A_PHC_VCLOCKS_INDEX``       s32     PHC index array
+  ====================================  ======  ==========================
+
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``            nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  reply header
+  ``ETHTOOL_A_MODULE_POWER_MODE_POLICY``  u8      power mode policy
+  ``ETHTOOL_A_MODULE_POWER_MODE``         u8      operational power mode
+  ======================================  ======  ==========================
+
+The optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute encodes the
+transceiver module power mode policy enforced by the host. The default policy
+is driver-dependent, but "auto" is the recommended default and it should be
+implemented by new drivers and drivers where conformance to a legacy behavior
+is not critical.
+
+The optional ``ETHTHOOL_A_MODULE_POWER_MODE`` attribute encodes the operational
+power mode policy of the transceiver module. It is only reported when a module
+is plugged-in. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_module_power_mode
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  request header
+  ``ETHTOOL_A_MODULE_POWER_MODE_POLICY``  u8      power mode policy
+  ======================================  ======  ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute is used
+to set the transceiver module power policy enforced by the host. Possible
+values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_module_power_mode_policy
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
+PSE_GET
+=======
+
+Gets PSE attributes.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_PSE_HEADER``               nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ==========================================  ======  =============================
+  ``ETHTOOL_A_PSE_HEADER``                    nested  reply header
+  ``ETHTOOL_A_PODL_PSE_ADMIN_STATE``             u32  Operational state of the PoDL
+                                                      PSE functions
+  ``ETHTOOL_A_PODL_PSE_PW_D_STATUS``             u32  power detection status of the
+                                                      PoDL PSE.
+  ``ETHTOOL_A_C33_PSE_ADMIN_STATE``              u32  Operational state of the PoE
+                                                      PSE functions.
+  ``ETHTOOL_A_C33_PSE_PW_D_STATUS``              u32  power detection status of the
+                                                      PoE PSE.
+  ``ETHTOOL_A_C33_PSE_PW_CLASS``                 u32  power class of the PoE PSE.
+  ``ETHTOOL_A_C33_PSE_ACTUAL_PW``                u32  actual power drawn on the
+                                                      PoE PSE.
+  ``ETHTOOL_A_C33_PSE_EXT_STATE``                u32  power extended state of the
+                                                      PoE PSE.
+  ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE``             u32  power extended substatus of
+                                                      the PoE PSE.
+  ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT``           u32  currently configured power
+                                                      limit of the PoE PSE.
+  ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES``       nested  Supported power limit
+                                                      configuration ranges.
+  ==========================================  ======  =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
+the operational state of the PoDL PSE functions.  The operational state of the
+PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
+action. This attribute corresponds to ``IEEE 802.3-2018`` 30.15.1.1.2
+aPoDLPSEAdminState. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_podl_pse_admin_state
+
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` implementing
+``IEEE 802.3-2022`` 30.9.1.1.2 aPSEAdminState.
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_c33_pse_admin_state
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
+the power detection status of the PoDL PSE.  The status depend on internal PSE
+state machine and automatic PD classification support. This attribute
+corresponds to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_podl_pse_pw_d_status
+
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_PW_D_STATUS`` implementing
+``IEEE 802.3-2022`` 30.9.1.1.5 aPSEPowerDetectionStatus.
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_c33_pse_pw_d_status
+
+When set, the optional ``ETHTOOL_A_C33_PSE_PW_CLASS`` attribute identifies
+the power class of the C33 PSE. It depends on the class negotiated between
+the PSE and the PD. This attribute corresponds to ``IEEE 802.3-2022``
+30.9.1.1.8 aPSEPowerClassification.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` attribute identifies
+the actual power drawn by the C33 PSE. This attribute corresponds to
+``IEEE 802.3-2022`` 30.9.1.1.23 aPSEActualPower. Actual power is reported
+in mW.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_STATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_c33_pse_ext_state
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_c33_pse_ext_substate_class_num_events
+		  ethtool_c33_pse_ext_substate_error_condition
+		  ethtool_c33_pse_ext_substate_mr_pse_enable
+		  ethtool_c33_pse_ext_substate_option_detect_ted
+		  ethtool_c33_pse_ext_substate_option_vport_lim
+		  ethtool_c33_pse_ext_substate_ovld_detected
+		  ethtool_c33_pse_ext_substate_pd_dll_power_type
+		  ethtool_c33_pse_ext_substate_power_not_available
+		  ethtool_c33_pse_ext_substate_short_detected
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` attribute
+identifies the C33 PSE power limit in mW.
+
+When set the optional ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested attribute
+identifies the C33 PSE power limit ranges through
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MIN`` and
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MAX``.
+If the controller works with fixed classes, the min and max values will be
+equal.
+
+PSE_SET
+=======
+
+Sets PSE parameters.
+
+Request contents:
+
+  ======================================  ======  =============================
+  ``ETHTOOL_A_PSE_HEADER``                nested  request header
+  ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``       u32  Control PoDL PSE Admin state
+  ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL``        u32  Control PSE Admin state
+  ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT``      u32  Control PoE PSE available
+                                                  power limit
+  ======================================  ======  =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
+to control PoDL PSE Admin functions. This option implements
+``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
+``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` implementing
+``IEEE 802.3-2022`` 30.9.1.2.1 acPSEAdminControl.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` attribute is
+used to control the available power value limit for C33 PSE in milliwatts.
+This attribute corresponds  to the `pse_available_power` variable described in
+``IEEE 802.3-2022`` 33.2.4.4 Variables  and `pse_avail_pwr` in 145.2.5.4
+Variables, which are described in power classes.
+
+It was decided to use milliwatts for this interface to unify it with other
+power monitoring interfaces, which also use milliwatts, and to align with
+various existing products that document power consumption in watts rather than
+classes. If power limit configuration based on classes is needed, the
+conversion can be done in user space, for example by ethtool.
+
+RSS_GET
+=======
+
+Get indirection table, hash key and hash function info associated with a
+RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
+
+Request contents:
+
+=====================================  ======  ============================
+  ``ETHTOOL_A_RSS_HEADER``             nested  request header
+  ``ETHTOOL_A_RSS_CONTEXT``            u32     context number
+  ``ETHTOOL_A_RSS_START_CONTEXT``      u32     start context number (dumps)
+=====================================  ======  ============================
+
+``ETHTOOL_A_RSS_CONTEXT`` specifies which RSS context number to query,
+if not set context 0 (the main context) is queried. Dumps can be filtered
+by device (only listing contexts of a given netdev). Filtering single
+context number is not supported but ``ETHTOOL_A_RSS_START_CONTEXT``
+can be used to start dumping context from the given number (primarily
+used to ignore context 0s and only dump additional contexts).
+
+Kernel response contents:
+
+=====================================  ======  ==========================
+  ``ETHTOOL_A_RSS_HEADER``             nested  reply header
+  ``ETHTOOL_A_RSS_CONTEXT``            u32     context number
+  ``ETHTOOL_A_RSS_HFUNC``              u32     RSS hash func
+  ``ETHTOOL_A_RSS_INDIR``              binary  Indir table bytes
+  ``ETHTOOL_A_RSS_HKEY``               binary  Hash key bytes
+  ``ETHTOOL_A_RSS_INPUT_XFRM``         u32     RSS input data transformation
+=====================================  ======  ==========================
+
+ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
+being used. Current supported options are toeplitz, xor or crc32.
+ETHTOOL_A_RSS_INDIR attribute returns RSS indirection table where each byte
+indicates queue number.
+ETHTOOL_A_RSS_INPUT_XFRM attribute is a bitmap indicating the type of
+transformation applied to the input protocol fields before given to the RSS
+hfunc. Current supported options are symmetric-xor and symmetric-or-xor.
+
+PLCA_GET_CFG
+============
+
+Gets the IEEE 802.3cg-2019 Clause 148 Physical Layer Collision Avoidance
+(PLCA) Reconciliation Sublayer (RS) attributes.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_PLCA_HEADER``              nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================  ======  =============================
+  ``ETHTOOL_A_PLCA_HEADER``               nested  reply header
+  ``ETHTOOL_A_PLCA_VERSION``              u16     Supported PLCA management
+                                                  interface standard/version
+  ``ETHTOOL_A_PLCA_ENABLED``              u8      PLCA Admin State
+  ``ETHTOOL_A_PLCA_NODE_ID``              u32     PLCA unique local node ID
+  ``ETHTOOL_A_PLCA_NODE_CNT``             u32     Number of PLCA nodes on the
+                                                  network, including the
+                                                  coordinator
+  ``ETHTOOL_A_PLCA_TO_TMR``               u32     Transmit Opportunity Timer
+                                                  value in bit-times (BT)
+  ``ETHTOOL_A_PLCA_BURST_CNT``            u32     Number of additional packets
+                                                  the node is allowed to send
+                                                  within a single TO
+  ``ETHTOOL_A_PLCA_BURST_TMR``            u32     Time to wait for the MAC to
+                                                  transmit a new frame before
+                                                  terminating the burst
+  ======================================  ======  =============================
+
+When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
+standard and version the PLCA management interface complies to. When not set,
+the interface is vendor-specific and (possibly) supplied by the driver.
+The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
+embedding the PLCA Reconciliation Sublayer. See "10BASE-T1S PLCA Management
+Registers" at https://www.opensig.org/about/specifications/.
+
+When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
+administrative state of the PLCA RS. When not set, the node operates in "plain"
+CSMA/CD mode. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.1
+aPLCAAdminState / 30.16.1.2.1 acPLCAAdminControl.
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_ID`` attribute indicates the
+configured local node ID of the PHY. This ID determines which transmit
+opportunity (TO) is reserved for the node to transmit into. This option is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.4 aPLCALocalNodeID. The valid
+range for this attribute is [0 .. 255] where 255 means "not configured".
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_CNT`` attribute indicates the
+configured maximum number of PLCA nodes on the mixing-segment. This number
+determines the total number of transmit opportunities generated during a
+PLCA cycle. This attribute is relevant only for the PLCA coordinator, which is
+the node with aPLCALocalNodeID set to 0. Follower nodes ignore this setting.
+This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.3
+aPLCANodeCount. The valid range for this attribute is [1 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_TO_TMR`` attribute indicates the
+configured value of the transmit opportunity timer in bit-times. This value
+must be set equal across all nodes sharing the medium for PLCA to work
+correctly. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.5
+aPLCATransmitOpportunityTimer. The valid range for this attribute is
+[0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_CNT`` attribute indicates the
+configured number of extra packets that the node is allowed to send during a
+single transmit opportunity. By default, this attribute is 0, meaning that
+the node can only send a single frame per TO. When greater than 0, the PLCA RS
+keeps the TO after any transmission, waiting for the MAC to send a new frame
+for up to aPLCABurstTimer BTs. This can only happen a number of times per PLCA
+cycle up to the value of this parameter. After that, the burst is over and the
+normal counting of TOs resumes. This option is corresponding to
+``IEEE 802.3cg-2019`` 30.16.1.1.6 aPLCAMaxBurstCount. The valid range for this
+attribute is [0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_TMR`` attribute indicates how
+many bit-times the PLCA RS waits for the MAC to initiate a new transmission
+when aPLCAMaxBurstCount is greater than 0. If the MAC fails to send a new
+frame within this time, the burst ends and the counting of TOs resumes.
+Otherwise, the new frame is sent as part of the current burst. This option
+is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.7 aPLCABurstTimer. The
+valid range for this attribute is [0 .. 255]. Although, the value should be
+set greater than the Inter-Frame-Gap (IFG) time of the MAC (plus some margin)
+for PLCA burst mode to work as intended.
+
+PLCA_SET_CFG
+============
+
+Sets PLCA RS parameters.
+
+Request contents:
+
+  ======================================  ======  =============================
+  ``ETHTOOL_A_PLCA_HEADER``               nested  request header
+  ``ETHTOOL_A_PLCA_ENABLED``              u8      PLCA Admin State
+  ``ETHTOOL_A_PLCA_NODE_ID``              u8      PLCA unique local node ID
+  ``ETHTOOL_A_PLCA_NODE_CNT``             u8      Number of PLCA nodes on the
+                                                  network, including the
+                                                  coordinator
+  ``ETHTOOL_A_PLCA_TO_TMR``               u8      Transmit Opportunity Timer
+                                                  value in bit-times (BT)
+  ``ETHTOOL_A_PLCA_BURST_CNT``            u8      Number of additional packets
+                                                  the node is allowed to send
+                                                  within a single TO
+  ``ETHTOOL_A_PLCA_BURST_TMR``            u8      Time to wait for the MAC to
+                                                  transmit a new frame before
+                                                  terminating the burst
+  ======================================  ======  =============================
+
+For a description of each attribute, see ``PLCA_GET_CFG``.
+
+PLCA_GET_STATUS
+===============
+
+Gets PLCA RS status information.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_PLCA_HEADER``              nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================  ======  =============================
+  ``ETHTOOL_A_PLCA_HEADER``               nested  reply header
+  ``ETHTOOL_A_PLCA_STATUS``               u8      PLCA RS operational status
+  ======================================  ======  =============================
+
+When set, the ``ETHTOOL_A_PLCA_STATUS`` attribute indicates whether the node is
+detecting the presence of the BEACON on the network. This flag is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.2 aPLCAStatus.
+
+MM_GET
+======
+
+Retrieve 802.3 MAC Merge parameters.
+
+Request contents:
+
+  ====================================  ======  ==========================
+  ``ETHTOOL_A_MM_HEADER``               nested  request header
+  ====================================  ======  ==========================
+
+Kernel response contents:
+
+  =================================  ======  ===================================
+  ``ETHTOOL_A_MM_HEADER``            nested  request header
+  ``ETHTOOL_A_MM_PMAC_ENABLED``      bool    set if RX of preemptible and SMD-V
+                                             frames is enabled
+  ``ETHTOOL_A_MM_TX_ENABLED``        bool    set if TX of preemptible frames is
+                                             administratively enabled (might be
+                                             inactive if verification failed)
+  ``ETHTOOL_A_MM_TX_ACTIVE``         bool    set if TX of preemptible frames is
+                                             operationally enabled
+  ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE``  u32     minimum size of transmitted
+                                             non-final fragments, in octets
+  ``ETHTOOL_A_MM_RX_MIN_FRAG_SIZE``  u32     minimum size of received non-final
+                                             fragments, in octets
+  ``ETHTOOL_A_MM_VERIFY_ENABLED``    bool    set if TX of SMD-V frames is
+                                             administratively enabled
+  ``ETHTOOL_A_MM_VERIFY_STATUS``     u8      state of the verification function
+  ``ETHTOOL_A_MM_VERIFY_TIME``       u32     delay between verification attempts
+  ``ETHTOOL_A_MM_MAX_VERIFY_TIME```  u32     maximum verification interval
+                                             supported by device
+  ``ETHTOOL_A_MM_STATS``             nested  IEEE 802.3-2018 subclause 30.14.1
+                                             oMACMergeEntity statistics counters
+  =================================  ======  ===================================
+
+The attributes are populated by the device driver through the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+    :identifiers: ethtool_mm_state
+
+The ``ETHTOOL_A_MM_VERIFY_STATUS`` will report one of the values from
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_mm_verify_status
+
+If ``ETHTOOL_A_MM_VERIFY_ENABLED`` was passed as false in the ``MM_SET``
+command, ``ETHTOOL_A_MM_VERIFY_STATUS`` will report either
+``ETHTOOL_MM_VERIFY_STATUS_INITIAL`` or ``ETHTOOL_MM_VERIFY_STATUS_DISABLED``,
+otherwise it should report one of the other states.
+
+It is recommended that drivers start with the pMAC disabled, and enable it upon
+user space request. It is also recommended that user space does not depend upon
+the default values from ``ETHTOOL_MSG_MM_GET`` requests.
+
+``ETHTOOL_A_MM_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``. The attribute will be empty if driver did not
+report any statistics. Drivers fill in the statistics in the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+    :identifiers: ethtool_mm_stats
+
+MM_SET
+======
+
+Modifies the configuration of the 802.3 MAC Merge layer.
+
+Request contents:
+
+  =================================  ======  ==========================
+  ``ETHTOOL_A_MM_VERIFY_TIME``       u32     see MM_GET description
+  ``ETHTOOL_A_MM_VERIFY_ENABLED``    bool    see MM_GET description
+  ``ETHTOOL_A_MM_TX_ENABLED``        bool    see MM_GET description
+  ``ETHTOOL_A_MM_PMAC_ENABLED``      bool    see MM_GET description
+  ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE``  u32     see MM_GET description
+  =================================  ======  ==========================
+
+The attributes are propagated to the driver through the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+    :identifiers: ethtool_mm_cfg
+
+MODULE_FW_FLASH_ACT
+===================
+
+Flashes transceiver module firmware.
+
+Request contents:
+
+  =======================================  ======  ===========================
+  ``ETHTOOL_A_MODULE_FW_FLASH_HEADER``     nested  request header
+  ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME``  string  firmware image file name
+  ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD``   u32     transceiver module password
+  =======================================  ======  ===========================
+
+The firmware update process consists of three logical steps:
+
+1. Downloading a firmware image to the transceiver module and validating it.
+2. Running the firmware image.
+3. Committing the firmware image so that it is run upon reset.
+
+When flash command is given, those three steps are taken in that order.
+
+This message merely schedules the update process and returns immediately
+without blocking. The process then runs asynchronously.
+Since it can take several minutes to complete, during the update process
+notifications are emitted from the kernel to user space updating it about
+the status and progress.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` attribute encodes the firmware
+image file name. The firmware image is downloaded to the transceiver module,
+validated, run and committed.
+
+The optional ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` attribute encodes a password
+that might be required as part of the transceiver module firmware update
+process.
+
+The firmware update process can take several minutes to complete. Therefore,
+during the update process notifications are emitted from the kernel to user
+space updating it about the status and progress.
+
+
+
+Notification contents:
+
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_HEADER``              | nested | reply header   |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS``              | u32    | status         |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG``          | string | status message |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_DONE``                | uint   | progress       |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL``               | uint   | total          |
+ +---------------------------------------------------+--------+----------------+
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` attribute encodes the current status
+of the firmware update process. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+    :identifiers: ethtool_module_fw_flash_status
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` attribute encodes a status message
+string.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL``
+attributes encode the completed and total amount of work, respectively.
+
+PHY_GET
+=======
+
+Retrieve information about a given Ethernet PHY sitting on the link. The DO
+operation returns all available information about dev->phydev. User can also
+specify a PHY_INDEX, in which case the DO request returns information about that
+specific PHY.
+
+As there can be more than one PHY, the DUMP operation can be used to list the PHYs
+present on a given interface, by passing an interface index or name in
+the dump request.
+
+For more information, refer to :ref:`phy_link_topology`
+
+Request contents:
+
+  ====================================  ======  ==========================
+  ``ETHTOOL_A_PHY_HEADER``              nested  request header
+  ====================================  ======  ==========================
+
+Kernel response contents:
+
+  ===================================== ======  ===============================
+  ``ETHTOOL_A_PHY_HEADER``              nested  request header
+  ``ETHTOOL_A_PHY_INDEX``               u32     the phy's unique index, that can
+                                                be used for phy-specific
+                                                requests
+  ``ETHTOOL_A_PHY_DRVNAME``             string  the phy driver name
+  ``ETHTOOL_A_PHY_NAME``                string  the phy device name
+  ``ETHTOOL_A_PHY_UPSTREAM_TYPE``       u32     the type of device this phy is
+                                                connected to
+  ``ETHTOOL_A_PHY_UPSTREAM_INDEX``      u32     the PHY index of the upstream
+                                                PHY
+  ``ETHTOOL_A_PHY_UPSTREAM_SFP_NAME``   string  if this PHY is connected to
+                                                its parent PHY through an SFP
+                                                bus, the name of this sfp bus
+  ``ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME`` string  if the phy controls an sfp bus,
+                                                the name of the sfp bus
+  ===================================== ======  ===============================
+
+When ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` is PHY_UPSTREAM_PHY, the PHY's parent is
+another PHY.
+
+TSCONFIG_GET
+============
+
+Retrieves the information about the current hardware timestamping source and
+configuration.
+
+It is similar to the deprecated ``SIOCGHWTSTAMP`` ioctl request.
+
+Request contents:
+
+  ====================================  ======  ==========================
+  ``ETHTOOL_A_TSCONFIG_HEADER``         nested  request header
+  ====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================== ======  ============================
+  ``ETHTOOL_A_TSCONFIG_HEADER``            nested  request header
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested  PTP hw clock provider
+  ``ETHTOOL_A_TSCONFIG_TX_TYPES``          bitset  hwtstamp Tx type
+  ``ETHTOOL_A_TSCONFIG_RX_FILTERS``        bitset  hwtstamp Rx filter
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS``	   u32     hwtstamp flags
+  ======================================== ======  ============================
+
+When set the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` attribute identifies the
+source of the hw timestamping provider. It is composed by
+``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX`` attribute which describe the index of
+the PTP device and ``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER`` which describe
+the qualifier of the timestamp.
+
+When set the ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS``
+and the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` attributes identify the Tx
+type, the Rx filter and the flags configured for the current hw timestamping
+provider. The attributes are propagated to the driver through the following
+structure:
+
+.. kernel-doc:: include/linux/net_tstamp.h
+    :identifiers: kernel_hwtstamp_config
+
+TSCONFIG_SET
+============
+
+Set the information about the current hardware timestamping source and
+configuration.
+
+It is similar to the deprecated ``SIOCSHWTSTAMP`` ioctl request.
+
+Request contents:
+
+  ======================================== ======  ============================
+  ``ETHTOOL_A_TSCONFIG_HEADER``            nested  request header
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested  PTP hw clock provider
+  ``ETHTOOL_A_TSCONFIG_TX_TYPES``          bitset  hwtstamp Tx type
+  ``ETHTOOL_A_TSCONFIG_RX_FILTERS``        bitset  hwtstamp Rx filter
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS``	   u32     hwtstamp flags
+  ======================================== ======  ============================
+
+Kernel response contents:
+
+  ======================================== ======  ============================
+  ``ETHTOOL_A_TSCONFIG_HEADER``            nested  request header
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested  PTP hw clock provider
+  ``ETHTOOL_A_TSCONFIG_TX_TYPES``          bitset  hwtstamp Tx type
+  ``ETHTOOL_A_TSCONFIG_RX_FILTERS``        bitset  hwtstamp Rx filter
+  ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS``	   u32     hwtstamp flags
+  ======================================== ======  ============================
+
+For a description of each attribute, see ``TSCONFIG_GET``.
+
 Request translation
 ===================
 
@@ -1187,11 +2411,11 @@ are netlink only.
   ``ETHTOOL_GET_DUMP_FLAG``           n/a
   ``ETHTOOL_GET_DUMP_DATA``           n/a
   ``ETHTOOL_GET_TS_INFO``             ``ETHTOOL_MSG_TSINFO_GET``
-  ``ETHTOOL_GMODULEINFO``             n/a
-  ``ETHTOOL_GMODULEEEPROM``           n/a
+  ``ETHTOOL_GMODULEINFO``             ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+  ``ETHTOOL_GMODULEEEPROM``           ``ETHTOOL_MSG_MODULE_EEPROM_GET``
   ``ETHTOOL_GEEE``                    ``ETHTOOL_MSG_EEE_GET``
   ``ETHTOOL_SEEE``                    ``ETHTOOL_MSG_EEE_SET``
-  ``ETHTOOL_GRSSH``                   n/a
+  ``ETHTOOL_GRSSH``                   ``ETHTOOL_MSG_RSS_GET``
   ``ETHTOOL_SRSSH``                   n/a
   ``ETHTOOL_GTUNABLE``                n/a
   ``ETHTOOL_STUNABLE``                n/a
@@ -1203,8 +2427,21 @@ are netlink only.
                                       ``ETHTOOL_MSG_LINKMODES_SET``
   ``ETHTOOL_PHY_GTUNABLE``            n/a
   ``ETHTOOL_PHY_STUNABLE``            n/a
-  ``ETHTOOL_GFECPARAM``               n/a
-  ``ETHTOOL_SFECPARAM``               n/a
-  n/a                                 ''ETHTOOL_MSG_CABLE_TEST_ACT''
-  n/a                                 ''ETHTOOL_MSG_CABLE_TEST_TDR_ACT''
+  ``ETHTOOL_GFECPARAM``               ``ETHTOOL_MSG_FEC_GET``
+  ``ETHTOOL_SFECPARAM``               ``ETHTOOL_MSG_FEC_SET``
+  n/a                                 ``ETHTOOL_MSG_CABLE_TEST_ACT``
+  n/a                                 ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
+  n/a                                 ``ETHTOOL_MSG_TUNNEL_INFO_GET``
+  n/a                                 ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_SET``
+  n/a                                 ``ETHTOOL_MSG_PLCA_GET_CFG``
+  n/a                                 ``ETHTOOL_MSG_PLCA_SET_CFG``
+  n/a                                 ``ETHTOOL_MSG_PLCA_GET_STATUS``
+  n/a                                 ``ETHTOOL_MSG_MM_GET``
+  n/a                                 ``ETHTOOL_MSG_MM_SET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT``
+  n/a                                 ``ETHTOOL_MSG_PHY_GET``
+  ``SIOCGHWTSTAMP``                   ``ETHTOOL_MSG_TSCONFIG_GET``
+  ``SIOCSHWTSTAMP``                   ``ETHTOOL_MSG_TSCONFIG_SET``
   =================================== =====================================
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
index a1d3e192b9fa..8eb9a5d40f31 100644
--- a/Documentation/networking/filter.rst
+++ b/Documentation/networking/filter.rst
@@ -1,9 +1,18 @@
 .. SPDX-License-Identifier: GPL-2.0
 
+.. _networking-filter:
+
 =======================================================
 Linux Socket Filtering aka Berkeley Packet Filter (BPF)
 =======================================================
 
+Notice
+------
+
+This file used to document the eBPF format and mechanisms even when not
+related to socket filtering.  The ../bpf/index.rst has more details
+on eBPF.
+
 Introduction
 ------------
 
@@ -296,7 +305,7 @@ Possible BPF extensions are shown in the following table:
   vlan_tci                              skb_vlan_tag_get(skb)
   vlan_avail                            skb_vlan_tag_present(skb)
   vlan_tpid                             skb->vlan_proto
-  rand                                  prandom_u32()
+  rand                                  get_random_u32()
   ===================================   =================================================
 
 These extensions can also be prefixed with '#'.
@@ -318,14 +327,7 @@ Examples for low-level BPF:
   ret #-1
   drop: ret #0
 
-**(Accelerated) VLAN w/ id 10**::
-
-  ld vlan_tci
-  jneq #10, drop
-  ret #-1
-  drop: ret #0
-
-**icmp random packet sampling, 1 in 4**:
+**icmp random packet sampling, 1 in 4**::
 
   ldh [12]
   jne #0x800, drop
@@ -356,6 +358,22 @@ Examples for low-level BPF:
   bad: ret #0             /* SECCOMP_RET_KILL_THREAD */
   good: ret #0x7fff0000   /* SECCOMP_RET_ALLOW */
 
+Examples for low-level BPF extension:
+
+**Packet for interface index 13**::
+
+  ld ifidx
+  jneq #13, drop
+  ret #-1
+  drop: ret #0
+
+**(Accelerated) VLAN w/ id 10**::
+
+  ld vlan_tci
+  jneq #10, drop
+  ret #-1
+  drop: ret #0
+
 The above example code can be placed into a file (here called "foo"), and
 then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
 and cls_bpf understands and can directly be loaded with. Example with above
@@ -495,7 +513,7 @@ JIT compiler
 ------------
 
 The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
-PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
+PowerPC, ARM, ARM64, MIPS, RISC-V, s390, and ARC and can be enabled through
 CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
 attached filter from user space or for internal kernel users if it has
 been previously enabled by root::
@@ -606,15 +624,11 @@ format with similar underlying principles from BPF described in previous
 paragraphs is being used. However, the instruction set format is modelled
 closer to the underlying architecture to mimic native instruction sets, so
 that a better performance can be achieved (more details later). This new
-ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
+ISA is called eBPF.  See the ../bpf/index.rst for details.  (Note: eBPF which
 originates from [e]xtended BPF is not the same as BPF extensions! While
 eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
 of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
 
-It is designed to be JITed with one to one mapping, which can also open up
-the possibility for GCC/LLVM compilers to generate optimized eBPF code through
-an eBPF backend that performs almost as fast as natively compiled code.
-
 The new instruction set was originally designed with the possible goal in
 mind to write programs in "restricted C" and compile into eBPF with a optional
 GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
@@ -627,8 +641,8 @@ extension, PTP dissector/classifier, and much more. They are all internally
 converted by the kernel into the new instruction set representation and run
 in the eBPF interpreter. For in-kernel handlers, this all works transparently
 by using bpf_prog_create() for setting up the filter, resp.
-bpf_prog_destroy() for destroying it. The macro
-BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
+bpf_prog_destroy() for destroying it. The function
+bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
 code to run the filter. 'filter' is a pointer to struct bpf_prog that we
 got from bpf_prog_create(), and 'ctx' the given context (e.g.
 skb pointer). All constraints and restrictions from bpf_check_classic() apply
@@ -636,994 +650,14 @@ before a conversion to the new layout is being done behind the scenes!
 
 Currently, the classic BPF format is being used for JITing on most
 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
-sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
-instruction set.
-
-Some core changes of the new internal format:
-
-- Number of registers increase from 2 to 10:
-
-  The old format had two registers A and X, and a hidden frame pointer. The
-  new layout extends this to be 10 internal registers and a read-only frame
-  pointer. Since 64-bit CPUs are passing arguments to functions via registers
-  the number of args from eBPF program to in-kernel function is restricted
-  to 5 and one register is used to accept return value from an in-kernel
-  function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
-  sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
-  registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
-
-  Therefore, eBPF calling convention is defined as:
-
-    * R0	- return value from in-kernel function, and exit value for eBPF program
-    * R1 - R5	- arguments from eBPF program to in-kernel function
-    * R6 - R9	- callee saved registers that in-kernel function will preserve
-    * R10	- read-only frame pointer to access stack
-
-  Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
-  etc, and eBPF calling convention maps directly to ABIs used by the kernel on
-  64-bit architectures.
-
-  On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
-  and may let more complex programs to be interpreted.
-
-  R0 - R5 are scratch registers and eBPF program needs spill/fill them if
-  necessary across calls. Note that there is only one eBPF program (== one
-  eBPF main routine) and it cannot call other eBPF functions, it can only
-  call predefined in-kernel functions, though.
-
-- Register width increases from 32-bit to 64-bit:
-
-  Still, the semantics of the original 32-bit ALU operations are preserved
-  via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
-  subregisters that zero-extend into 64-bit if they are being written to.
-  That behavior maps directly to x86_64 and arm64 subregister definition, but
-  makes other JITs more difficult.
-
-  32-bit architectures run 64-bit internal BPF programs via interpreter.
-  Their JITs may convert BPF programs that only use 32-bit subregisters into
-  native instruction set and let the rest being interpreted.
-
-  Operation is 64-bit, because on 64-bit architectures, pointers are also
-  64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
-  so 32-bit eBPF registers would otherwise require to define register-pair
-  ABI, thus, there won't be able to use a direct eBPF register to HW register
-  mapping and JIT would need to do combine/split/move operations for every
-  register in and out of the function, which is complex, bug prone and slow.
-  Another reason is the use of atomic 64-bit counters.
-
-- Conditional jt/jf targets replaced with jt/fall-through:
-
-  While the original design has constructs such as ``if (cond) jump_true;
-  else jump_false;``, they are being replaced into alternative constructs like
-  ``if (cond) jump_true; /* else fall-through */``.
-
-- Introduces bpf_call insn and register passing convention for zero overhead
-  calls from/to other kernel functions:
-
-  Before an in-kernel function call, the internal BPF program needs to
-  place function arguments into R1 to R5 registers to satisfy calling
-  convention, then the interpreter will take them from registers and pass
-  to in-kernel function. If R1 - R5 registers are mapped to CPU registers
-  that are used for argument passing on given architecture, the JIT compiler
-  doesn't need to emit extra moves. Function arguments will be in the correct
-  registers and BPF_CALL instruction will be JITed as single 'call' HW
-  instruction. This calling convention was picked to cover common call
-  situations without performance penalty.
-
-  After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
-  a return value of the function. Since R6 - R9 are callee saved, their state
-  is preserved across the call.
-
-  For example, consider three C functions::
-
-    u64 f1() { return (*_f2)(1); }
-    u64 f2(u64 a) { return f3(a + 1, a); }
-    u64 f3(u64 a, u64 b) { return a - b; }
-
-  GCC can compile f1, f3 into x86_64::
-
-    f1:
-	movl $1, %edi
-	movq _f2(%rip), %rax
-	jmp  *%rax
-    f3:
-	movq %rdi, %rax
-	subq %rsi, %rax
-	ret
-
-  Function f2 in eBPF may look like::
-
-    f2:
-	bpf_mov R2, R1
-	bpf_add R1, 1
-	bpf_call f3
-	bpf_exit
-
-  If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
-  returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
-  be used to call into f2.
-
-  For practical reasons all eBPF programs have only one argument 'ctx' which is
-  already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
-  can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
-  are currently not supported, but these restrictions can be lifted if necessary
-  in the future.
-
-  On 64-bit architectures all register map to HW registers one to one. For
-  example, x86_64 JIT compiler can map them as ...
-
-  ::
-
-    R0 - rax
-    R1 - rdi
-    R2 - rsi
-    R3 - rdx
-    R4 - rcx
-    R5 - r8
-    R6 - rbx
-    R7 - r13
-    R8 - r14
-    R9 - r15
-    R10 - rbp
-
-  ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
-  and rbx, r12 - r15 are callee saved.
-
-  Then the following internal BPF pseudo-program::
-
-    bpf_mov R6, R1 /* save ctx */
-    bpf_mov R2, 2
-    bpf_mov R3, 3
-    bpf_mov R4, 4
-    bpf_mov R5, 5
-    bpf_call foo
-    bpf_mov R7, R0 /* save foo() return value */
-    bpf_mov R1, R6 /* restore ctx for next call */
-    bpf_mov R2, 6
-    bpf_mov R3, 7
-    bpf_mov R4, 8
-    bpf_mov R5, 9
-    bpf_call bar
-    bpf_add R0, R7
-    bpf_exit
-
-  After JIT to x86_64 may look like::
-
-    push %rbp
-    mov %rsp,%rbp
-    sub $0x228,%rsp
-    mov %rbx,-0x228(%rbp)
-    mov %r13,-0x220(%rbp)
-    mov %rdi,%rbx
-    mov $0x2,%esi
-    mov $0x3,%edx
-    mov $0x4,%ecx
-    mov $0x5,%r8d
-    callq foo
-    mov %rax,%r13
-    mov %rbx,%rdi
-    mov $0x6,%esi
-    mov $0x7,%edx
-    mov $0x8,%ecx
-    mov $0x9,%r8d
-    callq bar
-    add %r13,%rax
-    mov -0x228(%rbp),%rbx
-    mov -0x220(%rbp),%r13
-    leaveq
-    retq
-
-  Which is in this example equivalent in C to::
-
-    u64 bpf_filter(u64 ctx)
-    {
-	return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
-    }
-
-  In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
-  arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
-  registers and place their return value into ``%rax`` which is R0 in eBPF.
-  Prologue and epilogue are emitted by JIT and are implicit in the
-  interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
-  them across the calls as defined by calling convention.
-
-  For example the following program is invalid::
-
-    bpf_mov R1, 1
-    bpf_call foo
-    bpf_mov R0, R1
-    bpf_exit
-
-  After the call the registers R1-R5 contain junk values and cannot be read.
-  An in-kernel eBPF verifier is used to validate internal BPF programs.
-
-Also in the new design, eBPF is limited to 4096 insns, which means that any
-program will terminate quickly and will only call a fixed number of kernel
-functions. Original BPF and the new format are two operand instructions,
-which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
-
-The input context pointer for invoking the interpreter function is generic,
-its content is defined by a specific use case. For seccomp register R1 points
-to seccomp_data, for converted BPF filters R1 points to a skb.
-
-A program, that is translated internally consists of the following elements::
-
-  op:16, jt:8, jf:8, k:32    ==>    op:8, dst_reg:4, src_reg:4, off:16, imm:32
-
-So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
-has room for new instructions. Some of them may use 16/24/32 byte encoding. New
-instructions must be multiple of 8 bytes to preserve backward compatibility.
-
-Internal BPF is a general purpose RISC instruction set. Not every register and
-every instruction are used during translation from original BPF to new format.
-For example, socket filters are not using ``exclusive add`` instruction, but
-tracing filters may do to maintain counters of events, for example. Register R9
-is not used by socket filters either, but more complex filters may be running
-out of registers and would have to resort to spill/fill to stack.
-
-Internal BPF can be used as a generic assembler for last step performance
-optimizations, socket filters and seccomp are using it as assembler. Tracing
-filters may use it as assembler to generate code from kernel. In kernel usage
-may not be bounded by security considerations, since generated internal BPF code
-may be optimizing internal code path and not being exposed to the user space.
-Safety of internal BPF can come from a verifier (TBD). In such use cases as
-described, it may be used as safe instruction set.
-
-Just like the original BPF, the new format runs within a controlled environment,
-is deterministic and the kernel can easily prove that. The safety of the program
-can be determined in two steps: first step does depth-first-search to disallow
-loops and other CFG validation; second step starts from the first insn and
-descends all possible paths. It simulates execution of every insn and observes
-the state change of registers and stack.
-
-eBPF opcode encoding
---------------------
-
-eBPF is reusing most of the opcode encoding from classic to simplify conversion
-of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
-field is divided into three parts::
-
-  +----------------+--------+--------------------+
-  |   4 bits       |  1 bit |   3 bits           |
-  | operation code | source | instruction class  |
-  +----------------+--------+--------------------+
-  (MSB)                                      (LSB)
-
-Three LSB bits store instruction class which is one of:
-
-  ===================     ===============
-  Classic BPF classes     eBPF classes
-  ===================     ===============
-  BPF_LD    0x00          BPF_LD    0x00
-  BPF_LDX   0x01          BPF_LDX   0x01
-  BPF_ST    0x02          BPF_ST    0x02
-  BPF_STX   0x03          BPF_STX   0x03
-  BPF_ALU   0x04          BPF_ALU   0x04
-  BPF_JMP   0x05          BPF_JMP   0x05
-  BPF_RET   0x06          BPF_JMP32 0x06
-  BPF_MISC  0x07          BPF_ALU64 0x07
-  ===================     ===============
-
-When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
-
-    ::
-
-	BPF_K     0x00
-	BPF_X     0x08
-
- * in classic BPF, this means::
-
-	BPF_SRC(code) == BPF_X - use register X as source operand
-	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
- * in eBPF, this means::
-
-	BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
-	BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
-... and four MSB bits store operation code.
-
-If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
-
-  BPF_ADD   0x00
-  BPF_SUB   0x10
-  BPF_MUL   0x20
-  BPF_DIV   0x30
-  BPF_OR    0x40
-  BPF_AND   0x50
-  BPF_LSH   0x60
-  BPF_RSH   0x70
-  BPF_NEG   0x80
-  BPF_MOD   0x90
-  BPF_XOR   0xa0
-  BPF_MOV   0xb0  /* eBPF only: mov reg to reg */
-  BPF_ARSH  0xc0  /* eBPF only: sign extending shift right */
-  BPF_END   0xd0  /* eBPF only: endianness conversion */
-
-If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
-
-  BPF_JA    0x00  /* BPF_JMP only */
-  BPF_JEQ   0x10
-  BPF_JGT   0x20
-  BPF_JGE   0x30
-  BPF_JSET  0x40
-  BPF_JNE   0x50  /* eBPF only: jump != */
-  BPF_JSGT  0x60  /* eBPF only: signed '>' */
-  BPF_JSGE  0x70  /* eBPF only: signed '>=' */
-  BPF_CALL  0x80  /* eBPF BPF_JMP only: function call */
-  BPF_EXIT  0x90  /* eBPF BPF_JMP only: function return */
-  BPF_JLT   0xa0  /* eBPF only: unsigned '<' */
-  BPF_JLE   0xb0  /* eBPF only: unsigned '<=' */
-  BPF_JSLT  0xc0  /* eBPF only: signed '<' */
-  BPF_JSLE  0xd0  /* eBPF only: signed '<=' */
-
-So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
-and eBPF. There are only two registers in classic BPF, so it means A += X.
-In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
-BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
-src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
-
-Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
-eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
-BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
-exactly the same operations as BPF_ALU, but with 64-bit wide operands
-instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
-dst_reg = dst_reg + src_reg
-
-Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
-operation. Classic BPF_RET | BPF_K means copy imm32 into return register
-and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
-in eBPF means function exit only. The eBPF program needs to store return
-value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
-BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
-operands for the comparisons instead.
-
-For load and store instructions the 8-bit 'code' field is divided as::
-
-  +--------+--------+-------------------+
-  | 3 bits | 2 bits |   3 bits          |
-  |  mode  |  size  | instruction class |
-  +--------+--------+-------------------+
-  (MSB)                             (LSB)
-
-Size modifier is one of ...
-
-::
-
-  BPF_W   0x00    /* word */
-  BPF_H   0x08    /* half word */
-  BPF_B   0x10    /* byte */
-  BPF_DW  0x18    /* eBPF only, double word */
-
-... which encodes size of load/store operation::
-
- B  - 1 byte
- H  - 2 byte
- W  - 4 byte
- DW - 8 byte (eBPF only)
-
-Mode modifier is one of::
-
-  BPF_IMM  0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
-  BPF_ABS  0x20
-  BPF_IND  0x40
-  BPF_MEM  0x60
-  BPF_LEN  0x80  /* classic BPF only, reserved in eBPF */
-  BPF_MSH  0xa0  /* classic BPF only, reserved in eBPF */
-  BPF_XADD 0xc0  /* eBPF only, exclusive add */
-
-eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
-(BPF_IND | <size> | BPF_LD) which are used to access packet data.
-
-They had to be carried over from classic to have strong performance of
-socket filters running in eBPF interpreter. These instructions can only
-be used when interpreter context is a pointer to ``struct sk_buff`` and
-have seven implicit operands. Register R6 is an implicit input that must
-contain pointer to sk_buff. Register R0 is an implicit output which contains
-the data fetched from the packet. Registers R1-R5 are scratch registers
-and must not be used to store the data across BPF_ABS | BPF_LD or
-BPF_IND | BPF_LD instructions.
-
-These instructions have implicit program exit condition as well. When
-eBPF program is trying to access the data beyond the packet boundary,
-the interpreter will abort the execution of the program. JIT compilers
-therefore must preserve this property. src_reg and imm32 fields are
-explicit inputs to these instructions.
-
-For example::
-
-  BPF_IND | BPF_W | BPF_LD means:
-
-    R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
-    and R1 - R5 were scratched.
-
-Unlike classic BPF instruction set, eBPF has generic load/store operations::
-
-    BPF_MEM | <size> | BPF_STX:  *(size *) (dst_reg + off) = src_reg
-    BPF_MEM | <size> | BPF_ST:   *(size *) (dst_reg + off) = imm32
-    BPF_MEM | <size> | BPF_LDX:  dst_reg = *(size *) (src_reg + off)
-    BPF_XADD | BPF_W  | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
-    BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
-
-Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
-2 byte atomic increments are not supported.
-
-eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
-of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
-instruction that loads 64-bit immediate value into a dst_reg.
-Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
-32-bit immediate value into a register.
-
-eBPF verifier
--------------
-The safety of the eBPF program is determined in two steps.
-
-First step does DAG check to disallow loops and other CFG validation.
-In particular it will detect programs that have unreachable instructions.
-(though classic BPF checker allows them)
-
-Second step starts from the first insn and descends all possible paths.
-It simulates execution of every insn and observes the state change of
-registers and stack.
-
-At the start of the program the register R1 contains a pointer to context
-and has type PTR_TO_CTX.
-If verifier sees an insn that does R2=R1, then R2 has now type
-PTR_TO_CTX as well and can be used on the right hand side of expression.
-If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
-since addition of two valid pointers makes invalid pointer.
-(In 'secure' mode verifier will reject any type of pointer arithmetic to make
-sure that kernel addresses don't leak to unprivileged users)
-
-If register was never written to, it's not readable::
-
-  bpf_mov R0 = R2
-  bpf_exit
-
-will be rejected, since R2 is unreadable at the start of the program.
-
-After kernel function call, R1-R5 are reset to unreadable and
-R0 has a return type of the function.
-
-Since R6-R9 are callee saved, their state is preserved across the call.
-
-::
-
-  bpf_mov R6 = 1
-  bpf_call foo
-  bpf_mov R0 = R6
-  bpf_exit
-
-is a correct program. If there was R1 instead of R6, it would have
-been rejected.
-
-load/store instructions are allowed only with registers of valid types, which
-are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
-For example::
-
- bpf_mov R1 = 1
- bpf_mov R2 = 2
- bpf_xadd *(u32 *)(R1 + 3) += R2
- bpf_exit
-
-will be rejected, since R1 doesn't have a valid pointer type at the time of
-execution of instruction bpf_xadd.
-
-At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
-A callback is used to customize verifier to restrict eBPF program access to only
-certain fields within ctx structure with specified size and alignment.
-
-For example, the following insn::
-
-  bpf_ld R0 = *(u32 *)(R6 + 8)
-
-intends to load a word from address R6 + 8 and store it into R0
-If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
-that offset 8 of size 4 bytes can be accessed for reading, otherwise
-the verifier will reject the program.
-If R6=PTR_TO_STACK, then access should be aligned and be within
-stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
-so it will fail verification, since it's out of bounds.
-
-The verifier will allow eBPF program to read data from stack only after
-it wrote into it.
-
-Classic BPF verifier does similar check with M[0-15] memory slots.
-For example::
-
-  bpf_ld R0 = *(u32 *)(R10 - 4)
-  bpf_exit
-
-is invalid program.
-Though R10 is correct read-only register and has type PTR_TO_STACK
-and R10 - 4 is within stack bounds, there were no stores into that location.
-
-Pointer register spill/fill is tracked as well, since four (R6-R9)
-callee saved registers may not be enough for some programs.
-
-Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
-The eBPF verifier will check that registers match argument constraints.
-After the call register R0 will be set to return type of the function.
-
-Function calls is a main mechanism to extend functionality of eBPF programs.
-Socket filters may let programs to call one set of functions, whereas tracing
-filters may allow completely different set.
-
-If a function made accessible to eBPF program, it needs to be thought through
-from safety point of view. The verifier will guarantee that the function is
-called with valid arguments.
-
-seccomp vs socket filters have different security restrictions for classic BPF.
-Seccomp solves this by two stage verifier: classic BPF verifier is followed
-by seccomp verifier. In case of eBPF one configurable verifier is shared for
-all use cases.
-
-See details of eBPF verifier in kernel/bpf/verifier.c
-
-Register value tracking
------------------------
-In order to determine the safety of an eBPF program, the verifier must track
-the range of possible values in each register and also in each stack slot.
-This is done with ``struct bpf_reg_state``, defined in include/linux/
-bpf_verifier.h, which unifies tracking of scalar and pointer values.  Each
-register state has a type, which is either NOT_INIT (the register has not been
-written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
-pointer type.  The types of pointers describe their base, as follows:
-
-
-    PTR_TO_CTX
-			Pointer to bpf_context.
-    CONST_PTR_TO_MAP
-			Pointer to struct bpf_map.  "Const" because arithmetic
-			on these pointers is forbidden.
-    PTR_TO_MAP_VALUE
-			Pointer to the value stored in a map element.
-    PTR_TO_MAP_VALUE_OR_NULL
-			Either a pointer to a map value, or NULL; map accesses
-			(see section 'eBPF maps', below) return this type,
-			which becomes a PTR_TO_MAP_VALUE when checked != NULL.
-			Arithmetic on these pointers is forbidden.
-    PTR_TO_STACK
-			Frame pointer.
-    PTR_TO_PACKET
-			skb->data.
-    PTR_TO_PACKET_END
-			skb->data + headlen; arithmetic forbidden.
-    PTR_TO_SOCKET
-			Pointer to struct bpf_sock_ops, implicitly refcounted.
-    PTR_TO_SOCKET_OR_NULL
-			Either a pointer to a socket, or NULL; socket lookup
-			returns this type, which becomes a PTR_TO_SOCKET when
-			checked != NULL. PTR_TO_SOCKET is reference-counted,
-			so programs must release the reference through the
-			socket release function before the end of the program.
-			Arithmetic on these pointers is forbidden.
-
-However, a pointer may be offset from this base (as a result of pointer
-arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
-offset'.  The former is used when an exactly-known value (e.g. an immediate
-operand) is added to a pointer, while the latter is used for values which are
-not exactly known.  The variable offset is also used in SCALAR_VALUEs, to track
-the range of possible values in the register.
-
-The verifier's knowledge about the variable offset consists of:
-
-* minimum and maximum values as unsigned
-* minimum and maximum values as signed
-
-* knowledge of the values of individual bits, in the form of a 'tnum': a u64
-  'mask' and a u64 'value'.  1s in the mask represent bits whose value is unknown;
-  1s in the value represent bits known to be 1.  Bits known to be 0 have 0 in both
-  mask and value; no bit should ever be 1 in both.  For example, if a byte is read
-  into a register from memory, the register's top 56 bits are known zero, while
-  the low 8 are unknown - which is represented as the tnum (0x0; 0xff).  If we
-  then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
-  0x1ff), because of potential carries.
-
-Besides arithmetic, the register state can also be updated by conditional
-branches.  For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
-it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
-branch it will have a umax_value of 8.  A signed compare (with BPF_JSGT or
-BPF_JSGE) would instead update the signed minimum/maximum values.  Information
-from the signed and unsigned bounds can be combined; for instance if a value is
-first tested < 8 and then tested s> 4, the verifier will conclude that the value
-is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
-
-PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
-pointers sharing that same variable offset.  This is important for packet range
-checks: after adding a variable to a packet pointer register A, if you then copy
-it to another register B and then add a constant 4 to A, both registers will
-share the same 'id' but the A will have a fixed offset of +4.  Then if A is
-bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
-now known to have a safe range of at least 4 bytes.  See 'Direct packet access',
-below, for more on PTR_TO_PACKET ranges.
-
-The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
-the pointer returned from a map lookup.  This means that when one copy is
-checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
-As well as range-checking, the tracked information is also used for enforcing
-alignment of pointer accesses.  For instance, on most systems the packet pointer
-is 2 bytes after a 4-byte alignment.  If a program adds 14 bytes to that to jump
-over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
-pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
-bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
-that pointer are safe.
-The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
-to all copies of the pointer returned from a socket lookup. This has similar
-behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
-it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
-represents a reference to the corresponding ``struct sock``. To ensure that the
-reference is not leaked, it is imperative to NULL-check the reference and in
-the non-NULL case, and pass the valid reference to the socket release function.
-
-Direct packet access
---------------------
-In cls_bpf and act_bpf programs the verifier allows direct access to the packet
-data via skb->data and skb->data_end pointers.
-Ex::
-
-    1:  r4 = *(u32 *)(r1 +80)  /* load skb->data_end */
-    2:  r3 = *(u32 *)(r1 +76)  /* load skb->data */
-    3:  r5 = r3
-    4:  r5 += 14
-    5:  if r5 > r4 goto pc+16
-    R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
-    6:  r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
-
-this 2byte load from the packet is safe to do, since the program author
-did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
-means that in the fall-through case the register R3 (which points to skb->data)
-has at least 14 directly accessible bytes. The verifier marks it
-as R3=pkt(id=0,off=0,r=14).
-id=0 means that no additional variables were added to the register.
-off=0 means that no additional constants were added.
-r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
-Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
-to the packet data, but constant 14 was added to the register, so
-it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
-which is zero bytes.
-
-More complex packet access may look like::
-
-
-    R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
-    6:  r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
-    7:  r4 = *(u8 *)(r3 +12)
-    8:  r4 *= 14
-    9:  r3 = *(u32 *)(r1 +76) /* load skb->data */
-    10:  r3 += r4
-    11:  r2 = r1
-    12:  r2 <<= 48
-    13:  r2 >>= 48
-    14:  r3 += r2
-    15:  r2 = r3
-    16:  r2 += 8
-    17:  r1 = *(u32 *)(r1 +80) /* load skb->data_end */
-    18:  if r2 > r1 goto pc+2
-    R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
-    19:  r1 = *(u8 *)(r3 +4)
-
-The state of the register R3 is R3=pkt(id=2,off=0,r=8)
-id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
-offset within a packet and since the program author did
-``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
-The verifier only allows 'add'/'sub' operations on packet registers. Any other
-operation will set the register state to 'SCALAR_VALUE' and it won't be
-available for direct packet access.
-
-Operation ``r3 += rX`` may overflow and become less than original skb->data,
-therefore the verifier has to prevent that.  So when it sees ``r3 += rX``
-instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
-against skb->data_end will not give us 'range' information, so attempts to read
-through the pointer will give "invalid access to packet" error.
-
-Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is
-R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
-of the register are guaranteed to be zero, and nothing is known about the lower
-8 bits. After insn ``r4 *= 14`` the state becomes
-R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
-value by constant 14 will keep upper 52 bits as zero, also the least significant
-bit will be zero as 14 is even.  Similarly ``r2 >>= 48`` will make
-R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
-extending.  This logic is implemented in adjust_reg_min_max_vals() function,
-which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
-versa) and adjust_scalar_min_max_vals() for operations on two scalars.
-
-The end result is that bpf program author can access packet directly
-using normal C code as::
-
-  void *data = (void *)(long)skb->data;
-  void *data_end = (void *)(long)skb->data_end;
-  struct eth_hdr *eth = data;
-  struct iphdr *iph = data + sizeof(*eth);
-  struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);
-
-  if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
-	  return 0;
-  if (eth->h_proto != htons(ETH_P_IP))
-	  return 0;
-  if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
-	  return 0;
-  if (udp->dest == 53 || udp->source == 9)
-	  ...;
-
-which makes such programs easier to write comparing to LD_ABS insn
-and significantly faster.
-
-eBPF maps
----------
-'maps' is a generic storage of different types for sharing data between kernel
-and userspace.
-
-The maps are accessed from user space via BPF syscall, which has commands:
-
-- create a map with given type and attributes
-  ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
-  using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
-  returns process-local file descriptor or negative error
-
-- lookup key in a given map
-  ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
-  using attr->map_fd, attr->key, attr->value
-  returns zero and stores found elem into value or negative error
-
-- create or update key/value pair in a given map
-  ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
-  using attr->map_fd, attr->key, attr->value
-  returns zero or negative error
-
-- find and delete element by key in a given map
-  ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
-  using attr->map_fd, attr->key
-
-- to delete map: close(fd)
-  Exiting process will delete maps automatically
-
-userspace programs use this syscall to create/access maps that eBPF programs
-are concurrently updating.
-
-maps can have different types: hash, array, bloom filter, radix-tree, etc.
-
-The map is defined by:
-
-  - type
-  - max number of elements
-  - key size in bytes
-  - value size in bytes
-
-Pruning
--------
-The verifier does not actually walk all possible paths through the program.  For
-each new branch to analyse, the verifier looks at all the states it's previously
-been in when at this instruction.  If any of them contain the current state as a
-subset, the branch is 'pruned' - that is, the fact that the previous state was
-accepted implies the current state would be as well.  For instance, if in the
-previous state, r1 held a packet-pointer, and in the current state, r1 holds a
-packet-pointer with a range as long or longer and at least as strict an
-alignment, then r1 is safe.  Similarly, if r2 was NOT_INIT before then it can't
-have been used by any path from that point, so any value in r2 (including
-another NOT_INIT) is safe.  The implementation is in the function regsafe().
-Pruning considers not only the registers but also the stack (and any spilled
-registers it may hold).  They must all be safe for the branch to be pruned.
-This is implemented in states_equal().
-
-Understanding eBPF verifier messages
-------------------------------------
-
-The following are few examples of invalid eBPF programs and verifier error
-messages as seen in the log:
-
-Program with unreachable instructions::
-
-  static struct bpf_insn prog[] = {
-  BPF_EXIT_INSN(),
-  BPF_EXIT_INSN(),
-  };
-
-Error:
-
-  unreachable insn 1
-
-Program that reads uninitialized register::
-
-  BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (bf) r0 = r2
-  R2 !read_ok
-
-Program that doesn't initialize R0 before exiting::
-
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (bf) r2 = r1
-  1: (95) exit
-  R0 !read_ok
-
-Program that accesses stack out of bounds::
-
-    BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
-    BPF_EXIT_INSN(),
-
-Error::
-
-    0: (7a) *(u64 *)(r10 +8) = 0
-    invalid stack off=8 size=8
-
-Program that doesn't initialize stack before passing its address into function::
-
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_LD_MAP_FD(BPF_REG_1, 0),
-  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (bf) r2 = r10
-  1: (07) r2 += -8
-  2: (b7) r1 = 0x0
-  3: (85) call 1
-  invalid indirect read from stack off -8+0 size 8
-
-Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
-
-  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_LD_MAP_FD(BPF_REG_1, 0),
-  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (7a) *(u64 *)(r10 -8) = 0
-  1: (bf) r2 = r10
-  2: (07) r2 += -8
-  3: (b7) r1 = 0x0
-  4: (85) call 1
-  fd 0 is not pointing to valid bpf_map
-
-Program that doesn't check return value of map_lookup_elem() before accessing
-map element::
-
-  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_LD_MAP_FD(BPF_REG_1, 0),
-  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
-  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (7a) *(u64 *)(r10 -8) = 0
-  1: (bf) r2 = r10
-  2: (07) r2 += -8
-  3: (b7) r1 = 0x0
-  4: (85) call 1
-  5: (7a) *(u64 *)(r0 +0) = 0
-  R0 invalid mem access 'map_value_or_null'
-
-Program that correctly checks map_lookup_elem() returned value for NULL, but
-accesses the memory with incorrect alignment::
-
-  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_LD_MAP_FD(BPF_REG_1, 0),
-  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
-  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
-  BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (7a) *(u64 *)(r10 -8) = 0
-  1: (bf) r2 = r10
-  2: (07) r2 += -8
-  3: (b7) r1 = 1
-  4: (85) call 1
-  5: (15) if r0 == 0x0 goto pc+1
-   R0=map_ptr R10=fp
-  6: (7a) *(u64 *)(r0 +4) = 0
-  misaligned access off 4 size 8
-
-Program that correctly checks map_lookup_elem() returned value for NULL and
-accesses memory with correct alignment in one side of 'if' branch, but fails
-to do so in the other side of 'if' branch::
-
-  BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_LD_MAP_FD(BPF_REG_1, 0),
-  BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
-  BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
-  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
-  BPF_EXIT_INSN(),
-  BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (7a) *(u64 *)(r10 -8) = 0
-  1: (bf) r2 = r10
-  2: (07) r2 += -8
-  3: (b7) r1 = 1
-  4: (85) call 1
-  5: (15) if r0 == 0x0 goto pc+2
-   R0=map_ptr R10=fp
-  6: (7a) *(u64 *)(r0 +0) = 0
-  7: (95) exit
-
-  from 5 to 8: R0=imm0 R10=fp
-  8: (7a) *(u64 *)(r0 +0) = 1
-  R0 invalid mem access 'imm'
-
-Program that performs a socket lookup then sets the pointer to NULL without
-checking it::
-
-  BPF_MOV64_IMM(BPF_REG_2, 0),
-  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_MOV64_IMM(BPF_REG_3, 4),
-  BPF_MOV64_IMM(BPF_REG_4, 0),
-  BPF_MOV64_IMM(BPF_REG_5, 0),
-  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
-  BPF_MOV64_IMM(BPF_REG_0, 0),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (b7) r2 = 0
-  1: (63) *(u32 *)(r10 -8) = r2
-  2: (bf) r2 = r10
-  3: (07) r2 += -8
-  4: (b7) r3 = 4
-  5: (b7) r4 = 0
-  6: (b7) r5 = 0
-  7: (85) call bpf_sk_lookup_tcp#65
-  8: (b7) r0 = 0
-  9: (95) exit
-  Unreleased reference id=1, alloc_insn=7
-
-Program that performs a socket lookup but does not NULL-check the returned
-value::
-
-  BPF_MOV64_IMM(BPF_REG_2, 0),
-  BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
-  BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-  BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
-  BPF_MOV64_IMM(BPF_REG_3, 4),
-  BPF_MOV64_IMM(BPF_REG_4, 0),
-  BPF_MOV64_IMM(BPF_REG_5, 0),
-  BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
-  BPF_EXIT_INSN(),
-
-Error::
-
-  0: (b7) r2 = 0
-  1: (63) *(u32 *)(r10 -8) = r2
-  2: (bf) r2 = r10
-  3: (07) r2 += -8
-  4: (b7) r3 = 4
-  5: (b7) r4 = 0
-  6: (b7) r5 = 0
-  7: (85) call bpf_sk_lookup_tcp#65
-  8: (95) exit
-  Unreleased reference id=1, alloc_insn=7
+sparc64, arm32, riscv64, riscv32, loongarch64, arc perform JIT compilation
+from eBPF instruction set.
 
 Testing
 -------
 
 Next to the BPF toolchain, the kernel also ships a test module that contains
-various test cases for classic and internal BPF that can be executed against
+various test cases for classic and eBPF that can be executed against
 the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
 enabled via Kconfig::
 
diff --git a/Documentation/networking/framerelay.rst b/Documentation/networking/framerelay.rst
deleted file mode 100644
index 6d904399ec6d..000000000000
--- a/Documentation/networking/framerelay.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================
-Frame Relay (FR)
-================
-
-Frame Relay (FR) support for linux is built into a two tiered system of device
-drivers.  The upper layer implements RFC1490 FR specification, and uses the
-Data Link Connection Identifier (DLCI) as its hardware address.  Usually these
-are assigned by your network supplier, they give you the number/numbers of
-the Virtual Connections (VC) assigned to you.
-
-Each DLCI is a point-to-point link between your machine and a remote one.
-As such, a separate device is needed to accommodate the routing.  Within the
-net-tools archives is 'dlcicfg'.  This program will communicate with the
-base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'...
-The configuration script will ask you how many DLCIs you need, as well as
-how many DLCIs you want to assign to each Frame Relay Access Device (FRAD).
-
-The DLCI uses a number of function calls to communicate with the FRAD, all
-of which are stored in the FRAD's private data area.  assoc/deassoc,
-activate/deactivate and dlci_config.  The DLCI supplies a receive function
-to the FRAD to accept incoming packets.
-
-With this initial offering, only 1 FRAD driver is available.  With many thanks
-to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E &
-S508 are supported.  This driver is currently set up for only FR, but as
-Sangoma makes more firmware modules available, it can be updated to provide
-them as well.
-
-Configuration of the FRAD makes use of another net-tools program, 'fradcfg'.
-This program makes use of a configuration file (which dlcicfg can also read)
-to specify the types of boards to be configured as FRADs, as well as perform
-any board specific configuration.  The Sangoma module of fradcfg loads the
-FR firmware into the card, sets the irq/port/memory information, and provides
-an initial configuration.
-
-Additional FRAD device drivers can be added as hardware is available.
-
-At this time, the dlcicfg and fradcfg programs have not been incorporated into
-the net-tools distribution.  They can be found at ftp.invlogic.com, in
-/pub/linux.  Note that with OS/2 FTPD, you end up in /pub by default, so just
-use 'cd linux'.  v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for
-pre-2.0.4 and later.
diff --git a/Documentation/networking/generic_netlink.rst b/Documentation/networking/generic_netlink.rst
index 59e04ccf80c1..d960dbd7e80e 100644
--- a/Documentation/networking/generic_netlink.rst
+++ b/Documentation/networking/generic_netlink.rst
@@ -6,4 +6,4 @@ Generic Netlink
 
 A wiki document on how to use Generic Netlink can be found here:
 
- * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto
+ * https://wiki.linuxfoundation.org/networking/generic_netlink_howto
diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst
index 1563fb94b289..9a7835cc1437 100644
--- a/Documentation/networking/gtp.rst
+++ b/Documentation/networking/gtp.rst
@@ -162,7 +162,7 @@ Local GTP-U entity and tunnel identification
 GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
 for GTPv1-U and 3386 for GTPv0-U.
 
-There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW
+There is only one GTP-U entity (and therefore SGSN/GGSN/S-GW/PDN-GW
 instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
 per GTP-U entity.
 
diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst
index 6f4bf8447a21..743c0a80e309 100644
--- a/Documentation/networking/ieee802154.rst
+++ b/Documentation/networking/ieee802154.rst
@@ -26,7 +26,9 @@ The stack is composed of three main parts:
 Socket API
 ==========
 
-.. c:function:: int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
+::
+
+    int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
 
 The address family, socket addresses etc. are defined in the
 include/net/af_ieee802154.h header or in the special header
@@ -68,9 +70,10 @@ Like with WiFi, there are several types of devices implementing IEEE 802.15.4.
 exports a management (e.g. MLME) and data API.
 2) 'SoftMAC' or just radio. These types of devices are just radio transceivers
 possibly with some kinds of acceleration like automatic CRC computation and
-comparation, automagic ACK handling, address matching, etc.
+comparison, automagic ACK handling, address matching, etc.
 
-Those types of devices require different approach to be hooked into Linux kernel.
+Each type of device requires a different approach to be hooked into the Linux
+kernel.
 
 HardMAC
 -------
@@ -79,10 +82,10 @@ See the header include/net/ieee802154_netdev.h. You have to implement Linux
 net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
 code via plain sk_buffs. On skb reception skb->cb must contain additional
 info as described in the struct ieee802154_mac_cb. During packet transmission
-the skb->cb is used to provide additional data to device's header_ops->create
-function. Be aware that this data can be overridden later (when socket code
-submits skb to qdisc), so if you need something from that cb later, you should
-store info in the skb->data on your own.
+the skb->cb is used to provide additional data to the device's
+header_ops->create function. Be aware that this data can be overridden later
+(when socket code submits skb to qdisc), so if you need something from that cb
+later, you should store info in the skb->data on your own.
 
 To hook the MLME interface you have to populate the ml_priv field of your
 net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
@@ -92,8 +95,9 @@ All other fields are required.
 SoftMAC
 -------
 
-The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it
-provides interface for drivers registration and management of slave interfaces.
+The MAC is the middle layer in the IEEE 802.15.4 Linux stack. At the moment, it
+provides an interface for driver registration and management of slave
+interfaces.
 
 NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4
 stack interface for network sniffers (e.g. WireShark).
@@ -131,12 +135,12 @@ Register PHY in the system.
 
 Freeing registered PHY.
 
-.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi):
+.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi)
 
 Telling 802.15.4 module there is a new received frame in the skb with
 the RF Link Quality Indicator (LQI) from the hardware device.
 
-.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling):
+.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling)
 
 Telling 802.15.4 module the frame in the skb is or going to be
 transmitted through the hardware device
@@ -155,25 +159,25 @@ operations structure at least::
         ...
    };
 
-.. c:function:: int start(struct ieee802154_hw *hw):
+.. c:function:: int start(struct ieee802154_hw *hw)
 
 Handler that 802.15.4 module calls for the hardware device initialization.
 
-.. c:function:: void stop(struct ieee802154_hw *hw):
+.. c:function:: void stop(struct ieee802154_hw *hw)
 
 Handler that 802.15.4 module calls for the hardware device cleanup.
 
-.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb):
+.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb)
 
 Handler that 802.15.4 module calls for each frame in the skb going to be
 transmitted through the hardware device.
 
-.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level):
+.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level)
 
 Handler that 802.15.4 module calls for Energy Detection from the hardware
 device.
 
-.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel):
+.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel)
 
 Set radio for listening on specific channel of the hardware device.
 
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 0186e276690a..ac90b82f3ce9 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -1,26 +1,28 @@
-Linux Networking Documentation
-==============================
+Networking
+==========
+
+Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
 
 Contents:
 
 .. toctree::
    :maxdepth: 2
 
-   netdev-FAQ
    af_xdp
    bareudp
    batman-adv
    can
    can_ucan_protocol
    device_drivers/index
+   diagnostic/index
    dsa/index
    devlink/index
    caif/index
    ethtool-netlink
    ieee802154
+   iso15765-2
    j1939
    kapi
-   z8530book
    msg_zerocopy
    failover
    net_dim
@@ -36,39 +38,32 @@ Contents:
    scaling
    tls
    tls-offload
+   tls-handshake
    nfc
    6lowpan
    6pack
-   altera_tse
    arcnet-hardware
    arcnet
    atm
    ax25
-   baycom
    bonding
    cdc_mbim
-   cops
-   cxacru
-   dccp
    dctcp
-   decnet
-   defza
+   devmem
    dns_resolver
    driver
    eql
    fib_trie
    filter
-   fore200e
-   framerelay
    generic-hdlc
    generic_netlink
+   netlink_spec/index
    gen_stats
    gtp
-   hinic
    ila
-   ipddp
+   ioam6-sysctl
+   iou-zcrx
    ip_dynaddr
-   iphase
    ipsec
    ip-sysctl
    ipv6
@@ -77,52 +72,70 @@ Contents:
    kcm
    l2tp
    lapb-module
-   ltpc
    mac80211-injection
+   mctp
    mpls-sysctl
+   mptcp
+   mptcp-sysctl
    multiqueue
+   multi-pf-netdev
+   napi
+   net_cachelines/index
    netconsole
    netdev-features
    netdevices
    netfilter-sysctl
    netif-msg
+   netmem
+   nexthop-group-resilient
    nf_conntrack-sysctl
    nf_flowtable
+   oa-tc6-framework
    openvswitch
    operstates
    packet_mmap
    phonet
+   phy-link-topology
    pktgen
    plip
    ppp_generic
    proc_net_tcp
+   pse-pd/index
    radiotap-headers
-   ray_cs
    rds
    regulatory
+   representors
    rxrpc
    sctp
    secid
    seg6-sysctl
-   skfp
+   skbuff
+   smc-sysctl
+   sriov
+   statistics
    strparser
    switchdev
+   sysfs-tagging
    tc-actions-env-rules
+   tc-queue-filters
+   tcp_ao
    tcp-thin
    team
    timestamping
+   tipc
    tproxy
    tuntap
    udplite
    vrf
    vxlan
-   x25-iface
    x25
+   x25-iface
    xfrm_device
    xfrm_proc
    xfrm_sync
    xfrm_sysctl
-   z8530drv
+   xdp-rx-metadata
+   xsk-tx-metadata
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/ioam6-sysctl.rst b/Documentation/networking/ioam6-sysctl.rst
new file mode 100644
index 000000000000..c18cab2c481a
--- /dev/null
+++ b/Documentation/networking/ioam6-sysctl.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+IOAM6 Sysfs variables
+=====================
+
+
+/proc/sys/net/conf/<iface>/ioam6_* variables:
+=============================================
+
+ioam6_enabled - BOOL
+        Accept (= enabled) or ignore (= disabled) IPv6 IOAM options on ingress
+        for this interface.
+
+        * 0 - disabled (default)
+        * 1 - enabled
+
+ioam6_id - SHORT INTEGER
+        Define the IOAM id of this interface.
+
+        Default is ~0.
+
+ioam6_id_wide - INTEGER
+        Define the wide IOAM id of this interface.
+
+        Default is ~0.
diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
new file mode 100644
index 000000000000..0127319b30bb
--- /dev/null
+++ b/Documentation/networking/iou-zcrx.rst
@@ -0,0 +1,202 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+io_uring zero copy Rx
+=====================
+
+Introduction
+============
+
+io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
+the network receive path, allowing packet data to be received directly into
+userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
+there are no strict alignment requirements and no need to mmap()/munmap().
+Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
+processed by the kernel TCP stack as normal.
+
+NIC HW Requirements
+===================
+
+Several NIC HW features are required for io_uring ZC Rx to work. For now the
+kernel API does not configure the NIC and it must be done by the user.
+
+Header/data split
+-----------------
+
+Required to split packets at the L4 boundary into a header and a payload.
+Headers are received into kernel memory as normal and processed by the TCP
+stack as normal. Payloads are received into userspace memory directly.
+
+Flow steering
+-------------
+
+Specific HW Rx queues are configured for this feature, but modern NICs
+typically distribute flows across all HW Rx queues. Flow steering is required
+to ensure that only desired flows are directed towards HW queues that are
+configured for io_uring ZC Rx.
+
+RSS
+---
+
+In addition to flow steering above, RSS is required to steer all other non-zero
+copy flows away from queues that are configured for io_uring ZC Rx.
+
+Usage
+=====
+
+Setup NIC
+---------
+
+Must be done out of band for now.
+
+Ensure there are at least two queues::
+
+  ethtool -L eth0 combined 2
+
+Enable header/data split::
+
+  ethtool -G eth0 tcp-data-split on
+
+Carve out half of the HW Rx queues for zero copy using RSS::
+
+  ethtool -X eth0 equal 1
+
+Set up flow steering, bearing in mind that queues are 0-indexed::
+
+  ethtool -N eth0 flow-type tcp6 ... action 1
+
+Setup io_uring
+--------------
+
+This section describes the low level io_uring kernel API. Please refer to
+liburing documentation for how to use the higher level API.
+
+Create an io_uring instance with the following required setup flags::
+
+  IORING_SETUP_SINGLE_ISSUER
+  IORING_SETUP_DEFER_TASKRUN
+  IORING_SETUP_CQE32
+
+Create memory area
+------------------
+
+Allocate userspace memory area for receiving zero copy data::
+
+  void *area_ptr = mmap(NULL, area_size,
+                        PROT_READ | PROT_WRITE,
+                        MAP_ANONYMOUS | MAP_PRIVATE,
+                        0, 0);
+
+Create refill ring
+------------------
+
+Allocate memory for a shared ringbuf used for returning consumed buffers::
+
+  void *ring_ptr = mmap(NULL, ring_size,
+                        PROT_READ | PROT_WRITE,
+                        MAP_ANONYMOUS | MAP_PRIVATE,
+                        0, 0);
+
+This refill ring consists of some space for the header, followed by an array of
+``struct io_uring_zcrx_rqe``::
+
+  size_t rq_entries = 4096;
+  size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
+  /* align to page size */
+  ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
+
+Register ZC Rx
+--------------
+
+Fill in registration structs::
+
+  struct io_uring_zcrx_area_reg area_reg = {
+    .addr = (__u64)(unsigned long)area_ptr,
+    .len = area_size,
+    .flags = 0,
+  };
+
+  struct io_uring_region_desc region_reg = {
+    .user_addr = (__u64)(unsigned long)ring_ptr,
+    .size = ring_size,
+    .flags = IORING_MEM_REGION_TYPE_USER,
+  };
+
+  struct io_uring_zcrx_ifq_reg reg = {
+    .if_idx = if_nametoindex("eth0"),
+    /* this is the HW queue with desired flow steered into it */
+    .if_rxq = 1,
+    .rq_entries = rq_entries,
+    .area_ptr = (__u64)(unsigned long)&area_reg,
+    .region_ptr = (__u64)(unsigned long)&region_reg,
+  };
+
+Register with kernel::
+
+  io_uring_register_ifq(ring, &reg);
+
+Map refill ring
+---------------
+
+The kernel fills in fields for the refill ring in the registration ``struct
+io_uring_zcrx_ifq_reg``. Map it into userspace::
+
+  struct io_uring_zcrx_rq refill_ring;
+
+  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
+  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
+  refill_ring.rqes =
+    (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
+  refill_ring.rq_tail = 0;
+  refill_ring.ring_ptr = ring_ptr;
+
+Receiving data
+--------------
+
+Prepare a zero copy recv request::
+
+  struct io_uring_sqe *sqe;
+
+  sqe = io_uring_get_sqe(ring);
+  io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
+  sqe->ioprio |= IORING_RECV_MULTISHOT;
+
+Now, submit and wait::
+
+  io_uring_submit_and_wait(ring, 1);
+
+Finally, process completions::
+
+  struct io_uring_cqe *cqe;
+  unsigned int count = 0;
+  unsigned int head;
+
+  io_uring_for_each_cqe(ring, head, cqe) {
+    struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+
+    unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
+    unsigned char *data = area_ptr + (rcqe->off & mask);
+    /* do something with the data */
+
+    count++;
+  }
+  io_uring_cq_advance(ring, count);
+
+Recycling buffers
+-----------------
+
+Return buffers back to the kernel to be used again::
+
+  struct io_uring_zcrx_rqe *rqe;
+  unsigned mask = refill_ring.ring_entries - 1;
+  rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
+
+  unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
+  rqe->off = area_offset | area_reg.rq_area_token;
+  rqe->len = cqe->res;
+  IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 837d51f9e1fa..0f1251cce314 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -25,7 +25,8 @@ ip_default_ttl - INTEGER
 ip_no_pmtu_disc - INTEGER
 	Disable Path MTU Discovery. If enabled in mode 1 and a
 	fragmentation-required ICMP is received, the PMTU to this
-	destination will be set to min_pmtu (see below). You will need
+	destination will be set to the smallest of the old MTU to
+	this destination and min_pmtu (see below). You will need
 	to raise min_pmtu to the smallest interface MTU on your system
 	manually if you want to avoid locally generated fragments.
 
@@ -36,8 +37,8 @@ ip_no_pmtu_disc - INTEGER
 	Mode 3 is a hardened pmtu discover mode. The kernel will only
 	accept fragmentation-needed errors if the underlying protocol
 	can verify them besides a plain socket lookup. Current
-	protocols for which pmtu events will be honored are TCP, SCTP
-	and DCCP as they verify e.g. the sequence number or the
+	protocols for which pmtu events will be honored are TCP and
+	SCTP as they verify e.g. the sequence number or the
 	association. This mode should not be enabled globally but is
 	only intended to secure e.g. name servers in namespaces where
 	TCP path mtu must still work but path MTU information of other
@@ -49,7 +50,8 @@ ip_no_pmtu_disc - INTEGER
 	Default: FALSE
 
 min_pmtu - INTEGER
-	default 552 - minimum discovered Path MTU
+	default 552 - minimum Path MTU. Unless this is changed manually,
+	each cached pmtu will never be lower than this setting.
 
 ip_forward_use_pmtu - BOOLEAN
 	By default we don't trust protocol path MTUs while forwarding
@@ -99,6 +101,49 @@ fib_multipath_hash_policy - INTEGER
 	- 0 - Layer 3
 	- 1 - Layer 4
 	- 2 - Layer 3 or inner Layer 3 if present
+	- 3 - Custom multipath hash. Fields used for multipath hash calculation
+	  are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+	When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+	fields used for multipath hash calculation are determined by this
+	sysctl.
+
+	This value is a bitmask which enables various fields for multipath hash
+	calculation.
+
+	Possible fields are:
+
+	====== ============================
+	0x0001 Source IP address
+	0x0002 Destination IP address
+	0x0004 IP protocol
+	0x0008 Unused (Flow Label)
+	0x0010 Source port
+	0x0020 Destination port
+	0x0040 Inner source IP address
+	0x0080 Inner destination IP address
+	0x0100 Inner IP protocol
+	0x0200 Inner Flow Label
+	0x0400 Inner source port
+	0x0800 Inner destination port
+	====== ============================
+
+	Default: 0x0007 (source IP, destination IP and IP protocol)
+
+fib_multipath_hash_seed - UNSIGNED INTEGER
+	The seed value used when calculating hash for multipath routes. Applies
+	to both IPv4 and IPv6 datapath. Only present for kernels built with
+	CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+	When set to 0, the seed value used for multipath routing defaults to an
+	internal random-generated one.
+
+	The actual hashing algorithm is not specified -- there is no guarantee
+	that a next hop distribution effected by a given seed will keep stable
+	across kernel versions.
+
+	Default: 0 (random)
 
 fib_sync_mem - UNSIGNED INTEGER
 	Amount of dirty memory from fib entries that can be backlogged before
@@ -125,6 +170,9 @@ route/max_size - INTEGER
 	From linux kernel 3.6 onwards, this is deprecated for ipv4
 	as route cache is no longer used.
 
+	From linux kernel 6.3 onwards, this is deprecated for ipv6
+	as garbage collection manages cached route entries.
+
 neigh/default/gc_thresh1 - INTEGER
 	Minimum number of entries to keep.  Garbage collector will not
 	purge entries if there are fewer than this number.
@@ -171,6 +219,12 @@ neigh/default/unres_qlen - INTEGER
 
 	Default: 101
 
+neigh/default/interval_probe_time_ms - INTEGER
+	The probe interval for neighbor entries with NTF_MANAGED flag,
+	the min value is 1.
+
+	Default: 5000
+
 mtu_expires - INTEGER
 	Time, in seconds, that cached PMTU information is kept.
 
@@ -178,6 +232,27 @@ min_adv_mss - INTEGER
 	The advertised MSS depends on the first hop route MTU, but will
 	never be lower than this setting.
 
+fib_notify_on_flag_change - INTEGER
+        Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+        RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+        After installing a route to the kernel, user space receives an
+        acknowledgment, which means the route was installed in the kernel,
+        but not necessarily in hardware.
+        It is also possible for a route already installed in hardware to change
+        its action and therefore its flags. For example, a host route that is
+        trapping packets can be "promoted" to perform decapsulation following
+        the installation of an IPinIP/VXLAN tunnel.
+        The notifications will indicate to user-space the state of the route.
+
+        Default: 0 (Do not emit notifications.)
+
+        Possible values:
+
+        - 0 - Do not emit notifications.
+        - 1 - Emit notifications.
+        - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
 IP Fragmentation:
 
 ipfrag_high_thresh - LONG INTEGER
@@ -215,6 +290,13 @@ ipfrag_max_dist - INTEGER
 	from different IP datagrams, which could result in data corruption.
 	Default: 64
 
+bc_forwarding - INTEGER
+	bc_forwarding enables the feature described in rfc1812#section-5.3.5.2
+	and rfc2644. It allows the router to forward directed broadcast.
+	To enable this feature, the 'all' entry and the input interface entry
+	should be set to 1.
+	Default: 0
+
 INET peer storage
 =================
 
@@ -253,6 +335,7 @@ tcp_abort_on_overflow - BOOLEAN
 	option can harm clients of your server.
 
 tcp_adv_win_scale - INTEGER
+	Obsolete since linux-6.6
 	Count buffering overhead as bytes/2^tcp_adv_win_scale
 	(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
 	if it is <= 0.
@@ -272,6 +355,8 @@ tcp_app_win - INTEGER
 	Reserve max(window/2^tcp_app_win, mss) of window for application
 	buffer. Value 0 is special, it means that nothing is reserved.
 
+	Possible values are [0, 31], inclusive.
+
 	Default: 31
 
 tcp_autocorking - BOOLEAN
@@ -571,6 +656,16 @@ tcp_recovery - INTEGER
 
 	Default: 0x1
 
+tcp_reflect_tos - BOOLEAN
+	For listening sockets, reuse the DSCP value of the initial SYN message
+	for outgoing packets. This allows to have both directions of a TCP
+	stream to use the same DSCP value, assuming DSCP remains unchanged for
+	the lifetime of the connection.
+
+	This options affects both IPv4 and IPv6.
+
+	Default: 0 (disabled)
+
 tcp_reordering - INTEGER
 	Initial reordering level of packets in a TCP stream.
 	TCP stack can then dynamically adjust flow reordering level
@@ -610,6 +705,8 @@ tcp_retries2 - INTEGER
 	seconds and is a lower bound for the effective timeout.
 	TCP will effectively time out at the first RTO which exceeds the
 	hypothetical timeout.
+	If tcp_rto_max_ms is decreased, it is recommended to also
+	change tcp_retries2.
 
 	RFC 1122 recommends at least 100 seconds for the timeout,
 	which corresponds to a value of at least 8.
@@ -630,16 +727,15 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
 
 	default: initial size of receive buffer used by TCP sockets.
 	This value overrides net.core.rmem_default used by other protocols.
-	Default: 87380 bytes. This value results in window of 65535 with
-	default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
-	less for default tcp_app_win. See below about these variables.
+	Default: 131072 bytes.
+	This value results in initial window of 65535.
 
 	max: maximal size of receive buffer allowed for automatically
 	selected receiver buffers for TCP socket. This value does not override
 	net.core.rmem_max.  Calling setsockopt() with SO_RCVBUF disables
 	automatic tuning of that socket's receive buffer size, in which
 	case this value is ignored.
-	Default: between 87380B and 6MB, depending on RAM size.
+	Default: between 131072 and 32MB, depending on RAM size.
 
 tcp_sack - BOOLEAN
 	Enable select acknowledgments (SACKS).
@@ -665,6 +761,13 @@ tcp_comp_sack_nr - INTEGER
 
 	Default : 44
 
+tcp_backlog_ack_defer - BOOLEAN
+	If set, user thread processing socket backlog tries sending
+	one ACK for the whole queue. This helps to avoid potential
+	long latencies at end of a TCP socket syscall.
+
+	Default : true
+
 tcp_slow_start_after_idle - BOOLEAN
 	If set, provide RFC2861 behavior and time out the congestion
 	window after an idle period.  An idle period is defined at
@@ -712,6 +815,31 @@ tcp_syncookies - INTEGER
 	network connections you can set this knob to 2 to enable
 	unconditionally generation of syncookies.
 
+tcp_migrate_req - BOOLEAN
+	The incoming connection is tied to a specific listening socket when
+	the initial SYN packet is received during the three-way handshake.
+	When a listener is closed, in-flight request sockets during the
+	handshake and established sockets in the accept queue are aborted.
+
+	If the listener has SO_REUSEPORT enabled, other listeners on the
+	same port should have been able to accept such connections. This
+	option makes it possible to migrate such child sockets to another
+	listener after close() or shutdown().
+
+	The BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program should
+	usually be used to define the policy to pick an alive listener.
+	Otherwise, the kernel will randomly pick an alive listener only if
+	this option is enabled.
+
+	Note that migration between listeners with different settings may
+	crash applications. Let's say migration happens from listener A to
+	B, and only B has TCP_SAVE_SYN enabled. B cannot read SYN data from
+	the requests migrated from A. To avoid such a situation, cancel
+	migration by returning SK_DROP in the type of eBPF program, or
+	disable this option.
+
+	Default: 0
+
 tcp_fastopen - INTEGER
 	Enable TCP Fast Open (RFC7413) to send and accept data in the opening
 	SYN packet.
@@ -752,7 +880,7 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER
 	initial value when the blackhole issue goes away.
 	0 to disable the blackhole detection.
 
-	By default, it is set to 1hr.
+	By default, it is set to 0 (feature is disabled).
 
 tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
 	The list consists of a primary key and an optional backup key. The
@@ -777,9 +905,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
 tcp_syn_retries - INTEGER
 	Number of times initial SYNs for an active TCP connection attempt
 	will be retransmitted. Should not be higher than 127. Default value
-	is 6, which corresponds to 63seconds till the last retransmission
-	with the current initial RTO of 1second. With this the final timeout
-	for an active TCP connection attempt will happen after 127seconds.
+	is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
+	till the last retransmission with the current initial RTO of 1second.
+	With this the final timeout for an active TCP connection attempt
+	will happen after 131seconds.
 
 tcp_timestamps - INTEGER
 	Enable timestamps as defined in RFC1323.
@@ -802,6 +931,29 @@ tcp_min_tso_segs - INTEGER
 
 	Default: 2
 
+tcp_tso_rtt_log - INTEGER
+	Adjustment of TSO packet sizes based on min_rtt
+
+	Starting from linux-5.18, TCP autosizing can be tweaked
+	for flows having small RTT.
+
+	Old autosizing was splitting the pacing budget to send 1024 TSO
+	per second.
+
+	tso_packet_size = sk->sk_pacing_rate / 1024;
+
+	With the new mechanism, we increase this TSO sizing using:
+
+	distance = min_rtt_usec / (2^tcp_tso_rtt_log)
+	tso_packet_size += gso_max_size >> distance;
+
+	This means that flows between very close hosts can use bigger
+	TSO packets, reducing their cpu costs.
+
+	If you want to use the old autosizing, set this sysctl to 0.
+
+	Default: 9  (2^9 = 512 usec)
+
 tcp_pacing_ss_ratio - INTEGER
 	sk->sk_pacing_rate is set by TCP stack using a ratio applied
 	to current rate. (current_rate = cwnd * mss / srtt)
@@ -819,6 +971,16 @@ tcp_pacing_ca_ratio - INTEGER
 
 	Default: 120
 
+tcp_syn_linear_timeouts - INTEGER
+	The number of times for an active TCP connection to retransmit SYNs with
+	a linear backoff timeout before defaulting to an exponential backoff
+	timeout. This has no effect on SYNACK at the passive TCP side.
+
+	With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
+	expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
+	and the first exponential backoff using 2^0 * initial_RTO).
+	Default: 4
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
@@ -840,9 +1002,38 @@ tcp_tw_reuse - INTEGER
 
 	Default: 2
 
+tcp_tw_reuse_delay - UNSIGNED INTEGER
+        The delay in milliseconds before a TIME-WAIT socket can be reused by a
+        new connection, if TIME-WAIT socket reuse is enabled. The actual reuse
+        threshold is within [N, N+1] range, where N is the requested delay in
+        milliseconds, to ensure the delay interval is never shorter than the
+        configured value.
+
+        This setting contains an assumption about the other TCP timestamp clock
+        tick interval. It should not be set to a value lower than the peer's
+        clock tick for PAWS (Protection Against Wrapped Sequence numbers)
+        mechanism work correctly for the reused connection.
+
+        Default: 1000 (milliseconds)
+
 tcp_window_scaling - BOOLEAN
 	Enable window scaling as defined in RFC1323.
 
+tcp_shrink_window - BOOLEAN
+	This changes how the TCP receive window is calculated.
+
+	RFC 7323, section 2.4, says there are instances when a retracted
+	window can be offered, and that TCP implementations MUST ensure
+	that they handle a shrinking window, as specified in RFC 1122.
+
+	- 0 - Disabled.	The window is never shrunk.
+	- 1 - Enabled.	The window is shrunk when necessary to remain within
+			the memory limit set by autotuning (sk_rcvbuf).
+			This only occurs if a non-zero receive window
+			scaling factor is also in effect.
+
+	Default: 0
+
 tcp_wmem - vector of 3 INTEGERs: min, default, max
 	min: Amount of memory reserved for send buffers for TCP sockets.
 	Each TCP socket has rights to use it due to fact of its birth.
@@ -908,20 +1099,156 @@ tcp_limit_output_bytes - INTEGER
 	limits the number of bytes on qdisc or device to reduce artificial
 	RTT/cwnd and reduce bufferbloat.
 
-	Default: 1048576 (16 * 65536)
+	Default: 4194304 (4 MB)
 
 tcp_challenge_ack_limit - INTEGER
 	Limits number of Challenge ACK sent per second, as recommended
 	in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
-	Default: 1000
+	Note that this per netns rate limit can allow some side channel
+	attacks and probably should not be enabled.
+	TCP stack implements per TCP socket limits anyway.
+	Default: INT_MAX (unlimited)
 
-tcp_rx_skb_cache - BOOLEAN
-	Controls a per TCP socket cache of one skb, that might help
-	performance of some workloads. This might be dangerous
-	on systems with a lot of TCP sockets, since it increases
-	memory usage.
+tcp_ehash_entries - INTEGER
+	Show the number of hash buckets for TCP sockets in the current
+	networking namespace.
 
-	Default: 0 (disabled)
+	A negative value means the networking namespace does not own its
+	hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+	Control the number of hash buckets for TCP sockets in the child
+	networking namespace, which must be set before clone() or unshare().
+
+	If the value is not 0, the kernel uses a value rounded up to 2^n
+	as the actual hash bucket size.  0 is a special value, meaning
+	the child networking namespace will share the initial networking
+	namespace's hash buckets.
+
+	Note that the child will use the global one in case the kernel
+	fails to allocate enough memory.  In addition, the global hash
+	buckets are spread over available NUMA nodes, but the allocation
+	of the child hash table depends on the current process's NUMA
+	policy, which could result in performance differences.
+
+	Note also that the default value of tcp_max_tw_buckets and
+	tcp_max_syn_backlog depend on the hash bucket size.
+
+	Possible values: 0, 2^n (n: 0 - 24 (16Mi))
+
+	Default: 0
+
+tcp_plb_enabled - BOOLEAN
+	If set and the underlying congestion control (e.g. DCTCP) supports
+	and enables PLB feature, TCP PLB (Protective Load Balancing) is
+	enabled. PLB is described in the following paper:
+	https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
+	upon sensing sustained congestion, TCP triggers a change in
+	flow label field for outgoing IPv6 packets. A change in flow label
+	field potentially changes the path of outgoing packets for switches
+	that use ECMP/WCMP for routing.
+
+	PLB changes socket txhash which results in a change in IPv6 Flow Label
+	field, and currently no-op for IPv4 headers. It is possible
+	to apply PLB for IPv4 with other network header fields (e.g. TCP
+	or IPv4 options) or using encapsulation where outer header is used
+	by switches to determine next hop. In either case, further host
+	and switch side changes will be needed.
+
+	When set, PLB assumes that congestion signal (e.g. ECN) is made
+	available and used by congestion control module to estimate a
+	congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
+	make repathing decisions.
+
+	Default: FALSE
+
+tcp_plb_idle_rehash_rounds - INTEGER
+	Number of consecutive congested rounds (RTT) seen after which
+	a rehash can be performed, given there are no packets in flight.
+	This is referred to as M in PLB paper:
+	https://doi.org/10.1145/3544216.3544226.
+
+	Possible Values: 0 - 31
+
+	Default: 3
+
+tcp_plb_rehash_rounds - INTEGER
+	Number of consecutive congested rounds (RTT) seen after which
+	a forced rehash can be performed. Be careful when setting this
+	parameter, as a small value increases the risk of retransmissions.
+	This is referred to as N in PLB paper:
+	https://doi.org/10.1145/3544216.3544226.
+
+	Possible Values: 0 - 31
+
+	Default: 12
+
+tcp_plb_suspend_rto_sec - INTEGER
+	Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
+	having PLB repath onto a connectivity "black hole", after an RTO a TCP
+	connection suspends PLB repathing for a random duration between 1x and
+	2x of this parameter. Randomness is added to avoid concurrent rehashing
+	of multiple TCP connections. This should be set corresponding to the
+	amount of time it takes to repair a failed link.
+
+	Possible Values: 0 - 255
+
+	Default: 60
+
+tcp_plb_cong_thresh - INTEGER
+	Fraction of packets marked with congestion over a round (RTT) to
+	tag that round as congested. This is referred to as K in the PLB paper:
+	https://doi.org/10.1145/3544216.3544226.
+
+	The 0-1 fraction range is mapped to 0-256 range to avoid floating
+	point operations. For example, 128 means that if at least 50% of
+	the packets in a round were marked as congested then the round
+	will be tagged as congested.
+
+	Setting threshold to 0 means that PLB repaths every RTT regardless
+	of congestion. This is not intended behavior for PLB and should be
+	used only for experimentation purpose.
+
+	Possible Values: 0 - 256
+
+	Default: 128
+
+tcp_pingpong_thresh - INTEGER
+	The number of estimated data replies sent for estimated incoming data
+	requests that must happen before TCP considers that a connection is a
+	"ping-pong" (request-response) connection for which delayed
+	acknowledgments can provide benefits.
+
+	This threshold is 1 by default, but some applications may need a higher
+	threshold for optimal performance.
+
+	Possible Values: 1 - 255
+
+	Default: 1
+
+tcp_rto_min_us - INTEGER
+	Minimal TCP retransmission timeout (in microseconds). Note that the
+	rto_min route option has the highest precedence for configuring this
+	setting, followed by the TCP_BPF_RTO_MIN and TCP_RTO_MIN_US socket
+	options, followed by this tcp_rto_min_us sysctl.
+
+	The recommended practice is to use a value less or equal to 200000
+	microseconds.
+
+	Possible Values: 1 - INT_MAX
+
+	Default: 200000
+
+tcp_rto_max_ms - INTEGER
+	Maximal TCP retransmission timeout (in ms).
+	Note that TCP_RTO_MAX_MS socket option has higher precedence.
+
+	When changing tcp_rto_max_ms, it is important to understand
+	that tcp_retries2 might need a change.
+
+	Possible Values: 1000 - 120,000
+
+	Default: 120,000
 
 UDP variables
 =============
@@ -938,13 +1265,11 @@ udp_l3mdev_accept - BOOLEAN
 udp_mem - vector of 3 INTEGERs: min, pressure, max
 	Number of pages allowed for queueing by all UDP sockets.
 
-	min: Below this number of pages UDP is not bothered about its
-	memory appetite. When amount of memory allocated by UDP exceeds
-	this number, UDP starts to moderate memory usage.
+	min: Number of pages allowed for queueing by all UDP sockets.
 
 	pressure: This value was introduced to follow format of tcp_mem.
 
-	max: Number of pages allowed for queueing by all UDP sockets.
+	max: This value was introduced to follow format of tcp_mem.
 
 	Default is calculated at boot time from amount of available memory.
 
@@ -956,11 +1281,34 @@ udp_rmem_min - INTEGER
 	Default: 4K
 
 udp_wmem_min - INTEGER
-	Minimal size of send buffer used by UDP sockets in moderation.
-	Each UDP socket is able to use the size for sending data, even if
-	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+	UDP does not have tx memory accounting and this tunable has no effect.
+
+udp_hash_entries - INTEGER
+	Show the number of hash buckets for UDP sockets in the current
+	networking namespace.
+
+	A negative value means the networking namespace does not own its
+	hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+	Control the number of hash buckets for UDP sockets in the child
+	networking namespace, which must be set before clone() or unshare().
+
+	If the value is not 0, the kernel uses a value rounded up to 2^n
+	as the actual hash bucket size.  0 is a special value, meaning
+	the child networking namespace will share the initial networking
+	namespace's hash buckets.
+
+	Note that the child will use the global one in case the kernel
+	fails to allocate enough memory.  In addition, the global hash
+	buckets are spread over available NUMA nodes, but the allocation
+	of the child hash table depends on the current process's NUMA
+	policy, which could result in performance differences.
+
+	Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
+
+	Default: 0
 
-	Default: 4K
 
 RAW variables
 =============
@@ -989,7 +1337,7 @@ cipso_cache_enable - BOOLEAN
 cipso_cache_bucket_size - INTEGER
 	The CIPSO label cache consists of a fixed size hash table with each
 	hash bucket containing a number of cache entries.  This variable limits
-	the number of entries in each hash bucket; the larger the value the
+	the number of entries in each hash bucket; the larger the value is, the
 	more CIPSO label mappings that can be cached.  When the number of
 	entries in a given hash bucket reaches this limit adding new entries
 	causes the oldest entry in the bucket to be removed to make room.
@@ -1053,7 +1401,9 @@ ip_local_reserved_ports - list of comma separated ranges
 
 	although this is redundant. However such a setting is useful
 	if later the port range is changed to a value that will
-	include the reserved ports.
+	include the reserved ports. Also keep in mind, that overlapping
+	of these ranges may affect probability of selecting ephemeral
+	ports which are right after block of reserved ports.
 
 	Default: Empty
 
@@ -1081,7 +1431,7 @@ ip_autobind_reuse - BOOLEAN
 	option should only be set by experts.
 	Default: 0
 
-ip_dynaddr - BOOLEAN
+ip_dynaddr - INTEGER
 	If set non-zero, enables support for dynamic addresses.
 	If set to a non-zero value larger than 1, a kernel log
 	message will be printed when dynamic address rewriting
@@ -1103,8 +1453,8 @@ ping_group_range - 2 INTEGERS
 	Restrict ICMP_PROTO datagram sockets to users in the group range.
 	The default is "1 0", meaning, that nobody (not even root) may
 	create ping sockets.  Setting it to "100 100" would grant permissions
-	to the single group. "0 4294967295" would enable it for the world, "100
-	4294967295" would enable it for the users, but not daemons.
+	to the single group. "0 4294967294" would enable it for the world, "100
+	4294967294" would enable it for the users, but not daemons.
 
 tcp_early_demux - BOOLEAN
 	Enable early demux for established TCP sockets.
@@ -1123,6 +1473,12 @@ icmp_echo_ignore_all - BOOLEAN
 
 	Default: 0
 
+icmp_echo_enable_probe - BOOLEAN
+        If set to one, then the kernel will respond to RFC 8335 PROBE
+        requests sent to it.
+
+        Default: 0
+
 icmp_echo_ignore_broadcasts - BOOLEAN
 	If set non-zero, then the kernel will ignore all ICMP ECHO and
 	TIMESTAMP requests sent to it via broadcast/multicast.
@@ -1142,13 +1498,15 @@ icmp_ratelimit - INTEGER
 icmp_msgs_per_sec - INTEGER
 	Limit maximal number of ICMP packets sent per second from this host.
 	Only messages whose type matches icmp_ratemask (see below) are
-	controlled by this limit.
+	controlled by this limit. For security reasons, the precise count
+	of messages per second is randomized.
 
 	Default: 1000
 
 icmp_msgs_burst - INTEGER
 	icmp_msgs_per_sec controls number of ICMP packets sent per second,
 	while icmp_msgs_burst controls the burst size of these packets.
+	For security reasons, the precise burst size is randomized.
 
 	Default: 50
 
@@ -1194,7 +1552,7 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
 
 	If non-zero, the message will be sent with the primary address of
 	the interface that received the packet that caused the icmp error.
-	This is the behaviour network many administrators will expect from
+	This is the behaviour many network administrators will expect from
 	a router. And it can make debugging complicated network layouts
 	much easier.
 
@@ -1337,6 +1695,14 @@ proxy_arp_pvlan - BOOLEAN
 	  Hewlett-Packard call it Source-Port filtering or port-isolation.
 	  Ericsson call it MAC-Forced Forwarding (RFC Draft).
 
+proxy_delay - INTEGER
+	Delay proxy response.
+
+	Delay response to a neighbor solicitation when proxy_arp
+	or proxy_ndp is enabled. A random value between [0, proxy_delay)
+	will be chosen, setting to zero means reply with no delay.
+	Value in jiffies. Defaults to 80.
+
 shared_media - BOOLEAN
 	Send(router) or accept(host) RFC1620 shared media redirects.
 	Overrides secure_redirects.
@@ -1423,6 +1789,25 @@ rp_filter - INTEGER
 	Default value is 0. Note that some distributions enable it
 	in startup scripts.
 
+src_valid_mark - BOOLEAN
+	- 0 - The fwmark of the packet is not included in reverse path
+	  route lookup.  This allows for asymmetric routing configurations
+	  utilizing the fwmark in only one direction, e.g., transparent
+	  proxying.
+
+	- 1 - The fwmark of the packet is included in reverse path route
+	  lookup.  This permits rp_filter to function when the fwmark is
+	  used for routing traffic in both directions.
+
+	This setting also affects the utilization of fmwark when
+	performing source address selection for ICMP replies, or
+	determining addresses stored for the IPOPT_TS_TSANDADDR and
+	IPOPT_RR IP options.
+
+	The max value from conf/{all,interface}/src_valid_mark is used.
+
+	Default value is 0.
+
 arp_filter - BOOLEAN
 	- 1 - Allows you to have multiple network interfaces on the same
 	  subnet, and have the ARPs for each interface be answered
@@ -1502,12 +1887,15 @@ arp_notify - BOOLEAN
 	     or hardware address changes.
 	 ==  ==========================================================
 
-arp_accept - BOOLEAN
-	Define behavior for gratuitous ARP frames who's IP is not
-	already present in the ARP table:
+arp_accept - INTEGER
+	Define behavior for accepting gratuitous ARP (garp) frames from devices
+	that are not already present in the ARP table:
 
 	- 0 - don't create new entries in the ARP table
 	- 1 - create new entries in the ARP table
+	- 2 - create new entries only if the source IP address is in the same
+	  subnet as an address configured on the interface that received the
+	  garp message.
 
 	Both replies and requests type gratuitous arp will trigger the
 	ARP table to be updated, if this setting is on.
@@ -1516,6 +1904,15 @@ arp_accept - BOOLEAN
 	gratuitous arp frame, the arp table will be updated regardless
 	if this setting is on or off.
 
+arp_evict_nocarrier - BOOLEAN
+	Clears the ARP cache on NOCARRIER events. This option is important for
+	wireless devices where the ARP cache should not be cleared when roaming
+	between access points on the same network. In most cases this should
+	remain as the default (1).
+
+	- 1 - (default): Clear the ARP cache on NOCARRIER events
+	- 0 - Do not clear ARP cache on NOCARRIER events
+
 mcast_solicit - INTEGER
 	The maximum number of multicast probes in INCOMPLETE state,
 	when the associated hardware address is unknown.  Defaults
@@ -1552,6 +1949,9 @@ igmpv3_unsolicited_report_interval - INTEGER
 
 	Default: 1000 (1 seconds)
 
+ignore_routes_with_linkdown - BOOLEAN
+        Ignore routes whose link is down when performing a FIB lookup.
+
 promote_secondaries - BOOLEAN
 	When a primary IP address is removed from this interface
 	promote a corresponding secondary IP address instead of
@@ -1691,6 +2091,35 @@ fib_multipath_hash_policy - INTEGER
 	- 0 - Layer 3 (source and destination addresses plus flow label)
 	- 1 - Layer 4 (standard 5-tuple)
 	- 2 - Layer 3 or inner Layer 3 if present
+	- 3 - Custom multipath hash. Fields used for multipath hash calculation
+	  are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+	When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+	fields used for multipath hash calculation are determined by this
+	sysctl.
+
+	This value is a bitmask which enables various fields for multipath hash
+	calculation.
+
+	Possible fields are:
+
+	====== ============================
+	0x0001 Source IP address
+	0x0002 Destination IP address
+	0x0004 IP protocol
+	0x0008 Flow Label
+	0x0010 Source port
+	0x0020 Destination port
+	0x0040 Inner source IP address
+	0x0080 Inner destination IP address
+	0x0100 Inner IP protocol
+	0x0200 Inner Flow Label
+	0x0400 Inner source port
+	0x0800 Inner destination port
+	====== ============================
+
+	Default: 0x0007 (source IP, destination IP and IP protocol)
 
 anycast_src_echo_reply - BOOLEAN
 	Controls the use of anycast addresses as source addresses for ICMPv6
@@ -1760,7 +2189,7 @@ skip_notify_on_dev_down - BOOLEAN
 
 nexthop_compat_mode - BOOLEAN
 	New nexthop API provides a means for managing nexthops independent of
-	prefixes. Backwards compatibilty with old route format is enabled by
+	prefixes. Backwards compatibility with old route format is enabled by
 	default which means route dumps and notifications contain the new
 	nexthop attribute but also the full, expanded nexthop definition.
 	Further, updates or deletes of a nexthop configuration generate route
@@ -1768,8 +2197,52 @@ nexthop_compat_mode - BOOLEAN
 	understands the new API, this sysctl can be disabled to achieve full
 	performance benefits of the new API by disabling the nexthop expansion
 	and extraneous notifications.
+
+	Note that as a backward-compatible mode, dumping of modern features
+	might be incomplete or wrong. For example, resilient groups will not be
+	shown as such, but rather as just a list of next hops. Also weights that
+	do not fit into 8 bits will show incorrectly.
+
 	Default: true (backward compat mode)
 
+fib_notify_on_flag_change - INTEGER
+        Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+        RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+        After installing a route to the kernel, user space receives an
+        acknowledgment, which means the route was installed in the kernel,
+        but not necessarily in hardware.
+        It is also possible for a route already installed in hardware to change
+        its action and therefore its flags. For example, a host route that is
+        trapping packets can be "promoted" to perform decapsulation following
+        the installation of an IPinIP/VXLAN tunnel.
+        The notifications will indicate to user-space the state of the route.
+
+        Default: 0 (Do not emit notifications.)
+
+        Possible values:
+
+        - 0 - Do not emit notifications.
+        - 1 - Emit notifications.
+        - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
+ioam6_id - INTEGER
+        Define the IOAM id of this node. Uses only 24 bits out of 32 in total.
+
+        Min: 0
+        Max: 0xFFFFFF
+
+        Default: 0xFFFFFF
+
+ioam6_id_wide - LONG INTEGER
+        Define the wide IOAM id of this node. Uses only 56 bits out of 64 in
+        total. Can be different from ioam6_id.
+
+        Min: 0
+        Max: 0xFFFFFFFFFFFFFF
+
+        Default: 0xFFFFFFFFFFFFFF
+
 IPv6 Fragmentation:
 
 ip6frag_high_thresh - INTEGER
@@ -1784,30 +2257,27 @@ ip6frag_low_thresh - INTEGER
 ip6frag_time - INTEGER
 	Time in seconds to keep an IPv6 fragment in memory.
 
-IPv6 Segment Routing:
-
-seg6_flowlabel - INTEGER
-	Controls the behaviour of computing the flowlabel of outer
-	IPv6 header in case of SR T.encaps
-
-	 == =======================================================
-	 -1  set flowlabel to zero.
-	  0  copy flowlabel from Inner packet in case of Inner IPv6
-	     (Set flowlabel to 0 in case IPv4/L2)
-	  1  Compute the flowlabel using seg6_make_flowlabel()
-	 == =======================================================
-
-	Default is 0.
-
 ``conf/default/*``:
 	Change the interface-specific default settings.
 
+	These settings would be used during creating new interfaces.
+
 
 ``conf/all/*``:
 	Change all the interface-specific settings.
 
 	[XXX:  Other special features than forwarding?]
 
+conf/all/disable_ipv6 - BOOLEAN
+	Changing this value is same as changing ``conf/default/disable_ipv6``
+	setting and also all per-interface ``disable_ipv6`` settings to the same
+	value.
+
+	Reading this value does not have any particular meaning. It does not say
+	whether IPv6 support is enabled or disabled. Returned value can be 1
+	also in the case when some interface has ``disable_ipv6`` set to 0 and
+	has configured IPv6 addresses.
+
 conf/all/forwarding - BOOLEAN
 	Enable global IPv6 forwarding between all interfaces.
 
@@ -1866,6 +2336,16 @@ accept_ra_defrtr - BOOLEAN
 		- enabled if accept_ra is enabled.
 		- disabled if accept_ra is disabled.
 
+ra_defrtr_metric - UNSIGNED INTEGER
+	Route metric for default route learned in Router Advertisement. This value
+	will be assigned as metric for the default route learned via IPv6 Router
+	Advertisement. Takes affect only if accept_ra_defrtr is enabled.
+
+	Possible values:
+		1 to 0xFFFFFFFF
+
+		Default: IP6_RT_PRIO_USER i.e. 1024.
+
 accept_ra_from_local - BOOLEAN
 	Accept RA with source-address that is found on local machine
 	if the RA is otherwise proper and able to be accepted.
@@ -1888,6 +2368,14 @@ accept_ra_min_hop_limit - INTEGER
 
 	Default: 1
 
+accept_ra_min_lft - INTEGER
+	Minimum acceptable lifetime value in Router Advertisement.
+
+	RA sections with a lifetime less than this value shall be
+	ignored. Zero lifetimes stay unaffected.
+
+	Default: 0
+
 accept_ra_pinfo - BOOLEAN
 	Learn Prefix Information in Router Advertisement.
 
@@ -1896,6 +2384,31 @@ accept_ra_pinfo - BOOLEAN
 		- enabled if accept_ra is enabled.
 		- disabled if accept_ra is disabled.
 
+ra_honor_pio_life - BOOLEAN
+	Whether to use RFC4862 Section 5.5.3e to determine the valid
+	lifetime of an address matching a prefix sent in a Router
+	Advertisement Prefix Information Option.
+
+	- If enabled, the PIO valid lifetime will always be honored.
+	- If disabled, RFC4862 section 5.5.3e is used to determine
+	  the valid lifetime of the address.
+
+	Default: 0 (disabled)
+
+ra_honor_pio_pflag - BOOLEAN
+	The Prefix Information Option P-flag indicates the network can
+	allocate a unique IPv6 prefix per client using DHCPv6-PD.
+	This sysctl can be enabled when a userspace DHCPv6-PD client
+	is running to cause the P-flag to take effect: i.e. the
+	P-flag suppresses any effects of the A-flag within the same
+	PIO. For a given PIO, P=1 and A=1 is treated as A=0.
+
+	- If disabled, the P-flag is ignored.
+	- If enabled, the P-flag will disable SLAAC autoconfiguration
+	  for the given Prefix Information Option.
+
+	Default: 0 (disabled)
+
 accept_ra_rt_info_min_plen - INTEGER
 	Minimum prefix length of Route Information in RA.
 
@@ -2063,12 +2576,18 @@ use_tempaddr - INTEGER
 		* -1 (for point-to-point devices and loopback devices)
 
 temp_valid_lft - INTEGER
-	valid lifetime (in seconds) for temporary addresses.
+	valid lifetime (in seconds) for temporary addresses. If less than the
+	minimum required lifetime (typically 5-7 seconds), temporary addresses
+	will not be created.
 
 	Default: 172800 (2 days)
 
 temp_prefered_lft - INTEGER
-	Preferred lifetime (in seconds) for temporary addresses.
+	Preferred lifetime (in seconds) for temporary addresses. If
+	temp_prefered_lft is less than the minimum required lifetime (typically
+	5-7 seconds), the preferred lifetime is the minimum required. If
+	temp_prefered_lft is greater than temp_valid_lft, the preferred lifetime
+	is temp_valid_lft.
 
 	Default: 86400 (1 day)
 
@@ -2090,6 +2609,16 @@ max_desync_factor - INTEGER
 
 	Default: 600
 
+regen_min_advance - INTEGER
+	How far in advance (in seconds), at minimum, to create a new temporary
+	address before the current one is deprecated. This value is added to
+	the amount of time that may be required for duplicate address detection
+	to determine when to create a new address. Linux permits setting this
+	value to less than the default of 2 seconds, but a value less than 2
+	does not conform to RFC 8981.
+
+	Default: 2
+
 regen_max_retry - INTEGER
 	Number of attempts before give up attempting to generate
 	valid temporary addresses.
@@ -2169,6 +2698,15 @@ ndisc_tclass - INTEGER
 
 	* 0 - (default)
 
+ndisc_evict_nocarrier - BOOLEAN
+	Clears the neighbor discovery table on NOCARRIER events. This option is
+	important for wireless devices where the neighbor discovery cache should
+	not be cleared when roaming between access points on the same network.
+	In most cases this should remain as the default (1).
+
+	- 1 - (default): Clear neighbor discover cache on NOCARRIER events.
+	- 0 - Do not clear neighbor discovery cache on NOCARRIER events.
+
 mldv1_unsolicited_report_interval - INTEGER
 	The interval in milliseconds in which the next unsolicited
 	MLDv1 report retransmit will take place.
@@ -2254,6 +2792,37 @@ drop_unsolicited_na - BOOLEAN
 
 	By default this is turned off.
 
+accept_untracked_na - INTEGER
+	Define behavior for accepting neighbor advertisements from devices that
+	are absent in the neighbor cache:
+
+	- 0 - (default) Do not accept unsolicited and untracked neighbor
+	  advertisements.
+
+	- 1 - Add a new neighbor cache entry in STALE state for routers on
+	  receiving a neighbor advertisement (either solicited or unsolicited)
+	  with target link-layer address option specified if no neighbor entry
+	  is already present for the advertised IPv6 address. Without this knob,
+	  NAs received for untracked addresses (absent in neighbor cache) are
+	  silently ignored.
+
+	  This is as per router-side behavior documented in RFC9131.
+
+	  This has lower precedence than drop_unsolicited_na.
+
+	  This will optimize the return path for the initial off-link
+	  communication that is initiated by a directly connected host, by
+	  ensuring that the first-hop router which turns on this setting doesn't
+	  have to buffer the initial return packets to do neighbor-solicitation.
+	  The prerequisite is that the host is configured to send unsolicited
+	  neighbor advertisements on interface bringup. This setting should be
+	  used in conjunction with the ndisc_notify setting on the host to
+	  satisfy this prerequisite.
+
+	- 2 - Extend option (1) to add a new neighbor cache entry only if the
+	  source IP address is in the same subnet as an address configured on
+	  the interface that received the neighbor advertisement.
+
 enhanced_dad - BOOLEAN
 	Include a nonce option in the IPv6 neighbor solicitation messages used for
 	duplicate address detection per RFC7527. A received DAD NS will only signal
@@ -2308,6 +2877,13 @@ echo_ignore_anycast - BOOLEAN
 
 	Default: 0
 
+error_anycast_as_unicast - BOOLEAN
+	If set to 1, then the kernel will respond with ICMP Errors
+	resulting from requests sent to it over the IPv6 protocol destined
+	to anycast address essentially treating anycast as unicast.
+
+	Default: 0
+
 xfrm6_gc_thresh - INTEGER
 	(Obsolete since linux-4.14)
 	The threshold at which we will start garbage collecting for IPv6
@@ -2408,7 +2984,7 @@ pf_expose - INTEGER
 	can be got via SCTP_GET_PEER_ADDR_INFO sockopt;  When it's enabled,
 	a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming
 	SCTP_PF state and a SCTP_PF-state transport info can be got via
-	SCTP_GET_PEER_ADDR_INFO sockopt;  When it's diabled, no
+	SCTP_GET_PEER_ADDR_INFO sockopt;  When it's disabled, no
 	SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when
 	trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO
 	sockopt.
@@ -2628,7 +3204,14 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
 	Default: 4K
 
 sctp_wmem  - vector of 3 INTEGERs: min, default, max
-	Currently this tunable has no effect.
+	Only the first value ("min") is used, "default" and "max" are
+	ignored.
+
+	min: Minimum size of send buffer that can be used by SCTP sockets.
+	It is guaranteed to each SCTP socket (but not association) even
+	under moderate memory pressure.
+
+	Default: 4K
 
 addr_scope_policy - INTEGER
 	Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
@@ -2640,6 +3223,95 @@ addr_scope_policy - INTEGER
 
 	Default: 1
 
+udp_port - INTEGER
+	The listening port for the local UDP tunneling sock. Normally it's
+	using the IANA-assigned UDP port number 9899 (sctp-tunneling).
+
+	This UDP sock is used for processing the incoming UDP-encapsulated
+	SCTP packets (from RFC6951), and shared by all applications in the
+	same net namespace. This UDP sock will be closed when the value is
+	set to 0.
+
+	The value will also be used to set the src port of the UDP header
+	for the outgoing UDP-encapsulated SCTP packets. For the dest port,
+	please refer to 'encap_port' below.
+
+	Default: 0
+
+encap_port - INTEGER
+	The default remote UDP encapsulation port.
+
+	This value is used to set the dest port of the UDP header for the
+	outgoing UDP-encapsulated SCTP packets by default. Users can also
+	change the value for each sock/asoc/transport by using setsockopt.
+	For further information, please refer to RFC6951.
+
+	Note that when connecting to a remote server, the client should set
+	this to the port that the UDP tunneling sock on the peer server is
+	listening to and the local UDP tunneling sock on the client also
+	must be started. On the server, it would get the encap_port from
+	the incoming packet's source port.
+
+	Default: 0
+
+plpmtud_probe_interval - INTEGER
+        The time interval (in milliseconds) for the PLPMTUD probe timer,
+        which is configured to expire after this period to receive an
+        acknowledgment to a probe packet. This is also the time interval
+        between the probes for the current pmtu when the probe search
+        is done.
+
+        PLPMTUD will be disabled when 0 is set, and other values for it
+        must be >= 5000.
+
+	Default: 0
+
+reconf_enable - BOOLEAN
+        Enable or disable extension of Stream Reconfiguration functionality
+        specified in RFC6525. This extension provides the ability to "reset"
+        a stream, and it includes the Parameters of "Outgoing/Incoming SSN
+        Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams".
+
+	- 1: Enable extension.
+	- 0: Disable extension.
+
+	Default: 0
+
+intl_enable - BOOLEAN
+        Enable or disable extension of User Message Interleaving functionality
+        specified in RFC8260. This extension allows the interleaving of user
+        messages sent on different streams. With this feature enabled, I-DATA
+        chunk will replace DATA chunk to carry user messages if also supported
+        by the peer. Note that to use this feature, one needs to set this option
+        to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2
+        and SCTP_INTERLEAVING_SUPPORTED to 1.
+
+	- 1: Enable extension.
+	- 0: Disable extension.
+
+	Default: 0
+
+ecn_enable - BOOLEAN
+        Control use of Explicit Congestion Notification (ECN) by SCTP.
+        Like in TCP, ECN is used only when both ends of the SCTP connection
+        indicate support for it. This feature is useful in avoiding losses
+        due to congestion by allowing supporting routers to signal congestion
+        before having to drop packets.
+
+        1: Enable ecn.
+        0: Disable ecn.
+
+        Default: 1
+
+l3mdev_accept - BOOLEAN
+	Enabling this option allows a "global" bound socket to work
+	across L3 master domains (e.g., VRFs) with packets capable of
+	being received regardless of the L3 domain in which they
+	originated. Only valid when the kernel was compiled with
+	CONFIG_NET_L3_MASTER_DEV.
+
+	Default: 1 (enabled)
+
 
 ``/proc/sys/net/core/*``
 ========================
diff --git a/Documentation/networking/ipddp.rst b/Documentation/networking/ipddp.rst
deleted file mode 100644
index be7091b77927..000000000000
--- a/Documentation/networking/ipddp.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================================
-AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
-=========================================================
-
-Documentation ipddp.c
-
-This file is written by Jay Schulist <jschlst@samba.org>
-
-Introduction
-------------
-
-AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk
-networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams
-inside AppleTalk packets.
-
-Through this driver you can either allow your Linux box to communicate
-IP over an AppleTalk network or you can provide IP gatewaying functions
-for your AppleTalk users.
-
-You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk,
-EtherTalk and PPPTalk. The only limit on the protocol is that of what
-kernel AppleTalk layer and drivers are available.
-
-Each mode requires its own user space software.
-
-Compiling AppleTalk-IP Decapsulation/Encapsulation
-==================================================
-
-AppleTalk-IP decapsulation needs to be compiled into your kernel. You
-will need to turn on AppleTalk-IP driver support. Then you will need to
-select ONE of the two options; IP to AppleTalk-IP encapsulation support or
-AppleTalk-IP to IP decapsulation support. If you compile the driver
-statically you will only be able to use the driver for the function you have
-enabled in the kernel. If you compile the driver as a module you can
-select what mode you want it to run in via a module loading param.
-ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for
-AppleTalk-IP to IP decapsulation.
-
-Basic instructions for user space tools
-=======================================
-
-I will briefly describe the operation of the tools, but you will
-need to consult the supporting documentation for each set of tools.
-
-Decapsulation - You will need to download a software package called
-MacGate. In this distribution there will be a tool called MacRoute
-which enables you to add routes to the kernel for your Macs by hand.
-Also the tool MacRegGateWay is included to register the
-proper IP Gateway and IP addresses for your machine. Included in this
-distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from
-ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional
-but it allows automatic adding and deleting of routes for Macs. (Handy
-for locations with large Mac installations)
-
-Encapsulation - You will need to download a software daemon called ipddpd.
-This software expects there to be an AppleTalk-IP gateway on the network.
-You will also need to add the proper routes to route your Linux box's IP
-traffic out the ipddp interface.
-
-Common Uses of ipddp.c
-----------------------
-Of course AppleTalk-IP decapsulation and encapsulation, but specifically
-decapsulation is being used most for connecting LocalTalk networks to
-IP networks. Although it has been used on EtherTalk networks to allow
-Macs that are only able to tunnel IP over EtherTalk.
-
-Encapsulation has been used to allow a Linux box stuck on a LocalTalk
-network to use IP. It should work equally well if you are stuck on an
-EtherTalk only network.
-
-Further Assistance
--------------------
-You can contact me (Jay Schulist <jschlst@samba.org>) with any
-questions regarding decapsulation or encapsulation. Bradford W. Johnson
-<johns393@maroon.tc.umn.edu> originally wrote the ipddp.c driver for IP
-encapsulation in AppleTalk.
diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst
index 694adcba36b0..895d0ccfd596 100644
--- a/Documentation/networking/ipvlan.rst
+++ b/Documentation/networking/ipvlan.rst
@@ -11,7 +11,7 @@ Initial Release:
 ================
 This is conceptually very similar to the macvlan driver with one major
 exception of using L3 for mux-ing /demux-ing among slaves. This property makes
-the master device share the L2 with it's slave devices. I have developed this
+the master device share the L2 with its slave devices. I have developed this
 driver in conjunction with network namespaces and not sure if there is use case
 outside of it.
 
@@ -61,7 +61,7 @@ e.g.
 IPvlan has two modes of operation - L2 and L3. For a given master device,
 you can select one of these two modes and all slaves on that master will
 operate in the same (selected) mode. The RX mode is almost identical except
-that in L3 mode the slaves wont receive any multicast / broadcast traffic.
+that in L3 mode the slaves won't receive any multicast / broadcast traffic.
 L3 mode is more restrictive since routing is controlled from the other (mostly)
 default namespace.
 
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 2afccc63856e..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -37,8 +37,7 @@ conn_reuse_mode - INTEGER
 
 	0: disable any special handling on port reuse. The new
 	connection will be delivered to the same real server that was
-	servicing the previous connection. This will effectively
-	disable expire_nodest_conn.
+	servicing the previous connection.
 
 	bit 1: enable rescheduling of new connections when it is safe.
 	That is, whenever expire_nodest_conn and for TCP sockets, when
@@ -130,6 +129,26 @@ drop_packet - INTEGER
 	threshold. When the mode 3 is set, the always mode drop rate
 	is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
 
+est_cpulist - CPULIST
+	Allowed	CPUs for estimation kthreads
+
+	Syntax: standard cpulist format
+	empty list - stop kthread tasks and estimation
+	default - the system's housekeeping CPUs for kthreads
+
+	Example:
+	"all": all possible CPUs
+	"0-N": all possible CPUs, N denotes last CPU number
+	"0,1-N:1/2": first and all CPUs with odd number
+	"": empty list
+
+est_nice - INTEGER
+	default 0
+	Valid range: -20 (more favorable) .. 19 (less favorable)
+
+	Niceness value to use for the estimation kthreads (scheduling
+	priority)
+
 expire_nodest_conn - BOOLEAN
 	- 0 - disabled (default)
 	- not 0 - enabled
@@ -300,3 +319,14 @@ sync_version - INTEGER
 
 	Kernels with this sync_version entry are able to receive messages
 	of both version 1 and version 2 of the synchronisation protocol.
+
+run_estimation - BOOLEAN
+	0 - disabled
+	not 0 - enabled (default)
+
+	If disabled, the estimation will be suspended and kthread tasks
+	stopped.
+
+	You can always re-enable estimation by setting this value to 1.
+	But be careful, the first estimation after re-enable is not
+	accurate.
diff --git a/Documentation/networking/iso15765-2.rst b/Documentation/networking/iso15765-2.rst
new file mode 100644
index 000000000000..37ebb2c417cb
--- /dev/null
+++ b/Documentation/networking/iso15765-2.rst
@@ -0,0 +1,386 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+====================
+ISO 15765-2 (ISO-TP)
+====================
+
+Overview
+========
+
+ISO 15765-2, also known as ISO-TP, is a transport protocol specifically defined
+for diagnostic communication on CAN. It is widely used in the automotive
+industry, for example as the transport protocol for UDSonCAN (ISO 14229-3) or
+emission-related diagnostic services (ISO 15031-5).
+
+ISO-TP can be used both on CAN CC (aka Classical CAN) and CAN FD (CAN with
+Flexible Datarate) based networks. It is also designed to be compatible with a
+CAN network using SAE J1939 as data link layer (however, this is not a
+requirement).
+
+Specifications used
+-------------------
+
+* ISO 15765-2:2024 : Road vehicles - Diagnostic communication over Controller
+  Area Network (DoCAN). Part 2: Transport protocol and network layer services.
+
+Addressing
+----------
+
+In its simplest form, ISO-TP is based on two kinds of addressing modes for the
+nodes connected to the same network:
+
+* physical addressing is implemented by two node-specific addresses and is used
+  in 1-to-1 communication.
+
+* functional addressing is implemented by one node-specific address and is used
+  in 1-to-N communication.
+
+Three different addressing formats can be employed:
+
+* "normal" : each address is represented simply by a CAN ID.
+
+* "extended": each address is represented by a CAN ID plus the first byte of
+  the CAN payload; both the CAN ID and the byte inside the payload shall be
+  different between two addresses.
+
+* "mixed": each address is represented by a CAN ID plus the first byte of
+  the CAN payload; the CAN ID is different between two addresses, but the
+  additional byte is the same.
+
+Transport protocol and associated frame types
+---------------------------------------------
+
+When transmitting data using the ISO-TP protocol, the payload can either fit
+inside one single CAN message or not, also considering the overhead the protocol
+is generating and the optional extended addressing. In the first case, the data
+is transmitted at once using a so-called Single Frame (SF). In the second case,
+ISO-TP defines a multi-frame protocol, in which the sender provides (through a
+First Frame - FF) the PDU length which is to be transmitted and also asks for a
+Flow Control (FC) frame, which provides the maximum supported size of a macro
+data block (``blocksize``) and the minimum time between the single CAN messages
+composing such block (``stmin``). Once this information has been received, the
+sender starts to send frames containing fragments of the data payload (called
+Consecutive Frames - CF), stopping after every ``blocksize``-sized block to wait
+confirmation from the receiver which should then send another Flow Control
+frame to inform the sender about its availability to receive more data.
+
+How to Use ISO-TP
+=================
+
+As with others CAN protocols, the ISO-TP stack support is built into the
+Linux network subsystem for the CAN bus, aka. Linux-CAN or SocketCAN, and
+thus follows the same socket API.
+
+Creation and basic usage of an ISO-TP socket
+--------------------------------------------
+
+To use the ISO-TP stack, ``#include <linux/can/isotp.h>`` shall be used. A
+socket can then be created using the ``PF_CAN`` protocol family, the
+``SOCK_DGRAM`` type (as the underlying protocol is datagram-based by design)
+and the ``CAN_ISOTP`` protocol:
+
+.. code-block:: C
+
+    s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+
+After the socket has been successfully created, ``bind(2)`` shall be called to
+bind the socket to the desired CAN interface; to do so:
+
+* a TX CAN ID shall be specified as part of the sockaddr supplied to the call
+  itself.
+
+* a RX CAN ID shall also be specified, unless broadcast flags have been set
+  through socket option (explained below).
+
+Once bound to an interface, the socket can be read from and written to using
+the usual ``read(2)`` and ``write(2)`` system calls, as well as ``send(2)``,
+``sendmsg(2)``, ``recv(2)`` and ``recvmsg(2)``.
+Unlike the CAN_RAW socket API, only the ISO-TP data field (the actual payload)
+is sent and received by the userspace application using these calls. The address
+information and the protocol information are automatically filled by the ISO-TP
+stack using the configuration supplied during socket creation. In the same way,
+the stack will use the transport mechanism when required (i.e., when the size
+of the data payload exceeds the MTU of the underlying CAN bus).
+
+The sockaddr structure used for SocketCAN has extensions for use with ISO-TP,
+as specified below:
+
+.. code-block:: C
+
+    struct sockaddr_can {
+        sa_family_t can_family;
+        int         can_ifindex;
+        union {
+            struct { canid_t rx_id, tx_id; } tp;
+        ...
+        } can_addr;
+    }
+
+* ``can_family`` and ``can_ifindex`` serve the same purpose as for other
+  SocketCAN sockets.
+
+* ``can_addr.tp.rx_id`` specifies the receive (RX) CAN ID and will be used as
+  a RX filter.
+
+* ``can_addr.tp.tx_id`` specifies the transmit (TX) CAN ID
+
+ISO-TP socket options
+---------------------
+
+When creating an ISO-TP socket, reasonable defaults are set. Some options can
+be modified with ``setsockopt(2)`` and/or read back with ``getsockopt(2)``.
+
+General options
+~~~~~~~~~~~~~~~
+
+General socket options can be passed using the ``CAN_ISOTP_OPTS`` optname:
+
+.. code-block:: C
+
+    struct can_isotp_options opts;
+    ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_OPTS, &opts, sizeof(opts))
+
+where the ``can_isotp_options`` structure has the following contents:
+
+.. code-block:: C
+
+    struct can_isotp_options {
+        u32 flags;
+        u32 frame_txtime;
+        u8  ext_address;
+        u8  txpad_content;
+        u8  rxpad_content;
+        u8  rx_ext_address;
+    };
+
+* ``flags``: modifiers to be applied to the default behaviour of the ISO-TP
+  stack. Following flags are available:
+
+  * ``CAN_ISOTP_LISTEN_MODE``: listen only (do not send FC frames); normally
+    used as a testing feature.
+
+  * ``CAN_ISOTP_EXTEND_ADDR``: use the byte specified in ``ext_address`` as an
+    additional address component. This enables the "mixed" addressing format if
+    used alone, or the "extended" addressing format if used in conjunction with
+    ``CAN_ISOTP_RX_EXT_ADDR``.
+
+  * ``CAN_ISOTP_TX_PADDING``: enable padding for transmitted frames, using
+    ``txpad_content`` as value for the padding bytes.
+
+  * ``CAN_ISOTP_RX_PADDING``: enable padding for the received frames, using
+    ``rxpad_content`` as value for the padding bytes.
+
+  * ``CAN_ISOTP_CHK_PAD_LEN``: check for correct padding length on the received
+    frames.
+
+  * ``CAN_ISOTP_CHK_PAD_DATA``: check padding bytes on the received frames
+    against ``rxpad_content``; if ``CAN_ISOTP_RX_PADDING`` is not specified,
+    this flag is ignored.
+
+  * ``CAN_ISOTP_HALF_DUPLEX``: force ISO-TP socket in half duplex mode
+    (that is, transport mechanism can only be incoming or outgoing at the same
+    time, not both).
+
+  * ``CAN_ISOTP_FORCE_TXSTMIN``: ignore stmin from received FC; normally
+    used as a testing feature.
+
+  * ``CAN_ISOTP_FORCE_RXSTMIN``: ignore CFs depending on rx stmin; normally
+    used as a testing feature.
+
+  * ``CAN_ISOTP_RX_EXT_ADDR``: use ``rx_ext_address`` instead of ``ext_address``
+    as extended addressing byte on the reception path. If used in conjunction
+    with ``CAN_ISOTP_EXTEND_ADDR``, this flag effectively enables the "extended"
+    addressing format.
+
+  * ``CAN_ISOTP_WAIT_TX_DONE``: wait until the frame is sent before returning
+    from ``write(2)`` and ``send(2)`` calls (i.e., blocking write operations).
+
+  * ``CAN_ISOTP_SF_BROADCAST``: use 1-to-N functional addressing (cannot be
+    specified alongside ``CAN_ISOTP_CF_BROADCAST``).
+
+  * ``CAN_ISOTP_CF_BROADCAST``: use 1-to-N transmission without flow control
+    (cannot be specified alongside ``CAN_ISOTP_SF_BROADCAST``).
+    NOTE: this is not covered by the ISO 15765-2 standard.
+
+  * ``CAN_ISOTP_DYN_FC_PARMS``: enable dynamic update of flow control
+    parameters.
+
+* ``frame_txtime``: frame transmission time (defined as N_As/N_Ar inside the
+  ISO standard); if ``0``, the default (or the last set value) is used.
+  To set the transmission time to ``0``, the ``CAN_ISOTP_FRAME_TXTIME_ZERO``
+  macro (equal to 0xFFFFFFFF) shall be used.
+
+* ``ext_address``: extended addressing byte, used if the
+  ``CAN_ISOTP_EXTEND_ADDR`` flag is specified.
+
+* ``txpad_content``: byte used as padding value for transmitted frames.
+
+* ``rxpad_content``: byte used as padding value for received frames.
+
+* ``rx_ext_address``: extended addressing byte for the reception path, used if
+  the ``CAN_ISOTP_RX_EXT_ADDR`` flag is specified.
+
+Flow Control options
+~~~~~~~~~~~~~~~~~~~~
+
+Flow Control (FC) options can be passed using the ``CAN_ISOTP_RECV_FC`` optname
+to provide the communication parameters for receiving ISO-TP PDUs.
+
+.. code-block:: C
+
+    struct can_isotp_fc_options fc_opts;
+    ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RECV_FC, &fc_opts, sizeof(fc_opts));
+
+where the ``can_isotp_fc_options`` structure has the following contents:
+
+.. code-block:: C
+
+    struct can_isotp_options {
+        u8 bs;
+        u8 stmin;
+        u8 wftmax;
+    };
+
+* ``bs``: blocksize provided in flow control frames.
+
+* ``stmin``: minimum separation time provided in flow control frames; can
+  have the following values (others are reserved):
+
+  * 0x00 - 0x7F : 0 - 127 ms
+
+  * 0xF1 - 0xF9 : 100 us - 900 us
+
+* ``wftmax``: maximum number of wait frames provided in flow control frames.
+
+Link Layer options
+~~~~~~~~~~~~~~~~~~
+
+Link Layer (LL) options can be passed using the ``CAN_ISOTP_LL_OPTS`` optname:
+
+.. code-block:: C
+
+    struct can_isotp_ll_options ll_opts;
+    ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_LL_OPTS, &ll_opts, sizeof(ll_opts));
+
+where the ``can_isotp_ll_options`` structure has the following contents:
+
+.. code-block:: C
+
+    struct can_isotp_ll_options {
+        u8 mtu;
+        u8 tx_dl;
+        u8 tx_flags;
+    };
+
+* ``mtu``: generated and accepted CAN frame type, can be equal to ``CAN_MTU``
+  for classical CAN frames or ``CANFD_MTU`` for CAN FD frames.
+
+* ``tx_dl``: maximum payload length for transmitted frames, can have one value
+  among: 8, 12, 16, 20, 24, 32, 48, 64. Values above 8 only apply to CAN FD
+  traffic (i.e.: ``mtu = CANFD_MTU``).
+
+* ``tx_flags``: flags set into ``struct canfd_frame.flags`` at frame creation.
+  Only applies to CAN FD traffic (i.e.: ``mtu = CANFD_MTU``).
+
+Transmission stmin
+~~~~~~~~~~~~~~~~~~
+
+The transmission minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_TX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; this will overwrite the value sent by the receiver in
+flow control frames:
+
+.. code-block:: C
+
+    uint32_t stmin;
+    ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_TX_STMIN, &stmin, sizeof(stmin));
+
+Reception stmin
+~~~~~~~~~~~~~~~
+
+The reception minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_RX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; received Consecutive Frames (CF) which timestamps
+differ less than this value will be ignored:
+
+.. code-block:: C
+
+    uint32_t stmin;
+    ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RX_STMIN, &stmin, sizeof(stmin));
+
+Multi-frame transport support
+-----------------------------
+
+The ISO-TP stack contained inside the Linux kernel supports the multi-frame
+transport mechanism defined by the standard, with the following constraints:
+
+* the maximum size of a PDU is defined by a module parameter, with an hard
+  limit imposed at build time.
+
+* when a transmission is in progress, subsequent calls to ``write(2)`` will
+  block, while calls to ``send(2)`` will either block or fail depending on the
+  presence of the ``MSG_DONTWAIT`` flag.
+
+* no support is present for sending "wait frames": whether a PDU can be fully
+  received or not is decided when the First Frame is received.
+
+Errors
+------
+
+Following errors are reported to userspace:
+
+RX path errors
+~~~~~~~~~~~~~~
+
+============ ===============================================================
+-ETIMEDOUT   timeout of data reception
+-EILSEQ      sequence number mismatch during a multi-frame reception
+-EBADMSG     data reception with wrong padding
+============ ===============================================================
+
+TX path errors
+~~~~~~~~~~~~~~
+
+========== =================================================================
+-ECOMM     flow control reception timeout
+-EMSGSIZE  flow control reception overflow
+-EBADMSG   flow control reception with wrong layout/padding
+========== =================================================================
+
+Examples
+========
+
+Basic node example
+------------------
+
+Following example implements a node using "normal" physical addressing, with
+RX ID equal to 0x18DAF142 and a TX ID equal to 0x18DA42F1. All options are left
+to their default.
+
+.. code-block:: C
+
+  int s;
+  struct sockaddr_can addr;
+  int ret;
+
+  s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+  if (s < 0)
+      exit(1);
+
+  addr.can_family = AF_CAN;
+  addr.can_ifindex = if_nametoindex("can0");
+  addr.can_addr.tp.tx_id = 0x18DA42F1 | CAN_EFF_FLAG;
+  addr.can_addr.tp.rx_id = 0x18DAF142 | CAN_EFF_FLAG;
+
+  ret = bind(s, (struct sockaddr *)&addr, sizeof(addr));
+  if (ret < 0)
+      exit(1);
+
+  /* Data can now be received using read(s, ...) and sent using write(s, ...) */
+
+Additional examples
+-------------------
+
+More complete (and complex) examples can be found inside the ``isotp*`` userland
+tools, distributed as part of the ``can-utils`` utilities at:
+https://github.com/linux-can/can-utils
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
index f5be243d250a..45f02efe3df5 100644
--- a/Documentation/networking/j1939.rst
+++ b/Documentation/networking/j1939.rst
@@ -10,9 +10,9 @@ Overview / What Is J1939
 SAE J1939 defines a higher layer protocol on CAN. It implements a more
 sophisticated addressing scheme and extends the maximum packet size above 8
 bytes. Several derived specifications exist, which differ from the original
-J1939 on the application level, like MilCAN A, NMEA2000 and especially
+J1939 on the application level, like MilCAN A, NMEA2000, and especially
 ISO-11783 (ISOBUS). This last one specifies the so-called ETP (Extended
-Transport Protocol) which is has been included in this implementation. This
+Transport Protocol), which has been included in this implementation. This
 results in a maximum packet size of ((2 ^ 24) - 1) * 7 bytes == 111 MiB.
 
 Specifications used
@@ -32,15 +32,15 @@ sockets, we found some reasons to justify a kernel implementation for the
 addressing and transport methods used by J1939.
 
 * **Addressing:** when a process on an ECU communicates via J1939, it should
-  not necessarily know its source address. Although at least one process per
+  not necessarily know its source address. Although, at least one process per
   ECU should know the source address. Other processes should be able to reuse
   that address. This way, address parameters for different processes
   cooperating for the same ECU, are not duplicated. This way of working is
-  closely related to the UNIX concept where programs do just one thing, and do
+  closely related to the UNIX concept, where programs do just one thing and do
   it well.
 
 * **Dynamic addressing:** Address Claiming in J1939 is time critical.
-  Furthermore data transport should be handled properly during the address
+  Furthermore, data transport should be handled properly during the address
   negotiation. Putting this functionality in the kernel eliminates it as a
   requirement for _every_ user space process that communicates via J1939. This
   results in a consistent J1939 bus with proper addressing.
@@ -58,7 +58,7 @@ Therefore, these parts are left to user space.
 
 The J1939 sockets operate on CAN network devices (see SocketCAN). Any J1939
 user space library operating on CAN raw sockets will still operate properly.
-Since such library does not communicate with the in-kernel implementation, care
+Since such a library does not communicate with the in-kernel implementation, care
 must be taken that these two do not interfere. In practice, this means they
 cannot share ECU addresses. A single ECU (or virtual ECU) address is used by
 the library exclusively, or by the in-kernel system exclusively.
@@ -66,24 +66,146 @@ the library exclusively, or by the in-kernel system exclusively.
 J1939 concepts
 ==============
 
+Data Sent to the J1939 Stack
+----------------------------
+
+The data buffers sent to the J1939 stack from user space are not CAN frames
+themselves. Instead, they are payloads that the J1939 stack converts into
+proper CAN frames based on the size of the buffer and the type of transfer. The
+size of the buffer influences how the stack processes the data and determines
+the internal code path used for the transfer.
+
+**Handling of Different Buffer Sizes:**
+
+- **Buffers with a size of 8 bytes or less:**
+
+  - These are handled as simple sessions internally within the stack.
+
+  - The stack converts the buffer directly into a single CAN frame without
+    fragmentation.
+
+  - This type of transfer does not require an actual client (receiver) on the
+    receiving side.
+
+- **Buffers up to 1785 bytes:**
+
+  - These are automatically handled as J1939 Transport Protocol (TP) transfers.
+
+  - Internally, the stack splits the buffer into multiple 8-byte CAN frames.
+
+  - TP transfers can be unicast or broadcast.
+
+  - **Broadcast TP:** Does not require a receiver on the other side and can be
+    used in broadcast scenarios.
+
+  - **Unicast TP:** Requires an active receiver (client) on the other side to
+    acknowledge the transfer.
+
+- **Buffers from 1786 bytes up to 111 MiB:**
+
+  - These are handled as ISO 11783 Extended Transport Protocol (ETP) transfers.
+
+  - ETP transfers are used for larger payloads and are split into multiple CAN
+    frames internally.
+
+  - **ETP transfers (unicast):** Require a receiver on the other side to
+    process the incoming data and acknowledge each step of the transfer.
+
+  - ETP transfers cannot be broadcast like TP transfers, and always require a
+    receiver for operation.
+
+**Non-Blocking Operation with `MSG_DONTWAIT`:**
+
+The J1939 stack supports non-blocking operation when used in combination with
+the `MSG_DONTWAIT` flag. In this mode, the stack attempts to take as much data
+as the available memory for the socket allows. It returns the amount of data
+that was successfully taken, and it is the responsibility of user space to
+monitor this value and handle partial transfers.
+
+- If the stack cannot take the entire buffer, it returns the number of bytes
+  successfully taken, and user space should handle the remainder.
+
+- **Error handling:** When using `MSG_DONTWAIT`, the user must rely on the
+  error queue to detect transfer errors. See the **SO_J1939_ERRQUEUE** section
+  for details on how to subscribe to error notifications. Without the error
+  queue, there is no other way for user space to be notified of transfer errors
+  during non-blocking operations.
+
+**Behavior and Requirements:**
+
+- **Simple transfers (<= 8 bytes):** Do not require a receiver on the other
+  side, making them easy to send without needing address claiming or
+  coordination with a destination.
+
+- **Unicast TP/ETP:** Requires a receiver on the other side to complete the
+  transfer. The receiver must acknowledge the transfer for the session to
+  proceed successfully.
+
+- **Broadcast TP:** Allows sending data without a receiver, but only works for
+  TP transfers. ETP cannot be broadcast and always needs a receiving client.
+
+These different behaviors depend heavily on the size of the buffer provided to
+the stack, and the appropriate transport mechanism (TP or ETP) is selected
+based on the payload size. The stack automatically manages the fragmentation
+and reassembly of large payloads and ensures that the correct CAN frames are
+generated and transmitted for each session.
+
 PGN
 ---
 
+The J1939 protocol uses the 29-bit CAN identifier with the following structure:
+
+  ============  ==============  ====================
+  29 bit CAN-ID
+  --------------------------------------------------
+  Bit positions within the CAN-ID
+  --------------------------------------------------
+  28 ... 26     25 ... 8        7 ... 0
+  ============  ==============  ====================
+  Priority      PGN             SA (Source Address)
+  ============  ==============  ====================
+
 The PGN (Parameter Group Number) is a number to identify a packet. The PGN
 is composed as follows:
-1 bit  : Reserved Bit
-1 bit  : Data Page
-8 bits : PF (PDU Format)
-8 bits : PS (PDU Specific)
+
+  ============  ==============  =================  =================
+  PGN
+  ------------------------------------------------------------------
+  Bit positions within the CAN-ID
+  ------------------------------------------------------------------
+  25            24              23 ... 16          15 ... 8
+  ============  ==============  =================  =================
+  R (Reserved)  DP (Data Page)  PF (PDU Format)    PS (PDU Specific)
+  ============  ==============  =================  =================
 
 In J1939-21 distinction is made between PDU1 format (where PF < 240) and PDU2
-format (where PF >= 240). Furthermore, when using PDU2 format, the PS-field
+format (where PF >= 240). Furthermore, when using the PDU2 format, the PS-field
 contains a so-called Group Extension, which is part of the PGN. When using PDU2
 format, the Group Extension is set in the PS-field.
 
+  ==============  ========================
+  PDU1 Format (specific) (peer to peer)
+  ----------------------------------------
+  Bit positions within the CAN-ID
+  ----------------------------------------
+  23 ... 16       15 ... 8
+  ==============  ========================
+  00h ... EFh     DA (Destination address)
+  ==============  ========================
+
+  ==============  ========================
+  PDU2 Format (global) (broadcast)
+  ----------------------------------------
+  Bit positions within the CAN-ID
+  ----------------------------------------
+  23 ... 16       15 ... 8
+  ==============  ========================
+  F0h ... FFh     GE (Group Extension)
+  ==============  ========================
+
 On the other hand, when using PDU1 format, the PS-field contains a so-called
 Destination Address, which is _not_ part of the PGN. When communicating a PGN
-from user space to kernel (or visa versa) and PDU2 format is used, the PS-field
+from user space to kernel (or vice versa) and PDU1 format is used, the PS-field
 of the PGN shall be set to zero. The Destination Address shall be set
 elsewhere.
 
@@ -96,15 +218,15 @@ Addressing
 
 Both static and dynamic addressing methods can be used.
 
-For static addresses, no extra checks are made by the kernel, and provided
+For static addresses, no extra checks are made by the kernel and provided
 addresses are considered right. This responsibility is for the OEM or system
 integrator.
 
 For dynamic addressing, so-called Address Claiming, extra support is foreseen
-in the kernel. In J1939 any ECU is known by it's 64-bit NAME. At the moment of
+in the kernel. In J1939 any ECU is known by its 64-bit NAME. At the moment of
 a successful address claim, the kernel keeps track of both NAME and source
 address being claimed. This serves as a base for filter schemes. By default,
-packets with a destination that is not locally, will be rejected.
+packets with a destination that is not locally will be rejected.
 
 Mixed mode packets (from a static to a dynamic address or vice versa) are
 allowed. The BSD sockets define separate API calls for getting/setting the
@@ -131,31 +253,31 @@ API Calls
 ---------
 
 On CAN, you first need to open a socket for communicating over a CAN network.
-To use J1939, #include <linux/can/j1939.h>. From there, <linux/can.h> will be
+To use J1939, ``#include <linux/can/j1939.h>``. From there, ``<linux/can.h>`` will be
 included too. To open a socket, use:
 
 .. code-block:: C
 
     s = socket(PF_CAN, SOCK_DGRAM, CAN_J1939);
 
-J1939 does use SOCK_DGRAM sockets. In the J1939 specification, connections are
+J1939 does use ``SOCK_DGRAM`` sockets. In the J1939 specification, connections are
 mentioned in the context of transport protocol sessions. These still deliver
-packets to the other end (using several CAN packets). SOCK_STREAM is not
+packets to the other end (using several CAN packets). ``SOCK_STREAM`` is not
 supported.
 
-After the successful creation of the socket, you would normally use the bind(2)
-and/or connect(2) system call to bind the socket to a CAN interface.  After
-binding and/or connecting the socket, you can read(2) and write(2) from/to the
-socket or use send(2), sendto(2), sendmsg(2) and the recv*() counterpart
+After the successful creation of the socket, you would normally use the ``bind(2)``
+and/or ``connect(2)`` system call to bind the socket to a CAN interface. After
+binding and/or connecting the socket, you can ``read(2)`` and ``write(2)`` from/to the
+socket or use ``send(2)``, ``sendto(2)``, ``sendmsg(2)`` and the ``recv*()`` counterpart
 operations on the socket as usual. There are also J1939 specific socket options
 described below.
 
-In order to send data, a bind(2) must have been successful. bind(2) assigns a
+In order to send data, a ``bind(2)`` must have been successful. ``bind(2)`` assigns a
 local address to a socket.
 
-Different from CAN is that the payload data is just the data that get send,
-without it's header info. The header info is derived from the sockaddr supplied
-to bind(2), connect(2), sendto(2) and recvfrom(2). A write(2) with size 4 will
+Different from CAN is that the payload data is just the data that get sends,
+without its header info. The header info is derived from the sockaddr supplied
+to ``bind(2)``, ``connect(2)``, ``sendto(2)`` and ``recvfrom(2)``. A ``write(2)`` with size 4 will
 result in a packet with 4 bytes.
 
 The sockaddr structure has extensions for use with J1939 as specified below:
@@ -180,47 +302,47 @@ The sockaddr structure has extensions for use with J1939 as specified below:
          } can_addr;
       }
 
-can_family & can_ifindex serve the same purpose as for other SocketCAN sockets.
+``can_family`` & ``can_ifindex`` serve the same purpose as for other SocketCAN sockets.
 
-can_addr.j1939.pgn specifies the PGN (max 0x3ffff). Individual bits are
+``can_addr.j1939.pgn`` specifies the PGN (max 0x3ffff). Individual bits are
 specified above.
 
-can_addr.j1939.name contains the 64-bit J1939 NAME.
+``can_addr.j1939.name`` contains the 64-bit J1939 NAME.
 
-can_addr.j1939.addr contains the address.
+``can_addr.j1939.addr`` contains the address.
 
-The bind(2) system call assigns the local address, i.e. the source address when
-sending packages. If a PGN during bind(2) is set, it's used as a RX filter.
-I.e.  only packets with a matching PGN are received. If an ADDR or NAME is set
+The ``bind(2)`` system call assigns the local address, i.e. the source address when
+sending packages. If a PGN during ``bind(2)`` is set, it's used as a RX filter.
+I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
 it is used as a receive filter, too. It will match the destination NAME or ADDR
 of the incoming packet. The NAME filter will work only if appropriate Address
 Claiming for this name was done on the CAN bus and registered/cached by the
 kernel.
 
-On the other hand connect(2) assigns the remote address, i.e. the destination
-address. The PGN from connect(2) is used as the default PGN when sending
+On the other hand ``connect(2)`` assigns the remote address, i.e. the destination
+address. The PGN from ``connect(2)`` is used as the default PGN when sending
 packets. If ADDR or NAME is set it will be used as the default destination ADDR
-or NAME. Further a set ADDR or NAME during connect(2) is used as a receive
+or NAME. Further a set ADDR or NAME during ``connect(2)`` is used as a receive
 filter. It will match the source NAME or ADDR of the incoming packet.
 
-Both write(2) and send(2) will send a packet with local address from bind(2) and
-the remote address from connect(2). Use sendto(2) to overwrite the destination
+Both ``write(2)`` and ``send(2)`` will send a packet with local address from ``bind(2)`` and the
+remote address from ``connect(2)``. Use ``sendto(2)`` to overwrite the destination
 address.
 
-If can_addr.j1939.name is set (!= 0) the NAME is looked up by the kernel and
-the corresponding ADDR is used. If can_addr.j1939.name is not set (== 0),
-can_addr.j1939.addr is used.
+If ``can_addr.j1939.name`` is set (!= 0) the NAME is looked up by the kernel and
+the corresponding ADDR is used. If ``can_addr.j1939.name`` is not set (== 0),
+``can_addr.j1939.addr`` is used.
 
 When creating a socket, reasonable defaults are set. Some options can be
-modified with setsockopt(2) & getsockopt(2).
+modified with ``setsockopt(2)`` & ``getsockopt(2)``.
 
 RX path related options:
 
-- SO_J1939_FILTER - configure array of filters
-- SO_J1939_PROMISC - disable filters set by bind(2) and connect(2)
+- ``SO_J1939_FILTER`` - configure array of filters
+- ``SO_J1939_PROMISC`` - disable filters set by ``bind(2)`` and ``connect(2)``
 
 By default no broadcast packets can be send or received. To enable sending or
-receiving broadcast packets use the socket option SO_BROADCAST:
+receiving broadcast packets use the socket option ``SO_BROADCAST``:
 
 .. code-block:: C
 
@@ -261,26 +383,26 @@ The following diagram illustrates the RX path:
      +---------------------------+
 
 TX path related options:
-SO_J1939_SEND_PRIO - change default send priority for the socket
+``SO_J1939_SEND_PRIO`` - change default send priority for the socket
 
 Message Flags during send() and Related System Calls
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-send(2), sendto(2) and sendmsg(2) take a 'flags' argument. Currently
+``send(2)``, ``sendto(2)`` and ``sendmsg(2)`` take a 'flags' argument. Currently
 supported flags are:
 
-* MSG_DONTWAIT, i.e. non-blocking operation.
+* ``MSG_DONTWAIT``, i.e. non-blocking operation.
 
 recvmsg(2)
 ^^^^^^^^^^
 
-In most cases recvmsg(2) is needed if you want to extract more information than
-recvfrom(2) can provide. For example package priority and timestamp. The
+In most cases ``recvmsg(2)`` is needed if you want to extract more information than
+``recvfrom(2)`` can provide. For example package priority and timestamp. The
 Destination Address, name and packet priority (if applicable) are attached to
-the msghdr in the recvmsg(2) call. They can be extracted using cmsg(3) macros,
-with cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR,
-SCM_J1939_DEST_NAME or SCM_J1939_PRIO. The returned data is a uint8_t for
-priority and dst_addr, and uint64_t for dst_name.
+the msghdr in the ``recvmsg(2)`` call. They can be extracted using ``cmsg(3)`` macros,
+with ``cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR``,
+``SCM_J1939_DEST_NAME`` or ``SCM_J1939_PRIO``. The returned data is a ``uint8_t`` for
+``priority`` and ``dst_addr``, and ``uint64_t`` for ``dst_name``.
 
 .. code-block:: C
 
@@ -300,17 +422,470 @@ priority and dst_addr, and uint64_t for dst_name.
 		}
 	}
 
+setsockopt(2)
+^^^^^^^^^^^^^
+
+The ``setsockopt(2)`` function is used to configure various socket-level
+options for J1939 communication. The following options are supported:
+
+``SO_J1939_FILTER``
+~~~~~~~~~~~~~~~~~~~
+
+The ``SO_J1939_FILTER`` option is essential when the default behavior of
+``bind(2)`` and ``connect(2)`` is insufficient for specific use cases. By
+default, ``bind(2)`` and ``connect(2)`` allow a socket to be associated with a
+single unicast or broadcast address. However, there are scenarios where finer
+control over the incoming messages is required, such as filtering by Parameter
+Group Number (PGN) rather than by addresses.
+
+For example, in a system where multiple types of J1939 messages are being
+transmitted, a process might only be interested in a subset of those messages,
+such as specific PGNs, and not want to receive all messages destined for its
+address or broadcast to the bus.
+
+By applying the ``SO_J1939_FILTER`` option, you can filter messages based on:
+
+- **Source Address (SA)**: Filter messages coming from specific source
+  addresses.
+
+- **Source Name**: Filter messages coming from ECUs with specific NAME
+  identifiers.
+
+- **Parameter Group Number (PGN)**: Focus on receiving messages with specific
+  PGNs, filtering out irrelevant ones.
+
+This filtering mechanism is particularly useful when:
+
+- You want to receive a subset of messages based on their PGNs, even if the
+  address is the same.
+
+- You need to handle both broadcast and unicast messages but only care about
+  certain message types or parameters.
+
+- The ``bind(2)`` and ``connect(2)`` functions only allow binding to a single
+  address, which might not be sufficient if the process needs to handle multiple
+  PGNs but does not want to open multiple sockets.
+
+To remove existing filters, you can pass ``optval == NULL`` or ``optlen == 0``
+to ``setsockopt(2)``. This will clear all currently set filters. If you want to
+**update** the set of filters, you must pass the updated filter set to
+``setsockopt(2)``, as the new filter set will **replace** the old one entirely.
+This behavior ensures that any previous filter configuration is discarded and
+only the new set is applied.
+
+Example of removing all filters:
+
+.. code-block:: c
+
+    setsockopt(sock, SOL_CAN_J1939, SO_J1939_FILTER, NULL, 0);
+
+**Maximum number of filters:** The maximum amount of filters that can be
+applied using ``SO_J1939_FILTER`` is defined by ``J1939_FILTER_MAX``, which is
+set to 512. This means you can configure up to 512 individual filters to match
+your specific filtering needs.
+
+Practical use case: **Monitoring Address Claiming**
+
+One practical use case is monitoring the J1939 address claiming process by
+filtering for specific PGNs related to address claiming. This allows a process
+to monitor and handle address claims without processing unrelated messages.
+
+Example:
+
+.. code-block:: c
+
+    struct j1939_filter filt[] = {
+        {
+            .pgn = J1939_PGN_ADDRESS_CLAIMED,
+            .pgn_mask = J1939_PGN_PDU1_MAX,
+        }, {
+            .pgn = J1939_PGN_REQUEST,
+            .pgn_mask = J1939_PGN_PDU1_MAX,
+        }, {
+            .pgn = J1939_PGN_ADDRESS_COMMANDED,
+            .pgn_mask = J1939_PGN_MAX,
+        },
+    };
+    setsockopt(sock, SOL_CAN_J1939, SO_J1939_FILTER, &filt, sizeof(filt));
+
+In this example, the socket will only receive messages with the PGNs related to
+address claiming: ``J1939_PGN_ADDRESS_CLAIMED``, ``J1939_PGN_REQUEST``, and
+``J1939_PGN_ADDRESS_COMMANDED``. This is particularly useful in scenarios where
+you want to monitor and process address claims without being overwhelmed by
+other traffic on the J1939 network.
+
+``SO_J1939_PROMISC``
+~~~~~~~~~~~~~~~~~~~~
+
+The ``SO_J1939_PROMISC`` option enables socket-level promiscuous mode. When
+this option is enabled, the socket will receive all J1939 traffic, regardless
+of any filters set by ``bind()`` or ``connect()``. This is analogous to
+enabling promiscuous mode for an Ethernet interface, where all traffic on the
+network segment is captured.
+
+However, **`SO_J1939_FILTER` has a higher priority** compared to
+``SO_J1939_PROMISC``. This means that even in promiscuous mode, you can reduce
+the number of packets received by applying specific filters with
+`SO_J1939_FILTER`. The filters will limit which packets are passed to the
+socket, allowing for more refined traffic selection while promiscuous mode is
+active.
+
+The acceptable value size for this option is ``sizeof(int)``, and the value is
+only differentiated between `0` and non-zero. A value of `0` disables
+promiscuous mode, while any non-zero value enables it.
+
+This combination can be useful for debugging or monitoring specific types of
+traffic while still capturing a broad set of messages.
+
+Example:
+
+.. code-block:: c
+
+    int value = 1;
+    setsockopt(sock, SOL_CAN_J1939, SO_J1939_PROMISC, &value, sizeof(value));
+
+In this example, setting ``value`` to any non-zero value (e.g., `1`) enables
+promiscuous mode, allowing the socket to receive all J1939 traffic on the
+network.
+
+``SO_BROADCAST``
+~~~~~~~~~~~~~~~~
+
+The ``SO_BROADCAST`` option enables the sending and receiving of broadcast
+messages. By default, broadcast messages are disabled for J1939 sockets. When
+this option is enabled, the socket will be allowed to send and receive
+broadcast packets on the J1939 network.
+
+Due to the nature of the CAN bus as a shared medium, all messages transmitted
+on the bus are visible to all participants. In the context of J1939,
+broadcasting refers to using a specific destination address field, where the
+destination address is set to a value that indicates the message is intended
+for all participants (usually a global address such as 0xFF). Enabling the
+broadcast option allows the socket to send and receive such broadcast messages.
+
+The acceptable value size for this option is ``sizeof(int)``, and the value is
+only differentiated between `0` and non-zero. A value of `0` disables the
+ability to send and receive broadcast messages, while any non-zero value
+enables it.
+
+Example:
+
+.. code-block:: c
+
+    int value = 1;
+    setsockopt(sock, SOL_SOCKET, SO_BROADCAST, &value, sizeof(value));
+
+In this example, setting ``value`` to any non-zero value (e.g., `1`) enables
+the socket to send and receive broadcast messages.
+
+``SO_J1939_SEND_PRIO``
+~~~~~~~~~~~~~~~~~~~~~~
+
+The ``SO_J1939_SEND_PRIO`` option sets the priority of outgoing J1939 messages
+for the socket. In J1939, messages can have different priorities, and lower
+numerical values indicate higher priority. This option allows the user to
+control the priority of messages sent from the socket by adjusting the priority
+bits in the CAN identifier.
+
+The acceptable value **size** for this option is ``sizeof(int)``, and the value
+is expected to be in the range of 0 to 7, where `0` is the highest priority,
+and `7` is the lowest. By default, the priority is set to `6` if this option is
+not explicitly configured.
+
+Note that the priority values `0` and `1` can only be set if the process has
+the `CAP_NET_ADMIN` capability. These are reserved for high-priority traffic
+and require administrative privileges.
+
+Example:
+
+.. code-block:: c
+
+    int prio = 3;  // Priority value between 0 (highest) and 7 (lowest)
+    setsockopt(sock, SOL_CAN_J1939, SO_J1939_SEND_PRIO, &prio, sizeof(prio));
+
+In this example, the priority is set to `3`, meaning the outgoing messages will
+be sent with a moderate priority level.
+
+``SO_J1939_ERRQUEUE``
+~~~~~~~~~~~~~~~~~~~~~
+
+The ``SO_J1939_ERRQUEUE`` option enables the socket to receive error messages
+from the error queue, providing diagnostic information about transmission
+failures, protocol violations, or other issues that occur during J1939
+communication. Once this option is set, user space is required to handle
+``MSG_ERRQUEUE`` messages.
+
+Setting ``SO_J1939_ERRQUEUE`` to ``0`` will purge any currently present error
+messages in the error queue. When enabled, error messages can be retrieved
+using the ``recvmsg(2)`` system call.
+
+When subscribing to the error queue, the following error events can be
+accessed:
+
+- **``J1939_EE_INFO_TX_ABORT``**: Transmission abort errors.
+- **``J1939_EE_INFO_RX_RTS``**: Reception of RTS (Request to Send) control
+  frames.
+- **``J1939_EE_INFO_RX_DPO``**: Reception of data packets with Data Page Offset
+  (DPO).
+- **``J1939_EE_INFO_RX_ABORT``**: Reception abort errors.
+
+The error queue can be used to correlate errors with specific message transfer
+sessions using the session ID (``tskey``). The session ID is assigned via the
+``SOF_TIMESTAMPING_OPT_ID`` flag, which is set by enabling the
+``SO_TIMESTAMPING`` option.
+
+If ``SO_J1939_ERRQUEUE`` is activated, the user is required to pull messages
+from the error queue, meaning that using plain ``recv(2)`` is not sufficient
+anymore. The user must use ``recvmsg(2)`` with appropriate flags to handle
+error messages. Failure to do so can result in the socket becoming blocked with
+unprocessed error messages in the queue.
+
+It is **recommended** that ``SO_J1939_ERRQUEUE`` be used in combination with
+``SO_TIMESTAMPING`` in most cases. This enables proper error handling along
+with session tracking and timestamping, providing a more detailed analysis of
+message transfers and errors.
+
+The acceptable value **size** for this option is ``sizeof(int)``, and the value
+is only differentiated between ``0`` and non-zero. A value of ``0`` disables
+error queue reception and purges any existing error messages, while any
+non-zero value enables it.
+
+Example:
+
+.. code-block:: c
+
+    int enable = 1;  // Enable error queue reception
+    setsockopt(sock, SOL_CAN_J1939, SO_J1939_ERRQUEUE, &enable, sizeof(enable));
+
+    // Enable timestamping with session tracking via tskey
+    int timestamping = SOF_TIMESTAMPING_OPT_ID | SOF_TIMESTAMPING_TX_ACK |
+                       SOF_TIMESTAMPING_TX_SCHED |
+                       SOF_TIMESTAMPING_RX_SOFTWARE | SOF_TIMESTAMPING_OPT_CMSG;
+    setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &timestamping,
+               sizeof(timestamping));
+
+When enabled, error messages can be retrieved using ``recvmsg(2)``. By
+combining ``SO_J1939_ERRQUEUE`` with ``SO_TIMESTAMPING`` (with
+``SOF_TIMESTAMPING_OPT_ID`` and ``SOF_TIMESTAMPING_OPT_CMSG`` enabled), the
+user can track message transfers, retrieve precise timestamps, and correlate
+errors with specific sessions.
+
+For more information on enabling timestamps and session tracking, refer to the
+`SO_TIMESTAMPING` section.
+
+``SO_TIMESTAMPING``
+~~~~~~~~~~~~~~~~~~~
+
+The ``SO_TIMESTAMPING`` option allows the socket to receive timestamps for
+various events related to message transmissions and receptions in J1939. This
+option is often used in combination with ``SO_J1939_ERRQUEUE`` to provide
+detailed diagnostic information, session tracking, and precise timing data for
+message transfers.
+
+In J1939, all payloads provided by user space, regardless of size, are
+processed by the kernel as **sessions**. This includes both single-frame
+messages (up to 8 bytes) and multi-frame protocols such as the Transport
+Protocol (TP) and Extended Transport Protocol (ETP). Even for small,
+single-frame messages, the kernel creates a session to manage the transmission
+and reception. The concept of sessions allows the kernel to manage various
+aspects of the protocol, such as reassembling multi-frame messages and tracking
+the status of transmissions.
+
+When receiving extended error messages from the error queue, the error
+information is delivered through a `struct sock_extended_err`, accessible via
+the control message (``cmsg``) retrieved using the ``recvmsg(2)`` system call.
+
+There are two typical origins for the extended error messages in J1939:
+
+1. ``serr->ee_origin == SO_EE_ORIGIN_TIMESTAMPING``:
+
+   In this case, the `serr->ee_info` field will contain one of the following
+   timestamp types:
+
+   - ``SCM_TSTAMP_SCHED``: This timestamp is valid for Extended Transport
+     Protocol (ETP) transfers and simple transfers (8 bytes or less). It
+     indicates when a message or set of frames has been scheduled for
+     transmission.
+
+     - For simple transfers (8 bytes or less), it marks the point when the
+       message is queued and ready to be sent onto the CAN bus.
+
+     - For ETP transfers, it is sent after receiving a CTS (Clear to Send)
+       frame on the sender side, indicating that a new set of frames has been
+       scheduled for transmission.
+
+     - The Transport Protocol (TP) case is currently not implemented for this
+       timestamp.
+
+     - On the receiver side, the counterpart to this event for ETP is
+       represented by the ``J1939_EE_INFO_RX_DPO`` message, which indicates the
+       reception of a Data Page Offset (DPO) control frame.
+
+   - ``SCM_TSTAMP_ACK``: This timestamp indicates the acknowledgment of the
+     message or session.
+
+     - For simple transfers (8 bytes or less), it marks when the message has
+       been sent and an echo confirmation has been received from the CAN
+       controller, indicating that the frame was transmitted onto the bus.
+
+     - For multi-frame transfers (TP or ETP), it signifies that the entire
+       session has been acknowledged, typically after receiving the End of
+       Message Acknowledgment (EOMA) packet.
+
+2. ``serr->ee_origin == SO_EE_ORIGIN_LOCAL``:
+
+   In this case, the `serr->ee_info` field will contain one of the following
+   J1939 stack-specific message types:
+
+   - ``J1939_EE_INFO_TX_ABORT``: This message indicates that the transmission
+     of a message or session was aborted. The cause of the abort can come from
+     various sources:
+
+     - **CAN stack failure**: The J1939 stack was unable to pass the frame to
+       the CAN framework for transmission.
+
+     - **Echo failure**: The J1939 stack did not receive an echo confirmation
+       from the CAN controller, meaning the frame may not have been successfully
+       transmitted to the CAN bus.
+
+     - **Protocol-level issues**: For multi-frame transfers (TP/ETP), this
+       could include protocol-related errors, such as an abort signaled by the
+       receiver or a timeout at the protocol level, which causes the session to
+       terminate prematurely.
+
+     - The corresponding error code is stored in ``serr->ee_data``
+       (``session->err`` on kernel side), providing additional details about
+       the specific reason for the abort.
+
+   - ``J1939_EE_INFO_RX_RTS``: This message indicates that the J1939 stack has
+     received a Request to Send (RTS) control frame, signaling the start of a
+     multi-frame transfer using the Transport Protocol (TP) or Extended
+     Transport Protocol (ETP).
+
+     - It informs the receiver that the sender is ready to transmit a
+       multi-frame message and includes details about the total message size
+       and the number of frames to be sent.
+
+     - Statistics such as ``J1939_NLA_TOTAL_SIZE``, ``J1939_NLA_PGN``,
+       ``J1939_NLA_SRC_NAME``, and ``J1939_NLA_DEST_NAME`` are provided along
+       with the ``J1939_EE_INFO_RX_RTS`` message, giving detailed information
+       about the incoming transfer.
+
+   - ``J1939_EE_INFO_RX_DPO``: This message indicates that the J1939 stack has
+     received a Data Page Offset (DPO) control frame, which is part of the
+     Extended Transport Protocol (ETP).
+
+     - The DPO frame signals the continuation of an ETP multi-frame message by
+       indicating the offset position in the data being transferred. It helps
+       the receiver manage large data sets by identifying which portion of the
+       message is being received.
+
+     - It is typically paired with a corresponding ``SCM_TSTAMP_SCHED`` event
+       on the sender side, which indicates when the next set of frames is
+       scheduled for transmission.
+
+     - This event includes statistics such as ``J1939_NLA_BYTES_ACKED``, which
+       tracks the number of bytes acknowledged up to that point in the session.
+
+   - ``J1939_EE_INFO_RX_ABORT``: This message indicates that the reception of a
+     multi-frame message (Transport Protocol or Extended Transport Protocol) has
+     been aborted.
+
+     - The abort can be triggered by protocol-level errors such as timeouts, an
+       unexpected frame, or a specific abort request from the sender.
+
+     - This message signals that the receiver cannot continue processing the
+       transfer, and the session is terminated.
+
+     - The corresponding error code is stored in ``serr->ee_data``
+       (``session->err`` on kernel side ), providing further details about the
+       reason for the abort, such as protocol violations or timeouts.
+
+     - After receiving this message, the receiver discards the partially received
+       frames, and the multi-frame session is considered incomplete.
+
+In both cases, if ``SOF_TIMESTAMPING_OPT_ID`` is enabled, ``serr->ee_data``
+will be set to the session’s unique identifier (``session->tskey``). This
+allows user space to track message transfers by their session identifier across
+multiple frames or stages.
+
+In all other cases, ``serr->ee_errno`` will be set to ``ENOMSG``, except for
+the ``J1939_EE_INFO_TX_ABORT`` and ``J1939_EE_INFO_RX_ABORT`` cases, where the
+kernel sets ``serr->ee_data`` to the error stored in ``session->err``.  All
+protocol-specific errors are converted to standard kernel error values and
+stored in ``session->err``. These error values are unified across system calls
+and ``serr->ee_errno``.  Some of the known error values are described in the
+`Error Codes in the J1939 Stack` section.
+
+When the `J1939_EE_INFO_RX_RTS` message is provided, it will include the
+following statistics for multi-frame messages (TP and ETP):
+
+  - ``J1939_NLA_TOTAL_SIZE``: Total size of the message in the session.
+  - ``J1939_NLA_PGN``: Parameter Group Number (PGN) identifying the message type.
+  - ``J1939_NLA_SRC_NAME``: 64-bit name of the source ECU.
+  - ``J1939_NLA_DEST_NAME``: 64-bit name of the destination ECU.
+  - ``J1939_NLA_SRC_ADDR``: 8-bit source address of the sending ECU.
+  - ``J1939_NLA_DEST_ADDR``: 8-bit destination address of the receiving ECU.
+
+- For other messages (including single-frame messages), only the following
+  statistic is included:
+
+  - ``J1939_NLA_BYTES_ACKED``: Number of bytes successfully acknowledged in the
+    session.
+
+The key flags for ``SO_TIMESTAMPING`` include:
+
+- ``SOF_TIMESTAMPING_OPT_ID``: Enables the use of a unique session identifier
+  (``tskey``) for each transfer. This identifier helps track message transfers
+  and errors as distinct sessions in user space. When this option is enabled,
+  ``serr->ee_data`` will be set to ``session->tskey``.
+
+- ``SOF_TIMESTAMPING_OPT_CMSG``: Sends timestamp information through control
+  messages (``struct scm_timestamping``), allowing the application to retrieve
+  timestamps alongside the data.
+
+- ``SOF_TIMESTAMPING_TX_SCHED``: Provides the timestamp for when a message is
+  scheduled for transmission (``SCM_TSTAMP_SCHED``).
+
+- ``SOF_TIMESTAMPING_TX_ACK``: Provides the timestamp for when a message
+  transmission is fully acknowledged (``SCM_TSTAMP_ACK``).
+
+- ``SOF_TIMESTAMPING_RX_SOFTWARE``: Provides timestamps for reception-related
+  events (e.g., ``J1939_EE_INFO_RX_RTS``, ``J1939_EE_INFO_RX_DPO``,
+  ``J1939_EE_INFO_RX_ABORT``).
+
+These flags enable detailed monitoring of message lifecycles, including
+transmission scheduling, acknowledgments, reception timestamps, and gathering
+detailed statistics about the communication session, especially for multi-frame
+payloads like TP and ETP.
+
+Example:
+
+.. code-block:: c
+
+    // Enable timestamping with various options, including session tracking and
+    // statistics
+    int sock_opt = SOF_TIMESTAMPING_OPT_CMSG |
+                   SOF_TIMESTAMPING_TX_ACK |
+                   SOF_TIMESTAMPING_TX_SCHED |
+                   SOF_TIMESTAMPING_OPT_ID |
+                   SOF_TIMESTAMPING_RX_SOFTWARE;
+
+    setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &sock_opt, sizeof(sock_opt));
+
+
+
 Dynamic Addressing
 ------------------
 
 Distinction has to be made between using the claimed address and doing an
 address claim. To use an already claimed address, one has to fill in the
-j1939.name member and provide it to bind(2). If the name had claimed an address
+``j1939.name`` member and provide it to ``bind(2)``. If the name had claimed an address
 earlier, all further messages being sent will use that address. And the
-j1939.addr member will be ignored.
+``j1939.addr`` member will be ignored.
 
 An exception on this is PGN 0x0ee00. This is the "Address Claim/Cannot Claim
-Address" message and the kernel will use the j1939.addr member for that PGN if
+Address" message and the kernel will use the ``j1939.addr`` member for that PGN if
 necessary.
 
 To claim an address following code example can be used:
@@ -371,12 +946,12 @@ NAME can send packets.
 
 If another ECU claims the address, the kernel will mark the NAME-SA expired.
 No socket bound to the NAME can send packets (other than address claims). To
-claim another address, some socket bound to NAME, must bind(2) again, but with
-only j1939.addr changed to the new SA, and must then send a valid address claim
+claim another address, some socket bound to NAME, must ``bind(2)`` again, but with
+only ``j1939.addr`` changed to the new SA, and must then send a valid address claim
 packet. This restarts the state machine in the kernel (and any other
 participant on the bus) for this NAME.
 
-can-utils also include the jacd tool, so it can be used as code example or as
+``can-utils`` also include the ``j1939acd`` tool, so it can be used as code example or as
 default Address Claiming daemon.
 
 Send Examples
@@ -403,8 +978,8 @@ Bind:
 
 	bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
 
-Now, the socket 'sock' is bound to the SA 0x20. Since no connect(2) was called,
-at this point we can use only sendto(2) or sendmsg(2).
+Now, the socket 'sock' is bound to the SA 0x20. Since no ``connect(2)`` was called,
+at this point we can use only ``sendto(2)`` or ``sendmsg(2)``.
 
 Send:
 
@@ -414,9 +989,147 @@ Send:
 		.can_family = AF_CAN,
 		.can_addr.j1939 = {
 			.name = J1939_NO_NAME;
-			.pgn = 0x30,
-			.addr = 0x12300,
+			.addr = 0x30,
+			.pgn = 0x12300,
 		},
 	};
 
 	sendto(sock, dat, sizeof(dat), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
+
+
+Error Codes in the J1939 Stack
+------------------------------
+
+This section lists all potential kernel error codes that can be exposed to user
+space when interacting with the J1939 stack. It includes both standard error
+codes and those derived from protocol-specific abort codes.
+
+- ``EAGAIN``: Operation would block; retry may succeed. One common reason is
+  that an active TP or ETP session exists, and an attempt was made to start a
+  new overlapping TP or ETP session between the same peers.
+
+- ``ENETDOWN``: Network is down. This occurs when the CAN interface is switched
+  to the "down" state.
+
+- ``ENOBUFS``: No buffer space available. This error occurs when the CAN
+  interface's transmit (TX) queue is full, and no more messages can be queued.
+
+- ``EOVERFLOW``: Value too large for defined data type. In J1939, this can
+  happen if the requested data lies outside of the queued buffer. For example,
+  if a CTS (Clear to Send) requests an offset not available in the kernel buffer
+  because user space did not provide enough data.
+
+- ``EBUSY``: Device or resource is busy. For example, this occurs if an
+  identical session is already active and the stack is unable to recover from
+  the condition.
+
+- ``EACCES``: Permission denied. This error can occur, for example, when
+  attempting to send broadcast messages, but the socket is not configured with
+  ``SO_BROADCAST``.
+
+- ``EADDRNOTAVAIL``: Address not available. This error occurs in cases such as:
+
+  - When attempting to use ``getsockname(2)`` to retrieve the peer's address,
+    but the socket is not connected.
+
+  - When trying to send data to or from a NAME, but address claiming for the
+    NAME was not performed or detected by the stack.
+
+- ``EBADFD``: File descriptor in bad state. This error can occur if:
+
+  - Attempting to send data to an unbound socket.
+
+  - The socket is bound but has no source name, and the source address is
+    ``J1939_NO_ADDR``.
+
+  - The ``can_ifindex`` is incorrect.
+
+- ``EFAULT``: Bad address. Occurs mostly when the stack can't copy from or to a
+  sockptr, when there is insufficient data from user space, or when the buffer
+  provided by user space is not large enough for the requested data.
+
+- ``EINTR``: A signal occurred before any data was transmitted; see ``signal(7)``.
+
+- ``EINVAL``: Invalid argument passed. For example:
+
+  - ``msg->msg_namelen`` is less than ``J1939_MIN_NAMELEN``.
+
+  - ``addr->can_family`` is not equal to ``AF_CAN``.
+
+  - An incorrect PGN was provided.
+
+- ``ENODEV``: No such device. This happens when the CAN network device cannot
+  be found for the provided ``can_ifindex`` or if ``can_ifindex`` is 0.
+
+- ``ENOMEM``: Out of memory. Typically related to issues with memory allocation
+  in the stack.
+
+- ``ENOPROTOOPT``: Protocol not available. This can occur when using
+  ``getsockopt(2)`` or ``setsockopt(2)`` if the requested socket option is not
+  available.
+
+- ``EDESTADDRREQ``: Destination address required. This error occurs:
+
+  - In the case of ``connect(2)``, if the ``struct sockaddr *uaddr`` is ``NULL``.
+
+  - In the case of ``send*(2)``, if there is an attempt to send an ETP message
+    to a broadcast address.
+
+- ``EDOM``: Argument out of domain. This error may happen if attempting to send
+  a TP or ETP message to a PGN that is reserved for control PGNs for TP or ETP
+  operations.
+
+- ``EIO``: I/O error. This can occur if the amount of data provided to the
+  socket for a TP or ETP session does not match the announced amount of data for
+  the session.
+
+- ``ENOENT``: No such file or directory. This can happen when the stack
+  attempts to transfer CTS or EOMA but cannot find a matching receiving socket
+  anymore.
+
+- ``ENOIOCTLCMD``: No ioctls are available for the socket layer.
+
+- ``EPERM``: Operation not permitted. For example, this can occur if a
+  requested action requires ``CAP_NET_ADMIN`` privileges.
+
+- ``ENETUNREACH``: Network unreachable. Most likely, this occurs when frames
+  cannot be transmitted to the CAN bus.
+
+- ``ETIME``: Timer expired. This can happen if a timeout occurs while
+  attempting to send a simple message, for example, when an echo message from
+  the controller is not received.
+
+- ``EPROTO``: Protocol error.
+
+  - Used for various protocol-level errors in J1939, including:
+
+    - Duplicate sequence number.
+
+    - Unexpected EDPO or ECTS packet.
+
+    - Invalid PGN or offset in EDPO/ECTS.
+
+    - Number of EDPO packets exceeded CTS allowance.
+
+    - Any other protocol-level error.
+
+- ``EMSGSIZE``: Message too long.
+
+- ``ENOMSG``: No message available.
+
+- ``EALREADY``: The ECU is already engaged in one or more connection-managed
+  sessions and cannot support another.
+
+- ``EHOSTUNREACH``: A timeout occurred, and the session was aborted.
+
+- ``EBADMSG``: CTS (Clear to Send) messages were received during an active data
+  transfer, causing an abort.
+
+- ``ENOTRECOVERABLE``: The maximum retransmission request limit was reached,
+  and the session cannot recover.
+
+- ``ENOTCONN``: An unexpected data transfer packet was received.
+
+- ``EILSEQ``: A bad sequence number was received, and the software could not
+  recover.
+
diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst
index f03ae64be8bc..98682b9a13ee 100644
--- a/Documentation/networking/kapi.rst
+++ b/Documentation/networking/kapi.rst
@@ -83,27 +83,6 @@ SUN RPC subsystem
 .. kernel-doc:: net/sunrpc/clnt.c
    :export:
 
-WiMAX
------
-
-.. kernel-doc:: net/wimax/op-msg.c
-   :export:
-
-.. kernel-doc:: net/wimax/op-reset.c
-   :export:
-
-.. kernel-doc:: net/wimax/op-rfkill.c
-   :export:
-
-.. kernel-doc:: net/wimax/stack.c
-   :export:
-
-.. kernel-doc:: include/net/wimax.h
-   :internal:
-
-.. kernel-doc:: include/uapi/linux/wimax.h
-   :internal:
-
 Network device support
 ======================
 
@@ -125,6 +104,9 @@ Driver Support
 .. kernel-doc:: include/linux/netdevice.h
    :internal:
 
+.. kernel-doc:: include/net/net_shaper.h
+   :internal:
+
 PHY Support
 -----------
 
@@ -134,6 +116,15 @@ PHY Support
 .. kernel-doc:: drivers/net/phy/phy.c
    :internal:
 
+.. kernel-doc:: drivers/net/phy/phy-core.c
+   :export:
+
+.. kernel-doc:: drivers/net/phy/phy-c45.c
+   :export:
+
+.. kernel-doc:: include/linux/phy.h
+   :internal:
+
 .. kernel-doc:: drivers/net/phy/phy_device.c
    :export:
 
diff --git a/Documentation/networking/kcm.rst b/Documentation/networking/kcm.rst
index db0f5560ac1c..71f44d0beaa3 100644
--- a/Documentation/networking/kcm.rst
+++ b/Documentation/networking/kcm.rst
@@ -200,7 +200,7 @@ while. Example use::
 
   setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
 
-BFP programs for message delineation
+BPF programs for message delineation
 ------------------------------------
 
 BPF programs can be compiled using the BPF LLVM backend. For example,
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
index a48238a2ec09..e8cf8b3e60ac 100644
--- a/Documentation/networking/l2tp.rst
+++ b/Documentation/networking/l2tp.rst
@@ -4,124 +4,364 @@
 L2TP
 ====
 
-This document describes how to use the kernel's L2TP drivers to
-provide L2TP functionality. L2TP is a protocol that tunnels one or
-more sessions over an IP tunnel. It is commonly used for VPNs
-(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
-network infrastructure. With L2TPv3, it is also useful as a Layer-2
-tunneling infrastructure.
-
-Features
+Layer 2 Tunneling Protocol (L2TP) allows L2 frames to be tunneled over
+an IP network.
+
+This document covers the kernel's L2TP subsystem. It documents kernel
+APIs for application developers who want to use the L2TP subsystem and
+it provides some technical details about the internal implementation
+which may be useful to kernel developers and maintainers.
+
+Overview
 ========
 
-L2TPv2 (PPP over L2TP (UDP tunnels)).
-L2TPv3 ethernet pseudowires.
-L2TPv3 PPP pseudowires.
-L2TPv3 IP encapsulation.
-Netlink sockets for L2TPv3 configuration management.
-
-History
-=======
-
-The original pppol2tp driver was introduced in 2.6.23 and provided
-L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP
-sessions over a UDP tunnel.
-
-L2TPv3 (rfc3931) changes the protocol to allow different frame types
-to be passed over an L2TP tunnel by moving the PPP-specific parts of
-the protocol out of the core L2TP packet headers. Each frame type is
-known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM
-pseudowires for L2TP are defined in separate RFC standards. Another
-change for L2TPv3 is that it can be carried directly over IP with no
-UDP header (UDP is optional). It is also possible to create static
-unmanaged L2TPv3 tunnels manually without a control protocol
-(userspace daemon) to manage them.
-
-To support L2TPv3, the original pppol2tp driver was split up to
-separate the L2TP and PPP functionality. Existing L2TPv2 userspace
-apps should be unaffected as the original pppol2tp sockets API is
-retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and
-sessions.
-
-Design
-======
-
-The L2TP protocol separates control and data frames.  The L2TP kernel
-drivers handle only L2TP data frames; control frames are always
-handled by userspace. L2TP control frames carry messages between L2TP
-clients/servers and are used to setup / teardown tunnels and
-sessions. An L2TP client or server is implemented in userspace.
-
-Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP
-provides L2TPv3 IP encapsulation (no UDP) and is implemented using a
-new l2tpip socket family. The tunnel socket is typically created by
-userspace, though for unmanaged L2TPv3 tunnels, the socket can also be
-created by the kernel. Each L2TP session (pseudowire) gets a network
-interface instance. In the case of PPP, these interfaces are created
-indirectly by pppd using a pppol2tp socket. In the case of ethernet,
-the netdevice is created upon a netlink request to create an L2TPv3
-ethernet pseudowire.
-
-For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a
-mechanism by which PPP frames carried through an L2TP session are
-passed through the kernel's PPP subsystem. The standard PPP daemon,
-pppd, handles all PPP interaction with the peer. PPP network
-interfaces are created for each local PPP endpoint. The kernel's PPP
-subsystem arranges for PPP control frames to be delivered to pppd,
-while data frames are forwarded as usual.
-
-For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a
-netdevice driver, managing virtual ethernet devices, one per
-pseudowire. These interfaces can be managed using standard Linux tools
-such as "ip" and "ifconfig". If only IP frames are passed over the
-tunnel, the interface can be given an IP addresses of itself and its
-peer. If non-IP frames are to be passed over the tunnel, the interface
-can be added to a bridge using brctl. All L2TP datapath protocol
-functions are handled by the L2TP core driver.
-
-Each tunnel and session within a tunnel is assigned a unique tunnel_id
-and session_id. These ids are carried in the L2TP header of every
-control and data packet. (Actually, in L2TPv3, the tunnel_id isn't
-present in data frames - it is inferred from the IP connection on
-which the packet was received.) The L2TP driver uses the ids to lookup
-internal tunnel and/or session contexts to determine how to handle the
-packet. Zero tunnel / session ids are treated specially - zero ids are
-never assigned to tunnels or sessions in the network. In the driver,
-the tunnel context keeps a reference to the tunnel UDP or L2TPIP
-socket. The session context holds data that lets the driver interface
-to the kernel's network frame type subsystems, i.e. PPP, ethernet.
-
-Userspace Programming
-=====================
-
-For L2TPv2, there are a number of requirements on the userspace L2TP
-daemon in order to use the pppol2tp driver.
-
-1. Use a UDP socket per tunnel.
-
-2. Create a single PPPoL2TP socket per tunnel bound to a special null
-   session id. This is used only for communicating with the driver but
-   must remain open while the tunnel is active. Opening this tunnel
-   management socket causes the driver to mark the tunnel socket as an
-   L2TP UDP encapsulation socket and flags it for use by the
-   referenced tunnel id. This hooks up the UDP receive path via
-   udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
-   in this special PPPoX socket.
-
-3. Create a PPPoL2TP socket per L2TP session. This is typically done
-   by starting pppd with the pppol2tp plugin and appropriate
-   arguments. A PPPoL2TP tunnel management socket (Step 2) must be
-   created before the first PPPoL2TP session socket is created.
+The kernel's L2TP subsystem implements the datapath for L2TPv2 and
+L2TPv3. L2TPv2 is carried over UDP. L2TPv3 is carried over UDP or
+directly over IP (protocol 115).
+
+The L2TP RFCs define two basic kinds of L2TP packets: control packets
+(the "control plane"), and data packets (the "data plane"). The kernel
+deals only with data packets. The more complex control packets are
+handled by user space.
+
+An L2TP tunnel carries one or more L2TP sessions. Each tunnel is
+associated with a socket. Each session is associated with a virtual
+netdevice, e.g. ``pppN``, ``l2tpethN``, through which data frames pass
+to/from L2TP. Fields in the L2TP header identify the tunnel or session
+and whether it is a control or data packet. When tunnels and sessions
+are set up using the Linux kernel API, we're just setting up the L2TP
+data path. All aspects of the control protocol are to be handled by
+user space.
+
+This split in responsibilities leads to a natural sequence of
+operations when establishing tunnels and sessions. The procedure looks
+like this:
+
+    1) Create a tunnel socket. Exchange L2TP control protocol messages
+       with the peer over that socket in order to establish a tunnel.
+
+    2) Create a tunnel context in the kernel, using information
+       obtained from the peer using the control protocol messages.
+
+    3) Exchange L2TP control protocol messages with the peer over the
+       tunnel socket in order to establish a session.
+
+    4) Create a session context in the kernel using information
+       obtained from the peer using the control protocol messages.
+
+L2TP APIs
+=========
+
+This section documents each userspace API of the L2TP subsystem.
+
+Tunnel Sockets
+--------------
+
+L2TPv2 always uses UDP. L2TPv3 may use UDP or IP encapsulation.
+
+To create a tunnel socket for use by L2TP, the standard POSIX
+socket API is used.
+
+For example, for a tunnel using IPv4 addresses and UDP encapsulation::
+
+    int sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+
+Or for a tunnel using IPv6 addresses and IP encapsulation::
+
+    int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
+
+UDP socket programming doesn't need to be covered here.
+
+IPPROTO_L2TP is an IP protocol type implemented by the kernel's L2TP
+subsystem. The L2TPIP socket address is defined in struct
+sockaddr_l2tpip and struct sockaddr_l2tpip6 at
+`include/uapi/linux/l2tp.h`_. The address includes the L2TP tunnel
+(connection) id. To use L2TP IP encapsulation, an L2TPv3 application
+should bind the L2TPIP socket using the locally assigned
+tunnel id. When the peer's tunnel id and IP address is known, a
+connect must be done.
+
+If the L2TP application needs to handle L2TPv3 tunnel setup requests
+from peers using L2TPIP, it must open a dedicated L2TPIP
+socket to listen for those requests and bind the socket using tunnel
+id 0 since tunnel setup requests are addressed to tunnel id 0.
+
+An L2TP tunnel and all of its sessions are automatically closed when
+its tunnel socket is closed.
+
+Netlink API
+-----------
+
+L2TP applications use netlink to manage L2TP tunnel and session
+instances in the kernel. The L2TP netlink API is defined in
+`include/uapi/linux/l2tp.h`_.
+
+L2TP uses `Generic Netlink`_ (GENL). Several commands are defined:
+Create, Delete, Modify and Get for tunnel and session
+instances, e.g. ``L2TP_CMD_TUNNEL_CREATE``. The API header lists the
+netlink attribute types that can be used with each command.
+
+Tunnel and session instances are identified by a locally unique
+32-bit id.  L2TP tunnel ids are given by ``L2TP_ATTR_CONN_ID`` and
+``L2TP_ATTR_PEER_CONN_ID`` attributes and L2TP session ids are given
+by ``L2TP_ATTR_SESSION_ID`` and ``L2TP_ATTR_PEER_SESSION_ID``
+attributes. If netlink is used to manage L2TPv2 tunnel and session
+instances, the L2TPv2 16-bit tunnel/session id is cast to a 32-bit
+value in these attributes.
+
+In the ``L2TP_CMD_TUNNEL_CREATE`` command, ``L2TP_ATTR_FD`` tells the
+kernel the tunnel socket fd being used. If not specified, the kernel
+creates a kernel socket for the tunnel, using IP parameters set in
+``L2TP_ATTR_IP[6]_SADDR``, ``L2TP_ATTR_IP[6]_DADDR``,
+``L2TP_ATTR_UDP_SPORT``, ``L2TP_ATTR_UDP_DPORT`` attributes. Kernel
+sockets are used to implement unmanaged L2TPv3 tunnels (iproute2's "ip
+l2tp" commands). If ``L2TP_ATTR_FD`` is given, it must be a socket fd
+that is already bound and connected. There is more information about
+unmanaged tunnels later in this document.
+
+``L2TP_CMD_TUNNEL_CREATE`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        Sets the tunnel (connection) id.
+PEER_CONN_ID       Y        Sets the peer tunnel (connection) id.
+PROTO_VERSION      Y        Protocol version. 2 or 3.
+ENCAP_TYPE         Y        Encapsulation type: UDP or IP.
+FD                 N        Tunnel socket file descriptor.
+UDP_CSUM           N        Enable IPv4 UDP checksums. Used only if FD is
+                            not set.
+UDP_ZERO_CSUM6_TX  N        Zero IPv6 UDP checksum on transmit. Used only
+                            if FD is not set.
+UDP_ZERO_CSUM6_RX  N        Zero IPv6 UDP checksum on receive. Used only if
+                            FD is not set.
+IP_SADDR           N        IPv4 source address. Used only if FD is not
+                            set.
+IP_DADDR           N        IPv4 destination address. Used only if FD is
+                            not set.
+UDP_SPORT          N        UDP source port. Used only if FD is not set.
+UDP_DPORT          N        UDP destination port. Used only if FD is not
+                            set.
+IP6_SADDR          N        IPv6 source address. Used only if FD is not
+                            set.
+IP6_DADDR          N        IPv6 destination address. Used only if FD is
+                            not set.
+DEBUG              N        Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        Identifies the tunnel id to be destroyed.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        Identifies the tunnel id to be modified.
+DEBUG              N        Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_GET`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            N        Identifies the tunnel id to be queried.
+                            Ignored in DUMP requests.
+================== ======== ===
+
+``L2TP_CMD_SESSION_CREATE`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        The parent tunnel id.
+SESSION_ID         Y        Sets the session id.
+PEER_SESSION_ID    Y        Sets the parent session id.
+PW_TYPE            Y        Sets the pseudowire type.
+DEBUG              N        Debug flags.
+RECV_SEQ           N        Enable rx data sequence numbers.
+SEND_SEQ           N        Enable tx data sequence numbers.
+LNS_MODE           N        Enable LNS mode (auto-enable data sequence
+                            numbers).
+RECV_TIMEOUT       N        Timeout to wait when reordering received
+                            packets.
+L2SPEC_TYPE        N        Sets layer2-specific-sublayer type (L2TPv3
+                            only).
+COOKIE             N        Sets optional cookie (L2TPv3 only).
+PEER_COOKIE        N        Sets optional peer cookie (L2TPv3 only).
+IFNAME             N        Sets interface name (L2TPv3 only).
+================== ======== ===
+
+For Ethernet session types, this will create an l2tpeth virtual
+interface which can then be configured as required. For PPP session
+types, a PPPoL2TP socket must also be opened and connected, mapping it
+onto the new session. This is covered in "PPPoL2TP Sockets" later.
+
+``L2TP_CMD_SESSION_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        Identifies the parent tunnel id of the session
+                            to be destroyed.
+SESSION_ID         Y        Identifies the session id to be destroyed.
+IFNAME             N        Identifies the session by interface name. If
+                            set, this overrides any CONN_ID and SESSION_ID
+                            attributes. Currently supported for L2TPv3
+                            Ethernet sessions only.
+================== ======== ===
+
+``L2TP_CMD_SESSION_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            Y        Identifies the parent tunnel id of the session
+                            to be modified.
+SESSION_ID         Y        Identifies the session id to be modified.
+IFNAME             N        Identifies the session by interface name. If
+                            set, this overrides any CONN_ID and SESSION_ID
+                            attributes. Currently supported for L2TPv3
+                            Ethernet sessions only.
+DEBUG              N        Debug flags.
+RECV_SEQ           N        Enable rx data sequence numbers.
+SEND_SEQ           N        Enable tx data sequence numbers.
+LNS_MODE           N        Enable LNS mode (auto-enable data sequence
+                            numbers).
+RECV_TIMEOUT       N        Timeout to wait when reordering received
+                            packets.
+================== ======== ===
+
+``L2TP_CMD_SESSION_GET`` attributes:-
+
+================== ======== ===
+Attribute          Required Use
+================== ======== ===
+CONN_ID            N        Identifies the tunnel id to be queried.
+                            Ignored for DUMP requests.
+SESSION_ID         N        Identifies the session id to be queried.
+                            Ignored for DUMP requests.
+IFNAME             N        Identifies the session by interface name.
+                            If set, this overrides any CONN_ID and
+                            SESSION_ID attributes. Ignored for DUMP
+                            requests. Currently supported for L2TPv3
+                            Ethernet sessions only.
+================== ======== ===
+
+Application developers should refer to `include/uapi/linux/l2tp.h`_ for
+netlink command and attribute definitions.
+
+Sample userspace code using libmnl_:
+
+  - Open L2TP netlink socket::
+
+        struct nl_sock *nl_sock;
+        int l2tp_nl_family_id;
+
+        nl_sock = nl_socket_alloc();
+        genl_connect(nl_sock);
+        genl_id = genl_ctrl_resolve(nl_sock, L2TP_GENL_NAME);
+
+  - Create a tunnel::
+
+        struct nlmsghdr *nlh;
+        struct genlmsghdr *gnlh;
+
+        nlh = mnl_nlmsg_put_header(buf);
+        nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+        nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+        nlh->nlmsg_seq = seq;
+
+        gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+        gnlh->cmd = L2TP_CMD_TUNNEL_CREATE;
+        gnlh->version = L2TP_GENL_VERSION;
+        gnlh->reserved = 0;
+
+        mnl_attr_put_u32(nlh, L2TP_ATTR_FD, tunl_sock_fd);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+        mnl_attr_put_u8(nlh, L2TP_ATTR_PROTO_VERSION, protocol_version);
+        mnl_attr_put_u16(nlh, L2TP_ATTR_ENCAP_TYPE, encap);
+
+  - Create a session::
+
+        struct nlmsghdr *nlh;
+        struct genlmsghdr *gnlh;
+
+        nlh = mnl_nlmsg_put_header(buf);
+        nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+        nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+        nlh->nlmsg_seq = seq;
+
+        gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+        gnlh->cmd = L2TP_CMD_SESSION_CREATE;
+        gnlh->version = L2TP_GENL_VERSION;
+        gnlh->reserved = 0;
+
+        mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_SESSION_ID, peer_sid);
+        mnl_attr_put_u16(nlh, L2TP_ATTR_PW_TYPE, pwtype);
+        /* there are other session options which can be set using netlink
+         * attributes during session creation -- see l2tp.h
+         */
+
+  - Delete a session::
+
+        struct nlmsghdr *nlh;
+        struct genlmsghdr *gnlh;
+
+        nlh = mnl_nlmsg_put_header(buf);
+        nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+        nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+        nlh->nlmsg_seq = seq;
+
+        gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+        gnlh->cmd = L2TP_CMD_SESSION_DELETE;
+        gnlh->version = L2TP_GENL_VERSION;
+        gnlh->reserved = 0;
+
+        mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+        mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+
+  - Delete a tunnel and all of its sessions (if any)::
+
+        struct nlmsghdr *nlh;
+        struct genlmsghdr *gnlh;
+
+        nlh = mnl_nlmsg_put_header(buf);
+        nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+        nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+        nlh->nlmsg_seq = seq;
+
+        gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+        gnlh->cmd = L2TP_CMD_TUNNEL_DELETE;
+        gnlh->version = L2TP_GENL_VERSION;
+        gnlh->reserved = 0;
+
+        mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+
+PPPoL2TP Session Socket API
+---------------------------
+
+For PPP session types, a PPPoL2TP socket must be opened and connected
+to the L2TP session.
 
 When creating PPPoL2TP sockets, the application provides information
-to the driver about the socket in a socket connect() call. Source and
-destination tunnel and session ids are provided, as well as the file
-descriptor of a UDP socket. See struct pppol2tp_addr in
-include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are
-treated specially. When creating the per-tunnel PPPoL2TP management
-socket in Step 2 above, zero source and destination session ids are
-specified, which tells the driver to prepare the supplied UDP file
-descriptor for use as an L2TP tunnel socket.
+to the kernel about the tunnel and session in a socket connect()
+call. Source and destination tunnel and session ids are provided, as
+well as the file descriptor of a UDP or L2TPIP socket. See struct
+pppol2tp_addr in `include/linux/if_pppol2tp.h`_. For historical reasons,
+there are unfortunately slightly different address structures for
+L2TPv2/L2TPv3 IPv4/IPv6 tunnels and userspace must use the appropriate
+structure that matches the tunnel socket type.
 
 Userspace may control behavior of the tunnel or session using
 setsockopt and ioctl on the PPPoX socket. The following socket
@@ -130,229 +370,417 @@ options are supported:-
 =========   ===========================================================
 DEBUG       bitmask of debug message categories. See below.
 SENDSEQ     - 0 => don't send packets with sequence numbers
-	    - 1 => send packets with sequence numbers
+            - 1 => send packets with sequence numbers
 RECVSEQ     - 0 => receive packet sequence numbers are optional
-	    - 1 => drop receive packets without sequence numbers
+            - 1 => drop receive packets without sequence numbers
 LNSMODE     - 0 => act as LAC.
-	    - 1 => act as LNS.
+            - 1 => act as LNS.
 REORDERTO   reorder timeout (in millisecs). If 0, don't try to reorder.
 =========   ===========================================================
 
-Only the DEBUG option is supported by the special tunnel management
-PPPoX socket.
-
 In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
 to retrieve tunnel and session statistics from the kernel using the
 PPPoX socket of the appropriate tunnel or session.
 
-For L2TPv3, userspace must use the netlink API defined in
-include/linux/l2tp.h to manage tunnel and session contexts. The
-general procedure to create a new L2TP tunnel with one session is:-
+Sample userspace code:
 
-1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel
-   using netlink.
+  - Create session PPPoX data socket::
 
-2. Create a UDP or L2TPIP socket for the tunnel.
+        /* Input: the L2TP tunnel UDP socket `tunnel_fd`, which needs to be
+         * bound already (both sockname and peername), otherwise it will not be
+         * ready.
+         */
 
-3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE
-   request. Set attributes according to desired tunnel parameters,
-   referencing the UDP or L2TPIP socket created in the previous step.
+        struct sockaddr_pppol2tp sax;
+        int session_fd;
+        int ret;
 
-4. Create a new L2TP session in the tunnel using a
-   L2TP_CMD_SESSION_CREATE request.
+        session_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
+        if (session_fd < 0)
+                return -errno;
 
-The tunnel and all of its sessions are closed when the tunnel socket
-is closed. The netlink API may also be used to delete sessions and
-tunnels. Configuration and status info may be set or read using netlink.
+        sax.sa_family = AF_PPPOX;
+        sax.sa_protocol = PX_PROTO_OL2TP;
+        sax.pppol2tp.fd = tunnel_fd;
+        sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+        sax.pppol2tp.addr.sin_port = addr->sin_port;
+        sax.pppol2tp.addr.sin_family = AF_INET;
+        sax.pppol2tp.s_tunnel  = tunnel_id;
+        sax.pppol2tp.s_session = session_id;
+        sax.pppol2tp.d_tunnel  = peer_tunnel_id;
+        sax.pppol2tp.d_session = peer_session_id;
 
-The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These
-are where there is no L2TP control message exchange with the peer to
-setup the tunnel; the tunnel is configured manually at each end of the
-tunnel. There is no need for an L2TP userspace application in this
-case -- the tunnel socket is created by the kernel and configured
-using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink
-request. The "ip" utility of iproute2 has commands for managing static
-L2TPv3 tunnels; do "ip l2tp help" for more information.
+        /* session_fd is the fd of the session's PPPoL2TP socket.
+         * tunnel_fd is the fd of the tunnel UDP / L2TPIP socket.
+         */
+        ret = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+        if (ret < 0 ) {
+                close(session_fd);
+                return -errno;
+        }
 
-Debugging
-=========
+        return session_fd;
 
-The driver supports a flexible debug scheme where kernel trace
-messages may be optionally enabled per tunnel and per session. Care is
-needed when debugging a live system since the messages are not
-rate-limited and a busy system could be swamped. Userspace uses
-setsockopt on the PPPoX socket to set a debug mask.
+L2TP control packets will still be available for read on `tunnel_fd`.
 
-The following debug mask bits are available:
+  - Create PPP channel::
 
-================  ==============================
-L2TP_MSG_DEBUG    verbose debug (if compiled in)
-L2TP_MSG_CONTROL  userspace - kernel interface
-L2TP_MSG_SEQ      sequence numbers handling
-L2TP_MSG_DATA     data packets
-================  ==============================
+        /* Input: the session PPPoX data socket `session_fd` which was created
+         * as described above.
+         */
 
-If enabled, files under a l2tp debugfs directory can be used to dump
-kernel state about L2TP tunnels and sessions. To access it, the
-debugfs filesystem must first be mounted::
+        int ppp_chan_fd;
+        int chindx;
+        int ret;
 
-	# mount -t debugfs debugfs /debug
+        ret = ioctl(session_fd, PPPIOCGCHAN, &chindx);
+        if (ret < 0)
+                return -errno;
 
-Files under the l2tp directory can then be accessed::
+        ppp_chan_fd = open("/dev/ppp", O_RDWR);
+        if (ppp_chan_fd < 0)
+                return -errno;
 
-	# cat /debug/l2tp/tunnels
+        ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx);
+        if (ret < 0) {
+                close(ppp_chan_fd);
+                return -errno;
+        }
 
-The debugfs files should not be used by applications to obtain L2TP
-state information because the file format is subject to change. It is
-implemented to provide extra debug information to help diagnose
-problems.) Users should use the netlink API.
+        return ppp_chan_fd;
 
-/proc/net/pppol2tp is also provided for backwards compatibility with
-the original pppol2tp driver. It lists information about L2TPv2
-tunnels and sessions only. Its use is discouraged.
+LCP PPP frames will be available for read on `ppp_chan_fd`.
 
-Unmanaged L2TPv3 Tunnels
-========================
-
-Some commercial L2TP products support unmanaged L2TPv3 ethernet
-tunnels, where there is no L2TP control protocol; tunnels are
-configured at each side manually. New commands are available in
-iproute2's ip utility to support this.
-
-To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
-and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
-tunnel endpoints::
-
-	# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
-	  udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
-	# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
-	# ip -s -d show dev l2tpeth0
-	# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
-	# ip li set dev l2tpeth0 up
-
-Choose IP addresses to be the address of a local IP interface and that
-of the remote system. The IP addresses of the l2tpeth0 interface can be
-anything suitable.
-
-Repeat the above at the peer, with ports, tunnel/session ids and IP
-addresses reversed.  The tunnel and session IDs can be any non-zero
-32-bit number, but the values must be reversed at the peer.
-
-========================       ===================
-Host 1                         Host2
-========================       ===================
-udp_sport=5000                 udp_sport=5001
-udp_dport=5001                 udp_dport=5000
-tunnel_id=42                   tunnel_id=45
-peer_tunnel_id=45              peer_tunnel_id=42
-session_id=128                 session_id=5196755
-peer_session_id=5196755        peer_session_id=128
-========================       ===================
-
-When done at both ends of the tunnel, it should be possible to send
-data over the network. e.g.::
-
-	# ping 10.5.1.1
-
-
-Sample Userspace Code
-=====================
-
-1. Create tunnel management PPPoX socket::
-
-	kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
-	if (kernel_fd >= 0) {
-		struct sockaddr_pppol2tp sax;
-		struct sockaddr_in const *peer_addr;
-
-		peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
-		memset(&sax, 0, sizeof(sax));
-		sax.sa_family = AF_PPPOX;
-		sax.sa_protocol = PX_PROTO_OL2TP;
-		sax.pppol2tp.fd = udp_fd;       /* fd of tunnel UDP socket */
-		sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
-		sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
-		sax.pppol2tp.addr.sin_family = AF_INET;
-		sax.pppol2tp.s_tunnel = tunnel_id;
-		sax.pppol2tp.s_session = 0;     /* special case: mgmt socket */
-		sax.pppol2tp.d_tunnel = 0;
-		sax.pppol2tp.d_session = 0;     /* special case: mgmt socket */
-
-		if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
-			perror("connect failed");
-			result = -errno;
-			goto err;
-		}
-	}
-
-2. Create session PPPoX data socket::
-
-	struct sockaddr_pppol2tp sax;
-	int fd;
-
-	/* Note, the target socket must be bound already, else it will not be ready */
-	sax.sa_family = AF_PPPOX;
-	sax.sa_protocol = PX_PROTO_OL2TP;
-	sax.pppol2tp.fd = tunnel_fd;
-	sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
-	sax.pppol2tp.addr.sin_port = addr->sin_port;
-	sax.pppol2tp.addr.sin_family = AF_INET;
-	sax.pppol2tp.s_tunnel  = tunnel_id;
-	sax.pppol2tp.s_session = session_id;
-	sax.pppol2tp.d_tunnel  = peer_tunnel_id;
-	sax.pppol2tp.d_session = peer_session_id;
-
-	/* session_fd is the fd of the session's PPPoL2TP socket.
-	 * tunnel_fd is the fd of the tunnel UDP socket.
-	 */
-	fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
-	if (fd < 0 )    {
-		return -errno;
-	}
-	return 0;
+  - Create PPP interface::
 
-Internal Implementation
-=======================
+        /* Input: the PPP channel `ppp_chan_fd` which was created as described
+         * above.
+         */
+
+        int ifunit = -1;
+        int ppp_if_fd;
+        int ret;
+
+        ppp_if_fd = open("/dev/ppp", O_RDWR);
+        if (ppp_if_fd < 0)
+                return -errno;
+
+        ret = ioctl(ppp_if_fd, PPPIOCNEWUNIT, &ifunit);
+        if (ret < 0) {
+                close(ppp_if_fd);
+                return -errno;
+        }
+
+        ret = ioctl(ppp_chan_fd, PPPIOCCONNECT, &ifunit);
+        if (ret < 0) {
+                close(ppp_if_fd);
+                return -errno;
+        }
+
+        return ppp_if_fd;
+
+IPCP/IPv6CP PPP frames will be available for read on `ppp_if_fd`.
+
+The ppp<ifunit> interface can then be configured as usual with netlink's
+RTM_NEWLINK, RTM_NEWADDR, RTM_NEWROUTE, or ioctl's SIOCSIFMTU, SIOCSIFADDR,
+SIOCSIFDSTADDR, SIOCSIFNETMASK, SIOCSIFFLAGS, or with the `ip` command.
+
+  - Bridging L2TP sessions which have PPP pseudowire types (this is also called
+    L2TP tunnel switching or L2TP multihop) is supported by bridging the PPP
+    channels of the two L2TP sessions to be bridged::
+
+        /* Input: the session PPPoX data sockets `session_fd1` and `session_fd2`
+         * which were created as described further above.
+         */
+
+        int ppp_chan_fd;
+        int chindx1;
+        int chindx2;
+        int ret;
+
+        ret = ioctl(session_fd1, PPPIOCGCHAN, &chindx1);
+        if (ret < 0)
+                return -errno;
+
+        ret = ioctl(session_fd2, PPPIOCGCHAN, &chindx2);
+        if (ret < 0)
+                return -errno;
+
+        ppp_chan_fd = open("/dev/ppp", O_RDWR);
+        if (ppp_chan_fd < 0)
+                return -errno;
 
-The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a
-struct l2tp_session context for each session. The l2tp_tunnel is
-always associated with a UDP or L2TP/IP socket and keeps a list of
-sessions in the tunnel. The l2tp_session context keeps kernel state
-about the session. It has private data which is used for data specific
-to the session type. With L2TPv2, the session always carried PPP
-traffic. With L2TPv3, the session can also carry ethernet frames
-(ethernet pseudowire) or other data types such as ATM, HDLC or Frame
-Relay.
+        ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx1);
+        if (ret < 0) {
+                close(ppp_chan_fd);
+                return -errno;
+        }
 
-When a tunnel is first opened, the reference count on the socket is
-increased using sock_hold(). This ensures that the kernel socket
-cannot be removed while L2TP's data structures reference it.
+        ret = ioctl(ppp_chan_fd, PPPIOCBRIDGECHAN, &chindx2);
+        close(ppp_chan_fd);
+        if (ret < 0)
+                return -errno;
+
+        return 0;
+
+It can be noted that when bridging PPP channels, the PPP session is not locally
+terminated, and no local PPP interface is created.  PPP frames arriving on one
+channel are directly passed to the other channel, and vice versa.
+
+The PPP channel does not need to be kept open.  Only the session PPPoX data
+sockets need to be kept open.
+
+More generally, it is also possible in the same way to e.g. bridge a PPPoL2TP
+PPP channel with other types of PPP channels, such as PPPoE.
+
+See more details for the PPP side in ppp_generic.rst.
+
+Old L2TPv2-only API
+-------------------
+
+When L2TP was first added to the Linux kernel in 2.6.23, it
+implemented only L2TPv2 and did not include a netlink API. Instead,
+tunnel and session instances in the kernel were managed directly using
+only PPPoL2TP sockets. The PPPoL2TP socket is used as described in
+section "PPPoL2TP Session Socket API" but tunnel and session instances
+are automatically created on a connect() of the socket instead of
+being created by a separate netlink request:
+
+    - Tunnels are managed using a tunnel management socket which is a
+      dedicated PPPoL2TP socket, connected to (invalid) session
+      id 0. The L2TP tunnel instance is created when the PPPoL2TP
+      tunnel management socket is connected and is destroyed when the
+      socket is closed.
+
+    - Session instances are created in the kernel when a PPPoL2TP
+      socket is connected to a non-zero session id. Session parameters
+      are set using setsockopt. The L2TP session instance is destroyed
+      when the socket is closed.
+
+This API is still supported but its use is discouraged. Instead, new
+L2TPv2 applications should use netlink to first create the tunnel and
+session, then create a PPPoL2TP socket for the session.
+
+Unmanaged L2TPv3 tunnels
+------------------------
+
+The kernel L2TP subsystem also supports static (unmanaged) L2TPv3
+tunnels. Unmanaged tunnels have no userspace tunnel socket, and
+exchange no control messages with the peer to set up the tunnel; the
+tunnel is configured manually at each end of the tunnel. All
+configuration is done using netlink. There is no need for an L2TP
+userspace application in this case -- the tunnel socket is created by
+the kernel and configured using parameters sent in the
+``L2TP_CMD_TUNNEL_CREATE`` netlink request. The ``ip`` utility of
+``iproute2`` has commands for managing static L2TPv3 tunnels; do ``ip
+l2tp help`` for more information.
+
+Debugging
+---------
 
-Some L2TP sessions also have a socket (PPP pseudowires) while others
-do not (ethernet pseudowires). We can't use the socket reference count
-as the reference count for session contexts. The L2TP implementation
-therefore has its own internal reference counts on the session
-contexts.
+The L2TP subsystem offers a range of debugging interfaces through the
+debugfs filesystem.
 
-To Do
-=====
+To access these interfaces, the debugfs filesystem must first be mounted::
 
-Add L2TP tunnel switching support. This would route tunneled traffic
-from one L2TP tunnel into another. Specified in
-http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08
+    # mount -t debugfs debugfs /debug
 
-Add L2TPv3 VLAN pseudowire support.
+Files under the l2tp directory can then be accessed, providing a summary
+of the current population of tunnel and session contexts existing in the
+kernel::
 
-Add L2TPv3 IP pseudowire support.
+    # cat /debug/l2tp/tunnels
+
+The debugfs files should not be used by applications to obtain L2TP
+state information because the file format is subject to change. It is
+implemented to provide extra debug information to help diagnose
+problems. Applications should instead use the netlink API.
+
+In addition the L2TP subsystem implements tracepoints using the standard
+kernel event tracing API.  The available L2TP events can be reviewed as
+follows::
+
+    # find /debug/tracing/events/l2tp
+
+Finally, /proc/net/pppol2tp is also provided for backwards compatibility
+with the original pppol2tp code. It lists information about L2TPv2
+tunnels and sessions only. Its use is discouraged.
+
+Internal Implementation
+=======================
 
-Add L2TPv3 ATM pseudowire support.
+This section is for kernel developers and maintainers.
+
+Sockets
+-------
+
+UDP sockets are implemented by the networking core. When an L2TP
+tunnel is created using a UDP socket, the socket is set up as an
+encapsulated UDP socket by setting encap_rcv and encap_destroy
+callbacks on the UDP socket. l2tp_udp_encap_recv is called when
+packets are received on the socket. l2tp_udp_encap_destroy is called
+when userspace closes the socket.
+
+L2TPIP sockets are implemented in `net/l2tp/l2tp_ip.c`_ and
+`net/l2tp/l2tp_ip6.c`_.
+
+Tunnels
+-------
+
+The kernel keeps a struct l2tp_tunnel context per L2TP tunnel. The
+l2tp_tunnel is always associated with a UDP or L2TP/IP socket and
+keeps a list of sessions in the tunnel. When a tunnel is first
+registered with L2TP core, the reference count on the socket is
+increased. This ensures that the socket cannot be removed while L2TP's
+data structures reference it.
+
+Tunnels are identified by a unique tunnel id. The id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Tunnels are kept in a per-net list, indexed by tunnel id. The
+tunnel id namespace is shared by L2TPv2 and L2TPv3.
+
+Handling tunnel socket close is perhaps the most tricky part of the
+L2TP implementation. If userspace closes a tunnel socket, the L2TP
+tunnel and all of its sessions must be closed and destroyed. Since the
+tunnel context holds a ref on the tunnel socket, the socket's
+sk_destruct won't be called until the tunnel sock_put's its
+socket. For UDP sockets, when userspace closes the tunnel socket, the
+socket's encap_destroy handler is invoked, which L2TP uses to initiate
+its tunnel close actions. For L2TPIP sockets, the socket's close
+handler initiates the same tunnel close actions. All sessions are
+first closed. Each session drops its tunnel ref. When the tunnel ref
+reaches zero, the tunnel drops its socket ref.
+
+Sessions
+--------
+
+The kernel keeps a struct l2tp_session context for each session.  Each
+session has private data which is used for data specific to the
+session type. With L2TPv2, the session always carries PPP
+traffic. With L2TPv3, the session can carry Ethernet frames (Ethernet
+pseudowire) or other data types such as PPP, ATM, HDLC or Frame
+Relay. Linux currently implements only Ethernet and PPP session types.
+
+Some L2TP session types also have a socket (PPP pseudowires) while
+others do not (Ethernet pseudowires).
+
+Like tunnels, L2TP sessions are identified by a unique
+session id. Just as with tunnel ids, the session id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Sessions hold a ref on their parent tunnel to ensure that the tunnel
+stays extant while one or more sessions references it.
+
+Sessions are kept in a per-net list. L2TPv2 sessions and L2TPv3
+sessions are stored in separate lists. L2TPv2 sessions are keyed
+by a 32-bit key made up of the 16-bit tunnel ID and 16-bit
+session ID. L2TPv3 sessions are keyed by the 32-bit session ID, since
+L2TPv3 session ids are unique across all tunnels.
+
+Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
+scoped by the tunnel, the Linux implementation has historically
+allowed this. Such session id collisions are supported using a per-net
+hash table keyed by sk and session ID. When looking up L2TPv3
+sessions, the list entry may link to multiple sessions with that
+session ID, in which case the session matching the given sk (tunnel)
+is used.
+
+PPP
+---
+
+`net/l2tp/l2tp_ppp.c`_ implements the PPPoL2TP socket family. Each PPP
+session has a PPPoL2TP socket.
+
+The PPPoL2TP socket's sk_user_data references the l2tp_session.
+
+Userspace sends and receives PPP packets over L2TP using a PPPoL2TP
+socket. Only PPP control frames pass over this socket: PPP data
+packets are handled entirely by the kernel, passing between the L2TP
+session and its associated ``pppN`` netdev through the PPP channel
+interface of the kernel PPP subsystem.
+
+The L2TP PPP implementation handles the closing of a PPPoL2TP socket
+by closing its corresponding L2TP session. This is complicated because
+it must consider racing with netlink session create/destroy requests
+and pppol2tp_connect trying to reconnect with a session that is in the
+process of being closed. PPP sessions hold a ref on their associated
+socket in order that the socket remains extants while the session
+references it.
+
+Ethernet
+--------
+
+`net/l2tp/l2tp_eth.c`_ implements L2TPv3 Ethernet pseudowires. It
+manages a netdev for each session.
+
+L2TP Ethernet sessions are created and destroyed by netlink request,
+or are destroyed when the tunnel is destroyed. Unlike PPP sessions,
+Ethernet sessions do not have an associated socket.
 
 Miscellaneous
 =============
 
-The L2TP drivers were developed as part of the OpenL2TP project by
-Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
-designed from the ground up to have the L2TP datapath in the
-kernel. The project also implemented the pppol2tp plugin for pppd
-which allows pppd to use the kernel driver. Details can be found at
-http://www.openl2tp.org.
+RFCs
+----
+
+The kernel code implements the datapath features specified in the
+following RFCs:
+
+======= =============== ===================================
+RFC2661 L2TPv2          https://tools.ietf.org/html/rfc2661
+RFC3931 L2TPv3          https://tools.ietf.org/html/rfc3931
+RFC4719 L2TPv3 Ethernet https://tools.ietf.org/html/rfc4719
+======= =============== ===================================
+
+Implementations
+---------------
+
+A number of open source applications use the L2TP kernel subsystem:
+
+============ ==============================================
+iproute2     https://github.com/shemminger/iproute2
+go-l2tp      https://github.com/katalix/go-l2tp
+tunneldigger https://github.com/wlanslovenija/tunneldigger
+xl2tpd       https://github.com/xelerance/xl2tpd
+============ ==============================================
+
+Limitations
+-----------
+
+The current implementation has a number of limitations:
+
+  1) Interfacing with openvswitch is not yet implemented. It may be
+     useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
+
+  2) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+     configured with a VLAN sub-interface. Since L2TPv3 VLAN
+     pseudowires carry one and only one VLAN, it may be better to use
+     a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
+     pair per VLAN session. The netlink attribute
+     ``L2TP_ATTR_VLAN_ID`` was added for this, but it was never
+     implemented.
+
+Testing
+-------
+
+Unmanaged L2TPv3 Ethernet features are tested by the kernel's built-in
+selftests. See `tools/testing/selftests/net/l2tp.sh`_.
+
+Another test suite, l2tp-ktest_, covers all
+of the L2TP APIs and tunnel/session types. This may be integrated into
+the kernel's built-in L2TP selftests in the future.
+
+.. Links
+.. _Generic Netlink: generic_netlink.html
+.. _libmnl: https://www.netfilter.org/projects/libmnl
+.. _include/uapi/linux/l2tp.h: ../../../include/uapi/linux/l2tp.h
+.. _include/linux/if_pppol2tp.h: ../../../include/linux/if_pppol2tp.h
+.. _net/l2tp/l2tp_ip.c: ../../../net/l2tp/l2tp_ip.c
+.. _net/l2tp/l2tp_ip6.c: ../../../net/l2tp/l2tp_ip6.c
+.. _net/l2tp/l2tp_ppp.c: ../../../net/l2tp/l2tp_ppp.c
+.. _net/l2tp/l2tp_eth.c: ../../../net/l2tp/l2tp_eth.c
+.. _tools/testing/selftests/net/l2tp.sh: ../../../tools/testing/selftests/net/l2tp.sh
+.. _l2tp-ktest: https://github.com/katalix/l2tp-ktest
diff --git a/Documentation/networking/ltpc.rst b/Documentation/networking/ltpc.rst
deleted file mode 100644
index 0ad197fd17ce..000000000000
--- a/Documentation/networking/ltpc.rst
+++ /dev/null
@@ -1,144 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========
-LTPC Driver
-===========
-
-This is the ALPHA version of the ltpc driver.
-
-In order to use it, you will need at least version 1.3.3 of the
-netatalk package, and the Apple or Farallon LocalTalk PC card.
-There are a number of different LocalTalk cards for the PC; this
-driver applies only to the one with the 65c02 processor chip on it.
-
-To include it in the kernel, select the CONFIG_LTPC switch in the
-configuration dialog.  You can also compile it as a module.
-
-While the driver will attempt to autoprobe the I/O port address, IRQ
-line, and DMA channel of the card, this does not always work.  For
-this reason, you should be prepared to supply these parameters
-yourself.  (see "Card Configuration" below for how to determine or
-change the settings on your card)
-
-When the driver is compiled into the kernel, you can add a line such
-as the following to your /etc/lilo.conf::
-
- append="ltpc=0x240,9,1"
-
-where the parameters (in order) are the port address, IRQ, and DMA
-channel.  The second and third values can be omitted, in which case
-the driver will try to determine them itself.
-
-If you load the driver as a module, you can pass the parameters "io=",
-"irq=", and "dma=" on the command line with insmod or modprobe, or add
-them as options in a configuration file in /etc/modprobe.d/ directory::
-
- alias lt0 ltpc # autoload the module when the interface is configured
- options ltpc io=0x240 irq=9 dma=1
-
-Before starting up the netatalk demons (perhaps in rc.local), you
-need to add a line such as::
-
- /sbin/ifconfig lt0 127.0.0.42
-
-The address is unimportant - however, the card needs to be configured
-with ifconfig so that Netatalk can find it.
-
-The appropriate netatalk configuration depends on whether you are
-attached to a network that includes AppleTalk routers or not.  If,
-like me, you are simply connecting to your home Macintoshes and
-printers, you need to set up netatalk to "seed".  The way I do this
-is to have the lines::
-
- dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
- lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
-
-in my atalkd.conf.  What is going on here is that I need to fool
-netatalk into thinking that there are two AppleTalk interfaces
-present; otherwise, it refuses to seed.  This is a hack, and a more
-permanent solution would be to alter the netatalk code.  Also, make
-sure you have the correct name for the dummy interface - If it's
-compiled as a module, you will need to refer to it as "dummy0" or some
-such.
-
-If you are attached to an extended AppleTalk network, with routers on
-it, then you don't need to fool around with this -- the appropriate
-line in atalkd.conf is::
-
- lt0 -phase 1
-
-
-Card Configuration
-==================
-
-The interrupts and so forth are configured via the dipswitch on the
-board.  Set the switches so as not to conflict with other hardware.
-
-       Interrupts -- set at most one.  If none are set, the driver uses
-       polled mode.  Because the card was developed in the XT era, the
-       original documentation refers to IRQ2.  Since you'll be running
-       this on an AT (or later) class machine, that really means IRQ9.
-
-       ===     ===========================================================
-       SW1     IRQ 4
-       SW2     IRQ 3
-       SW3     IRQ 9 (2 in original card documentation only applies to XT)
-       ===     ===========================================================
-
-
-       DMA -- choose DMA 1 or 3, and set both corresponding switches.
-
-       ===     =====
-       SW4     DMA 3
-       SW5     DMA 1
-       SW6     DMA 3
-       SW7     DMA 1
-       ===     =====
-
-
-       I/O address -- choose one.
-
-       ===     =========
-       SW8     220 / 240
-       ===     =========
-
-
-IP
-==
-
-Yes, it is possible to do IP over LocalTalk.  However, you can't just
-treat the LocalTalk device like an ordinary Ethernet device, even if
-that's what it looks like to Netatalk.
-
-Instead, you follow the same procedure as for doing IP in EtherTalk.
-See Documentation/networking/ipddp.rst for more information about the
-kernel driver and userspace tools needed.
-
-
-Bugs
-====
-
-IRQ autoprobing often doesn't work on a cold boot.  To get around
-this, either compile the driver as a module, or pass the parameters
-for the card to the kernel as described above.
-
-Also, as usual, autoprobing is not recommended when you use the driver
-as a module. (though it usually works at boot time, at least)
-
-Polled mode is *really* slow sometimes, but this seems to depend on
-the configuration of the network.
-
-It may theoretically be possible to use two LTPC cards in the same
-machine, but this is unsupported, so if you really want to do this,
-you'll probably have to hack the initialization code a bit.
-
-
-Thanks
-======
-
-Thanks to Alan Cox for helpful discussions early on in this
-work, and to Denis Hainsworth for doing the bleeding-edge testing.
-
-Bradford Johnson <bradford@math.umn.edu>
-
-Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst
new file mode 100644
index 000000000000..c628cb5406d2
--- /dev/null
+++ b/Documentation/networking/mctp.rst
@@ -0,0 +1,320 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Management Component Transport Protocol (MCTP)
+==============================================
+
+net/mctp/ contains protocol support for MCTP, as defined by DMTF standard
+DSP0236. Physical interface drivers ("bindings" in the specification) are
+provided in drivers/net/mctp/.
+
+The core code provides a socket-based interface to send and receive MCTP
+messages, through an AF_MCTP, SOCK_DGRAM socket.
+
+Structure: interfaces & networks
+================================
+
+The kernel models the local MCTP topology through two items: interfaces and
+networks.
+
+An interface (or "link") is an instance of an MCTP physical transport binding
+(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
+device. This is represented as a ``struct netdevice``.
+
+A network defines a unique address space for MCTP endpoints by endpoint-ID
+(described by DSP0236, section 3.2.31). A network has a user-visible identifier
+to allow references from userspace. Route definitions are specific to one
+network.
+
+Interfaces are associated with one network. A network may be associated with one
+or more interfaces.
+
+If multiple networks are present, each may contain endpoint IDs (EIDs) that are
+also present on other networks.
+
+Sockets API
+===========
+
+Protocol definitions
+--------------------
+
+MCTP uses ``AF_MCTP`` / ``PF_MCTP`` for the address- and protocol- families.
+Since MCTP is message-based, only ``SOCK_DGRAM`` sockets are supported.
+
+.. code-block:: C
+
+    int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+The only (current) value for the ``protocol`` argument is 0.
+
+As with all socket address families, source and destination addresses are
+specified with a ``sockaddr`` type, with a single-byte endpoint address:
+
+.. code-block:: C
+
+    typedef __u8		mctp_eid_t;
+
+    struct mctp_addr {
+            mctp_eid_t		s_addr;
+    };
+
+    struct sockaddr_mctp {
+            __kernel_sa_family_t smctp_family;
+            unsigned int         smctp_network;
+            struct mctp_addr     smctp_addr;
+            __u8                 smctp_type;
+            __u8                 smctp_tag;
+    };
+
+    #define MCTP_NET_ANY	0x0
+    #define MCTP_ADDR_ANY	0xff
+
+
+Syscall behaviour
+-----------------
+
+The following sections describe the MCTP-specific behaviours of the standard
+socket system calls. These behaviours have been chosen to map closely to the
+existing sockets APIs.
+
+``bind()`` : set local socket address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sockets that receive incoming request packets will bind to a local address,
+using the ``bind()`` syscall.
+
+.. code-block:: C
+
+    struct sockaddr_mctp addr;
+
+    addr.smctp_family = AF_MCTP;
+    addr.smctp_network = MCTP_NET_ANY;
+    addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
+    addr.smctp_type = MCTP_TYPE_PLDM;
+    addr.smctp_tag = MCTP_TAG_OWNER;
+
+    int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
+
+This establishes the local address of the socket. Incoming MCTP messages that
+match the network, address, and message type will be received by this socket.
+The reference to 'incoming' is important here; a bound socket will only receive
+messages with the TO bit set, to indicate an incoming request message, rather
+than a response.
+
+The ``smctp_tag`` value will configure the tags accepted from the remote side of
+this socket. Given the above, the only valid value is ``MCTP_TAG_OWNER``, which
+will result in remotely "owned" tags being routed to this socket. Since
+``MCTP_TAG_OWNER`` is set, the 3 least-significant bits of ``smctp_tag`` are not
+used; callers must set them to zero.
+
+A ``smctp_network`` value of ``MCTP_NET_ANY`` will configure the socket to
+receive incoming packets from any locally-connected network. A specific network
+value will cause the socket to only receive incoming messages from that network.
+
+The ``smctp_addr`` field specifies a local address to bind to. A value of
+``MCTP_ADDR_ANY`` configures the socket to receive messages addressed to any
+local destination EID.
+
+The ``smctp_type`` field specifies which message types to receive. Only the
+lower 7 bits of the type is matched on incoming messages (ie., the
+most-significant IC bit is not part of the match). This results in the socket
+receiving packets with and without a message integrity check footer.
+
+``sendto()``, ``sendmsg()``, ``send()`` : transmit an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message is transmitted using one of the ``sendto()``, ``sendmsg()`` or
+``send()`` syscalls. Using ``sendto()`` as the primary example:
+
+.. code-block:: C
+
+    struct sockaddr_mctp addr;
+    char buf[14];
+    ssize_t len;
+
+    /* set message destination */
+    addr.smctp_family = AF_MCTP;
+    addr.smctp_network = 0;
+    addr.smctp_addr.s_addr = 8;
+    addr.smctp_tag = MCTP_TAG_OWNER;
+    addr.smctp_type = MCTP_TYPE_ECHO;
+
+    /* arbitrary message to send, with message-type header */
+    buf[0] = MCTP_TYPE_ECHO;
+    memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
+
+    len = sendto(sd, buf, sizeof(buf), 0,
+                    (struct sockaddr_mctp *)&addr, sizeof(addr));
+
+The network and address fields of ``addr`` define the remote address to send to.
+If ``smctp_tag`` has the ``MCTP_TAG_OWNER``, the kernel will ignore any bits set
+in ``MCTP_TAG_VALUE``, and generate a tag value suitable for the destination
+EID. If ``MCTP_TAG_OWNER`` is not set, the message will be sent with the tag
+value as specified. If a tag value cannot be allocated, the system call will
+report an errno of ``EAGAIN``.
+
+The application must provide the message type byte as the first byte of the
+message buffer passed to ``sendto()``. If a message integrity check is to be
+included in the transmitted message, it must also be provided in the message
+buffer, and the most-significant bit of the message type byte must be 1.
+
+The ``sendmsg()`` system call allows a more compact argument interface, and the
+message buffer to be specified as a scatter-gather list. At present no ancillary
+message types (used for the ``msg_control`` data passed to ``sendmsg()``) are
+defined.
+
+Transmitting a message on an unconnected socket with ``MCTP_TAG_OWNER``
+specified will cause an allocation of a tag, if no valid tag is already
+allocated for that destination. The (destination-eid,tag) tuple acts as an
+implicit local socket address, to allow the socket to receive responses to this
+outgoing message. If any previous allocation has been performed (to for a
+different remote EID), that allocation is lost.
+
+Sockets will only receive responses to requests they have sent (with TO=1) and
+may only respond (with TO=0) to requests they have received.
+
+``recvfrom()``, ``recvmsg()``, ``recv()`` : receive an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message can be received by an application using one of the
+``recvfrom()``, ``recvmsg()``, or ``recv()`` system calls. Using ``recvfrom()``
+as the primary example:
+
+.. code-block:: C
+
+    struct sockaddr_mctp addr;
+    socklen_t addrlen;
+    char buf[14];
+    ssize_t len;
+
+    addrlen = sizeof(addr);
+
+    len = recvfrom(sd, buf, sizeof(buf), 0,
+                    (struct sockaddr_mctp *)&addr, &addrlen);
+
+    /* We can expect addr to describe an MCTP address */
+    assert(addrlen >= sizeof(buf));
+    assert(addr.smctp_family == AF_MCTP);
+
+    printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
+
+The address argument to ``recvfrom`` and ``recvmsg`` is populated with the
+remote address of the incoming message, including tag value (this will be needed
+in order to reply to the message).
+
+The first byte of the message buffer will contain the message type byte. If an
+integrity check follows the message, it will be included in the received buffer.
+
+The ``recv()`` system call behaves in a similar way, but does not provide a
+remote address to the application. Therefore, these are only useful if the
+remote address is already known, or the message does not require a reply.
+
+Like the send calls, sockets will only receive responses to requests they have
+sent (TO=1) and may only respond (TO=0) to requests they have received.
+
+``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tags give applications more control over MCTP message tags, by allocating
+(and dropping) tag values explicitly, rather than the kernel automatically
+allocating a per-message tag at ``sendmsg()`` time.
+
+In general, you will only need to use these ioctls if your MCTP protocol does
+not fit the usual request/response model. For example, if you need to persist
+tags across multiple requests, or a request may generate more than one response.
+In these cases, the ioctls allow you to decouple the tag allocation (and
+release) from individual message send and receive operations.
+
+Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``:
+
+.. code-block:: C
+
+    struct mctp_ioc_tag_ctl {
+        mctp_eid_t      peer_addr;
+        __u8		tag;
+        __u16   	flags;
+    };
+
+``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application
+can use in future ``sendmsg()`` calls. The application populates the
+``peer_addr`` member with the remote EID. Other fields must be zero.
+
+On return, the ``tag`` member will be populated with the allocated tag value.
+The allocated tag will have the following tag bits set:
+
+ - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag
+   owner
+
+ - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a
+   preallocated tag.
+
+ - ... and the actual tag value, within the least-significant three bits
+   (``MCTP_TAG_MASK``). Note that zero is a valid tag value.
+
+The tag value should be used as-is for the ``smctp_tag`` member of ``struct
+sockaddr_mctp``.
+
+``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a
+``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the
+allocation, and the ``tag`` value must match exactly the tag returned from the
+allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits).
+The ``flags`` field must be zero.
+
+Kernel internals
+================
+
+There are a few possible packet flows in the MCTP stack:
+
+1. local TX to remote endpoint, message <= MTU::
+
+	sendmsg()
+	 -> mctp_local_output()
+	    : route lookup
+	    -> rt->output() (== mctp_route_output)
+	       -> dev_queue_xmit()
+
+2. local TX to remote endpoint, message > MTU::
+
+	sendmsg()
+	-> mctp_local_output()
+	    -> mctp_do_fragment_route()
+	       : creates packet-sized skbs. For each new skb:
+	       -> rt->output() (== mctp_route_output)
+	          -> dev_queue_xmit()
+
+3. remote TX to local endpoint, single-packet message::
+
+	mctp_pkttype_receive()
+	: route lookup
+	-> rt->output() (== mctp_route_input)
+	   : sk_key lookup
+	   -> sock_queue_rcv_skb()
+
+4. remote TX to local endpoint, multiple-packet message::
+
+	mctp_pkttype_receive()
+	: route lookup
+	-> rt->output() (== mctp_route_input)
+	   : sk_key lookup
+	   : stores skb in struct sk_key->reasm_head
+
+	mctp_pkttype_receive()
+	: route lookup
+	-> rt->output() (== mctp_route_input)
+	   : sk_key lookup
+	   : finds existing reassembly in sk_key->reasm_head
+	   : appends new fragment
+	   -> sock_queue_rcv_skb()
+
+Key refcounts
+-------------
+
+ * keys are refed by:
+
+   - a skb: during route output, stored in ``skb->cb``.
+
+   - netns and sock lists.
+
+ * keys can be associated with a device, in which case they hold a
+   reference to the dev (set through ``key->dev``, counted through
+   ``dev->key_count``). Multiple keys can reference the device.
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
new file mode 100644
index 000000000000..5bfab01eff5a
--- /dev/null
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -0,0 +1,149 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+MPTCP Sysfs variables
+=====================
+
+/proc/sys/net/mptcp/* Variables
+===============================
+
+add_addr_timeout - INTEGER (seconds)
+	Set the timeout after which an ADD_ADDR control message will be
+	resent to an MPTCP peer that has not acknowledged a previous
+	ADD_ADDR message.
+
+	The default value matches TCP_RTO_MAX. This is a per-namespace
+	sysctl.
+
+	Default: 120
+
+allow_join_initial_addr_port - BOOLEAN
+	Allow peers to send join requests to the IP address and port number used
+	by the initial subflow if the value is 1. This controls a flag that is
+	sent to the peer at connection time, and whether such join requests are
+	accepted or denied.
+
+	Joins to addresses advertised with ADD_ADDR are not affected by this
+	value.
+
+	This is a per-namespace sysctl.
+
+	Default: 1
+
+available_path_managers - STRING
+	Shows the available path managers choices that are registered. More
+	path managers may be available, but not loaded.
+
+available_schedulers - STRING
+	Shows the available schedulers choices that are registered. More packet
+	schedulers may be available, but not loaded.
+
+blackhole_timeout - INTEGER (seconds)
+	Initial time period in second to disable MPTCP on active MPTCP sockets
+	when a MPTCP firewall blackhole issue happens. This time period will
+	grow exponentially when more blackhole issues get detected right after
+	MPTCP is re-enabled and will reset to the initial value when the
+	blackhole issue goes away.
+
+	0 to disable the blackhole detection. This is a per-namespace sysctl.
+
+	Default: 3600
+
+checksum_enabled - BOOLEAN
+	Control whether DSS checksum can be enabled.
+
+	DSS checksum can be enabled if the value is nonzero. This is a
+	per-namespace sysctl.
+
+	Default: 0
+
+close_timeout - INTEGER (seconds)
+	Set the make-after-break timeout: in absence of any close or
+	shutdown syscall, MPTCP sockets will maintain the status
+	unchanged for such time, after the last subflow removal, before
+	moving to TCP_CLOSE.
+
+	The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+	sysctl.
+
+	Default: 60
+
+enabled - BOOLEAN
+	Control whether MPTCP sockets can be created.
+
+	MPTCP sockets can be created if the value is 1. This is a
+	per-namespace sysctl.
+
+	Default: 1 (enabled)
+
+path_manager - STRING
+	Set the default path manager name to use for each new MPTCP
+	socket. In-kernel path management will control subflow
+	connections and address advertisements according to
+	per-namespace values configured over the MPTCP netlink
+	API. Userspace path management puts per-MPTCP-connection subflow
+	connection decisions and address advertisements under control of
+	a privileged userspace program, at the cost of more netlink
+	traffic to propagate all of the related events and commands.
+
+	This is a per-namespace sysctl.
+
+	* "kernel"          - In-kernel path manager
+	* "userspace"       - Userspace path manager
+
+	Default: "kernel"
+
+pm_type - INTEGER
+	Set the default path manager type to use for each new MPTCP
+	socket. In-kernel path management will control subflow
+	connections and address advertisements according to
+	per-namespace values configured over the MPTCP netlink
+	API. Userspace path management puts per-MPTCP-connection subflow
+	connection decisions and address advertisements under control of
+	a privileged userspace program, at the cost of more netlink
+	traffic to propagate all of the related events and commands.
+
+	This is a per-namespace sysctl.
+
+	Deprecated since v6.15, use path_manager instead.
+
+	* 0 - In-kernel path manager
+	* 1 - Userspace path manager
+
+	Default: 0
+
+scheduler - STRING
+	Select the scheduler of your choice.
+
+	Support for selection of different schedulers. This is a per-namespace
+	sysctl.
+
+	Default: "default"
+
+stale_loss_cnt - INTEGER
+	The number of MPTCP-level retransmission intervals with no traffic and
+	pending outstanding data on a given subflow required to declare it stale.
+	The packet scheduler ignores stale subflows.
+	A low stale_loss_cnt  value allows for fast active-backup switch-over,
+	an high value maximize links utilization on edge scenarios e.g. lossy
+	link with high BER or peer pausing the data processing.
+
+	This is a per-namespace sysctl.
+
+	Default: 4
+
+syn_retrans_before_tcp_fallback - INTEGER
+	The number of SYN + MP_CAPABLE retransmissions before falling back to
+	TCP, i.e. dropping the MPTCP options. In other words, if all the packets
+	are dropped on the way, there will be:
+
+	* The initial SYN with MPTCP support
+	* This number of SYN retransmitted with MPTCP support
+	* The next SYN retransmissions will be without MPTCP support
+
+	0 means the first retransmission will be done without MPTCP options.
+	>= 128 means that all SYN retransmissions will keep the MPTCP options. A
+	lower number might increase false-positive MPTCP blackholes detections.
+	This is a per-namespace sysctl.
+
+	Default: 2
diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst
new file mode 100644
index 000000000000..17f2bab61164
--- /dev/null
+++ b/Documentation/networking/mptcp.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multipath TCP (MPTCP)
+=====================
+
+Introduction
+============
+
+Multipath TCP or MPTCP is an extension to the standard TCP and is described in
+`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
+device to make use of multiple interfaces at once to send and receive TCP
+packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
+multiple interfaces or prefer the one with the lowest latency. It also allows a
+fail-over if one path is down, and the traffic is seamlessly reinjected on other
+paths.
+
+For more details about Multipath TCP in the Linux kernel, please see the
+official website: `mptcp.dev <https://www.mptcp.dev>`_.
+
+
+Use cases
+=========
+
+Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
+brings new use-cases, compared to TCP:
+
+- Seamless handovers: switching from one path to another while preserving
+  established connections, e.g. to be used in mobility use-cases, like on
+  smartphones.
+- Best network selection: using the "best" available path depending on some
+  conditions, e.g. latency, losses, cost, bandwidth, etc.
+- Network aggregation: using multiple paths at the same time to have a higher
+  throughput, e.g. to combine fixed and mobile networks to send files faster.
+
+
+Concepts
+========
+
+Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
+(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
+a regular TCP connection that is used to transmit data through one interface.
+Additional *subflows* can be negotiated later between the hosts. For the remote
+host to be able to detect the use of MPTCP, a new field is added to the TCP
+*option* field of the underlying TCP *subflow*. This field contains, amongst
+other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
+it is supported. If the remote host or any middlebox in between does not support
+it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
+*option* field. In that case, the connection will be "downgraded" to plain TCP,
+and it will continue with a single path.
+
+This behavior is made possible by two internal components: the path manager, and
+the packet scheduler.
+
+Path Manager
+------------
+
+The Path Manager is in charge of *subflows*, from creation to deletion, and also
+address announcements. Typically, it is the client side that initiates subflows,
+and the server side that announces additional addresses via the ``ADD_ADDR`` and
+``REMOVE_ADDR`` options.
+
+Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
+mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
+same rules are applied for all the connections (see: ``ip mptcp``) ; and the
+userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
+<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
+connection. The path managers can be controlled via a Netlink API; see
+netlink_spec/mptcp_pm.rst.
+
+To be able to use multiple IP addresses on a host to create multiple *subflows*
+(paths), the default in-kernel MPTCP path-manager needs to know which IP
+addresses can be used. This can be configured with ``ip mptcp endpoint`` for
+example.
+
+Packet Scheduler
+----------------
+
+The Packet Scheduler is in charge of selecting which available *subflow(s)* to
+use to send the next data packet. It can decide to maximize the use of the
+available bandwidth, only to pick the path with the lower latency, or any other
+policy depending on the configuration.
+
+Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
+see mptcp-sysctl.rst.
+
+
+Sockets API
+===========
+
+Creating MPTCP sockets
+----------------------
+
+On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
+``socket``:
+
+.. code-block:: C
+
+    int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
+
+Note that ``IPPROTO_MPTCP`` is defined as ``262``.
+
+If MPTCP is not supported, ``errno`` will be set to:
+
+- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
+- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
+  on kernels >= v5.6.
+- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
+  ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.
+
+MPTCP is then opt-in: applications need to explicitly request it. Note that
+applications can be forced to use MPTCP with different techniques, e.g.
+``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
+``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.
+
+Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
+transparent as possible for the userspace applications.
+
+Socket options
+--------------
+
+MPTCP supports most socket options handled by TCP. It is possible some less
+common options are not supported, but contributions are welcome.
+
+Generally, the same value is propagated to all subflows, including the ones
+created after the calls to ``setsockopt()``. eBPF can be used to set different
+values per subflow.
+
+There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
+retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
+call:
+
+- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
+- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
+  ``struct tcp_info``.
+- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
+  array of ``mptcp_subflow_addrs``.
+- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
+  array of ``struct mptcp_subflow_info`` (including the
+  ``struct mptcp_subflow_addrs``), and one pointer to an array of
+  ``struct tcp_info``, followed by the content of ``struct mptcp_info``.
+
+Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
+if MPTCP is currently being used: the value will be set to 1 if it is.
+
+
+Design choices
+==============
+
+A new socket type has been added for MPTCP for the userspace-facing socket. The
+kernel is in charge of creating subflow sockets: they are TCP sockets where the
+behavior is modified using TCP-ULP.
+
+MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
+connection request from the client didn't ask for MPTCP, making the performance
+impact minimal when MPTCP is enabled by default.
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index ace56204dd03..78fb70e748b7 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,8 @@ Intro
 =====
 
 The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP and UDP sockets.
+The feature is currently implemented for TCP, UDP and VSOCK (with
+virtio transport) sockets.
 
 
 Opportunity and Caveats
@@ -15,7 +16,7 @@ Opportunity and Caveats
 
 Copying large buffers between user process and kernel can be
 expensive. Linux supports various interfaces that eschew copying,
-such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+such as sendfile and splice. The MSG_ZEROCOPY flag extends the
 underlying copy avoidance mechanism to common socket send calls.
 
 Copy avoidance is not a free lunch. As implemented, with page pinning,
@@ -50,7 +51,7 @@ the excellent reporting over at LWN.net or read the original code.
 
   patchset
     [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
-    https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+    https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
 
 
 Interface
@@ -83,8 +84,8 @@ Pass the new flag.
 	ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
 
 A zerocopy failure will return -1 with errno ENOBUFS. This happens if
-the socket option was not set, the socket exceeds its optmem limit or
-the user exceeds its ulimit on locked pages.
+the socket exceeds its optmem limit or the user exceeds their ulimit on
+locked pages.
 
 
 Mixing copy avoidance and copying
@@ -174,7 +175,9 @@ read_notification() call in the previous snippet. A notification
 is encoded in the standard error format, sock_extended_err.
 
 The level and type fields in the control data are protocol family
-specific, IP_RECVERR or IPV6_RECVERR.
+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be
+VSOCK_RECVERR.
 
 Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
 as explained before, to avoid blocking read and write system calls on
@@ -235,12 +238,15 @@ Implementation
 Loopback
 --------
 
+For TCP and UDP:
 Data sent to local sockets can be queued indefinitely if the receive
 process does not read its socket. Unbound notification latency is not
 acceptable. For this reason all packets generated with MSG_ZEROCOPY
 that are looped to a local socket will incur a deferred copy. This
 includes looping onto packet sockets (e.g., tcpdump) and tun devices.
 
+For VSOCK:
+Data path sent to local sockets is the same as for non-local sockets.
 
 Testing
 =======
@@ -254,3 +260,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
 namespaces, the test will not show any improvement. For testing, the
 loopback restriction can be temporarily relaxed by making
 skb_orphan_frags_rx identical to skb_orphan_frags.
+
+For VSOCK type of socket example can be found in
+tools/testing/vsock/vsock_test_zerocopy.c.
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
new file mode 100644
index 000000000000..2f5a5bb3ca9a
--- /dev/null
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============
+Multi-PF Netdev
+===============
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `mlx5 implementation`_
+- `Channels distribution`_
+- `Observability`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
+the network, each through its own dedicated PCIe interface. Through either a connection harness that
+splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
+results in eliminating the network traffic traversing over the internal bus between the sockets,
+significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
+network throughput.
+
+Overview
+========
+
+The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
+one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
+sysfs entry, and devlink are kept separate.
+Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
+traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
+proximity to the device and achieve improved performance.
+
+mlx5 implementation
+===================
+
+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
+NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
+
+The netdev network channels are distributed between all devices, a proper configuration would utilize
+the correct close NUMA node when working on a certain app/CPU.
+
+We pick one PF to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
+the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
+to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two PFs (sockets).
+
+Channels distribution
+=====================
+
+We distribute the channels between the different PFs to achieve local NUMA node performance
+on multiple NUMA nodes.
+
+Each combined channel works against one specific PF, creating all its datapath queues against it. We
+distribute channels to PFs in a round-robin policy.
+
+::
+
+        Example for 2 PFs and 5 channels:
+        +--------+--------+
+        | ch idx | PF idx |
+        +--------+--------+
+        |    0   |    0   |
+        |    1   |    1   |
+        |    2   |    0   |
+        |    3   |    1   |
+        |    4   |    0   |
+        +--------+--------+
+
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent across channel's closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Observability
+=============
+The relation between PF, irq, napi, and queue can be observed via netlink spec::
+
+  $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
+  [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
+   {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
+   {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
+   {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
+   {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
+   {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
+   {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
+   {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
+   {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
+   {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
+
+  $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
+  [{'id': 543, 'ifindex': 13, 'irq': 42},
+   {'id': 542, 'ifindex': 13, 'irq': 41},
+   {'id': 541, 'ifindex': 13, 'irq': 40},
+   {'id': 540, 'ifindex': 13, 'irq': 39},
+   {'id': 539, 'ifindex': 13, 'irq': 36}]
+
+Here you can clearly observe our channels distribution policy::
+
+  $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
+  /proc/irq/36/mlx5_comp0@pci:0000:08:00.0
+  /proc/irq/39/mlx5_comp0@pci:0000:09:00.0
+  /proc/irq/40/mlx5_comp1@pci:0000:08:00.0
+  /proc/irq/41/mlx5_comp1@pci:0000:09:00.0
+  /proc/irq/42/mlx5_comp2@pci:0000:08:00.0
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
+that is capable of pointing to the receive queues of a different PF.
+
+In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
+PF on the same node as the CPU.
+
+XPS default config example:
+
+NUMA node(s):          2
+NUMA node0 CPU(s):     0-11
+NUMA node1 CPU(s):     12-23
+
+PF0 on node0, PF1 on node1.
+
+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of Multi-PF, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For example, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
new file mode 100644
index 000000000000..d0e3953cae6a
--- /dev/null
+++ b/Documentation/networking/napi.rst
@@ -0,0 +1,451 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _napi:
+
+====
+NAPI
+====
+
+NAPI is the event handling mechanism used by the Linux networking stack.
+The name NAPI no longer stands for anything in particular [#]_.
+
+In basic operation the device notifies the host about new events
+via an interrupt.
+The host then schedules a NAPI instance to process the events.
+The device may also be polled for events via NAPI without receiving
+interrupts first (:ref:`busy polling<poll>`).
+
+NAPI processing usually happens in the software interrupt context,
+but there is an option to use :ref:`separate kernel threads<threaded>`
+for NAPI processing.
+
+All in all NAPI abstracts away from the drivers the context and configuration
+of event (packet Rx and Tx) processing.
+
+Driver API
+==========
+
+The two most important elements of NAPI are the struct napi_struct
+and the associated poll method. struct napi_struct holds the state
+of the NAPI instance while the method is the driver-specific event
+handler. The method will typically free Tx packets that have been
+transmitted and process newly received packets.
+
+.. _drv_ctrl:
+
+Control API
+-----------
+
+netif_napi_add() and netif_napi_del() add/remove a NAPI instance
+from the system. The instances are attached to the netdevice passed
+as argument (and will be deleted automatically when netdevice is
+unregistered). Instances are added in a disabled state.
+
+napi_enable() and napi_disable() manage the disabled state.
+A disabled NAPI can't be scheduled and its poll method is guaranteed
+to not be invoked. napi_disable() waits for ownership of the NAPI
+instance to be released.
+
+The control APIs are not idempotent. Control API calls are safe against
+concurrent use of datapath APIs but an incorrect sequence of control API
+calls may result in crashes, deadlocks, or race conditions. For example,
+calling napi_disable() multiple times in a row will deadlock.
+
+Datapath API
+------------
+
+napi_schedule() is the basic method of scheduling a NAPI poll.
+Drivers should call this function in their interrupt handler
+(see :ref:`drv_sched` for more info). A successful call to napi_schedule()
+will take ownership of the NAPI instance.
+
+Later, after NAPI is scheduled, the driver's poll method will be
+called to process the events/packets. The method takes a ``budget``
+argument - drivers can process completions for any number of Tx
+packets but should only process up to ``budget`` number of
+Rx packets. Rx processing is usually much more expensive.
+
+In other words for Rx processing the ``budget`` argument limits how many
+packets driver can process in a single poll. Rx specific APIs like page
+pool or XDP cannot be used at all when ``budget`` is 0.
+skb Tx processing should happen regardless of the ``budget``, but if
+the argument is 0 driver cannot call any XDP (or page pool) APIs.
+
+.. warning::
+
+   The ``budget`` argument may be 0 if core tries to only process
+   skb Tx completions and no Rx or XDP packets.
+
+The poll method returns the amount of work done. If the driver still
+has outstanding work to do (e.g. ``budget`` was exhausted)
+the poll method should return exactly ``budget``. In that case,
+the NAPI instance will be serviced/polled again (without the
+need to be scheduled).
+
+If event processing has been completed (all outstanding packets
+processed) the poll method should call napi_complete_done()
+before returning. napi_complete_done() releases the ownership
+of the instance.
+
+.. warning::
+
+   The case of finishing all events and using exactly ``budget``
+   must be handled carefully. There is no way to report this
+   (rare) condition to the stack, so the driver must either
+   not call napi_complete_done() and wait to be called again,
+   or return ``budget - 1``.
+
+   If the ``budget`` is 0 napi_complete_done() should never be called.
+
+Call sequence
+-------------
+
+Drivers should not make assumptions about the exact sequencing
+of calls. The poll method may be called without the driver scheduling
+the instance (unless the instance is disabled). Similarly,
+it's not guaranteed that the poll method will be called, even
+if napi_schedule() succeeded (e.g. if the instance gets disabled).
+
+As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
+calls to the poll method only wait for the ownership of the instance
+to be released, not for the poll method to exit. This means that
+drivers should avoid accessing any data structures after calling
+napi_complete_done().
+
+.. _drv_sched:
+
+Scheduling and IRQ masking
+--------------------------
+
+Drivers should keep the interrupts masked after scheduling
+the NAPI instance - until NAPI polling finishes any further
+interrupts are unnecessary.
+
+Drivers which have to mask the interrupts explicitly (as opposed
+to IRQ being auto-masked by the device) should use the napi_schedule_prep()
+and __napi_schedule() calls:
+
+.. code-block:: c
+
+  if (napi_schedule_prep(&v->napi)) {
+      mydrv_mask_rxtx_irq(v->idx);
+      /* schedule after masking to avoid races */
+      __napi_schedule(&v->napi);
+  }
+
+IRQ should only be unmasked after a successful call to napi_complete_done():
+
+.. code-block:: c
+
+  if (budget && napi_complete_done(&v->napi, work_done)) {
+    mydrv_unmask_rxtx_irq(v->idx);
+    return min(work_done, budget - 1);
+  }
+
+napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
+of guarantees given by being invoked in IRQ context (no need to
+mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
+IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
+
+Instance to queue mapping
+-------------------------
+
+Modern devices have multiple NAPI instances (struct napi_struct) per
+interface. There is no strong requirement on how the instances are
+mapped to queues and interrupts. NAPI is primarily a polling/processing
+abstraction without specific user-facing semantics. That said, most networking
+devices end up using NAPI in fairly similar ways.
+
+NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
+(queue pair is a set of a single Rx and single Tx queue).
+
+In less common cases a NAPI instance may be used for multiple queues
+or Rx and Tx queues can be serviced by separate NAPI instances on a single
+core. Regardless of the queue assignment, however, there is usually still
+a 1:1 mapping between NAPI instances and interrupts.
+
+It's worth noting that the ethtool API uses a "channel" terminology where
+each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
+what constitutes a channel; the recommended interpretation is to understand
+a channel as an IRQ/NAPI which services queues of a given type. For example,
+a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
+to utilize 3 interrupts, 2 Rx and 2 Tx queues.
+
+Persistent NAPI config
+----------------------
+
+Drivers often allocate and free NAPI instances dynamically. This leads to loss
+of NAPI-related user configuration each time NAPI instances are reallocated.
+The netif_napi_add_config() API prevents this loss of configuration by
+associating each NAPI instance with a persistent NAPI configuration based on
+a driver defined index value, like a queue number.
+
+Using this API allows for persistent NAPI IDs (among other settings), which can
+be beneficial to userspace programs using ``SO_INCOMING_NAPI_ID``. See the
+sections below for other NAPI configuration settings.
+
+Drivers should try to use netif_napi_add_config() whenever possible.
+
+User API
+========
+
+User interactions with NAPI depend on NAPI instance ID. The instance IDs
+are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
+
+Users can query NAPI IDs for a device or device queue using netlink. This can
+be done programmatically in a user application or by using a script included in
+the kernel source tree: ``tools/net/ynl/pyynl/cli.py``.
+
+For example, using the script to dump all of the queues for a device (which
+will reveal each queue's NAPI ID):
+
+.. code-block:: bash
+
+   $ kernel-source/tools/net/ynl/pyynl/cli.py \
+             --spec Documentation/netlink/specs/netdev.yaml \
+             --dump queue-get \
+             --json='{"ifindex": 2}'
+
+See ``Documentation/netlink/specs/netdev.yaml`` for more details on
+available operations and attributes.
+
+Software IRQ coalescing
+-----------------------
+
+NAPI does not perform any explicit event coalescing by default.
+In most scenarios batching happens due to IRQ coalescing which is done
+by the device. There are cases where software coalescing is helpful.
+
+NAPI can be configured to arm a repoll timer instead of unmasking
+the hardware interrupts as soon as all packets are processed.
+The ``gro_flush_timeout`` sysfs configuration of the netdevice
+is reused to control the delay of the timer, while
+``napi_defer_hard_irqs`` controls the number of consecutive empty polls
+before NAPI gives up and goes back to using hardware IRQs.
+
+The above parameters can also be set on a per-NAPI basis using netlink via
+netdev-genl. When used with netlink and configured on a per-NAPI basis, the
+parameters mentioned above use hyphens instead of underscores:
+``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
+
+Per-NAPI configuration can be done programmatically in a user application
+or by using a script included in the kernel source tree:
+``tools/net/ynl/pyynl/cli.py``.
+
+For example, using the script:
+
+.. code-block:: bash
+
+  $ kernel-source/tools/net/ynl/pyynl/cli.py \
+            --spec Documentation/netlink/specs/netdev.yaml \
+            --do napi-set \
+            --json='{"id": 345,
+                     "defer-hard-irqs": 111,
+                     "gro-flush-timeout": 11111}'
+
+Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
+via netdev-genl. There is no global sysfs parameter for this value.
+
+``irq-suspend-timeout`` is used to determine how long an application can
+completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
+which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
+
+.. _poll:
+
+Busy polling
+------------
+
+Busy polling allows a user process to check for incoming packets before
+the device interrupt fires. As is the case with any busy polling it trades
+off CPU cycles for lower latency (production uses of NAPI busy polling
+are not well known).
+
+Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
+selected sockets or using the global ``net.core.busy_poll`` and
+``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
+also exists.
+
+epoll-based busy polling
+------------------------
+
+It is possible to trigger packet processing directly from calls to
+``epoll_wait``. In order to use this feature, a user application must ensure
+all file descriptors which are added to an epoll context have the same NAPI ID.
+
+If the application uses a dedicated acceptor thread, the application can obtain
+the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
+distribute that file descriptor to a worker thread. The worker thread would add
+the file descriptor to its epoll context. This would ensure each worker thread
+has an epoll context with FDs that have the same NAPI ID.
+
+Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
+be inserted to distribute incoming connections to threads such that each thread
+is only given incoming connections with the same NAPI ID. Care must be taken to
+carefully handle cases where a system may have multiple NICs.
+
+In order to enable busy polling, there are two choices:
+
+1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
+   loop waiting for events. This is a system-wide setting and will cause all
+   epoll-based applications to busy poll when they call epoll_wait. This may
+   not be desirable as many applications may not have the need to busy poll.
+
+2. Applications using recent kernels can issue an ioctl on the epoll context
+   file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
+   epoll_params``:, which user programs can define as follows:
+
+.. code-block:: c
+
+  struct epoll_params {
+      uint32_t busy_poll_usecs;
+      uint16_t busy_poll_budget;
+      uint8_t prefer_busy_poll;
+
+      /* pad the struct to a multiple of 64bits */
+      uint8_t __pad;
+  };
+
+IRQ mitigation
+---------------
+
+While busy polling is supposed to be used by low latency applications,
+a similar mechanism can be used for IRQ mitigation.
+
+Very high request-per-second applications (especially routing/forwarding
+applications and especially applications using AF_XDP sockets) may not
+want to be interrupted until they finish processing a request or a batch
+of packets.
+
+Such applications can pledge to the kernel that they will perform a busy
+polling operation periodically, and the driver should keep the device IRQs
+permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
+socket option. To avoid system misbehavior the pledge is revoked
+if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
+busy polling applications, the ``prefer_busy_poll`` field of ``struct
+epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
+enable this mode. See the above section for more details.
+
+The NAPI budget for busy polling is lower than the default (which makes
+sense given the low latency intention of normal busy polling). This is
+not the case with IRQ mitigation, however, so the budget can be adjusted
+with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
+applications, the ``busy_poll_budget`` field can be adjusted to the desired value
+in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
+ioctl. See the above section for more details.
+
+It is important to note that choosing a large value for ``gro_flush_timeout``
+will defer IRQs to allow for better batch processing, but will induce latency
+when the system is not fully loaded. Choosing a small value for
+``gro_flush_timeout`` can cause interference of the user application which is
+attempting to busy poll by device IRQs and softirq processing. This value
+should be chosen carefully with these tradeoffs in mind. epoll-based busy
+polling applications may be able to mitigate how much user processing happens
+by choosing an appropriate value for ``maxevents``.
+
+Users may want to consider an alternate approach, IRQ suspension, to help deal
+with these tradeoffs.
+
+IRQ suspension
+--------------
+
+IRQ suspension is a mechanism wherein device IRQs are masked while epoll
+triggers NAPI packet processing.
+
+While application calls to epoll_wait successfully retrieve events, the kernel will
+defer the IRQ suspension timer. If the kernel does not retrieve any events
+while busy polling (for example, because network traffic levels subsided), IRQ
+suspension is disabled and the IRQ mitigation strategies described above are
+engaged.
+
+This allows users to balance CPU consumption with network processing
+efficiency.
+
+To use this mechanism:
+
+  1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
+     maximum time (in nanoseconds) the application can have its IRQs
+     suspended. This is done using netlink, as described above. This timeout
+     serves as a safety mechanism to restart IRQ driver interrupt processing if
+     the application has stalled. This value should be chosen so that it covers
+     the amount of time the user application needs to process data from its
+     call to epoll_wait, noting that applications can control how much data
+     they retrieve by setting ``max_events`` when calling epoll_wait.
+
+  2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
+     and ``napi_defer_hard_irqs`` can be set to low values. They will be used
+     to defer IRQs after busy poll has found no data.
+
+  3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
+     the ``EPIOCSPARAMS`` ioctl as described above.
+
+  4. The application uses epoll as described above to trigger NAPI packet
+     processing.
+
+As mentioned above, as long as subsequent calls to epoll_wait return events to
+userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
+allows the application to process data without interference.
+
+Once a call to epoll_wait results in no events being found, IRQ suspension is
+automatically disabled and the ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` mitigation mechanisms take over.
+
+It is expected that ``irq-suspend-timeout`` will be set to a value much larger
+than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
+the duration of one userland processing cycle.
+
+While it is not strictly necessary to use ``napi_defer_hard_irqs`` and
+``gro_flush_timeout`` to use IRQ suspension, their use is strongly
+recommended.
+
+IRQ suspension causes the system to alternate between polling mode and
+irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
+overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
+epoll finds no events, the setting of ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` determine the next step.
+
+There are essentially three possible loops for network processing and
+packet delivery:
+
+1) hardirq -> softirq -> napi poll; basic interrupt delivery
+2) timer -> softirq -> napi poll; deferred irq processing
+3) epoll -> busy-poll -> napi poll; busy looping
+
+Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` are set.
+
+If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
+and 3 "wrestle" with each other for control.
+
+During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
+which essentially tilts network processing in favour of Loop 3.
+
+If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
+cannot take control from Loop 1.
+
+Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
+the recommended usage, because otherwise setting ``irq-suspend-timeout``
+might not have any discernible effect.
+
+.. _threaded:
+
+Threaded NAPI
+-------------
+
+Threaded NAPI is an operating mode that uses dedicated kernel
+threads rather than software IRQ context for NAPI processing.
+The configuration is per netdevice and will affect all
+NAPI instances of that device. Each NAPI instance will spawn a separate
+thread (called ``napi/${ifc-name}-${napi-id}``).
+
+It is recommended to pin each kernel thread to a single CPU, the same
+CPU as the CPU which services the interrupt. Note that the mapping
+between IRQs and NAPI instances may not be trivial (and is driver
+dependent). The NAPI instance IDs will be assigned in the opposite
+order than the process IDs of the kernel threads.
+
+Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
+netdev's sysfs directory.
+
+.. rubric:: Footnotes
+
+.. [#] NAPI was originally referred to as New API in 2.4 Linux.
diff --git a/Documentation/networking/net_cachelines/index.rst b/Documentation/networking/net_cachelines/index.rst
new file mode 100644
index 000000000000..2669e4cda086
--- /dev/null
+++ b/Documentation/networking/net_cachelines/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===================================
+Common Networking Struct Cachelines
+===================================
+
+.. toctree::
+   :maxdepth: 1
+
+   inet_connection_sock
+   inet_sock
+   net_device
+   netns_ipv4_sysctl
+   snmp
+   tcp_sock
diff --git a/Documentation/networking/net_cachelines/inet_connection_sock.rst b/Documentation/networking/net_cachelines/inet_connection_sock.rst
new file mode 100644
index 000000000000..8fae85ebb773
--- /dev/null
+++ b/Documentation/networking/net_cachelines/inet_connection_sock.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+=====================================================
+inet_connection_sock struct fast path usage breakdown
+=====================================================
+
+=================================== ====================== =================== =================== ========================================================================================================================================================
+Type                                Name                   fastpath_tx_access  fastpath_rx_access  comment
+=================================== ====================== =================== =================== ========================================================================================================================================================
+struct inet_sock                    icsk_inet              read_mostly         read_mostly         tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct request_sock_queue           icsk_accept_queue
+struct inet_bind_bucket             icsk_bind_hash         read_mostly                             tcp_set_state
+struct inet_bind2_bucket            icsk_bind2_hash        read_mostly                             tcp_set_state,inet_put_port
+struct timer_list                   icsk_retransmit_timer  read_write                              inet_csk_reset_xmit_timer,tcp_connect
+struct timer_list                   icsk_delack_timer      read_mostly                             inet_csk_reset_xmit_timer,tcp_connect
+u32                                 icsk_rto               read_write                              tcp_cwnd_validate,tcp_schedule_loss_probe,tcp_connect_init,tcp_connect,tcp_write_xmit,tcp_push_one
+u32                                 icsk_rto_min
+u32                                 icsk_rto_max           read_mostly                             tcp_reset_xmit_timer
+u32                                 icsk_delack_max
+u32                                 icsk_pmtu_cookie       read_write                              tcp_sync_mss,tcp_current_mss,tcp_send_syn_data,tcp_connect_init,tcp_connect
+struct tcp_congestion_ops           icsk_ca_ops            read_write                              tcp_cwnd_validate,tcp_tso_segs,tcp_ca_dst_init,tcp_connect_init,tcp_connect,tcp_write_xmit
+struct inet_connection_sock_af_ops  icsk_af_ops            read_mostly                             tcp_finish_connect,tcp_send_syn_data,tcp_mtup_init,tcp_mtu_check_reprobe,tcp_mtu_probe,tcp_connect_init,tcp_connect,__tcp_transmit_skb
+struct tcp_ulp_ops*                 icsk_ulp_ops
+void*                               icsk_ulp_data
+u8:5                                icsk_ca_state          read_write                              tcp_cwnd_application_limited,tcp_set_ca_state,tcp_enter_cwr,tcp_tso_should_defer,tcp_mtu_probe,tcp_schedule_loss_probe,tcp_write_xmit,__tcp_transmit_skb
+u8:1                                icsk_ca_initialized    read_write                              tcp_init_transfer,tcp_init_congestion_control,tcp_init_transfer,tcp_finish_connect,tcp_connect
+u8:1                                icsk_ca_setsockopt
+u8:1                                icsk_ca_dst_locked     write_mostly                            tcp_ca_dst_init,tcp_connect_init,tcp_connect
+u8                                  icsk_retransmits       write_mostly                            tcp_connect_init,tcp_connect
+u8                                  icsk_pending           read_write                              inet_csk_reset_xmit_timer,tcp_connect,tcp_check_probe_timer,__tcp_push_pending_frames,tcp_rearm_rto,tcp_event_new_data_sent,tcp_event_new_data_sent
+u8                                  icsk_backoff           write_mostly                            tcp_write_queue_purge,tcp_connect_init
+u8                                  icsk_syn_retries
+u8                                  icsk_probes_out
+u16                                 icsk_ext_hdr_len       read_mostly                             __tcp_mtu_to_mss,tcp_mtu_to_rss,tcp_mtu_probe,tcp_write_xmit,tcp_mtu_to_mss,
+struct icsk_ack_u8                  pending                read_write          read_write          inet_csk_ack_scheduled,__tcp_cleanup_rbuf,tcp_cleanup_rbuf,inet_csk_clear_xmit_timer,tcp_event_ack-sent,inet_csk_reset_xmit_timer
+struct icsk_ack_u8                  quick                  read_write          write_mostly        tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_select_window,__tcp_cleanup_rbuf
+struct icsk_ack_u8                  pingpong
+struct icsk_ack_u8                  retry                  write_mostly        read_write          inet_csk_clear_xmit_timer,tcp_rearm_rto,tcp_event_new_data_sent,tcp_write_xmit,__tcp_send_ack,tcp_send_ack,
+struct icsk_ack_u8                  ato                    read_mostly         write_mostly        tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_send_ack,tcp_send_ack
+struct icsk_ack_u32                 lrcvtime               read_write                              tcp_finish_connect,tcp_connect,tcp_event_data_sent,__tcp_transmit_skb
+struct icsk_ack_u16                 rcv_mss                write_mostly        read_mostly         __tcp_select_window,__tcp_cleanup_rbuf,tcp_initialize_rcv_mss,tcp_connect_init
+struct icsk_mtup_int                search_high            read_write                              tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_check_reprobe,tcp_write_xmit
+struct icsk_mtup_int                search_low             read_write                              tcp_mtu_probe,tcp_mtu_check_reprobe,tcp_write_xmit,tcp_sync_mss,tcp_connect_init,tcp_mtup_init
+struct icsk_mtup_u32:31             probe_size             read_write                              tcp_mtup_init,tcp_connect_init,__tcp_transmit_skb
+struct icsk_mtup_u32:1              enabled                read_write                              tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_probe,tcp_write_xmit
+struct icsk_mtup_u32                probe_timestamp        read_write                              tcp_mtup_init,tcp_connect_init,tcp_mtu_check_reprobe,tcp_mtu_probe
+u32                                 icsk_probes_tstamp
+u32                                 icsk_user_timeout
+u64[104/sizeof(u64)]                icsk_ca_priv
+=================================== ====================== =================== =================== ========================================================================================================================================================
diff --git a/Documentation/networking/net_cachelines/inet_sock.rst b/Documentation/networking/net_cachelines/inet_sock.rst
new file mode 100644
index 000000000000..b11bf48fa2b3
--- /dev/null
+++ b/Documentation/networking/net_cachelines/inet_sock.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+==========================================
+inet_sock struct fast path usage breakdown
+==========================================
+
+======================= ===================== =================== =================== ======================================================================================================
+Type                    Name                  fastpath_tx_access  fastpath_rx_access  comment
+======================= ===================== =================== =================== ======================================================================================================
+struct sock             sk                    read_mostly         read_mostly         tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct ipv6_pinfo*      pinet6
+be16                    inet_sport            read_mostly                             __tcp_transmit_skb
+be32                    inet_daddr            read_mostly                             ip_select_ident_segs
+be32                    inet_rcv_saddr
+be16                    inet_dport            read_mostly                             __tcp_transmit_skb
+u16                     inet_num
+be32                    inet_saddr
+s16                     uc_ttl                read_mostly                             __ip_queue_xmit/ip_select_ttl
+u16                     cmsg_flags
+struct ip_options_rcu*  inet_opt              read_mostly                             __ip_queue_xmit
+u16                     inet_id               read_mostly                             ip_select_ident_segs
+u8                      tos                   read_mostly                             ip_queue_xmit
+u8                      min_ttl
+u8                      mc_ttl
+u8                      pmtudisc
+u8:1                    recverr
+u8:1                    is_icsk
+u8:1                    freebind
+u8:1                    hdrincl
+u8:1                    mc_loop
+u8:1                    transparent
+u8:1                    mc_all
+u8:1                    nodefrag
+u8:1                    bind_address_no_port
+u8:1                    recverr_rfc4884
+u8:1                    defer_connect         read_mostly                             tcp_sendmsg_fastopen
+u8                      rcv_tos
+u8                      convert_csum
+int                     uc_index
+int                     mc_index
+be32                    mc_addr
+struct ip_mc_socklist*  mc_list
+struct inet_cork_full   cork                  read_mostly                             __tcp_transmit_skb
+struct                  local_port_range
+======================= ===================== =================== =================== ======================================================================================================
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
new file mode 100644
index 000000000000..c69cc89c958e
--- /dev/null
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -0,0 +1,193 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+net_device struct fast path usage breakdown
+===========================================
+
+=================================== =========================== =================== =================== ===================================================================================
+Type                                Name                        fastpath_tx_access  fastpath_rx_access  Comments
+=================================== =========================== =================== =================== ===================================================================================
+unsigned_long:32                    priv_flags                  read_mostly                             __dev_queue_xmit(tx)
+unsigned_long:1                     lltx                        read_mostly                             HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
+unsigned long:1                     netmem_tx:1;                read_mostly
+char                                name[16]
+struct netdev_name_node*            name_node
+struct dev_ifalias*                 ifalias
+unsigned_long                       mem_end
+unsigned_long                       mem_start
+unsigned_long                       base_addr
+unsigned_long                       state                       read_mostly         read_mostly         netif_running(dev)
+struct list_head                    dev_list
+struct list_head                    napi_list
+struct list_head                    unreg_list
+struct list_head                    close_list
+struct list_head                    ptype_all                   read_mostly                             dev_nit_active(tx)
+struct list_head                    ptype_specific                                  read_mostly         deliver_ptype_list_skb/__netif_receive_skb_core(rx)
+struct                              adj_list
+unsigned_int                        flags                       read_mostly         read_mostly         __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
+xdp_features_t                      xdp_features
+struct net_device_ops*              netdev_ops                  read_mostly                             netdev_core_pick_tx,netdev_start_xmit(tx)
+struct xdp_metadata_ops*            xdp_metadata_ops
+int                                 ifindex                                         read_mostly         ip6_rcv_core
+unsigned_short                      gflags
+unsigned_short                      hard_header_len             read_mostly         read_mostly         ip6_xmit(tx);gro_list_prepare(rx)
+unsigned_int                        mtu                         read_mostly                             ip_finish_output2
+unsigned_short                      needed_headroom             read_mostly                             LL_RESERVED_SPACE/ip_finish_output2
+unsigned_short                      needed_tailroom
+netdev_features_t                   features                    read_mostly         read_mostly         HARD_TX_LOCK,netif_skb_features,sk_setup_caps(tx);netif_elide_gro(rx)
+netdev_features_t                   hw_features
+netdev_features_t                   wanted_features
+netdev_features_t                   vlan_features
+netdev_features_t                   hw_enc_features                                                     netif_skb_features
+netdev_features_t                   mpls_features
+netdev_features_t                   gso_partial_features        read_mostly                             gso_features_check
+unsigned_int                        min_mtu
+unsigned_int                        max_mtu
+unsigned_short                      type
+unsigned_char                       min_header_len
+unsigned_char                       name_assign_type
+int                                 group
+struct net_device_stats             stats
+struct net_device_core_stats*       core_stats
+atomic_t                            carrier_up_count
+atomic_t                            carrier_down_count
+struct iw_handler_def*              wireless_handlers
+struct ethtool_ops*                 ethtool_ops
+struct l3mdev_ops*                  l3mdev_ops
+struct ndisc_ops*                   ndisc_ops
+struct xfrmdev_ops*                 xfrmdev_ops
+struct tlsdev_ops*                  tlsdev_ops
+struct header_ops*                  header_ops                  read_mostly                             ip_finish_output2,ip6_finish_output2(tx)
+unsigned_char                       operstate
+unsigned_char                       link_mode
+unsigned_char                       if_port
+unsigned_char                       dma
+unsigned_char                       perm_addr[32]
+unsigned_char                       addr_assign_type
+unsigned_char                       addr_len
+unsigned_char                       upper_level
+unsigned_char                       lower_level
+unsigned_short                      neigh_priv_len
+unsigned_short                      padded
+unsigned_short                      dev_id
+unsigned_short                      dev_port
+spinlock_t                          addr_list_lock
+int                                 irq
+struct netdev_hw_addr_list          uc
+struct netdev_hw_addr_list          mc
+struct netdev_hw_addr_list          dev_addrs
+struct kset*                        queues_kset
+struct list_head                    unlink_list
+unsigned_int                        promiscuity
+unsigned_int                        allmulti
+bool                                uc_promisc
+unsigned_char                       nested_level
+struct in_device*                   ip_ptr                      read_mostly         read_mostly         __in_dev_get
+struct hlist_head                   fib_nh_head
+struct inet6_dev*                   ip6_ptr                     read_mostly         read_mostly         __in6_dev_get
+struct vlan_info*                   vlan_info
+struct dsa_port*                    dsa_ptr
+struct tipc_bearer*                 tipc_ptr
+void*                               atalk_ptr
+void*                               ax25_ptr
+struct wireless_dev*                ieee80211_ptr
+struct wpan_dev*                    ieee802154_ptr
+struct mpls_dev*                    mpls_ptr
+struct mctp_dev*                    mctp_ptr
+unsigned_char*                      dev_addr
+struct netdev_queue*                _rx                         read_mostly                             netdev_get_rx_queue(rx)
+unsigned_int                        num_rx_queues
+unsigned_int                        real_num_rx_queues                              read_mostly         get_rps_cpu
+struct bpf_prog*                    xdp_prog                                        read_mostly         netif_elide_gro()
+unsigned_long                       gro_flush_timeout                               read_mostly         napi_complete_done
+u32                                 napi_defer_hard_irqs                            read_mostly         napi_complete_done
+unsigned_int                        gro_max_size                                    read_mostly         skb_gro_receive
+unsigned_int                        gro_ipv4_max_size                               read_mostly         skb_gro_receive
+rx_handler_func_t*                  rx_handler                  read_mostly                             __netif_receive_skb_core
+void*                               rx_handler_data             read_mostly
+struct netdev_queue*                ingress_queue               read_mostly
+struct bpf_mprog_entry              tcx_ingress                                     read_mostly         sch_handle_ingress
+struct nf_hook_entries*             nf_hooks_ingress
+unsigned_char                       broadcast[32]
+struct cpu_rmap*                    rx_cpu_rmap
+struct hlist_node                   index_hlist
+struct netdev_queue*                _tx                         read_mostly                             netdev_get_tx_queue(tx)
+unsigned_int                        num_tx_queues
+unsigned_int                        real_num_tx_queues          read_mostly                             skb_tx_hash,netdev_core_pick_tx(tx)
+unsigned_int                        tx_queue_len
+spinlock_t                          tx_global_lock
+struct xdp_dev_bulk_queue__percpu*  xdp_bulkq
+struct xps_dev_maps*                xps_maps[2]                 read_mostly                             __netif_set_xps_queue
+struct bpf_mprog_entry              tcx_egress                  read_mostly                             sch_handle_egress
+struct nf_hook_entries*             nf_hooks_egress             read_mostly
+struct hlist_head                   qdisc_hash[16]
+struct timer_list                   watchdog_timer
+int                                 watchdog_timeo
+u32                                 proto_down_reason
+struct list_head                    todo_list
+int__percpu*                        pcpu_refcnt
+refcount_t                          dev_refcnt
+struct ref_tracker_dir              refcnt_tracker
+struct list_head                    link_watch_list
+enum:8                              reg_state
+bool                                dismantle
+bool                                rtnl_link_initilizing
+bool                                needs_free_netdev
+void*priv_destructor                struct net_device
+struct netpoll_info*                npinfo                                          read_mostly         napi_poll/napi_poll_lock
+possible_net_t                      nd_net                                          read_mostly         (dev_net)napi_busy_loop,tcp_v(4/6)_rcv,ip(v6)_rcv,ip(6)_input,ip(6)_input_finish
+void*                               ml_priv
+enum_netdev_ml_priv_type            ml_priv_type
+struct pcpu_lstats__percpu*         lstats                      read_mostly                             dev_lstats_add()
+struct pcpu_sw_netstats__percpu*    tstats                      read_mostly                             dev_sw_netstats_tx_add()
+struct pcpu_dstats__percpu*         dstats
+struct garp_port*                   garp_port
+struct mrp_port*                    mrp_port
+struct dm_hw_stat_delta*            dm_private
+struct device                       dev
+struct attribute_group*             sysfs_groups[4]
+struct attribute_group*             sysfs_rx_queue_group
+struct rtnl_link_ops*               rtnl_link_ops
+unsigned_int                        gso_max_size                read_mostly                             sk_dst_gso_max_size
+unsigned_int                        tso_max_size
+u16                                 gso_max_segs                read_mostly                             gso_max_segs
+u16                                 tso_max_segs
+unsigned_int                        gso_ipv4_max_size           read_mostly                             sk_dst_gso_max_size
+struct dcbnl_rtnl_ops*              dcbnl_ops
+s16                                 num_tc                      read_mostly                             skb_tx_hash
+struct netdev_tc_txq                tc_to_txq[16]               read_mostly                             skb_tx_hash
+u8                                  prio_tc_map[16]
+unsigned_int                        fcoe_ddp_xid
+struct netprio_map*                 priomap
+struct phy_device*                  phydev
+struct sfp_bus*                     sfp_bus
+struct lock_class_key*              qdisc_tx_busylock
+bool                                proto_down
+unsigned:1                          wol_enabled
+unsigned:1                          threaded                                                            napi_poll(napi_enable,dev_set_threaded)
+unsigned_long:1                     see_all_hwtstamp_requests
+unsigned_long:1                     change_proto_down
+unsigned_long:1                     netns_immutable
+unsigned_long:1                     fcoe_mtu
+struct list_head                    net_notifier_list
+struct macsec_ops*                  macsec_ops
+struct udp_tunnel_nic_info*         udp_tunnel_nic_info
+struct udp_tunnel_nic*              udp_tunnel_nic
+unsigned_int                        xdp_zc_max_segs
+struct bpf_xdp_entity               xdp_state[3]
+u8                                  dev_addr_shadow[32]
+netdevice_tracker                   linkwatch_dev_tracker
+netdevice_tracker                   watchdog_dev_tracker
+netdevice_tracker                   dev_registered_tracker
+struct rtnl_hw_stats64*             offload_xstats_l3
+struct devlink_port*                devlink_port
+struct dpll_pin*                    dpll_pin
+struct hlist_head                   page_pools
+struct dim_irq_moder*               irq_moder
+u64                                 max_pacing_offload_horizon
+struct_napi_config*                 napi_config
+unsigned_long                       gro_flush_timeout
+u32                                 napi_defer_hard_irqs
+struct hlist_head                   neighbours[2]
+=================================== =========================== =================== =================== ===================================================================================
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
new file mode 100644
index 000000000000..6e7b20afd2d4
--- /dev/null
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+netns_ipv4 struct fast path usage breakdown
+===========================================
+
+=============================== ============================================ =================== =================== =================================================
+Type                            Name                                         fastpath_tx_access  fastpath_rx_access  comment
+=============================== ============================================ =================== =================== =================================================
+struct_inet_timewait_death_row  tcp_death_row
+struct_udp_table*               udp_table
+struct_ctl_table_header*        forw_hdr
+struct_ctl_table_header*        frags_hdr
+struct_ctl_table_header*        ipv4_hdr
+struct_ctl_table_header*        route_hdr
+struct_ctl_table_header*        xfrm4_hdr
+struct_ipv4_devconf*            devconf_all
+struct_ipv4_devconf*            devconf_dflt
+struct_ip_ra_chain              ra_chain
+struct_mutex                    ra_mutex
+struct_fib_rules_ops*           rules_ops
+struct_fib_table                fib_main
+struct_fib_table                fib_default
+unsigned_int                    fib_rules_require_fldissect
+bool                            fib_has_custom_rules
+bool                            fib_has_custom_local_routes
+bool                            fib_offload_disabled
+atomic_t                        fib_num_tclassid_users
+struct_hlist_head*              fib_table_hash
+struct_sock*                    fibnl
+struct_sock*                    mc_autojoin_sk
+struct_inet_peer_base*          peers
+struct_fqdir*                   fqdir
+u8                              sysctl_icmp_echo_ignore_all
+u8                              sysctl_icmp_echo_enable_probe
+u8                              sysctl_icmp_echo_ignore_broadcasts
+u8                              sysctl_icmp_ignore_bogus_error_responses
+u8                              sysctl_icmp_errors_use_inbound_ifaddr
+int                             sysctl_icmp_ratelimit
+int                             sysctl_icmp_ratemask
+u32                             ip_rt_min_pmtu
+int                             ip_rt_mtu_expires
+int                             ip_rt_min_advmss
+struct_local_ports              ip_local_ports
+u8                              sysctl_tcp_ecn
+u8                              sysctl_tcp_ecn_fallback
+u8                              sysctl_ip_default_ttl                                                                ip4_dst_hoplimit/ip_select_ttl
+u8                              sysctl_ip_no_pmtu_disc
+u8                              sysctl_ip_fwd_use_pmtu                       read_mostly                             ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
+u8                              sysctl_ip_fwd_update_priority                                                        ip_forward
+u8                              sysctl_ip_nonlocal_bind
+u8                              sysctl_ip_autobind_reuse
+u8                              sysctl_ip_dynaddr
+u8                              sysctl_ip_early_demux                                            read_mostly         ip(6)_rcv_finish_core
+u8                              sysctl_raw_l3mdev_accept
+u8                              sysctl_tcp_early_demux                                           read_mostly         ip(6)_rcv_finish_core
+u8                              sysctl_udp_early_demux
+u8                              sysctl_nexthop_compat_mode
+u8                              sysctl_fwmark_reflect
+u8                              sysctl_tcp_fwmark_accept
+u8                              sysctl_tcp_l3mdev_accept                                         read_mostly         __inet6_lookup_established/inet_request_bound_dev_if
+u8                              sysctl_tcp_mtu_probing
+int                             sysctl_tcp_mtu_probe_floor
+int                             sysctl_tcp_base_mss
+int                             sysctl_tcp_min_snd_mss                       read_mostly                             __tcp_mtu_to_mss(tcp_write_xmit)
+int                             sysctl_tcp_probe_threshold                                                           tcp_mtu_probe(tcp_write_xmit)
+u32                             sysctl_tcp_probe_interval                                                            tcp_mtu_check_reprobe(tcp_write_xmit)
+int                             sysctl_tcp_keepalive_time
+int                             sysctl_tcp_keepalive_intvl
+u8                              sysctl_tcp_keepalive_probes
+u8                              sysctl_tcp_syn_retries
+u8                              sysctl_tcp_synack_retries
+u8                              sysctl_tcp_syncookies                                                                generated_on_syn
+u8                              sysctl_tcp_migrate_req                                                               reuseport
+u8                              sysctl_tcp_comp_sack_nr                                                              __tcp_ack_snd_check
+int                             sysctl_tcp_reordering                                            read_mostly         tcp_may_raise_cwnd/tcp_cong_control
+u8                              sysctl_tcp_retries1
+u8                              sysctl_tcp_retries2
+u8                              sysctl_tcp_orphan_retries
+u8                              sysctl_tcp_tw_reuse                                                                  timewait_sock_ops
+unsigned_int                    sysctl_tcp_tw_reuse_delay                                                            timewait_sock_ops
+int                             sysctl_tcp_fin_timeout                                                               TCP_LAST_ACK/tcp_rcv_state_process
+unsigned_int                    sysctl_tcp_notsent_lowat                     read_mostly                             tcp_notsent_lowat/tcp_stream_memory_free
+u8                              sysctl_tcp_sack                                                                      tcp_syn_options
+u8                              sysctl_tcp_window_scaling                                                            tcp_syn_options,tcp_parse_options
+u8                              sysctl_tcp_timestamps
+u8                              sysctl_tcp_early_retrans                     read_mostly                             tcp_schedule_loss_probe(tcp_write_xmit)
+u32                             sysctl_tcp_rto_max_ms
+u8                              sysctl_tcp_recovery                                                                  tcp_fastretrans_alert
+u8                              sysctl_tcp_thin_linear_timeouts                                                      tcp_retrans_timer(on_thin_streams)
+u8                              sysctl_tcp_slow_start_after_idle                                                     unlikely(tcp_cwnd_validate-network-not-starved)
+u8                              sysctl_tcp_retrans_collapse
+u8                              sysctl_tcp_stdurg                                                                    unlikely(tcp_check_urg)
+u8                              sysctl_tcp_rfc1337
+u8                              sysctl_tcp_abort_on_overflow
+u8                              sysctl_tcp_fack
+int                             sysctl_tcp_max_reordering                                                            tcp_check_sack_reordering
+int                             sysctl_tcp_adv_win_scale                                                             tcp_init_buffer_space
+u8                              sysctl_tcp_dsack                                                                     partial_packet_or_retrans_in_tcp_data_queue
+u8                              sysctl_tcp_app_win                                                                   tcp_win_from_space
+u8                              sysctl_tcp_frto                                                                      tcp_enter_loss
+u8                              sysctl_tcp_nometrics_save                                                            TCP_LAST_ACK/tcp_update_metrics
+u8                              sysctl_tcp_no_ssthresh_metrics_save                                                  TCP_LAST_ACK/tcp_(update/init)_metrics
+u8                              sysctl_tcp_moderate_rcvbuf                   read_mostly         read_mostly         tcp_tso_should_defer(tx);tcp_rcv_space_adjust(rx)
+u8                              sysctl_tcp_tso_win_divisor                   read_mostly                             tcp_tso_should_defer(tcp_write_xmit)
+u8                              sysctl_tcp_workaround_signed_windows                                                 tcp_select_window
+int                             sysctl_tcp_limit_output_bytes                read_mostly                             tcp_small_queue_check(tcp_write_xmit)
+int                             sysctl_tcp_challenge_ack_limit
+int                             sysctl_tcp_min_rtt_wlen                      read_mostly                             tcp_ack_update_rtt
+u8                              sysctl_tcp_min_tso_segs                                                              unlikely(icsk_ca_ops-written)
+u8                              sysctl_tcp_tso_rtt_log                       read_mostly                             tcp_tso_autosize
+u8                              sysctl_tcp_autocorking                       read_mostly                             tcp_push/tcp_should_autocork
+u8                              sysctl_tcp_reflect_tos                                                               tcp_v(4/6)_send_synack
+int                             sysctl_tcp_invalid_ratelimit
+int                             sysctl_tcp_pacing_ss_ratio                                                           default_cong_cont(tcp_update_pacing_rate)
+int                             sysctl_tcp_pacing_ca_ratio                                                           default_cong_cont(tcp_update_pacing_rate)
+int                             sysctl_tcp_wmem[3]                           read_mostly                             tcp_wmem_schedule(sendmsg/sendpage)
+int                             sysctl_tcp_rmem[3]                                               read_mostly         __tcp_grow_window(tx),tcp_rcv_space_adjust(rx)
+unsigned_int                    sysctl_tcp_child_ehash_entries
+unsigned_long                   sysctl_tcp_comp_sack_delay_ns                                                        __tcp_ack_snd_check
+unsigned_long                   sysctl_tcp_comp_sack_slack_ns                                                        __tcp_ack_snd_check
+int                             sysctl_max_syn_backlog
+int                             sysctl_tcp_fastopen
+struct_tcp_congestion_ops       tcp_congestion_control                                                               init_cc
+struct_tcp_fastopen_context     tcp_fastopen_ctx
+unsigned_int                    sysctl_tcp_fastopen_blackhole_timeout
+atomic_t                        tfo_active_disable_times
+unsigned_long                   tfo_active_disable_stamp
+u32                             tcp_challenge_timestamp
+u32                             tcp_challenge_count
+u8                              sysctl_tcp_plb_enabled
+u8                              sysctl_tcp_plb_idle_rehash_rounds
+u8                              sysctl_tcp_plb_rehash_rounds
+u8                              sysctl_tcp_plb_suspend_rto_sec
+int                             sysctl_tcp_plb_cong_thresh
+int                             sysctl_udp_wmem_min
+int                             sysctl_udp_rmem_min
+u8                              sysctl_fib_notify_on_flag_change
+u8                              sysctl_udp_l3mdev_accept
+u8                              sysctl_igmp_llm_reports
+int                             sysctl_igmp_max_memberships
+int                             sysctl_igmp_max_msf
+int                             sysctl_igmp_qrv
+struct_ping_group_range         ping_group_range
+atomic_t                        dev_addr_genid
+unsigned_int                    sysctl_udp_child_hash_entries
+unsigned_long*                  sysctl_local_reserved_ports
+int                             sysctl_ip_prot_sock
+struct_mr_table*                mrt
+struct_list_head                mr_tables
+struct_fib_rules_ops*           mr_rules_ops
+u32                             sysctl_fib_multipath_hash_fields
+u8                              sysctl_fib_multipath_use_neigh
+u8                              sysctl_fib_multipath_hash_policy
+struct_fib_notifier_ops*        notifier_ops
+unsigned_int                    fib_seq
+struct_fib_notifier_ops*        ipmr_notifier_ops
+unsigned_int                    ipmr_seq
+atomic_t                        rt_genid
+siphash_key_t                   ip_id_key
+=============================== ============================================ =================== =================== =================================================
diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
new file mode 100644
index 000000000000..bd44b3eebbef
--- /dev/null
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -0,0 +1,140 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+netns_ipv4 enum fast path usage breakdown
+===========================================
+
+============== ===================================== =================== =================== ==================================================
+Type           Name                                  fastpath_tx_access  fastpath_rx_access  comment
+============== ===================================== =================== =================== ==================================================
+unsigned_long  LINUX_MIB_TCPKEEPALIVE                write_mostly                            tcp_keepalive_timer
+unsigned_long  LINUX_MIB_DELAYEDACKS                 write_mostly                            tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long  LINUX_MIB_DELAYEDACKLOCKED            write_mostly                            tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long  LINUX_MIB_TCPAUTOCORKING              write_mostly                            tcp_push,tcp_sendmsg_locked
+unsigned_long  LINUX_MIB_TCPFROMZEROWINDOWADV        write_mostly                            tcp_select_window,tcp_transmit-skb
+unsigned_long  LINUX_MIB_TCPTOZEROWINDOWADV          write_mostly                            tcp_select_window,tcp_transmit-skb
+unsigned_long  LINUX_MIB_TCPWANTZEROWINDOWADV        write_mostly                            tcp_select_window,tcp_transmit-skb
+unsigned_long  LINUX_MIB_TCPORIGDATASENT             write_mostly                            tcp_write_xmit
+unsigned_long  LINUX_MIB_TCPHPHITS                                       write_mostly        tcp_rcv_established,tcp_v4_do_rcv,tcp_v6_do_rcv
+unsigned_long  LINUX_MIB_TCPRCVCOALESCE                                  write_mostly        tcp_try_coalesce,tcp_queue_rcv,tcp_rcv_established
+unsigned_long  LINUX_MIB_TCPPUREACKS                                     write_mostly        tcp_ack,tcp_rcv_established
+unsigned_long  LINUX_MIB_TCPHPACKS                                       write_mostly        tcp_ack,tcp_rcv_established
+unsigned_long  LINUX_MIB_TCPDELIVERED                                    write_mostly        tcp_newly_delivered,tcp_ack,tcp_rcv_established
+unsigned_long  LINUX_MIB_SYNCOOKIESSENT
+unsigned_long  LINUX_MIB_SYNCOOKIESRECV
+unsigned_long  LINUX_MIB_SYNCOOKIESFAILED
+unsigned_long  LINUX_MIB_EMBRYONICRSTS
+unsigned_long  LINUX_MIB_PRUNECALLED
+unsigned_long  LINUX_MIB_RCVPRUNED
+unsigned_long  LINUX_MIB_OFOPRUNED
+unsigned_long  LINUX_MIB_OUTOFWINDOWICMPS
+unsigned_long  LINUX_MIB_LOCKDROPPEDICMPS
+unsigned_long  LINUX_MIB_ARPFILTER
+unsigned_long  LINUX_MIB_TIMEWAITED
+unsigned_long  LINUX_MIB_TIMEWAITRECYCLED
+unsigned_long  LINUX_MIB_TIMEWAITKILLED
+unsigned_long  LINUX_MIB_PAWSACTIVEREJECTED
+unsigned_long  LINUX_MIB_PAWSESTABREJECTED
+unsigned_long  LINUX_MIB_TSECR_REJECTED
+unsigned_long  LINUX_MIB_PAWS_OLD_ACK
+unsigned_long  LINUX_MIB_PAWS_TW_REJECTED
+unsigned_long  LINUX_MIB_DELAYEDACKLOST
+unsigned_long  LINUX_MIB_LISTENOVERFLOWS
+unsigned_long  LINUX_MIB_LISTENDROPS
+unsigned_long  LINUX_MIB_TCPRENORECOVERY
+unsigned_long  LINUX_MIB_TCPSACKRECOVERY
+unsigned_long  LINUX_MIB_TCPSACKRENEGING
+unsigned_long  LINUX_MIB_TCPSACKREORDER
+unsigned_long  LINUX_MIB_TCPRENOREORDER
+unsigned_long  LINUX_MIB_TCPTSREORDER
+unsigned_long  LINUX_MIB_TCPFULLUNDO
+unsigned_long  LINUX_MIB_TCPPARTIALUNDO
+unsigned_long  LINUX_MIB_TCPDSACKUNDO
+unsigned_long  LINUX_MIB_TCPLOSSUNDO
+unsigned_long  LINUX_MIB_TCPLOSTRETRANSMIT
+unsigned_long  LINUX_MIB_TCPRENOFAILURES
+unsigned_long  LINUX_MIB_TCPSACKFAILURES
+unsigned_long  LINUX_MIB_TCPLOSSFAILURES
+unsigned_long  LINUX_MIB_TCPFASTRETRANS
+unsigned_long  LINUX_MIB_TCPSLOWSTARTRETRANS
+unsigned_long  LINUX_MIB_TCPTIMEOUTS
+unsigned_long  LINUX_MIB_TCPLOSSPROBES
+unsigned_long  LINUX_MIB_TCPLOSSPROBERECOVERY
+unsigned_long  LINUX_MIB_TCPRENORECOVERYFAIL
+unsigned_long  LINUX_MIB_TCPSACKRECOVERYFAIL
+unsigned_long  LINUX_MIB_TCPRCVCOLLAPSED
+unsigned_long  LINUX_MIB_TCPDSACKOLDSENT
+unsigned_long  LINUX_MIB_TCPDSACKOFOSENT
+unsigned_long  LINUX_MIB_TCPDSACKRECV
+unsigned_long  LINUX_MIB_TCPDSACKOFORECV
+unsigned_long  LINUX_MIB_TCPABORTONDATA
+unsigned_long  LINUX_MIB_TCPABORTONCLOSE
+unsigned_long  LINUX_MIB_TCPABORTONMEMORY
+unsigned_long  LINUX_MIB_TCPABORTONTIMEOUT
+unsigned_long  LINUX_MIB_TCPABORTONLINGER
+unsigned_long  LINUX_MIB_TCPABORTFAILED
+unsigned_long  LINUX_MIB_TCPMEMORYPRESSURES
+unsigned_long  LINUX_MIB_TCPMEMORYPRESSURESCHRONO
+unsigned_long  LINUX_MIB_TCPSACKDISCARD
+unsigned_long  LINUX_MIB_TCPDSACKIGNOREDOLD
+unsigned_long  LINUX_MIB_TCPDSACKIGNOREDNOUNDO
+unsigned_long  LINUX_MIB_TCPSPURIOUSRTOS
+unsigned_long  LINUX_MIB_TCPMD5NOTFOUND
+unsigned_long  LINUX_MIB_TCPMD5UNEXPECTED
+unsigned_long  LINUX_MIB_TCPMD5FAILURE
+unsigned_long  LINUX_MIB_SACKSHIFTED
+unsigned_long  LINUX_MIB_SACKMERGED
+unsigned_long  LINUX_MIB_SACKSHIFTFALLBACK
+unsigned_long  LINUX_MIB_TCPBACKLOGDROP
+unsigned_long  LINUX_MIB_PFMEMALLOCDROP
+unsigned_long  LINUX_MIB_TCPMINTTLDROP
+unsigned_long  LINUX_MIB_TCPDEFERACCEPTDROP
+unsigned_long  LINUX_MIB_IPRPFILTER
+unsigned_long  LINUX_MIB_TCPTIMEWAITOVERFLOW
+unsigned_long  LINUX_MIB_TCPREQQFULLDOCOOKIES
+unsigned_long  LINUX_MIB_TCPREQQFULLDROP
+unsigned_long  LINUX_MIB_TCPRETRANSFAIL
+unsigned_long  LINUX_MIB_TCPBACKLOGCOALESCE
+unsigned_long  LINUX_MIB_TCPOFOQUEUE
+unsigned_long  LINUX_MIB_TCPOFODROP
+unsigned_long  LINUX_MIB_TCPOFOMERGE
+unsigned_long  LINUX_MIB_TCPCHALLENGEACK
+unsigned_long  LINUX_MIB_TCPSYNCHALLENGE
+unsigned_long  LINUX_MIB_TCPFASTOPENACTIVE
+unsigned_long  LINUX_MIB_TCPFASTOPENACTIVEFAIL
+unsigned_long  LINUX_MIB_TCPFASTOPENPASSIVE
+unsigned_long  LINUX_MIB_TCPFASTOPENPASSIVEFAIL
+unsigned_long  LINUX_MIB_TCPFASTOPENLISTENOVERFLOW
+unsigned_long  LINUX_MIB_TCPFASTOPENCOOKIEREQD
+unsigned_long  LINUX_MIB_TCPFASTOPENBLACKHOLE
+unsigned_long  LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES
+unsigned_long  LINUX_MIB_BUSYPOLLRXPACKETS
+unsigned_long  LINUX_MIB_TCPSYNRETRANS
+unsigned_long  LINUX_MIB_TCPHYSTARTTRAINDETECT
+unsigned_long  LINUX_MIB_TCPHYSTARTTRAINCWND
+unsigned_long  LINUX_MIB_TCPHYSTARTDELAYDETECT
+unsigned_long  LINUX_MIB_TCPHYSTARTDELAYCWND
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDSYNRECV
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDPAWS
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDSEQ
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDFINWAIT2
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDTIMEWAIT
+unsigned_long  LINUX_MIB_TCPACKSKIPPEDCHALLENGE
+unsigned_long  LINUX_MIB_TCPWINPROBE
+unsigned_long  LINUX_MIB_TCPMTUPFAIL
+unsigned_long  LINUX_MIB_TCPMTUPSUCCESS
+unsigned_long  LINUX_MIB_TCPDELIVEREDCE
+unsigned_long  LINUX_MIB_TCPACKCOMPRESSED
+unsigned_long  LINUX_MIB_TCPZEROWINDOWDROP
+unsigned_long  LINUX_MIB_TCPRCVQDROP
+unsigned_long  LINUX_MIB_TCPWQUEUETOOBIG
+unsigned_long  LINUX_MIB_TCPFASTOPENPASSIVEALTKEY
+unsigned_long  LINUX_MIB_TCPTIMEOUTREHASH
+unsigned_long  LINUX_MIB_TCPDUPLICATEDATAREHASH
+unsigned_long  LINUX_MIB_TCPDSACKRECVSEGS
+unsigned_long  LINUX_MIB_TCPDSACKIGNOREDDUBIOUS
+unsigned_long  LINUX_MIB_TCPMIGRATEREQSUCCESS
+unsigned_long  LINUX_MIB_TCPMIGRATEREQFAILURE
+unsigned_long  __LINUX_MIB_MAX
+============== ===================================== =================== =================== ==================================================
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
new file mode 100644
index 000000000000..bc9b2131bf7a
--- /dev/null
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -0,0 +1,160 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+=========================================
+tcp_sock struct fast path usage breakdown
+=========================================
+
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
+Type                          Name                    fastpath_tx_access  fastpath_rx_access  Comments
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
+struct inet_connection_sock   inet_conn
+u16                           tcp_header_len          read_mostly         read_mostly         tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
+u16                           gso_segs                read_mostly                             tcp_xmit_size_goal
+__be32                        pred_flags              read_write          read_mostly         tcp_select_window(tx);tcp_rcv_established(rx)
+u64                           bytes_received                              read_write          tcp_rcv_nxt_update(rx)
+u32                           segs_in                                     read_write          tcp_v6_rcv(rx)
+u32                           data_segs_in                                read_write          tcp_v6_rcv(rx)
+u32                           rcv_nxt                 read_mostly         read_write          tcp_cleanup_rbuf,tcp_send_ack,tcp_inq_hint,tcp_transmit_skb,tcp_receive_window(tx);tcp_v6_do_rcv,tcp_rcv_established,tcp_data_queue,tcp_receive_window,tcp_rcv_nxt_update(write)(rx)
+u32                           copied_seq                                  read_mostly         tcp_cleanup_rbuf,tcp_rcv_space_adjust,tcp_inq_hint
+u32                           rcv_wup                                     read_write          __tcp_cleanup_rbuf,tcp_receive_window,tcp_receive_established
+u32                           snd_nxt                 read_write          read_mostly         tcp_rate_check_app_limited,__tcp_transmit_skb,tcp_event_new_data_sent(write)(tx);tcp_rcv_established,tcp_ack,tcp_clean_rtx_queue(rx)
+u32                           segs_out                read_write                              __tcp_transmit_skb
+u32                           data_segs_out           read_write                              __tcp_transmit_skb,tcp_update_skb_after_send
+u64                           bytes_sent              read_write                              __tcp_transmit_skb
+u64                           bytes_acked                                 read_write          tcp_snd_una_update/tcp_ack
+u32                           dsack_dups
+u32                           snd_una                 read_mostly         read_write          tcp_wnd_end,tcp_urg_mode,tcp_minshall_check,tcp_cwnd_validate(tx);tcp_ack,tcp_may_update_window,tcp_clean_rtx_queue(write),tcp_ack_tstamp(rx)
+u32                           snd_sml                 read_write                              tcp_minshall_check,tcp_minshall_update
+u32                           rcv_tstamp                                  read_mostly         tcp_ack
+void *                        tcp_clean_acked                             read_mostly         tcp_ack
+u32                           lsndtime                read_write                              tcp_slow_start_after_idle_check,tcp_event_data_sent
+u32                           last_oow_ack_time
+u32                           compressed_ack_rcv_nxt
+u32                           tsoffset                read_mostly         read_mostly         tcp_established_options(tx);tcp_fast_parse_options(rx)
+struct list_head              tsq_node
+struct list_head              tsorted_sent_queue      read_write                              tcp_update_skb_after_send
+u32                           snd_wl1                                     read_mostly         tcp_may_update_window
+u32                           snd_wnd                 read_mostly         read_mostly         tcp_wnd_end,tcp_tso_should_defer(tx);tcp_fast_path_on(rx)
+u32                           max_window              read_mostly                             tcp_bound_to_half_wnd,forced_push
+u32                           mss_cache               read_mostly         read_mostly         tcp_rate_check_app_limited,tcp_current_mss,tcp_sync_mss,tcp_sndbuf_expand,tcp_tso_should_defer(tx);tcp_update_pacing_rate,tcp_clean_rtx_queue(rx)
+u32                           window_clamp            read_mostly         read_write          tcp_rcv_space_adjust,__tcp_select_window
+u32                           rcv_ssthresh            read_mostly                             __tcp_select_window
+u8                            scaling_ratio           read_mostly         read_mostly         tcp_win_from_space
+struct                        tcp_rack
+u16                           advmss                                      read_mostly         tcp_rcv_space_adjust
+u8                            compressed_ack
+u8:2                          dup_ack_counter
+u8:1                          tlp_retrans
+u8:1                          tcp_usec_ts             read_mostly         read_mostly
+u32                           chrono_start            read_write                              tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u32[3]                        chrono_stat             read_write                              tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:2                          chrono_type             read_write                              tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:1                          rate_app_limited                            read_write          tcp_rate_gen
+u8:1                          fastopen_connect
+u8:1                          fastopen_no_cookie
+u8:1                          is_sack_reneg                               read_mostly         tcp_skb_entail,tcp_ack
+u8:2                          fastopen_client_fail
+u8:4                          nonagle                 read_write                              tcp_skb_entail,tcp_push_pending_frames
+u8:1                          thin_lto
+u8:1                          recvmsg_inq
+u8:1                          repair                  read_mostly                             tcp_write_xmit
+u8:1                          frto
+u8                            repair_queue
+u8:2                          save_syn
+u8:1                          syn_data
+u8:1                          syn_fastopen
+u8:1                          syn_fastopen_exp
+u8:1                          syn_fastopen_ch
+u8:1                          syn_data_acked
+u8:1                          is_cwnd_limited         read_mostly                             tcp_cwnd_validate,tcp_is_cwnd_limited
+u32                           tlp_high_seq                                read_mostly         tcp_ack
+u32                           tcp_tx_delay
+u64                           tcp_wstamp_ns           read_write                              tcp_pacing_check,tcp_tso_should_defer,tcp_update_skb_after_send
+u64                           tcp_clock_cache         read_write          read_write          tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust),__tcp_transmit_skb,tcp_tso_should_defer;timer
+u64                           tcp_mstamp              read_write          read_write          tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust)(tx);tcp_rcv_space_adjust,tcp_rate_gen,tcp_clean_rtx_queue,tcp_ack_update_rtt/tcp_time_stamp(rx);timer
+u32                           srtt_us                 read_mostly         read_write          tcp_tso_should_defer(tx);tcp_update_pacing_rate,__tcp_set_rto,tcp_rtt_estimator(rx)
+u32                           mdev_us                 read_write                              tcp_rtt_estimator
+u32                           mdev_max_us
+u32                           rttvar_us                                   read_mostly         __tcp_set_rto
+u32                           rtt_seq                 read_write                              tcp_rtt_estimator
+struct minmax                 rtt_min                                     read_mostly         tcp_min_rtt/tcp_rate_gen,tcp_min_rtttcp_update_rtt_min
+u32                           packets_out             read_write          read_write          tcp_packets_in_flight(tx/rx);tcp_slow_start_after_idle_check,tcp_nagle_check,tcp_rate_skb_sent,tcp_event_new_data_sent,tcp_cwnd_validate,tcp_write_xmit(tx);tcp_ack,tcp_clean_rtx_queue,tcp_update_pacing_rate(rx)
+u32                           retrans_out                                 read_mostly         tcp_packets_in_flight,tcp_rate_check_app_limited
+u32                           max_packets_out                             read_write          tcp_cwnd_validate
+u32                           cwnd_usage_seq                              read_write          tcp_cwnd_validate
+u16                           urg_data                                    read_mostly         tcp_fast_path_check
+u8                            ecn_flags               read_write                              tcp_ecn_send
+u8                            keepalive_probes
+u32                           reordering              read_mostly                             tcp_sndbuf_expand
+u32                           reord_seen
+u32                           snd_up                  read_write          read_mostly         tcp_mark_urg,tcp_urg_mode,__tcp_transmit_skb(tx);tcp_clean_rtx_queue(rx)
+struct tcp_options_received   rx_opt                  read_mostly         read_write          tcp_established_options(tx);tcp_fast_path_on,tcp_ack_update_window,tcp_is_sack,tcp_data_queue,tcp_rcv_established,tcp_ack_update_rtt(rx)
+u32                           snd_ssthresh                                read_mostly         tcp_update_pacing_rate
+u32                           snd_cwnd                read_mostly         read_mostly         tcp_snd_cwnd,tcp_rate_check_app_limited,tcp_tso_should_defer(tx);tcp_update_pacing_rate
+u32                           snd_cwnd_cnt
+u32                           snd_cwnd_clamp
+u32                           snd_cwnd_used
+u32                           snd_cwnd_stamp
+u32                           prior_cwnd
+u32                           prr_delivered
+u32                           prr_out                 read_mostly         read_mostly         tcp_rate_skb_sent,tcp_newly_delivered(tx);tcp_ack,tcp_rate_gen,tcp_clean_rtx_queue(rx)
+u32                           delivered               read_mostly         read_write          tcp_rate_skb_sent, tcp_newly_delivered(tx);tcp_ack, tcp_rate_gen, tcp_clean_rtx_queue (rx)
+u32                           delivered_ce            read_mostly         read_write          tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
+u32                           lost                                        read_mostly         tcp_ack
+u32                           app_limited             read_write          read_mostly         tcp_rate_check_app_limited,tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
+u64                           first_tx_mstamp         read_write                              tcp_rate_skb_sent
+u64                           delivered_mstamp        read_write                              tcp_rate_skb_sent
+u32                           rate_delivered                              read_mostly         tcp_rate_gen
+u32                           rate_interval_us                            read_mostly         rate_delivered,rate_app_limited
+u32                           rcv_wnd                 read_write          read_mostly         tcp_select_window,tcp_receive_window,tcp_fast_path_check
+u32                           write_seq               read_write                              tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
+u32                           notsent_lowat           read_mostly                             tcp_stream_memory_free
+u32                           pushed_seq              read_write                              tcp_mark_push,forced_push
+u32                           lost_out                read_mostly         read_mostly         tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_rate_check_app_limited(rx)
+u32                           sacked_out              read_mostly         read_mostly         tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_clean_rtx_queue(rx)
+struct hrtimer                pacing_timer
+struct hrtimer                compressed_ack_timer
+struct sk_buff*               lost_skb_hint           read_mostly                             tcp_clean_rtx_queue
+struct sk_buff*               retransmit_skb_hint     read_mostly                             tcp_clean_rtx_queue
+struct rb_root                out_of_order_queue                          read_mostly         tcp_data_queue,tcp_fast_path_check
+struct sk_buff*               ooo_last_skb
+struct tcp_sack_block[1]      duplicate_sack
+struct tcp_sack_block[4]      selective_acks
+struct tcp_sack_block[4]      recv_sack_cache
+struct sk_buff*               highest_sack            read_write                              tcp_event_new_data_sent
+int                           lost_cnt_hint
+u32                           prior_ssthresh
+u32                           high_seq
+u32                           retrans_stamp
+u32                           undo_marker
+int                           undo_retrans
+u64                           bytes_retrans
+u32                           total_retrans
+u32                           rto_stamp
+u16                           total_rto
+u16                           total_rto_recoveries
+u32                           total_rto_time
+u32                           urg_seq
+unsigned_int                  keepalive_time
+unsigned_int                  keepalive_intvl
+int                           linger2
+u8                            bpf_sock_ops_cb_flags
+u8:1                          bpf_chg_cc_inprogress
+u16                           timeout_rehash
+u32                           rcv_ooopack
+u32                           rcv_rtt_last_tsecr
+struct                        rcv_rtt_est                                 read_write          tcp_rcv_space_adjust,tcp_rcv_established
+struct                        rcvq_space                                  read_write          tcp_rcv_space_adjust
+struct                        mtu_probe
+u32                           plb_rehash
+u32                           mtu_info
+bool                          is_mptcp
+bool                          smc_hs_congested
+bool                          syn_smc
+struct tcp_sock_af_ops*       af_specific
+struct tcp_md5sig_info*       md5sig_info
+struct tcp_fastopen_request*  fastopen_req
+struct request_sock*          fastopen_rsk
+struct saved_syn*             saved_syn
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst
index 3bed9fd95336..4377998e6826 100644
--- a/Documentation/networking/net_dim.rst
+++ b/Documentation/networking/net_dim.rst
@@ -156,7 +156,7 @@ usage is not complete but it should make the outline of the usage clear.
 		          my_entity->bytes,
 		          &dim_sample);
 	/* Call net DIM */
-	net_dim(&my_entity->dim, dim_sample);
+	net_dim(&my_entity->dim, &dim_sample);
 	...
   }
 
@@ -169,6 +169,48 @@ usage is not complete but it should make the outline of the usage clear.
 	...
   }
 
+
+Tuning DIM
+==========
+
+Net DIM serves a range of network devices and delivers excellent acceleration
+benefits. Yet, it has been observed that some preset configurations of DIM may
+not align seamlessly with the varying specifications of network devices, and
+this discrepancy has been identified as a factor to the suboptimal performance
+outcomes of DIM-enabled network devices, related to a mismatch in profiles.
+
+To address this issue, Net DIM introduces a per-device control to modify and
+access a device's ``rx-profile`` and ``tx-profile`` parameters:
+Assume that the target network device is named ethx, and ethx only declares
+support for RX profile setting and supports modification of ``usec`` field
+and ``pkts`` field (See the data structure:
+:c:type:`struct dim_cq_moder <dim_cq_moder>`).
+
+You can use ethtool to modify the current RX DIM profile where all
+values are 64::
+
+    $ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n
+
+``n`` means do not modify this field, and ``_`` separates structure
+elements of the profile array.
+
+Querying the current profiles using::
+
+    $ ethtool -c ethx
+    ...
+    rx-profile:
+    {.usec =   1, .pkts =   1, .comps = n/a,},
+    {.usec =   2, .pkts =   2, .comps = n/a,},
+    {.usec =   3, .pkts =  64, .comps = n/a,},
+    {.usec =  64, .pkts =   4, .comps = n/a,},
+    {.usec =  64, .pkts =  64, .comps = n/a,}
+    tx-profile:   n/a
+
+If the network device does not support specific fields of DIM profiles,
+the corresponding ``n/a`` will display. If the ``n/a`` field is being
+modified, error messages will be reported.
+
+
 Dynamic Interrupt Moderation (DIM) library API
 ==============================================
 
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
index e143ab79a960..f4e1b4e07adc 100644
--- a/Documentation/networking/net_failover.rst
+++ b/Documentation/networking/net_failover.rst
@@ -35,7 +35,7 @@ To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY
 feature on the virtio-net interface and assign the same MAC address to both
 virtio-net and VF interfaces.
 
-Here is an example XML snippet that shows such configuration.
+Here is an example libvirt XML snippet that shows such configuration:
 ::
 
   <interface type='network'>
@@ -45,18 +45,32 @@ Here is an example XML snippet that shows such configuration.
     <model type='virtio'/>
     <driver name='vhost' queues='4'/>
     <link state='down'/>
-    <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
+    <teaming type='persistent'/>
+    <alias name='ua-backup0'/>
   </interface>
   <interface type='hostdev' managed='yes'>
     <mac address='52:54:00:00:12:53'/>
     <source>
       <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
     </source>
-    <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+    <teaming type='transient' persistent='ua-backup0'/>
   </interface>
 
+In this configuration, the first device definition is for the virtio-net
+interface and this acts as the 'persistent' device indicating that this
+interface will always be plugged in. This is specified by the 'teaming' tag with
+required attribute type having value 'persistent'. The link state for the
+virtio-net device is set to 'down' to ensure that the 'failover' netdev prefers
+the VF passthrough device for normal communication. The virtio-net device will
+be brought UP during live migration to allow uninterrupted communication.
+
+The second device definition is for the VF passthrough interface. Here the
+'teaming' tag is provided with type 'transient' indicating that this device may
+periodically be unplugged. A second attribute - 'persistent' is provided and
+points to the alias name declared for the virtio-net device.
+
 Booting a VM with the above configuration will result in the following 3
-netdevs created in the VM.
+interfaces created in the VM:
 ::
 
   4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
@@ -65,13 +79,36 @@ netdevs created in the VM.
          valid_lft 42482sec preferred_lft 42482sec
       inet6 fe80::97d8:db2:8c10:b6d6/64 scope link
          valid_lft forever preferred_lft forever
-  5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000
+  5: ens10nsby: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master ens10 state DOWN group default qlen 1000
       link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
   7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000
       link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
 
-ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave
-'standby' and 'primary' netdevs respectively.
+Here, ens10 is the 'failover' master interface, ens10nsby is the slave 'standby'
+virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface.
+
+One point to note here is that some user space network configuration daemons
+like systemd-networkd, ifupdown, etc, do not understand the 'net_failover'
+device; and on the first boot, the VM might end up with both 'failover' device
+and VF acquiring IP addresses (either same or different) from the DHCP server.
+This will result in lack of connectivity to the VM. So some tweaks might be
+needed to these network configuration daemons to make sure that an IP is
+received only on the 'failover' device.
+
+Below is the patch snippet used with 'cloud-ifupdown-helper' script found on
+Debian cloud images:
+
+::
+  @@ -27,6 +27,8 @@ do_setup() {
+       local working="$cfgdir/.$INTERFACE"
+       local final="$cfgdir/$INTERFACE"
+
+  +    if [ -d "/sys/class/net/${INTERFACE}/master" ]; then exit 0; fi
+  +
+       if ifup --no-act "$INTERFACE" > /dev/null 2>&1; then
+           # interface is already known to ifupdown, no need to generate cfg
+           log "Skipping configuration generation for $INTERFACE"
+
 
 Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode
 ==================================================================
@@ -80,40 +117,68 @@ net_failover also enables hypervisor controlled live migration to be supported
 with VMs that have direct attached SR-IOV VF devices by automatic failover to
 the paravirtual datapath when the VF is unplugged.
 
-Here is a sample script that shows the steps to initiate live migration on
-the source hypervisor.
+Here is a sample script that shows the steps to initiate live migration from
+the source hypervisor. Note: It is assumed that the VM is connected to a
+software bridge 'br0' which has a single VF attached to it along with the vnet
+device to the VM. This is not the VF that was passthrough'd to the VM (seen in
+the vf.xml file).
 ::
 
-  # cat vf_xml
+  # cat vf.xml
   <interface type='hostdev' managed='yes'>
     <mac address='52:54:00:00:12:53'/>
     <source>
       <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
     </source>
-    <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+    <teaming type='transient' persistent='ua-backup0'/>
   </interface>
 
-  # Source Hypervisor
+  # Source Hypervisor migrate.sh
   #!/bin/bash
 
-  DOMAIN=fedora27-tap01
-  PF=enp66s0f0
-  VF_NUM=5
-  TAP_IF=tap01
-  VF_XML=
+  DOMAIN=vm-01
+  PF=ens6np0
+  VF=ens6v1             # VF attached to the bridge.
+  VF_NUM=1
+  TAP_IF=vmtap01        # virtio-net interface in the VM.
+  VF_XML=vf.xml
 
   MAC=52:54:00:00:12:53
   ZERO_MAC=00:00:00:00:00:00
 
+  # Set the virtio-net interface up.
   virsh domif-setlink $DOMAIN $TAP_IF up
-  bridge fdb del $MAC dev $PF master
-  virsh detach-device $DOMAIN $VF_XML
+
+  # Remove the VF that was passthrough'd to the VM.
+  virsh detach-device --live --config $DOMAIN $VF_XML
+
   ip link set $PF vf $VF_NUM mac $ZERO_MAC
 
-  virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system
+  # Add FDB entry for traffic to continue going to the VM via
+  # the VF -> br0 -> vnet interface path.
+  bridge fdb add $MAC dev $VF
+  bridge fdb add $MAC dev $TAP_IF master
+
+  # Migrate the VM
+  virsh migrate --live --persistent $DOMAIN qemu+ssh://$REMOTE_HOST/system
+
+  # Clean up FDB entries after migration completes.
+  bridge fdb del $MAC dev $VF
+  bridge fdb del $MAC dev $TAP_IF master
 
-  # Destination Hypervisor
+On the destination hypervisor, a shared bridge 'br0' is created before migration
+starts, and a VF from the destination PF is added to the bridge. Similarly an
+appropriate FDB entry is added.
+
+The following script is executed on the destination hypervisor once migration
+completes, and it reattaches the VF to the VM and brings down the virtio-net
+interface.
+
+::
+  # reattach-vf.sh
   #!/bin/bash
 
-  virsh attach-device $DOMAIN $VF_XML
-  virsh domif-setlink $DOMAIN $TAP_IF down
+  bridge fdb del 52:54:00:00:12:53 dev ens36v0
+  bridge fdb del 52:54:00:00:12:53 dev vmtap01 master
+  virsh attach-device --config --live vm01 vf.xml
+  virsh domif-setlink vm01 vmtap01 down
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index 1f5c4a04027c..a0076b542e9c 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -13,6 +13,12 @@ IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
 
 Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
 
+Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023
+
+Userdata append support by Matthew Wood <thepacketgeek@gmail.com>, Jan 22 2024
+
+Sysdata append support by Breno Leitao <leitao@debian.org>, Jan 15 2025
+
 Please send bug reports to Matt Mackall <mpm@selenic.com>
 Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
 
@@ -34,13 +40,14 @@ Sender and receiver configuration:
 It takes a string configuration parameter "netconsole" in the
 following format::
 
- netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
+ netconsole=[+][r][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
 
    where
 	+             if present, enable extended console support
+	r             if present, prepend kernel version (release) to the message
 	src-port      source for UDP packets (defaults to 6665)
 	src-ip        source IP to use (interface address)
-	dev           network interface (eth0)
+	dev           network interface name (eth0) or MAC address
 	tgt-port      port for logging agent (6666)
 	tgt-ip        IP address for logging agent
 	tgt-macaddr   ethernet MAC address for logging agent (broadcast)
@@ -57,6 +64,10 @@ or using IPv6::
 
  insmod netconsole netconsole=@/,@fd00:1:2:3::1/
 
+or using a MAC address to select the egress interface::
+
+   linux netconsole=4444@10.0.0.1/22:33:44:55:66:77,9353@10.0.0.2/12:34:56:78:9a:bc
+
 It also supports logging to multiple remote agents by specifying
 parameters for the multiple agents separated by semicolons and the
 complete string enclosed in "quotes", thusly::
@@ -96,9 +107,6 @@ Dynamic reconfiguration:
 Dynamic reconfigurability is a useful addition to netconsole that enables
 remote logging targets to be dynamically added, removed, or have their
 parameters reconfigured at runtime from a configfs-based userspace interface.
-[ Note that the parameters of netconsole targets that were specified/created
-from the boot/module option are not exposed via this interface, and hence
-cannot be modified dynamically. ]
 
 To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
 netconsole module (or kernel, if netconsole is built-in).
@@ -122,9 +130,10 @@ To remove a target::
 
 The interface exposes these parameters of a netconsole target to userspace:
 
-	==============  =================================       ============
+	=============== =================================       ============
 	enabled		Is this target currently enabled?	(read-write)
 	extended	Extended mode enabled			(read-write)
+	release		Prepend kernel release to message	(read-write)
 	dev_name	Local network interface name		(read-write)
 	local_port	Source UDP port to use			(read-write)
 	remote_port	Remote agent's UDP port			(read-write)
@@ -132,7 +141,8 @@ The interface exposes these parameters of a netconsole target to userspace:
 	remote_ip	Remote agent's IP address		(read-write)
 	local_mac	Local interface's MAC address		(read-only)
 	remote_mac	Remote agent's MAC address		(read-write)
-	==============  =================================       ============
+	transmit_errors	Number of packet send errors		(read-only)
+	=============== =================================       ============
 
 The "enabled" attribute is also used to control whether the parameters of
 a target can be updated or not -- you can modify the parameters of only
@@ -151,6 +161,185 @@ You can also update the local interface dynamically. This is especially
 useful if you want to use interfaces that have newly come up (and may not
 have existed when netconsole was loaded / initialized).
 
+Netconsole targets defined at boot time (or module load time) with the
+`netconsole=` param are assigned the name `cmdline<index>`.  For example, the
+first target in the parameter is named `cmdline0`.  You can control and modify
+these targets by creating configfs directories with the matching name.
+
+Let's suppose you have two netconsole targets defined at boot time::
+
+ netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc;4444@10.0.0.1/eth1,9353@10.0.0.3/12:34:56:78:9a:bc
+
+You can modify these targets in runtime by creating the following targets::
+
+ mkdir cmdline0
+ cat cmdline0/remote_ip
+ 10.0.0.2
+
+ mkdir cmdline1
+ cat cmdline1/remote_ip
+ 10.0.0.3
+
+Append User Data
+----------------
+
+Custom user data can be appended to the end of messages with netconsole
+dynamic configuration enabled. User data entries can be modified without
+changing the "enabled" attribute of a target.
+
+Directories (keys) under `userdata` are limited to 53 character length, and
+data in `userdata/<key>/value` are limited to 200 bytes::
+
+ cd /sys/kernel/config/netconsole && mkdir cmdline0
+ cd cmdline0
+ mkdir userdata/foo
+ echo bar > userdata/foo/value
+ mkdir userdata/qux
+ echo baz > userdata/qux/value
+
+Messages will now include this additional user data::
+
+ echo "This is a message" > /dev/kmsg
+
+Sends::
+
+ 12,607,22085407756,-;This is a message
+  foo=bar
+  qux=baz
+
+Preview the userdata that will be appended with::
+
+ cd /sys/kernel/config/netconsole/cmdline0/userdata
+ for f in `ls userdata`; do echo $f=$(cat userdata/$f/value); done
+
+If a `userdata` entry is created but no data is written to the `value` file,
+the entry will be omitted from netconsole messages::
+
+ cd /sys/kernel/config/netconsole && mkdir cmdline0
+ cd cmdline0
+ mkdir userdata/foo
+ echo bar > userdata/foo/value
+ mkdir userdata/qux
+
+The `qux` key is omitted since it has no value::
+
+ echo "This is a message" > /dev/kmsg
+ 12,607,22085407756,-;This is a message
+  foo=bar
+
+Delete `userdata` entries with `rmdir`::
+
+ rmdir /sys/kernel/config/netconsole/cmdline0/userdata/qux
+
+.. warning::
+   When writing strings to user data values, input is broken up per line in
+   configfs store calls and this can cause confusing behavior::
+
+     mkdir userdata/testing
+     printf "val1\nval2" > userdata/testing/value
+     # userdata store value is called twice, first with "val1\n" then "val2"
+     # so "val2" is stored, being the last value stored
+     cat userdata/testing/value
+     val2
+
+   It is recommended to not write user data values with newlines.
+
+Task name auto population in userdata
+-------------------------------------
+
+Inside the netconsole configfs hierarchy, there is a file called
+`taskname_enabled` under the `userdata` directory. This file is used to enable
+or disable the automatic task name population feature. This feature
+automatically populates the current task name that is scheduled in the CPU
+sneding the message.
+
+To enable task name auto-population::
+
+  echo 1 > /sys/kernel/config/netconsole/target1/userdata/taskname_enabled
+
+When this option is enabled, the netconsole messages will include an additional
+line in the userdata field with the format `taskname=<task name>`. This allows
+the receiver of the netconsole messages to easily find which application was
+currently scheduled when that message was generated, providing extra context
+for kernel messages and helping to categorize them.
+
+Example::
+
+  echo "This is a message" > /dev/kmsg
+  12,607,22085407756,-;This is a message
+   taskname=echo
+
+In this example, the message was generated while "echo" was the current
+scheduled process.
+
+Kernel release auto population in userdata
+------------------------------------------
+
+Within the netconsole configfs hierarchy, there is a file named `release_enabled`
+located in the `userdata` directory. This file controls the kernel release
+(version) auto-population feature, which appends the kernel release information
+to userdata dictionary in every message sent.
+
+To enable the release auto-population::
+
+  echo 1 > /sys/kernel/config/netconsole/target1/userdata/release_enabled
+
+Example::
+
+  echo "This is a message" > /dev/kmsg
+  12,607,22085407756,-;This is a message
+   release=6.14.0-rc6-01219-g3c027fbd941d
+
+.. note::
+
+   This feature provides the same data as the "release prepend" feature.
+   However, in this case, the release information is appended to the userdata
+   dictionary rather than being included in the message header.
+
+
+CPU number auto population in userdata
+--------------------------------------
+
+Inside the netconsole configfs hierarchy, there is a file called
+`cpu_nr` under the `userdata` directory. This file is used to enable or disable
+the automatic CPU number population feature. This feature automatically
+populates the CPU number that is sending the message.
+
+To enable the CPU number auto-population::
+
+  echo 1 > /sys/kernel/config/netconsole/target1/userdata/cpu_nr
+
+When this option is enabled, the netconsole messages will include an additional
+line in the userdata field with the format `cpu=<cpu_number>`. This allows the
+receiver of the netconsole messages to easily differentiate and demultiplex
+messages originating from different CPUs, which is particularly useful when
+dealing with parallel log output.
+
+Example::
+
+  echo "This is a message" > /dev/kmsg
+  12,607,22085407756,-;This is a message
+   cpu=42
+
+In this example, the message was sent by CPU 42.
+
+.. note::
+
+   If the user has set a conflicting `cpu` key in the userdata dictionary,
+   both keys will be reported, with the kernel-populated entry appearing after
+   the user one. For example::
+
+     # User-defined CPU entry
+     mkdir -p /sys/kernel/config/netconsole/target1/userdata/cpu
+     echo "1" > /sys/kernel/config/netconsole/target1/userdata/cpu/value
+
+   Output might look like::
+
+     12,607,22085407756,-;This is a message
+      cpu=1
+      cpu=42    # kernel-populated value
+
+
 Extended console:
 =================
 
@@ -165,9 +354,14 @@ following format which is the same as /dev/kmsg::
 
  <level>,<sequnum>,<timestamp>,<contflag>;<message text>
 
+If 'r' (release) feature is enabled, the kernel release version is
+prepended to the start of the message. Example::
+
+ 6.4.0,6,444,501151268,-;netconsole: network logging started
+
 Non printable characters in <message text> are escaped using "\xff"
 notation. If the message contains optional dictionary, verbatim
-newline is used as the delimeter.
+newline is used as the delimiter.
 
 If a message doesn't fit in certain number of bytes (currently 1000),
 the message is split into multiple fragments by netconsole. These
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
deleted file mode 100644
index d5c9320901c3..000000000000
--- a/Documentation/networking/netdev-FAQ.rst
+++ /dev/null
@@ -1,272 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. _netdev-FAQ:
-
-==========
-netdev FAQ
-==========
-
-Q: What is netdev?
-------------------
-A: It is a mailing list for all network-related Linux stuff.  This
-includes anything found under net/ (i.e. core code like IPv6) and
-drivers/net (i.e. hardware specific drivers) in the Linux source tree.
-
-Note that some subsystems (e.g. wireless drivers) which have a high
-volume of traffic have their own specific mailing lists.
-
-The netdev list is managed (like many other Linux mailing lists) through
-VGER (http://vger.kernel.org/) and archives can be found below:
-
--  http://marc.info/?l=linux-netdev
--  http://www.spinics.net/lists/netdev/
-
-Aside from subsystems like that mentioned above, all network-related
-Linux development (i.e. RFC, review, comments, etc.) takes place on
-netdev.
-
-Q: How do the changes posted to netdev make their way into Linux?
------------------------------------------------------------------
-A: There are always two trees (git repositories) in play.  Both are
-driven by David Miller, the main network maintainer.  There is the
-``net`` tree, and the ``net-next`` tree.  As you can probably guess from
-the names, the ``net`` tree is for fixes to existing code already in the
-mainline tree from Linus, and ``net-next`` is where the new code goes
-for the future release.  You can find the trees here:
-
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
-
-Q: How often do changes from these trees make it to the mainline Linus tree?
-----------------------------------------------------------------------------
-A: To understand this, you need to know a bit of background information on
-the cadence of Linux development.  Each new release starts off with a
-two week "merge window" where the main maintainers feed their new stuff
-to Linus for merging into the mainline tree.  After the two weeks, the
-merge window is closed, and it is called/tagged ``-rc1``.  No new
-features get mainlined after this -- only fixes to the rc1 content are
-expected.  After roughly a week of collecting fixes to the rc1 content,
-rc2 is released.  This repeats on a roughly weekly basis until rc7
-(typically; sometimes rc6 if things are quiet, or rc8 if things are in a
-state of churn), and a week after the last vX.Y-rcN was done, the
-official vX.Y is released.
-
-Relating that to netdev: At the beginning of the 2-week merge window,
-the ``net-next`` tree will be closed - no new changes/features.  The
-accumulated new content of the past ~10 weeks will be passed onto
-mainline/Linus via a pull request for vX.Y -- at the same time, the
-``net`` tree will start accumulating fixes for this pulled content
-relating to vX.Y
-
-An announcement indicating when ``net-next`` has been closed is usually
-sent to netdev, but knowing the above, you can predict that in advance.
-
-IMPORTANT: Do not send new ``net-next`` content to netdev during the
-period during which ``net-next`` tree is closed.
-
-Shortly after the two weeks have passed (and vX.Y-rc1 is released), the
-tree for ``net-next`` reopens to collect content for the next (vX.Y+1)
-release.
-
-If you aren't subscribed to netdev and/or are simply unsure if
-``net-next`` has re-opened yet, simply check the ``net-next`` git
-repository link above for any new networking-related commits.  You may
-also check the following website for the current status:
-
-  http://vger.kernel.org/~davem/net-next.html
-
-The ``net`` tree continues to collect fixes for the vX.Y content, and is
-fed back to Linus at regular (~weekly) intervals.  Meaning that the
-focus for ``net`` is on stabilization and bug fixes.
-
-Finally, the vX.Y gets released, and the whole cycle starts over.
-
-Q: So where are we now in this cycle?
-
-Load the mainline (Linus) page here:
-
-  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-
-and note the top of the "tags" section.  If it is rc1, it is early in
-the dev cycle.  If it was tagged rc7 a week ago, then a release is
-probably imminent.
-
-Q: How do I indicate which tree (net vs. net-next) my patch should be in?
--------------------------------------------------------------------------
-A: Firstly, think whether you have a bug fix or new "next-like" content.
-Then once decided, assuming that you use git, use the prefix flag, i.e.
-::
-
-  git format-patch --subject-prefix='PATCH net-next' start..finish
-
-Use ``net`` instead of ``net-next`` (always lower case) in the above for
-bug-fix ``net`` content.  If you don't use git, then note the only magic
-in the above is just the subject text of the outgoing e-mail, and you
-can manually change it yourself with whatever MUA you are comfortable
-with.
-
-Q: I sent a patch and I'm wondering what happened to it?
---------------------------------------------------------
-Q: How can I tell whether it got merged?
-A: Start by looking at the main patchworks queue for netdev:
-
-  http://patchwork.ozlabs.org/project/netdev/list/
-
-The "State" field will tell you exactly where things are at with your
-patch.
-
-Q: The above only says "Under Review".  How can I find out more?
-----------------------------------------------------------------
-A: Generally speaking, the patches get triaged quickly (in less than
-48h).  So be patient.  Asking the maintainer for status updates on your
-patch is a good way to ensure your patch is ignored or pushed to the
-bottom of the priority list.
-
-Q: I submitted multiple versions of the patch series
-----------------------------------------------------
-Q: should I directly update patchwork for the previous versions of these
-patch series?
-A: No, please don't interfere with the patch status on patchwork, leave
-it to the maintainer to figure out what is the most recent and current
-version that should be applied. If there is any doubt, the maintainer
-will reply and ask what should be done.
-
-Q: I made changes to only a few patches in a patch series should I resend only those changed?
----------------------------------------------------------------------------------------------
-A: No, please resend the entire patch series and make sure you do number your
-patches such that it is clear this is the latest and greatest set of patches
-that can be applied.
-
-Q: I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do?
--------------------------------------------------------------------------------------------------------------------------------------------
-A: There is no revert possible, once it is pushed out, it stays like that.
-Please send incremental versions on top of what has been merged in order to fix
-the patches the way they would look like if your latest patch series was to be
-merged.
-
-Q: How can I tell what patches are queued up for backporting to the various stable releases?
---------------------------------------------------------------------------------------------
-A: Normally Greg Kroah-Hartman collects stable commits himself, but for
-networking, Dave collects up patches he deems critical for the
-networking subsystem, and then hands them off to Greg.
-
-There is a patchworks queue that you can see here:
-
-  http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
-
-It contains the patches which Dave has selected, but not yet handed off
-to Greg.  If Greg already has the patch, then it will be here:
-
-  https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git
-
-A quick way to find whether the patch is in this stable-queue is to
-simply clone the repo, and then git grep the mainline commit ID, e.g.
-::
-
-  stable-queue$ git grep -l 284041ef21fdf2e
-  releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
-  releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
-  releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
-  stable/stable-queue$
-
-Q: I see a network patch and I think it should be backported to stable.
------------------------------------------------------------------------
-Q: Should I request it via stable@vger.kernel.org like the references in
-the kernel's Documentation/process/stable-kernel-rules.rst file say?
-A: No, not for networking.  Check the stable queues as per above first
-to see if it is already queued.  If not, then send a mail to netdev,
-listing the upstream commit ID and why you think it should be a stable
-candidate.
-
-Before you jump to go do the above, do note that the normal stable rules
-in :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
-still apply.  So you need to explicitly indicate why it is a critical
-fix and exactly what users are impacted.  In addition, you need to
-convince yourself that you *really* think it has been overlooked,
-vs. having been considered and rejected.
-
-Generally speaking, the longer it has had a chance to "soak" in
-mainline, the better the odds that it is an OK candidate for stable.  So
-scrambling to request a commit be added the day after it appears should
-be avoided.
-
-Q: I have created a network patch and I think it should be backported to stable.
---------------------------------------------------------------------------------
-Q: Should I add a Cc: stable@vger.kernel.org like the references in the
-kernel's Documentation/ directory say?
-A: No.  See above answer.  In short, if you think it really belongs in
-stable, then ensure you write a decent commit log that describes who
-gets impacted by the bug fix and how it manifests itself, and when the
-bug was introduced.  If you do that properly, then the commit will get
-handled appropriately and most likely get put in the patchworks stable
-queue if it really warrants it.
-
-If you think there is some valid information relating to it being in
-stable that does *not* belong in the commit log, then use the three dash
-marker line as described in
-:ref:`Documentation/process/submitting-patches.rst <the_canonical_patch_format>`
-to temporarily embed that information into the patch that you send.
-
-Q: Are all networking bug fixes backported to all stable releases?
-------------------------------------------------------------------
-A: Due to capacity, Dave could only take care of the backports for the
-last two stable releases. For earlier stable releases, each stable
-branch maintainer is supposed to take care of them. If you find any
-patch is missing from an earlier stable branch, please notify
-stable@vger.kernel.org with either a commit ID or a formal patch
-backported, and CC Dave and other relevant networking developers.
-
-Q: Is the comment style convention different for the networking content?
-------------------------------------------------------------------------
-A: Yes, in a largely trivial way.  Instead of this::
-
-  /*
-   * foobar blah blah blah
-   * another line of text
-   */
-
-it is requested that you make it look like this::
-
-  /* foobar blah blah blah
-   * another line of text
-   */
-
-Q: I am working in existing code that has the former comment style and not the latter.
---------------------------------------------------------------------------------------
-Q: Should I submit new code in the former style or the latter?
-A: Make it the latter style, so that eventually all code in the domain
-of netdev is of this format.
-
-Q: I found a bug that might have possible security implications or similar.
----------------------------------------------------------------------------
-Q: Should I mail the main netdev maintainer off-list?**
-A: No. The current netdev maintainer has consistently requested that
-people use the mailing lists and not reach out directly.  If you aren't
-OK with that, then perhaps consider mailing security@kernel.org or
-reading about http://oss-security.openwall.org/wiki/mailing-lists/distros
-as possible alternative mechanisms.
-
-Q: What level of testing is expected before I submit my change?
----------------------------------------------------------------
-A: If your changes are against ``net-next``, the expectation is that you
-have tested by layering your changes on top of ``net-next``.  Ideally
-you will have done run-time testing specific to your change, but at a
-minimum, your changes should survive an ``allyesconfig`` and an
-``allmodconfig`` build without new warnings or failures.
-
-Q: Any other tips to help ensure my net/net-next patch gets OK'd?
------------------------------------------------------------------
-A: Attention to detail.  Re-read your own work as if you were the
-reviewer.  You can start with using ``checkpatch.pl``, perhaps even with
-the ``--strict`` flag.  But do not be mindlessly robotic in doing so.
-If your change is a bug fix, make sure your commit log indicates the
-end-user visible symptom, the underlying reason as to why it happens,
-and then if necessary, explain why the fix proposed is the best way to
-get things done.  Don't mangle whitespace, and as is common, don't
-mis-indent function arguments that span multiple lines.  If it is your
-first patch, mail it to yourself so you can test apply it to an
-unpatched tree to confirm infrastructure didn't mangle it.
-
-Finally, go back and read
-:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
-to be sure you are not repeating some common mistake documented there.
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
index a2d7d7160e39..02bd7536fc0c 100644
--- a/Documentation/networking/netdev-features.rst
+++ b/Documentation/networking/netdev-features.rst
@@ -139,21 +139,6 @@ chained skbs (skb->next/prev list).
 Features contained in NETIF_F_SOFT_FEATURES are features of networking
 stack. Driver should not change behaviour based on them.
 
- * LLTX driver (deprecated for hardware drivers)
-
-NETIF_F_LLTX is meant to be used by drivers that don't need locking at all,
-e.g. software tunnels.
-
-This is also used in a few legacy drivers that implement their
-own locking, don't use it for new (hardware) drivers.
-
- * netns-local device
-
-NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between
-network namespaces (e.g. loopback).
-
-Don't use it in drivers.
-
  * VLAN challenged
 
 NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN
@@ -182,3 +167,29 @@ stricter than Hardware LRO.  A packet stream merged by Hardware GRO must
 be re-segmentable by GSO or TSO back to the exact original packet stream.
 Hardware GRO is dependent on RXCSUM since every packet successfully merged
 by hardware must also have the checksum verified by hardware.
+
+* hsr-tag-ins-offload
+
+This should be set for devices which insert an HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tag automatically.
+
+* hsr-tag-rm-offload
+
+This should be set for devices which remove HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically.
+
+* hsr-fwd-offload
+
+This should be set for devices which forward HSR (High-availability Seamless
+Redundancy) frames from one port to another in hardware.
+
+* hsr-dup-offload
+
+This should be set for devices which duplicate outgoing HSR (High-availability
+Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
+frames in hardware.
+
+* netmem-tx
+
+This should be set for devices which support netmem TX. See
+Documentation/networking/netmem.rst
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index 5a85fcc80c76..7ebb6c36482d 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -8,20 +8,179 @@ Network Devices, the Kernel, and You!
 Introduction
 ============
 The following is a random collection of documentation regarding
-network devices.
+network devices. It is intended for driver developers.
 
-struct net_device allocation rules
-==================================
+struct net_device lifetime rules
+================================
 Network device structures need to persist even after module is unloaded and
 must be allocated with alloc_netdev_mqs() and friends.
 If device has registered successfully, it will be freed on last use
-by free_netdev(). This is required to handle the pathologic case cleanly
-(example: rmmod mydriver </sys/class/net/myeth/mtu )
+by free_netdev(). This is required to handle the pathological case cleanly
+(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
 
-alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver
+alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
 private data which gets freed when the network device is freed. If
 separately allocated data is attached to the network device
-(netdev_priv(dev)) then it is up to the module exit handler to free that.
+(netdev_priv()) then it is up to the module exit handler to free that.
+
+There are two groups of APIs for registering struct net_device.
+First group can be used in normal contexts where ``rtnl_lock`` is not already
+held: register_netdev(), unregister_netdev().
+Second group can be used when ``rtnl_lock`` is already held:
+register_netdevice(), unregister_netdevice(), free_netdevice().
+
+Simple drivers
+--------------
+
+Most drivers (especially device drivers) handle lifetime of struct net_device
+in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
+
+In that case the struct net_device registration is done using
+the register_netdev(), and unregister_netdev() functions:
+
+.. code-block:: c
+
+  int probe()
+  {
+    struct my_device_priv *priv;
+    int err;
+
+    dev = alloc_netdev_mqs(...);
+    if (!dev)
+      return -ENOMEM;
+    priv = netdev_priv(dev);
+
+    /* ... do all device setup before calling register_netdev() ...
+     */
+
+    err = register_netdev(dev);
+    if (err)
+      goto err_undo;
+
+    /* net_device is visible to the user! */
+
+  err_undo:
+    /* ... undo the device setup ... */
+    free_netdev(dev);
+    return err;
+  }
+
+  void remove()
+  {
+    unregister_netdev(dev);
+    free_netdev(dev);
+  }
+
+Note that after calling register_netdev() the device is visible in the system.
+Users can open it and start sending / receiving traffic immediately,
+or run any other callback, so all initialization must be done prior to
+registration.
+
+unregister_netdev() closes the device and waits for all users to be done
+with it. The memory of struct net_device itself may still be referenced
+by sysfs but all operations on that device will fail.
+
+free_netdev() can be called after unregister_netdev() returns on when
+register_netdev() failed.
+
+Device management under RTNL
+----------------------------
+
+Registering struct net_device while in context which already holds
+the ``rtnl_lock`` requires extra care. In those scenarios most drivers
+will want to make use of struct net_device's ``needs_free_netdev``
+and ``priv_destructor`` members for freeing of state.
+
+Example flow of netdev handling under ``rtnl_lock``:
+
+.. code-block:: c
+
+  static void my_setup(struct net_device *dev)
+  {
+    dev->needs_free_netdev = true;
+  }
+
+  static void my_destructor(struct net_device *dev)
+  {
+    some_obj_destroy(priv->obj);
+    some_uninit(priv);
+  }
+
+  int create_link()
+  {
+    struct my_device_priv *priv;
+    int err;
+
+    ASSERT_RTNL();
+
+    dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
+    if (!dev)
+      return -ENOMEM;
+    priv = netdev_priv(dev);
+
+    /* Implicit constructor */
+    err = some_init(priv);
+    if (err)
+      goto err_free_dev;
+
+    priv->obj = some_obj_create();
+    if (!priv->obj) {
+      err = -ENOMEM;
+      goto err_some_uninit;
+    }
+    /* End of constructor, set the destructor: */
+    dev->priv_destructor = my_destructor;
+
+    err = register_netdevice(dev);
+    if (err)
+      /* register_netdevice() calls destructor on failure */
+      goto err_free_dev;
+
+    /* If anything fails now unregister_netdevice() (or unregister_netdev())
+     * will take care of calling my_destructor and free_netdev().
+     */
+
+    return 0;
+
+  err_some_uninit:
+    some_uninit(priv);
+  err_free_dev:
+    free_netdev(dev);
+    return err;
+  }
+
+If struct net_device.priv_destructor is set it will be called by the core
+some time after unregister_netdevice(), it will also be called if
+register_netdevice() fails. The callback may be invoked with or without
+``rtnl_lock`` held.
+
+There is no explicit constructor callback, driver "constructs" the private
+netdev state after allocating it and before registration.
+
+Setting struct net_device.needs_free_netdev makes core call free_netdevice()
+automatically after unregister_netdevice() when all references to the device
+are gone. It only takes effect after a successful call to register_netdevice()
+so if register_netdevice() fails driver is responsible for calling
+free_netdev().
+
+free_netdev() is safe to call on error paths right after unregister_netdevice()
+or when register_netdevice() fails. Parts of netdev (de)registration process
+happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
+will defer some of the processing until ``rtnl_lock`` is released.
+
+Devices spawned from struct rtnl_link_ops should never free the
+struct net_device directly.
+
+.ndo_init and .ndo_uninit
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
+registration and de-registration, under ``rtnl_lock``. Drivers can use
+those e.g. when parts of their init process need to run under ``rtnl_lock``.
+
+``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
+runs during de-registering after device is closed but other subsystems
+may still have outstanding references to the netdevice.
 
 MTU
 ===
@@ -51,30 +210,65 @@ packets is preferred.
 struct net_device synchronization rules
 =======================================
 ndo_open:
-	Synchronization: rtnl_lock() semaphore.
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
 	Context: process
 
 ndo_stop:
-	Synchronization: rtnl_lock() semaphore.
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
 	Context: process
 	Note: netif_running() is guaranteed false
 
 ndo_do_ioctl:
 	Synchronization: rtnl_lock() semaphore.
+
+	This is only called by network subsystems internally,
+	not by user space calling ioctl as it was in before
+	linux-5.14.
+
+ndo_siocbond:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
+        Context: process
+
+	Used by the bonding driver for the SIOCBOND family of
+	ioctl commands.
+
+ndo_siocwandev:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
+	Context: process
+
+	Used by the drivers/net/wan framework to handle
+	the SIOCWANDEV ioctl with the if_settings structure.
+
+ndo_siocdevprivate:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
+	Context: process
+
+	This is used to implement SIOCDEVPRIVATE ioctl helpers.
+	These should not be added to new drivers, so don't use.
+
+ndo_eth_ioctl:
+	Synchronization: rtnl_lock() semaphore. In addition, netdev instance
+	lock if the driver implements queue management or shaper API.
 	Context: process
 
 ndo_get_stats:
-	Synchronization: dev_base_lock rwlock.
-	Context: nominally process, but don't sleep inside an rwlock
+	Synchronization: RCU (can be called concurrently with the stats
+	update path).
+	Context: atomic (can't sleep under RCU)
 
 ndo_start_xmit:
 	Synchronization: __netif_tx_lock spinlock.
 
-	When the driver sets NETIF_F_LLTX in dev->features this will be
+	When the driver sets dev->lltx this will be
 	called without holding netif_tx_lock. In this case the driver
 	has to lock by itself when needed.
 	The locking there should also properly protect against
-	set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
+	set_rx_mode. WARNING: use of dev->lltx is deprecated.
 	Don't use it for new drivers.
 
 	Context: Process with BHs disabled or BH (timer),
@@ -96,6 +290,16 @@ ndo_set_rx_mode:
 	Synchronization: netif_addr_lock spinlock.
 	Context: BHs disabled
 
+ndo_setup_tc:
+	``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks
+	(i.e. no ``rtnl_lock`` and no device instance lock). The rest of
+	``tc_setup_type`` types run under netdev instance lock if the driver
+	implements queue management or shaper API.
+
+Most ndo callbacks not specified in the list above are running
+under ``rtnl_lock``. In addition, netdev instance lock is taken as well if
+the driver implements queue management or shaper API.
+
 struct napi_struct synchronization rules
 ========================================
 napi->poll:
@@ -109,3 +313,106 @@ napi->poll:
 	Context:
 		 softirq
 		 will be called with interrupts disabled by netconsole.
+
+netdev instance lock
+====================
+
+Historically, all networking control operations were protected by a single
+global lock known as ``rtnl_lock``. There is an ongoing effort to replace this
+global lock with separate locks for each network namespace. Additionally,
+properties of individual netdev are increasingly protected by per-netdev locks.
+
+For device drivers that implement shaping or queue management APIs, all control
+operations will be performed under the netdev instance lock.
+Drivers can also explicitly request instance lock to be held during ops
+by setting ``request_ops_lock`` to true. Code comments and docs refer
+to drivers which have ops called under the instance lock as "ops locked".
+See also the documentation of the ``lock`` member of struct net_device.
+
+In the future, there will be an option for individual
+drivers to opt out of using ``rtnl_lock`` and instead perform their control
+operations directly under the netdev instance lock.
+
+Devices drivers are encouraged to rely on the instance lock where possible.
+
+For the (mostly software) drivers that need to interact with the core stack,
+there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx``
+(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx``
+functions handle acquiring the instance lock themselves, while the
+``netif_xxx`` functions assume that the driver has already acquired
+the instance lock.
+
+struct net_device_ops
+---------------------
+
+``ndos`` are called without holding the instance lock for most drivers.
+
+"Ops locked" drivers will have most of the ``ndos`` invoked under
+the instance lock.
+
+struct ethtool_ops
+------------------
+
+Similarly to ``ndos`` the instance lock is only held for select drivers.
+For "ops locked" drivers all ethtool ops without exceptions should
+be called under the instance lock.
+
+struct netdev_stat_ops
+----------------------
+
+"qstat" ops are invoked under the instance lock for "ops locked" drivers,
+and under rtnl_lock for all other drivers.
+
+struct net_shaper_ops
+---------------------
+
+All net shaper callbacks are invoked while holding the netdev instance
+lock. ``rtnl_lock`` may or may not be held.
+
+Note that supporting net shapers automatically enables "ops locking".
+
+struct netdev_queue_mgmt_ops
+----------------------------
+
+All queue management callbacks are invoked while holding the netdev instance
+lock. ``rtnl_lock`` may or may not be held.
+
+Note that supporting struct netdev_queue_mgmt_ops automatically enables
+"ops locking".
+
+Notifiers and netdev instance lock
+----------------------------------
+
+For device drivers that implement shaping or queue management APIs,
+some of the notifiers (``enum netdev_cmd``) are running under the netdev
+instance lock.
+
+The following netdev notifiers are always run under the instance lock:
+* ``NETDEV_XDP_FEAT_CHANGE``
+
+For devices with locked ops, currently only the following notifiers are
+running under the lock:
+* ``NETDEV_CHANGE``
+* ``NETDEV_REGISTER``
+* ``NETDEV_UP``
+
+The following notifiers are running without the lock:
+* ``NETDEV_UNREGISTER``
+
+There are no clear expectations for the remaining notifiers. Notifiers not on
+the list may run with or without the instance lock, potentially even invoking
+the same notifier type with and without the lock from different code paths.
+The goal is to eventually ensure that all (or most, with a few documented
+exceptions) notifiers run under the instance lock. Please extend this
+documentation whenever you make explicit assumption about lock being held
+from a notifier.
+
+NETDEV_INTERNAL symbol namespace
+================================
+
+Symbols exported as NETDEV_INTERNAL can only be used in networking
+core and drivers which exclusively flow via the main networking list and trees.
+Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL
+are not expected to be used by random code outside netdev either.
+Symbols may lack the designation because they predate the namespaces,
+or simply due to an oversight.
diff --git a/Documentation/networking/netlink_spec/.gitignore b/Documentation/networking/netlink_spec/.gitignore
new file mode 100644
index 000000000000..30d85567b592
--- /dev/null
+++ b/Documentation/networking/netlink_spec/.gitignore
@@ -0,0 +1 @@
+*.rst
diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt
new file mode 100644
index 000000000000..030b44aca4e6
--- /dev/null
+++ b/Documentation/networking/netlink_spec/readme.txt
@@ -0,0 +1,4 @@
+SPDX-License-Identifier: GPL-2.0
+
+This file is populated during the build of the documentation (htmldocs) by the
+tools/net/ynl/pyynl/ynl_gen_rst.py script.
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
new file mode 100644
index 000000000000..b63aded46337
--- /dev/null
+++ b/Documentation/networking/netmem.rst
@@ -0,0 +1,98 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Netmem Support for Network Drivers
+==================================
+
+This document outlines the requirements for network drivers to support netmem,
+an abstract memory type that enables features like device memory TCP. By
+supporting netmem, drivers can work with various underlying memory types
+with little to no modification.
+
+Benefits of Netmem :
+
+* Flexibility: Netmem can be backed by different memory types (e.g., struct
+  page, DMA-buf), allowing drivers to support various use cases such as device
+  memory TCP.
+* Future-proof: Drivers with netmem support are ready for upcoming
+  features that rely on it.
+* Simplified Development: Drivers interact with a consistent API,
+  regardless of the underlying memory implementation.
+
+Driver RX Requirements
+======================
+
+1. The driver must support page_pool.
+
+2. The driver must support the tcp-data-split ethtool option.
+
+3. The driver must use the page_pool netmem APIs for payload memory. The netmem
+   APIs currently 1-to-1 correspond with page APIs. Conversion to netmem should
+   be achievable by switching the page APIs to netmem APIs and tracking memory
+   via netmem_refs in the driver rather than struct page * :
+
+   - page_pool_alloc -> page_pool_alloc_netmem
+   - page_pool_get_dma_addr -> page_pool_get_dma_addr_netmem
+   - page_pool_put_page -> page_pool_put_netmem
+
+   Not all page APIs have netmem equivalents at the moment. If your driver
+   relies on a missing netmem API, feel free to add and propose to netdev@, or
+   reach out to the maintainers and/or almasrymina@google.com for help adding
+   the netmem API.
+
+4. The driver must use the following PP_FLAGS:
+
+   - PP_FLAG_DMA_MAP: netmem is not dma-mappable by the driver. The driver
+     must delegate the dma mapping to the page_pool, which knows when
+     dma-mapping is (or is not) appropriate.
+   - PP_FLAG_DMA_SYNC_DEV: netmem dma addr is not necessarily dma-syncable
+     by the driver. The driver must delegate the dma syncing to the page_pool,
+     which knows when dma-syncing is (or is not) appropriate.
+   - PP_FLAG_ALLOW_UNREADABLE_NETMEM. The driver must specify this flag iff
+     tcp-data-split is enabled.
+
+5. The driver must not assume the netmem is readable and/or backed by pages.
+   The netmem returned by the page_pool may be unreadable, in which case
+   netmem_address() will return NULL. The driver must correctly handle
+   unreadable netmem, i.e. don't attempt to handle its contents when
+   netmem_address() is NULL.
+
+   Ideally, drivers should not have to check the underlying netmem type via
+   helpers like netmem_is_net_iov() or convert the netmem to any of its
+   underlying types via netmem_to_page() or netmem_to_net_iov(). In most cases,
+   netmem or page_pool helpers that abstract this complexity are provided
+   (and more can be added).
+
+6. The driver must use page_pool_dma_sync_netmem_for_cpu() in lieu of
+   dma_sync_single_range_for_cpu(). For some memory providers, dma_syncing for
+   CPU will be done by the page_pool, for others (particularly dmabuf memory
+   provider), dma syncing for CPU is the responsibility of the userspace using
+   dmabuf APIs. The driver must delegate the entire dma-syncing operation to
+   the page_pool which will do it correctly.
+
+7. Avoid implementing driver-specific recycling on top of the page_pool. Drivers
+   cannot hold onto a struct page to do their own recycling as the netmem may
+   not be backed by a struct page. However, you may hold onto a page_pool
+   reference with page_pool_fragment_netmem() or page_pool_ref_netmem() for
+   that purpose, but be mindful that some netmem types might have longer
+   circulation times, such as when userspace holds a reference in zerocopy
+   scenarios.
+
+Driver TX Requirements
+======================
+
+1. The Driver must not pass the netmem dma_addr to any of the dma-mapping APIs
+   directly. This is because netmem dma_addrs may come from a source like
+   dma-buf that is not compatible with the dma-mapping APIs.
+
+   Helpers like netmem_dma_unmap_page_attrs() & netmem_dma_unmap_addr_set()
+   should be used in lieu of dma_unmap_page[_attrs](), dma_unmap_addr_set().
+   The netmem variants will handle netmem dma_addrs correctly regardless of the
+   source, delegating to the dma-mapping APIs when appropriate.
+
+   Not all dma-mapping APIs have netmem equivalents at the moment. If your
+   driver relies on a missing netmem API, feel free to add and propose to
+   netdev@, or reach out to the maintainers and/or almasrymina@google.com for
+   help adding the netmem API.
+
+2. Driver should declare support by setting `netdev->netmem_tx = true`
diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst
new file mode 100644
index 000000000000..fabecee24d85
--- /dev/null
+++ b/Documentation/networking/nexthop-group-resilient.rst
@@ -0,0 +1,293 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Resilient Next-hop Groups
+=========================
+
+Resilient groups are a type of next-hop group that is aimed at minimizing
+disruption in flow routing across changes to the group composition and
+weights of constituent next hops.
+
+The idea behind resilient hashing groups is best explained in contrast to
+the legacy multipath next-hop group, which uses the hash-threshold
+algorithm, described in RFC 2992.
+
+To select a next hop, hash-threshold algorithm first assigns a range of
+hashes to each next hop in the group, and then selects the next hop by
+comparing the SKB hash with the individual ranges. When a next hop is
+removed from the group, the ranges are recomputed, which leads to
+reassignment of parts of hash space from one next hop to another. RFC 2992
+illustrates it thus::
+
+             +-------+-------+-------+-------+-------+
+             |   1   |   2   |   3   |   4   |   5   |
+             +-------+-+-----+---+---+-----+-+-------+
+             |    1    |    2    |    4    |    5    |
+             +---------+---------+---------+---------+
+
+              Before and after deletion of next hop 3
+	      under the hash-threshold algorithm.
+
+Note how next hop 2 gave up part of the hash space in favor of next hop 1,
+and 4 in favor of 5. While there will usually be some overlap between the
+previous and the new distribution, some traffic flows change the next hop
+that they resolve to.
+
+If a multipath group is used for load-balancing between multiple servers,
+this hash space reassignment causes an issue that packets from a single
+flow suddenly end up arriving at a server that does not expect them. This
+can result in TCP connections being reset.
+
+If a multipath group is used for load-balancing among available paths to
+the same server, the issue is that different latencies and reordering along
+the way causes the packets to arrive in the wrong order, resulting in
+degraded application performance.
+
+To mitigate the above-mentioned flow redirection, resilient next-hop groups
+insert another layer of indirection between the hash space and its
+constituent next hops: a hash table. The selection algorithm uses SKB hash
+to choose a hash table bucket, then reads the next hop that this bucket
+contains, and forwards traffic there.
+
+This indirection brings an important feature. In the hash-threshold
+algorithm, the range of hashes associated with a next hop must be
+continuous. With a hash table, mapping between the hash table buckets and
+the individual next hops is arbitrary. Therefore when a next hop is deleted
+the buckets that held it are simply reassigned to other next hops::
+
+	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+	    |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+	                     v v v v
+	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+	    |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+	    Before and after deletion of next hop 3
+	    under the resilient hashing algorithm.
+
+When weights of next hops in a group are altered, it may be possible to
+choose a subset of buckets that are currently not used for forwarding
+traffic, and use those to satisfy the new next-hop distribution demands,
+keeping the "busy" buckets intact. This way, established flows are ideally
+kept being forwarded to the same endpoints through the same paths as before
+the next-hop group change.
+
+Algorithm
+---------
+
+In a nutshell, the algorithm works as follows. Each next hop deserves a
+certain number of buckets, according to its weight and the number of
+buckets in the hash table. In accordance with the source code, we will call
+this number a "wants count" of a next hop. In case of an event that might
+cause bucket allocation change, the wants counts for individual next hops
+are updated.
+
+Next hops that have fewer buckets than their wants count, are called
+"underweight". Those that have more are "overweight". If there are no
+overweight (and therefore no underweight) next hops in the group, it is
+said to be "balanced".
+
+Each bucket maintains a last-used timer. Every time a packet is forwarded
+through a bucket, this timer is updated to current jiffies value. One
+attribute of a resilient group is then the "idle timer", which is the
+amount of time that a bucket must not be hit by traffic in order for it to
+be considered "idle". Buckets that are not idle are busy.
+
+After assigning wants counts to next hops, an "upkeep" algorithm runs. For
+buckets:
+
+1) that have no assigned next hop, or
+2) whose next hop has been removed, or
+3) that are idle and their next hop is overweight,
+
+upkeep changes the next hop that the bucket references to one of the
+underweight next hops. If, after considering all buckets in this manner,
+there are still underweight next hops, another upkeep run is scheduled to a
+future time.
+
+There may not be enough "idle" buckets to satisfy the updated wants counts
+of all next hops. Another attribute of a resilient group is the "unbalanced
+timer". This timer can be set to 0, in which case the table will stay out
+of balance until idle buckets do appear, possibly never. If set to a
+non-zero value, the value represents the period of time that the table is
+permitted to stay out of balance.
+
+With this in mind, we update the above list of conditions with one more
+item. Thus buckets:
+
+4) whose next hop is overweight, and the amount of time that the table has
+   been out of balance exceeds the unbalanced timer, if that is non-zero,
+
+\... are migrated as well.
+
+Offloading & Driver Feedback
+----------------------------
+
+When offloading resilient groups, the algorithm that distributes buckets
+among next hops is still the one in SW. Drivers are notified of updates to
+next hop groups in the following three ways:
+
+- Full group notification with the type
+  ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
+  created and buckets populated for the first time.
+
+- Single-bucket notifications of the type
+  ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
+  individual migrations within an already-established group.
+
+- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
+  is sent before the group is replaced, and is a way for the driver to veto
+  the group before committing anything to the HW.
+
+Some single-bucket notifications are forced, as indicated by the "force"
+flag in the notification. Those are used for the cases where e.g. the next
+hop associated with the bucket was removed, and the bucket really must be
+migrated.
+
+Non-forced notifications can be overridden by the driver by returning an
+error code. The use case for this is that the driver notifies the HW that a
+bucket should be migrated, but the HW discovers that the bucket has in fact
+been hit by traffic.
+
+A second way for the HW to report that a bucket is busy is through the
+``nexthop_res_grp_activity_update()`` API. The buckets identified this way
+as busy are treated as if traffic hit them.
+
+Offloaded buckets should be flagged as either "offload" or "trap". This is
+done through the ``nexthop_bucket_set_hw_flags()`` API.
+
+Netlink UAPI
+------------
+
+Resilient Group Replacement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
+same manner as other multipath groups. The following changes apply to the
+attributes passed in the netlink message:
+
+  =================== =========================================================
+  ``NHA_GROUP_TYPE``  Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
+  ``NHA_RES_GROUP``   A nest that contains attributes specific to resilient
+                      groups.
+  =================== =========================================================
+
+``NHA_RES_GROUP`` payload:
+
+  =================================== =========================================
+  ``NHA_RES_GROUP_BUCKETS``           Number of buckets in the hash table.
+  ``NHA_RES_GROUP_IDLE_TIMER``        Idle timer in units of clock_t.
+  ``NHA_RES_GROUP_UNBALANCED_TIMER``  Unbalanced timer in units of clock_t.
+  =================================== =========================================
+
+Next Hop Get
+^^^^^^^^^^^^
+
+Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
+message in exactly the same way as other next hop get requests. The
+response attributes match the replacement attributes cited above, except
+``NHA_RES_GROUP`` payload will include the following attribute:
+
+  =================================== =========================================
+  ``NHA_RES_GROUP_UNBALANCED_TIME``   How long has the resilient group been out
+                                      of balance, in units of clock_t.
+  =================================== =========================================
+
+Bucket Get
+^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
+used to request a single bucket. The attributes recognized at get requests
+are:
+
+  =================== =========================================================
+  ``NHA_ID``          ID of the next-hop group that the bucket belongs to.
+  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
+  =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+  ======================== ====================================================
+  ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
+  ======================== ====================================================
+
+Bucket Dumps
+^^^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
+to request a dump of matching buckets. The attributes recognized at dump
+requests are:
+
+  =================== =========================================================
+  ``NHA_ID``          If specified, limits the dump to just the next-hop group
+                      with this ID.
+  ``NHA_OIF``         If specified, limits the dump to buckets that contain
+                      next hops that use the device with this ifindex.
+  ``NHA_MASTER``      If specified, limits the dump to buckets that contain
+                      next hops that use a device in the VRF with this ifindex.
+  ``NHA_RES_BUCKET``  A nest that contains attributes specific to bucket.
+  =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+  ======================== ====================================================
+  ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
+                           that contain the next hop with this ID.
+  ======================== ====================================================
+
+Usage
+-----
+
+To illustrate the usage, consider the following commands::
+
+	# ip nexthop add id 1 via 192.0.2.2 dev eth0
+	# ip nexthop add id 2 via 192.0.2.3 dev eth0
+	# ip nexthop add id 10 group 1/2 type resilient \
+		buckets 8 idle_timer 60 unbalanced_timer 300
+
+The last command creates a resilient next-hop group. It will have 8 buckets
+(which is unusually low number, and used here for demonstration purposes
+only), each bucket will be considered idle when no traffic hits it for at
+least 60 seconds, and if the table remains out of balance for 300 seconds,
+it will be forcefully brought into balance.
+
+Changing next-hop weights leads to change in bucket allocation::
+
+	# ip nexthop replace id 10 group 1,3/2 type resilient
+
+This can be confirmed by looking at individual buckets::
+
+	# ip nexthop bucket show id 10
+	id 10 index 0 idle_time 5.59 nhid 1
+	id 10 index 1 idle_time 5.59 nhid 1
+	id 10 index 2 idle_time 8.74 nhid 2
+	id 10 index 3 idle_time 8.74 nhid 2
+	id 10 index 4 idle_time 8.74 nhid 1
+	id 10 index 5 idle_time 8.74 nhid 1
+	id 10 index 6 idle_time 8.74 nhid 1
+	id 10 index 7 idle_time 8.74 nhid 1
+
+Note the two buckets that have a shorter idle time. Those are the ones that
+were migrated after the next-hop replace command to satisfy the new demand
+that next hop 1 be given 6 buckets instead of 4.
+
+Netdevsim
+---------
+
+The netdevsim driver implements a mock offload of resilient groups, and
+exposes debugfs interface that allows marking individual buckets as busy.
+For example, the following will mark bucket 23 in next-hop group 10 as
+active::
+
+	# echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
+
+In addition, another debugfs interface can be used to configure that the
+next attempt to migrate a bucket should fail::
+
+	# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
+
+Besides serving as an example, the interfaces that netdevsim exposes are
+useful in automated testing, and
+``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
+them to test the algorithm.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 11a9b76786cb..238b66d0e059 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -17,9 +17,8 @@ nf_conntrack_acct - BOOLEAN
 nf_conntrack_buckets - INTEGER
 	Size of hash table. If not specified as parameter during module
 	loading, the default size is calculated by dividing total memory
-	by 16384 to determine the number of buckets but the hash table will
-	never have fewer than 32 and limited to 16384 buckets. For systems
-	with more than 4GB of memory it will be 65536 buckets.
+	by 16384 to determine the number of buckets. The hash table will
+	never have fewer than 1024 and never more than 262144 buckets.
 	This sysctl is only writeable in the initial net namespace.
 
 nf_conntrack_checksum - BOOLEAN
@@ -35,10 +34,13 @@ nf_conntrack_count - INTEGER (read-only)
 
 nf_conntrack_events - BOOLEAN
 	- 0 - disabled
-	- not 0 - enabled (default)
+	- 1 - enabled
+	- 2 - auto (default)
 
 	If this option is enabled, the connection tracking code will
 	provide userspace with connection tracking events via ctnetlink.
+	The default allocates the extension if a userspace program is
+	listening to ctnetlink events.
 
 nf_conntrack_expect_max - INTEGER
 	Maximum size of expectation table.  Default value is
@@ -68,15 +70,6 @@ nf_conntrack_generic_timeout - INTEGER (seconds)
 	Default for generic timeout.  This refers to layer 4 unknown/unsupported
 	protocols.
 
-nf_conntrack_helper - BOOLEAN
-	- 0 - disabled (default)
-	- not 0 - enabled
-
-	Enable automatic conntrack helper assignment.
-	If disabled it is required to set up iptables rules to assign
-	helpers to connections.  See the CT target description in the
-	iptables-extensions(8) man page for further information.
-
 nf_conntrack_icmp_timeout - INTEGER (seconds)
 	default 30
 
@@ -100,8 +93,12 @@ nf_conntrack_log_invalid - INTEGER
 	Log invalid packets of a type specified by value.
 
 nf_conntrack_max - INTEGER
-	Size of connection tracking table.  Default value is
-	nf_conntrack_buckets value * 4.
+        Maximum number of allowed connection tracking entries. This value is set
+        to nf_conntrack_buckets by default.
+        Note that connection tracking entries are added to the table twice -- once
+        for the original direction and once for the reply direction (i.e., with
+        the reversed address). This means that with default settings a maxed-out
+        table will have a average hash chain length of 2, not 1.
 
 nf_conntrack_tcp_be_liberal - BOOLEAN
 	- 0 - disabled (default)
@@ -110,6 +107,12 @@ nf_conntrack_tcp_be_liberal - BOOLEAN
 	Be conservative in what you do, be liberal in what you accept from others.
 	If it's non-zero, we mark only out of window RST segments as INVALID.
 
+nf_conntrack_tcp_ignore_invalid_rst - BOOLEAN
+	- 0 - disabled (default)
+	- 1 - enabled
+
+	If it's 1, we don't mark out of window RST segments as INVALID.
+
 nf_conntrack_tcp_loose - BOOLEAN
 	- 0 - disabled
 	- not 0 - enabled (default)
@@ -160,6 +163,35 @@ nf_conntrack_timestamp - BOOLEAN
 
 	Enable connection tracking flow timestamping.
 
+nf_conntrack_sctp_timeout_closed - INTEGER (seconds)
+	default 10
+
+nf_conntrack_sctp_timeout_cookie_wait - INTEGER (seconds)
+	default 3
+
+nf_conntrack_sctp_timeout_cookie_echoed - INTEGER (seconds)
+	default 3
+
+nf_conntrack_sctp_timeout_established - INTEGER (seconds)
+	default 210
+
+	Default is set to (hb_interval * path_max_retrans + rto_max)
+
+nf_conntrack_sctp_timeout_shutdown_sent - INTEGER (seconds)
+	default 3
+
+nf_conntrack_sctp_timeout_shutdown_recd - INTEGER (seconds)
+	default 3
+
+nf_conntrack_sctp_timeout_shutdown_ack_sent - INTEGER (seconds)
+	default 3
+
+nf_conntrack_sctp_timeout_heartbeat_sent - INTEGER (seconds)
+	default 30
+
+	This timeout is used to setup conntrack entry on secondary paths.
+	Default is set to hb_interval.
+
 nf_conntrack_udp_timeout - INTEGER (seconds)
 	default 30
 
@@ -177,3 +209,24 @@ nf_conntrack_gre_timeout_stream - INTEGER (seconds)
 
 	This extended timeout will be used in case there is an GRE stream
 	detected.
+
+nf_hooks_lwtunnel - BOOLEAN
+	- 0 - disabled (default)
+	- not 0 - enabled
+
+	If this option is enabled, the lightweight tunnel netfilter hooks are
+	enabled. This option cannot be disabled once it is enabled.
+
+nf_flowtable_tcp_timeout - INTEGER (seconds)
+        default 30
+
+        Control offload timeout for tcp connections.
+        TCP connections may be offloaded from nf conntrack to nf flow table.
+        Once aged, the connection is returned to nf conntrack.
+
+nf_flowtable_udp_timeout - INTEGER (seconds)
+        default 30
+
+        Control offload timeout for udp connections.
+        UDP connections may be offloaded from nf conntrack to nf flow table.
+        Once aged, the connection is returned to nf conntrack.
diff --git a/Documentation/networking/nf_flowtable.rst b/Documentation/networking/nf_flowtable.rst
index b6e1fa141aae..d757c21c10f2 100644
--- a/Documentation/networking/nf_flowtable.rst
+++ b/Documentation/networking/nf_flowtable.rst
@@ -4,35 +4,38 @@
 Netfilter's flowtable infrastructure
 ====================================
 
-This documentation describes the software flowtable infrastructure available in
-Netfilter since Linux kernel 4.16.
+This documentation describes the Netfilter flowtable infrastructure which allows
+you to define a fastpath through the flowtable datapath. This infrastructure
+also provides hardware offload support. The flowtable supports for the layer 3
+IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
 
 Overview
 --------
 
-Initial packets follow the classic forwarding path, once the flow enters the
-established state according to the conntrack semantics (ie. we have seen traffic
-in both directions), then you can decide to offload the flow to the flowtable
-from the forward chain via the 'flow offload' action available in nftables.
+Once the first packet of the flow successfully goes through the IP forwarding
+path, from the second packet on, you might decide to offload the flow to the
+flowtable through your ruleset. The flowtable infrastructure provides a rule
+action that allows you to specify when to add a flow to the flowtable.
 
-Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
-output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
-path (the visible effect is that you do not see these packets from any of the
-netfilter hooks coming after the ingress). In case of flowtable miss, the packet
-follows the classic forward path.
+A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
+transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
+classic IP forwarding path (the visible effect is that you do not see these
+packets from any of the Netfilter hooks coming after ingress). In case that
+there is no matching entry in the flowtable (ie. flowtable miss), the packet
+follows the classic IP forwarding path.
 
-The flowtable uses a resizable hashtable, lookups are based on the following
-7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
-and destination ports and the input interface (useful in case there are several
-conntrack zones in place).
+The flowtable uses a resizable hashtable. Lookups are based on the following
+n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
+source and destination, layer 4 source and destination ports and the input
+interface (useful in case there are several conntrack zones in place).
 
-Flowtables are populated via the 'flow offload' nftables action, so the user can
-selectively specify what flows are placed into the flow table. Hence, packets
-follow the classic forwarding path unless the user explicitly instruct packets
-to use this new alternative forwarding path via nftables policy.
+The 'flow add' action allows you to populate the flowtable, the user selectively
+specifies what flows are placed into the flowtable. Hence, packets follow the
+classic IP forwarding path unless the user explicitly instruct flows to use this
+new alternative forwarding path via policy.
 
-This is represented in Fig.1, which describes the classic forwarding path
-including the Netfilter hooks and the flowtable fastpath bypass.
+The flowtable datapath is represented in Fig.1, which describes the classic IP
+forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
 
 ::
 
@@ -67,11 +70,13 @@ including the Netfilter hooks and the flowtable fastpath bypass.
 	       Fig.1 Netfilter hooks and flowtable interactions
 
 The flowtable entry also stores the NAT configuration, so all packets are
-mangled according to the NAT policy that matches the initial packets that went
-through the classic forwarding path. The TTL is decremented before calling
-neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
-path given that the transport selectors are missing, therefore flowtable lookup
-is not possible.
+mangled according to the NAT policy that is specified from the classic IP
+forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
+traffic is passed up to follow the classic IP forwarding path given that the
+transport header is missing, in this case, flowtable lookups are not possible.
+TCP RST and FIN packets are also passed up to the classic IP forwarding path to
+release the flow gracefully. Packets that exceed the MTU are also passed up to
+the classic forwarding path to report packet-too-big ICMP errors to the sender.
 
 Example configuration
 ---------------------
@@ -85,7 +90,7 @@ flowtable and add one rule to your forward chain::
 		}
 		chain y {
 			type filter hook forward priority 0; policy accept;
-			ip protocol tcp flow offload @f
+			ip protocol tcp flow add @f
 			counter packets 0 bytes 0
 		}
 	}
@@ -103,13 +108,126 @@ flow is offloaded, you will observe that the counter rule in the example above
 does not get updated for the packets that are being forwarded through the
 forwarding bypass.
 
+You can identify offloaded flows through the [OFFLOAD] tag when listing your
+connection tracking table.
+
+::
+
+	# conntrack -L
+	tcp      6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
+
+
+Layer 2 encapsulation
+---------------------
+
+Since Linux kernel 5.13, the flowtable infrastructure discovers the real
+netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
+parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
+VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
+flowtable datapath also deals with layer 2 decapsulation.
+
+You do not need to add the PPPoE and the VLAN devices to your flowtable,
+instead the real device is sufficient for the flowtable to track your flows.
+
+Bridge and IP forwarding
+------------------------
+
+Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
+flowtable infrastructure discovers the topology behind the bridge device. This
+allows the flowtable to define a fastpath bypass between the bridge ports
+(represented as eth1 and eth2 in the example figure below) and the gateway
+device (represented as eth0) in your switch/router.
+
+::
+
+                      fastpath bypass
+               .-------------------------.
+              /                           \
+              |           IP forwarding   |
+              |          /             \ \/
+              |       br0               eth0 ..... eth0
+              .       / \                          *host B*
+               -> eth1  eth2
+                   .           *switch/router*
+                   .
+                   .
+                 eth0
+               *host A*
+
+The flowtable infrastructure also supports for bridge VLAN filtering actions
+such as PVID and untagged. You can also stack a classic VLAN device on top of
+your bridge port.
+
+If you would like that your flowtable defines a fastpath between your bridge
+ports and your IP forwarding path, you have to add your bridge ports (as
+represented by the real netdevice) to your flowtable definition.
+
+Counters
+--------
+
+The flowtable can synchronize packet and byte counters with the existing
+connection tracking entry by specifying the counter statement in your flowtable
+definition, e.g.
+
+::
+
+	table inet x {
+		flowtable f {
+			hook ingress priority 0; devices = { eth0, eth1 };
+			counter
+		}
+	}
+
+Counter support is available since Linux kernel 5.7.
+
+Hardware offload
+----------------
+
+If your network device provides hardware offload support, you can turn it on by
+means of the 'offload' flag in your flowtable definition, e.g.
+
+::
+
+	table inet x {
+		flowtable f {
+			hook ingress priority 0; devices = { eth0, eth1 };
+			flags offload;
+		}
+	}
+
+There is a workqueue that adds the flows to the hardware. Note that a few
+packets might still run over the flowtable software path until the workqueue has
+a chance to offload the flow to the network device.
+
+You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
+listing your connection tracking table. Please, note that the [OFFLOAD] tag
+refers to the software offload mode, so there is a distinction between [OFFLOAD]
+which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
+to the hardware offload datapath being used by the flow.
+
+The flowtable hardware offload infrastructure also supports for the DSA
+(Distributed Switch Architecture).
+
+Limitations
+-----------
+
+The flowtable behaves like a cache. The flowtable entries might get stale if
+either the destination MAC address or the egress netdevice that is used for
+transmission changes.
+
+This might be a problem if:
+
+- You run the flowtable in software mode and you combine bridge and IP
+  forwarding in your setup.
+- Hardware offload is enabled.
+
 More reading
 ------------
 
 This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
 also made a very complete and comprehensive summary called "A state of network
 acceleration" that describes how things were before this infrastructure was
-mailined [3]_ and it also makes a rough summary of this work [4]_.
+mainlined [3]_ and it also makes a rough summary of this work [4]_.
 
 .. [1] https://lwn.net/Articles/738214/
 .. [2] https://lwn.net/Articles/742164/
diff --git a/Documentation/networking/oa-tc6-framework.rst b/Documentation/networking/oa-tc6-framework.rst
new file mode 100644
index 000000000000..fe2aabde923a
--- /dev/null
+++ b/Documentation/networking/oa-tc6-framework.rst
@@ -0,0 +1,497 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================================================================
+OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6) Framework Support
+=========================================================================
+
+Introduction
+------------
+
+The IEEE 802.3cg project defines two 10 Mbit/s PHYs operating over a
+single pair of conductors. The 10BASE-T1L (Clause 146) is a long reach
+PHY supporting full duplex point-to-point operation over 1 km of single
+balanced pair of conductors. The 10BASE-T1S (Clause 147) is a short reach
+PHY supporting full / half duplex point-to-point operation over 15 m of
+single balanced pair of conductors, or half duplex multidrop bus
+operation over 25 m of single balanced pair of conductors.
+
+Furthermore, the IEEE 802.3cg project defines the new Physical Layer
+Collision Avoidance (PLCA) Reconciliation Sublayer (Clause 148) meant to
+provide improved determinism to the CSMA/CD media access method. PLCA
+works in conjunction with the 10BASE-T1S PHY operating in multidrop mode.
+
+The aforementioned PHYs are intended to cover the low-speed / low-cost
+applications in industrial and automotive environment. The large number
+of pins (16) required by the MII interface, which is specified by the
+IEEE 802.3 in Clause 22, is one of the major cost factors that need to be
+addressed to fulfil this objective.
+
+The MAC-PHY solution integrates an IEEE Clause 4 MAC and a 10BASE-T1x PHY
+exposing a low pin count Serial Peripheral Interface (SPI) to the host
+microcontroller. This also enables the addition of Ethernet functionality
+to existing low-end microcontrollers which do not integrate a MAC
+controller.
+
+Overview
+--------
+
+The MAC-PHY is specified to carry both data (Ethernet frames) and control
+(register access) transactions over a single full-duplex serial peripheral
+interface.
+
+Protocol Overview
+-----------------
+
+Two types of transactions are defined in the protocol: data transactions
+for Ethernet frame transfers and control transactions for register
+read/write transfers. A chunk is the basic element of data transactions
+and is composed of 4 bytes of overhead plus 64 bytes of payload size for
+each chunk. Ethernet frames are transferred over one or more data chunks.
+Control transactions consist of one or more register read/write control
+commands.
+
+SPI transactions are initiated by the SPI host with the assertion of CSn
+low to the MAC-PHY and ends with the deassertion of CSn high. In between
+each SPI transaction, the SPI host may need time for additional
+processing and to setup the next SPI data or control transaction.
+
+SPI data transactions consist of an equal number of transmit (TX) and
+receive (RX) chunks. Chunks in both transmit and receive directions may
+or may not contain valid frame data independent from each other, allowing
+for the simultaneous transmission and reception of different length
+frames.
+
+Each transmit data chunk begins with a 32-bit data header followed by a
+data chunk payload on MOSI. The data header indicates whether transmit
+frame data is present and provides the information to determine which
+bytes of the payload contain valid frame data.
+
+In parallel, receive data chunks are received on MISO. Each receive data
+chunk consists of a data chunk payload ending with a 32-bit data footer.
+The data footer indicates if there is receive frame data present within
+the payload or not and provides the information to determine which bytes
+of the payload contain valid frame data.
+
+Reference
+---------
+
+10BASE-T1x MAC-PHY Serial Interface Specification,
+
+Link: https://opensig.org/download/document/OPEN_Alliance_10BASET1x_MAC-PHY_Serial_Interface_V1.1.pdf
+
+Hardware Architecture
+---------------------
+
+.. code-block:: none
+
+  +----------+      +-------------------------------------+
+  |          |      |                MAC-PHY              |
+  |          |<---->| +-----------+  +-------+  +-------+ |
+  | SPI Host |      | | SPI Slave |  |  MAC  |  |  PHY  | |
+  |          |      | +-----------+  +-------+  +-------+ |
+  +----------+      +-------------------------------------+
+
+Software Architecture
+---------------------
+
+.. code-block:: none
+
+  +----------------------------------------------------------+
+  |                 Networking Subsystem                     |
+  +----------------------------------------------------------+
+            / \                             / \
+             |                               |
+             |                               |
+            \ /                              |
+  +----------------------+     +-----------------------------+
+  |     MAC Driver       |<--->| OPEN Alliance TC6 Framework |
+  +----------------------+     +-----------------------------+
+            / \                             / \
+             |                               |
+             |                               |
+             |                              \ /
+  +----------------------------------------------------------+
+  |                    SPI Subsystem                         |
+  +----------------------------------------------------------+
+                          / \
+                           |
+                           |
+                          \ /
+  +----------------------------------------------------------+
+  |                10BASE-T1x MAC-PHY Device                 |
+  +----------------------------------------------------------+
+
+Implementation
+--------------
+
+MAC Driver
+~~~~~~~~~~
+
+- Probed by SPI subsystem.
+
+- Initializes OA TC6 framework for the MAC-PHY.
+
+- Registers and configures the network device.
+
+- Sends the tx ethernet frames from n/w subsystem to OA TC6 framework.
+
+OPEN Alliance TC6 Framework
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Initializes PHYLIB interface.
+
+- Registers mac-phy interrupt.
+
+- Performs mac-phy register read/write operation using the control
+  transaction protocol specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+  Serial Interface specification.
+
+- Performs Ethernet frames transaction using the data transaction protocol
+  for Ethernet frames specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+  Serial Interface specification.
+
+- Forwards the received Ethernet frame from 10Base-T1x MAC-PHY to n/w
+  subsystem.
+
+Data Transaction
+~~~~~~~~~~~~~~~~
+
+The Ethernet frames that are typically transferred from the SPI host to
+the MAC-PHY will be converted into multiple transmit data chunks. Each
+transmit data chunk will have a 4 bytes header which contains the
+information needed to determine the validity and the location of the
+transmit frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+  +---------------------------------------------------+
+  |                     Tx Chunk                      |
+  | +---------------------------+  +----------------+ |   MOSI
+  | | 64 bytes chunk payload    |  | 4 bytes header | |------------>
+  | +---------------------------+  +----------------+ |
+  +---------------------------------------------------+
+
+4 bytes header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+               transaction. For TX data chunks, this bit shall be ’1’.
+               0 - Control command
+               1 - Data chunk
+
+SEQ (Bit 30) - Data Chunk Sequence. This bit is used to indicate an
+               even/odd transmit data chunk sequence to the MAC-PHY.
+
+NORX (Bit 29) - No Receive flag. The SPI host may set this bit to prevent
+                the MAC-PHY from conveying RX data on the MISO for the
+                current chunk (DV = 0 in the footer), indicating that the
+                host would not process it. Typically, the SPI host should
+                set NORX = 0 indicating that it will accept and process
+                any receive frame data within the current chunk.
+
+RSVD (Bit 28..24) - Reserved: All reserved bits shall be ‘0’.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+                  If the MAC-PHY does not implement these bits, the host
+                  shall set them to ‘0’.
+
+DV (Bit 21) - Data Valid flag. The SPI host uses this bit to indicate
+              whether the current chunk contains valid transmit frame data
+              (DV = 1) or not (DV = 0). When ‘0’, the MAC-PHY ignores the
+              chunk payload. Note that the receive path is unaffected by
+              the setting of the DV bit in the data header.
+
+SV (Bit 20) - Start Valid flag. The SPI host shall set this bit when the
+              beginning of an Ethernet frame is present in the current
+              transmit data chunk payload. Otherwise, this bit shall be
+              zero. This bit is not to be confused with the Start-of-Frame
+              Delimiter (SFD) byte described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field shall
+                   contain the 32-bit word offset into the transmit data
+                   chunk payload that points to the start of a new
+                   Ethernet frame to be transmitted. The host shall write
+                   this field as zero when SV = 0.
+
+RSVD (Bit 15) - Reserved: All reserved bits shall be ‘0’.
+
+EV (Bit 14) - End Valid flag. The SPI host shall set this bit when the end
+              of an Ethernet frame is present in the current transmit data
+              chunk payload. Otherwise, this bit shall be zero.
+
+EBO (Bit 13..8) - End Byte Offset. When EV = 1, this field shall contain
+                  the byte offset into the transmit data chunk payload
+                  that points to the last byte of the Ethernet frame to
+                  transmit. This field shall be zero when EV = 0.
+
+TSC (Bit 7..6) - Timestamp Capture. Request a timestamp capture when the
+                 frame is transmitted onto the network.
+                 00 - Do not capture a timestamp
+                 01 - Capture timestamp into timestamp capture register A
+                 10 - Capture timestamp into timestamp capture register B
+                 11 - Capture timestamp into timestamp capture register C
+
+RSVD (Bit 5..1) - Reserved: All reserved bits shall be ‘0’.
+
+P (Bit 0) - Parity. Parity bit calculated over the transmit data header.
+            Method used is odd parity.
+
+The number of buffers available in the MAC-PHY to store the incoming
+transmit data chunk payloads is represented as transmit credits. The
+available transmit credits in the MAC-PHY can be read either from the
+Buffer Status Register or footer (Refer below for the footer info)
+received from the MAC-PHY. The SPI host should not write more data chunks
+than the available transmit credits as this will lead to transmit buffer
+overflow error.
+
+In case the previous data footer had no transmit credits available and
+once the transmit credits become available for transmitting transmit data
+chunks, the MAC-PHY interrupt is asserted to SPI host. On reception of the
+first data header this interrupt will be deasserted and the received
+footer for the first data chunk will have the transmit credits available
+information.
+
+The Ethernet frames that are typically transferred from MAC-PHY to SPI
+host will be sent as multiple receive data chunks. Each receive data
+chunk will have 64 bytes of data chunk payload followed by 4 bytes footer
+which contains the information needed to determine the validity and the
+location of the receive frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+  +---------------------------------------------------+
+  |                     Rx Chunk                      |
+  | +----------------+  +---------------------------+ |   MISO
+  | | 4 bytes footer |  | 64 bytes chunk payload    | |------------>
+  | +----------------+  +---------------------------+ |
+  +---------------------------------------------------+
+
+4 bytes footer contains the below fields,
+
+EXST (Bit 31) - Extended Status. This bit is set when any bit in the
+                STATUS0 or STATUS1 registers are set and not masked.
+
+HDRB (Bit 30) - Received Header Bad. When set, indicates that the MAC-PHY
+                received a control or data header with a parity error.
+
+SYNC (Bit 29) - Configuration Synchronized flag. This bit reflects the
+                state of the SYNC bit in the CONFIG0 configuration
+                register (see Table 12). A zero indicates that the MAC-PHY
+                configuration may not be as expected by the SPI host.
+                Following configuration, the SPI host sets the
+                corresponding bitin the configuration register which is
+                reflected in this field.
+
+RCA (Bit 28..24) - Receive Chunks Available. The RCA field indicates to
+                   the SPI host the minimum number of additional receive
+                   data chunks of frame data that are available for
+                   reading beyond the current receive data chunk. This
+                   field is zero when there is no receive frame data
+                   pending in the MAC-PHY’s buffer for reading.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+                  If not implemented, the MAC-PHY shall set these bits to
+                  ‘0’.
+
+DV (Bit 21) - Data Valid flag. The MAC-PHY uses this bit to indicate
+              whether the current receive data chunk contains valid
+              receive frame data (DV = 1) or not (DV = 0). When ‘0’, the
+              SPI host shall ignore the chunk payload.
+
+SV (Bit 20) - Start Valid flag. The MAC-PHY sets this bit when the current
+              chunk payload contains the start of an Ethernet frame.
+              Otherwise, this bit is zero. The SV bit is not to be
+              confused with the Start-of-Frame Delimiter (SFD) byte
+              described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field contains the
+                   32-bit word offset into the receive data chunk payload
+                   containing the first byte of a new received Ethernet
+                   frame. When a receive timestamp has been added to the
+                   beginning of the received Ethernet frame (RTSA = 1)
+                   then SWO points to the most significant byte of the
+                   timestamp. This field will be zero when SV = 0.
+
+FD (Bit 15) - Frame Drop. When set, this bit indicates that the MAC has
+              detected a condition for which the SPI host should drop the
+              received Ethernet frame. This bit is only valid at the end
+              of a received Ethernet frame (EV = 1) and shall be zero at
+              all other times.
+
+EV (Bit 14) - End Valid flag. The MAC-PHY sets this bit when the end of a
+              received Ethernet frame is present in this receive data
+              chunk payload.
+
+EBO (Bit 13..8) - End Byte Offset: When EV = 1, this field contains the
+                  byte offset into the receive data chunk payload that
+                  locates the last byte of the received Ethernet frame.
+                  This field is zero when EV = 0.
+
+RTSA (Bit 7) - Receive Timestamp Added. This bit is set when a 32-bit or
+               64-bit timestamp has been added to the beginning of the
+               received Ethernet frame. The MAC-PHY shall set this bit to
+               zero when SV = 0.
+
+RTSP (Bit 6) - Receive Timestamp Parity. Parity bit calculated over the
+               32-bit/64-bit timestamp added to the beginning of the
+               received Ethernet frame. Method used is odd parity. The
+               MAC-PHY shall set this bit to zero when RTSA = 0.
+
+TXC (Bit 5..1) - Transmit Credits. This field contains the minimum number
+                 of transmit data chunks of frame data that the SPI host
+                 can write in a single transaction without incurring a
+                 transmit buffer overflow error.
+
+P (Bit 0) - Parity. Parity bit calculated over the receive data footer.
+            Method used is odd parity.
+
+SPI host will initiate the data receive transaction based on the receive
+chunks available in the MAC-PHY which is provided in the receive chunk
+footer (RCA - Receive Chunks Available). SPI host will create data invalid
+transmit data chunks (empty chunks) or data valid transmit data chunks in
+case there are valid Ethernet frames to transmit to the MAC-PHY. The
+receive chunks available in MAC-PHY can be read either from the Buffer
+Status Register or footer.
+
+In case the previous data footer had no receive data chunks available and
+once the receive data chunks become available again for reading, the
+MAC-PHY interrupt is asserted to SPI host. On reception of the first data
+header this interrupt will be deasserted and the received footer for the
+first data chunk will have the receive chunks available information.
+
+MAC-PHY Interrupt
+~~~~~~~~~~~~~~~~~
+
+The MAC-PHY interrupt is asserted when the following conditions are met.
+
+Receive chunks available - This interrupt is asserted when the previous
+data footer had no receive data chunks available and once the receive
+data chunks become available for reading. On reception of the first data
+header this interrupt will be deasserted.
+
+Transmit chunk credits available - This interrupt is asserted when the
+previous data footer indicated no transmit credits available and once the
+transmit credits become available for transmitting transmit data chunks.
+On reception of the first data header this interrupt will be deasserted.
+
+Extended status event - This interrupt is asserted when the previous data
+footer indicated no extended status and once the extended event become
+available. In this case the host should read status #0 register to know
+the corresponding error/event. On reception of the first data header this
+interrupt will be deasserted.
+
+Control Transaction
+~~~~~~~~~~~~~~~~~~~
+
+4 bytes control header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+               transaction. For control commands, this bit shall be ‘0’.
+               0 - Control command
+               1 - Data chunk
+
+HDRB (Bit 30) - Received Header Bad. When set by the MAC-PHY, indicates
+                that a header was received with a parity error. The SPI
+                host should always clear this bit. The MAC-PHY ignores the
+                HDRB value sent by the SPI host on MOSI.
+
+WNR (Bit 29) - Write-Not-Read. This bit indicates if data is to be written
+               to registers (when set) or read from registers
+               (when clear).
+
+AID (Bit 28) - Address Increment Disable. When clear, the address will be
+               automatically post-incremented by one following each
+               register read or write. When set, address auto increment is
+               disabled allowing successive reads and writes to occur at
+               the same register address.
+
+MMS (Bit 27..24) - Memory Map Selector. This field selects the specific
+                   register memory map to access.
+
+ADDR (Bit 23..8) - Address. Address of the first register within the
+                   selected memory map to access.
+
+LEN (Bit 7..1) - Length. Specifies the number of registers to read/write.
+                 This field is interpreted as the number of registers
+                 minus 1 allowing for up to 128 consecutive registers read
+                 or written starting at the address specified in ADDR. A
+                 length of zero shall read or write a single register.
+
+P (Bit 0) - Parity. Parity bit calculated over the control command header.
+            Method used is odd parity.
+
+Control transactions consist of one or more control commands. Control
+commands are used by the SPI host to read and write registers within the
+MAC-PHY. Each control commands are composed of a 4 bytes control command
+header followed by register write data in case of control write command.
+
+The MAC-PHY ignores the final 4 bytes of data from the SPI host at the end
+of the control write command. The control write command is also echoed
+from the MAC-PHY back to the SPI host to identify which register write
+failed in case of any bus errors. The echoed Control write command will
+have the first 4 bytes unused value to be ignored by the SPI host
+followed by 4 bytes echoed control header followed by echoed register
+write data. Control write commands can write either a single register or
+multiple consecutive registers. When multiple consecutive registers are
+written, the address is automatically post-incremented by the MAC-PHY.
+Writing to any unimplemented or undefined registers shall be ignored and
+yield no effect.
+
+The MAC-PHY ignores all data from the SPI host following the control
+header for the remainder of the control read command. The control read
+command is also echoed from the MAC-PHY back to the SPI host to identify
+which register read is failed in case of any bus errors. The echoed
+Control read command will have the first 4 bytes of unused value to be
+ignored by the SPI host followed by 4 bytes echoed control header followed
+by register read data. Control read commands can read either a single
+register or multiple consecutive registers. When multiple consecutive
+registers are read, the address is automatically post-incremented by the
+MAC-PHY. Reading any unimplemented or undefined registers shall return
+zero.
+
+Device drivers API
+==================
+
+The include/linux/oa_tc6.h defines the following functions:
+
+.. c:function:: struct oa_tc6 *oa_tc6_init(struct spi_device *spi, \
+                                           struct net_device *netdev)
+
+Initialize OA TC6 lib.
+
+.. c:function:: void oa_tc6_exit(struct oa_tc6 *tc6)
+
+Free allocated OA TC6 lib.
+
+.. c:function:: int oa_tc6_write_register(struct oa_tc6 *tc6, u32 address, \
+                                          u32 value)
+
+Write a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_write_registers(struct oa_tc6 *tc6, u32 address, \
+                                           u32 value[], u8 length)
+
+Writing multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be written starting at @address.
+
+.. c:function:: int oa_tc6_read_register(struct oa_tc6 *tc6, u32 address, \
+                                         u32 *value)
+
+Read a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_read_registers(struct oa_tc6 *tc6, u32 address, \
+                                          u32 value[], u8 length)
+
+Reading multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be read starting at @address.
+
+.. c:function:: netdev_tx_t oa_tc6_start_xmit(struct oa_tc6 *tc6, \
+                                              struct sk_buff *skb);
+
+The transmit Ethernet frame in the skb is or going to be transmitted through
+the MAC-PHY.
+
+.. c:function:: int oa_tc6_zero_align_receive_frame_enable(struct oa_tc6 *tc6);
+
+Zero align receive frame feature can be enabled to align all receive ethernet
+frames data to start at the beginning of any receive data chunk payload with a
+start word offset (SWO) of zero.
diff --git a/Documentation/networking/operstates.rst b/Documentation/networking/operstates.rst
index 9c918f7cb0e8..1ee2141e8ef1 100644
--- a/Documentation/networking/operstates.rst
+++ b/Documentation/networking/operstates.rst
@@ -73,7 +73,9 @@ IF_OPER_LOWERLAYERDOWN (3):
  state (f.e. VLAN).
 
 IF_OPER_TESTING (4):
- Unused in current kernel.
+ Interface is in testing mode, for example executing driver self-tests
+ or media (cable) test. It can't be used for normal traffic until tests
+ complete.
 
 IF_OPER_DORMANT (5):
  Interface is L1 up, but waiting for an external event, f.e. for a
@@ -111,7 +113,7 @@ it as lower layer.
 
 Note that for certain kind of soft-devices, which are not managing any
 real hardware, it is possible to set this bit from userspace.  One
-should use TVL IFLA_CARRIER to do so.
+should use TLV IFLA_CARRIER to do so.
 
 netif_carrier_ok() can be used to query that bit.
 
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
index 6c009ceb1183..02370786e77b 100644
--- a/Documentation/networking/packet_mmap.rst
+++ b/Documentation/networking/packet_mmap.rst
@@ -8,7 +8,7 @@ Abstract
 ========
 
 This file documents the mmap() facility available with the PACKET
-socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+socket interface. This type of sockets is used for
 
 i) capture network traffic with utilities like tcpdump,
 ii) transmit network traffic, or any other that needs raw
@@ -16,7 +16,7 @@ ii) transmit network traffic, or any other that needs raw
 
 Howto can be found at:
 
-    https://sites.google.com/site/packetmmap/
+    https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
 
 Please send your comments to
     - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
@@ -25,12 +25,12 @@ Please send your comments to
 Why use PACKET_MMAP
 ===================
 
-In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+Non PACKET_MMAP capture process (plain AF_PACKET) is very
 inefficient. It uses very limited buffers and requires one system call to
 capture each packet, it requires two if you want to get packet's timestamp
 (like libpcap always does).
 
-In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+On the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
 configurable circular buffer mapped in user space that can be used to either
 send or receive packets. This way reading packets just needs to wait for them,
 most of the time there is no need to issue a single system call. Concerning
@@ -153,7 +153,7 @@ As capture, each frame contains two parts::
     struct ifreq s_ifr;
     ...
 
-    strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+    strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
 
     /* get interface index of eth0 */
     ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
@@ -166,7 +166,8 @@ As capture, each frame contains two parts::
     /* bind socket to eth0 */
     bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
 
- A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+ A complete tutorial is available at:
+ https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
 
 By default, the user should put data at::
 
@@ -252,8 +253,7 @@ PACKET_MMAP setting constraints
 
 In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
 the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
-16384 in a 64 bit architecture. For information on these kernel versions
-see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
+16384 in a 64 bit architecture.
 
 Block size limit
 ----------------
@@ -264,20 +264,20 @@ the name indicates, this function allocates pages of memory, and the second
 argument is "order" or a power of two number of pages, that is
 (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
 order=2 ==> 16384 bytes, etc. The maximum size of a
-region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
-precisely the limit can be calculated as::
+region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro.
+More precisely the limit can be calculated as::
 
-   PAGE_SIZE << MAX_ORDER
+   PAGE_SIZE << MAX_PAGE_ORDER
 
    In a i386 architecture PAGE_SIZE is 4096 bytes
-   In a 2.4/i386 kernel MAX_ORDER is 10
-   In a 2.6/i386 kernel MAX_ORDER is 11
+   In a 2.4/i386 kernel MAX_PAGE_ORDER is 10
+   In a 2.6/i386 kernel MAX_PAGE_ORDER is 11
 
 So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
 respectively, with an i386 architecture.
 
 User space programs can include /usr/include/sys/user.h and
-/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations.
 
 The pagesize can also be determined dynamically with the getpagesize (2)
 system call.
@@ -325,7 +325,7 @@ Definitions:
 		(see /proc/slabinfo)
 <pointer size>  depends on the architecture -- ``sizeof(void *)``
 <page size>     depends on the architecture -- PAGE_SIZE or getpagesize (2)
-<max-order>     is the value defined with MAX_ORDER
+<max-order>     is the value defined with MAX_PAGE_ORDER
 <frame size>    it's an upper bound of frame's capture size (more on this later)
 ==============  ================================================================
 
@@ -437,7 +437,7 @@ and the following flags apply:
 Capture process
 ^^^^^^^^^^^^^^^
 
-     from include/linux/if_packet.h
+From include/linux/if_packet.h::
 
      #define TP_STATUS_COPY          (1 << 1)
      #define TP_STATUS_LOSING        (1 << 2)
@@ -756,7 +756,7 @@ AF_PACKET TPACKET_V3 example
 ============================
 
 AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
-sizes by doing it's own memory management. It is based on blocks where polling
+sizes by doing its own memory management. It is based on blocks where polling
 works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
 
 It is said that TPACKET_V3 brings the following benefits:
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index 43088ddf95e4..9d958128a57c 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -4,22 +4,8 @@
 Page Pool API
 =============
 
-The page_pool allocator is optimized for the XDP mode that uses one frame
-per-page, but it can fallback on the regular page allocator APIs.
-
-Basic use involves replacing alloc_pages() calls with the
-page_pool_alloc_pages() call.  Drivers should use page_pool_dev_alloc_pages()
-replacing dev_alloc_pages().
-
-API keeps track of inflight pages, in order to let API user know
-when it is safe to free a page_pool object.  Thus, API users
-must run page_pool_release_page() when a page is leaving the page_pool or
-call page_pool_put_page() where appropriate in order to maintain correct
-accounting.
-
-API user must call page_pool_put_page() once on a page, as it
-will either recycle the page, or in case of refcnt > 1, it will
-release the DMA mapping and inflight state accounting.
+.. kernel-doc:: include/net/page_pool/helpers.h
+   :doc: page_pool allocator
 
 Architecture overview
 =====================
@@ -55,6 +41,11 @@ Architecture overview
                           |   Fast cache    |     |  ptr-ring cache  |
                           +-----------------+     +------------------+
 
+Monitoring
+==========
+Information about page pools on the system can be accessed via the netdev
+genetlink family (see Documentation/netlink/specs/netdev.yaml).
+
 API interface
 =============
 The number of pools created **must** match the number of hardware queues
@@ -64,38 +55,71 @@ This lockless guarantee naturally comes from running under a NAPI softirq.
 The protection doesn't strictly have to be NAPI, any guarantee that allocating
 a page will cause no race conditions is enough.
 
-* page_pool_create(): Create a pool.
-    * flags:      PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
-    * order:      2^order pages on allocation
-    * pool_size:  size of the ptr_ring
-    * nid:        preferred NUMA node for allocation
-    * dev:        struct device. Used on DMA operations
-    * dma_dir:    DMA direction
-    * max_len:    max DMA sync memory size
-    * offset:     DMA address offset
-
-* page_pool_put_page(): The outcome of this depends on the page refcnt. If the
-  driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1
-  the allocator owns the page and will try to recycle it in one of the pool
-  caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device
-  using dma_sync_single_range_for_device().
-
-* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync
-  for the entire memory area configured in area pool->max_len.
-
-* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller
-  must guarantee safe context (e.g NAPI), since it will recycle the page
-  directly into the pool fast cache.
-
-* page_pool_release_page(): Unmap the page (if mapped) and account for it on
-  inflight counters.
-
-* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool
-  caches.
-
-* page_pool_get_dma_addr(): Retrieve the stored DMA address.
-
-* page_pool_get_dma_dir(): Retrieve the stored DMA direction.
+.. kernel-doc:: net/core/page_pool.c
+   :identifiers: page_pool_create
+
+.. kernel-doc:: include/net/page_pool/types.h
+   :identifiers: struct page_pool_params
+
+.. kernel-doc:: include/net/page_pool/helpers.h
+   :identifiers: page_pool_put_page page_pool_put_full_page
+		 page_pool_recycle_direct page_pool_free_va
+		 page_pool_dev_alloc_pages page_pool_dev_alloc_frag
+		 page_pool_dev_alloc page_pool_dev_alloc_va
+		 page_pool_get_dma_addr page_pool_get_dma_dir
+
+.. kernel-doc:: net/core/page_pool.c
+   :identifiers: page_pool_put_page_bulk page_pool_get_stats
+
+DMA sync
+--------
+Driver is always responsible for syncing the pages for the CPU.
+Drivers may choose to take care of syncing for the device as well
+or set the ``PP_FLAG_DMA_SYNC_DEV`` flag to request that pages
+allocated from the page pool are already synced for the device.
+
+If ``PP_FLAG_DMA_SYNC_DEV`` is set, the driver must inform the core what portion
+of the buffer has to be synced. This allows the core to avoid syncing the entire
+page when the drivers knows that the device only accessed a portion of the page.
+
+Most drivers will reserve headroom in front of the frame. This part
+of the buffer is not touched by the device, so to avoid syncing
+it drivers can set the ``offset`` field in struct page_pool_params
+appropriately.
+
+For pages recycled on the XDP xmit and skb paths the page pool will
+use the ``max_len`` member of struct page_pool_params to decide how
+much of the page needs to be synced (starting at ``offset``).
+When directly freeing pages in the driver (page_pool_put_page())
+the ``dma_sync_size`` argument specifies how much of the buffer needs
+to be synced.
+
+If in doubt set ``offset`` to 0, ``max_len`` to ``PAGE_SIZE`` and
+pass -1 as ``dma_sync_size``. That combination of arguments is always
+correct.
+
+Note that the syncing parameters are for the entire page.
+This is important to remember when using fragments (``PP_FLAG_PAGE_FRAG``),
+where allocated buffers may be smaller than a full page.
+Unless the driver author really understands page pool internals
+it's recommended to always use ``offset = 0``, ``max_len = PAGE_SIZE``
+with fragmented page pools.
+
+Stats API and structures
+------------------------
+If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
+page_pool_get_stats() and structures described below are available.
+It takes a  pointer to a ``struct page_pool`` and a pointer to a struct
+page_pool_stats allocated by the caller.
+
+Older drivers expose page pool statistics via ethtool or debugfs.
+The same statistics are accessible via the netlink netdev family
+in a driver-independent fashion.
+
+.. kernel-doc:: include/net/page_pool/types.h
+   :identifiers: struct page_pool_recycle_stats
+		 struct page_pool_alloc_stats
+		 struct page_pool_stats
 
 Coding examples
 ===============
@@ -116,6 +140,7 @@ Registration
     pp_params.pool_size = DESC_NUM;
     pp_params.nid = NUMA_NO_NODE;
     pp_params.dev = priv->dev;
+    pp_params.napi = napi; /* only if locking is tied to NAPI */
     pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
     page_pool = page_pool_create(&pp_params);
 
@@ -144,11 +169,26 @@ NAPI poller
             if XDP_DROP:
                 page_pool_recycle_direct(page_pool, page);
         } else (packet_is_skb) {
-            page_pool_release_page(page_pool, page);
+            skb_mark_for_recycle(skb);
             new_page = page_pool_dev_alloc_pages(page_pool);
         }
     }
 
+Stats
+-----
+
+.. code-block:: c
+
+	#ifdef CONFIG_PAGE_POOL_STATS
+	/* retrieve stats */
+	struct page_pool_stats stats = { 0 };
+	if (page_pool_get_stats(page_pool, &stats)) {
+		/* perhaps the driver reports statistics with ethool */
+		ethtool_print_allocation_stats(&stats.alloc_stats);
+		ethtool_print_recycle_stats(&stats.recycle_stats);
+	}
+	#endif
+
 Driver unload
 -------------
 
diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst
index 8668dcbc5e6a..d705cc5b09fc 100644
--- a/Documentation/networking/phonet.rst
+++ b/Documentation/networking/phonet.rst
@@ -131,7 +131,7 @@ Phonet resources, as follow::
 Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O
 control request, or when the socket is closed.
 
-Note that no more than one socket can be subcribed to any given
+Note that no more than one socket can be subscribed to any given
 resource at a time. If not, ioctl() will return EBUSY.
 
 
diff --git a/Documentation/networking/phy-link-topology.rst b/Documentation/networking/phy-link-topology.rst
new file mode 100644
index 000000000000..4dec5d7d6513
--- /dev/null
+++ b/Documentation/networking/phy-link-topology.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _phy_link_topology:
+
+=================
+PHY link topology
+=================
+
+Overview
+========
+
+The PHY link topology representation in the networking stack aims at representing
+the hardware layout for any given Ethernet link.
+
+An Ethernet interface from userspace's point of view is nothing but a
+:c:type:`struct net_device <net_device>`, which exposes configuration options
+through the legacy ioctls and the ethtool netlink commands. The base assumption
+when designing these configuration APIs were that the link looks something like ::
+
+  +-----------------------+        +----------+      +--------------+
+  | Ethernet Controller / |        | Ethernet |      | Connector /  |
+  |       MAC             | ------ |   PHY    | ---- |    Port      | ---... to LP
+  +-----------------------+        +----------+      +--------------+
+  struct net_device               struct phy_device
+
+Commands that needs to configure the PHY will go through the net_device.phydev
+field to reach the PHY and perform the relevant configuration.
+
+This assumption falls apart in more complex topologies that can arise when,
+for example, using SFP transceivers (although that's not the only specific case).
+
+Here, we have 2 basic scenarios. Either the MAC is able to output a serialized
+interface, that can directly be fed to an SFP cage, such as SGMII, 1000BaseX,
+10GBaseR, etc.
+
+The link topology then looks like this (when an SFP module is inserted) ::
+
+  +-----+  SGMII  +------------+
+  | MAC | ------- | SFP Module |
+  +-----+         +------------+
+
+Knowing that some modules embed a PHY, the actual link is more like ::
+
+  +-----+  SGMII   +--------------+
+  | MAC | -------- | PHY (on SFP) |
+  +-----+          +--------------+
+
+In this case, the SFP PHY is handled by phylib, and registered by phylink through
+its SFP upstream ops.
+
+Now some Ethernet controllers aren't able to output a serialized interface, so
+we can't directly connect them to an SFP cage. However, some PHYs can be used
+as media-converters, to translate the non-serialized MAC MII interface to a
+serialized MII interface fed to the SFP ::
+
+  +-----+  RGMII  +-----------------------+  SGMII  +--------------+
+  | MAC | ------- | PHY (media converter) | ------- | PHY (on SFP) |
+  +-----+         +-----------------------+         +--------------+
+
+This is where the model of having a single net_device.phydev pointer shows its
+limitations, as we now have 2 PHYs on the link.
+
+The phy_link topology framework aims at providing a way to keep track of every
+PHY on the link, for use by both kernel drivers and subsystems, but also to
+report the topology to userspace, allowing to target individual PHYs in configuration
+commands.
+
+API
+===
+
+The :c:type:`struct phy_link_topology <phy_link_topology>` is a per-netdevice
+resource, that gets initialized at netdevice creation. Once it's initialized,
+it is then possible to register PHYs to the topology through :
+
+:c:func:`phy_link_topo_add_phy`
+
+Besides registering the PHY to the topology, this call will also assign a unique
+index to the PHY, which can then be reported to userspace to refer to this PHY
+(akin to the ifindex). This index is a u32, ranging from 1 to U32_MAX. The value
+0 is reserved to indicate the PHY doesn't belong to any topology yet.
+
+The PHY can then be removed from the topology through
+
+:c:func:`phy_link_topo_del_phy`
+
+These function are already hooked into the phylib subsystem, so all PHYs that
+are linked to a net_device through :c:func:`phy_attach_direct` will automatically
+join the netdev's topology.
+
+PHYs that are on a SFP module will also be automatically registered IF the SFP
+upstream is phylink (so, no media-converter).
+
+PHY drivers that can be used as SFP upstream need to call :c:func:`phy_sfp_attach_phy`
+and :c:func:`phy_sfp_detach_phy`, which can be used as a
+.attach_phy / .detach_phy implementation for the
+:c:type:`struct sfp_upstream_ops <sfp_upstream_ops>`.
+
+UAPI
+====
+
+There exist a set of netlink commands to query the link topology from userspace,
+see ``Documentation/networking/ethtool-netlink.rst``.
+
+The whole point of having a topology representation is to assign the phyindex
+field in :c:type:`struct phy_device <phy_device>`. This index is reported to
+userspace using the ``ETHTOOL_MSG_PHY_GET`` ethtnl command. Performing a DUMP operation
+will result in all PHYs from all net_device being listed. The DUMP command
+accepts either a ``ETHTOOL_A_HEADER_DEV_INDEX`` or ``ETHTOOL_A_HEADER_DEV_NAME``
+to be passed in the request to filter the DUMP to a single net_device.
+
+The retrieved index can then be passed as a request parameter using the
+``ETHTOOL_A_HEADER_PHY_INDEX`` field in the following ethnl commands :
+
+* ``ETHTOOL_MSG_STRSET_GET`` to get the stats string set from a given PHY
+* ``ETHTOOL_MSG_CABLE_TEST_ACT`` and ``ETHTOOL_MSG_CABLE_TEST_ACT``, to perform
+  cable testing on a given PHY on the link (most likely the outermost PHY)
+* ``ETHTOOL_MSG_PSE_SET`` and ``ETHTOOL_MSG_PSE_GET`` for PHY-controlled PoE and PSE settings
+* ``ETHTOOL_MSG_PLCA_GET_CFG``, ``ETHTOOL_MSG_PLCA_SET_CFG`` and ``ETHTOOL_MSG_PLCA_GET_STATUS``
+  to set the PLCA (Physical Layer Collision Avoidance) parameters
+
+Note that the PHY index can be passed to other requests, which will silently
+ignore it if present and irrelevant.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 256106054c8c..f64641417c54 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -80,8 +80,8 @@ values of phy_interface_t must be understood from the perspective of the PHY
 device itself, leading to the following:
 
 * PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
-  internal delay by itself, it assumes that either the Ethernet MAC (if capable
-  or the PCB traces) insert the correct 1.5-2ns delay
+  internal delay by itself, it assumes that either the Ethernet MAC (if capable)
+  or the PCB traces insert the correct 1.5-2ns delay
 
 * PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
   for the transmit data lines (TXD[3:0]) processed by the PHY device
@@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons:
 
 * PHY device drivers in PHYLIB being reusable by nature, being able to
   configure correctly a specified delay enables more designs with similar delay
-  requirements to be operate correctly
+  requirements to be operated correctly
 
 For cases where the PHY is not capable of providing this delay, but the
 Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
@@ -120,7 +120,7 @@ required delays, as defined per the RGMII standard, several options may be
 available:
 
 * Some SoCs may offer a pin pad/mux/controller capable of configuring a given
-  set of pins'strength, delays, and voltage; and it may be a suitable
+  set of pins' strength, delays, and voltage; and it may be a suitable
   option to insert the expected 2ns RGMII delay.
 
 * Modifying the PCB design to include a fixed delay (e.g: using a specifically
@@ -216,7 +216,7 @@ put into an unsupported state.
 Lastly, once the controller is ready to handle network traffic, you call
 phy_start(phydev).  This tells the PAL that you are ready, and configures the
 PHY to connect to the network. If the MAC interrupt of your network driver
-also handles PHY status changes, just set phydev->irq to PHY_IGNORE_INTERRUPT
+also handles PHY status changes, just set phydev->irq to PHY_MAC_INTERRUPT
 before you call phy_start and use phy_mac_interrupt() from the network
 driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
 phy_start() enables the PHY interrupts (if applicable) and starts the
@@ -237,6 +237,11 @@ negotiation results.
 
 Some of the interface modes are described below:
 
+``PHY_INTERFACE_MODE_SMII``
+    This is serial MII, clocked at 125MHz, supporting 100M and 10M speeds.
+    Some details can be found in
+    https://opencores.org/ocsvn/smii/smii/trunk/doc/SMII.pdf
+
 ``PHY_INTERFACE_MODE_1000BASEX``
     This defines the 1000BASE-X single-lane serdes link as defined by the
     802.3 standard section 36.  The link operates at a fixed bit rate of
@@ -247,8 +252,8 @@ Some of the interface modes are described below:
     speeds (see below.)
 
 ``PHY_INTERFACE_MODE_2500BASEX``
-    This defines a variant of 1000BASE-X which is clocked 2.5 times faster,
-    than the 802.3 standard giving a fixed bit rate of 3.125Gbaud.
+    This defines a variant of 1000BASE-X which is clocked 2.5 times as fast
+    as the 802.3 standard, giving a fixed bit rate of 3.125Gbaud.
 
 ``PHY_INTERFACE_MODE_SGMII``
     This is used for Cisco SGMII, which is a modification of 1000BASE-X
@@ -267,6 +272,12 @@ Some of the interface modes are described below:
     duplex, pause or other settings.  This is dependent on the MAC and/or
     PHY behaviour.
 
+``PHY_INTERFACE_MODE_5GBASER``
+    This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is
+    identical to the 10GBASE-R protocol defined in Clause 49, with the
+    exception that it operates at half the frequency. Please refer to the
+    IEEE standard for the definition.
+
 ``PHY_INTERFACE_MODE_10GBASER``
     This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
     various different mediums. Please refer to the IEEE standard for a
@@ -286,6 +297,42 @@ Some of the interface modes are described below:
     Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
     use of this definition.
 
+``PHY_INTERFACE_MODE_25GBASER``
+    This is the IEEE 802.3 PCS Clause 107 defined 25GBASE-R protocol.
+    The PCS is identical to 10GBASE-R, i.e. 64B/66B encoded
+    running 2.5 as fast, giving a fixed bit rate of 25.78125 Gbaud.
+    Please refer to the IEEE standard for further information.
+
+``PHY_INTERFACE_MODE_100BASEX``
+    This defines IEEE 802.3 Clause 24.  The link operates at a fixed data
+    rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying
+    data rate of 100Mpbs.
+
+``PHY_INTERFACE_MODE_QUSGMII``
+    This defines the Cisco the Quad USGMII mode, which is the Quad variant of
+    the USGMII (Universal SGMII) link. It's very similar to QSGMII, but uses
+    a Packet Control Header (PCH) instead of the 7 bytes preamble to carry not
+    only the port id, but also so-called "extensions". The only documented
+    extension so-far in the specification is the inclusion of timestamps, for
+    PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the
+    same capabilities in terms of link speed and negotiation.
+
+``PHY_INTERFACE_MODE_1000BASEKX``
+    This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73
+    autonegotiation. Generally, it will be used with a Clause 70 PMD. To
+    contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this
+    interface mode has different autonegotiation and only supports full duplex.
+
+``PHY_INTERFACE_MODE_PSGMII``
+    This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
+    SGMII lines into a single link compared to 4 on QSGMII.
+
+``PHY_INTERFACE_MODE_10G_QXGMII``
+    Represents the 10G-QXGMII PHY-MAC interface as defined by the Cisco USXGMII
+    Multiport Copper Interface document. It supports 4 ports over a 10.3125 GHz
+    SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved
+    through symbol replication. The PCS expects the standard USXGMII code word.
+
 Pause frames / flow control
 ===========================
 
diff --git a/Documentation/networking/pktgen.rst b/Documentation/networking/pktgen.rst
index 7afa1c9f1183..c945218946e1 100644
--- a/Documentation/networking/pktgen.rst
+++ b/Documentation/networking/pktgen.rst
@@ -178,6 +178,7 @@ Examples::
 			      IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
 			      NODE_ALLOC # node specific memory allocation
 			      NO_TIMESTAMP # disable timestamping
+			      SHARED # enable shared SKB
  pgset 'flag ![name]'    Clear a flag to determine behaviour.
 			 Note that you might need to use single quote in
 			 interactive mode, so that your shell wouldn't expand
@@ -248,26 +249,24 @@ Usage:::
 
   -i : ($DEV)       output interface/device (required)
   -s : ($PKT_SIZE)  packet size
-  -d : ($DEST_IP)   destination IP
+  -d : ($DEST_IP)   destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed
   -m : ($DST_MAC)   destination MAC-addr
+  -p : ($DST_PORT)  destination PORT range (e.g. 433-444) is also allowed
   -t : ($THREADS)   threads to start
+  -f : ($F_THREAD)  index of first thread (zero indexed CPU number)
   -c : ($SKB_CLONE) SKB clones send before alloc new SKB
+  -n : ($COUNT)     num messages to send per thread, 0 means indefinitely
   -b : ($BURST)     HW level bursting of SKBs
   -v : ($VERBOSE)   verbose
   -x : ($DEBUG)     debug
+  -6 : ($IP6)       IPv6
+  -w : ($DELAY)     Tx Delay value (ns)
+  -a : ($APPEND)    Script will not reset generator's state, but will append its config
 
 The global variables being set are also listed.  E.g. the required
 interface/device parameter "-i" sets variable $DEV.  Copy the
 pktgen_sampleXX scripts and modify them to fit your own needs.
 
-The old scripts::
-
-    pktgen.conf-1-2                  # 1 CPU 2 dev
-    pktgen.conf-1-1-rdos             # 1 CPU 1 dev w. route DoS
-    pktgen.conf-1-1-ip6              # 1 CPU 1 dev ipv6
-    pktgen.conf-1-1-ip6-rdos         # 1 CPU 1 dev ipv6  w. route DoS
-    pktgen.conf-1-1-flows            # 1 CPU 1 dev multiple flows.
-
 
 Interrupt affinity
 ===================
@@ -290,6 +289,16 @@ To avoid breaking existing testbed scripts for using AH type and tunnel mode,
 you can use "pgset spi SPI_VALUE" to specify which transformation mode
 to employ.
 
+Disable shared SKB
+==================
+By default, SKBs sent by pktgen are shared (user count > 1).
+To test with non-shared SKBs, remove the "SHARED" flag by simply setting::
+
+	pg_set "flag !SHARED"
+
+However, if the "clone_skb" or "burst" parameters are configured, the skb
+still needs to be held by pktgen for further access. Hence the skb must be
+shared.
 
 Current commands and configuration options
 ==========================================
@@ -359,6 +368,7 @@ Current commands and configuration options
     IPSEC
     NODE_ALLOC
     NO_TIMESTAMP
+    SHARED
 
     spi (ipsec)
 
@@ -398,7 +408,7 @@ Current commands and configuration options
 References:
 
 - ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
-- tp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
 
 Paper from Linux-Kongress in Erlangen 2004.
 - ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
diff --git a/Documentation/networking/ppp_generic.rst b/Documentation/networking/ppp_generic.rst
index e60504377900..5a10abce5964 100644
--- a/Documentation/networking/ppp_generic.rst
+++ b/Documentation/networking/ppp_generic.rst
@@ -314,6 +314,22 @@ channel are:
   it is connected to.  It will return an EINVAL error if the channel
   is not connected to an interface.
 
+* PPPIOCBRIDGECHAN bridges a channel with another. The argument should
+  point to an int containing the channel number of the channel to bridge
+  to. Once two channels are bridged, frames presented to one channel by
+  ppp_input() are passed to the bridge instance for onward transmission.
+  This allows frames to be switched from one channel into another: for
+  example, to pass PPPoE frames into a PPPoL2TP session. Since channel
+  bridging interrupts the normal ppp_input() path, a given channel may
+  not be part of a bridge at the same time as being part of a unit.
+  This ioctl will return an EALREADY error if the channel is already
+  part of a bridge or unit, or ENXIO if the requested channel does not
+  exist.
+
+* PPPIOCUNBRIDGECHAN performs the inverse of PPPIOCBRIDGECHAN, unbridging
+  a channel pair.  This ioctl will return an EINVAL error if the channel
+  does not form part of a bridge.
+
 * All other ioctl commands are passed to the channel ioctl() function.
 
 The ioctl calls that are available on an instance that is attached to
diff --git a/Documentation/networking/pse-pd/index.rst b/Documentation/networking/pse-pd/index.rst
new file mode 100644
index 000000000000..de28a5aee316
--- /dev/null
+++ b/Documentation/networking/pse-pd/index.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Power Sourcing Equipment (PSE) Documentation
+============================================
+
+.. toctree::
+   :maxdepth: 2
+
+   introduction
+   pse-pi
diff --git a/Documentation/networking/pse-pd/introduction.rst b/Documentation/networking/pse-pd/introduction.rst
new file mode 100644
index 000000000000..e3d3faaef717
--- /dev/null
+++ b/Documentation/networking/pse-pd/introduction.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Power Sourcing Equipment (PSE) in IEEE 802.3 Standard
+=====================================================
+
+Overview
+--------
+
+Power Sourcing Equipment (PSE) is essential in networks for delivering power
+along with data over Ethernet cables. It usually refers to devices like
+switches and hubs that supply power to Powered Devices (PDs) such as IP
+cameras, VoIP phones, and wireless access points.
+
+PSE vs. PoDL PSE
+----------------
+
+PSE in the IEEE 802.3 standard generally refers to equipment that provides
+power alongside data over Ethernet cables, typically associated with Power over
+Ethernet (PoE).
+
+PoDL PSE, or Power over Data Lines PSE, specifically denotes PSEs operating
+with single balanced twisted-pair PHYs, as per Clause 104 of IEEE 802.3. PoDL
+is significant in contexts like automotive and industrial controls where power
+and data delivery over a single pair is advantageous.
+
+IEEE 802.3-2018 Addendums and Related Clauses
+---------------------------------------------
+
+Key addenda to the IEEE 802.3-2018 standard relevant to power delivery over
+Ethernet are as follows:
+
+- **802.3af (Approved in 2003-06-12)**: Known as PoE in the market, detailed in
+  Clause 33, delivering up to 15.4W of power.
+- **802.3at (Approved in 2009-09-11)**: Marketed as PoE+, enhancing PoE as
+  covered in Clause 33, increasing power delivery to up to 30W.
+- **802.3bt (Approved in 2018-09-27)**: Known as 4PPoE in the market, outlined
+  in Clause 33. Type 3 delivers up to 60W, and Type 4 up to 100W.
+- **802.3bu (Approved in 2016-12-07)**: Formerly referred to as PoDL, detailed
+  in Clause 104. Introduces Classes 0 - 9. Class 9 PoDL PSE delivers up to ~65W
+
+Kernel Naming Convention Recommendations
+----------------------------------------
+
+For clarity and consistency within the Linux kernel's networking subsystem, the
+following naming conventions are recommended:
+
+- For general PSE (PoE) code, use "c33_pse" key words. For example:
+  ``enum ethtool_c33_pse_admin_state c33_admin_control;``.
+  This aligns with Clause 33, encompassing various PoE forms.
+
+- For PoDL PSE - specific code, use "podl_pse". For example:
+  ``enum ethtool_podl_pse_admin_state podl_admin_control;`` to differentiate
+  PoDL PSE settings according to Clause 104.
+
+Summary of Clause 33: Data Terminal Equipment (DTE) Power via Media Dependent Interface (MDI)
+---------------------------------------------------------------------------------------------
+
+Clause 33 of the IEEE 802.3 standard defines the functional and electrical
+characteristics of Powered Device (PD) and Power Sourcing Equipment (PSE).
+These entities enable power delivery using the same generic cabling as for data
+transmission, integrating power with data communication for devices such as
+10BASE-T, 100BASE-TX, or 1000BASE-T.
+
+Summary of Clause 104: Power over Data Lines (PoDL) of Single Balanced Twisted-Pair Ethernet
+--------------------------------------------------------------------------------------------
+
+Clause 104 of the IEEE 802.3 standard delineates the functional and electrical
+characteristics of PoDL Powered Devices (PDs) and PoDL Power Sourcing Equipment
+(PSEs). These are designed for use with single balanced twisted-pair Ethernet
+Physical Layers. In this clause, 'PSE' refers specifically to PoDL PSE, and
+'PD' to PoDL PD. The key intent is to provide devices with a unified interface
+for both data and the power required to process this data over a single
+balanced twisted-pair Ethernet connection.
diff --git a/Documentation/networking/pse-pd/pse-pi.rst b/Documentation/networking/pse-pd/pse-pi.rst
new file mode 100644
index 000000000000..5cad14fedc13
--- /dev/null
+++ b/Documentation/networking/pse-pd/pse-pi.rst
@@ -0,0 +1,301 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+PSE Power Interface (PSE PI) Documentation
+==========================================
+
+The Power Sourcing Equipment Power Interface (PSE PI) plays a pivotal role in
+the architecture of Power over Ethernet (PoE) systems. It is essentially a
+blueprint that outlines how one or multiple power sources are connected to the
+eight-pin modular jack, commonly known as the Ethernet RJ45 port. This
+connection scheme is crucial for enabling the delivery of power alongside data
+over Ethernet cables.
+
+Documentation and Standards
+---------------------------
+
+The IEEE 802.3 standard provides detailed documentation on the PSE PI.
+Specifically:
+
+- Section "33.2.3 PI pin assignments" covers the pin assignments for PoE
+  systems that utilize two pairs for power delivery.
+- Section "145.2.4 PSE PI" addresses the configuration for PoE systems that
+  deliver power over all four pairs of an Ethernet cable.
+
+PSE PI and Single Pair Ethernet
+-------------------------------
+
+Single Pair Ethernet (SPE) represents a different approach to Ethernet
+connectivity, utilizing just one pair of conductors for both data and power
+transmission. Unlike the configurations detailed in the PSE PI for standard
+Ethernet, which can involve multiple power sourcing arrangements across four or
+two pairs of wires, SPE operates on a simpler model due to its single-pair
+design. As a result, the complexities of choosing between alternative pin
+assignments for power delivery, as described in the PSE PI for multi-pair
+Ethernet, are not applicable to SPE.
+
+Understanding PSE PI
+--------------------
+
+The Power Sourcing Equipment Power Interface (PSE PI) is a framework defining
+how Power Sourcing Equipment (PSE) delivers power to Powered Devices (PDs) over
+Ethernet cables. It details two main configurations for power delivery, known
+as Alternative A and Alternative B, which are distinguished not only by their
+method of power transmission but also by the implications for polarity and data
+transmission direction.
+
+Alternative A and B Overview
+----------------------------
+
+- **Alternative A:** Utilizes RJ45 conductors 1, 2, 3 and 6. In either case of
+  networks 10/100BaseT or 1G/2G/5G/10GBaseT, the pairs used are carrying data.
+  The power delivery's polarity in this alternative can vary based on the MDI
+  (Medium Dependent Interface) or MDI-X (Medium Dependent Interface Crossover)
+  configuration.
+
+- **Alternative B:** Utilizes RJ45 conductors 4, 5, 7 and 8. In case of
+  10/100BaseT network the pairs used are spare pairs without data and are less
+  influenced by data transmission direction. This is not the case for
+  1G/2G/5G/10GBaseT network. Alternative B includes two configurations with
+  different polarities, known as variant X and variant S, to accommodate
+  different network requirements and device specifications.
+
+Table 145-3 PSE Pinout Alternatives
+-----------------------------------
+
+The following table outlines the pin configurations for both Alternative A and
+Alternative B.
+
++------------+-------------------+-----------------+-----------------+-----------------+
+| Conductor  | Alternative A     | Alternative A   | Alternative B   | Alternative B   |
+|            |    (MDI-X)        |      (MDI)      |        (X)      |        (S)      |
++============+===================+=================+=================+=================+
+| 1          | Negative V        | Positive V      | -               | -               |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 2          | Negative V        | Positive V      | -               | -               |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 3          | Positive V        | Negative V      | -               | -               |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 4          | -                 | -               | Negative V      | Positive V      |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 5          | -                 | -               | Negative V      | Positive V      |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 6          | Positive V        | Negative V      | -               | -               |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 7          | -                 | -               | Positive V      | Negative V      |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 8          | -                 | -               | Positive V      | Negative V      |
++------------+-------------------+-----------------+-----------------+-----------------+
+
+.. note::
+    - "Positive V" and "Negative V" indicate the voltage polarity for each pin.
+    - "-" indicates that the pin is not used for power delivery in that
+      specific configuration.
+
+PSE PI compatibilities
+----------------------
+
+The following table outlines the compatibility between the pinout alternative
+and the 1000/2.5G/5G/10GBaseT in the PSE 2 pairs connection.
+
++---------+---------------+---------------------+-----------------------+
+| Variant | Alternative   | Power Feeding Type  | Compatibility with    |
+|         | (A/B)         | (Direct/Phantom)    | 1000/2.5G/5G/10GBaseT |
++=========+===============+=====================+=======================+
+| 1       | A             | Phantom             | Yes                   |
++---------+---------------+---------------------+-----------------------+
+| 2       | B             | Phantom             | Yes                   |
++---------+---------------+---------------------+-----------------------+
+| 3       | B             | Direct              | No                    |
++---------+---------------+---------------------+-----------------------+
+
+.. note::
+    - "Direct" indicate a variant where the power is injected directly to pairs
+       without using magnetics in case of spare pairs.
+    - "Phantom" indicate power path over coils/magnetics as it is done for
+       Alternative A variant.
+
+In case of PSE 4 pairs, a PSE supporting only 10/100BaseT (which mean Direct
+Power on pinout Alternative B) is not compatible with a 4 pairs
+1000/2.5G/5G/10GBaseT.
+
+PSE Power Interface (PSE PI) Connection Diagram
+-----------------------------------------------
+
+The diagram below illustrates the connection architecture between the RJ45
+port, the Ethernet PHY (Physical Layer), and the PSE PI (Power Sourcing
+Equipment Power Interface), demonstrating how power and data are delivered
+simultaneously through an Ethernet cable. The RJ45 port serves as the physical
+interface for these connections, with each of its eight pins connected to both
+the Ethernet PHY for data transmission and the PSE PI for power delivery.
+
+.. code-block::
+
+    +--------------------------+
+    |                          |
+    |          RJ45 Port       |
+    |                          |
+    +--+--+--+--+--+--+--+--+--+                +-------------+
+      1| 2| 3| 4| 5| 6| 7| 8|                   |             |
+       |  |  |  |  |  |  |  o-------------------+             |
+       |  |  |  |  |  |  o--|-------------------+             +<--- PSE 1
+       |  |  |  |  |  o--|--|-------------------+             |
+       |  |  |  |  o--|--|--|-------------------+             |
+       |  |  |  o--|--|--|--|-------------------+  PSE PI     |
+       |  |  o--|--|--|--|--|-------------------+             |
+       |  o--|--|--|--|--|--|-------------------+             +<--- PSE 2 (optional)
+       o--|--|--|--|--|--|--|-------------------+             |
+       |  |  |  |  |  |  |  |                   |             |
+    +--+--+--+--+--+--+--+--+--+                +-------------+
+    |                          |
+    |       Ethernet PHY       |
+    |                          |
+    +--------------------------+
+
+Simple PSE PI Configuration for Alternative A
+---------------------------------------------
+
+The diagram below illustrates a straightforward PSE PI (Power Sourcing
+Equipment Power Interface) configuration designed to support the Alternative A
+setup for Power over Ethernet (PoE). This implementation is tailored to provide
+power delivery through the data-carrying pairs of an Ethernet cable, suitable
+for either MDI or MDI-X configurations, albeit supporting one variation at a
+time.
+
+.. code-block::
+
+         +-------------+
+         |    PSE PI   |
+ 8  -----+                             +-------------+
+ 7  -----+                    Rail 1   |
+ 6  -----+------+----------------------+
+ 5  -----+      |                      |
+ 4  -----+      |             Rail 2   |  PSE 1
+ 3  -----+------/         +------------+
+ 2  -----+--+-------------/            |
+ 1  -----+--/                          +-------------+
+         |
+         +-------------+
+
+In this configuration:
+
+- Pins 1 and 2, as well as pins 3 and 6, are utilized for power delivery in
+  addition to data transmission. This aligns with the standard wiring for
+  10/100BaseT Ethernet networks where these pairs are used for data.
+- Rail 1 and Rail 2 represent the positive and negative voltage rails, with
+  Rail 1 connected to pins 1 and 2, and Rail 2 connected to pins 3 and 6.
+  More advanced PSE PI configurations may include integrated or external
+  switches to change the polarity of the voltage rails, allowing for
+  compatibility with both MDI and MDI-X configurations.
+
+More complex PSE PI configurations may include additional components, to support
+Alternative B, or to provide additional features such as power management, or
+additional power delivery capabilities such as 2-pair or 4-pair power delivery.
+
+.. code-block::
+
+         +-------------+
+         |    PSE PI   |
+         |        +---+
+ 8  -----+--------+   |                 +-------------+
+ 7  -----+--------+   |       Rail 1   |
+ 6  -----+--------+   +-----------------+
+ 5  -----+--------+   |                |
+ 4  -----+--------+   |       Rail 2   |  PSE 1
+ 3  -----+--------+   +----------------+
+ 2  -----+--------+   |                |
+ 1  -----+--------+   |                 +-------------+
+         |        +---+
+         +-------------+
+
+Device Tree Configuration: Describing PSE PI Configurations
+-----------------------------------------------------------
+
+The necessity for a separate PSE PI node in the device tree is influenced by
+the intricacy of the Power over Ethernet (PoE) system's setup. Here are
+descriptions of both simple and complex PSE PI configurations to illustrate
+this decision-making process:
+
+**Simple PSE PI Configuration:**
+In a straightforward scenario, the PSE PI setup involves a direct, one-to-one
+connection between a single PSE controller and an Ethernet port. This setup
+typically supports basic PoE functionality without the need for dynamic
+configuration or management of multiple power delivery modes. For such simple
+configurations, detailing the PSE PI within the existing PSE controller's node
+may suffice, as the system does not encompass additional complexity that
+warrants a separate node. The primary focus here is on the clear and direct
+association of power delivery to a specific Ethernet port.
+
+**Complex PSE PI Configuration:**
+Contrastingly, a complex PSE PI setup may encompass multiple PSE controllers or
+auxiliary circuits that collectively manage power delivery to one Ethernet
+port. Such configurations might support a range of PoE standards and require
+the capability to dynamically configure power delivery based on the operational
+mode (e.g., PoE2 versus PoE4) or specific requirements of connected devices. In
+these instances, a dedicated PSE PI node becomes essential for accurately
+documenting the system architecture. This node would serve to detail the
+interactions between different PSE controllers, the support for various PoE
+modes, and any additional logic required to coordinate power delivery across
+the network infrastructure.
+
+**Guidance:**
+
+For simple PSE setups, including PSE PI information in the PSE controller node
+might suffice due to the straightforward nature of these systems. However,
+complex configurations, involving multiple components or advanced PoE features,
+benefit from a dedicated PSE PI node. This method adheres to IEEE 802.3
+specifications, improving documentation clarity and ensuring accurate
+representation of the PoE system's complexity.
+
+PSE PI Node: Essential Information
+----------------------------------
+
+The PSE PI (Power Sourcing Equipment Power Interface) node in a device tree can
+include several key pieces of information critical for defining the power
+delivery capabilities and configurations of a PoE (Power over Ethernet) system.
+Below is a list of such information, along with explanations for their
+necessity and reasons why they might not be found within a PSE controller node:
+
+1. **Powered Pairs Configuration**
+
+   - *Description:* Identifies the pairs used for power delivery in the
+     Ethernet cable.
+   - *Necessity:* Essential to ensure the correct pairs are powered according
+     to the board's design.
+   - *PSE Controller Node:* Typically lacks details on physical pair usage,
+     focusing on power regulation.
+
+2. **Polarity of Powered Pairs**
+
+   - *Description:* Specifies the polarity (positive or negative) for each
+     powered pair.
+   - *Necessity:* Critical for safe and effective power transmission to PDs.
+   - *PSE Controller Node:* Polarity management may exceed the standard
+     functionalities of PSE controllers.
+
+3. **PSE Cells Association**
+
+   - *Description:* Details the association of PSE cells with Ethernet ports or
+     pairs in multi-cell configurations.
+   - *Necessity:* Allows for optimized power resource allocation in complex
+     systems.
+   - *PSE Controller Node:* Controllers may not manage cell associations
+     directly, focusing instead on power flow regulation.
+
+4. **Support for PoE Standards**
+
+   - *Description:* Lists the PoE standards and configurations supported by the
+     system.
+   - *Necessity:* Ensures system compatibility with various PDs and adherence
+     to industry standards.
+   - *PSE Controller Node:* Specific capabilities may depend on the overall PSE
+     PI design rather than the controller alone. Multiple PSE cells per PI
+     do not necessarily imply support for multiple PoE standards.
+
+5. **Protection Mechanisms**
+
+   - *Description:* Outlines additional protection mechanisms, such as
+     overcurrent protection and thermal management.
+   - *Necessity:* Provides extra safety and stability, complementing PSE
+     controller protections.
+   - *PSE Controller Node:* Some protections may be implemented via
+     board-specific hardware or algorithms external to the controller.
diff --git a/Documentation/networking/ray_cs.rst b/Documentation/networking/ray_cs.rst
deleted file mode 100644
index 9a46d1ae8f20..000000000000
--- a/Documentation/networking/ray_cs.rst
+++ /dev/null
@@ -1,165 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. include:: <isonum.txt>
-
-=========================
-Raylink wireless LAN card
-=========================
-
-September 21, 1999
-
-Copyright |copy| 1998  Corey Thomas (corey@world.std.com)
-
-This file is the documentation for the Raylink Wireless LAN card driver for
-Linux.  The Raylink wireless LAN card is a PCMCIA card which provides IEEE
-802.11 compatible wireless network connectivity at 1 and 2 megabits/second.
-See http://www.raytheon.com/micro/raylink/ for more information on the Raylink
-card.  This driver is in early development and does have bugs.  See the known
-bugs and limitations at the end of this document for more information.
-This driver also works with WebGear's Aviator 2.4 and Aviator Pro
-wireless LAN cards.
-
-As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
-source.  My web page for the development of ray_cs is at
-http://web.ralinktech.com/ralink/Home/Support/Linux.html
-and I can be emailed at corey@world.std.com
-
-The kernel driver is based on ray_cs-1.62.tgz
-
-The driver at my web page is intended to be used as an add on to
-David Hinds pcmcia package.  All the command line parameters are
-available when compiled as a module.  When built into the kernel, only
-the essid= string parameter is available via the kernel command line.
-This will change after the method of sorting out parameters for all
-the PCMCIA drivers is agreed upon.  If you must have a built in driver
-with nondefault parameters, they can be edited in
-/usr/src/linux/drivers/net/pcmcia/ray_cs.c.  Searching for module_param
-will find them all.
-
-Information on card services is available at:
-
-	http://pcmcia-cs.sourceforge.net/
-
-
-Card services user programs are still required for PCMCIA devices.
-pcmcia-cs-3.1.1 or greater is required for the kernel version of
-the driver.
-
-Currently, ray_cs is not part of David Hinds card services package,
-so the following magic is required.
-
-At the end of the /etc/pcmcia/config.opts file, add the line:
-source ./ray_cs.opts
-This will make card services read the ray_cs.opts file
-when starting.  Create the file /etc/pcmcia/ray_cs.opts containing the
-following::
-
-  #### start of /etc/pcmcia/ray_cs.opts ###################
-  # Configuration options for Raylink Wireless LAN PCMCIA card
-  device "ray_cs"
-    class "network" module "misc/ray_cs"
-
-  card "RayLink PC Card WLAN Adapter"
-    manfid 0x01a6, 0x0000
-    bind "ray_cs"
-
-  module "misc/ray_cs" opts ""
-  #### end of /etc/pcmcia/ray_cs.opts #####################
-
-
-To join an existing network with
-different parameters, contact the network administrator for the
-configuration information, and edit /etc/pcmcia/ray_cs.opts.
-Add the parameters below between the empty quotes.
-
-Parameters for ray_cs driver which may be specified in ray_cs.opts:
-
-=============== =============== =============================================
-bc              integer         0 = normal mode (802.11 timing),
-				1 = slow down inter frame timing to allow
-				operation with older breezecom access
-				points.
-
-beacon_period	integer         beacon period in Kilo-microseconds,
-
-				legal values = must be integer multiple
-				of hop dwell
-
-				default = 256
-
-country         integer         1 = USA (default),
-				2 = Europe,
-				3 = Japan,
-				4 = Korea,
-				5 = Spain,
-				6 = France,
-				7 = Israel,
-				8 = Australia
-
-essid		string		ESS ID - network name to join
-
-				string with maximum length of 32 chars
-				default value = "ADHOC_ESSID"
-
-hop_dwell	integer         hop dwell time in Kilo-microseconds
-
-				legal values = 16,32,64,128(default),256
-
-irq_mask	integer         linux standard 16 bit value 1bit/IRQ
-
-				lsb is IRQ 0, bit 1 is IRQ 1 etc.
-				Used to restrict choice of IRQ's to use.
-				Recommended method for controlling
-				interrupts is in /etc/pcmcia/config.opts
-
-net_type	integer		0 (default) = adhoc network,
-				1 = infrastructure
-
-phy_addr	string          string containing new MAC address in
-				hex, must start with x eg
-				x00008f123456
-
-psm		integer         0 = continuously active,
-				1 = power save mode (not useful yet)
-
-pc_debug	integer		(0-5) larger values for more verbose
-				logging.  Replaces ray_debug.
-
-ray_debug	integer		Replaced with pc_debug
-
-ray_mem_speed   integer         defaults to 500
-
-sniffer         integer         0 = not sniffer (default),
-				1 = sniffer which can be used to record all
-				network traffic using tcpdump or similar,
-				but no normal network use is allowed.
-
-translate	integer		0 = no translation (encapsulate frames),
-				1 = translation    (RFC1042/802.1)
-=============== =============== =============================================
-
-More on sniffer mode:
-
-tcpdump does not understand 802.11 headers, so it can't
-interpret the contents, but it can record to a file.  This is only
-useful for debugging 802.11 lowlevel protocols that are not visible to
-linux.  If you want to watch ftp xfers, or do similar things, you
-don't need to use sniffer mode.  Also, some packet types are never
-sent up by the card, so you will never see them (ack, rts, cts, probe
-etc.)  There is a simple program (showcap) included in the ray_cs
-package which parses the 802.11 headers.
-
-Known Problems and missing features
-
-	Does not work with non x86
-
-	Does not work with SMP
-
-	Support for defragmenting frames is not yet debugged, and in
-	fact is known to not work.  I have never encountered a net set
-	up to fragment, but still, it should be fixed.
-
-	The ioctl support is incomplete.  The hardware address cannot be set
-	using ifconfig yet.  If a different hardware address is needed, it may
-	be set using the phy_addr parameter in ray_cs.opts.  This requires
-	a card insertion to take effect.
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
index 44936c27ab3a..41b0a6182fe4 100644
--- a/Documentation/networking/rds.rst
+++ b/Documentation/networking/rds.rst
@@ -1,6 +1,6 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-==
+===
 RDS
 ===
 
@@ -265,7 +265,7 @@ RDS Protocol
 
       The bitmaps are allocated as connections are brought up.  This
       avoids allocation in the interrupt handling path which queues
-      sages on sockets.  The dense bitmaps let transports send the
+      messages on sockets.  The dense bitmaps let transports send the
       entire bitmap on any bitmap change reasonably efficiently.  This
       is much easier to implement than some finer-grained
       communication of per-port congestion.  The sender does a very
@@ -373,7 +373,7 @@ The recv path
     - validate header checksum
     - copy header to rds_ib_incoming struct if start of a new datagram
     - add to ibinc's fraglist
-    - if competed datagram:
+    - if completed datagram:
 	 - update cong map if datagram was cong update
 	 - call rds_recv_incoming() otherwise
 	 - note if ack is required
@@ -415,7 +415,7 @@ Multipath RDS (mprds)
   I/O workqs and reconnect threads are driven from the rds_conn_path.
   Transports such as TCP that are multipath capable may then set up a
   TCP socket per rds_conn_path, and this is managed by the transport via
-  the transport privatee cp_transport_data pointer.
+  the transport private cp_transport_data pointer.
 
   Transports announce themselves as multipath capable by setting the
   t_mp_capable bit during registration with the rds core module. When the
@@ -430,7 +430,7 @@ Multipath RDS (mprds)
   This is done by sending out a control packet exchange before the
   first data packet. The control packet exchange must have completed
   prior to outgoing hash completion in rds_sendmsg() when the transport
-  is mutlipath capable.
+  is multipath capable.
 
   The control packet is an RDS ping packet (i.e., packet to rds dest
   port 0) with the ping packet having a rds extension header option  of
diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst
index 16782a95b74a..5e42c8a175c3 100644
--- a/Documentation/networking/regulatory.rst
+++ b/Documentation/networking/regulatory.rst
@@ -66,7 +66,7 @@ An example::
   iw reg set CR
 
 This will request the kernel to set the regulatory domain to
-the specificied alpha2. The kernel in turn will then ask userspace
+the specified alpha2. The kernel in turn will then ask userspace
 to provide a regulatory domain for the alpha2 specified by the user
 by sending a uevent.
 
@@ -158,7 +158,7 @@ kmalloc() a structure big enough to hold your regulatory domain
 structure and you should then fill it with your data. Finally you simply
 call regulatory_hint() with the regulatory domain structure in it.
 
-Bellow is a simple example, with a regulatory domain cached using the stack.
+Below is a simple example, with a regulatory domain cached using the stack.
 Your implementation may vary (read EEPROM cache instead, for example).
 
 Example cache of some regulatory domain::
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 000000000000..5e23386f6968
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,262 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _representors:
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs.  For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support.  This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function.  The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors bring the standard Linux networking stack to
+virtual switches and IOV devices.  Just as each physical port of a Linux-
+controlled switch has a separate netdev, so does each virtual port of a virtual
+switch.
+When the system boots, and before any offload is configured, all packets from
+the virtual functions appear in the networking stack of the PF via the
+representors.  The PF can thus always communicate freely with the virtual
+functions.
+The PF can configure standard Linux forwarding between representors, the uplink
+or any other netdev (routing, bridging, TC classifiers).
+
+Thus, a representor is both a control plane object (representing the function in
+administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+Definitions
+-----------
+
+This document uses the term "switchdev function" to refer to the PCIe function
+which has administrative control over the virtual switch on the device.
+Typically, this will be a PF, but conceivably a NIC could be configured to grant
+these administrative privileges instead to a VF or SF (subfunction).
+Depending on NIC design, a multi-port NIC might have a single switchdev function
+for the whole device or might have a separate virtual switch, and hence
+switchdev function, for each physical network port.
+If the NIC supports nested switching, there might be separate switchdev
+functions for each nested switch, in which case each switchdev function should
+only create representors for the ports on the (sub-)switch it directly
+administers.
+
+A "representee" is the object that a representor represents.  So for example in
+the case of a VF representor, the representee is the corresponding VF.
+
+What does a representor do?
+---------------------------
+
+A representor has three main roles.
+
+1. It is used to configure the network connection the representee sees, e.g.
+   link up/down, MTU, etc.  For instance, bringing the representor
+   administratively UP should cause the representee to see a link up / carrier
+   on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+   fast-path rules in the virtual switch.  Packets transmitted on the
+   representor netdevice should be delivered to the representee; packets
+   transmitted by the representee which fail to match any switching rule should
+   be received on the representor netdevice.  (That is, there is a virtual pipe
+   connecting the representor to the representee, similar in concept to a veth
+   pair.)
+   This allows software switch implementations (such as OpenVSwitch or a Linux
+   bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+   to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not.  E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF.  Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+Some vendors have chosen to omit representors for the uplink and the physical
+network port, which can simplify usage (the uplink netdev becomes in effect the
+physical port's representor) but does not generalise to devices with multiple
+ports or uplinks.
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the switchdev function.
+ - Other PFs on the local PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
+   System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+   as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
+   if) their network access is implemented through a virtual switch port. [#]_
+   Note that such functions can require a representor despite the representee
+   not having a netdev.
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+   their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+   through a virtual switch port, even if they do not have a corresponding PCIe
+   PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+It is a common misunderstanding to conflate virtual ports with PCIe virtual
+functions or their netdevs.  While in simple cases there will be a 1:1
+correspondence between VF netdevices and VF representors, more advanced device
+configurations may not follow this.
+A PCIe function which does not have network access through the internal switch
+(not even indirectly through the hardware implementation of whatever services
+the function provides) should *not* have a representor (even if it has a
+netdev).
+Such a function has no switch virtual port for the representor to configure or
+to be the other end of the virtual pipe.
+The representor represents the virtual port, not the PCIe function nor the 'end
+user' netdevice.
+
+.. [#] The concept here is that a hardware IP stack in the device performs the
+   translation between block DMA requests and network packets, so that only
+   network packets pass through the virtual port onto the switch.  The network
+   access that the IP stack "sees" would then be configurable through tc rules;
+   e.g. its traffic might all be wrapped in a specific VLAN or VxLAN.  However,
+   any needed configuration of the block device *qua* block device, not being a
+   networking entity, would not be appropriate for the representor and would
+   thus use some other channel such as devlink.
+   Contrast this with the case of a virtio-blk implementation which forwards the
+   DMA requests unchanged to another PF whose driver then initiates and
+   terminates IP traffic in software; in that case the DMA traffic would *not*
+   run over the virtual switch and the virtio-blk PF should thus *not* have a
+   representor.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the switchdev function should, for each virtual
+port on the switch, create a pure-software netdevice which has some form of
+in-kernel reference to the switchdev function's own netdevice or driver private
+data (``netdev_priv()``).
+This may be by enumerating ports at probe time, reacting dynamically to the
+creation and destruction of ports at run time, or a combination of the two.
+
+The operations of the representor netdevice will generally involve acting
+through the switchdev function.  For example, ``ndo_start_xmit()`` might send
+the packet through a hardware TX queue attached to the switchdev function, with
+either packet metadata or queue configuration marking it for delivery to the
+representee.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the switchdev function.
+Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
+assign a devlink port instance to the netdevice before registering the
+netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
+and ``phys_port_name`` sysfs nodes.
+(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name()`` directly, but this is deprecated.)  See
+:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
+details of this API.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice.  For
+instance if the switchdev function is ``eth4`` then a representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice.  Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``PORT_DEV`` is the physical port representor and
+``REP_DEV`` is a VF representor, the following rules::
+
+    tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
+        action mirred egress redirect dev $PORT_DEV
+    tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
+        action mirred egress mirror dev $REP_DEV
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to ``PORT_DEV``.  (Note that without ``skip_sw`` on the second rule,
+the VF would get two copies, as the packet reception on ``PORT_DEV`` would
+trigger the TC rule again and mirror the packet to ``REP_DEV``.)
+
+On devices without separate port and uplink representors, ``PORT_DEV`` would
+instead be the switchdev function's own uplink netdevice.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. switchdev
+function uplink netdev or port representor).  TC rules such as::
+
+    tc filter add dev $REP_DEV parent ffff: flower \
+        action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+                              dst_port 4789 \
+        action mirred egress redirect dev vxlan0
+    tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+        enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+        action tunnel_key unset action mirred egress redirect dev $REP_DEV
+
+where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
+perform an ARP/neighbour table lookup to find the MAC addresses to use in the
+outer Ethernet frame), while UDP packets received on the physical port with UDP
+port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
+decapsulated and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC (except ignoring any
+``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor.  Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+   :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst
index 39c2249c7aa7..d63e3e27dd06 100644
--- a/Documentation/networking/rxrpc.rst
+++ b/Documentation/networking/rxrpc.rst
@@ -848,14 +848,21 @@ The kernel interface functions are as follows:
      returned.  The caller now holds a reference on this and it must be
      properly ended.
 
- (#) End a client call::
+ (#) Shut down a client call::
 
-	void rxrpc_kernel_end_call(struct socket *sock,
+	void rxrpc_kernel_shutdown_call(struct socket *sock,
+					struct rxrpc_call *call);
+
+     This is used to shut down a previously begun call.  The user_call_ID is
+     expunged from AF_RXRPC's knowledge and will not be seen again in
+     association with the specified call.
+
+ (#) Release the ref on a client call::
+
+	void rxrpc_kernel_put_call(struct socket *sock,
 				   struct rxrpc_call *call);
 
-     This is used to end a previously begun call.  The user_call_ID is expunged
-     from AF_RXRPC's knowledge and will not be seen again in association with
-     the specified call.
+     This is used to release the caller's ref on an rxrpc call.
 
  (#) Send data through a call::
 
@@ -880,8 +887,8 @@ The kernel interface functions are as follows:
 
      notify_end_rx can be NULL or it can be used to specify a function to be
      called when the call changes state to end the Tx phase.  This function is
-     called with the call-state spinlock held to prevent any reply or final ACK
-     from being delivered first.
+     called with a spinlock held to prevent the last DATA packet from being
+     transmitted until the function returns.
 
  (#) Receive data from a call::
 
@@ -1055,41 +1062,6 @@ The kernel interface functions are as follows:
      first function to change.  Note that this must be called in TASK_RUNNING
      state.
 
- (#) Get reply timestamp::
-
-	bool rxrpc_kernel_get_reply_time(struct socket *sock,
-					 struct rxrpc_call *call,
-					 ktime_t *_ts)
-
-     This allows the timestamp on the first DATA packet of the reply of a
-     client call to be queried, provided that it is still in the Rx ring.  If
-     successful, the timestamp will be stored into ``*_ts`` and true will be
-     returned; false will be returned otherwise.
-
- (#) Get remote client epoch::
-
-	u32 rxrpc_kernel_get_epoch(struct socket *sock,
-				   struct rxrpc_call *call)
-
-     This allows the epoch that's contained in packets of an incoming client
-     call to be queried.  This value is returned.  The function always
-     successful if the call is still in progress.  It shouldn't be called once
-     the call has expired.  Note that calling this on a local client call only
-     returns the local epoch.
-
-     This value can be used to determine if the remote client has been
-     restarted as it shouldn't change otherwise.
-
- (#) Set the maxmimum lifespan on a call::
-
-	void rxrpc_kernel_set_max_life(struct socket *sock,
-				       struct rxrpc_call *call,
-				       unsigned long hard_timeout)
-
-     This sets the maximum lifespan on a call to hard_timeout (which is in
-     jiffies).  In the event of the timeout occurring, the call will be
-     aborted and -ETIME or -ETIMEDOUT will be returned.
-
  (#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the
      kernel::
 
@@ -1176,3 +1148,18 @@ adjusted through sysctls in /proc/net/rxrpc/:
      header plus exactly 1412 bytes of data.  The terminal packet must contain
      a four byte header plus any amount of data.  In any event, a jumbo packet
      may not exceed rxrpc_rx_mtu in size.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: net/rxrpc/af_rxrpc.c
+.. kernel-doc:: net/rxrpc/call_object.c
+.. kernel-doc:: net/rxrpc/key.c
+.. kernel-doc:: net/rxrpc/oob.c
+.. kernel-doc:: net/rxrpc/peer_object.c
+.. kernel-doc:: net/rxrpc/recvmsg.c
+.. kernel-doc:: net/rxrpc/rxgk.c
+.. kernel-doc:: net/rxrpc/rxkad.c
+.. kernel-doc:: net/rxrpc/sendmsg.c
+.. kernel-doc:: net/rxrpc/server_key.c
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 8f0347b9fb3d..99b6a61e5e31 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -44,6 +44,28 @@ by masking out the low order seven bits of the computed hash for the
 packet (usually a Toeplitz hash), taking this number as a key into the
 indirection table and reading the corresponding value.
 
+Some NICs support symmetric RSS hashing where, if the IP (source address,
+destination address) and TCP/UDP (source port, destination port) tuples
+are swapped, the computed hash is the same. This is beneficial in some
+applications that monitor TCP/IP flows (IDS, firewalls, ...etc) and need
+both directions of the flow to land on the same Rx queue (and CPU). The
+"Symmetric-XOR" and "Symmetric-OR-XOR" are types of RSS algorithms that
+achieve this hash symmetry by XOR/ORing the input source and destination
+fields of the IP and/or L4 protocols. This, however, results in reduced
+input entropy and could potentially be exploited.
+
+Specifically, the "Symmetric-XOR" algorithm XORs the input
+as follows::
+
+    # (SRC_IP ^ DST_IP, SRC_IP ^ DST_IP, SRC_PORT ^ DST_PORT, SRC_PORT ^ DST_PORT)
+
+The "Symmetric-OR-XOR" algorithm, on the other hand, transforms the input as
+follows::
+
+    # (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT)
+
+The result is then fed to the underlying RSS algorithm.
+
 Some advanced NICs allow steering packets to queues based on
 programmable filters. For example, webserver bound TCP port 80 packets
 can be directed to their own receive queue. Such “n-tuple” filters can
@@ -105,6 +127,48 @@ a separate CPU. For interrupt handling, HT has shown no benefit in
 initial tests, so limit the number of queues to the number of CPU cores
 in the system.
 
+Dedicated RSS contexts
+~~~~~~~~~~~~~~~~~~~~~~
+
+Modern NICs support creating multiple co-existing RSS configurations
+which are selected based on explicit matching rules. This can be very
+useful when application wants to constrain the set of queues receiving
+traffic for e.g. a particular destination port or IP address.
+The example below shows how to direct all traffic to TCP port 22
+to queues 0 and 1.
+
+To create an additional RSS context use::
+
+  # ethtool -X eth0 hfunc toeplitz context new
+  New RSS context is 1
+
+Kernel reports back the ID of the allocated context (the default, always
+present RSS context has ID of 0). The new context can be queried and
+modified using the same APIs as the default context::
+
+  # ethtool -x eth0 context 1
+  RX flow hash indirection table for eth0 with 13 RX ring(s):
+    0:      0     1     2     3     4     5     6     7
+    8:      8     9    10    11    12     0     1     2
+  [...]
+  # ethtool -X eth0 equal 2 context 1
+  # ethtool -x eth0 context 1
+  RX flow hash indirection table for eth0 with 13 RX ring(s):
+    0:      0     1     0     1     0     1     0     1
+    8:      0     1     0     1     0     1     0     1
+  [...]
+
+To make use of the new context direct traffic to it using an n-tuple
+filter::
+
+  # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
+  Added rule with ID 1023
+
+When done, remove the context and the rule::
+
+  # ethtool -N eth0 delete 1023
+  # ethtool -X eth0 context 1 delete
+
 
 RPS: Receive Packet Steering
 ============================
@@ -269,8 +333,8 @@ a single application thread handles flows with many different flow hashes.
 rps_sock_flow_table is a global flow table that contains the *desired* CPU
 for flows: the CPU that is currently processing the flow in userspace.
 Each table value is a CPU index that is updated during calls to recvmsg
-and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
-and tcp_splice_read()).
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).
 
 When the scheduler moves a thread to a new CPU while it has outstanding
 receive packets on the old CPU, packets may arrive out of order. To
@@ -370,8 +434,10 @@ rps_dev_flow_table. The stack consults a CPU to hardware queue map which
 is maintained by the NIC driver. This is an auto-generated reverse map of
 the IRQ affinity table shown by /proc/interrupts. Drivers can use
 functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
-to populate the map. For each CPU, the corresponding queue in the map is
-set to be one whose processing CPU is closest in cache locality.
+to populate the map. Alternatively, drivers can delegate the cpu_rmap
+management to the Kernel by calling netif_enable_cpu_rmap(). For each CPU,
+the corresponding queue in the map is set to be one whose processing CPU is
+closest in cache locality.
 
 
 Accelerated RFS Configuration
@@ -465,9 +531,9 @@ XPS Configuration
 -----------------
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs/receive-queues that may
-use a transmit queue is configured using the sysfs file entry:
+default for SMP). If compiled in, it is driver dependent whether, and
+how, XPS is configured at device init. The mapping of CPUs/receive-queues
+to transmit queue can be inspected and configured using sysfs:
 
 For selection based on CPUs map::
 
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
index ec73e1445030..07c20e470baf 100644
--- a/Documentation/networking/seg6-sysctl.rst
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -24,3 +24,16 @@ seg6_require_hmac - INTEGER
 	* 1 - Drop SR packets without HMAC, validate SR packets with HMAC
 
 	Default is 0.
+
+seg6_flowlabel - INTEGER
+	Controls the behaviour of computing the flowlabel of outer
+	IPv6 header in case of SR T.encaps
+
+	 == =======================================================
+	 -1  set flowlabel to zero.
+	  0  copy flowlabel from Inner packet in case of Inner IPv6
+	     (Set flowlabel to 0 in case IPv4/L2)
+	  1  Compute the flowlabel using seg6_make_flowlabel()
+	 == =======================================================
+
+	Default is 0.
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
index 5aec7c8857d0..5bf285d73e8a 100644
--- a/Documentation/networking/sfp-phylink.rst
+++ b/Documentation/networking/sfp-phylink.rst
@@ -163,7 +163,7 @@ this documentation.
 	err = phylink_of_phy_connect(priv->phylink, node, flags);
 
    For the most part, ``flags`` can be zero; these flags are passed to
-   the of_phy_attach() inside this function call if a PHY is specified
+   the phy_attach_direct() inside this function call if a PHY is specified
    in the DT node ``node``.
 
    ``node`` should be the DT node which contains the network phy property,
@@ -200,10 +200,12 @@ this documentation.
    when the in-band link state changes - otherwise the link will never
    come up.
 
-   The :c:func:`validate` method should mask the supplied supported mask,
-   and ``state->advertising`` with the supported ethtool link modes.
-   These are the new ethtool link modes, so bitmask operations must be
-   used. For an example, see drivers/net/ethernet/marvell/mvneta.c.
+   The :c:func:`mac_get_caps` method is optional, and if provided should
+   return the phylink MAC capabilities that are supported for the passed
+   ``interface`` mode. In general, there is no need to implement this method.
+   Phylink will use these capabilities in combination with permissible
+   capabilities for ``interface`` to determine the allowable ethtool link
+   modes.
 
    The :c:func:`mac_link_state` method is used to read the link state
    from the MAC, and report back the settings that the MAC is currently
@@ -224,21 +226,141 @@ this documentation.
    function should modify the state and only take the link down when
    absolutely necessary to change the MAC configuration.  An example
    of how to do this can be found in :c:func:`mvneta_mac_config` in
-   drivers/net/ethernet/marvell/mvneta.c.
+   ``drivers/net/ethernet/marvell/mvneta.c``.
 
    For further information on these methods, please see the inline
    documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
 
-9. Remove calls to of_parse_phandle() for the PHY,
-   of_phy_register_fixed_link() for fixed links etc. from the probe
-   function, and replace with:
+9. Fill-in the :c:type:`struct phylink_config <phylink_config>` fields with
+   a reference to the :c:type:`struct device <device>` associated to your
+   :c:type:`struct net_device <net_device>`:
 
    .. code-block:: c
 
-	struct phylink *phylink;
 	priv->phylink_config.dev = &dev.dev;
 	priv->phylink_config.type = PHYLINK_NETDEV;
 
+   Fill-in the various speeds, pause and duplex modes your MAC can handle:
+
+   .. code-block:: c
+
+        priv->phylink_config.mac_capabilities = MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000FD;
+
+10. Some Ethernet controllers work in pair with a PCS (Physical Coding Sublayer)
+    block, that can handle among other things the encoding/decoding, link
+    establishment detection and autonegotiation. While some MACs have internal
+    PCS whose operation is transparent, some other require dedicated PCS
+    configuration for the link to become functional. In that case, phylink
+    provides a PCS abstraction through :c:type:`struct phylink_pcs <phylink_pcs>`.
+
+    Identify if your driver has one or more internal PCS blocks, and/or if
+    your controller can use an external PCS block that might be internally
+    connected to your controller.
+
+    If your controller doesn't have any internal PCS, you can go to step 11.
+
+    If your Ethernet controller contains one or several PCS blocks, create
+    one :c:type:`struct phylink_pcs <phylink_pcs>` instance per PCS block within
+    your driver's private data structure:
+
+    .. code-block:: c
+
+        struct phylink_pcs pcs;
+
+    Populate the relevant :c:type:`struct phylink_pcs_ops <phylink_pcs_ops>` to
+    configure your PCS. Create a :c:func:`pcs_get_state` function that reports
+    the inband link state, a :c:func:`pcs_config` function to configure your
+    PCS according to phylink-provided parameters, and a :c:func:`pcs_validate`
+    function that report to phylink all accepted configuration parameters for
+    your PCS:
+
+    .. code-block:: c
+
+        struct phylink_pcs_ops foo_pcs_ops = {
+                .pcs_validate = foo_pcs_validate,
+                .pcs_get_state = foo_pcs_get_state,
+                .pcs_config = foo_pcs_config,
+        };
+
+    Arrange for PCS link state interrupts to be forwarded into
+    phylink, via:
+
+    .. code-block:: c
+
+        phylink_pcs_change(pcs, link_is_up);
+
+    where ``link_is_up`` is true if the link is currently up or false
+    otherwise. If a PCS is unable to provide these interrupts, then
+    it should set ``pcs->pcs_poll = true;`` when creating the PCS.
+
+11. If your controller relies on, or accepts the presence of an external PCS
+    controlled through its own driver, add a pointer to a phylink_pcs instance
+    in your driver private data structure:
+
+    .. code-block:: c
+
+        struct phylink_pcs *pcs;
+
+    The way of getting an instance of the actual PCS depends on the platform,
+    some PCS sit on an MDIO bus and are grabbed by passing a pointer to the
+    corresponding :c:type:`struct mii_bus <mii_bus>` and the PCS's address on
+    that bus. In this example, we assume the controller attaches to a Lynx PCS
+    instance:
+
+    .. code-block:: c
+
+        priv->pcs = lynx_pcs_create_mdiodev(bus, 0);
+
+    Some PCS can be recovered based on firmware information:
+
+    .. code-block:: c
+
+        priv->pcs = lynx_pcs_create_fwnode(of_fwnode_handle(node));
+
+12. Populate the :c:func:`mac_select_pcs` callback and add it to your
+    :c:type:`struct phylink_mac_ops <phylink_mac_ops>` set of ops. This function
+    must return a pointer to the relevant :c:type:`struct phylink_pcs <phylink_pcs>`
+    that will be used for the requested link configuration:
+
+    .. code-block:: c
+
+        static struct phylink_pcs *foo_select_pcs(struct phylink_config *config,
+                                                  phy_interface_t interface)
+        {
+                struct foo_priv *priv = container_of(config, struct foo_priv,
+                                                     phylink_config);
+
+                if ( /* 'interface' needs a PCS to function */ )
+                        return priv->pcs;
+
+                return NULL;
+        }
+
+    See :c:func:`mvpp2_select_pcs` for an example of a driver that has multiple
+    internal PCS.
+
+13. Fill-in all the :c:type:`phy_interface_t <phy_interface_t>` (i.e. all MAC to
+    PHY link modes) that your MAC can output. The following example shows a
+    configuration for a MAC that can handle all RGMII modes, SGMII and 1000BaseX.
+    You must adjust these according to what your MAC and all PCS associated
+    with this MAC are capable of, and not just the interface you wish to use:
+
+    .. code-block:: c
+
+       phy_interface_set_rgmii(priv->phylink_config.supported_interfaces);
+        __set_bit(PHY_INTERFACE_MODE_SGMII,
+                  priv->phylink_config.supported_interfaces);
+        __set_bit(PHY_INTERFACE_MODE_1000BASEX,
+                  priv->phylink_config.supported_interfaces);
+
+14. Remove calls to of_parse_phandle() for the PHY,
+    of_phy_register_fixed_link() for fixed links etc. from the probe
+    function, and replace with:
+
+    .. code-block:: c
+
+	struct phylink *phylink;
+
 	phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops);
 	if (IS_ERR(phylink)) {
 		err = PTR_ERR(phylink);
@@ -247,14 +369,14 @@ this documentation.
 
 	priv->phylink = phylink;
 
-   and arrange to destroy the phylink in the probe failure path as
-   appropriate and the removal path too by calling:
+    and arrange to destroy the phylink in the probe failure path as
+    appropriate and the removal path too by calling:
 
-   .. code-block:: c
+    .. code-block:: c
 
 	phylink_destroy(priv->phylink);
 
-10. Arrange for MAC link state interrupts to be forwarded into
+15. Arrange for MAC link state interrupts to be forwarded into
     phylink, via:
 
     .. code-block:: c
@@ -262,17 +384,16 @@ this documentation.
 	phylink_mac_change(priv->phylink, link_is_up);
 
     where ``link_is_up`` is true if the link is currently up or false
-    otherwise. If a MAC is unable to provide these interrupts, then
-    it should set ``priv->phylink_config.pcs_poll = true;`` in step 9.
+    otherwise.
 
-11. Verify that the driver does not call::
+16. Verify that the driver does not call::
 
 	netif_carrier_on()
 	netif_carrier_off()
 
-   as these will interfere with phylink's tracking of the link state,
-   and cause phylink to omit calls via the :c:func:`mac_link_up` and
-   :c:func:`mac_link_down` methods.
+    as these will interfere with phylink's tracking of the link state,
+    and cause phylink to omit calls via the :c:func:`mac_link_up` and
+    :c:func:`mac_link_down` methods.
 
 Network drivers should call phylink_stop() and phylink_start() via their
 suspend/resume paths, which ensures that the appropriate
@@ -281,4 +402,4 @@ as necessary.
 
 For information describing the SFP cage in DT, please see the binding
 documentation in the kernel source tree
-``Documentation/devicetree/bindings/net/sff,sfp.txt``
+``Documentation/devicetree/bindings/net/sff,sfp.yaml``.
diff --git a/Documentation/networking/skbuff.rst b/Documentation/networking/skbuff.rst
new file mode 100644
index 000000000000..5b74275a73a3
--- /dev/null
+++ b/Documentation/networking/skbuff.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+struct sk_buff
+==============
+
+:c:type:`sk_buff` is the main networking structure representing
+a packet.
+
+Basic sk_buff geometry
+----------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+   :doc: Basic sk_buff geometry
+
+Shared skbs and skb clones
+--------------------------
+
+:c:member:`sk_buff.users` is a simple refcount allowing multiple entities
+to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred
+to as shared skbs (see skb_shared()).
+
+skb_clone() allows for fast duplication of skbs. None of the data buffers
+get copied, but caller gets a new metadata struct (struct sk_buff).
+&skb_shared_info.refcount indicates the number of skbs pointing at the same
+packet data (i.e. clones).
+
+dataref and headerless skbs
+---------------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+   :doc: dataref and headerless skbs
+
+Checksum information
+--------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+   :doc: skb checksums
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..a874d007f2db
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+SMC Sysctl
+==========
+
+/proc/sys/net/smc/* Variables
+=============================
+
+autocorking_size - INTEGER
+	Setting SMC auto corking size:
+	SMC auto corking is like TCP auto corking from the application's
+	perspective of view. When applications do consecutive small
+	write()/sendmsg() system calls, we try to coalesce these small writes
+	as much as possible, to lower total amount of CDC and RDMA Write been
+	sent.
+	autocorking_size limits the maximum corked bytes that can be sent to
+	the under device in 1 single sending. If set to 0, the SMC auto corking
+	is disabled.
+	Applications can still use TCP_CORK for optimal behavior when they
+	know how/when to uncork their sockets.
+
+	Default: 64K
+
+smcr_buf_type - INTEGER
+        Controls which type of sndbufs and RMBs to use in later newly created
+        SMC-R link group. Only for SMC-R.
+
+        Default: 0 (physically contiguous sndbufs and RMBs)
+
+        Possible values:
+
+        - 0 - Use physically contiguous buffers
+        - 1 - Use virtually contiguous buffers
+        - 2 - Mixed use of the two types. Try physically contiguous buffers first.
+          If not available, use virtually contiguous buffers then.
+
+smcr_testlink_time - INTEGER
+	How frequently SMC-R link sends out TEST_LINK LLC messages to confirm
+	viability, after the last activity of connections on it. Value 0 means
+	disabling TEST_LINK.
+
+	Default: 30 seconds.
+
+wmem - INTEGER
+	Initial size of send buffer used by SMC sockets.
+
+	The minimum value is 16KiB and there is no hard limit for max value, but
+	only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+	Default: 64KiB
+
+rmem - INTEGER
+	Initial size of receive buffer (RMB) used by SMC sockets.
+
+	The minimum value is 16KiB and there is no hard limit for max value, but
+	only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+	Default: 64KiB
+
+smcr_max_links_per_lgr - INTEGER
+	Controls the max number of links can be added to a SMC-R link group. Notice that
+	the actual number of the links added to a SMC-R link group depends on the number
+	of RDMA devices exist in the system. The acceptable value ranges from 1 to 2. Only
+	for SMC-R v2.1 and later.
+
+	Default: 2
+
+smcr_max_conns_per_lgr - INTEGER
+	Controls the max number of connections can be added to a SMC-R link group. The
+	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
+
+	Default: 255
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index 4edd0d38779e..ff1e6a8ffe21 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -313,8 +313,8 @@ https://lwn.net/Articles/576263/
 
 * TcpExtTCPOrigDataSent
 
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
 
   TCPOrigDataSent: number of outgoing packets with original data (excluding
   retransmission but including data-in-SYN). This counter is different from
@@ -323,22 +323,20 @@ explaination below::
 
 * TCPSynRetrans
 
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
 
   TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
   retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
 
 * TCPFastOpenActiveFail
 
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
 
   TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
   the remote does not accept it or the attempts timed out.
 
-.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
-
 * TcpExtListenOverflows and TcpExtListenDrops
 
 When kernel receives a SYN from a client, and if the TCP accept queue
@@ -382,7 +380,7 @@ Defined in `RFC1213 tcpAttemptFails`_.
 
 Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
 the 'segments sent containing the RST flag', but in linux kernel, this
-couner indicates the segments kerenl tried to send. The sending
+counter indicates the segments kernel tried to send. The sending
 process might be failed due to some errors (e.g. memory alloc failed).
 
 .. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
@@ -698,11 +696,9 @@ number of the SACK block. For more details, please refer the comment
 of the function tcp_is_sackblock_valid in the kernel source code. A
 SACK option could have up to 4 blocks, they are checked
 individually. E.g., if 3 blocks of a SACk is invalid, the
-corresponding counter would be updated 3 times. The comment of the
-`Add counters for discarded SACK blocks`_ patch has additional
-explaination:
-
-.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
+corresponding counter would be updated 3 times. The comment of commit
+18f02545a9a1 ("[TCP] MIB: Add counters for discarded SACK blocks")
+has additional explanation:
 
 * TcpExtTCPSACKDiscard
 
@@ -829,7 +825,7 @@ PAWS check fails or the received sequence number is out of window.
 
 * TcpExtTCPACKSkippedTimeWait
 
-Tha ACK is skipped in Time-Wait status, the reason would be either
+The ACK is skipped in Time-Wait status, the reason would be either
 PAWS check failed or the received sequence number is out of window.
 
 * TcpExtTCPACKSkippedChallenge
@@ -980,11 +976,11 @@ How many reply packets of the SYN cookies the TCP stack receives.
 
 The MSS decoded from the SYN cookie is invalid. When this counter is
 updated, the received packet won't be treated as a SYN cookie and the
-TcpExtSyncookiesRecv counter wont be updated.
+TcpExtSyncookiesRecv counter won't be updated.
 
 Challenge ACK
 =============
-For details of challenge ACK, please refer the explaination of
+For details of challenge ACK, please refer the explanation of
 TcpExtTCPACKSkippedChallenge.
 
 * TcpExtTCPChallengeACK
@@ -1002,7 +998,7 @@ prune
 =====
 When a socket is under memory pressure, the TCP stack will try to
 reclaim memory from the receiving queue and out of order queue. One of
-the reclaiming method is 'collapse', which means allocate a big sbk,
+the reclaiming method is 'collapse', which means allocate a big skb,
 copy the contiguous skbs to the single big skb, and free these
 contiguous skbs.
 
@@ -1163,7 +1159,7 @@ The server side nstat output::
   IpExtOutOctets                  52                 0.0
   IpExtInNoECTPkts                1                  0.0
 
-Input a string in nc client side again ('world' in our exmaple)::
+Input a string in nc client side again ('world' in our example)::
 
   nstatuser@nstat-a:~$ nc -v nstat-b 9000
   Connection to nstat-b 9000 port [tcp/*] succeeded!
@@ -1211,7 +1207,7 @@ replied an ACK. But kernel handled them in different ways. When the
 TCP window scale option is not used, kernel will try to enable fast
 path immediately when the connection comes into the established state,
 but if the TCP window scale option is used, kernel will disable the
-fast path at first, and try to enable it after kerenl receives
+fast path at first, and try to enable it after kernel receives
 packets. We could use the 'ss' command to verify whether the window
 scale option is used. e.g. run below command on either server or
 client::
@@ -1343,7 +1339,7 @@ Check TcpExtTCPAbortOnMemory on client::
   nstatuser@nstat-a:~$ nstat | grep -i abort
   TcpExtTCPAbortOnMemory          54                 0.0
 
-Check orphane socket count on client::
+Check orphaned socket count on client::
 
   nstatuser@nstat-a:~$ ss -s
   Total: 131 (kernel 0)
@@ -1681,11 +1677,11 @@ RST to nstat-b::
 
   nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
 
-Send 3 SYN repeatly to nstat-b::
+Send 3 SYN repeatedly to nstat-b::
 
   nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
 
-Check snmp cunter on nstat-b::
+Check snmp counter on nstat-b::
 
   nstatuser@nstat-b:~$ nstat | grep -i skip
   TcpExtTCPACKSkippedSynRecv      1                  0.0
@@ -1770,7 +1766,7 @@ string 'foo' in our example::
   Connection from nstat-a 42132 received!
   foo
 
-On nstat-a, the tcpdump should have caputred the ACK. We should check
+On nstat-a, the tcpdump should have captured the ACK. We should check
 the source port numbers of the two nc clients::
 
   nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
@@ -1778,7 +1774,7 @@ the source port numbers of the two nc clients::
   ESTAB  0        0            192.168.122.250:50208       192.168.122.251:9000
   ESTAB  0        0            192.168.122.250:42132       192.168.122.251:9001
 
-Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
+Run tcprewrite, change port 9001 to port 9000, change port 42132 to
 port 50208::
 
   nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
diff --git a/Documentation/networking/sriov.rst b/Documentation/networking/sriov.rst
new file mode 100644
index 000000000000..5deb4ff3154f
--- /dev/null
+++ b/Documentation/networking/sriov.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NIC SR-IOV APIs
+===============
+
+Modern NICs are strongly encouraged to focus on implementing the ``switchdev``
+model (see :ref:`switchdev`) to configure forwarding and security of SR-IOV
+functionality.
+
+Legacy API
+==========
+
+The old SR-IOV API is implemented in ``rtnetlink`` Netlink family as part of
+the ``RTM_GETLINK`` and ``RTM_SETLINK`` commands. On the driver side
+it consists of a number of ``ndo_set_vf_*`` and ``ndo_get_vf_*`` callbacks.
+
+Since the legacy APIs do not integrate well with the rest of the stack
+the API is considered frozen; no new functionality or extensions
+will be accepted. New drivers should not implement the uncommon callbacks;
+namely the following callbacks are off limits:
+
+ - ``ndo_get_vf_port``
+ - ``ndo_set_vf_port``
+ - ``ndo_set_vf_rss_query_en``
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
new file mode 100644
index 000000000000..518284e287b0
--- /dev/null
+++ b/Documentation/networking/statistics.rst
@@ -0,0 +1,236 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Interface statistics
+====================
+
+Overview
+========
+
+This document is a guide to Linux network interface statistics.
+
+There are three main sources of interface statistics in Linux:
+
+ - standard interface statistics based on
+   :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`;
+ - protocol-specific statistics; and
+ - driver-defined statistics available via ethtool.
+
+Standard interface statistics
+-----------------------------
+
+There are multiple interfaces to reach the standard statistics.
+Most commonly used is the `ip` command from `iproute2`::
+
+  $ ip -s -s link show dev ens4u1u1
+  6: ens4u1u1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+    link/ether 48:2a:e3:4c:b1:d1 brd ff:ff:ff:ff:ff:ff
+    RX: bytes  packets  errors  dropped overrun mcast
+    74327665117 69016965 0       0       0       0
+    RX errors: length   crc     frame   fifo    missed
+               0        0       0       0       0
+    TX: bytes  packets  errors  dropped carrier collsns
+    21405556176 44608960 0       0       0       0
+    TX errors: aborted  fifo   window heartbeat transns
+               0        0       0       0       128
+    altname enp58s0u1u1
+
+Note that `-s` has been specified twice to see all members of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+If `-s` is specified once the detailed errors won't be shown.
+
+`ip` supports JSON formatting via the `-j` option.
+
+Queue statistics
+~~~~~~~~~~~~~~~~
+
+Queue statistics are accessible via the netdev netlink family.
+
+Currently no widely distributed CLI exists to access those statistics.
+Kernel development tools (ynl) can be used to experiment with them,
+see `Documentation/userspace-api/netlink/intro-specs.rst`.
+
+Protocol-specific statistics
+----------------------------
+
+Protocol-specific statistics are exposed via relevant interfaces,
+the same interfaces as are used to configure them.
+
+ethtool
+~~~~~~~
+
+Ethtool exposes common low-level statistics.
+All the standard statistics are expected to be maintained
+by the device, not the driver (as opposed to driver-defined stats
+described in the next section which mix software and hardware stats).
+For devices which contain unmanaged
+switches (e.g. legacy SR-IOV or multi-host NICs) the events counted
+may not pertain exclusively to the packets destined to
+the local host interface. In other words the events may
+be counted at the network port (MAC/PHY blocks) without separation
+for different host side (PCIe) devices. Such ambiguity must not
+be present when internal switch is managed by Linux (so called
+switchdev mode for NICs).
+
+Standard ethtool statistics can be accessed via the interfaces used
+for configuration. For example ethtool interface used
+to configure pause frames can report corresponding hardware counters::
+
+  $ ethtool --include-statistics -a eth0
+  Pause parameters for eth0:
+  Autonegotiate:	on
+  RX:			on
+  TX:			on
+  Statistics:
+    tx_pause_frames: 1
+    rx_pause_frames: 1
+
+General Ethernet statistics not associated with any particular
+functionality are exposed via ``ethtool -S $ifc`` by specifying
+the ``--groups`` parameter::
+
+  $ ethtool -S eth0 --groups eth-phy eth-mac eth-ctrl rmon
+  Stats for eth0:
+  eth-phy-SymbolErrorDuringCarrier: 0
+  eth-mac-FramesTransmittedOK: 1
+  eth-mac-FrameTooLongErrors: 1
+  eth-ctrl-MACControlFramesTransmitted: 1
+  eth-ctrl-MACControlFramesReceived: 0
+  eth-ctrl-UnsupportedOpcodesReceived: 1
+  rmon-etherStatsUndersizePkts: 1
+  rmon-etherStatsJabbers: 0
+  rmon-rx-etherStatsPkts64Octets: 1
+  rmon-rx-etherStatsPkts65to127Octets: 0
+  rmon-rx-etherStatsPkts128to255Octets: 0
+  rmon-tx-etherStatsPkts64Octets: 2
+  rmon-tx-etherStatsPkts65to127Octets: 3
+  rmon-tx-etherStatsPkts128to255Octets: 0
+
+Driver-defined statistics
+-------------------------
+
+Driver-defined ethtool statistics can be dumped using `ethtool -S $ifc`, e.g.::
+
+  $ ethtool -S ens4u1u1
+  NIC statistics:
+     tx_single_collisions: 0
+     tx_multi_collisions: 0
+
+uAPIs
+=====
+
+procfs
+------
+
+The historical `/proc/net/dev` text interface gives access to the list
+of interfaces as well as their statistics.
+
+Note that even though this interface is using
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+internally it combines some of the fields.
+
+sysfs
+-----
+
+Each device directory in sysfs contains a `statistics` directory (e.g.
+`/sys/class/net/lo/statistics/`) with files corresponding to
+members of :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+
+This simple interface is convenient especially in constrained/embedded
+environments without access to tools. However, it's inefficient when
+reading multiple stats as it internally performs a full dump of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+and reports only the stat corresponding to the accessed file.
+
+Sysfs files are documented in
+Documentation/ABI/testing/sysfs-class-net-statistics.
+
+
+netlink
+-------
+
+`rtnetlink` (`NETLINK_ROUTE`) is the preferred method of accessing
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` stats.
+
+Statistics are reported both in the responses to link information
+requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
+when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
+
+netdev (netlink)
+~~~~~~~~~~~~~~~~
+
+`netdev` generic netlink family allows accessing page pool and per queue
+statistics.
+
+ethtool
+-------
+
+Ethtool IOCTL interface allows drivers to report implementation
+specific statistics. Historically it has also been used to report
+statistics for which other APIs did not exist, like per-device-queue
+statistics, or standard-based statistics (e.g. RFC 2863).
+
+Statistics and their string identifiers are retrieved separately.
+Identifiers via `ETHTOOL_GSTRINGS` with `string_set` set to `ETH_SS_STATS`,
+and values via `ETHTOOL_GSTATS`. User space should use `ETHTOOL_GDRVINFO`
+to retrieve the number of statistics (`.n_stats`).
+
+ethtool-netlink
+---------------
+
+Ethtool netlink is a replacement for the older IOCTL interface.
+
+Protocol-related statistics can be requested in get commands by setting
+the `ETHTOOL_FLAG_STATS` flag in `ETHTOOL_A_HEADER_FLAGS`. Currently
+statistics are supported in the following commands:
+
+  - `ETHTOOL_MSG_PAUSE_GET`
+  - `ETHTOOL_MSG_FEC_GET`
+  - `ETHTOOL_MSG_MM_GET`
+
+debugfs
+-------
+
+Some drivers expose extra statistics via `debugfs`.
+
+struct rtnl_link_stats64
+========================
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+    :identifiers: rtnl_link_stats64
+
+Notes for driver authors
+========================
+
+Drivers should report all statistics which have a matching member in
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` exclusively
+via `.ndo_get_stats64`. Reporting such standard stats via ethtool
+or debugfs will not be accepted.
+
+Drivers must ensure best possible compliance with
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+Please note for example that detailed error statistics must be
+added into the general `rx_error` / `tx_error` counters.
+
+The `.ndo_get_stats64` callback can not sleep because of accesses
+via `/proc/net/dev`. If driver may sleep when retrieving the statistics
+from the device it should do so periodically asynchronously and only return
+a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
+allows setting the frequency of refreshing statistics, if needed.
+
+Retrieving ethtool statistics is a multi-syscall process, drivers are advised
+to keep the number of statistics constant to avoid race conditions with
+user space trying to read them.
+
+Statistics must persist across routine operations like bringing the interface
+down and up.
+
+Kernel-internal data structures
+-------------------------------
+
+The following structures are internal to the kernel, their members are
+translated to netlink attributes when dumped. Drivers must not overwrite
+the statistics they don't report with 0.
+
+- ethtool_pause_stats()
+- ethtool_fec_stats()
diff --git a/Documentation/networking/strparser.rst b/Documentation/networking/strparser.rst
index 6cab1f74ae05..8dc6bb04c710 100644
--- a/Documentation/networking/strparser.rst
+++ b/Documentation/networking/strparser.rst
@@ -112,7 +112,7 @@ Functions
 Callbacks
 =========
 
-There are six callbacks:
+There are seven callbacks:
 
     ::
 
@@ -180,10 +180,17 @@ There are six callbacks:
     struct contains two fields: offset and full_len. Offset is
     where the message starts in the skb, and full_len is the
     the length of the message. skb->len - offset may be greater
-    then full_len since strparser does not trim the skb.
+    than full_len since strparser does not trim the skb.
 
     ::
 
+	int (*read_sock)(struct strparser *strp, read_descriptor_t *desc,
+                     sk_read_actor_t recv_actor);
+
+    The read_sock callback is used by strparser instead of
+    sock->ops->read_sock, if provided.
+    ::
+
 	int (*read_sock_done)(struct strparser *strp, int err);
 
      read_sock_done is called when the stream parser is done reading
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index ddc3f35775dc..2966b7122f05 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -1,5 +1,6 @@
 .. SPDX-License-Identifier: GPL-2.0
 .. include:: <isonum.txt>
+.. _switchdev:
 
 ===============================================
 Ethernet switch device driver model (switchdev)
@@ -136,10 +137,10 @@ would be sub-port 0 on port 1 on switch 1.
 Port Features
 ^^^^^^^^^^^^^
 
-NETIF_F_NETNS_LOCAL
+dev->netns_immutable
 
 If the switchdev driver (and device) only supports offloading of the default
-network namespace (netns), the driver should set this feature flag to prevent
+network namespace (netns), the driver should set this private flag to prevent
 the port netdev from being moved out of the default netns.  A netns-aware
 driver/device would not set this flag and be responsible for partitioning
 hardware to preserve netns containment.  This means hardware cannot forward
@@ -159,7 +160,7 @@ tools such as iproute2.
 
 The switchdev driver can know a particular port's position in the topology by
 monitoring NETDEV_CHANGEUPPER notifications.  For example, a port moved into a
-bond will see it's upper master change.  If that bond is moved into a bridge,
+bond will see its upper master change.  If that bond is moved into a bridge,
 the bond's upper master will change.  And so on.  The driver will track such
 movements to know what position a port is in in the overall topology by
 registering for netdevice events and acting on NETDEV_CHANGEUPPER.
@@ -181,18 +182,41 @@ To offloading L2 bridging, the switchdev driver/device should support:
 Static FDB Entries
 ^^^^^^^^^^^^^^^^^^
 
-The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
-to support static FDB entries installed to the device.  Static bridge FDB
-entries are installed, for example, using iproute2 bridge cmd::
+A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
+``ndo_fdb_dump`` operations is able to support the command below, which adds a
+static bridge FDB entry::
 
-	bridge fdb add ADDR dev DEV [vlan VID] [self]
+        bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
 
-The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
-ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
-switchdev_port_obj_xxx ops.
+(the "static" keyword is non-optional: if not specified, the entry defaults to
+being "local", which means that it should not be forwarded)
 
-XXX: what should be done if offloading this rule to hardware fails (for
-example, due to full capacity in hardware tables) ?
+The "self" keyword (optional because it is implicit) has the role of
+instructing the kernel to fulfill the operation through the ``ndo_fdb_add``
+implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
+will bypass the bridge and therefore leave the software database out of sync
+with the hardware one.
+
+To avoid this, the "master" keyword can be used::
+
+        bridge fdb add dev DEV ADDRESS [vlan VID] master static
+
+The above command instructs the kernel to search for a master interface of
+``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
+This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
+which the port driver can handle and use it to program its hardware table. This
+way, the software and the hardware database will both contain this static FDB
+entry.
+
+Note: for new switchdev drivers that offload the Linux bridge, implementing the
+``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
+discouraged: all static FDB entries should be added on a bridge port using the
+"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
+visualize the hardware tables, if the device does not have an interrupt for
+notifying the operating system of newly learned/forgotten dynamic FDB
+addresses. In that case, the hardware FDB might end up having entries that the
+software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
+them.
 
 Note: by default, the bridge does not filter on VLAN and only bridges untagged
 traffic.  To enable VLAN support, turn on VLAN filtering::
@@ -385,3 +409,156 @@ The driver can monitor for updates to arp_tbl using the netevent notifier
 NETEVENT_NEIGH_UPDATE.  The device can be programmed with resolved nexthops
 for the routes as arp_tbl updates.  The driver implements ndo_neigh_destroy
 to know when arp_tbl neighbor entries are purged from the port.
+
+Device driver expected behavior
+-------------------------------
+
+Below is a set of defined behavior that switchdev enabled network devices must
+adhere to.
+
+Configuration-less state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Upon driver bring up, the network devices must be fully operational, and the
+backing driver must configure the network device such that it is possible to
+send and receive traffic to this network device and it is properly separated
+from other network devices/ports (e.g.: as is frequent with a switch ASIC). How
+this is achieved is heavily hardware dependent, but a simple solution can be to
+use per-port VLAN identifiers unless a better mechanism is available
+(proprietary metadata for each network port for instance).
+
+The network device must be capable of running a full IP protocol stack
+including multicast, DHCP, IPv4/6, etc. If necessary, it should program the
+appropriate filters for VLAN, multicast, unicast etc. The underlying device
+driver must effectively be configured in a similar fashion to what it would do
+when IGMP snooping is enabled for IP multicast over these switchdev network
+devices and unsolicited multicast must be filtered as early as possible in
+the hardware.
+
+When configuring VLANs on top of the network device, all VLANs must be working,
+irrespective of the state of other network devices (e.g.: other ports being part
+of a VLAN-aware bridge doing ingress VID checking). See below for details.
+
+If the device implements e.g.: VLAN filtering, putting the interface in
+promiscuous mode should allow the reception of all VLAN tags (including those
+not present in the filter(s)).
+
+Bridged switch ports
+^^^^^^^^^^^^^^^^^^^^
+
+When a switchdev enabled network device is added as a bridge member, it should
+not disrupt any functionality of non-bridged network devices and they
+should continue to behave as normal network devices. Depending on the bridge
+configuration knobs below, the expected behavior is documented.
+
+Bridge VLAN filtering
+^^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of a VLAN filtering mode (statically,
+at device creation time, and dynamically, during run time) which must be
+observed by the underlying switchdev network device/hardware:
+
+- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
+  data path will process all Ethernet frames as if they are VLAN-untagged.
+  The bridge VLAN database can still be modified, but the modifications should
+  have no effect while VLAN filtering is turned off. Frames ingressing the
+  device with a VID that is not programmed into the bridge/switch's VLAN table
+  must be forwarded and may be processed using a VLAN device (see below).
+
+- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
+  the device with a VID that is not programmed into the bridges/switch's VLAN
+  table must be dropped (strict VID checking).
+
+When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
+network device which is a bridge port member, the behavior of the software
+network stack must be preserved, or the configuration must be refused if that
+is not possible.
+
+- with VLAN filtering turned off, the bridge will process all ingress traffic
+  for the port, except for the traffic tagged with a VLAN ID destined for a
+  VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
+  be added to a second bridge, which includes other switch ports or software
+  interfaces. Some approaches to ensure that the forwarding domain for traffic
+  belonging to the VLAN upper interfaces are managed properly:
+
+    * If forwarding destinations can be managed per VLAN, the hardware could be
+      configured to map all traffic, except the packets tagged with a VID
+      belonging to a VLAN upper interface, to an internal VID corresponding to
+      untagged packets. This internal VID spans all ports of the VLAN-unaware
+      bridge. The VID corresponding to the VLAN upper interface spans the
+      physical port of that VLAN interface, as well as the other ports that
+      might be bridged with it.
+    * Treat bridge ports with VLAN upper interfaces as standalone, and let
+      forwarding be handled in the software data path.
+
+- with VLAN filtering turned on, these VLAN devices can be created as long as
+  the bridge does not have an existing VLAN entry with the same VID on any
+  bridge port. These VLAN devices cannot be enslaved into the bridge since they
+  duplicate functionality/use case with the bridge's VLAN data path processing.
+
+Non-bridged network ports of the same switch fabric must not be disturbed in any
+way by the enabling of VLAN filtering on the bridge device(s). If the VLAN
+filtering setting is global to the entire chip, then the standalone ports
+should indicate to the network stack that VLAN filtering is required by setting
+'rx-vlan-filter: on [fixed]' in the ethtool features.
+
+Because VLAN filtering can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately. If that is not possible, the
+switchdev driver can also refuse to support dynamic toggling of the VLAN
+filtering knob at runtime and require a destruction of the bridge device(s) and
+creation of new bridge device(s) with a different VLAN filtering value to
+ensure VLAN awareness is pushed down to the hardware.
+
+Even when VLAN filtering in the bridge is turned off, the underlying switch
+hardware and driver may still configure itself in a VLAN-aware mode provided
+that the behavior described above is observed.
+
+The VLAN protocol of the bridge plays a role in deciding whether a packet is
+treated as tagged or not: a bridge using the 802.1ad protocol must treat both
+VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
+untagged.
+
+The 802.1p (VID 0) tagged packets must be treated in the same way by the device
+as untagged packets, since the bridge device does not allow the manipulation of
+VID 0 in its database.
+
+When the bridge has VLAN filtering enabled and a PVID is not configured on the
+ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
+has VLAN filtering enabled and a PVID exists on the ingress port, untagged and
+priority-tagged packets must be accepted and forwarded according to the
+bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
+disabled, the presence/lack of a PVID should not influence the packet
+forwarding decision.
+
+Bridge IGMP snooping
+^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of IGMP snooping (statically, at
+interface creation time, or dynamically, during runtime) which must be observed
+by the underlying switchdev network device/hardware in the following way:
+
+- when IGMP snooping is turned off, multicast traffic must be flooded to all
+  ports within the same bridge that have mcast_flood=true. The CPU/management
+  port should ideally not be flooded (unless the ingress interface has
+  IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
+  the network stack notifications. If the hardware is not capable of doing that
+  then the CPU/management port must also be flooded and multicast filtering
+  happens in software.
+
+- when IGMP snooping is turned on, multicast traffic must selectively flow
+  to the appropriate network ports (including CPU/management port). Flooding of
+  unknown multicast should be only towards the ports connected to a multicast
+  router (the local device may also act as a multicast router).
+
+The switch must adhere to RFC 4541 and flood multicast traffic accordingly
+since that is what the Linux bridge implementation does.
+
+Because IGMP snooping can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately.
+
+A switchdev driver can also refuse to support dynamic toggling of the multicast
+snooping knob at runtime and require the destruction of the bridge device(s)
+and creation of a new bridge device(s) with a different multicast snooping
+value.
diff --git a/Documentation/networking/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst
new file mode 100644
index 000000000000..65307130ab63
--- /dev/null
+++ b/Documentation/networking/sysfs-tagging.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Sysfs tagging
+=============
+
+(Taken almost verbatim from Eric Biederman's netns tagging patch
+commit msg)
+
+The problem.  Network devices show up in sysfs and with the network
+namespace active multiple devices with the same name can show up in
+the same directory, ouch!
+
+To avoid that problem and allow existing applications in network
+namespaces to see the same interface that is currently presented in
+sysfs, sysfs now has tagging directory support.
+
+By using the network namespace pointers as tags to separate out
+the sysfs directory entries we ensure that we don't have conflicts
+in the directories and applications only see a limited set of
+the network devices.
+
+Each sysfs directory entry may be tagged with a namespace via the
+``void *ns member`` of its ``kernfs_node``.  If a directory entry is tagged,
+then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE
+and KOBJ_NS_TYPES, and ns will point to the namespace to which it
+belongs.
+
+Each sysfs superblock's kernfs_super_info contains an array
+``void *ns[KOBJ_NS_TYPES]``.  When a task in a tagging namespace
+kobj_nstype first mounts sysfs, a new superblock is created.  It
+will be differentiated from other sysfs mounts by having its
+``s_fs_info->ns[kobj_nstype]`` set to the new namespace.  Note that
+through bind mounting and mounts propagation, a task can easily view
+the contents of other namespaces' sysfs mounts.  Therefore, when a
+namespace exits, it will call kobj_ns_exit() to invalidate any
+kernfs_node->ns pointers pointing to it.
+
+Users of this interface:
+
+- define a type in the ``kobj_ns_type`` enumeration.
+- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has
+
+  - current_ns() which returns current's namespace
+  - netlink_ns() which returns a socket's namespace
+  - initial_ns() which returns the initial namespace
+
+- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/networking/tc-queue-filters.rst b/Documentation/networking/tc-queue-filters.rst
new file mode 100644
index 000000000000..6b417092276f
--- /dev/null
+++ b/Documentation/networking/tc-queue-filters.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+TC queue based filtering
+=========================
+
+TC can be used for directing traffic to either a set of queues or
+to a single queue on both the transmit and receive side.
+
+On the transmit side:
+
+1) TC filter directing traffic to a set of queues is achieved
+   using the action skbedit priority for Tx priority selection,
+   the priority maps to a traffic class (set of queues) when
+   the queue-sets are configured using mqprio.
+
+2) TC filter directs traffic to a transmit queue with the action
+   skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
+   for transmit queue is executed in software only and cannot be
+   offloaded.
+
+Likewise, on the receive side, the two filters for selecting set of
+queues and/or a single queue are supported as below:
+
+1) TC flower filter directs incoming traffic to a set of queues using
+   the 'hw_tc' option.
+   hw_tc $TCID - Specify a hardware traffic class to pass matching
+   packets on to. TCID is in the range 0 through 15.
+
+2) TC filter with action skbedit queue_mapping $rx_qid selects a
+   receive queue. The action skbedit queue_mapping for receive queue
+   is supported only in hardware. Multiple filters may compete in
+   the hardware for queue selection. In such case, the hardware
+   pipeline resolves conflicts based on priority. On Intel E810
+   devices, TC filter directing traffic to a queue have higher
+   priority over flow director filter assigning a queue. The hash
+   filter has lowest priority.
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
new file mode 100644
index 000000000000..d5b6d0df63c3
--- /dev/null
+++ b/Documentation/networking/tcp_ao.rst
@@ -0,0 +1,453 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================================
+TCP Authentication Option Linux implementation (RFC5925)
+========================================================
+
+TCP Authentication Option (TCP-AO) provides a TCP extension aimed at verifying
+segments between trusted peers. It adds a new TCP header option with
+a Message Authentication Code (MAC). MACs are produced from the content
+of a TCP segment using a hashing function with a password known to both peers.
+The intent of TCP-AO is to deprecate TCP-MD5 providing better security,
+key rotation and support for a variety of hashing algorithms.
+
+1. Introduction
+===============
+
+.. table:: Short and Limited Comparison of TCP-AO and TCP-MD5
+
+ +----------------------+------------------------+-----------------------+
+ |                      |       TCP-MD5          |         TCP-AO        |
+ +======================+========================+=======================+
+ |Supported hashing     |MD5                     |Must support HMAC-SHA1 |
+ |algorithms            |(cryptographically weak)|(chosen-prefix attacks)|
+ |                      |                        |and CMAC-AES-128 (only |
+ |                      |                        |side-channel attacks). |
+ |                      |                        |May support any hashing|
+ |                      |                        |algorithm.             |
+ +----------------------+------------------------+-----------------------+
+ |Length of MACs (bytes)|16                      |Typically 12-16.       |
+ |                      |                        |Other variants that fit|
+ |                      |                        |TCP header permitted.  |
+ +----------------------+------------------------+-----------------------+
+ |Number of keys per    |1                       |Many                   |
+ |TCP connection        |                        |                       |
+ +----------------------+------------------------+-----------------------+
+ |Possibility to change |Non-practical (both     |Supported by protocol  |
+ |an active key         |peers have to change    |                       |
+ |                      |them during MSL)        |                       |
+ +----------------------+------------------------+-----------------------+
+ |Protection against    |No                      |Yes: ignoring them     |
+ |ICMP 'hard errors'    |                        |by default on          |
+ |                      |                        |established connections|
+ +----------------------+------------------------+-----------------------+
+ |Protection against    |No                      |Yes: pseudo-header     |
+ |traffic-crossing      |                        |includes TCP ports.    |
+ |attack                |                        |                       |
+ +----------------------+------------------------+-----------------------+
+ |Protection against    |No                      |Sequence Number        |
+ |replayed TCP segments |                        |Extension (SNE) and    |
+ |                      |                        |Initial Sequence       |
+ |                      |                        |Numbers (ISNs)         |
+ +----------------------+------------------------+-----------------------+
+ |Supports              |Yes                     |No. ISNs+SNE are needed|
+ |Connectionless Resets |                        |to correctly sign RST. |
+ +----------------------+------------------------+-----------------------+
+ |Standards             |RFC 2385                |RFC 5925, RFC 5926     |
+ +----------------------+------------------------+-----------------------+
+
+
+1.1 Frequently Asked Questions (FAQ) with references to RFC 5925
+----------------------------------------------------------------
+
+Q: Can either SendID or RecvID be non-unique for the same 4-tuple
+(srcaddr, srcport, dstaddr, dstport)?
+
+A: No [3.1]::
+
+   >> The IDs of MKTs MUST NOT overlap where their TCP connection
+   identifiers overlap.
+
+Q: Can Master Key Tuple (MKT) for an active connection be removed?
+
+A: No, unless it's copied to Transport Control Block (TCB) [3.1]::
+
+   It is presumed that an MKT affecting a particular connection cannot
+   be destroyed during an active connection -- or, equivalently, that
+   its parameters are copied to an area local to the connection (i.e.,
+   instantiated) and so changes would affect only new connections.
+
+Q: If an old MKT needs to be deleted, how should it be done in order
+to not remove it for an active connection? (As it can be still in use
+at any moment later)
+
+A: Not specified by RFC 5925, seems to be a problem for key management
+to ensure that no one uses such MKT before trying to remove it.
+
+Q: Can an old MKT exist forever and be used by another peer?
+
+A: It can, it's a key management task to decide when to remove an old key [6.1]::
+
+   Deciding when to start using a key is a performance issue. Deciding
+   when to remove an MKT is a security issue. Invalid MKTs are expected
+   to be removed. TCP-AO provides no mechanism to coordinate their removal,
+   as we consider this a key management operation.
+
+also [6.1]::
+
+   The only way to avoid reuse of previously used MKTs is to remove the MKT
+   when it is no longer considered permitted.
+
+Linux TCP-AO will try its best to prevent you from removing a key that's
+being used, considering it a key management failure. But since keeping
+an outdated key may become a security issue and as a peer may
+unintentionally prevent the removal of an old key by always setting
+it as RNextKeyID - a forced key removal mechanism is provided, where
+userspace has to supply KeyID to use instead of the one that's being removed
+and the kernel will atomically delete the old key, even if the peer is
+still requesting it. There are no guarantees for force-delete as the peer
+may yet not have the new key - the TCP connection may just break.
+Alternatively, one may choose to shut down the socket.
+
+Q: What happens when a packet is received on a new connection with no known
+MKT's RecvID?
+
+A: RFC 5925 specifies that by default it is accepted with a warning logged, but
+the behaviour can be configured by the user [7.5.1.a]::
+
+   If the segment is a SYN, then this is the first segment of a new
+   connection. Find the matching MKT for this segment, using the segment's
+   socket pair and its TCP-AO KeyID, matched against the MKT's TCP connection
+   identifier and the MKT's RecvID.
+
+      i. If there is no matching MKT, remove TCP-AO from the segment.
+         Proceed with further TCP handling of the segment.
+         NOTE: this presumes that connections that do not match any MKT
+         should be silently accepted, as noted in Section 7.3.
+
+[7.3]::
+
+   >> A TCP-AO implementation MUST allow for configuration of the behavior
+   of segments with TCP-AO but that do not match an MKT. The initial default
+   of this configuration SHOULD be to silently accept such connections.
+   If this is not the desired case, an MKT can be included to match such
+   connections, or the connection can indicate that TCP-AO is required.
+   Alternately, the configuration can be changed to discard segments with
+   the AO option not matching an MKT.
+
+[10.2.b]::
+
+   Connections not matching any MKT do not require TCP-AO. Further, incoming
+   segments with TCP-AO are not discarded solely because they include
+   the option, provided they do not match any MKT.
+
+Note that Linux TCP-AO implementation differs in this aspect. Currently, TCP-AO
+segments with unknown key signatures are discarded with warnings logged.
+
+Q: Does the RFC imply centralized kernel key management in any way?
+(i.e. that a key on all connections MUST be rotated at the same time?)
+
+A: Not specified. MKTs can be managed in userspace, the only relevant part to
+key changes is [7.3]::
+
+   >> All TCP segments MUST be checked against the set of MKTs for matching
+   TCP connection identifiers.
+
+Q: What happens when RNextKeyID requested by a peer is unknown? Should
+the connection be reset?
+
+A: It should not, no action needs to be performed [7.5.2.e]::
+
+   ii. If they differ, determine whether the RNextKeyID MKT is ready.
+
+       1. If the MKT corresponding to the segment’s socket pair and RNextKeyID
+       is not available, no action is required (RNextKeyID of a received
+       segment needs to match the MKT’s SendID).
+
+Q: How is current_key set, and when does it change? Is it a user-triggered
+change, or is it triggered by a request from the remote peer? Is it set by the
+user explicitly, or by a matching rule?
+
+A: current_key is set by RNextKeyID [6.1]::
+
+   Rnext_key is changed only by manual user intervention or MKT management
+   protocol operation. It is not manipulated by TCP-AO. Current_key is updated
+   by TCP-AO when processing received TCP segments as discussed in the segment
+   processing description in Section 7.5. Note that the algorithm allows
+   the current_key to change to a new MKT, then change back to a previously
+   used MKT (known as "backing up"). This can occur during an MKT change when
+   segments are received out of order, and is considered a feature of TCP-AO,
+   because reordering does not result in drops.
+
+[7.5.2.e.ii]::
+
+   2. If the matching MKT corresponding to the segment’s socket pair and
+   RNextKeyID is available:
+
+      a. Set current_key to the RNextKeyID MKT.
+
+Q: If both peers have multiple MKTs matching the connection's socket pair
+(with different KeyIDs), how should the sender/receiver pick KeyID to use?
+
+A: Some mechanism should pick the "desired" MKT [3.3]::
+
+   Multiple MKTs may match a single outgoing segment, e.g., when MKTs
+   are being changed. Those MKTs cannot have conflicting IDs (as noted
+   elsewhere), and some mechanism must determine which MKT to use for each
+   given outgoing segment.
+
+   >> An outgoing TCP segment MUST match at most one desired MKT, indicated
+   by the segment’s socket pair. The segment MAY match multiple MKTs, provided
+   that exactly one MKT is indicated as desired. Other information in
+   the segment MAY be used to determine the desired MKT when multiple MKTs
+   match; such information MUST NOT include values in any TCP option fields.
+
+Q: Can TCP-MD5 connection migrate to TCP-AO (and vice-versa):
+
+A: No [1]::
+
+   TCP MD5-protected connections cannot be migrated to TCP-AO because TCP MD5
+   does not support any changes to a connection’s security algorithm
+   once established.
+
+Q: If all MKTs are removed on a connection, can it become a non-TCP-AO signed
+connection?
+
+A: [7.5.2] doesn't have the same choice as SYN packet handling in [7.5.1.i]
+that would allow accepting segments without a sign (which would be insecure).
+While switching to non-TCP-AO connection is not prohibited directly, it seems
+what the RFC means. Also, there's a requirement for TCP-AO connections to
+always have one current_key [3.3]::
+
+   TCP-AO requires that every protected TCP segment match exactly one MKT.
+
+[3.3]::
+
+   >> An incoming TCP segment including TCP-AO MUST match exactly one MKT,
+   indicated solely by the segment’s socket pair and its TCP-AO KeyID.
+
+[4.4]::
+
+   One or more MKTs. These are the MKTs that match this connection’s
+   socket pair.
+
+Q: Can a non-TCP-AO connection become a TCP-AO-enabled one?
+
+A: No: for an already established non-TCP-AO connection it would be impossible
+to switch to using TCP-AO, as the traffic key generation requires the initial
+sequence numbers. Paraphrasing, starting using TCP-AO would require
+re-establishing the TCP connection.
+
+2. In-kernel MKTs database vs database in userspace
+===================================================
+
+Linux TCP-AO support is implemented using ``setsockopt()s``, in a similar way
+to TCP-MD5. It means that a userspace application that wants to use TCP-AO
+should perform ``setsockopt()`` on a TCP socket when it wants to add,
+remove or rotate MKTs. This approach moves the key management responsibility
+to userspace as well as decisions on corner cases, i.e. what to do if
+the peer doesn't respect RNextKeyID; moving more code to userspace, especially
+responsible for the policy decisions. Besides, it's flexible and scales well
+(with less locking needed than in the case of an in-kernel database). One also
+should keep in mind that mainly intended users are BGP processes, not any
+random applications, which means that compared to IPsec tunnels,
+no transparency is really needed and modern BGP daemons already have
+``setsockopt()s`` for TCP-MD5 support.
+
+.. table:: Considered pros and cons of the approaches
+
+ +----------------------+------------------------+-----------------------+
+ |                      |    ``setsockopt()``    |      in-kernel DB     |
+ +======================+========================+=======================+
+ | Extendability        | ``setsockopt()``       | Netlink messages are  |
+ |                      | commands should be     | simple and extendable |
+ |                      | extendable syscalls    |                       |
+ +----------------------+------------------------+-----------------------+
+ | Required userspace   | BGP or any application | could be transparent  |
+ | changes              | that wants TCP-AO needs| as tunnels, providing |
+ |                      | to perform             | something like        |
+ |                      | ``setsockopt()s``      | ``ip tcpao add key``  |
+ |                      | and do key management  | (delete/show/rotate)  |
+ +----------------------+------------------------+-----------------------+
+ |MKTs removal or adding| harder for userspace   | harder for kernel     |
+ +----------------------+------------------------+-----------------------+
+ | Dump-ability         | ``getsockopt()``       | Netlink .dump()       |
+ |                      |                        | callback              |
+ +----------------------+------------------------+-----------------------+
+ | Limits on kernel     |                      equal                     |
+ | resources/memory     |                                                |
+ +----------------------+------------------------+-----------------------+
+ | Scalability          | contention on          | contention on         |
+ |                      | ``TCP_LISTEN`` sockets | the whole database    |
+ +----------------------+------------------------+-----------------------+
+ | Monitoring & warnings| ``TCP_DIAG``           | same Netlink socket   |
+ +----------------------+------------------------+-----------------------+
+ | Matching of MKTs     | half-problem: only     | hard                  |
+ |                      | listen sockets         |                       |
+ +----------------------+------------------------+-----------------------+
+
+
+3. uAPI
+=======
+
+Linux provides a set of ``setsockopt()s`` and ``getsockopt()s`` that let
+userspace manage TCP-AO on a per-socket basis. In order to add/delete MKTs
+``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used.
+It is not allowed to add a key on an established non-TCP-AO connection
+as well as to remove the last key from TCP-AO connection.
+
+``setsockopt(TCP_AO_DEL_KEY)`` command may specify ``tcp_ao_del::current_key``
++ ``tcp_ao_del::set_current`` and/or ``tcp_ao_del::rnext``
++ ``tcp_ao_del::set_rnext`` which makes such delete "forced": it
+provides userspace a way to delete a key that's being used and atomically set
+another one instead. This is not intended for normal use and should be used
+only when the peer ignores RNextKeyID and keeps requesting/using an old key.
+It provides a way to force-delete a key that's not trusted but may break
+the TCP-AO connection.
+
+The usual/normal key-rotation can be performed with ``setsockopt(TCP_AO_INFO)``.
+It also provides a uAPI to change per-socket TCP-AO settings, such as
+ignoring ICMPs, as well as clear per-socket TCP-AO packet counters.
+The corresponding ``getsockopt(TCP_AO_INFO)`` can be used to get those
+per-socket TCP-AO settings.
+
+Another useful command is ``getsockopt(TCP_AO_GET_KEYS)``. One can use it
+to list all MKTs on a TCP socket or use a filter to get keys for a specific
+peer and/or sndid/rcvid, VRF L3 interface or get current_key/rnext_key.
+
+To repair TCP-AO connections ``setsockopt(TCP_AO_REPAIR)`` is available,
+provided that the user previously has checkpointed/dumped the socket with
+``getsockopt(TCP_AO_REPAIR)``.
+
+A tip here for scaled TCP_LISTEN sockets, that may have some thousands TCP-AO
+keys, is: use filters in ``getsockopt(TCP_AO_GET_KEYS)`` and asynchronous
+delete with ``setsockopt(TCP_AO_DEL_KEY)``.
+
+Linux TCP-AO also provides a bunch of segment counters that can be helpful
+with troubleshooting/debugging issues. Every MKT has good/bad counters
+that reflect how many packets passed/failed verification.
+Each TCP-AO socket has the following counters:
+- for good segments (properly signed)
+- for bad segments (failed TCP-AO verification)
+- for segments with unknown keys
+- for segments where an AO signature was expected, but wasn't found
+- for the number of ignored ICMPs
+
+TCP-AO per-socket counters are also duplicated with per-netns counters,
+exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``,
+``TCPAORequired`` and ``TCPAODroppedIcmps``.
+
+For monitoring purposes, there are following TCP-AO trace events:
+``tcp_hash_bad_header``, ``tcp_hash_ao_required``, ``tcp_ao_handshake_failure``,
+``tcp_ao_wrong_maclen``, ``tcp_ao_wrong_maclen``, ``tcp_ao_key_not_found``,
+``tcp_ao_rnext_request``, ``tcp_ao_synack_no_key``, ``tcp_ao_snd_sne_update``,
+``tcp_ao_rcv_sne_update``. It's possible to separately enable any of them and
+one can filter them by net-namespace, 4-tuple, family, L3 index, and TCP header
+flags. If a segment has a TCP-AO header, the filters may also include
+keyid, rnext, and maclen. SNE updates include the rolled-over numbers.
+
+RFC 5925 very permissively specifies how TCP port matching can be done for
+MKTs::
+
+   TCP connection identifier. A TCP socket pair, i.e., a local IP
+   address, a remote IP address, a TCP local port, and a TCP remote port.
+   Values can be partially specified using ranges (e.g., 2-30), masks
+   (e.g., 0xF0), wildcards (e.g., "*"), or any other suitable indication.
+
+Currently Linux TCP-AO implementation doesn't provide any TCP port matching.
+Probably, port ranges are the most flexible for uAPI, but so far
+not implemented.
+
+4. ``setsockopt()`` vs ``accept()`` race
+========================================
+
+In contrast with an established TCP-MD5 connection which has just one key,
+TCP-AO connections may have many keys, which means that accepted connections
+on a listen socket may have any amount of keys as well. As copying all those
+keys on a first properly signed SYN would make the request socket bigger, that
+would be undesirable. Currently, the implementation doesn't copy keys
+to request sockets, but rather look them up on the "parent" listener socket.
+
+The result is that when userspace removes TCP-AO keys, that may break
+not-yet-established connections on request sockets as well as not removing
+keys from sockets that were already established, but not yet ``accept()``'ed,
+hanging in the accept queue.
+
+The reverse is valid as well: if userspace adds a new key for a peer on
+a listener socket, the established sockets in the accept queue won't
+have the new keys.
+
+At this moment, the resolution for the two races:
+``setsockopt(TCP_AO_ADD_KEY)`` vs ``accept()``
+and ``setsockopt(TCP_AO_DEL_KEY)`` vs ``accept()`` is delegated to userspace.
+This means that it's expected that userspace would check the MKTs on the socket
+that was returned by ``accept()`` to verify that any key rotation that
+happened on the listen socket is reflected on the newly established connection.
+
+This is a similar "do-nothing" approach to TCP-MD5 from the kernel side and
+may be changed later by introducing new flags to ``tcp_ao_add``
+and ``tcp_ao_del``.
+
+Note that this race is rare for it needs TCP-AO key rotation to happen
+during the 3-way handshake for the new TCP connection.
+
+5. Interaction with TCP-MD5
+===========================
+
+A TCP connection can not migrate between TCP-AO and TCP-MD5 options. The
+established sockets that have either AO or MD5 keys are restricted for
+adding keys of the other option.
+
+For listening sockets the picture is different: BGP server may want to receive
+both TCP-AO and (deprecated) TCP-MD5 clients. As a result, both types of keys
+may be added to TCP_CLOSED or TCP_LISTEN sockets. It's not allowed to add
+different types of keys for the same peer.
+
+6. SNE Linux implementation
+===========================
+
+RFC 5925 [6.2] describes the algorithm of how to extend TCP sequence numbers
+with SNE.  In short: TCP has to track the previous sequence numbers and set
+sne_flag when the current SEQ number rolls over. The flag is cleared when
+both current and previous SEQ numbers cross 0x7fff, which is 32Kb.
+
+In times when sne_flag is set, the algorithm compares SEQ for each packet with
+0x7fff and if it's higher than 32Kb, it assumes that the packet should be
+verified with SNE before the increment. As a result, there's
+this [0; 32Kb] window, when packets with (SNE - 1) can be accepted.
+
+Linux implementation simplifies this a bit: as the network stack already tracks
+the first SEQ byte that ACK is wanted for (snd_una) and the next SEQ byte that
+is wanted (rcv_nxt) - that's enough information for a rough estimation
+on where in the 4GB SEQ number space both sender and receiver are.
+When they roll over to zero, the corresponding SNE gets incremented.
+
+tcp_ao_compute_sne() is called for each TCP-AO segment. It compares SEQ numbers
+from the segment with snd_una or rcv_nxt and fits the result into a 2GB window around them,
+detecting SEQ numbers rolling over. That simplifies the code a lot and only
+requires SNE numbers to be stored on every TCP-AO socket.
+
+The 2GB window at first glance seems much more permissive compared to
+RFC 5926. But that is only used to pick the correct SNE before/after
+a rollover. It allows more TCP segment replays, but yet all regular
+TCP checks in tcp_sequence() are applied on the verified segment.
+So, it trades a bit more permissive acceptance of replayed/retransmitted
+segments for the simplicity of the algorithm and what seems better behaviour
+for large TCP windows.
+
+7. Links
+========
+
+RFC 5925 The TCP Authentication Option
+   https://www.rfc-editor.org/rfc/pdfrfc/rfc5925.txt.pdf
+
+RFC 5926 Cryptographic Algorithms for the TCP Authentication Option (TCP-AO)
+   https://www.rfc-editor.org/rfc/pdfrfc/rfc5926.txt.pdf
+
+Draft "SHA-2 Algorithm for the TCP Authentication Option (TCP-AO)"
+   https://datatracker.ietf.org/doc/html/draft-nayak-tcp-sha2-03
+
+RFC 2385 Protection of BGP Sessions via the TCP MD5 Signature Option
+   https://www.rfc-editor.org/rfc/pdfrfc/rfc2385.txt.pdf
+
+:Author: Dmitry Safonov <dima@arista.com>
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index 1adead6a4527..7aabead90648 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -55,7 +55,8 @@ struct __kernel_sock_timeval format.
 SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
 on 32 bit machines.
 
-1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW):
+1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW)
+-------------------------------------------------------------------
 
 This option is identical to SO_TIMESTAMP except for the returned data type.
 Its struct timespec allows for higher resolution (ns) timestamps than the
@@ -139,6 +140,14 @@ SOF_TIMESTAMPING_TX_ACK:
   cumulative acknowledgment. The mechanism ignores SACK and FACK.
   This flag can be enabled via both socket options and control messages.
 
+SOF_TIMESTAMPING_TX_COMPLETION:
+  Request tx timestamps on packet tx completion.  The completion
+  timestamp is generated by the kernel when it receives packet a
+  completion report from the hardware. Hardware may report multiple
+  packets at once, and completion timestamps reflect the timing of the
+  report and not actual tx time. This flag can be enabled via both
+  socket options and control messages.
+
 
 1.3.2 Timestamp Reporting
 ^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -157,7 +166,8 @@ SOF_TIMESTAMPING_SYS_HARDWARE:
 
 SOF_TIMESTAMPING_RAW_HARDWARE:
   Report hardware timestamps as generated by
-  SOF_TIMESTAMPING_TX_HARDWARE when available.
+  SOF_TIMESTAMPING_TX_HARDWARE or SOF_TIMESTAMPING_RX_HARDWARE
+  when available.
 
 
 1.3.3 Timestamp Options
@@ -178,7 +188,8 @@ SOF_TIMESTAMPING_OPT_ID:
   identifier and returns that along with the timestamp. The identifier
   is derived from a per-socket u32 counter (that wraps). For datagram
   sockets, the counter increments with each sent packet. For stream
-  sockets, it increments with every byte.
+  sockets, it increments with every byte. For stream sockets, also set
+  SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
 
   The counter starts at zero. It is initialized the first time that
   the socket option is enabled. It is reset each time the option is
@@ -191,6 +202,49 @@ SOF_TIMESTAMPING_OPT_ID:
   among all possibly concurrently outstanding timestamp requests for
   that socket.
 
+  The process can optionally override the default generated ID, by
+  passing a specific ID with control message SCM_TS_OPT_ID (not
+  supported for TCP sockets)::
+
+    struct msghdr *msg;
+    ...
+    cmsg			 = CMSG_FIRSTHDR(msg);
+    cmsg->cmsg_level		 = SOL_SOCKET;
+    cmsg->cmsg_type		 = SCM_TS_OPT_ID;
+    cmsg->cmsg_len		 = CMSG_LEN(sizeof(__u32));
+    *((__u32 *) CMSG_DATA(cmsg)) = opt_id;
+    err = sendmsg(fd, msg, 0);
+
+
+SOF_TIMESTAMPING_OPT_ID_TCP:
+  Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+  timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+  counter increments for stream sockets, but its starting point is
+  not entirely trivial. This option fixes that.
+
+  For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
+  always be set too. On datagram sockets the option has no effect.
+
+  A reasonable expectation is that the counter is reset to zero with
+  the system call, so that a subsequent write() of N bytes generates
+  a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+  implements this behavior under all conditions.
+
+  SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+  especially when the socket option is set when no data is in
+  transmission. If data is being transmitted, it may be off by the
+  length of the output queue (SIOCOUTQ).
+
+  The difference is due to being based on snd_una versus write_seq.
+  snd_una is the offset in the stream acknowledged by the peer. This
+  depends on factors outside of process control, such as network RTT.
+  write_seq is the last byte written by the process. This offset is
+  not affected by external inputs.
+
+  The difference is subtle and unlikely to be noticed when configured
+  at initial socket creation, when no data is queued or sent. But
+  SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
+  when the socket option is set.
 
 SOF_TIMESTAMPING_OPT_CMSG:
   Support recv() cmsg for all timestamped packets. Control messages
@@ -235,6 +289,23 @@ SOF_TIMESTAMPING_OPT_TX_SWHW:
   two separate messages will be looped to the socket's error queue,
   each containing just one timestamp.
 
+SOF_TIMESTAMPING_OPT_RX_FILTER:
+  Filter out spurious receive timestamps: report a receive timestamp
+  only if the matching timestamp generation flag is enabled.
+
+  Receive timestamps are generated early in the ingress path, before a
+  packet's destination socket is known. If any socket enables receive
+  timestamps, packets for all socket will receive timestamped packets.
+  Including those that request timestamp reporting with
+  SOF_TIMESTAMPING_SOFTWARE and/or SOF_TIMESTAMPING_RAW_HARDWARE, but
+  do not request receive timestamp generation. This can happen when
+  requesting transmit timestamps only.
+
+  Receiving spurious timestamps is generally benign. A process can
+  ignore the unexpected non-zero value. But it makes behavior subtly
+  dependent on other sockets. This flag isolates the socket for more
+  deterministic behavior.
+
 New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
 disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
 regardless of the setting of sysctl net.core.tstamp_allow_data.
@@ -326,7 +397,8 @@ enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
 send time with the value returned for each timestamp. It can prevent
 the situation by always flushing the TCP stack in between requests,
 for instance by enabling TCP_NODELAY and disabling TCP_CORK and
-autocork.
+autocork. After linux-4.7, a better way to prevent coalescing is
+to use MSG_EOR flag at sendmsg() time.
 
 These precautions ensure that the timestamp is generated only when all
 bytes have passed a timestamp point, assuming that the network stack
@@ -461,8 +533,8 @@ implicitly defined. ts[0] holds a software timestamp if set, ts[1]
 is again deprecated and ts[2] holds a hardware timestamp if set.
 
 
-3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
-=======================================================================
+3. Hardware Timestamping configuration: ETHTOOL_MSG_TSCONFIG_SET/GET
+====================================================================
 
 Hardware time stamping must also be initialized for each device driver
 that is expected to do hardware time stamping. The parameter is defined in
@@ -475,18 +547,20 @@ include/uapi/linux/net_tstamp.h as::
 	};
 
 Desired behavior is passed into the kernel and to a specific device by
-calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
-ifr_data points to a struct hwtstamp_config. The tx_type and
-rx_filter are hints to the driver what it is expected to do. If
-the requested fine-grained filtering for incoming packets is not
-supported, the driver may time stamp more than just the requested types
-of packets.
+calling the tsconfig netlink socket ``ETHTOOL_MSG_TSCONFIG_SET``.
+The ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` and
+``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` netlink attributes are then used to set
+the struct hwtstamp_config accordingly.
+
+The ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` netlink nested attribute is used
+to select the source of the hardware time stamping. It is composed of an index
+for the device source and a qualifier for the type of time stamping.
 
 Drivers are free to use a more permissive configuration than the requested
 configuration. It is expected that drivers should only implement directly the
 most generic mode that can be supported. For example if the hardware can
-support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale
-HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT
+support HWTSTAMP_FILTER_PTP_V2_EVENT, then it should generally always upscale
+HWTSTAMP_FILTER_PTP_V2_L2_SYNC, and so forth, as HWTSTAMP_FILTER_PTP_V2_EVENT
 is more generic (and more useful to applications).
 
 A driver which supports hardware time stamping shall update the struct
@@ -499,9 +573,16 @@ Only a processes with admin rights may change the configuration. User
 space is responsible to ensure that multiple processes don't interfere
 with each other and that the settings are reset.
 
-Any process can read the actual configuration by passing this
-structure to ioctl(SIOCGHWTSTAMP) in the same way.  However, this has
-not been implemented in all drivers.
+Any process can read the actual configuration by requesting tsconfig netlink
+socket ``ETHTOOL_MSG_TSCONFIG_GET``.
+
+The legacy configuration is the use of the ioctl(SIOCSHWTSTAMP) with a pointer
+to a struct ifreq whose ifr_data points to a struct hwtstamp_config.
+The tx_type and rx_filter are hints to the driver what it is expected to do.
+If the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets. ioctl(SIOCGHWTSTAMP) is used in the same way as the
+ioctl(SIOCSHWTSTAMP). However, this has not been implemented in all drivers.
 
 ::
 
@@ -546,9 +627,10 @@ not been implemented in all drivers.
 --------------------------------------------------------
 
 A driver which supports hardware time stamping must support the
-SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
-the actual values as described in the section on SIOCSHWTSTAMP.  It
-should also support SIOCGHWTSTAMP.
+ndo_hwtstamp_set NDO or the legacy SIOCSHWTSTAMP ioctl and update the
+supplied struct hwtstamp_config with the actual values as described in
+the section on SIOCSHWTSTAMP. It should also support ndo_hwtstamp_get or
+the legacy SIOCGHWTSTAMP.
 
 Time stamps for received packets must be stored in the skb. To get a pointer
 to the shared time stamp structure of the skb call skb_hwtstamps(). Then
@@ -581,11 +663,189 @@ Time stamps for outgoing packets are to be generated as follows:
   and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
 - As soon as the driver has sent the packet and/or obtained a
   hardware time stamp for it, it passes the time stamp back by
-  calling skb_hwtstamp_tx() with the original skb, the raw
-  hardware time stamp. skb_hwtstamp_tx() clones the original skb and
+  calling skb_tstamp_tx() with the original skb, the raw
+  hardware time stamp. skb_tstamp_tx() clones the original skb and
   adds the timestamps, therefore the original skb has to be freed now.
   If obtaining the hardware time stamp somehow fails, then the driver
   should not fall back to software time stamping. The rationale is that
   this would occur at a later time in the processing pipeline than other
   software time stamping and therefore could lead to unexpected deltas
   between time stamps.
+
+3.2 Special considerations for stacked PTP Hardware Clocks
+----------------------------------------------------------
+
+There are situations when there may be more than one PHC (PTP Hardware Clock)
+in the data path of a packet. The kernel has no explicit mechanism to allow the
+user to select which PHC to use for timestamping Ethernet frames. Instead, the
+assumption is that the outermost PHC is always the most preferable, and that
+kernel drivers collaborate towards achieving that goal. Currently there are 3
+cases of stacked PHCs, detailed below:
+
+3.2.1 DSA (Distributed Switch Architecture) switches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are Ethernet switches which have one of their ports connected to an
+(otherwise completely unaware) host Ethernet interface, and perform the role of
+a port multiplier with optional forwarding acceleration features.  Each DSA
+switch port is visible to the user as a standalone (virtual) network interface,
+and its network I/O is performed, under the hood, indirectly through the host
+interface (redirecting to the host port on TX, and intercepting frames on RX).
+
+When a DSA switch is attached to a host port, PTP synchronization has to
+suffer, since the switch's variable queuing delay introduces a path delay
+jitter between the host port and its PTP partner. For this reason, some DSA
+switches include a timestamping clock of their own, and have the ability to
+perform network timestamping on their own MAC, such that path delays only
+measure wire and PHY propagation latencies. Timestamping DSA switches are
+supported in Linux and expose the same ABI as any other network interface (save
+for the fact that the DSA interfaces are in fact virtual in terms of network
+I/O, they do have their own PHC).  It is typical, but not mandatory, for all
+interfaces of a DSA switch to share the same PHC.
+
+By design, PTP timestamping with a DSA switch does not need any special
+handling in the driver for the host port it is attached to.  However, when the
+host port also supports PTP timestamping, DSA will take care of intercepting
+the ``.ndo_eth_ioctl`` calls towards the host port, and block attempts to enable
+hardware timestamping on it. This is because the SO_TIMESTAMPING API does not
+allow the delivery of multiple hardware timestamps for the same packet, so
+anybody else except for the DSA switch port must be prevented from doing so.
+
+In the generic layer, DSA provides the following infrastructure for PTP
+timestamping:
+
+- ``.port_txtstamp()``: a hook called prior to the transmission of
+  packets with a hardware TX timestamping request from user space.
+  This is required for two-step timestamping, since the hardware
+  timestamp becomes available after the actual MAC transmission, so the
+  driver must be prepared to correlate the timestamp with the original
+  packet so that it can re-enqueue the packet back into the socket's
+  error queue. To save the packet for when the timestamp becomes
+  available, the driver can call ``skb_clone_sk`` , save the clone pointer
+  in skb->cb and enqueue a tx skb queue. Typically, a switch will have a
+  PTP TX timestamp register (or sometimes a FIFO) where the timestamp
+  becomes available. In case of a FIFO, the hardware might store
+  key-value pairs of PTP sequence ID/message type/domain number and the
+  actual timestamp. To perform the correlation correctly between the
+  packets in a queue waiting for timestamping and the actual timestamps,
+  drivers can use a BPF classifier (``ptp_classify_raw``) to identify
+  the PTP transport type, and ``ptp_parse_header`` to interpret the PTP
+  header fields. There may be an IRQ that is raised upon this
+  timestamp's availability, or the driver might have to poll after
+  invoking ``dev_queue_xmit()`` towards the host interface.
+  One-step TX timestamping do not require packet cloning, since there is
+  no follow-up message required by the PTP protocol (because the
+  TX timestamp is embedded into the packet by the MAC), and therefore
+  user space does not expect the packet annotated with the TX timestamp
+  to be re-enqueued into its socket's error queue.
+
+- ``.port_rxtstamp()``: On RX, the BPF classifier is run by DSA to
+  identify PTP event messages (any other packets, including PTP general
+  messages, are not timestamped). The original (and only) timestampable
+  skb is provided to the driver, for it to annotate it with a timestamp,
+  if that is immediately available, or defer to later. On reception,
+  timestamps might either be available in-band (through metadata in the
+  DSA header, or attached in other ways to the packet), or out-of-band
+  (through another RX timestamping FIFO). Deferral on RX is typically
+  necessary when retrieving the timestamp needs a sleepable context. In
+  that case, it is the responsibility of the DSA driver to call
+  ``netif_rx()`` on the freshly timestamped skb.
+
+3.2.2 Ethernet PHYs
+^^^^^^^^^^^^^^^^^^^
+
+These are devices that typically fulfill a Layer 1 role in the network stack,
+hence they do not have a representation in terms of a network interface as DSA
+switches do. However, PHYs may be able to detect and timestamp PTP packets, for
+performance reasons: timestamps taken as close as possible to the wire have the
+potential to yield a more stable and precise synchronization.
+
+A PHY driver that supports PTP timestamping must create a ``struct
+mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
+of this pointer will be checked by the networking stack.
+
+Since PHYs do not have network interface representations, the timestamping and
+ethtool ioctl operations for them need to be mediated by their respective MAC
+driver.  Therefore, as opposed to DSA switches, modifications need to be done
+to each individual MAC driver for PHY timestamping support. This entails:
+
+- Checking, in ``.ndo_eth_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
+  is true or not. If it is, then the MAC driver should not process this request
+  but instead pass it on to the PHY using ``phy_mii_ioctl()``.
+
+- On RX, special intervention may or may not be needed, depending on the
+  function used to deliver skb's up the network stack. In the case of plain
+  ``netif_rx()`` and similar, MAC drivers must check whether
+  ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
+  call ``netif_rx()`` at all.  If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
+  enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
+  will be called now, to determine, using logic very similar to DSA, whether
+  deferral for RX timestamping is necessary.  Again like DSA, it becomes the
+  responsibility of the PHY driver to send the packet up the stack when the
+  timestamp is available.
+
+  For other skb receive functions, such as ``napi_gro_receive`` and
+  ``netif_receive_skb``, the stack automatically checks whether
+  ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
+  the driver.
+
+- On TX, again, special intervention might or might not be needed.  The
+  function that calls the ``mii_ts->txtstamp()`` hook is named
+  ``skb_clone_tx_timestamp()``. This function can either be called directly
+  (case in which explicit MAC driver support is indeed needed), but the
+  function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
+  drivers already perform for software timestamping purposes. Therefore, if a
+  MAC supports software timestamping, it does not need to do anything further
+  at this stage.
+
+3.2.3 MII bus snooping devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These perform the same role as timestamping Ethernet PHYs, save for the fact
+that they are discrete devices and can therefore be used in conjunction with
+any PHY even if it doesn't support timestamping. In Linux, they are
+discoverable and attachable to a ``struct phy_device`` through Device Tree, and
+for the rest, they use the same mii_ts infrastructure as those. See
+Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
+
+3.2.4 Other caveats for MAC drivers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The use of stacked PHCs may uncover MAC driver bugs which were impossible to
+trigger without them. One example has to do with this line of code, already
+presented earlier::
+
+      skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
+driver or a MII bus snooping device driver, should set this flag.
+But a MAC driver that is unaware of PHC stacking might get tripped up by
+somebody other than itself setting this flag, and deliver a duplicate
+timestamp.
+For example, a typical driver design for TX timestamping might be to split the
+transmission part into 2 portions:
+
+1. "TX": checks whether PTP timestamping has been previously enabled through
+   the ``.ndo_eth_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
+   current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
+   SKBTX_HW_TSTAMP``"). If this is true, it sets the
+   "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
+   described above, in the case of a stacked PHC system, this condition should
+   never trigger, as this MAC is certainly not the outermost PHC. But this is
+   not where the typical issue is.  Transmission proceeds with this packet.
+
+2. "TX confirmation": Transmission has finished. The driver checks whether it
+   is necessary to collect any TX timestamp for it. Here is where the typical
+   issues are: the MAC driver takes a shortcut and only checks whether
+   "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
+   PHC system, this is incorrect because this MAC driver is not the only entity
+   in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
+   place.
+
+The correct solution for this problem is for MAC drivers to have a compound
+check in their "TX confirmation" portion, not only for
+"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
+"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
+that PTP timestamping is not enabled for anything other than the outermost PHC,
+this enhanced check will avoid delivering a duplicated TX timestamp to user
+space.
diff --git a/Documentation/networking/tipc.rst b/Documentation/networking/tipc.rst
new file mode 100644
index 000000000000..9b375b9b9981
--- /dev/null
+++ b/Documentation/networking/tipc.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel TIPC
+=================
+
+Introduction
+============
+
+TIPC (Transparent Inter Process Communication) is a protocol that is specially
+designed for intra-cluster communication. It can be configured to transmit
+messages either on UDP or directly across Ethernet. Message delivery is
+sequence guaranteed, loss free and flow controlled. Latency times are shorter
+than with any other known protocol, while maximal throughput is comparable to
+that of TCP.
+
+TIPC Features
+-------------
+
+- Cluster wide IPC service
+
+  Have you ever wished you had the convenience of Unix Domain Sockets even when
+  transmitting data between cluster nodes? Where you yourself determine the
+  addresses you want to bind to and use? Where you don't have to perform DNS
+  lookups and worry about IP addresses? Where you don't have to start timers
+  to monitor the continuous existence of peer sockets? And yet without the
+  downsides of that socket type, such as the risk of lingering inodes?
+
+  Welcome to the Transparent Inter Process Communication service, TIPC in short,
+  which gives you all of this, and a lot more.
+
+- Service Addressing
+
+  A fundamental concept in TIPC is that of Service Addressing which makes it
+  possible for a programmer to chose his own address, bind it to a server
+  socket and let client programs use only that address for sending messages.
+
+- Service Tracking
+
+  A client wanting to wait for the availability of a server, uses the Service
+  Tracking mechanism to subscribe for binding and unbinding/close events for
+  sockets with the associated service address.
+
+  The service tracking mechanism can also be used for Cluster Topology Tracking,
+  i.e., subscribing for availability/non-availability of cluster nodes.
+
+  Likewise, the service tracking mechanism can be used for Cluster Connectivity
+  Tracking, i.e., subscribing for up/down events for individual links between
+  cluster nodes.
+
+- Transmission Modes
+
+  Using a service address, a client can send datagram messages to a server socket.
+
+  Using the same address type, it can establish a connection towards an accepting
+  server socket.
+
+  It can also use a service address to create and join a Communication Group,
+  which is the TIPC manifestation of a brokerless message bus.
+
+  Multicast with very good performance and scalability is available both in
+  datagram mode and in communication group mode.
+
+- Inter Node Links
+
+  Communication between any two nodes in a cluster is maintained by one or two
+  Inter Node Links, which both guarantee data traffic integrity and monitor
+  the peer node's availability.
+
+- Cluster Scalability
+
+  By applying the Overlapping Ring Monitoring algorithm on the inter node links
+  it is possible to scale TIPC clusters up to 1000 nodes with a maintained
+  neighbor failure discovery time of 1-2 seconds. For smaller clusters this
+  time can be made much shorter.
+
+- Neighbor Discovery
+
+  Neighbor Node Discovery in the cluster is done by Ethernet broadcast or UDP
+  multicast, when any of those services are available. If not, configured peer
+  IP addresses can be used.
+
+- Configuration
+
+  When running TIPC in single node mode no configuration whatsoever is needed.
+  When running in cluster mode TIPC must as a minimum be given a node address
+  (before Linux 4.17) and told which interface to attach to. The "tipc"
+  configuration tool makes is possible to add and maintain many more
+  configuration parameters.
+
+- Performance
+
+  TIPC message transfer latency times are better than in any other known protocol.
+  Maximal byte throughput for inter-node connections is still somewhat lower than
+  for TCP, while they are superior for intra-node and inter-container throughput
+  on the same host.
+
+- Language Support
+
+  The TIPC user API has support for C, Python, Perl, Ruby, D and Go.
+
+More Information
+----------------
+
+- How to set up TIPC:
+
+  http://tipc.io/getting_started.html
+
+- How to program with TIPC:
+
+  http://tipc.io/programming.html
+
+- How to contribute to TIPC:
+
+  http://tipc.io/contacts.html
+
+- More details about TIPC specification:
+
+  http://tipc.io/protocol.html
+
+
+Implementation
+==============
+
+TIPC is implemented as a kernel module in net/tipc/ directory.
+
+TIPC Base Types
+---------------
+
+.. kernel-doc:: net/tipc/subscr.h
+   :internal:
+
+.. kernel-doc:: net/tipc/bearer.h
+   :internal:
+
+.. kernel-doc:: net/tipc/name_table.h
+   :internal:
+
+.. kernel-doc:: net/tipc/name_distr.h
+   :internal:
+
+.. kernel-doc:: net/tipc/bcast.c
+   :internal:
+
+TIPC Bearer Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/bearer.c
+   :internal:
+
+.. kernel-doc:: net/tipc/udp_media.c
+   :internal:
+
+TIPC Crypto Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/crypto.c
+   :internal:
+
+TIPC Discoverer Interfaces
+--------------------------
+
+.. kernel-doc:: net/tipc/discover.c
+   :internal:
+
+TIPC Link Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/link.c
+   :internal:
+
+TIPC msg Interfaces
+-------------------
+
+.. kernel-doc:: net/tipc/msg.c
+   :internal:
+
+TIPC Name Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/name_table.c
+   :internal:
+
+.. kernel-doc:: net/tipc/name_distr.c
+   :internal:
+
+TIPC Node Management Interfaces
+-------------------------------
+
+.. kernel-doc:: net/tipc/node.c
+   :internal:
+
+TIPC Socket Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/socket.c
+   :internal:
+
+TIPC Network Topology Interfaces
+--------------------------------
+
+.. kernel-doc:: net/tipc/subscr.c
+   :internal:
+
+TIPC Server Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/topsrv.c
+   :internal:
+
+TIPC Trace Interfaces
+---------------------
+
+.. kernel-doc:: net/tipc/trace.c
+   :internal:
diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst
new file mode 100644
index 000000000000..6f5ea1646a47
--- /dev/null
+++ b/Documentation/networking/tls-handshake.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+In-Kernel TLS Handshake
+=======================
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs
+over TCP. TLS provides end-to-end data integrity and confidentiality in
+addition to peer authentication.
+
+The kernel's kTLS implementation handles the TLS record subprotocol, but
+does not handle the TLS handshake subprotocol which is used to establish
+a TLS session. Kernel consumers can use the API described here to
+request TLS session establishment.
+
+There are several possible ways to provide a handshake service in the
+kernel. The API described here is designed to hide the details of those
+implementations so that in-kernel TLS consumers do not need to be
+aware of how the handshake gets done.
+
+
+User handshake agent
+====================
+
+As of this writing, there is no TLS handshake implementation in the
+Linux kernel. To provide a handshake service, a handshake agent
+(typically in user space) is started in each network namespace where a
+kernel consumer might require a TLS handshake. Handshake agents listen
+for events sent from the kernel that indicate a handshake request is
+waiting.
+
+An open socket is passed to a handshake agent via a netlink operation,
+which creates a socket descriptor in the agent's file descriptor table.
+If the handshake completes successfully, the handshake agent promotes
+the socket to use the TLS ULP and sets the session information using the
+SOL_TLS socket options. The handshake agent returns the socket to the
+kernel via a second netlink operation.
+
+
+Kernel Handshake API
+====================
+
+A kernel TLS consumer initiates a client-side TLS handshake on an open
+socket by invoking one of the tls_client_hello() functions. First, it
+fills in a structure that contains the parameters of the request:
+
+.. code-block:: c
+
+  struct tls_handshake_args {
+        struct socket   *ta_sock;
+        tls_done_func_t ta_done;
+        void            *ta_data;
+        const char      *ta_peername;
+        unsigned int    ta_timeout_ms;
+        key_serial_t    ta_keyring;
+        key_serial_t    ta_my_cert;
+        key_serial_t    ta_my_privkey;
+        unsigned int    ta_num_peerids;
+        key_serial_t    ta_my_peerids[5];
+  };
+
+The @ta_sock field references an open and connected socket. The consumer
+must hold a reference on the socket to prevent it from being destroyed
+while the handshake is in progress. The consumer must also have
+instantiated a struct file in sock->file.
+
+
+@ta_done contains a callback function that is invoked when the handshake
+has completed. Further explanation of this function is in the "Handshake
+Completion" sesction below.
+
+The consumer can provide a NUL-terminated hostname in the @ta_peername
+field that is sent as part of ClientHello. If no peername is provided,
+the DNS hostname associated with the server's IP address is used instead.
+
+The consumer can fill in the @ta_timeout_ms field to force the servicing
+handshake agent to exit after a number of milliseconds. This enables the
+socket to be fully closed once both the kernel and the handshake agent
+have closed their endpoints.
+
+Authentication material such as x.509 certificates, private certificate
+keys, and pre-shared keys are provided to the handshake agent in keys
+that are instantiated by the consumer before making the handshake
+request. The consumer can provide a private keyring that is linked into
+the handshake agent's process keyring in the @ta_keyring field to prevent
+access of those keys by other subsystems.
+
+To request an x.509-authenticated TLS session, the consumer fills in
+the @ta_my_cert and @ta_my_privkey fields with the serial numbers of
+keys containing an x.509 certificate and the private key for that
+certificate. Then, it invokes this function:
+
+.. code-block:: c
+
+  ret = tls_client_hello_x509(args, gfp_flags);
+
+The function returns zero when the handshake request is under way. A
+zero return guarantees the callback function @ta_done will be invoked
+for this socket. The function returns a negative errno if the handshake
+could not be started. A negative errno guarantees the callback function
+@ta_done will not be invoked on this socket.
+
+
+To initiate a client-side TLS handshake with a pre-shared key, use:
+
+.. code-block:: c
+
+  ret = tls_client_hello_psk(args, gfp_flags);
+
+However, in this case, the consumer fills in the @ta_my_peerids array
+with serial numbers of keys containing the peer identities it wishes
+to offer, and the @ta_num_peerids field with the number of array
+entries it has filled in. The other fields are filled in as above.
+
+
+To initiate an anonymous client-side TLS handshake use:
+
+.. code-block:: c
+
+  ret = tls_client_hello_anon(args, gfp_flags);
+
+The handshake agent presents no peer identity information to the remote
+during this type of handshake. Only server authentication (ie the client
+verifies the server's identity) is performed during the handshake. Thus
+the established session uses encryption only.
+
+
+Consumers that are in-kernel servers use:
+
+.. code-block:: c
+
+  ret = tls_server_hello_x509(args, gfp_flags);
+
+or
+
+.. code-block:: c
+
+  ret = tls_server_hello_psk(args, gfp_flags);
+
+The argument structure is filled in as above.
+
+
+If the consumer needs to cancel the handshake request, say, due to a ^C
+or other exigent event, the consumer can invoke:
+
+.. code-block:: c
+
+  bool tls_handshake_cancel(sock);
+
+This function returns true if the handshake request associated with
+@sock has been canceled. The consumer's handshake completion callback
+will not be invoked. If this function returns false, then the consumer's
+completion callback has already been invoked.
+
+
+Handshake Completion
+====================
+
+When the handshake agent has completed processing, it notifies the
+kernel that the socket may be used by the consumer again. At this point,
+the consumer's handshake completion callback, provided in the @ta_done
+field in the tls_handshake_args structure, is invoked.
+
+The synopsis of this function is:
+
+.. code-block:: c
+
+  typedef void	(*tls_done_func_t)(void *data, int status,
+                                   key_serial_t peerid);
+
+The consumer provides a cookie in the @ta_data field of the
+tls_handshake_args structure that is returned in the @data parameter of
+this callback. The consumer uses the cookie to match the callback to the
+thread waiting for the handshake to complete.
+
+The success status of the handshake is returned via the @status
+parameter:
+
++------------+----------------------------------------------+
+|  status    |  meaning                                     |
++============+==============================================+
+|  0         |  TLS session established successfully        |
++------------+----------------------------------------------+
+|  -EACCESS  |  Remote peer rejected the handshake or       |
+|            |  authentication failed                       |
++------------+----------------------------------------------+
+|  -ENOMEM   |  Temporary resource allocation failure       |
++------------+----------------------------------------------+
+|  -EINVAL   |  Consumer provided an invalid argument       |
++------------+----------------------------------------------+
+|  -ENOKEY   |  Missing authentication material             |
++------------+----------------------------------------------+
+|  -EIO      |  An unexpected fault occurred                |
++------------+----------------------------------------------+
+
+The @peerid parameter contains the serial number of a key containing the
+remote peer's identity or the value TLS_NO_PEERID if the session is not
+authenticated.
+
+A best practice is to close and destroy the socket immediately if the
+handshake failed.
+
+
+Other considerations
+--------------------
+
+While a handshake is under way, the kernel consumer must alter the
+socket's sk_data_ready callback function to ignore all incoming data.
+Once the handshake completion callback function has been invoked, normal
+receive operation can be resumed.
+
+Once a TLS session is established, the consumer must provide a buffer
+for and then examine the control message (CMSG) that is part of every
+subsequent sock_recvmsg(). Each control message indicates whether the
+received message data is TLS record data or session metadata.
+
+See tls.rst for details on how a kTLS consumer recognizes incoming
+(decrypted) application data, alerts, and handshake packets once the
+socket has been promoted to use the TLS ULP.
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index f914e81fd3a6..7354d48cdf92 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -51,7 +51,7 @@ and send them to the device for encryption and transmission.
 RX
 --
 
-On the receive side if the device handled decryption and authentication
+On the receive side, if the device handled decryption and authentication
 successfully, the driver will set the decrypted bit in the associated
 :c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
 are handled normally. ``ktls`` is informed when data is queued to the socket
@@ -120,8 +120,9 @@ before installing the connection state in the kernel.
 RX
 --
 
-In RX direction local networking stack has little control over the segmentation,
-so the initial records' TCP sequence number may be anywhere inside the segment.
+In the RX direction, the local networking stack has little control over
+segmentation, so the initial records' TCP sequence number may be anywhere
+inside the segment.
 
 Normal operation
 ================
@@ -138,8 +139,8 @@ There are no guarantees on record length or record segmentation. In particular
 segments may start at any point of a record and contain any number of records.
 Assuming segments are received in order, the device should be able to perform
 crypto operations and authentication regardless of segmentation. For this
-to be possible device has to keep small amount of segment-to-segment state.
-This includes at least:
+to be possible, the device has to keep a small amount of segment-to-segment
+state. This includes at least:
 
  * partial headers (if a segment carried only a part of the TLS header)
  * partial data block
@@ -175,12 +176,12 @@ and packet transformation functions) the device validates the Layer 4
 checksum and performs a 5-tuple lookup to find any TLS connection the packet
 may belong to (technically a 4-tuple
 lookup is sufficient - IP addresses and TCP port numbers, as the protocol
-is always TCP). If connection is matched device confirms if the TCP sequence
-number is the expected one and proceeds to TLS handling (record delineation,
-decryption, authentication for each record in the packet). The device leaves
-the record framing unmodified, the stack takes care of record decapsulation.
-Device indicates successful handling of TLS offload in the per-packet context
-(descriptor) passed to the host.
+is always TCP). If the packet is matched to a connection, the device confirms
+if the TCP sequence number is the expected one and proceeds to TLS handling
+(record delineation, decryption, authentication for each record in the packet).
+The device leaves the record framing unmodified, the stack takes care of record
+decapsulation. Device indicates successful handling of TLS offload in the
+per-packet context (descriptor) passed to the host.
 
 Upon reception of a TLS offloaded packet, the driver sets
 the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
@@ -428,6 +429,24 @@ by the driver:
    which were part of a TLS stream.
  * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
    which were successfully decrypted.
+ * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
+   decryption.
+ * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
+   (connection has finished).
+ * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
+    request.
+ * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
+    was started.
+ * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
+    properly ended with providing the HW tracked tcp-seq.
+ * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
+    procedure was started but not properly ended.
+ * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
+    the driver was successfully handled.
+ * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
+    the driver was terminated unsuccessfully.
+ * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
+   but were not decrypted due to unexpected error in the state machine.
  * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
    for encryption of their TLS payload.
  * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
@@ -489,8 +508,8 @@ in packets as seen on the wire.
 Transport layer transparency
 ----------------------------
 
-The device should not modify any packet headers for the purpose
-of the simplifying TLS offload.
+For the purpose of simplifying TLS offload, the device should not modify any
+packet headers.
 
 The device should not depend on any packet headers beyond what is strictly
 necessary for TLS offload.
@@ -506,7 +525,16 @@ on TCP retransmissions to handle corner cases is not acceptable.
 TLS device features
 -------------------
 
-Drivers should ignore the changes to TLS the device feature flags.
+Drivers should ignore the changes to the TLS device feature flags.
 These flags will be acted upon accordingly by the core ``ktls`` code.
 TLS device feature flags only control adding of new TLS connection
 offloads, old connections will remain active after flags are cleared.
+
+TLS encryption cannot be offloaded to devices without checksum calculation
+offload. Hence, TLS TX device feature flag requires TX csum offload being set.
+Disabling the latter implies clearing the former. Disabling TX checksum offload
+should not affect old connections, and drivers should make sure checksum
+calculation does not break for them.
+Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
+does not want to enable RX csum offload, TLS RX device feature is disabled
+as well.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 8cb2cd4e2a80..36cc7afc2527 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -16,11 +16,13 @@ User interface
 Creating a TLS connection
 -------------------------
 
-First create a new TCP socket and set the TLS ULP.
+First create a new TCP socket and once the connection is established set the
+TLS ULP.
 
 .. code-block:: c
 
   sock = socket(AF_INET, SOCK_STREAM, 0);
+  connect(sock, addr, addrlen);
   setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
 
 Setting the TLS ULP allows us to set/get TLS socket options. Currently
@@ -200,6 +202,32 @@ received without a cmsg buffer set.
 
 recv will never return data from mixed types of TLS records.
 
+TLS 1.3 Key Updates
+-------------------
+
+In TLS 1.3, KeyUpdate handshake messages signal that the sender is
+updating its TX key. Any message sent after a KeyUpdate will be
+encrypted using the new key. The userspace library can pass the new
+key to the kernel using the TLS_TX and TLS_RX socket options, as for
+the initial keys. TLS version and cipher cannot be changed.
+
+To prevent attempting to decrypt incoming records using the wrong key,
+decryption will be paused when a KeyUpdate message is received by the
+kernel, until the new key has been provided using the TLS_RX socket
+option. Any read occurring after the KeyUpdate has been read and
+before the new key is provided will fail with EKEYEXPIRED. poll() will
+not report any read events from the socket until the new key is
+provided. There is no pausing on the transmit side.
+
+Userspace should make sure that the crypto_info provided has been set
+properly. In particular, the kernel will not check for key/nonce
+reuse.
+
+The number of successful and failed key updates is tracked in the
+``TlsTxRekeyOk``, ``TlsRxRekeyOk``, ``TlsTxRekeyError``,
+``TlsRxRekeyError`` statistics. The ``TlsRxRekeyReceived`` statistic
+counts KeyUpdate handshake messages that have been received.
+
 Integrating in to userspace TLS library
 ---------------------------------------
 
@@ -214,6 +242,44 @@ of calling send directly after a handshake using gnutls.
 Since it doesn't implement a full record layer, control
 messages are not supported.
 
+Optional optimizations
+----------------------
+
+There are certain condition-specific optimizations the TLS ULP can make,
+if requested. Those optimizations are either not universally beneficial
+or may impact correctness, hence they require an opt-in.
+All options are set per-socket using setsockopt(), and their
+state can be checked using getsockopt() and via socket diag (``ss``).
+
+TLS_TX_ZEROCOPY_RO
+~~~~~~~~~~~~~~~~~~
+
+For device offload only. Allow sendfile() data to be transmitted directly
+to the NIC without making an in-kernel copy. This allows true zero-copy
+behavior when device offload is enabled.
+
+The application must make sure that the data is not modified between being
+submitted and transmission completing. In other words this is mostly
+applicable if the data sent on a socket via sendfile() is read-only.
+
+Modifying the data may result in different versions of the data being used
+for the original TCP transmission and TCP retransmissions. To the receiver
+this will look like TLS records had been tampered with and will result
+in record authentication failures.
+
+TLS_RX_EXPECT_NO_PAD
+~~~~~~~~~~~~~~~~~~~~
+
+TLS 1.3 only. Expect the sender to not pad records. This allows the data
+to be decrypted directly into user space buffers with TLS 1.3.
+
+This optimization is safe to enable only if the remote end is trusted,
+otherwise it is an attack vector to doubling the TLS processing cost.
+
+If the record decrypted turns out to had been padded or is not a data
+record it will be decrypted again into a kernel buffer without zero copy.
+Such events are counted in the ``TlsDecryptRetry`` statistic.
+
 Statistics
 ==========
 
@@ -239,3 +305,22 @@ TLS implementation exposes the following per-namespace statistics
 
 - ``TlsDeviceRxResync`` -
   number of RX resyncs sent to NICs handling cryptography
+
+- ``TlsDecryptRetry`` -
+  number of RX records which had to be re-decrypted due to
+  ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will
+  also increment for non-data records.
+
+- ``TlsRxNoPadViolation`` -
+  number of data RX records which had to be re-decrypted due to
+  ``TLS_RX_EXPECT_NO_PAD`` mis-prediction.
+
+- ``TlsTxRekeyOk``, ``TlsRxRekeyOk`` -
+  number of successful rekeys on existing sessions for TX and RX
+
+- ``TlsTxRekeyError``, ``TlsRxRekeyError`` -
+  number of failed rekeys on existing sessions for TX and RX
+
+- ``TlsRxRekeyReceived`` -
+  number of received KeyUpdate handshake messages, requiring userspace
+  to provide a new RX key
diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst
index 00dc3a1a66b4..75e4990cc3db 100644
--- a/Documentation/networking/tproxy.rst
+++ b/Documentation/networking/tproxy.rst
@@ -17,7 +17,7 @@ The idea is that you identify packets with destination address matching a local
 socket on your box, set the packet mark to a certain value::
 
     # iptables -t mangle -N DIVERT
-    # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+    # iptables -t mangle -A PREROUTING -p tcp -m socket --transparent -j DIVERT
     # iptables -t mangle -A DIVERT -j MARK --set-mark 1
     # iptables -t mangle -A DIVERT -j ACCEPT
 
@@ -69,9 +69,9 @@ add rules like this to the iptables ruleset above::
     # iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
       --tproxy-mark 0x1/0x1 --on-port 50080
 
-Or the following rule to nft:
+Or the following rule to nft::
 
-# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept
+    # nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept
 
 Note that for this to work you'll have to modify the proxy to enable (SOL_IP,
 IP_TRANSPARENT) for the listening socket.
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
index a59d1dd6fdcc..4d7087f727be 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -107,7 +107,7 @@ Note that the character pointer becomes overwritten with the real device name
        */
       ifr.ifr_flags = IFF_TUN;
       if( *dev )
-	 strncpy(ifr.ifr_name, dev, IFNAMSIZ);
+	 strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
 
       if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
 	 close(fd);
diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst
index 0dde145043bc..0a9a6f968cb9 100644
--- a/Documentation/networking/vrf.rst
+++ b/Documentation/networking/vrf.rst
@@ -144,6 +144,19 @@ default VRF are only handled by a socket not bound to any VRF::
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
+Using VRF-aware applications (applications which simultaneously create sockets
+outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1``
+is possible but may lead to problems in some situations. With that sysctl
+value, it is unspecified which listening socket will be selected to handle
+connections for VRF traffic; ie. either a socket bound to the VRF or an unbound
+socket may be used to accept new connections from a VRF. This somewhat
+unexpected behavior can lead to problems if sockets are configured with extra
+options (ex. TCP MD5 keys) with the expectation that VRF traffic will
+exclusively be handled by sockets bound to VRFs, as would be the case with
+``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of
+which listening socket is selected, established sockets will be created in the
+VRF based on the ingress interface, as documented earlier.
+
 --------------------------------------------------------------------------------
 
 Using iproute2 for VRFs
diff --git a/Documentation/networking/vxlan.rst b/Documentation/networking/vxlan.rst
index ce239fa01848..2759dc1cc525 100644
--- a/Documentation/networking/vxlan.rst
+++ b/Documentation/networking/vxlan.rst
@@ -58,3 +58,31 @@ forwarding table using the new bridge command.
 3. Show forwarding table::
 
     # bridge fdb show dev vxlan0
+
+The following NIC features may indicate support for UDP tunnel-related
+offloads (most commonly VXLAN features, but support for a particular
+encapsulation protocol is NIC specific):
+
+ - `tx-udp_tnl-segmentation`
+ - `tx-udp_tnl-csum-segmentation`
+    ability to perform TCP segmentation offload of UDP encapsulated frames
+
+ - `rx-udp_tunnel-port-offload`
+    receive side parsing of UDP encapsulated frames which allows NICs to
+    perform protocol-aware offloads, like checksum validation offload of
+    inner frames (only needed by NICs without protocol-agnostic offloads)
+
+For devices supporting `rx-udp_tunnel-port-offload` the list of currently
+offloaded ports can be interrogated with `ethtool`::
+
+  $ ethtool --show-tunnels eth0
+  Tunnel information for eth0:
+    UDP port table 0:
+      Size: 4
+      Types: vxlan
+      No entries
+    UDP port table 1:
+      Size: 4
+      Types: geneve, vxlan-gpe
+      Entries (1):
+          port 1230, vxlan-gpe
diff --git a/Documentation/networking/x25-iface.rst b/Documentation/networking/x25-iface.rst
index df401891dce6..285cefcfce87 100644
--- a/Documentation/networking/x25-iface.rst
+++ b/Documentation/networking/x25-iface.rst
@@ -1,8 +1,7 @@
 .. SPDX-License-Identifier: GPL-2.0
 
-============================-
 X.25 Device Driver Interface
-============================-
+============================
 
 Version 1.1
 
@@ -70,60 +69,13 @@ First Byte = 0x03 (X25_IFACE_PARAMS)
 LAPB parameters. To be defined.
 
 
+Requirements for the device driver
+----------------------------------
 
-Possible Problems
-=================
-
-(Henner Eisen, 2000-10-28)
-
-The X.25 packet layer protocol depends on a reliable datalink service.
-The LAPB protocol provides such reliable service. But this reliability
-is not preserved by the Linux network device driver interface:
-
-- With Linux 2.4.x (and above) SMP kernels, packet ordering is not
-  preserved. Even if a device driver calls netif_rx(skb1) and later
-  netif_rx(skb2), skb2 might be delivered to the network layer
-  earlier that skb1.
-- Data passed upstream by means of netif_rx() might be dropped by the
-  kernel if the backlog queue is congested.
-
-The X.25 packet layer protocol will detect this and reset the virtual
-call in question. But many upper layer protocols are not designed to
-handle such N-Reset events gracefully. And frequent N-Reset events
-will always degrade performance.
-
-Thus, driver authors should make netif_rx() as reliable as possible:
-
-SMP re-ordering will not occur if the driver's interrupt handler is
-always executed on the same CPU. Thus,
-
-- Driver authors should use irq affinity for the interrupt handler.
-
-The probability of packet loss due to backlog congestion can be
-reduced by the following measures or a combination thereof:
-
-(1) Drivers for kernel versions 2.4.x and above should always check the
-    return value of netif_rx(). If it returns NET_RX_DROP, the
-    driver's LAPB protocol must not confirm reception of the frame
-    to the peer.
-    This will reliably suppress packet loss. The LAPB protocol will
-    automatically cause the peer to re-transmit the dropped packet
-    later.
-    The lapb module interface was modified to support this. Its
-    data_indication() method should now transparently pass the
-    netif_rx() return value to the (lapb module) caller.
-(2) Drivers for kernel versions 2.2.x should always check the global
-    variable netdev_dropping when a new frame is received. The driver
-    should only call netif_rx() if netdev_dropping is zero. Otherwise
-    the driver should not confirm delivery of the frame and drop it.
-    Alternatively, the driver can queue the frame internally and call
-    netif_rx() later when netif_dropping is 0 again. In that case, delivery
-    confirmation should also be deferred such that the internal queue
-    cannot grow to much.
-    This will not reliably avoid packet loss, but the probability
-    of packet loss in netif_rx() path will be significantly reduced.
-(3) Additionally, driver authors might consider to support
-    CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up
-    when a previously congested backlog queue becomes empty again.
-    The driver could uses this for flow-controlling the peer by means
-    of the LAPB protocol's flow-control service.
+Packets should not be reordered or dropped when delivering between the
+Packet Layer and the device driver.
+
+To avoid packets from being reordered or dropped when delivering from
+the device driver to the Packet Layer, the device driver should not
+call "netif_rx" to deliver the received packets. Instead, it should
+call "netif_receive_skb_core" from softirq context to deliver them.
diff --git a/Documentation/networking/x25.rst b/Documentation/networking/x25.rst
index 00e45d384ba0..e11d9ebdf9a3 100644
--- a/Documentation/networking/x25.rst
+++ b/Documentation/networking/x25.rst
@@ -19,13 +19,11 @@ implementation of LAPB. Therefore the LAPB modules would be called by
 unintelligent X.25 card drivers and not by intelligent ones, this would
 provide a uniform device driver interface, and simplify configuration.
 
-To confuse matters a little, an 802.2 LLC implementation for Linux is being
-written which will allow X.25 to be run over an Ethernet (or Token Ring) and
-conform with the JNT "Pink Book", this will have a different interface to
-the Packet Layer but there will be no confusion since the class of device
-being served by the LLC will be completely separate from LAPB. The LLC
-implementation is being done as part of another protocol project (SNA) and
-by a different author.
+To confuse matters a little, an 802.2 LLC implementation is also possible
+which could allow X.25 to be run over an Ethernet (or Token Ring) and
+conform with the JNT "Pink Book", this would have a different interface to
+the Packet Layer but there would be no confusion since the class of device
+being served by the LLC would be completely separate from LAPB.
 
 Just when you thought that it could not become more confusing, another
 option appeared, XOT. This allows X.25 Packet Layer frames to operate over
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..a6e0ece18be5
--- /dev/null
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -0,0 +1,128 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+XDP RX Metadata
+===============
+
+This document describes how an eXpress Data Path (XDP) program can access
+hardware metadata related to a packet using a set of helper functions,
+and how it can pass that metadata on to other consumers.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
+Every device driver that wishes to expose additional packet metadata can
+implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
+via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_vlan_tag
+
+An XDP program can use these kfuncs to read the metadata into stack
+variables for its own consumption. Or, to pass the metadata on to other
+consumers, an XDP program can store it into the metadata area carried
+ahead of the packet. Not all packets will necessary have the requested
+metadata available in which case the driver returns ``-ENODATA``.
+
+Not all kfuncs have to be implemented by the device driver; when not
+implemented, the default ones that return ``-EOPNOTSUPP`` will be used
+to indicate the device driver have not implemented this kfunc.
+
+
+Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is
+as follows::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+             ^                 ^
+             |                 |
+   xdp_buff->data_meta   xdp_buff->data
+
+An XDP program can store individual metadata items into this ``data_meta``
+area in whichever format it chooses. Later consumers of the metadata
+will have to agree on the format by some out of band contract (like for
+the AF_XDP use case, see below).
+
+AF_XDP
+======
+
+:doc:`af_xdp` use-case implies that there is a contract between the BPF
+program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
+the final consumer. Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. The userspace ``XSK`` consumer computes
+``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
+Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
+``METADATA_SIZE`` is an application-specific constant (``AF_XDP`` receive
+descriptor does _not_ explicitly carry the size of the metadata).
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+                               ^
+                               |
+                        rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
+contents. Currently, every driver has custom kernel code to parse
+the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
+conversion, and the XDP metadata is not used by the kernel when building
+``skbs``. However, TC-BPF programs can access the XDP metadata area using
+the ``data_meta`` pointer.
+
+In the future, we'd like to support a case where an XDP program
+can override some of the metadata used for building ``skbs``.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+Some devices (like virtual ethernet links) support running a second XDP
+program after the redirect. However, the final consumer doesn't have
+access to the original hardware descriptor and can't access any of
+the original metadata. The same applies to XDP programs installed
+into devmaps and cpumaps.
+
+This means that for redirected packets only custom metadata is
+currently supported, which has to be prepared by the initial XDP program
+before redirect. If the frame is eventually passed to the kernel, the
+``skb`` created from such a frame won't have any hardware metadata populated
+in its ``skb``. If such a packet is later redirected into an ``XSK``,
+that will also only have access to the custom metadata.
+
+bpf_tail_call
+=============
+
+Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
+is currently not supported.
+
+Supported Devices
+=================
+
+It is possible to query which kfunc the particular netdev implements via
+netlink. See ``xdp-rx-metadata-features`` attribute set in
+``Documentation/netlink/specs/netdev.yaml``.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
+``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
+BPF program that handles XDP metadata.
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index da1073acda96..122204da0fff 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -1,10 +1,12 @@
 .. SPDX-License-Identifier: GPL-2.0
+.. _xfrm_device:
 
 ===============================================
 XFRM device - offloading the IPsec computations
 ===============================================
 
 Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
 
 
 Overview
@@ -18,10 +20,21 @@ can radically increase throughput and decrease CPU utilization.  The XFRM
 Device interface allows NIC drivers to offer to the stack access to the
 hardware offload.
 
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+   * NIC performs encrypt/decrypt
+   * Kernel does everything else
+ * IPsec packet offload:
+   * NIC performs encrypt/decrypt
+   * NIC does encapsulation
+   * Kernel and NIC have SA and policy in-sync
+   * NIC handles the SA and policies states
+   * The Kernel talks to the keymanager
+
 Userland access to the offload is typically through a system such as
 libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
 be handy when experimenting.  An example command might look something
-like this::
+like this for crypto offload:
 
   ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
      reqid 0x07 replay-window 32 \
@@ -29,6 +42,17 @@ like this::
      sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
      offload dev eth4 dir in
 
+and for packet offload
+
+  ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+     reqid 0x07 replay-window 32 \
+     aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+     sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+     offload packet dev eth4 dir in
+
+  ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+  tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
 Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
 
 
@@ -40,17 +64,28 @@ Callbacks to implement
 
   /* from include/linux/netdevice.h */
   struct xfrmdev_ops {
-	int	(*xdo_dev_state_add) (struct xfrm_state *x);
-	void	(*xdo_dev_state_delete) (struct xfrm_state *x);
-	void	(*xdo_dev_state_free) (struct xfrm_state *x);
+        /* Crypto and Packet offload callbacks */
+	int	(*xdo_dev_state_add)(struct net_device *dev,
+                                     struct xfrm_state *x,
+                                     struct netlink_ext_ack *extack);
+	void	(*xdo_dev_state_delete)(struct net_device *dev,
+                                        struct xfrm_state *x);
+	void	(*xdo_dev_state_free)(struct net_device *dev,
+                                      struct xfrm_state *x);
 	bool	(*xdo_dev_offload_ok) (struct sk_buff *skb,
 				       struct xfrm_state *x);
 	void    (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+	void    (*xdo_dev_state_update_stats) (struct xfrm_state *x);
+
+        /* Solely packet offload callbacks */
+	int	(*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack);
+	void	(*xdo_dev_policy_delete) (struct xfrm_policy *x);
+	void	(*xdo_dev_policy_free) (struct xfrm_policy *x);
   };
 
-The NIC driver offering ipsec offload will need to implement these
-callbacks to make the offload available to the network stack's
-XFRM subsytem.  Additionally, the feature bits NETIF_F_HW_ESP and
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
 NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
 
 
@@ -79,7 +114,8 @@ and an indication of whether it is for Rx or Tx.  The driver should
 
 		===========   ===================================
 		0             success
-		-EOPNETSUPP   offload not supported, try SW IPsec
+		-EOPNETSUPP   offload not supported, try SW IPsec,
+                              not applicable for packet offload mode
 		other         fail the request
 		===========   ===================================
 
@@ -94,8 +130,10 @@ been setup for offload, it first calls into xdo_dev_offload_ok() with
 the skb and the intended offload state to ask the driver if the offload
 will serviceable.  This can check the packet information to be sure the
 offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
-return true of false to signify its support.
+return true or false to signify its support. In case driver doesn't implement
+this callback, the stack provides reasonable defaults.
 
+Crypto offload mode:
 When ready to send, the driver needs to inspect the Tx packet for the
 offload information, including the opaque context, and set up the packet
 send accordingly::
@@ -136,16 +174,29 @@ the stack in xfrm_input().
 
 	hand the packet to napi_gro_receive() as usual
 
-In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
+In ESN mode, xdo_dev_state_advance_esn() is called from
+xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX.
 Driver will check packet seq number and update HW ESN state machine if needed.
 
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
 When the SA is removed by the user, the driver's xdo_dev_state_delete()
-is asked to disable the offload.  Later, xdo_dev_state_free() is called
-from a garbage collection routine after all reference counts to the state
+and xdo_dev_policy_delete() are asked to disable the offload.  Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
 have been removed and any remaining resources can be cleared for the
 offload state.  How these are used by the driver will depend on specific
 hardware needs.
 
 As a netdev is set to DOWN the XFRM stack's netdev listener will call
-xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
-states.
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_stats() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.
diff --git a/Documentation/networking/xfrm_proc.rst b/Documentation/networking/xfrm_proc.rst
index 0a771c5a7399..973d1571acac 100644
--- a/Documentation/networking/xfrm_proc.rst
+++ b/Documentation/networking/xfrm_proc.rst
@@ -73,6 +73,9 @@ XfrmAcquireError:
 XfrmFwdHdrError:
 	Forward routing of a packet is not allowed
 
+XfrmInStateDirError:
+        State direction mismatch (lookup found an output state on the input path, expected input or no direction)
+
 Outbound errors
 ~~~~~~~~~~~~~~~
 XfrmOutError:
@@ -111,3 +114,6 @@ XfrmOutPolError:
 
 XfrmOutStateInvalid:
 	State is invalid, perhaps expired
+
+XfrmOutStateDirError:
+        State direction mismatch (lookup found an input state on the output path, expected output or no direction)
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
new file mode 100644
index 000000000000..df53a10ccac3
--- /dev/null
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -0,0 +1,147 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+AF_XDP TX Metadata
+==================
+
+This document describes how to enable offloads when transmitting packets
+via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
+metadata on the receive side.
+
+General Design
+==============
+
+The headroom for the metadata is reserved via ``tx_metadata_len`` and
+``XDP_UMEM_TX_METADATA_LEN`` flag in ``struct xdp_umem_reg``. The metadata
+length is therefore the same for every socket that shares the same umem.
+The metadata layout is a fixed UAPI, refer to ``union xsk_tx_metadata`` in
+``include/uapi/linux/if_xdp.h``. Thus, generally, the ``tx_metadata_len``
+field above should contain ``sizeof(union xsk_tx_metadata)``.
+
+Note that in the original implementation the ``XDP_UMEM_TX_METADATA_LEN``
+flag was not required. Applications might attempt to create a umem
+with a flag first and if it fails, do another attempt without a flag.
+
+The headroom and the metadata itself should be located right before
+``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
+layout is as follows::
+
+           tx_metadata_len
+     /                         \
+    +-----------------+---------+----------------------------+
+    | xsk_tx_metadata | padding |          payload           |
+    +-----------------+---------+----------------------------+
+                                ^
+                                |
+                          xdp_desc->addr
+
+An AF_XDP application can request headrooms larger than ``sizeof(struct
+xsk_tx_metadata)``. The kernel will ignore the padding (and will still
+use ``xdp_desc->addr - tx_metadata_len`` to locate
+the ``xsk_tx_metadata``). For the frames that shouldn't carry
+any metadata (i.e., the ones that don't have ``XDP_TX_METADATA`` option),
+the metadata area is ignored by the kernel as well.
+
+The flags field enables the particular offload:
+
+- ``XDP_TXMD_FLAGS_TIMESTAMP``: requests the device to put transmission
+  timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
+- ``XDP_TXMD_FLAGS_CHECKSUM``: requests the device to calculate L4
+  checksum. ``csum_start`` specifies byte offset of where the checksumming
+  should start and ``csum_offset`` specifies byte offset where the
+  device should store the computed checksum.
+- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
+  packet for transmission at a pre-determined time called launch time. The
+  value of launch time is indicated by ``launch_time`` field of
+  ``union xsk_tx_metadata``.
+
+Besides the flags above, in order to trigger the offloads, the first
+packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
+bit in the ``options`` field. Also note that in a multi-buffer packet
+only the first chunk should carry the metadata.
+
+Software TX Checksum
+====================
+
+For development and testing purposes its possible to pass
+``XDP_UMEM_TX_SW_CSUM`` flag to ``XDP_UMEM_REG`` UMEM registration call.
+In this case, when running in ``XDK_COPY`` mode, the TX checksum
+is calculated on the CPU. Do not enable this option in production because
+it will negatively affect performance.
+
+Launch Time
+===========
+
+The value of the requested launch time should be based on the device's PTP
+Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
+compared to the ETF queuing discipline, which organizes packets and delays
+their transmission. Instead, AF_XDP immediately hands off the packets to
+the device driver without rearranging their order or holding them prior to
+transmission. Since the driver maintains FIFO behavior and does not perform
+packet reordering, a packet with a launch time request will block other
+packets in the same Tx Queue until it is sent. Therefore, it is recommended
+to allocate separate queue for scheduling traffic that is intended for
+future transmission.
+
+In scenarios where the launch time offload feature is disabled, the device
+driver is expected to disregard the launch time request. For correct
+interpretation and meaningful operation, the launch time should never be
+set to a value larger than the farthest programmable time in the future
+(the horizon). Different devices have different hardware limitations on the
+launch time offload feature.
+
+stmmac driver
+-------------
+
+For stmmac, TSO and launch time (TBS) features are mutually exclusive for
+each individual Tx Queue. By default, the driver configures Tx Queue 0 to
+support TSO and the rest of the Tx Queues to support TBS. The launch time
+hardware offload feature can be enabled or disabled by using the tc-etf
+command to call the driver's ndo_setup_tc() callback.
+
+The value of the launch time that is programmed in the Enhanced Normal
+Transmit Descriptors is a 32-bit value, where the most significant 8 bits
+represent the time in seconds and the remaining 24 bits represent the time
+in 256 ns increments. The programmed launch time is compared against the
+PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
+horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
+future.
+
+igc driver
+----------
+
+For igc, all four Tx Queues support the launch time feature. The launch
+time hardware offload feature can be enabled or disabled by using the
+tc-etf command to call the driver's ndo_setup_tc() callback. When entering
+TSN mode, the igc driver will reset the device and create a default Qbv
+schedule with a 1-second cycle time, with all Tx Queues open at all times.
+
+The value of the launch time that is programmed in the Advanced Transmit
+Context Descriptor is a relative offset to the starting time of the Qbv
+transmission window of the queue. The Frst flag of the descriptor can be
+set to schedule the packet for the next Qbv cycle. Therefore, the horizon
+of the launch time for i225 and i226 is the ending time of the next cycle
+of the Qbv transmission window of the queue. For example, when the Qbv
+cycle time is set to 1 second, the horizon of the launch time ranges
+from 1 second to 2 seconds, depending on where the Qbv cycle is currently
+running.
+
+Querying Device Capabilities
+============================
+
+Every devices exports its offloads capabilities via netlink netdev family.
+Refer to ``xsk-flags`` features bitmask in
+``Documentation/netlink/specs/netdev.yaml``.
+
+- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
+- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
+- ``tx-launch-time-fifo``: device supports ``XDP_TXMD_FLAGS_LAUNCH_TIME``
+
+See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
+program that handles TX metadata. Also see https://github.com/fomichev/xskgen
+for a more bare-bones example.
diff --git a/Documentation/networking/z8530book.rst b/Documentation/networking/z8530book.rst
deleted file mode 100644
index fea2c40e7973..000000000000
--- a/Documentation/networking/z8530book.rst
+++ /dev/null
@@ -1,256 +0,0 @@
-=======================
-Z8530 Programming Guide
-=======================
-
-:Author: Alan Cox
-
-Introduction
-============
-
-The Z85x30 family synchronous/asynchronous controller chips are used on
-a large number of cheap network interface cards. The kernel provides a
-core interface layer that is designed to make it easy to provide WAN
-services using this chip.
-
-The current driver only support synchronous operation. Merging the
-asynchronous driver support into this code to allow any Z85x30 device to
-be used as both a tty interface and as a synchronous controller is a
-project for Linux post the 2.4 release
-
-Driver Modes
-============
-
-The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in
-three different modes. Each mode can be applied to an individual channel
-on the chip (each chip has two channels).
-
-The PIO synchronous mode supports the most common Z8530 wiring. Here the
-chip is interface to the I/O and interrupt facilities of the host
-machine but not to the DMA subsystem. When running PIO the Z8530 has
-extremely tight timing requirements. Doing high speeds, even with a
-Z85230 will be tricky. Typically you should expect to achieve at best
-9600 baud with a Z8C530 and 64Kbits with a Z85230.
-
-The DMA mode supports the chip when it is configured to use dual DMA
-channels on an ISA bus. The better cards tend to support this mode of
-operation for a single channel. With DMA running the Z85230 tops out
-when it starts to hit ISA DMA constraints at about 512Kbits. It is worth
-noting here that many PC machines hang or crash when the chip is driven
-fast enough to hold the ISA bus solid.
-
-Transmit DMA mode uses a single DMA channel. The DMA channel is used for
-transmission as the transmit FIFO is smaller than the receive FIFO. it
-gives better performance than pure PIO mode but is nowhere near as ideal
-as pure DMA mode.
-
-Using the Z85230 driver
-=======================
-
-The Z85230 driver provides the back end interface to your board. To
-configure a Z8530 interface you need to detect the board and to identify
-its ports and interrupt resources. It is also your problem to verify the
-resources are available.
-
-Having identified the chip you need to fill in a struct z8530_dev,
-which describes each chip. This object must exist until you finally
-shutdown the board. Firstly zero the active field. This ensures nothing
-goes off without you intending it. The irq field should be set to the
-interrupt number of the chip. (Each chip has a single interrupt source
-rather than each channel). You are responsible for allocating the
-interrupt line. The interrupt handler should be set to
-:c:func:`z8530_interrupt()`. The device id should be set to the
-z8530_dev structure pointer. Whether the interrupt can be shared or not
-is board dependent, and up to you to initialise.
-
-The structure holds two channel structures. Initialise chanA.ctrlio and
-chanA.dataio with the address of the control and data ports. You can or
-this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS
-delay for chip settling done in software. The PORT_SLEEP option is
-architecture specific. Other flags may become available on future
-platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to
-start the chip up as disabled and discarding interrupt events. This
-ensures that stray interrupts will be mopped up and not hang the bus.
-Set chanA.dev to point to the device structure itself. The private and
-name field you may use as you wish. The private field is unused by the
-Z85230 layer. The name is used for error reporting and it may thus make
-sense to make it match the network name.
-
-Repeat the same operation with the B channel if your chip has both
-channels wired to something useful. This isn't always the case. If it is
-not wired then the I/O values do not matter, but you must initialise
-chanB.dev.
-
-If your board has DMA facilities then initialise the txdma and rxdma
-fields for the relevant channels. You must also allocate the ISA DMA
-channels and do any necessary board level initialisation to configure
-them. The low level driver will do the Z8530 and DMA controller
-programming but not board specific magic.
-
-Having initialised the device you can then call
-:c:func:`z8530_init()`. This will probe the chip and reset it into
-a known state. An identification sequence is then run to identify the
-chip type. If the checks fail to pass the function returns a non zero
-error code. Typically this indicates that the port given is not valid.
-After this call the type field of the z8530_dev structure is
-initialised to either Z8530, Z85C30 or Z85230 according to the chip
-found.
-
-Once you have called z8530_init you can also make use of the utility
-function :c:func:`z8530_describe()`. This provides a consistent
-reporting format for the Z8530 devices, and allows all the drivers to
-provide consistent reporting.
-
-Attaching Network Interfaces
-============================
-
-If you wish to use the network interface facilities of the driver, then
-you need to attach a network device to each channel that is present and
-in use. In addition to use the generic HDLC you need to follow some
-additional plumbing rules. They may seem complex but a look at the
-example hostess_sv11 driver should reassure you.
-
-The network device used for each channel should be pointed to by the
-netdevice field of each channel. The hdlc-> priv field of the network
-device points to your private data - you will need to be able to find
-your private data from this.
-
-The way most drivers approach this particular problem is to create a
-structure holding the Z8530 device definition and put that into the
-private field of the network device. The network device fields of the
-channels then point back to the network devices.
-
-If you wish to use the generic HDLC then you need to register the HDLC
-device.
-
-Before you register your network device you will also need to provide
-suitable handlers for most of the network device callbacks. See the
-network device documentation for more details on this.
-
-Configuring And Activating The Port
-===================================
-
-The Z85230 driver provides helper functions and tables to load the port
-registers on the Z8530 chips. When programming the register settings for
-a channel be aware that the documentation recommends initialisation
-orders. Strange things happen when these are not followed.
-
-:c:func:`z8530_channel_load()` takes an array of pairs of
-initialisation values in an array of u8 type. The first value is the
-Z8530 register number. Add 16 to indicate the alternate register bank on
-the later chips. The array is terminated by a 255.
-
-The driver provides a pair of public tables. The z8530_hdlc_kilostream
-table is for the UK 'Kilostream' service and also happens to cover most
-other end host configurations. The z8530_hdlc_kilostream_85230 table
-is the same configuration using the enhancements of the 85230 chip. The
-configuration loaded is standard NRZ encoded synchronous data with HDLC
-bitstuffing. All of the timing is taken from the other end of the link.
-
-When writing your own tables be aware that the driver internally tracks
-register values. It may need to reload values. You should therefore be
-sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where
-the register settings depend on DMA selection the driver will update the
-bits itself when you open or close. Loading a new table with the
-interface open is not recommended.
-
-There are three standard configurations supported by the core code. In
-PIO mode the interface is programmed up to use interrupt driven PIO.
-This places high demands on the host processor to avoid latency. The
-driver is written to take account of latency issues but it cannot avoid
-latencies caused by other drivers, notably IDE in PIO mode. Because the
-drivers allocate buffers you must also prevent MTU changes while the
-port is open.
-
-Once the port is open it will call the rx_function of each channel
-whenever a completed packet arrived. This is invoked from interrupt
-context and passes you the channel and a network buffer (struct
-sk_buff) holding the data. The data includes the CRC bytes so most
-users will want to trim the last two bytes before processing the data.
-This function is very timing critical. When you wish to simply discard
-data the support code provides the function
-:c:func:`z8530_null_rx()` to discard the data.
-
-To active PIO mode sending and receiving the ``z8530_sync_open`` is called.
-This expects to be passed the network device and the channel. Typically
-this is called from your network device open callback. On a failure a
-non zero error status is returned.
-The :c:func:`z8530_sync_close()` function shuts down a PIO
-channel. This must be done before the channel is opened again and before
-the driver shuts down and unloads.
-
-The ideal mode of operation is dual channel DMA mode. Here the kernel
-driver will configure the board for DMA in both directions. The driver
-also handles ISA DMA issues such as controller programming and the
-memory range limit for you. This mode is activated by calling the
-:c:func:`z8530_sync_dma_open()` function. On failure a non zero
-error value is returned. Once this mode is activated it can be shut down
-by calling the :c:func:`z8530_sync_dma_close()`. You must call
-the close function matching the open mode you used.
-
-The final supported mode uses a single DMA channel to drive the transmit
-side. As the Z85C30 has a larger FIFO on the receive channel this tends
-to increase the maximum speed a little. This is activated by calling the
-``z8530_sync_txdma_open``. This returns a non zero error code on failure. The
-:c:func:`z8530_sync_txdma_close()` function closes down the Z8530
-interface from this mode.
-
-Network Layer Functions
-=======================
-
-The Z8530 layer provides functions to queue packets for transmission.
-The driver internally buffers the frame currently being transmitted and
-one further frame (in order to keep back to back transmission running).
-Any further buffering is up to the caller.
-
-The function :c:func:`z8530_queue_xmit()` takes a network buffer
-in sk_buff format and queues it for transmission. The caller must
-provide the entire packet with the exception of the bitstuffing and CRC.
-This is normally done by the caller via the generic HDLC interface
-layer. It returns 0 if the buffer has been queued and non zero values
-for queue full. If the function accepts the buffer it becomes property
-of the Z8530 layer and the caller should not free it.
-
-The function :c:func:`z8530_get_stats()` returns a pointer to an
-internally maintained per interface statistics block. This provides most
-of the interface code needed to implement the network layer get_stats
-callback.
-
-Porting The Z8530 Driver
-========================
-
-The Z8530 driver is written to be portable. In DMA mode it makes
-assumptions about the use of ISA DMA. These are probably warranted in
-most cases as the Z85230 in particular was designed to glue to PC type
-machines. The PIO mode makes no real assumptions.
-
-Should you need to retarget the Z8530 driver to another architecture the
-only code that should need changing are the port I/O functions. At the
-moment these assume PC I/O port accesses. This may not be appropriate
-for all platforms. Replacing :c:func:`z8530_read_port()` and
-``z8530_write_port`` is intended to be all that is required to port
-this driver layer.
-
-Known Bugs And Assumptions
-==========================
-
-Interrupt Locking
-    The locking in the driver is done via the global cli/sti lock. This
-    makes for relatively poor SMP performance. Switching this to use a
-    per device spin lock would probably materially improve performance.
-
-Occasional Failures
-    We have reports of occasional failures when run for very long
-    periods of time and the driver starts to receive junk frames. At the
-    moment the cause of this is not clear.
-
-Public Functions Provided
-=========================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
-   :export:
-
-Internal Functions
-==================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
-   :internal: