aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst335
-rw-r--r--Documentation/networking/arcnet-hardware.rst2
-rw-r--r--Documentation/networking/ax25.rst4
-rw-r--r--Documentation/networking/bareudp.rst24
-rw-r--r--Documentation/networking/batman-adv.rst12
-rw-r--r--Documentation/networking/bonding.rst131
-rw-r--r--Documentation/networking/bridge.rst334
-rw-r--r--Documentation/networking/caif/caif.rst5
-rw-r--r--Documentation/networking/caif/index.rst1
-rw-r--r--Documentation/networking/caif/spi_porting.rst229
-rw-r--r--Documentation/networking/can.rst143
-rw-r--r--Documentation/networking/can_ucan_protocol.rst2
-rw-r--r--Documentation/networking/cdc_mbim.rst2
-rw-r--r--Documentation/networking/cops.rst80
-rw-r--r--Documentation/networking/dccp.rst3
-rw-r--r--Documentation/networking/decnet.rst243
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru-cf.py (renamed from Documentation/networking/cxacru-cf.py)0
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru.rst (renamed from Documentation/networking/cxacru.rst)0
-rw-r--r--Documentation/networking/device_drivers/atm/fore200e.rst (renamed from Documentation/networking/fore200e.rst)0
-rw-r--r--Documentation/networking/device_drivers/atm/index.rst20
-rw-r--r--Documentation/networking/device_drivers/atm/iphase.rst (renamed from Documentation/networking/iphase.rst)2
-rw-r--r--Documentation/networking/device_drivers/cable/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cable/sb1000.rst (renamed from Documentation/networking/device_drivers/sb1000.rst)0
-rw-r--r--Documentation/networking/device_drivers/can/can327.rst331
-rw-r--r--Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst638
-rw-r--r--Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg151
-rw-r--r--Documentation/networking/device_drivers/can/freescale/flexcan.rst54
-rw-r--r--Documentation/networking/device_drivers/can/index.rst22
-rw-r--r--Documentation/networking/device_drivers/cellular/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst196
-rw-r--r--Documentation/networking/device_drivers/dec/de4x5.rst189
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/3c509.rst (renamed from Documentation/networking/device_drivers/3com/3c509.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/vortex.rst (renamed from Documentation/networking/device_drivers/3com/vortex.rst)8
-rw-r--r--Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst (renamed from Documentation/networking/altera_tse.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst (renamed from Documentation/networking/device_drivers/amazon/ena.rst)224
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_core.rst139
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst85
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst79
-rw-r--r--Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst (renamed from Documentation/networking/device_drivers/aquantia/atlantic.rst)6
-rw-r--r--Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst (renamed from Documentation/networking/device_drivers/chelsio/cxgb.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst (renamed from Documentation/networking/device_drivers/cirrus/cs89x0.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst (renamed from Documentation/networking/device_drivers/davicom/dm9000.rst)2
-rw-r--r--Documentation/networking/device_drivers/ethernet/dec/dmfe.rst (renamed from Documentation/networking/device_drivers/dec/dmfe.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst (renamed from Documentation/networking/device_drivers/dlink/dl2k.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst)7
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst)3
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/index.rst)1
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst)11
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/overview.rst)1
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst217
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst (renamed from Documentation/networking/device_drivers/freescale/gianfar.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/google/gve.rst (renamed from Documentation/networking/device_drivers/google/gve.rst)62
-rw-r--r--Documentation/networking/device_drivers/ethernet/huawei/hinic.rst (renamed from Documentation/networking/hinic.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst66
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e100.rst (renamed from Documentation/networking/device_drivers/intel/e100.rst)11
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000.rst (renamed from Documentation/networking/device_drivers/intel/e1000.rst)9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000e.rst (renamed from Documentation/networking/device_drivers/intel/e1000e.rst)7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/fm10k.rst (renamed from Documentation/networking/device_drivers/intel/fm10k.rst)9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/i40e.rst (renamed from Documentation/networking/device_drivers/intel/i40e.rst)21
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/iavf.rst (renamed from Documentation/networking/device_drivers/intel/iavf.rst)13
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst1175
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/idpf.rst160
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igb.rst (renamed from Documentation/networking/device_drivers/intel/igb.rst)9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igbvf.rst (renamed from Documentation/networking/device_drivers/intel/igbvf.rst)9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst (renamed from Documentation/networking/device_drivers/intel/ixgbe.rst)23
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst (renamed from Documentation/networking/device_drivers/intel/ixgbevf.rst)7
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst41
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst24
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst (renamed from Documentation/networking/device_drivers/marvell/octeontx2.rst)185
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst1305
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst25
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst168
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst281
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst229
-rw-r--r--Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst (renamed from Documentation/networking/device_drivers/microsoft/netvsc.rst)14
-rw-r--r--Documentation/networking/device_drivers/ethernet/neterion/s2io.rst (renamed from Documentation/networking/device_drivers/neterion/s2io.rst)4
-rw-r--r--Documentation/networking/device_drivers/ethernet/netronome/nfp.rst (renamed from Documentation/networking/device_drivers/netronome/nfp.rst)165
-rw-r--r--Documentation/networking/device_drivers/ethernet/pensando/ionic.rst (renamed from Documentation/networking/device_drivers/pensando/ionic.rst)24
-rw-r--r--Documentation/networking/device_drivers/ethernet/smsc/smc9.rst (renamed from Documentation/networking/device_drivers/smsc/smc9.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst (renamed from Documentation/networking/device_drivers/stmicro/stmmac.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst143
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw.rst (renamed from Documentation/networking/device_drivers/ti/cpsw.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst (renamed from Documentation/networking/device_drivers/ti/cpsw_switchdev.rst)2
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/tlan.rst (renamed from Documentation/networking/device_drivers/ti/tlan.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst (renamed from Documentation/networking/device_drivers/toshiba/spider_net.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst14
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst20
-rw-r--r--Documentation/networking/device_drivers/fddi/defza.rst (renamed from Documentation/networking/defza.rst)2
-rw-r--r--Documentation/networking/device_drivers/fddi/index.rst19
-rw-r--r--Documentation/networking/device_drivers/fddi/skfp.rst (renamed from Documentation/networking/skfp.rst)0
-rw-r--r--Documentation/networking/device_drivers/hamradio/baycom.rst (renamed from Documentation/networking/baycom.rst)0
-rw-r--r--Documentation/networking/device_drivers/hamradio/index.rst19
-rw-r--r--Documentation/networking/device_drivers/hamradio/z8530drv.rst (renamed from Documentation/networking/z8530drv.rst)0
-rw-r--r--Documentation/networking/device_drivers/index.rst56
-rw-r--r--Documentation/networking/device_drivers/intel/ice.rst46
-rw-r--r--Documentation/networking/device_drivers/intel/ixgb.rst468
-rw-r--r--Documentation/networking/device_drivers/mellanox/mlx5.rst321
-rw-r--r--Documentation/networking/device_drivers/neterion/vxge.rst115
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx46
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic288
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qlge288
-rw-r--r--Documentation/networking/device_drivers/qualcomm/rmnet.rst95
-rw-r--r--Documentation/networking/device_drivers/wifi/index.rst19
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2100.rst (renamed from Documentation/networking/device_drivers/intel/ipw2100.rst)2
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2200.rst (renamed from Documentation/networking/device_drivers/intel/ipw2200.rst)0
-rw-r--r--Documentation/networking/device_drivers/wwan/index.rst19
-rw-r--r--Documentation/networking/device_drivers/wwan/iosm.rst96
-rw-r--r--Documentation/networking/device_drivers/wwan/t7xx.rst166
-rw-r--r--Documentation/networking/devlink/am65-nuss-cpsw-switch.rst26
-rw-r--r--Documentation/networking/devlink/bnxt.rst2
-rw-r--r--Documentation/networking/devlink/devlink-dpipe.rst2
-rw-r--r--Documentation/networking/devlink/devlink-flash.rst28
-rw-r--r--Documentation/networking/devlink/devlink-health.rst40
-rw-r--r--Documentation/networking/devlink/devlink-info.rst17
-rw-r--r--Documentation/networking/devlink/devlink-linecard.rst122
-rw-r--r--Documentation/networking/devlink/devlink-params.rst33
-rw-r--r--Documentation/networking/devlink/devlink-port.rst443
-rw-r--r--Documentation/networking/devlink/devlink-region.rst19
-rw-r--r--Documentation/networking/devlink/devlink-reload.rst90
-rw-r--r--Documentation/networking/devlink/devlink-resource.rst14
-rw-r--r--Documentation/networking/devlink/devlink-selftests.rst38
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst105
-rw-r--r--Documentation/networking/devlink/etas_es58x.rst36
-rw-r--r--Documentation/networking/devlink/hns3.rst25
-rw-r--r--Documentation/networking/devlink/i40e.rst59
-rw-r--r--Documentation/networking/devlink/ice.rst317
-rw-r--r--Documentation/networking/devlink/index.rst55
-rw-r--r--Documentation/networking/devlink/iosm.rst162
-rw-r--r--Documentation/networking/devlink/mlx5.rst228
-rw-r--r--Documentation/networking/devlink/mlxsw.rst24
-rw-r--r--Documentation/networking/devlink/netdevsim.rst31
-rw-r--r--Documentation/networking/devlink/octeontx2.rst42
-rw-r--r--Documentation/networking/devlink/prestera.rst141
-rw-r--r--Documentation/networking/devlink/sfc.rst57
-rw-r--r--Documentation/networking/devlink/sja1105.rst49
-rw-r--r--Documentation/networking/driver.rst156
-rw-r--r--Documentation/networking/dsa/b53.rst14
-rw-r--r--Documentation/networking/dsa/bcm_sf2.rst2
-rw-r--r--Documentation/networking/dsa/configuration.rst510
-rw-r--r--Documentation/networking/dsa/dsa.rst894
-rw-r--r--Documentation/networking/dsa/lan9303.rst2
-rw-r--r--Documentation/networking/dsa/sja1105.rst314
-rw-r--r--Documentation/networking/ethtool-netlink.rst1041
-rw-r--r--Documentation/networking/filter.rst1032
-rw-r--r--Documentation/networking/framerelay.rst44
-rw-r--r--Documentation/networking/generic_netlink.rst2
-rw-r--r--Documentation/networking/gtp.rst2
-rw-r--r--Documentation/networking/ieee802154.rst20
-rw-r--r--Documentation/networking/index.rst44
-rw-r--r--Documentation/networking/ioam6-sysctl.rst26
-rw-r--r--Documentation/networking/ip-sysctl.rst708
-rw-r--r--Documentation/networking/ipddp.rst78
-rw-r--r--Documentation/networking/ipvlan.rst4
-rw-r--r--Documentation/networking/ipvs-sysctl.rst34
-rw-r--r--Documentation/networking/j1939.rst166
-rw-r--r--Documentation/networking/kapi.rst30
-rw-r--r--Documentation/networking/l2tp.rst1044
-rw-r--r--Documentation/networking/ltpc.rst144
-rw-r--r--Documentation/networking/mctp.rst320
-rw-r--r--Documentation/networking/mptcp-sysctl.rst95
-rw-r--r--Documentation/networking/msg_zerocopy.rst21
-rw-r--r--Documentation/networking/multi-pf-netdev.rst174
-rw-r--r--Documentation/networking/napi.rst255
-rw-r--r--Documentation/networking/net_cachelines/index.rst16
-rw-r--r--Documentation/networking/net_cachelines/inet_connection_sock.rst50
-rw-r--r--Documentation/networking/net_cachelines/inet_sock.rst44
-rw-r--r--Documentation/networking/net_cachelines/net_device.rst178
-rw-r--r--Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst158
-rw-r--r--Documentation/networking/net_cachelines/snmp.rst135
-rw-r--r--Documentation/networking/net_cachelines/tcp_sock.rst157
-rw-r--r--Documentation/networking/net_failover.rst111
-rw-r--r--Documentation/networking/netconsole.rst101
-rw-r--r--Documentation/networking/netdev-FAQ.rst272
-rw-r--r--Documentation/networking/netdev-features.rst21
-rw-r--r--Documentation/networking/netdevices.rst204
-rw-r--r--Documentation/networking/netlink_spec/.gitignore1
-rw-r--r--Documentation/networking/netlink_spec/readme.txt4
-rw-r--r--Documentation/networking/nexthop-group-resilient.rst293
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst83
-rw-r--r--Documentation/networking/nf_flowtable.rst174
-rw-r--r--Documentation/networking/operstates.rst6
-rw-r--r--Documentation/networking/packet_mmap.rst29
-rw-r--r--Documentation/networking/page_pool.rst138
-rw-r--r--Documentation/networking/phonet.rst2
-rw-r--r--Documentation/networking/phy.rst55
-rw-r--r--Documentation/networking/pktgen.rst30
-rw-r--r--Documentation/networking/ppp_generic.rst16
-rw-r--r--Documentation/networking/ray_cs.rst165
-rw-r--r--Documentation/networking/rds.rst2
-rw-r--r--Documentation/networking/regulatory.rst4
-rw-r--r--Documentation/networking/representors.rst261
-rw-r--r--Documentation/networking/rxrpc.rst34
-rw-r--r--Documentation/networking/scaling.rst67
-rw-r--r--Documentation/networking/seg6-sysctl.rst13
-rw-r--r--Documentation/networking/sfp-phylink.rst163
-rw-r--r--Documentation/networking/skbuff.rst37
-rw-r--r--Documentation/networking/smc-sysctl.rst73
-rw-r--r--Documentation/networking/snmp_counter.rst46
-rw-r--r--Documentation/networking/statistics.rst236
-rw-r--r--Documentation/networking/switchdev.rst197
-rw-r--r--Documentation/networking/sysfs-tagging.rst48
-rw-r--r--Documentation/networking/tc-queue-filters.rst37
-rw-r--r--Documentation/networking/tcp_ao.rst444
-rw-r--r--Documentation/networking/timestamping.rst226
-rw-r--r--Documentation/networking/tipc.rst215
-rw-r--r--Documentation/networking/tls-handshake.rst222
-rw-r--r--Documentation/networking/tls-offload.rst29
-rw-r--r--Documentation/networking/tls.rst47
-rw-r--r--Documentation/networking/tuntap.rst2
-rw-r--r--Documentation/networking/vrf.rst13
-rw-r--r--Documentation/networking/vxlan.rst28
-rw-r--r--Documentation/networking/x25-iface.rst68
-rw-r--r--Documentation/networking/x25.rst12
-rw-r--r--Documentation/networking/xdp-rx-metadata.rst128
-rw-r--r--Documentation/networking/xfrm_device.rst65
-rw-r--r--Documentation/networking/xsk-tx-metadata.rst81
-rw-r--r--Documentation/networking/z8530book.rst256
218 files changed, 18743 insertions, 6397 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 5bc55a4e3bce..72da7057e4cf 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -243,8 +243,8 @@ Configuration Flags and Socket Options
These are the various configuration flags that can be used to control
and monitor the behavior of AF_XDP sockets.
-XDP_COPY and XDP_ZERO_COPY bind flags
--------------------------------------
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
When you bind to a socket, the kernel will first try to use zero-copy
copy. If zero-copy is not supported, it will fall back on using copy
@@ -252,20 +252,27 @@ mode, i.e. copying all packets out to user space. But if you would
like to force a certain mode, you can use the following flags. If you
pass the XDP_COPY flag to the bind call, the kernel will force the
socket into copy mode. If it cannot use copy mode, the bind call will
-fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
socket into zero-copy mode or fail.
XDP_SHARED_UMEM bind flag
-------------------------
-This flag enables you to bind multiple sockets to the same UMEM, but
-only if they share the same queue id. In this mode, each socket has
-their own RX and TX rings, but the UMEM (tied to the fist socket
-created) only has a single FILL ring and a single COMPLETION
-ring. To use this mode, create the first socket and bind it in the normal
-way. Create a second socket and create an RX and a TX ring, or at
-least one of them, but no FILL or COMPLETION rings as the ones from
-the first socket will be used. In the bind call, set he
+This flag enables you to bind multiple sockets to the same UMEM. It
+works on the same queue id, between queue ids and between
+netdevs/devices. In this mode, each socket has their own RX and TX
+rings as usual, but you are going to have one or more FILL and
+COMPLETION ring pairs. You have to create one of these pairs per
+unique netdev and queue id tuple that you bind to.
+
+Starting with the case were we would like to share a UMEM between
+sockets bound to the same netdev and queue id. The UMEM (tied to the
+fist socket created) will only have a single FILL ring and a single
+COMPLETION ring as there is only on unique netdev,queue_id tuple that
+we have bound to. To use this mode, create the first socket and bind
+it in the normal way. Create a second socket and create an RX and a TX
+ring, or at least one of them, but no FILL or COMPLETION rings as the
+ones from the first socket will be used. In the bind call, set he
XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
sockets this way.
@@ -283,19 +290,19 @@ round-robin example of distributing packets is shown below:
#define MAX_SOCKS 16
struct {
- __uint(type, BPF_MAP_TYPE_XSKMAP);
- __uint(max_entries, MAX_SOCKS);
- __uint(key_size, sizeof(int));
- __uint(value_size, sizeof(int));
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");
static unsigned int rr;
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
- rr = (rr + 1) & (MAX_SOCKS - 1);
+ rr = (rr + 1) & (MAX_SOCKS - 1);
- return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}
Note, that since there is only a single set of FILL and COMPLETION
@@ -305,11 +312,42 @@ concurrently. There are no synchronization primitives in the
libbpf code that protects multiple users at this point in time.
Libbpf uses this mode if you create more than one socket tied to the
-same umem. However, note that you need to supply the
+same UMEM. However, note that you need to supply the
XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
xsk_socket__create calls and load your own XDP program as there is no
built in one in libbpf that will route the traffic for you.
+The second case is when you share a UMEM between sockets that are
+bound to different queue ids and/or netdevs. In this case you have to
+create one FILL ring and one COMPLETION ring for each unique
+netdev,queue_id pair. Let us say you want to create two sockets bound
+to two different queue ids on the same netdev. Create the first socket
+and bind it in the normal way. Create a second socket and create an RX
+and a TX ring, or at least one of them, and then one FILL and
+COMPLETION ring for this socket. Then in the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field as you registered the UMEM on that
+socket. These two sockets will now share one and the same UMEM.
+
+In this case, it is possible to use the NIC's packet steering
+capabilities to steer the packets to the right queue. This is not
+possible in the previous example as there is only one queue shared
+among sockets, so the NIC cannot do this steering as it can only steer
+between queues.
+
+In libxdp (or libbpf prior to version 1.0), you need to use the
+xsk_socket__create_shared() API as it takes a reference to a FILL ring
+and a COMPLETION ring that will be created for you and bound to the
+shared UMEM. You can use this function for all the sockets you create,
+or you can use it for the second and following ones and use
+xsk_socket__create() for the first one. Both methods yield the same
+result.
+
+Note that a UMEM can be shared between sockets on the same queue id
+and device, as well as between queues on the same device and between
+devices at the same time. It is also possible to redirect to any
+socket as long as it is bound to the same umem with XDP_SHARED_UMEM.
+
XDP_USE_NEED_WAKEUP bind flag
-----------------------------
@@ -342,7 +380,7 @@ would look like this for the TX path:
.. code-block:: c
if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
- sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
I.e., only use the syscall if the flag is set.
@@ -364,7 +402,7 @@ resources by only setting up one of them. Both the FILL ring and the
COMPLETION ring are mandatory as you need to have a UMEM tied to your
socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
first one does not have a UMEM and should in that case not have any
-FILL or COMPLETION rings created as the ones from the shared umem will
+FILL or COMPLETION rings created as the ones from the shared UMEM will
be used. Note, that the rings are single-producer single-consumer, so
do not try to access them from multiple processes at the same
time. See the XDP_SHARED_UMEM section.
@@ -382,7 +420,7 @@ XDP_UMEM_REG setsockopt
-----------------------
This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can recide in. The call takes a
+contain all the buffers that packet can reside in. The call takes a
pointer to the beginning of this area and the size of it. Moreover, it
also has parameter called chunk_size that is the size that the UMEM is
divided into. It can only be 2K or 4K at the moment. If you have an
@@ -396,6 +434,15 @@ start N bytes into the buffer leaving the first N bytes for the
application to use. The final option is the flags field, but it will
be dealt with in separate sections for each UMEM flag.
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface. It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface. Updating the value requires CAP_NET_RAW.
+
XDP_STATISTICS getsockopt
-------------------------
@@ -405,9 +452,9 @@ purposes. The supported statistics are shown below:
.. code-block:: c
struct xdp_statistics {
- __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
- __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
- __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
XDP_OPTIONS getsockopt
@@ -416,8 +463,92 @@ XDP_OPTIONS getsockopt
Gets options from an XDP socket. The only one supported so far is
XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+ frame. In the case the packet consists of a single frame, the
+ descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+ descriptors/frames of this packet are marked as invalid and not
+ completed. The next descriptor is treated as the start of a new
+ packet, even if this was not the intent (because we cannot guess
+ the intent). As before, if your program is producing invalid
+ descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+ equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+ descriptors accumulated so far are dropped and treated as
+ invalid. To produce an application that will work on any system
+ regardless of this config setting, limit the number of frags to 18,
+ as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+ supports. Usually at least five on the NICs we have checked. We
+ consciously chose to not enforce a rigid limit (such as
+ CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+ resulted in copy actions under the hood to fit into what limit the
+ NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+ probe for this limit is explained in the "probe for multi-buffer
+ support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
Usage
-=====
+-----
In order to use AF_XDP sockets two parts are needed. The
user-space application and the XDP program. For a complete setup and
@@ -446,15 +577,15 @@ like this:
.. code-block:: c
// struct xdp_rxtx_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // struct xdp_desc *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
// };
// struct xdp_umem_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // __u64 *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
// };
// typedef struct xdp_rxtx_ring RING;
@@ -495,6 +626,131 @@ like this:
But please use the libbpf functions as they are optimized and ready to
use. Will make your life easier.
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+ void rx_packets(struct xsk_socket_info *xsk)
+ {
+ static bool new_packet = true;
+ u32 idx_rx = 0, idx_fq = 0;
+ static char *pkt;
+
+ int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+ xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+ for (int i = 0; i < rcvd; i++) {
+ struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+ char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+ bool eop = !(desc->options & XDP_PKT_CONTD);
+
+ if (new_packet)
+ pkt = frag;
+ else
+ add_frag_to_pkt(pkt, frag);
+
+ if (eop)
+ process_pkt(pkt);
+
+ new_packet = eop;
+
+ *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+ }
+
+ xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+ xsk_ring_cons__release(&xsk->rx, rcvd);
+ }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+ void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+ int batch_size)
+ {
+ u32 idx, i, pkt_nb = 0;
+
+ xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+ for (i = 0; i < batch_size;) {
+ u64 addr = pkts[pkt_nb].addr;
+ u32 len = pkts[pkt_nb].size;
+
+ do {
+ struct xdp_desc *tx_desc;
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+ tx_desc->addr = addr;
+
+ if (len > xsk_frame_size) {
+ tx_desc->len = xsk_frame_size;
+ tx_desc->options = XDP_PKT_CONTD;
+ } else {
+ tx_desc->len = len;
+ tx_desc->options = 0;
+ pkt_nb++;
+ }
+ len -= tx_desc->len;
+ addr += xsk_frame_size;
+
+ if (i == batch_size) {
+ /* Remember len, addr, pkt_nb for next iteration.
+ * Skipped for simplicity.
+ */
+ break;
+ }
+ } while (len);
+ }
+
+ xsk_ring_prod__submit(&xsk->tx, i);
+ }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+ one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+ returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
Sample application
==================
@@ -555,7 +811,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
A number of other ways are possible all up to the capabilities of
the NIC you have.
-Q: Can I use the XSKMAP to implement a switch betwen different umems
+Q: Can I use the XSKMAP to implement a switch between different umems
in copy mode?
A: The short answer is no, that is not supported at the moment. The
@@ -567,6 +823,21 @@ A: The short answer is no, that is not supported at the moment. The
switch, or other distribution mechanism, in your NIC to direct
traffic to the correct queue id and socket.
+ Note that if you are using the XDP_SHARED_UMEM option, it is
+ possible to switch traffic between any socket bound to the same
+ umem.
+
+Q: My packets are sometimes corrupted. What is wrong?
+
+A: Care has to be taken not to feed the same buffer in the UMEM into
+ more than one ring at the same time. If you for example feed the
+ same buffer into the FILL ring and the TX ring at the same time, the
+ NIC might receive data into the buffer at the same time it is
+ sending it. This will cause some packets to become corrupted. Same
+ thing goes for feeding the same buffer into the FILL rings
+ belonging to different queue ids or netdevs bound with the
+ XDP_SHARED_UMEM flag.
+
Credits
=======
diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst
index ac249ac8fcf2..982215723582 100644
--- a/Documentation/networking/arcnet-hardware.rst
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -1902,7 +1902,7 @@ of 32 possible I/O Base addresses using the following tables::
6 | 10
The I/O address is sum of all switches set to "1". Remember that
-the I/O address space bellow 0x200 is RESERVED for mainboard, so
+the I/O address space below 0x200 is RESERVED for mainboard, so
switch 1 should be ALWAYS SET TO OFF.
diff --git a/Documentation/networking/ax25.rst b/Documentation/networking/ax25.rst
index f060cfb1445a..605e72c6c877 100644
--- a/Documentation/networking/ax25.rst
+++ b/Documentation/networking/ax25.rst
@@ -7,9 +7,9 @@ AX.25
To use the amateur radio protocols within Linux you will need to get a
suitable copy of the AX.25 Utilities. More detailed information about
AX.25, NET/ROM and ROSE, associated programs and utilities can be
-found on http://www.linux-ax25.org.
+found on https://linux-ax25.in-berlin.de.
-There is an active mailing list for discussing Linux amateur radio matters
+There is a mailing list for discussing Linux amateur radio matters
called linux-hams@vger.kernel.org. To subscribe to it, send a message to
majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body
of the message, the subject field is ignored. You don't need to be
diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst
index 465a8b251bfe..b9d04ee6dac1 100644
--- a/Documentation/networking/bareudp.rst
+++ b/Documentation/networking/bareudp.rst
@@ -8,9 +8,8 @@ There are various L3 encapsulation standards using UDP being discussed to
leverage the UDP based load balancing capability of different networks.
MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
-The Bareudp tunnel module provides a generic L3 encapsulation tunnelling
-support for tunnelling different L3 protocols like MPLS, IP, NSH etc. inside
-a UDP tunnel.
+The Bareudp tunnel module provides a generic L3 encapsulation support for
+tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel.
Special Handling
----------------
@@ -26,7 +25,7 @@ Usage
1) Device creation & deletion
- a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847.
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc
This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
0x8847 (MPLS traffic). The destination port of the UDP header will be set to
@@ -34,14 +33,21 @@ Usage
b) ip link delete bareudp0
-2) Device creation with multiple proto mode enabled
+2) Device creation with multiproto mode enabled
-There are two ways to create a bareudp device for MPLS & IP with multiproto mode
-enabled.
+The multiproto mode allows bareudp tunnels to handle several protocols of the
+same family. It is currently only available for IP and MPLS. This mode has to
+be enabled explicitly with the "multiproto" flag.
- a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype 0x8847 multiproto
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype ipv4 multiproto
- b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls
+ For an IPv4 tunnel the multiproto mode allows the tunnel to also handle
+ IPv6.
+
+ b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc multiproto
+
+ For MPLS, the multiproto mode allows the tunnel to handle both unicast
+ and multicast MPLS packets.
3) Device Usage
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
index 18020943ba25..8a0dcb1894b4 100644
--- a/Documentation/networking/batman-adv.rst
+++ b/Documentation/networking/batman-adv.rst
@@ -73,7 +73,7 @@ lower value. This will make the mesh more responsive to topology changes, but
will also increase the overhead.
Information about the current state can be accessed via the batadv generic
-netlink family. batctl provides human readable version via its debug tables
+netlink family. batctl provides a human readable version via its debug tables
subcommands.
@@ -115,8 +115,8 @@ are prefixed with "batman-adv:" So to see just these messages try::
$ dmesg | grep batman-adv
When investigating problems with your mesh network, it is sometimes necessary to
-see more detail debug messages. This must be enabled when compiling the
-batman-adv module. When building batman-adv as part of kernel, use "make
+see more detailed debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of the kernel, use "make
menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
(``CONFIG_BATMAN_ADV_DEBUG=y``).
@@ -157,10 +157,10 @@ Contact
Please send us comments, experiences, questions, anything :)
IRC:
- #batman on irc.freenode.org
+ #batadv on ircs://irc.hackint.org/
Mailing-list:
- b.a.t.m.a.n@open-mesh.org (optional subscription at
- https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+ b.a.t.m.a.n@lists.open-mesh.org (optional subscription at
+ https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
You can also contact the Authors:
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 24168b0d16bd..e774b48de9f5 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -196,11 +196,12 @@ ad_actor_sys_prio
ad_actor_system
In an AD system, this specifies the mac-address for the actor in
- protocol packet exchanges (LACPDUs). The value cannot be NULL or
- multicast. It is preferred to have the local-admin bit set for this
- mac but driver does not enforce it. If the value is not given then
- system defaults to using the masters' mac address as actors' system
- address.
+ protocol packet exchanges (LACPDUs). The value cannot be a multicast
+ address. If the all-zeroes MAC is specified, bonding will internally
+ use the MAC of the bond itself. It is preferred to have the
+ local-admin bit set for this mac but driver does not enforce it. If
+ the value is not given then system defaults to using the masters'
+ mac address as actors' system address.
This parameter has effect only in 802.3ad mode and is available through
SysFs interface.
@@ -312,6 +313,17 @@ arp_ip_target
maximum number of targets that can be specified is 16. The
default value is no IP addresses.
+ns_ip6_target
+
+ Specifies the IPv6 addresses to use as IPv6 monitoring peers when
+ arp_interval is > 0. These are the targets of the NS request
+ sent to determine the health of the link to the targets.
+ Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6
+ addresses must be separated by a comma. At least one IPv6
+ address must be given for NS/NA monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IPv6 addresses.
+
arp_validate
Specifies whether or not ARP probes and replies should be
@@ -421,6 +433,29 @@ arp_all_targets
consider the slave up only when all of the arp_ip_targets
are reachable
+arp_missed_max
+
+ Specifies the number of arp_interval monitor checks that must
+ fail in order for an interface to be marked down by the ARP monitor.
+
+ In order to provide orderly failover semantics, backup interfaces
+ are permitted an extra monitor check (i.e., they must fail
+ arp_missed_max + 1 times before being marked down).
+
+ The default value is 2, and the allowable range is 1 - 255.
+
+coupled_control
+
+ Specifies whether the LACP state machine's MUX in the 802.3ad mode
+ should have separate Collecting and Distributing states.
+
+ This is by implementing the independent control state machine per
+ IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control
+ state machine.
+
+ The default value is 1. This setting does not separate the Collecting
+ and Distributing states, maintaining the bond in coupled control.
+
downdelay
Specifies the time, in milliseconds, to wait before disabling
@@ -501,6 +536,18 @@ fail_over_mac
This option was added in bonding version 3.2.0. The "follow"
policy was added in bonding version 3.3.0.
+lacp_active
+ Option specifying whether to send LACPDU frames periodically.
+
+ off or 0
+ LACPDU frames acts as "speak when spoken to".
+
+ on or 1
+ LACPDU frames are sent along the configured links
+ periodically. See lacp_rate for more details.
+
+ The default is on.
+
lacp_rate
Option specifying the rate in which we'll ask our link partner
@@ -531,7 +578,8 @@ miimon
link monitoring. A value of 100 is a good starting point.
The use_carrier option, below, affects how the link state is
determined. See the High Availability section for additional
- information. The default value is 0.
+ information. The default value is 100 if arp_interval is not
+ set.
min_links
@@ -740,10 +788,22 @@ peer_notif_delay
Specify the delay, in milliseconds, between each peer
notification (gratuitous ARP and unsolicited IPv6 Neighbor
Advertisement) when they are issued after a failover event.
- This delay should be a multiple of the link monitor interval
- (arp_interval or miimon, whichever is active). The default
- value is 0 which means to match the value of the link monitor
- interval.
+ This delay should be a multiple of the MII link monitor interval
+ (miimon).
+
+ The valid range is 0 - 300000. The default value is 0, which means
+ to match the value of the MII link monitor interval.
+
+prio
+ Slave priority. A higher number means higher priority.
+ The primary slave has the highest priority. This option also
+ follows the primary_reselect rules.
+
+ This option could only be configured via netlink, and is only valid
+ for active-backup(1), balance-tlb (5) and balance-alb (6) mode.
+ The valid value range is a signed 32 bit integer.
+
+ The default value is 0.
primary
@@ -800,7 +860,7 @@ primary_reselect
tlb_dynamic_lb
Specifies if dynamic shuffling of flows is enabled in tlb
- mode. The value has no effect on any other modes.
+ or alb mode. The value has no effect on any other modes.
The default behavior of tlb mode is to shuffle active flows across
slaves based on the load in that interval. This gives nice lb
@@ -859,7 +919,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and packet type ID
field to generate the hash. The formula is
- hash = source MAC XOR destination MAC XOR packet type ID
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
slave number = hash modulo slave count
This algorithm will place all traffic to a particular
@@ -875,7 +935,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and IP addresses to
generate the hash. The formula is
- hash = source MAC XOR destination MAC XOR packet type ID
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
@@ -910,6 +970,7 @@ xmit_hash_policy
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
+ hash = hash RSHIFT 1
And then hash is reduced modulo slave count.
If the protocol is IPv6 then the source and destination
@@ -951,6 +1012,19 @@ xmit_hash_policy
packets will be distributed according to the encapsulated
flows.
+ vlan+srcmac
+
+ This policy uses a very rudimentary vlan ID and source mac
+ hash to load-balance traffic per-vlan, with failover
+ should one leg fail. The intended use case is for a bond
+ shared by multiple virtual machines, all configured to
+ use their own vlan, to give lacp-like functionality
+ without requiring lacp-capable switching hardware.
+
+ The formula for the hash is simply
+
+ hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
+
The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy. The
@@ -1574,7 +1648,7 @@ your init script::
-----------------------------------------
This section applies to distros which use /etc/network/interfaces file
-to describe network interface configuration, most notably Debian and it's
+to describe network interface configuration, most notably Debian and its
derivatives.
The ifup and ifdown commands on Debian don't support bonding out of
@@ -1923,15 +1997,6 @@ uses the response as an indication that the link is operating. This
gives some assurance that traffic is actually flowing to and from one
or more peers on the local network.
-The ARP monitor relies on the device driver itself to verify
-that traffic is flowing. In particular, the driver must keep up to
-date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX
-flag must also update netdev_queue->trans_start. If they do not, then the
-ARP monitor will immediately fail any slaves using that driver, and
-those slaves will stay down. If networking monitoring (tcpdump, etc)
-shows the ARP requests and replies on the network, then it may be that
-your device driver is not updating last_rx and trans_start.
-
7.2 Configuring Multiple ARP Targets
------------------------------------
@@ -1975,7 +2040,7 @@ netif_carrier.
If use_carrier is 0, then the MII monitor will first query the
device's (via ioctl) MII registers and check the link state. If that
request fails (not just that it returns carrier down), then the MII
-monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain
+monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain
the same information. If both methods fail (i.e., the driver either
does not support or had some error in processing both the MII register
and ethtool requests), then the MII monitor will assume the link is
@@ -2860,17 +2925,6 @@ version of the linux kernel, found on http://kernel.org
The latest version of this document can be found in the latest kernel
source (named Documentation/networking/bonding.rst).
-Discussions regarding the usage of the bonding driver take place on the
-bonding-devel mailing list, hosted at sourceforge.net. If you have questions or
-problems, post them to the list. The list address is:
-
-bonding-devel@lists.sourceforge.net
-
-The administrative interface (to subscribe or unsubscribe) can
-be found at:
-
-https://lists.sourceforge.net/lists/listinfo/bonding-devel
-
Discussions regarding the development of the bonding driver take place
on the main Linux network mailing list, hosted at vger.kernel.org. The list
address is:
@@ -2881,10 +2935,3 @@ The administrative interface (to subscribe or unsubscribe) can
be found at:
http://vger.kernel.org/vger-lists.html#netdev
-
-Donald Becker's Ethernet Drivers and diag programs may be found at :
-
- - http://web.archive.org/web/%2E/http://www.scyld.com/network/
-
-You will also find a lot of information regarding Ethernet, NWay, MII,
-etc. at www.scyld.com.
diff --git a/Documentation/networking/bridge.rst b/Documentation/networking/bridge.rst
index 4aef9cddde2f..ef8b73e157b2 100644
--- a/Documentation/networking/bridge.rst
+++ b/Documentation/networking/bridge.rst
@@ -4,18 +4,332 @@
Ethernet Bridging
=================
-In order to use the Ethernet bridging functionality, you'll need the
-userspace tools.
+Introduction
+============
-Documentation for Linux bridging is on:
- http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
+The IEEE 802.1Q-2022 (Bridges and Bridged Networks) standard defines the
+operation of bridges in computer networks. A bridge, in the context of this
+standard, is a device that connects two or more network segments and operates
+at the data link layer (Layer 2) of the OSI (Open Systems Interconnection)
+model. The purpose of a bridge is to filter and forward frames between
+different segments based on the destination MAC (Media Access Control) address.
-The bridge-utilities are maintained at:
- git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+Bridge kAPI
+===========
-Additionally, the iproute2 utilities can be used to configure
-bridge devices.
+Here are some core structures of bridge code. Note that the kAPI is *unstable*,
+and can be changed at any time.
-If you still have questions, don't hesitate to post to the mailing list
-(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
+.. kernel-doc:: net/bridge/br_private.h
+ :identifiers: net_bridge_vlan
+Bridge uAPI
+===========
+
+Modern Linux bridge uAPI is accessed via Netlink interface. You can find
+below files where the bridge and bridge port netlink attributes are defined.
+
+Bridge netlink attributes
+-------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :doc: Bridge enum definition
+
+Bridge port netlink attributes
+------------------------------
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :doc: Bridge port enum definition
+
+Bridge sysfs
+------------
+
+The sysfs interface is deprecated and should not be extended if new
+options are added.
+
+STP
+===
+
+The STP (Spanning Tree Protocol) implementation in the Linux bridge driver
+is a critical feature that helps prevent loops and broadcast storms in
+Ethernet networks by identifying and disabling redundant links. In a Linux
+bridge context, STP is crucial for network stability and availability.
+
+STP is a Layer 2 protocol that operates at the Data Link Layer of the OSI
+model. It was originally developed as IEEE 802.1D and has since evolved into
+multiple versions, including Rapid Spanning Tree Protocol (RSTP) and
+`Multiple Spanning Tree Protocol (MSTP)
+<https://lore.kernel.org/netdev/20220316150857.2442916-1-tobias@waldekranz.com/>`_.
+
+The 802.1D-2004 removed the original Spanning Tree Protocol, instead
+incorporating the Rapid Spanning Tree Protocol (RSTP). By 2014, all the
+functionality defined by IEEE 802.1D has been incorporated into either
+IEEE 802.1Q (Bridges and Bridged Networks) or IEEE 802.1AC (MAC Service
+Definition). 802.1D has been officially withdrawn in 2022.
+
+Bridge Ports and STP States
+---------------------------
+
+In the context of STP, bridge ports can be in one of the following states:
+ * Blocking: The port is disabled for data traffic and only listens for
+ BPDUs (Bridge Protocol Data Units) from other devices to determine the
+ network topology.
+ * Listening: The port begins to participate in the STP process and listens
+ for BPDUs.
+ * Learning: The port continues to listen for BPDUs and begins to learn MAC
+ addresses from incoming frames but does not forward data frames.
+ * Forwarding: The port is fully operational and forwards both BPDUs and
+ data frames.
+ * Disabled: The port is administratively disabled and does not participate
+ in the STP process. The data frames forwarding are also disabled.
+
+Root Bridge and Convergence
+---------------------------
+
+In the context of networking and Ethernet bridging in Linux, the root bridge
+is a designated switch in a bridged network that serves as a reference point
+for the spanning tree algorithm to create a loop-free topology.
+
+Here's how the STP works and root bridge is chosen:
+ 1. Bridge Priority: Each bridge running a spanning tree protocol, has a
+ configurable Bridge Priority value. The lower the value, the higher the
+ priority. By default, the Bridge Priority is set to a standard value
+ (e.g., 32768).
+ 2. Bridge ID: The Bridge ID is composed of two components: Bridge Priority
+ and the MAC address of the bridge. It uniquely identifies each bridge
+ in the network. The Bridge ID is used to compare the priorities of
+ different bridges.
+ 3. Bridge Election: When the network starts, all bridges initially assume
+ that they are the root bridge. They start advertising Bridge Protocol
+ Data Units (BPDU) to their neighbors, containing their Bridge ID and
+ other information.
+ 4. BPDU Comparison: Bridges exchange BPDUs to determine the root bridge.
+ Each bridge examines the received BPDUs, including the Bridge Priority
+ and Bridge ID, to determine if it should adjust its own priorities.
+ The bridge with the lowest Bridge ID will become the root bridge.
+ 5. Root Bridge Announcement: Once the root bridge is determined, it sends
+ BPDUs with information about the root bridge to all other bridges in the
+ network. This information is used by other bridges to calculate the
+ shortest path to the root bridge and, in doing so, create a loop-free
+ topology.
+ 6. Forwarding Ports: After the root bridge is selected and the spanning tree
+ topology is established, each bridge determines which of its ports should
+ be in the forwarding state (used for data traffic) and which should be in
+ the blocking state (used to prevent loops). The root bridge's ports are
+ all in the forwarding state. while other bridges have some ports in the
+ blocking state to avoid loops.
+ 7. Root Ports: After the root bridge is selected and the spanning tree
+ topology is established, each non-root bridge processes incoming
+ BPDUs and determines which of its ports provides the shortest path to the
+ root bridge based on the information in the received BPDUs. This port is
+ designated as the root port. And it is in the Forwarding state, allowing
+ it to actively forward network traffic.
+ 8. Designated ports: A designated port is the port through which the non-root
+ bridge will forward traffic towards the designated segment. Designated ports
+ are placed in the Forwarding state. All other ports on the non-root
+ bridge that are not designated for specific segments are placed in the
+ Blocking state to prevent network loops.
+
+STP ensures network convergence by calculating the shortest path and disabling
+redundant links. When network topology changes occur (e.g., a link failure),
+STP recalculates the network topology to restore connectivity while avoiding loops.
+
+Proper configuration of STP parameters, such as the bridge priority, can
+influence network performance, path selection and which bridge becomes the
+Root Bridge.
+
+User space STP helper
+---------------------
+
+The user space STP helper *bridge-stp* is a program to control whether to use
+user mode spanning tree. The ``/sbin/bridge-stp <bridge> <start|stop>`` is
+called by the kernel when STP is enabled/disabled on a bridge
+(via ``brctl stp <bridge> <on|off>`` or ``ip link set <bridge> type bridge
+stp_state <0|1>``). The kernel enables user_stp mode if that command returns
+0, or enables kernel_stp mode if that command returns any other value.
+
+VLAN
+====
+
+A LAN (Local Area Network) is a network that covers a small geographic area,
+typically within a single building or a campus. LANs are used to connect
+computers, servers, printers, and other networked devices within a localized
+area. LANs can be wired (using Ethernet cables) or wireless (using Wi-Fi).
+
+A VLAN (Virtual Local Area Network) is a logical segmentation of a physical
+network into multiple isolated broadcast domains. VLANs are used to divide
+a single physical LAN into multiple virtual LANs, allowing different groups of
+devices to communicate as if they were on separate physical networks.
+
+Typically there are two VLAN implementations, IEEE 802.1Q and IEEE 802.1ad
+(also known as QinQ). IEEE 802.1Q is a standard for VLAN tagging in Ethernet
+networks. It allows network administrators to create logical VLANs on a
+physical network and tag Ethernet frames with VLAN information, which is
+called *VLAN-tagged frames*. IEEE 802.1ad, commonly known as QinQ or Double
+VLAN, is an extension of the IEEE 802.1Q standard. QinQ allows for the
+stacking of multiple VLAN tags within a single Ethernet frame. The Linux
+bridge supports both the IEEE 802.1Q and `802.1AD
+<https://lore.kernel.org/netdev/1402401565-15423-1-git-send-email-makita.toshiaki@lab.ntt.co.jp/>`_
+protocol for VLAN tagging.
+
+`VLAN filtering <https://lore.kernel.org/netdev/1360792820-14116-1-git-send-email-vyasevic@redhat.com/>`_
+on a bridge is disabled by default. After enabling VLAN filtering on a bridge,
+it will start forwarding frames to appropriate destinations based on their
+destination MAC address and VLAN tag (both must match).
+
+Multicast
+=========
+
+The Linux bridge driver has multicast support allowing it to process Internet
+Group Management Protocol (IGMP) or Multicast Listener Discovery (MLD)
+messages, and to efficiently forward multicast data packets. The bridge
+driver supports IGMPv2/IGMPv3 and MLDv1/MLDv2.
+
+Multicast snooping
+------------------
+
+Multicast snooping is a networking technology that allows network switches
+to intelligently manage multicast traffic within a local area network (LAN).
+
+The switch maintains a multicast group table, which records the association
+between multicast group addresses and the ports where hosts have joined these
+groups. The group table is dynamically updated based on the IGMP/MLD messages
+received. With the multicast group information gathered through snooping, the
+switch optimizes the forwarding of multicast traffic. Instead of blindly
+broadcasting the multicast traffic to all ports, it sends the multicast
+traffic based on the destination MAC address only to ports which have
+subscribed the respective destination multicast group.
+
+When created, the Linux bridge devices have multicast snooping enabled by
+default. It maintains a Multicast forwarding database (MDB) which keeps track
+of port and group relationships.
+
+IGMPv3/MLDv2 EHT support
+------------------------
+
+The Linux bridge supports IGMPv3/MLDv2 EHT (Explicit Host Tracking), which
+was added by `474ddb37fa3a ("net: bridge: multicast: add EHT allow/block handling")
+<https://lore.kernel.org/netdev/20210120145203.1109140-1-razor@blackwall.org/>`_
+
+The explicit host tracking enables the device to keep track of each
+individual host that is joined to a particular group or channel. The main
+benefit of the explicit host tracking in IGMP is to allow minimal leave
+latencies when a host leaves a multicast group or channel.
+
+The length of time between a host wanting to leave and a device stopping
+traffic forwarding is called the IGMP leave latency. A device configured
+with IGMPv3 or MLDv2 and explicit tracking can immediately stop forwarding
+traffic if the last host to request to receive traffic from the device
+indicates that it no longer wants to receive traffic. The leave latency
+is thus bound only by the packet transmission latencies in the multiaccess
+network and the processing time in the device.
+
+Other multicast features
+------------------------
+
+The Linux bridge also supports `per-VLAN multicast snooping
+<https://lore.kernel.org/netdev/20210719170637.435541-1-razor@blackwall.org/>`_,
+which is disabled by default but can be enabled. And `Multicast Router Discovery
+<https://lore.kernel.org/netdev/20190121062628.2710-1-linus.luessing@c0d3.blue/>`_,
+which help identify the location of multicast routers.
+
+Switchdev
+=========
+
+Linux Bridge Switchdev is a feature in the Linux kernel that extends the
+capabilities of the traditional Linux bridge to work more efficiently with
+hardware switches that support switchdev. With Linux Bridge Switchdev, certain
+networking functions like forwarding, filtering, and learning of Ethernet
+frames can be offloaded to a hardware switch. This offloading reduces the
+burden on the Linux kernel and CPU, leading to improved network performance
+and lower latency.
+
+To use Linux Bridge Switchdev, you need hardware switches that support the
+switchdev interface. This means that the switch hardware needs to have the
+necessary drivers and functionality to work in conjunction with the Linux
+kernel.
+
+Please see the :ref:`switchdev` document for more details.
+
+Netfilter
+=========
+
+The bridge netfilter module is a legacy feature that allows to filter bridged
+packets with iptables and ip6tables. Its use is discouraged. Users should
+consider using nftables for packet filtering.
+
+The older ebtables tool is more feature-limited compared to nftables, but
+just like nftables it doesn't need this module either to function.
+
+The br_netfilter module intercepts packets entering the bridge, performs
+minimal sanity tests on ipv4 and ipv6 packets and then pretends that
+these packets are being routed, not bridged. br_netfilter then calls
+the ip and ipv6 netfilter hooks from the bridge layer, i.e. ip(6)tables
+rulesets will also see these packets.
+
+br_netfilter is also the reason for the iptables *physdev* match:
+This match is the only way to reliably tell routed and bridged packets
+apart in an iptables ruleset.
+
+Note that ebtables and nftables will work fine without the br_netfilter module.
+iptables/ip6tables/arptables do not work for bridged traffic because they
+plug in the routing stack. nftables rules in ip/ip6/inet/arp families won't
+see traffic that is forwarded by a bridge either, but that's very much how it
+should be.
+
+Historically the feature set of ebtables was very limited (it still is),
+this module was added to pretend packets are routed and invoke the ipv4/ipv6
+netfilter hooks from the bridge so users had access to the more feature-rich
+iptables matching capabilities (including conntrack). nftables doesn't have
+this limitation, pretty much all features work regardless of the protocol family.
+
+So, br_netfilter is only needed if users, for some reason, need to use
+ip(6)tables to filter packets forwarded by the bridge, or NAT bridged
+traffic. For pure link layer filtering, this module isn't needed.
+
+Other Features
+==============
+
+The Linux bridge also supports `IEEE 802.11 Proxy ARP
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=958501163ddd6ea22a98f94fa0e7ce6d4734e5c4>`_,
+`Media Redundancy Protocol (MRP)
+<https://lore.kernel.org/netdev/20200426132208.3232-1-horatiu.vultur@microchip.com/>`_,
+`Media Redundancy Protocol (MRP) LC mode
+<https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com>`_,
+`IEEE 802.1X port authentication
+<https://lore.kernel.org/netdev/20220218155148.2329797-1-schultz.hans+netdev@gmail.com/>`_,
+and `MAC Authentication Bypass (MAB)
+<https://lore.kernel.org/netdev/20221101193922.2125323-2-idosch@nvidia.com/>`_.
+
+FAQ
+===
+
+What does a bridge do?
+----------------------
+
+A bridge transparently forwards traffic between multiple network interfaces.
+In plain English this means that a bridge connects two or more physical
+Ethernet networks, to form one larger (logical) Ethernet network.
+
+Is it L3 protocol independent?
+------------------------------
+
+Yes. The bridge sees all frames, but it *uses* only L2 headers/information.
+As such, the bridging functionality is protocol independent, and there should
+be no trouble forwarding IPX, NetBEUI, IP, IPv6, etc.
+
+Contact Info
+============
+
+The code is currently maintained by Roopa Prabhu <roopa@nvidia.com> and
+Nikolay Aleksandrov <razor@blackwall.org>. Bridge bugs and enhancements
+are discussed on the linux-netdev mailing list netdev@vger.kernel.org and
+bridge@lists.linux.dev.
+
+The list is open to anyone interested: http://vger.kernel.org/vger-lists.html#netdev
+
+External Links
+==============
+
+The old Documentation for Linux bridging is on:
+https://wiki.linuxfoundation.org/networking/bridge
diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst
index a07213030ccf..d922d419c513 100644
--- a/Documentation/networking/caif/caif.rst
+++ b/Documentation/networking/caif/caif.rst
@@ -68,11 +68,10 @@ There are debugfs parameters provided for serial communication.
* tty_status: Prints the bit-mask tty status information
- 0x01 - tty->warned is on.
- - 0x02 - tty->low_latency is on.
- 0x04 - tty->packed is on.
- - 0x08 - tty->flow_stopped is on.
+ - 0x08 - tty->flow.tco_stopped is on.
- 0x10 - tty->hw_stopped is on.
- - 0x20 - tty->stopped is on.
+ - 0x20 - tty->flow.stopped is on.
* last_tx_msg: Binary blob Prints the last transmitted frame.
diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst
index 86e5b7832ec3..ec29b6f4bdb4 100644
--- a/Documentation/networking/caif/index.rst
+++ b/Documentation/networking/caif/index.rst
@@ -10,4 +10,3 @@ Contents:
linux_caif
caif
- spi_porting
diff --git a/Documentation/networking/caif/spi_porting.rst b/Documentation/networking/caif/spi_porting.rst
deleted file mode 100644
index d49f874b20ac..000000000000
--- a/Documentation/networking/caif/spi_porting.rst
+++ /dev/null
@@ -1,229 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================
-CAIF SPI porting
-================
-
-CAIF SPI basics
-===============
-
-Running CAIF over SPI needs some extra setup, owing to the nature of SPI.
-Two extra GPIOs have been added in order to negotiate the transfers
-between the master and the slave. The minimum requirement for running
-CAIF over SPI is a SPI slave chip and two GPIOs (more details below).
-Please note that running as a slave implies that you need to keep up
-with the master clock. An overrun or underrun event is fatal.
-
-CAIF SPI framework
-==================
-
-To make porting as easy as possible, the CAIF SPI has been divided in
-two parts. The first part (called the interface part) deals with all
-generic functionality such as length framing, SPI frame negotiation
-and SPI frame delivery and transmission. The other part is the CAIF
-SPI slave device part, which is the module that you have to write if
-you want to run SPI CAIF on a new hardware. This part takes care of
-the physical hardware, both with regard to SPI and to GPIOs.
-
-- Implementing a CAIF SPI device:
-
- - Functionality provided by the CAIF SPI slave device:
-
- In order to implement a SPI device you will, as a minimum,
- need to implement the following
- functions:
-
- ::
-
- int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev):
-
- This function is called by the CAIF SPI interface to give
- you a chance to set up your hardware to be ready to receive
- a stream of data from the master. The xfer structure contains
- both physical and logical addresses, as well as the total length
- of the transfer in both directions.The dev parameter can be used
- to map to different CAIF SPI slave devices.
-
- ::
-
- void (*sig_xfer) (bool xfer, struct cfspi_dev *dev):
-
- This function is called by the CAIF SPI interface when the output
- (SPI_INT) GPIO needs to change state. The boolean value of the xfer
- variable indicates whether the GPIO should be asserted (HIGH) or
- deasserted (LOW). The dev parameter can be used to map to different CAIF
- SPI slave devices.
-
- - Functionality provided by the CAIF SPI interface:
-
- ::
-
- void (*ss_cb) (bool assert, struct cfspi_ifc *ifc);
-
- This function is called by the CAIF SPI slave device in order to
- signal a change of state of the input GPIO (SS) to the interface.
- Only active edges are mandatory to be reported.
- This function can be called from IRQ context (recommended in order
- not to introduce latency). The ifc parameter should be the pointer
- returned from the platform probe function in the SPI device structure.
-
- ::
-
- void (*xfer_done_cb) (struct cfspi_ifc *ifc);
-
- This function is called by the CAIF SPI slave device in order to
- report that a transfer is completed. This function should only be
- called once both the transmission and the reception are completed.
- This function can be called from IRQ context (recommended in order
- not to introduce latency). The ifc parameter should be the pointer
- returned from the platform probe function in the SPI device structure.
-
- - Connecting the bits and pieces:
-
- - Filling in the SPI slave device structure:
-
- Connect the necessary callback functions.
-
- Indicate clock speed (used to calculate toggle delays).
-
- Chose a suitable name (helps debugging if you use several CAIF
- SPI slave devices).
-
- Assign your private data (can be used to map to your
- structure).
-
- - Filling in the SPI slave platform device structure:
-
- Add name of driver to connect to ("cfspi_sspi").
-
- Assign the SPI slave device structure as platform data.
-
-Padding
-=======
-
-In order to optimize throughput, a number of SPI padding options are provided.
-Padding can be enabled independently for uplink and downlink transfers.
-Padding can be enabled for the head, the tail and for the total frame size.
-The padding needs to be correctly configured on both sides of the link.
-The padding can be changed via module parameters in cfspi_sspi.c or via
-the sysfs directory of the cfspi_sspi driver (before device registration).
-
-- CAIF SPI device template::
-
- /*
- * Copyright (C) ST-Ericsson AB 2010
- * Author: Daniel Martensson / Daniel.Martensson@stericsson.com
- * License terms: GNU General Public License (GPL), version 2.
- *
- */
-
- #include <linux/init.h>
- #include <linux/module.h>
- #include <linux/device.h>
- #include <linux/wait.h>
- #include <linux/interrupt.h>
- #include <linux/dma-mapping.h>
- #include <net/caif/caif_spi.h>
-
- MODULE_LICENSE("GPL");
-
- struct sspi_struct {
- struct cfspi_dev sdev;
- struct cfspi_xfer *xfer;
- };
-
- static struct sspi_struct slave;
- static struct platform_device slave_device;
-
- static irqreturn_t sspi_irq(int irq, void *arg)
- {
- /* You only need to trigger on an edge to the active state of the
- * SS signal. Once a edge is detected, the ss_cb() function should be
- * called with the parameter assert set to true. It is OK
- * (and even advised) to call the ss_cb() function in IRQ context in
- * order not to add any delay. */
-
- return IRQ_HANDLED;
- }
-
- static void sspi_complete(void *context)
- {
- /* Normally the DMA or the SPI framework will call you back
- * in something similar to this. The only thing you need to
- * do is to call the xfer_done_cb() function, providing the pointer
- * to the CAIF SPI interface. It is OK to call this function
- * from IRQ context. */
- }
-
- static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev)
- {
- /* Store transfer info. For a normal implementation you should
- * set up your DMA here and make sure that you are ready to
- * receive the data from the master SPI. */
-
- struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
-
- sspi->xfer = xfer;
-
- return 0;
- }
-
- void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev)
- {
- /* If xfer is true then you should assert the SPI_INT to indicate to
- * the master that you are ready to receive the data from the master
- * SPI. If xfer is false then you should de-assert SPI_INT to indicate
- * that the transfer is done.
- */
-
- struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
- }
-
- static void sspi_release(struct device *dev)
- {
- /*
- * Here you should release your SPI device resources.
- */
- }
-
- static int __init sspi_init(void)
- {
- /* Here you should initialize your SPI device by providing the
- * necessary functions, clock speed, name and private data. Once
- * done, you can register your device with the
- * platform_device_register() function. This function will return
- * with the CAIF SPI interface initialized. This is probably also
- * the place where you should set up your GPIOs, interrupts and SPI
- * resources. */
-
- int res = 0;
-
- /* Initialize slave device. */
- slave.sdev.init_xfer = sspi_init_xfer;
- slave.sdev.sig_xfer = sspi_sig_xfer;
- slave.sdev.clk_mhz = 13;
- slave.sdev.priv = &slave;
- slave.sdev.name = "spi_sspi";
- slave_device.dev.release = sspi_release;
-
- /* Initialize platform device. */
- slave_device.name = "cfspi_sspi";
- slave_device.dev.platform_data = &slave.sdev;
-
- /* Register platform device. */
- res = platform_device_register(&slave_device);
- if (res) {
- printk(KERN_WARNING "sspi_init: failed to register dev.\n");
- return -ENODEV;
- }
-
- return res;
- }
-
- static void __exit sspi_exit(void)
- {
- platform_device_del(&slave_device);
- }
-
- module_init(sspi_init);
- module_exit(sspi_exit);
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index ff05cbd05e0d..62519d38c58b 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -168,7 +168,7 @@ reflect the correct [#f1]_ traffic on the node the loopback of the sent
data has to be performed right after a successful transmission. If
the CAN network interface is not capable of performing the loopback for
some reason the SocketCAN core can do this task as a fallback solution.
-See :ref:`socketcan-local-loopback1` for details (recommended).
+See :ref:`socketcan-local-loopback2` for details (recommended).
The loopback functionality is enabled by default to reflect standard
networking behaviour for CAN applications. Due to some requests from
@@ -228,20 +228,36 @@ send(2), sendto(2), sendmsg(2) and the recv* counterpart operations
on the socket as usual. There are also CAN specific socket options
described below.
-The basic CAN frame structure and the sockaddr structure are defined
-in include/linux/can.h:
+The Classical CAN frame structure (aka CAN 2.0B), the CAN FD frame structure
+and the sockaddr structure are defined in include/linux/can.h:
.. code-block:: C
struct can_frame {
canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
- __u8 can_dlc; /* frame payload length in byte (0 .. 8) */
+ union {
+ /* CAN frame payload length in byte (0 .. CAN_MAX_DLEN)
+ * was previously named can_dlc so we need to carry that
+ * name for legacy support
+ */
+ __u8 len;
+ __u8 can_dlc; /* deprecated */
+ };
__u8 __pad; /* padding */
__u8 __res0; /* reserved / padding */
- __u8 __res1; /* reserved / padding */
+ __u8 len8_dlc; /* optional DLC for 8 byte payload length (9 .. 15) */
__u8 data[8] __attribute__((aligned(8)));
};
+Remark: The len element contains the payload length in bytes and should be
+used instead of can_dlc. The deprecated can_dlc was misleadingly named as
+it always contained the plain payload length in bytes and not the so called
+'data length code' (DLC).
+
+To pass the raw DLC from/to a Classical CAN network device the len8_dlc
+element can contain values 9 .. 15 when the len element is 8 (the real
+payload length for all DLC values greater or equal to 8).
+
The alignment of the (linear) payload data[] to a 64bit boundary
allows the user to define their own structs and unions to easily access
the CAN payload. There is no given byteorder on the CAN bus by
@@ -260,6 +276,23 @@ PF_PACKET socket, that also binds to a specific interface:
/* transport protocol class address info (e.g. ISOTP) */
struct { canid_t rx_id, tx_id; } tp;
+ /* J1939 address information */
+ struct {
+ /* 8 byte name when using dynamic addressing */
+ __u64 name;
+
+ /* pgn:
+ * 8 bit: PS in PDU2 case, else 0
+ * 8 bit: PF
+ * 1 bit: DP
+ * 1 bit: reserved
+ */
+ __u32 pgn;
+
+ /* 1 byte address */
+ __u8 addr;
+ } j1939;
+
/* reserved for future CAN protocols address information */
} can_addr;
};
@@ -371,7 +404,7 @@ kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight
bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g.
the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that
switches the socket into a mode that allows the handling of CAN FD frames
-and (legacy) CAN frames simultaneously (see :ref:`socketcan-rawfd`).
+and Classical CAN frames simultaneously (see :ref:`socketcan-rawfd`).
The struct canfd_frame is defined in include/linux/can.h:
@@ -397,7 +430,7 @@ code (DLC) of the struct can_frame was used as a length information as the
length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve
the easy handling of the length information the canfd_frame.len element
contains a plain length value from 0 .. 64. So both canfd_frame.len and
-can_frame.can_dlc are equal and contain a length information and no DLC.
+can_frame.len are equal and contain a length information and no DLC.
For details about the distinction of CAN and CAN FD capable devices and
the mapping to the bus-relevant data length code (DLC), see :ref:`socketcan-can-fd-driver`.
@@ -407,10 +440,28 @@ definitions are specified for CAN specific MTUs in include/linux/can.h:
.. code-block:: C
- #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame
+ #define CAN_MTU (sizeof(struct can_frame)) == 16 => Classical CAN frame
#define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame
+Returned Message Flags
+----------------------
+
+When using the system call recvmsg(2) on a RAW or a BCM socket, the
+msg->msg_flags field may contain the following flags:
+
+MSG_DONTROUTE:
+ set when the received frame was created on the local host.
+
+MSG_CONFIRM:
+ set when the frame was sent via the socket it is received on.
+ This flag can be interpreted as a 'transmission confirmation' when the
+ CAN driver supports the echo of frames on driver level, see
+ :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
+ (Note: In order to receive such messages on a RAW socket,
+ CAN_RAW_RECV_OWN_MSGS must be set.)
+
+
.. _socketcan-raw-sockets:
RAW Protocol Sockets with can_filters (SOCK_RAW)
@@ -575,6 +626,8 @@ demand:
setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS,
&recv_own_msgs, sizeof(recv_own_msgs));
+Note that reception of a socket's own CAN frames are subject to the same
+filtering as other CAN frames (see :ref:`socketcan-rawfilter`).
.. _socketcan-rawfd:
@@ -609,7 +662,7 @@ Example:
printf("got CAN FD frame with length %d\n", cfd.len);
/* cfd.flags contains valid data */
} else if (nbytes == CAN_MTU) {
- printf("got legacy CAN frame with length %d\n", cfd.len);
+ printf("got Classical CAN frame with length %d\n", cfd.len);
/* cfd.flags is undefined */
} else {
fprintf(stderr, "read: invalid CAN(FD) frame\n");
@@ -623,7 +676,7 @@ Example:
printf("%02X ", cfd.data[i]);
When reading with size CANFD_MTU only returns CAN_MTU bytes that have
-been received from the socket a legacy CAN frame has been read into the
+been received from the socket a Classical CAN frame has been read into the
provided CAN FD structure. Note that the canfd_frame.flags data field is
not specified in the struct can_frame and therefore it is only valid in
CANFD_MTU sized CAN FD frames.
@@ -633,7 +686,7 @@ Implementation hint for new CAN applications:
To build a CAN FD aware application use struct canfd_frame as basic CAN
data structure for CAN_RAW based applications. When the application is
executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES
-socket option returns an error: No problem. You'll get legacy CAN frames
+socket option returns an error: No problem. You'll get Classical CAN frames
or CAN FD frames and can process them the same way.
When sending to CAN devices make sure that the device is capable to handle
@@ -658,22 +711,6 @@ where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or
CAN ID ranges from the incoming traffic.
-RAW Socket Returned Message Flags
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-When using recvmsg() call, the msg->msg_flags may contain following flags:
-
-MSG_DONTROUTE:
- set when the received frame was created on the local host.
-
-MSG_CONFIRM:
- set when the frame was sent via the socket it is received on.
- This flag can be interpreted as a 'transmission confirmation' when the
- CAN driver supports the echo of frames on driver level, see
- :ref:`socketcan-local-loopback1` and :ref:`socketcan-local-loopback2`.
- In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
-
-
Broadcast Manager Protocol Sockets (SOCK_DGRAM)
-----------------------------------------------
@@ -842,6 +879,8 @@ TX_RESET_MULTI_IDX:
RX_RTR_FRAME:
Send reply for RTR-request (placed in op->frames[0]).
+CAN_FD_FRAME:
+ The CAN frames following the bcm_msg_head are struct canfd_frame's
Broadcast Manager Transmission Timers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -894,7 +933,7 @@ ival1:
ival2:
Throttle the received message rate down to the value of ival2. This
is useful to reduce messages for the application when the signal inside the
- CAN frame is stateless as state changes within the ival2 periode may get
+ CAN frame is stateless as state changes within the ival2 period may get
lost.
Broadcast Manager Multiplex Message Receive Filter
@@ -1026,7 +1065,7 @@ Additional procfs files in /proc/net/can::
stats - SocketCAN core statistics (rx/tx frames, match ratios, ...)
reset_stats - manual statistic reset
- version - prints the SocketCAN core version and the ABI version
+ version - prints SocketCAN core and ABI version (removed in Linux 5.10)
Writing Own CAN Protocol Modules
@@ -1070,7 +1109,7 @@ General Settings
dev->type = ARPHRD_CAN; /* the netdevice hardware type */
dev->flags = IFF_NOARP; /* CAN has no arp */
- dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */
+ dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */
or alternative, when the controller supports CAN with flexible data rate:
dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
@@ -1111,6 +1150,39 @@ tuning on deep embedded systems'. The author is running a MPC603e
load without any problems ...
+Switchable Termination Resistors
+--------------------------------
+
+CAN bus requires a specific impedance across the differential pair,
+typically provided by two 120Ohm resistors on the farthest nodes of
+the bus. Some CAN controllers support activating / deactivating a
+termination resistor(s) to provide the correct impedance.
+
+Query the available resistances::
+
+ $ ip -details link show can0
+ ...
+ termination 120 [ 0, 120 ]
+
+Activate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 120
+
+Deactivate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 0
+
+To enable termination resistor support to a can-controller, either
+implement in the controller's struct can-priv::
+
+ termination_const
+ termination_const_cnt
+ do_set_termination
+
+or add gpio control with the device tree entries from
+Documentation/devicetree/bindings/net/can/can-controller.yaml
+
+
The Virtual CAN Driver (vcan)
-----------------------------
@@ -1184,6 +1256,7 @@ Setting CAN device properties::
[ fd { on | off } ]
[ fd-non-iso { on | off } ]
[ presume-ack { on | off } ]
+ [ cc-len8-dlc { on | off } ]
[ restart-ms TIME-MS ]
[ restart ]
@@ -1326,22 +1399,22 @@ arbitration phase and the payload phase of the CAN FD frame. Therefore a
second bit timing has to be specified in order to enable the CAN FD bitrate.
Additionally CAN FD capable CAN controllers support up to 64 bytes of
-payload. The representation of this length in can_frame.can_dlc and
+payload. The representation of this length in can_frame.len and
canfd_frame.len for userspace applications and inside the Linux network
layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
-The data length code was a 1:1 mapping to the payload length in the legacy
+The data length code was a 1:1 mapping to the payload length in the Classical
CAN frames anyway. The payload length to the bus-relevant DLC mapping is
only performed inside the CAN drivers, preferably with the helper
-functions can_dlc2len() and can_len2dlc().
+functions can_fd_dlc2len() and can_fd_len2dlc().
The CAN netdevice driver capabilities can be distinguished by the network
devices maximum transfer unit (MTU)::
- MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device
+ MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => Classical CAN device
MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device
The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
-N.B. CAN FD capable devices can also handle and send legacy CAN frames.
+N.B. CAN FD capable devices can also handle and send Classical CAN frames.
When configuring CAN FD capable CAN controllers an additional 'data' bitrate
has to be set. This bitrate for the data phase of the CAN FD frame has to be
diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst
index 638ac1ee7914..935d872ae87c 100644
--- a/Documentation/networking/can_ucan_protocol.rst
+++ b/Documentation/networking/can_ucan_protocol.rst
@@ -50,7 +50,7 @@ Setup Packet
``wIndex`` USB Interface Index (0 for device commands)
``wLength`` * Host to Device - Number of bytes to transmit
* Device to Host - Maximum Number of bytes to
- receive. If the device send less. Commom ZLP
+ receive. If the device send less. Common ZLP
semantics are used.
================= =====================================================
diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst
index 0048409c06b4..37f968acc473 100644
--- a/Documentation/networking/cdc_mbim.rst
+++ b/Documentation/networking/cdc_mbim.rst
@@ -93,7 +93,7 @@ MBIM function can be looked up using sysfs. For example::
USB configuration descriptors
-----------------------------
The wMaxControlMessage field of the CDC MBIM functional descriptor
-limits the maximum control message size. The managament application is
+limits the maximum control message size. The management application is
responsible for negotiating a control message size complying with the
requirements in section 9.3.1 of [1], taking this descriptor field
into consideration.
diff --git a/Documentation/networking/cops.rst b/Documentation/networking/cops.rst
deleted file mode 100644
index 964ba80599a9..000000000000
--- a/Documentation/networking/cops.rst
+++ /dev/null
@@ -1,80 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-========================================
-The COPS LocalTalk Linux driver (cops.c)
-========================================
-
-By Jay Schulist <jschlst@samba.org>
-
-This driver has two modes and they are: Dayna mode and Tangent mode.
-Each mode corresponds with the type of card. It has been found
-that there are 2 main types of cards and all other cards are
-the same and just have different names or only have minor differences
-such as more IO ports. As this driver is tested it will
-become more clear exactly what cards are supported.
-
-Right now these cards are known to work with the COPS driver. The
-LT-200 cards work in a somewhat more limited capacity than the
-DL200 cards, which work very well and are in use by many people.
-
-TANGENT driver mode:
- - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
-
-DAYNA driver mode:
- - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
- - Farallon PhoneNET PC III, Farallon PhoneNET PC II
-
-Other cards possibly supported mode unknown though:
- - Dayna DL2000 (Full length)
-
-The COPS driver defaults to using Dayna mode. To change the driver's
-mode if you built a driver with dual support use board_type=1 or
-board_type=2 for Dayna or Tangent with insmod.
-
-Operation/loading of the driver
-===============================
-
-Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #)
-If you do not specify any options the driver will try and use the IO = 0x240,
-IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
-
-To load multiple COPS driver Localtalk cards you can do one of the following::
-
- insmod cops io=0x240 irq=5
- insmod -o cops2 cops io=0x260 irq=3
-
-Or in lilo.conf put something like this::
-
- append="ether=5,0x240,lt0 ether=3,0x260,lt1"
-
-Then bring up the interface with ifconfig. It will look something like this::
-
- lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
- inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
- UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1
- RX packets:0 errors:0 dropped:0 overruns:0 frame:0
- TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
-
-Netatalk Configuration
-======================
-
-You will need to configure atalkd with something like the following to make
-it work with the cops.c driver.
-
-* For single LTalk card use::
-
- dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
- lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple cards, Ethernet and LocalTalk::
-
- eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
- lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple LocalTalk cards, and an Ethernet card.
-
-* Order seems to matter here, Ethernet last::
-
- lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
- lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
- eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst
index dde16be04456..91e5c33ba3ff 100644
--- a/Documentation/networking/dccp.rst
+++ b/Documentation/networking/dccp.rst
@@ -192,6 +192,9 @@ FIONREAD
Works as in udp(7): returns in the ``int`` argument pointer the size of
the next pending datagram in bytes, or 0 when no datagram is pending.
+SIOCOUTQ
+ Returns the number of unsent data bytes in the socket send queue as ``int``
+ into the buffer specified by the argument pointer.
Other tunables
==============
diff --git a/Documentation/networking/decnet.rst b/Documentation/networking/decnet.rst
deleted file mode 100644
index b8bc11ff8370..000000000000
--- a/Documentation/networking/decnet.rst
+++ /dev/null
@@ -1,243 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================
-Linux DECnet Networking Layer Information
-=========================================
-
-1. Other documentation....
-==========================
-
- - Project Home Pages
- - http://www.chygwyn.com/ - Kernel info
- - http://linux-decnet.sourceforge.net/ - Userland tools
- - http://www.sourceforge.net/projects/linux-decnet/ - Status page
-
-2. Configuring the kernel
-=========================
-
-Be sure to turn on the following options:
-
- - CONFIG_DECNET (obviously)
- - CONFIG_PROC_FS (to see what's going on)
- - CONFIG_SYSCTL (for easy configuration)
-
-if you want to try out router support (not properly debugged yet)
-you'll need the following options as well...
-
- - CONFIG_DECNET_ROUTER (to be able to add/delete routes)
- - CONFIG_NETFILTER (will be required for the DECnet routing daemon)
-
-Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
-that you need it, in general you won't and it can cause ifconfig to
-malfunction.
-
-Run time configuration has changed slightly from the 2.4 system. If you
-want to configure an endnode, then the simplified procedure is as follows:
-
- - Set the MAC address on your ethernet card before starting _any_ other
- network protocols.
-
-As soon as your network card is brought into the UP state, DECnet should
-start working. If you need something more complicated or are unsure how
-to set the MAC address, see the next section. Also all configurations which
-worked with 2.4 will work under 2.5 with no change.
-
-3. Command line options
-=======================
-
-You can set a DECnet address on the kernel command line for compatibility
-with the 2.4 configuration procedure, but in general it's not needed any more.
-If you do st a DECnet address on the command line, it has only one purpose
-which is that its added to the addresses on the loopback device.
-
-With 2.4 kernels, DECnet would only recognise addresses as local if they
-were added to the loopback device. In 2.5, any local interface address
-can be used to loop back to the local machine. Of course this does not
-prevent you adding further addresses to the loopback device if you
-want to.
-
-N.B. Since the address list of an interface determines the addresses for
-which "hello" messages are sent, if you don't set an address on the loopback
-interface then you won't see any entries in /proc/net/neigh for the local
-host until such time as you start a connection. This doesn't affect the
-operation of the local communications in any other way though.
-
-The kernel command line takes options looking like the following::
-
- decnet.addr=1,2
-
-the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels
-and early 2.3.xx kernels, you must use a comma when specifying the
-DECnet address like this. For more recent 2.3.xx kernels, you may
-use almost any character except space, although a `.` would be the most
-obvious choice :-)
-
-There used to be a third number specifying the node type. This option
-has gone away in favour of a per interface node type. This is now set
-using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be
-set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router.
-
-There are also equivalent options for modules. The node address can
-also be set through the /proc/sys/net/decnet/ files, as can other system
-parameters.
-
-Currently the only supported devices are ethernet and ip_gre. The
-ethernet address of your ethernet card has to be set according to the DECnet
-address of the node in order for it to be autoconfigured (and then appear in
-/proc/net/decnet_dev). There is a utility available at the above
-FTP sites called dn2ethaddr which can compute the correct ethernet
-address to use. The address can be set by ifconfig either before or
-at the time the device is brought up. If you are using RedHat you can
-add the line::
-
- MACADDR=AA:00:04:00:03:04
-
-or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or
-wherever your network card's configuration lives. Setting the MAC address
-of your ethernet card to an address starting with "hi-ord" will cause a
-DECnet address which matches to be added to the interface (which you can
-verify with iproute2).
-
-The default device for routing can be set through the /proc filesystem
-by setting /proc/sys/net/decnet/default_device to the
-device you want DECnet to route packets out of when no specific route
-is available. Usually this will be eth0, for example::
-
- echo -n "eth0" >/proc/sys/net/decnet/default_device
-
-If you don't set the default device, then it will default to the first
-ethernet card which has been autoconfigured as described above. You can
-confirm that by looking in the default_device file of course.
-
-There is a list of what the other files under /proc/sys/net/decnet/ do
-on the kernel patch web site (shown above).
-
-4. Run time kernel configuration
-================================
-
-
-This is either done through the sysctl/proc interface (see the kernel web
-pages for details on what the various options do) or through the iproute2
-package in the same way as IPv4/6 configuration is performed.
-
-Documentation for iproute2 is included with the package, although there is
-as yet no specific section on DECnet, most of the features apply to both
-IP and DECnet, albeit with DECnet addresses instead of IP addresses and
-a reduced functionality.
-
-If you want to configure a DECnet router you'll need the iproute2 package
-since its the _only_ way to add and delete routes currently. Eventually
-there will be a routing daemon to send and receive routing messages for
-each interface and update the kernel routing tables accordingly. The
-routing daemon will use netfilter to listen to routing packets, and
-rtnetlink to update the kernels routing tables.
-
-The DECnet raw socket layer has been removed since it was there purely
-for use by the routing daemon which will now use netfilter (a much cleaner
-and more generic solution) instead.
-
-5. How can I tell if its working?
-=================================
-
-Here is a quick guide of what to look for in order to know if your DECnet
-kernel subsystem is working.
-
- - Is the node address set (see /proc/sys/net/decnet/node_address)
- - Is the node of the correct type
- (see /proc/sys/net/decnet/conf/<dev>/forwarding)
- - Is the Ethernet MAC address of each Ethernet card set to match
- the DECnet address. If in doubt use the dn2ethaddr utility available
- at the ftp archive.
- - If the previous two steps are satisfied, and the Ethernet card is up,
- you should find that it is listed in /proc/net/decnet_dev and also
- that it appears as a directory in /proc/sys/net/decnet/conf/. The
- loopback device (lo) should also appear and is required to communicate
- within a node.
- - If you have any DECnet routers on your network, they should appear
- in /proc/net/decnet_neigh, otherwise this file will only contain the
- entry for the node itself (if it doesn't check to see if lo is up).
- - If you want to send to any node which is not listed in the
- /proc/net/decnet_neigh file, you'll need to set the default device
- to point to an Ethernet card with connection to a router. This is
- again done with the /proc/sys/net/decnet/default_device file.
- - Try starting a simple server and client, like the dnping/dnmirror
- over the loopback interface. With luck they should communicate.
- For this step and those after, you'll need the DECnet library
- which can be obtained from the above ftp sites as well as the
- actual utilities themselves.
- - If this seems to work, then try talking to a node on your local
- network, and see if you can obtain the same results.
- - At this point you are on your own... :-)
-
-6. How to send a bug report
-===========================
-
-If you've found a bug and want to report it, then there are several things
-you can do to help me work out exactly what it is that is wrong. Useful
-information (_most_ of which _is_ _essential_) includes:
-
- - What kernel version are you running ?
- - What version of the patch are you running ?
- - How far though the above set of tests can you get ?
- - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ?
- - Which services are you running ?
- - Which client caused the problem ?
- - How much data was being transferred ?
- - Was the network congested ?
- - How can the problem be reproduced ?
- - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of
- tcpdump don't understand how to dump DECnet properly, so including
- the hex listing of the packet contents is _essential_, usually the -x flag.
- You may also need to increase the length grabbed with the -s flag. The
- -e flag also provides very useful information (ethernet MAC addresses))
-
-7. MAC FAQ
-==========
-
-A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet
-interact and how to get the best performance from your hardware.
-
-Ethernet cards are designed to normally only pass received network frames
-to a host computer when they are addressed to it, or to the broadcast address.
-
-Linux has an interface which allows the setting of extra addresses for
-an ethernet card to listen to. If the ethernet card supports it, the
-filtering operation will be done in hardware, if not the extra unwanted packets
-received will be discarded by the host computer. In the latter case,
-significant processor time and bus bandwidth can be used up on a busy
-network (see the NAPI documentation for a longer explanation of these
-effects).
-
-DECnet makes use of this interface to allow running DECnet on an ethernet
-card which has already been configured using TCP/IP (presumably using the
-built in MAC address of the card, as usual) and/or to allow multiple DECnet
-addresses on each physical interface. If you do this, be aware that if your
-ethernet card doesn't support perfect hashing in its MAC address filter
-then your computer will be doing more work than required. Some cards
-will simply set themselves into promiscuous mode in order to receive
-packets from the DECnet specified addresses. So if you have one of these
-cards its better to set the MAC address of the card as described above
-to gain the best efficiency. Better still is to use a card which supports
-NAPI as well.
-
-
-8. Mailing list
-===============
-
-If you are keen to get involved in development, or want to ask questions
-about configuration, or even just report bugs, then there is a mailing
-list that you can join, details are at:
-
-http://sourceforge.net/mail/?group_id=4993
-
-9. Legal Info
-=============
-
-The Linux DECnet project team have placed their code under the GPL. The
-software is provided "as is" and without warranty express or implied.
-DECnet is a trademark of Compaq. This software is not a product of
-Compaq. We acknowledge the help of people at Compaq in providing extra
-documentation above and beyond what was previously publicly available.
-
-Steve Whitehouse <SteveW@ACM.org>
-
diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/device_drivers/atm/cxacru-cf.py
index b41d298398c8..b41d298398c8 100644
--- a/Documentation/networking/cxacru-cf.py
+++ b/Documentation/networking/device_drivers/atm/cxacru-cf.py
diff --git a/Documentation/networking/cxacru.rst b/Documentation/networking/device_drivers/atm/cxacru.rst
index 6088af2ffeda..6088af2ffeda 100644
--- a/Documentation/networking/cxacru.rst
+++ b/Documentation/networking/device_drivers/atm/cxacru.rst
diff --git a/Documentation/networking/fore200e.rst b/Documentation/networking/device_drivers/atm/fore200e.rst
index 55df9ec09ac8..55df9ec09ac8 100644
--- a/Documentation/networking/fore200e.rst
+++ b/Documentation/networking/device_drivers/atm/fore200e.rst
diff --git a/Documentation/networking/device_drivers/atm/index.rst b/Documentation/networking/device_drivers/atm/index.rst
new file mode 100644
index 000000000000..7b593f031a60
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Asynchronous Transfer Mode (ATM) Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ cxacru
+ fore200e
+ iphase
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/iphase.rst b/Documentation/networking/device_drivers/atm/iphase.rst
index 92d9b757d75a..388c7101e2cb 100644
--- a/Documentation/networking/iphase.rst
+++ b/Documentation/networking/device_drivers/atm/iphase.rst
@@ -4,7 +4,7 @@
ATM (i)Chip IA Linux Driver Source
==================================
- READ ME FISRT
+ READ ME FIRST
--------------------------------------------------------------------------------
diff --git a/Documentation/networking/device_drivers/cable/index.rst b/Documentation/networking/device_drivers/cable/index.rst
new file mode 100644
index 000000000000..cce3c4392972
--- /dev/null
+++ b/Documentation/networking/device_drivers/cable/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cable Modem Device Drivers
+==========================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ sb1000
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/sb1000.rst b/Documentation/networking/device_drivers/cable/sb1000.rst
index c8582ca4034d..c8582ca4034d 100644
--- a/Documentation/networking/device_drivers/sb1000.rst
+++ b/Documentation/networking/device_drivers/cable/sb1000.rst
diff --git a/Documentation/networking/device_drivers/can/can327.rst b/Documentation/networking/device_drivers/can/can327.rst
new file mode 100644
index 000000000000..b87bfbe5d51c
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/can327.rst
@@ -0,0 +1,331 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-3-Clause)
+
+can327: ELM327 driver for Linux SocketCAN
+==========================================
+
+Authors
+--------
+
+Max Staudt <max@enpas.org>
+
+
+
+Motivation
+-----------
+
+This driver aims to lower the initial cost for hackers interested in
+working with CAN buses.
+
+CAN adapters are expensive, few, and far between.
+ELM327 interfaces are cheap and plentiful.
+Let's use ELM327s as CAN adapters.
+
+
+
+Introduction
+-------------
+
+This driver is an effort to turn abundant ELM327 based OBD interfaces
+into full fledged (as far as possible) CAN interfaces.
+
+Since the ELM327 was never meant to be a stand alone CAN controller,
+the driver has to switch between its modes as quickly as possible in
+order to fake full-duplex operation.
+
+As such, can327 is a best effort driver. However, this is more than
+enough to implement simple request-response protocols (such as OBD II),
+and to monitor broadcast messages on a bus (such as in a vehicle).
+
+Most ELM327s come as nondescript serial devices, attached via USB or
+Bluetooth. The driver cannot recognize them by itself, and as such it
+is up to the user to attach it in form of a TTY line discipline
+(similar to PPP, SLIP, slcan, ...).
+
+This driver is meant for ELM327 versions 1.4b and up, see below for
+known limitations in older controllers and clones.
+
+
+
+Data sheet
+-----------
+
+The official data sheets can be found at ELM electronics' home page:
+
+ https://www.elmelectronics.com/
+
+
+
+How to attach the line discipline
+----------------------------------
+
+Every ELM327 chip is factory programmed to operate at a serial setting
+of 38400 baud/s, 8 data bits, no parity, 1 stopbit.
+
+If you have kept this default configuration, the line discipline can
+be attached on a command prompt as follows::
+
+ sudo ldattach \
+ --debug \
+ --speed 38400 \
+ --eightbits \
+ --noparity \
+ --onestopbit \
+ --iflag -ICRNL,INLCR,-IXOFF \
+ 30 \
+ /dev/ttyUSB0
+
+To change the ELM327's serial settings, please refer to its data
+sheet. This needs to be done before attaching the line discipline.
+
+Once the ldisc is attached, the CAN interface starts out unconfigured.
+Set the speed before starting it::
+
+ # The interface needs to be down to change parameters
+ sudo ip link set can0 down
+ sudo ip link set can0 type can bitrate 500000
+ sudo ip link set can0 up
+
+500000 bit/s is a common rate for OBD-II diagnostics.
+If you're connecting straight to a car's OBD port, this is the speed
+that most cars (but not all!) expect.
+
+After this, you can set out as usual with candump, cansniffer, etc.
+
+
+
+How to check the controller version
+------------------------------------
+
+Use a terminal program to attach to the controller.
+
+After issuing the "``AT WS``" command, the controller will respond with
+its version::
+
+ >AT WS
+
+
+ ELM327 v1.4b
+
+ >
+
+Note that clones may claim to be any version they like.
+It is not indicative of their actual feature set.
+
+
+
+
+Communication example
+----------------------
+
+This is a short and incomplete introduction on how to talk to an ELM327.
+It is here to guide understanding of the controller's and the driver's
+limitation (listed below) as well as manual testing.
+
+
+The ELM327 has two modes:
+
+- Command mode
+- Reception mode
+
+In command mode, it expects one command per line, terminated by CR.
+By default, the prompt is a "``>``", after which a command can be
+entered::
+
+ >ATE1
+ OK
+ >
+
+The init script in the driver switches off several configuration options
+that are only meaningful in the original OBD scenario the chip is meant
+for, and are actually a hindrance for can327.
+
+
+When a command is not recognized, such as by an older version of the
+ELM327, a question mark is printed as a response instead of OK::
+
+ >ATUNKNOWN
+ ?
+ >
+
+At present, can327 does not evaluate this response. See the section
+below on known limitations for details.
+
+
+When a CAN frame is to be sent, the target address is configured, after
+which the frame is sent as a command that consists of the data's hex
+dump::
+
+ >ATSH123
+ OK
+ >DEADBEEF12345678
+ OK
+ >
+
+The above interaction sends the SFF frame "``DE AD BE EF 12 34 56 78``"
+with (11 bit) CAN ID ``0x123``.
+For this to function, the controller must be configured for SFF sending
+mode (using "``AT PB``", see code or datasheet).
+
+
+Once a frame has been sent and wait-for-reply mode is on (``ATR1``,
+configured on ``listen-only=off``), or when the reply timeout expires
+and the driver sets the controller into monitoring mode (``ATMA``),
+the ELM327 will send one line for each received CAN frame, consisting
+of CAN ID, DLC, and data::
+
+ 123 8 DEADBEEF12345678
+
+For EFF (29 bit) CAN frames, the address format is slightly different,
+which can327 uses to tell the two apart::
+
+ 12 34 56 78 8 DEADBEEF12345678
+
+The ELM327 will receive both SFF and EFF frames - the current CAN
+config (``ATPB``) does not matter.
+
+
+If the ELM327's internal UART sending buffer runs full, it will abort
+the monitoring mode, print "BUFFER FULL" and drop back into command
+mode. Note that in this case, unlike with other error messages, the
+error message may appear on the same line as the last (usually
+incomplete) data frame::
+
+ 12 34 56 78 8 DEADBEEF123 BUFFER FULL
+
+
+
+Known limitations of the controller
+------------------------------------
+
+- Clone devices ("v1.5" and others)
+
+ Sending RTR frames is not supported and will be dropped silently.
+
+ Receiving RTR with DLC 8 will appear to be a regular frame with
+ the last received frame's DLC and payload.
+
+ "``AT CSM``" (CAN Silent Monitoring, i.e. don't send CAN ACKs) is
+ not supported, and is hard coded to ON. Thus, frames are not ACKed
+ while listening: "``AT MA``" (Monitor All) will always be "silent".
+ However, immediately after sending a frame, the ELM327 will be in
+ "receive reply" mode, in which it *does* ACK any received frames.
+ Once the bus goes silent, or an error occurs (such as BUFFER FULL),
+ or the receive reply timeout runs out, the ELM327 will end reply
+ reception mode on its own and can327 will fall back to "``AT MA``"
+ in order to keep monitoring the bus.
+
+ Other limitations may apply, depending on the clone and the quality
+ of its firmware.
+
+
+- All versions
+
+ No full duplex operation is supported. The driver will switch
+ between input/output mode as quickly as possible.
+
+ The length of outgoing RTR frames cannot be set. In fact, some
+ clones (tested with one identifying as "``v1.5``") are unable to
+ send RTR frames at all.
+
+ We don't have a way to get real-time notifications on CAN errors.
+ While there is a command (``AT CS``) to retrieve some basic stats,
+ we don't poll it as it would force us to interrupt reception mode.
+
+
+- Versions prior to 1.4b
+
+ These versions do not send CAN ACKs when in monitoring mode (AT MA).
+ However, they do send ACKs while waiting for a reply immediately
+ after sending a frame. The driver maximizes this time to make the
+ controller as useful as possible.
+
+ Starting with version 1.4b, the ELM327 supports the "``AT CSM``"
+ command, and the "listen-only" CAN option will take effect.
+
+
+- Versions prior to 1.4
+
+ These chips do not support the "``AT PB``" command, and thus cannot
+ change bitrate or SFF/EFF mode on-the-fly. This will have to be
+ programmed by the user before attaching the line discipline. See the
+ data sheet for details.
+
+
+- Versions prior to 1.3
+
+ These chips cannot be used at all with can327. They do not support
+ the "``AT D1``" command, which is necessary to avoid parsing conflicts
+ on incoming data, as well as distinction of RTR frame lengths.
+
+ Specifically, this allows for easy distinction of SFF and EFF
+ frames, and to check whether frames are complete. While it is possible
+ to deduce the type and length from the length of the line the ELM327
+ sends us, this method fails when the ELM327's UART output buffer
+ overruns. It may abort sending in the middle of the line, which will
+ then be mistaken for something else.
+
+
+
+Known limitations of the driver
+--------------------------------
+
+- No 8/7 timing.
+
+ ELM327 can only set CAN bitrates that are of the form 500000/n, where
+ n is an integer divisor.
+ However there is an exception: With a separate flag, it may set the
+ speed to be 8/7 of the speed indicated by the divisor.
+ This mode is not currently implemented.
+
+- No evaluation of command responses.
+
+ The ELM327 will reply with OK when a command is understood, and with ?
+ when it is not. The driver does not currently check this, and simply
+ assumes that the chip understands every command.
+ The driver is built such that functionality degrades gracefully
+ nevertheless. See the section on known limitations of the controller.
+
+- No use of hardware CAN ID filtering
+
+ An ELM327's UART sending buffer will easily overflow on heavy CAN bus
+ load, resulting in the "``BUFFER FULL``" message. Using the hardware
+ filters available through "``AT CF xxx``" and "``AT CM xxx``" would be
+ helpful here, however SocketCAN does not currently provide a facility
+ to make use of such hardware features.
+
+
+
+Rationale behind the chosen configuration
+------------------------------------------
+
+``AT E1``
+ Echo on
+
+ We need this to be able to get a prompt reliably.
+
+``AT S1``
+ Spaces on
+
+ We need this to distinguish 11/29 bit CAN addresses received.
+
+ Note:
+ We can usually do this using the line length (odd/even),
+ but this fails if the line is not transmitted fully to
+ the host (BUFFER FULL).
+
+``AT D1``
+ DLC on
+
+ We need this to tell the "length" of RTR frames.
+
+
+
+A note on CAN bus termination
+------------------------------
+
+Your adapter may have resistors soldered in which are meant to terminate
+the bus. This is correct when it is plugged into a OBD-II socket, but
+not helpful when trying to tap into the middle of an existing CAN bus.
+
+If communications don't work with the adapter connected, check for the
+termination resistors on its PCB and try removing them.
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
new file mode 100644
index 000000000000..1661d13174d5
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
@@ -0,0 +1,638 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+CTU CAN FD Driver
+=================
+
+Author: Martin Jerabek <martin.jerabek01@gmail.com>
+
+
+About CTU CAN FD IP Core
+------------------------
+
+`CTU CAN FD <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+is an open source soft core written in VHDL.
+It originated in 2015 as Ondrej Ille's project
+at the `Department of Measurement <https://meas.fel.cvut.cz/>`_
+of `FEE <http://www.fel.cvut.cz/en/>`_ at `CTU <https://www.cvut.cz/en>`_.
+
+The SocketCAN driver for Xilinx Zynq SoC based MicroZed board
+`Vivado integration <https://gitlab.fel.cvut.cz/canbus/zynq/zynq-can-sja1000-top>`_
+and Intel Cyclone V 5CSEMA4U23C6 based DE0-Nano-SoC Terasic board
+`QSys integration <https://gitlab.fel.cvut.cz/canbus/intel-soc-ctucanfd>`_
+has been developed as well as support for
+`PCIe integration <https://gitlab.fel.cvut.cz/canbus/pcie-ctucanfd>`_ of the core.
+
+In the case of Zynq, the core is connected via the APB system bus, which does
+not have enumeration support, and the device must be specified in Device Tree.
+This kind of devices is called platform device in the kernel and is
+handled by a platform device driver.
+
+The basic functional model of the CTU CAN FD peripheral has been
+accepted into QEMU mainline. See QEMU `CAN emulation support <https://www.qemu.org/docs/master/system/devices/can.html>`_
+for CAN FD buses, host connection and CTU CAN FD core emulation. The development
+version of emulation support can be cloned from ctu-canfd branch of QEMU local
+development `repository <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_.
+
+
+About SocketCAN
+---------------
+
+SocketCAN is a standard common interface for CAN devices in the Linux
+kernel. As the name suggests, the bus is accessed via sockets, similarly
+to common network devices. The reasoning behind this is in depth
+described in `Linux SocketCAN <https://www.kernel.org/doc/html/latest/networking/can.html>`_.
+In short, it offers a
+natural way to implement and work with higher layer protocols over CAN,
+in the same way as, e.g., UDP/IP over Ethernet.
+
+Device probe
+~~~~~~~~~~~~
+
+Before going into detail about the structure of a CAN bus device driver,
+let's reiterate how the kernel gets to know about the device at all.
+Some buses, like PCI or PCIe, support device enumeration. That is, when
+the system boots, it discovers all the devices on the bus and reads
+their configuration. The kernel identifies the device via its vendor ID
+and device ID, and if there is a driver registered for this identifier
+combination, its probe method is invoked to populate the driver's
+instance for the given hardware. A similar situation goes with USB, only
+it allows for device hot-plug.
+
+The situation is different for peripherals which are directly embedded
+in the SoC and connected to an internal system bus (AXI, APB, Avalon,
+and others). These buses do not support enumeration, and thus the kernel
+has to learn about the devices from elsewhere. This is exactly what the
+Device Tree was made for.
+
+Device tree
+~~~~~~~~~~~
+
+An entry in device tree states that a device exists in the system, how
+it is reachable (on which bus it resides) and its configuration –
+registers address, interrupts and so on. An example of such a device
+tree is given in .
+
+::
+
+ / {
+ /* ... */
+ amba: amba {
+ #address-cells = <1>;
+ #size-cells = <1>;
+ compatible = "simple-bus";
+
+ CTU_CAN_FD_0: CTU_CAN_FD@43c30000 {
+ compatible = "ctu,ctucanfd";
+ interrupt-parent = <&intc>;
+ interrupts = <0 30 4>;
+ clocks = <&clkc 15>;
+ reg = <0x43c30000 0x10000>;
+ };
+ };
+ };
+
+
+.. _sec:socketcan:drv:
+
+Driver structure
+~~~~~~~~~~~~~~~~
+
+The driver can be divided into two parts – platform-dependent device
+discovery and set up, and platform-independent CAN network device
+implementation.
+
+.. _sec:socketcan:platdev:
+
+Platform device driver
+^^^^^^^^^^^^^^^^^^^^^^
+
+In the case of Zynq, the core is connected via the AXI system bus, which
+does not have enumeration support, and the device must be specified in
+Device Tree. This kind of devices is called *platform device* in the
+kernel and is handled by a *platform device driver*\ [1]_.
+
+A platform device driver provides the following things:
+
+- A *probe* function
+
+- A *remove* function
+
+- A table of *compatible* devices that the driver can handle
+
+The *probe* function is called exactly once when the device appears (or
+the driver is loaded, whichever happens later). If there are more
+devices handled by the same driver, the *probe* function is called for
+each one of them. Its role is to allocate and initialize resources
+required for handling the device, as well as set up low-level functions
+for the platform-independent layer, e.g., *read_reg* and *write_reg*.
+After that, the driver registers the device to a higher layer, in our
+case as a *network device*.
+
+The *remove* function is called when the device disappears, or the
+driver is about to be unloaded. It serves to free the resources
+allocated in *probe* and to unregister the device from higher layers.
+
+Finally, the table of *compatible* devices states which devices the
+driver can handle. The Device Tree entry ``compatible`` is matched
+against the tables of all *platform drivers*.
+
+.. code:: c
+
+ /* Match table for OF platform binding */
+ static const struct of_device_id ctucan_of_match[] = {
+ { .compatible = "ctu,canfd-2", },
+ { .compatible = "ctu,ctucanfd", },
+ { /* end of list */ },
+ };
+ MODULE_DEVICE_TABLE(of, ctucan_of_match);
+
+ static int ctucan_probe(struct platform_device *pdev);
+ static int ctucan_remove(struct platform_device *pdev);
+
+ static struct platform_driver ctucanfd_driver = {
+ .probe = ctucan_probe,
+ .remove = ctucan_remove,
+ .driver = {
+ .name = DRIVER_NAME,
+ .of_match_table = ctucan_of_match,
+ },
+ };
+ module_platform_driver(ctucanfd_driver);
+
+
+.. _sec:socketcan:netdev:
+
+Network device driver
+^^^^^^^^^^^^^^^^^^^^^
+
+Each network device must support at least these operations:
+
+- Bring the device up: ``ndo_open``
+
+- Bring the device down: ``ndo_close``
+
+- Submit TX frames to the device: ``ndo_start_xmit``
+
+- Signal TX completion and errors to the network subsystem: ISR
+
+- Submit RX frames to the network subsystem: ISR and NAPI
+
+There are two possible event sources: the device and the network
+subsystem. Device events are usually signaled via an interrupt, handled
+in an Interrupt Service Routine (ISR). Handlers for the events
+originating in the network subsystem are then specified in
+``struct net_device_ops``.
+
+When the device is brought up, e.g., by calling ``ip link set can0 up``,
+the driver’s function ``ndo_open`` is called. It should validate the
+interface configuration and configure and enable the device. The
+analogous opposite is ``ndo_close``, called when the device is being
+brought down, be it explicitly or implicitly.
+
+When the system should transmit a frame, it does so by calling
+``ndo_start_xmit``, which enqueues the frame into the device. If the
+device HW queue (FIFO, mailboxes or whatever the implementation is)
+becomes full, the ``ndo_start_xmit`` implementation informs the network
+subsystem that it should stop the TX queue (via ``netif_stop_queue``).
+It is then re-enabled later in ISR when the device has some space
+available again and is able to enqueue another frame.
+
+All the device events are handled in ISR, namely:
+
+#. **TX completion**. When the device successfully finishes transmitting
+ a frame, the frame is echoed locally. On error, an informative error
+ frame [2]_ is sent to the network subsystem instead. In both cases,
+ the software TX queue is resumed so that more frames may be sent.
+
+#. **Error condition**. If something goes wrong (e.g., the device goes
+ bus-off or RX overrun happens), error counters are updated, and
+ informative error frames are enqueued to SW RX queue.
+
+#. **RX buffer not empty**. In this case, read the RX frames and enqueue
+ them to SW RX queue. Usually NAPI is used as a middle layer (see ).
+
+.. _sec:socketcan:napi:
+
+NAPI
+~~~~
+
+The frequency of incoming frames can be high and the overhead to invoke
+the interrupt service routine for each frame can cause significant
+system load. There are multiple mechanisms in the Linux kernel to deal
+with this situation. They evolved over the years of Linux kernel
+development and enhancements. For network devices, the current standard
+is NAPI – *the New API*. It is similar to classical top-half/bottom-half
+interrupt handling in that it only acknowledges the interrupt in the ISR
+and signals that the rest of the processing should be done in softirq
+context. On top of that, it offers the possibility to *poll* for new
+frames for a while. This has a potential to avoid the costly round of
+enabling interrupts, handling an incoming IRQ in ISR, re-enabling the
+softirq and switching context back to softirq.
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+Integrating the core to Xilinx Zynq
+-----------------------------------
+
+The core interfaces a simple subset of the Avalon
+(search for Intel **Avalon Interface Specifications**)
+bus as it was originally used on
+Alterra FPGA chips, yet Xilinx natively interfaces with AXI
+(search for ARM **AMBA AXI and ACE Protocol Specification AXI3,
+AXI4, and AXI4-Lite, ACE and ACE-Lite**).
+The most obvious solution would be to use
+an Avalon/AXI bridge or implement some simple conversion entity.
+However, the core’s interface is half-duplex with no handshake
+signaling, whereas AXI is full duplex with two-way signaling. Moreover,
+even AXI-Lite slave interface is quite resource-intensive, and the
+flexibility and speed of AXI are not required for a CAN core.
+
+Thus a much simpler bus was chosen – APB (Advanced Peripheral Bus)
+(search for ARM **AMBA APB Protocol Specification**).
+APB-AXI bridge is directly available in
+Xilinx Vivado, and the interface adaptor entity is just a few simple
+combinatorial assignments.
+
+Finally, to be able to include the core in a block diagram as a custom
+IP, the core, together with the APB interface, has been packaged as a
+Vivado component.
+
+CTU CAN FD Driver design
+------------------------
+
+The general structure of a CAN device driver has already been examined
+in . The next paragraphs provide a more detailed description of the CTU
+CAN FD core driver in particular.
+
+Low-level driver
+~~~~~~~~~~~~~~~~
+
+The core is not intended to be used solely with SocketCAN, and thus it
+is desirable to have an OS-independent low-level driver. This low-level
+driver can then be used in implementations of OS driver or directly
+either on bare metal or in a user-space application. Another advantage
+is that if the hardware slightly changes, only the low-level driver
+needs to be modified.
+
+The code [3]_ is in part automatically generated and in part written
+manually by the core author, with contributions of the thesis’ author.
+The low-level driver supports operations such as: set bit timing, set
+controller mode, enable/disable, read RX frame, write TX frame, and so
+on.
+
+Configuring bit timing
+~~~~~~~~~~~~~~~~~~~~~~
+
+On CAN, each bit is divided into four segments: SYNC, PROP, PHASE1, and
+PHASE2. Their duration is expressed in multiples of a Time Quantum
+(details in `CAN Specification, Version 2.0 <http://esd.cs.ucr.edu/webres/can20.pdf>`_, chapter 8).
+When configuring
+bitrate, the durations of all the segments (and time quantum) must be
+computed from the bitrate and Sample Point. This is performed
+independently for both the Nominal bitrate and Data bitrate for CAN FD.
+
+SocketCAN is fairly flexible and offers either highly customized
+configuration by setting all the segment durations manually, or a
+convenient configuration by setting just the bitrate and sample point
+(and even that is chosen automatically per Bosch recommendation if not
+specified). However, each CAN controller may have different base clock
+frequency and different width of segment duration registers. The
+algorithm thus needs the minimum and maximum values for the durations
+(and clock prescaler) and tries to optimize the numbers to fit both the
+constraints and the requested parameters.
+
+.. code:: c
+
+ struct can_bittiming_const {
+ char name[16]; /* Name of the CAN controller hardware */
+ __u32 tseg1_min; /* Time segment 1 = prop_seg + phase_seg1 */
+ __u32 tseg1_max;
+ __u32 tseg2_min; /* Time segment 2 = phase_seg2 */
+ __u32 tseg2_max;
+ __u32 sjw_max; /* Synchronisation jump width */
+ __u32 brp_min; /* Bit-rate prescaler */
+ __u32 brp_max;
+ __u32 brp_inc;
+ };
+
+
+[lst:can_bittiming_const]
+
+A curious reader will notice that the durations of the segments PROP_SEG
+and PHASE_SEG1 are not determined separately but rather combined and
+then, by default, the resulting TSEG1 is evenly divided between PROP_SEG
+and PHASE_SEG1. In practice, this has virtually no consequences as the
+sample point is between PHASE_SEG1 and PHASE_SEG2. In CTU CAN FD,
+however, the duration registers ``PROP`` and ``PH1`` have different
+widths (6 and 7 bits, respectively), so the auto-computed values might
+overflow the shorter register and must thus be redistributed among the
+two [4]_.
+
+Handling RX
+~~~~~~~~~~~
+
+Frame reception is handled in NAPI queue, which is enabled from ISR when
+the RXNE (RX FIFO Not Empty) bit is set. Frames are read one by one
+until either no frame is left in the RX FIFO or the maximum work quota
+has been reached for the NAPI poll run (see ). Each frame is then passed
+to the network interface RX queue.
+
+An incoming frame may be either a CAN 2.0 frame or a CAN FD frame. The
+way to distinguish between these two in the kernel is to allocate either
+``struct can_frame`` or ``struct canfd_frame``, the two having different
+sizes. In the controller, the information about the frame type is stored
+in the first word of RX FIFO.
+
+This brings us a chicken-egg problem: we want to allocate the ``skb``
+for the frame, and only if it succeeds, fetch the frame from FIFO;
+otherwise keep it there for later. But to be able to allocate the
+correct ``skb``, we have to fetch the first work of FIFO. There are
+several possible solutions:
+
+#. Read the word, then allocate. If it fails, discard the rest of the
+ frame. When the system is low on memory, the situation is bad anyway.
+
+#. Always allocate ``skb`` big enough for an FD frame beforehand. Then
+ tweak the ``skb`` internals to look like it has been allocated for
+ the smaller CAN 2.0 frame.
+
+#. Add option to peek into the FIFO instead of consuming the word.
+
+#. If the allocation fails, store the read word into driver’s data. On
+ the next try, use the stored word instead of reading it again.
+
+Option 1 is simple enough, but not very satisfying if we could do
+better. Option 2 is not acceptable, as it would require modifying the
+private state of an integral kernel structure. The slightly higher
+memory consumption is just a virtual cherry on top of the “cake”. Option
+3 requires non-trivial HW changes and is not ideal from the HW point of
+view.
+
+Option 4 seems like a good compromise, with its disadvantage being that
+a partial frame may stay in the FIFO for a prolonged time. Nonetheless,
+there may be just one owner of the RX FIFO, and thus no one else should
+see the partial frame (disregarding some exotic debugging scenarios).
+Basides, the driver resets the core on its initialization, so the
+partial frame cannot be “adopted” either. In the end, option 4 was
+selected [5]_.
+
+.. _subsec:ctucanfd:rxtimestamp:
+
+Timestamping RX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+The CTU CAN FD core reports the exact timestamp when the frame has been
+received. The timestamp is by default captured at the sample point of
+the last bit of EOF but is configurable to be captured at the SOF bit.
+The timestamp source is external to the core and may be up to 64 bits
+wide. At the time of writing, passing the timestamp from kernel to
+userspace is not yet implemented, but is planned in the future.
+
+Handling TX
+~~~~~~~~~~~
+
+The CTU CAN FD core has 4 independent TX buffers, each with its own
+state and priority. When the core wants to transmit, a TX buffer in
+Ready state with the highest priority is selected.
+
+The priorities are 3bit numbers in register TX_PRIORITY
+(nibble-aligned). This should be flexible enough for most use cases.
+SocketCAN, however, supports only one FIFO queue for outgoing
+frames [6]_. The buffer priorities may be used to simulate the FIFO
+behavior by assigning each buffer a distinct priority and *rotating* the
+priorities after a frame transmission is completed.
+
+In addition to priority rotation, the SW must maintain head and tail
+pointers into the FIFO formed by the TX buffers to be able to determine
+which buffer should be used for next frame (``txb_head``) and which
+should be the first completed one (``txb_tail``). The actual buffer
+indices are (obviously) modulo 4 (number of TX buffers), but the
+pointers must be at least one bit wider to be able to distinguish
+between FIFO full and FIFO empty – in this situation,
+:math:`txb\_head \equiv txb\_tail\ (\textrm{mod}\ 4)`. An example of how
+the FIFO is maintained, together with priority rotation, is depicted in
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | A | B | C | |
++------+---+---+---+---+
+| Prio | 7 | 6 | 5 | 4 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | | B | C | |
++------+---+---+---+---+
+| Prio | 4 | 7 | 6 | 5 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+----+
+| TXB# | 0 | 1 | 2 | 3 | 0’ |
++======+===+===+===+===+====+
+| Seq | E | B | C | D | |
++------+---+---+---+---+----+
+| Prio | 4 | 7 | 6 | 5 | |
++------+---+---+---+---+----+
+| | | T | | | H |
++------+---+---+---+---+----+
+
+|
+
+.. kernel-figure:: fsm_txt_buffer_user.svg
+
+ TX Buffer states with possible transitions
+
+.. _subsec:ctucanfd:txtimestamp:
+
+Timestamping TX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+When submitting a frame to a TX buffer, one may specify the timestamp at
+which the frame should be transmitted. The frame transmission may start
+later, but not sooner. Note that the timestamp does not participate in
+buffer prioritization – that is decided solely by the mechanism
+described above.
+
+Support for time-based packet transmission was recently merged to Linux
+v4.19 `Time-based packet transmission <https://lwn.net/Articles/748879/>`_,
+but it remains yet to be researched
+whether this functionality will be practical for CAN.
+
+Also similarly to retrieving the timestamp of RX frames, the core
+supports retrieving the timestamp of TX frames – that is the time when
+the frame was successfully delivered. The particulars are very similar
+to timestamping RX frames and are described in .
+
+Handling RX buffer overrun
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a received frame does no more fit into the hardware RX FIFO in its
+entirety, RX FIFO overrun flag (STATUS[DOR]) is set and Data Overrun
+Interrupt (DOI) is triggered. When servicing the interrupt, care must be
+taken first to clear the DOR flag (via COMMAND[CDO]) and after that
+clear the DOI interrupt flag. Otherwise, the interrupt would be
+immediately [7]_ rearmed.
+
+**Note**: During development, it was discussed whether the internal HW
+pipelining cannot disrupt this clear sequence and whether an additional
+dummy cycle is necessary between clearing the flag and the interrupt. On
+the Avalon interface, it indeed proved to be the case, but APB being
+safe because it uses 2-cycle transactions. Essentially, the DOR flag
+would be cleared, but DOI register’s Preset input would still be high
+the cycle when the DOI clear request would also be applied (by setting
+the register’s Reset input high). As Set had higher priority than Reset,
+the DOI flag would not be reset. This has been already fixed by swapping
+the Set/Reset priority (see issue #187).
+
+Reporting Error Passive and Bus Off conditions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It may be desirable to report when the node reaches *Error Passive*,
+*Error Warning*, and *Bus Off* conditions. The driver is notified about
+error state change by an interrupt (EPI, EWLI), and then proceeds to
+determine the core’s error state by reading its error counters.
+
+There is, however, a slight race condition here – there is a delay
+between the time when the state transition occurs (and the interrupt is
+triggered) and when the error counters are read. When EPI is received,
+the node may be either *Error Passive* or *Bus Off*. If the node goes
+*Bus Off*, it obviously remains in the state until it is reset.
+Otherwise, the node is *or was* *Error Passive*. However, it may happen
+that the read state is *Error Warning* or even *Error Active*. It may be
+unclear whether and what exactly to report in that case, but I
+personally entertain the idea that the past error condition should still
+be reported. Similarly, when EWLI is received but the state is later
+detected to be *Error Passive*, *Error Passive* should be reported.
+
+
+CTU CAN FD Driver Sources Reference
+-----------------------------------
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd.h
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_base.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_pci.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_platform.c
+ :internal:
+
+CTU CAN FD IP Core and Driver Development Acknowledgment
+---------------------------------------------------------
+
+* Odrej Ille <ondrej.ille@gmail.com>
+
+ * started the project as student at Department of Measurement, FEE, CTU
+ * invested great amount of personal time and enthusiasm to the project over years
+ * worked on more funded tasks
+
+* `Department of Measurement <https://meas.fel.cvut.cz/>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * is the main investor into the project over many years
+ * uses project in their CAN/CAN FD diagnostics framework for `Skoda Auto <https://www.skoda-auto.cz/>`_
+
+* `Digiteq Automotive <https://www.digiteqautomotive.com/en>`_
+
+ * funding of the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * negotiated and paid CTU to allow public access to the project
+ * provided additional funding of the work
+
+* `Department of Control Engineering <https://control.fel.cvut.cz/en>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * solving the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * providing GitLab management
+ * virtual servers and computational power for continuous integration
+ * providing hardware for HIL continuous integration tests
+
+* `PiKRON Ltd. <http://pikron.com/>`_
+
+ * minor funding to initiate preparation of the project open-sourcing
+
+* Petr Porazil <porazil@pikron.com>
+
+ * design of PCIe transceiver addon board and assembly of boards
+ * design and assembly of MZ_APO baseboard for MicroZed/Zynq based system
+
+* Martin Jerabek <martin.jerabek01@gmail.com>
+
+ * Linux driver development
+ * continuous integration platform architect and GHDL updates
+ * thesis `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
+
+* Jiri Novak <jnovak@fel.cvut.cz>
+
+ * project initiation, management and use at Department of Measurement, FEE, CTU
+
+* Pavel Pisa <pisa@cmp.felk.cvut.cz>
+
+ * initiate open-sourcing, project coordination, management at Department of Control Engineering, FEE, CTU
+
+* Jaroslav Beran<jara.beran@gmail.com>
+
+ * system integration for Intel SoC, core and driver testing and updates
+
+* Carsten Emde (`OSADL <https://www.osadl.org/>`_)
+
+ * provided OSADL expertise to discuss IP core licensing
+ * pointed to possible deadlock for LGPL and CAN bus possible patent case which lead to relicense IP core design to BSD like license
+
+* Reiner Zitzmann and Holger Zeltwanger (`CAN in Automation <https://www.can-cia.org/>`_)
+
+ * provided suggestions and help to inform community about the project and invited us to events focused on CAN bus future development directions
+
+* Jan Charvat
+
+ * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
+ * Bachelor thesis Model of CAN FD Communication Controller for QEMU Emulator
+
+Notes
+-----
+
+
+.. [1]
+ Other buses have their own specific driver interface to set up the
+ device.
+
+.. [2]
+ Not to be mistaken with CAN Error Frame. This is a ``can_frame`` with
+ ``CAN_ERR_FLAG`` set and some error info in its ``data`` field.
+
+.. [3]
+ Available in CTU CAN FD repository
+ `<https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+
+.. [4]
+ As is done in the low-level driver functions
+ ``ctucan_hw_set_nom_bittiming`` and
+ ``ctucan_hw_set_data_bittiming``.
+
+.. [5]
+ At the time of writing this thesis, option 1 is still being used and
+ the modification is queued in gitlab issue #222
+
+.. [6]
+ Strictly speaking, multiple CAN TX queues are supported since v4.19
+ `can: enable multi-queue for SocketCAN devices <https://lore.kernel.org/patchwork/patch/913526/>`_ but no mainline driver is using
+ them yet.
+
+.. [7]
+ Or rather in the next clock cycle
diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
new file mode 100644
index 000000000000..381323423b4c
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
@@ -0,0 +1,151 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg width="113.611mm" height="86.6873mm" version="1.1" viewBox="0 0 113.611 86.6873" xmlns="http://www.w3.org/2000/svg" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+ <defs>
+ <marker id="marker3667" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3517" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3373" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3199" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3037" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2779" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2477" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2074" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1964" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1856" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="Arrow2Mend" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <marker id="marker2074-3" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204-6" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-4" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ </defs>
+ <metadata>
+ <rdf:RDF>
+ <cc:Work rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+ <dc:title/>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g transform="translate(-49.0277 -104.823)">
+ <g>
+ <path d="m130.534 165.429h-71.1816v-17.5315" fill="none" marker-end="url(#marker2477)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m145.034 122.959v-11.5914h-43.1215" fill="none" marker-end="url(#marker3037)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="130.679" y="122.933" width="28.2965" height="45.2319" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".499999"/>
+ <path d="m102.044 116.236h23.3126l-0.13388 18.8185h19.9383v3.66603" fill="none" marker-end="url(#marker3199)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m59.5006 138.391v-24.2517h20.6338" fill="none" marker-end="url(#marker2779)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="78.1389" y="126.411" width="28.0037" height="35.0443" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".5"/>
+ </g>
+ <g fill="#ffcb35" stroke="#000" stroke-linecap="square">
+ <ellipse cx="92.1408" cy="114.239" rx="10.8866" ry="4.39308" stroke-width=".5"/>
+ <ellipse cx="92.1408" cy="134.185" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ <ellipse cx="92.1408" cy="152.199" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ </g>
+ <g fill="#28a4ff" stroke="#000" stroke-linecap="square" stroke-width=".499999">
+ <ellipse cx="144.827" cy="143.316" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="144.827" cy="159.143" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="59.4364" cy="142.823" rx="7.36455" ry="4.39308"/>
+ <ellipse cx="144.827" cy="129.196" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="143.077" cy="180.53" rx="10.8866" ry="4.39308"/>
+ </g>
+ <ellipse cx="110.386" cy="180.53" rx="10.8866" ry="4.39308" fill="#ffcb35" stroke="#000" stroke-linecap="square" stroke-width=".499999"/>
+ <text x="110.90907" y="179.42688" font-size="3.175px" xml:space="preserve"><tspan x="110.90907" y="179.42688" dy="0.60000002" text-align="center" text-anchor="middle">Accessible</tspan><tspan x="110.90907" y="183.39563"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <text x="143.5869" y="179.52795" xml:space="preserve"><tspan x="143.5869" y="179.52795" dy="1 0 0 0 0 0" font-family="sans-serif" font-size="2.82222px" text-align="center" text-anchor="middle" style="font-variant-caps:normal;font-variant-east-asian:normal;font-variant-ligatures:normal;font-variant-numeric:normal">Inaccessible</tspan><tspan x="143.5869" y="183.36786" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <g font-size="3.175px">
+ <text x="91.95018" y="115.29005" xml:space="preserve"><tspan x="91.95018" y="115.29005" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Ready</tspan></tspan></text>
+ <text x="145.25127" y="130.49019" xml:space="preserve"><tspan x="145.25127" y="130.49019" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX OK</tspan></tspan></text>
+ <text x="145.31845" y="144.43121" xml:space="preserve"><tspan x="145.31845" y="144.43121" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Aborted</tspan></tspan></text>
+ <text x="145.40399" y="160.36035" xml:space="preserve"><tspan x="145.40399" y="160.36035" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX failed</tspan></tspan></text>
+ <text x="91.823967" y="133.53941" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.823967" y="133.53941" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX in</tspan></tspan><tspan x="91.823967" y="136.39691" text-align="center">progress</tspan></text>
+ <text x="91.648918" y="151.84813" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.648918" y="151.84813" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Abort in</tspan></tspan><tspan x="91.648918" y="154.70563" text-align="center">progress</tspan></text>
+ <text x="59.456043" y="143.91658" xml:space="preserve"><tspan x="59.456043" y="143.91658" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Empty</tspan></tspan></text>
+ </g>
+ <g fill="none">
+ <g stroke="#000">
+ <rect x="52.3943" y="171.63" width="106.581" height="16.601" rx="0" ry="0" stroke-linecap="square" stroke-width=".499999"/>
+ <g stroke-width=".6">
+ <path d="m106.383 159.046h26.4967" marker-end="url(#Arrow2Mend)"/>
+ <path d="m103.138 152.268h41.5564v-3.92426" marker-end="url(#marker1856)"/>
+ <path d="m106.38 129.354h17.7785"/>
+ <path d="m125.818 129.359h7.2418" marker-end="url(#marker1964)"/>
+ </g>
+ <path d="m124.169 129.354a0.959514 0.97091 0 0 1 0.47587-0.84557 0.959514 0.97091 0 0 1 0.96164-3e-3 0.959514 0.97091 0 0 1 0.48149 0.84231" stroke-linecap="square" stroke-width=".600001"/>
+ <path d="m55.7026 180.832h34.8131" marker-end="url(#marker2074)" stroke-width=".6"/>
+ </g>
+ <g>
+ <path d="m55.6464 185.744h34.8131" marker-end="url(#marker2074-3)" stroke="#28a4ff" stroke-width=".600001"/>
+ <g stroke-width=".6">
+ <path d="m94.0487 129.889v-10.6493" marker-end="url(#marker3373)" stroke="#000"/>
+ <path d="m89.7534 118.621v10.662" marker-end="url(#marker3517)" stroke="#000"/>
+ <path d="m92.119 138.812v7.9718" marker-end="url(#marker3667)" stroke="#28a4ff"/>
+ </g>
+ </g>
+ </g>
+ <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
+ <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccessful</tspan></text>
+ <g font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">successful</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
+ </g>
+ <g stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 107.574 155.948)" x="38.824219" y="9.1171875" filter="url(#filter1204)" font-size="10.6667px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Retransmit</tspan><tspan x="38.824219" y="20.850557" text-align="center">limit reached or</tspan><tspan x="38.824219" y="32.583927" text-align="center">node went bus off</tspan><tspan x="38.824219" y="44.317299" text-align="center"/></text>
+ <text transform="matrix(.264583 0 0 .264583 60.7127 177.384)" x="38.824539" y="9.1173134" filter="url(#filter1204-6)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824539" y="9.1173134" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Transmission result</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 45.6885 173.226)" x="57.727047" y="9.11724" filter="url(#filter1204-6-9)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Legend:</tspan></text>
+ </g>
+ <g fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 57.0045 182.079)" x="57.727047" y="9.11724" filter="url(#filter1204-6-2)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">SW command</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 57.7865 110.104)" x="40.822609" y="9.11724" filter="url(#filter1204-6-2-9)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="40.822609" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 116.893 107.491)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-4)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 87.5687 166.324)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-1)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set empty</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.53 113.074)" x="30.228771" y="8.9063139" filter="url(#filter1204-6-2-9-1-3)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="30.228771" y="8.9063139" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set abort</tspan></text>
+ </g>
+ </g>
+</svg>
diff --git a/Documentation/networking/device_drivers/can/freescale/flexcan.rst b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
new file mode 100644
index 000000000000..106cd2890135
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Flexcan CAN Controller driver
+=============================
+
+Authors: Marc Kleine-Budde <mkl@pengutronix.de>,
+Dario Binacchi <dario.binacchi@amarulasolutions.com>
+
+On/off RTR frames reception
+===========================
+
+For most flexcan IP cores the driver supports 2 RX modes:
+
+- FIFO
+- mailbox
+
+The older flexcan cores (integrated into the i.MX25, i.MX28, i.MX35
+and i.MX53 SOCs) only receive RTR frames if the controller is
+configured for RX-FIFO mode.
+
+The RX FIFO mode uses a hardware FIFO with a depth of 6 CAN frames,
+while the mailbox mode uses a software FIFO with a depth of up to 62
+CAN frames. With the help of the bigger buffer, the mailbox mode
+performs better under high system load situations.
+
+As reception of RTR frames is part of the CAN standard, all flexcan
+cores come up in a mode where RTR reception is possible.
+
+With the "rx-rtr" private flag the ability to receive RTR frames can
+be waived at the expense of losing the ability to receive RTR
+messages. This trade off is beneficial in certain use cases.
+
+"rx-rtr" on
+ Receive RTR frames. (default)
+
+ The CAN controller can and will receive RTR frames.
+
+ On some IP cores the controller cannot receive RTR frames in the
+ more performant "RX mailbox" mode and will use "RX FIFO" mode
+ instead.
+
+"rx-rtr" off
+
+ Waive ability to receive RTR frames. (not supported on all IP cores)
+
+ This mode activates the "RX mailbox mode" for better performance, on
+ some IP cores RTR frames cannot be received anymore.
+
+The setting can only be changed if the interface is down::
+
+ ip link set dev can0 down
+ ethtool --set-priv-flags can0 rx-rtr {off|on}
+ ip link set dev can0 up
diff --git a/Documentation/networking/device_drivers/can/index.rst b/Documentation/networking/device_drivers/can/index.rst
new file mode 100644
index 000000000000..6a8a4f74fa26
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Controller Area Network (CAN) Device Drivers
+============================================
+
+Device drivers for CAN devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ can327
+ ctu/ctucanfd-driver
+ freescale/flexcan
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/index.rst b/Documentation/networking/device_drivers/cellular/index.rst
new file mode 100644
index 000000000000..fc1812d3fc70
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cellular Modem Device Drivers
+=============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ qualcomm/rmnet
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
new file mode 100644
index 000000000000..289c146a8291
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Rmnet Driver
+============
+
+1. Introduction
+===============
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+================
+
+a. MAP packet v1 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. Map packet v4 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-(x-33) (x-32)-x
+ Function Raw bytes Checksum offload header
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+Checksum offload header, has the information about the checksum processing done
+by the hardware.Checksum offload header fields are in big endian format.
+
+Packet format::
+
+ Bit 0-14 15 16-31
+ Function Reserved Valid Checksum start offset
+
+ Bit 31-47 48-64
+ Function Checksum length Checksum value
+
+Reserved bits must be zero when sent and ignored when received.
+
+Valid bit indicates whether the partial checksum is calculated and is valid.
+Set to 1, if its is valid. Set to 0 otherwise.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Checksum start offset, Indicates the offset in bytes from the beginning of the
+IP header, from which modem computed checksum.
+
+Checksum length is the Length in bytes starting from CKSUM_START_OFFSET,
+over which checksum is computed.
+
+Checksum value, indicates the checksum computed.
+
+c. MAP packet v5 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Next header Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Next header is used to indicate the presence of another header, currently is
+limited to checksum header.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+d. Checksum offload header v5
+
+Checksum offload header fields are in big endian format.
+
+ Bit 0 - 6 7 8-15 16-31
+ Function Header Type Next Header Checksum Valid Reserved
+
+Header Type is to indicate the type of header, this usually is set to CHECKSUM
+
+Header types
+= ==========================================
+0 Reserved
+1 Reserved
+2 checksum header
+
+Checksum Valid is to indicate whether the header checksum is valid. Value of 1
+implies that checksum is calculated on this packet and is valid, value of 0
+indicates that the calculated packet checksum is invalid.
+
+Reserved bits must be zero when sent and ignored when received.
+
+e. MAP packet v1/v5 (command specific)::
+
+ Bit 0 1 2-7 8 - 15 16 - 31
+ Function Command Reserved Pad Multiplexer ID Payload length
+ Bit 32 - 39 40 - 45 46 - 47 48 - 63
+ Function Command name Reserved Command Type Reserved
+ Bit 64 - 95
+ Function Transaction ID
+ Bit 96 - 127
+ Function Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types
+
+= ==========================================
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+= ==========================================
+
+f. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered to
+rmnet in a single linear skb. rmnet will process the individual
+packets and either ACK the MAP command or deliver the IP packet to the
+network stack as needed
+
+MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+
+MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+
+3. Userspace configuration
+==========================
+
+rmnet userspace configuration is done through netlink using iproute2
+https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
+
+The driver uses rtnl_link_ops for communication.
diff --git a/Documentation/networking/device_drivers/dec/de4x5.rst b/Documentation/networking/device_drivers/dec/de4x5.rst
deleted file mode 100644
index e03e9c631879..000000000000
--- a/Documentation/networking/device_drivers/dec/de4x5.rst
+++ /dev/null
@@ -1,189 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================
-DEC EtherWORKS Ethernet De4x5 cards
-===================================
-
- Originally, this driver was written for the Digital Equipment
- Corporation series of EtherWORKS Ethernet cards:
-
- - DE425 TP/COAX EISA
- - DE434 TP PCI
- - DE435 TP/COAX/AUI PCI
- - DE450 TP/COAX/AUI PCI
- - DE500 10/100 PCI Fasternet
-
- but it will now attempt to support all cards which conform to the
- Digital Semiconductor SROM Specification. The driver currently
- recognises the following chips:
-
- - DC21040 (no SROM)
- - DC21041[A]
- - DC21140[A]
- - DC21142
- - DC21143
-
- So far the driver is known to work with the following cards:
-
- - KINGSTON
- - Linksys
- - ZNYX342
- - SMC8432
- - SMC9332 (w/new SROM)
- - ZNYX31[45]
- - ZNYX346 10/100 4 port (can act as a 10/100 bridge!)
-
- The driver has been tested on a relatively busy network using the DE425,
- DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred
- 16M of data to a DECstation 5000/200 as follows::
-
- TCP UDP
- TX RX TX RX
- DE425 1030k 997k 1170k 1128k
- DE434 1063k 995k 1170k 1125k
- DE435 1063k 995k 1170k 1125k
- DE500 1063k 998k 1170k 1125k in 10Mb/s mode
-
- All values are typical (in kBytes/sec) from a sample of 4 for each
- measurement. Their error is +/-20k on a quiet (private) network and also
- depend on what load the CPU has.
-
-----------------------------------------------------------------------------
-
- The ability to load this driver as a loadable module has been included
- and used extensively during the driver development (to save those long
- reboot sequences). Loadable module support under PCI and EISA has been
- achieved by letting the driver autoprobe as if it were compiled into the
- kernel. Do make sure you're not sharing interrupts with anything that
- cannot accommodate interrupt sharing!
-
- To utilise this ability, you have to do 8 things:
-
- 0) have a copy of the loadable modules code installed on your system.
- 1) copy de4x5.c from the /linux/drivers/net directory to your favourite
- temporary directory.
- 2) for fixed autoprobes (not recommended), edit the source code near
- line 5594 to reflect the I/O address you're using, or assign these when
- loading by::
-
- insmod de4x5 io=0xghh where g = bus number
- hh = device number
-
- .. note::
-
- autoprobing for modules is now supported by default. You may just
- use::
-
- insmod de4x5
-
- to load all available boards. For a specific board, still use
- the 'io=?' above.
- 3) compile de4x5.c, but include -DMODULE in the command line to ensure
- that the correct bits are compiled (see end of source code).
- 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a
- kernel with the de4x5 configuration turned off and reboot.
- 5) insmod de4x5 [io=0xghh]
- 6) run the net startup bits for your new eth?? interface(s) manually
- (usually /etc/rc.inet[12] at boot time).
- 7) enjoy!
-
- To unload a module, turn off the associated interface(s)
- 'ifconfig eth?? down' then 'rmmod de4x5'.
-
- Automedia detection is included so that in principle you can disconnect
- from, e.g. TP, reconnect to BNC and things will still work (after a
- pause while the driver figures out where its media went). My tests
- using ping showed that it appears to work....
-
- By default, the driver will now autodetect any DECchip based card.
- Should you have a need to restrict the driver to DIGITAL only cards, you
- can compile with a DEC_ONLY define, or if loading as a module, use the
- 'dec_only=1' parameter.
-
- I've changed the timing routines to use the kernel timer and scheduling
- functions so that the hangs and other assorted problems that occurred
- while autosensing the media should be gone. A bonus for the DC21040
- auto media sense algorithm is that it can now use one that is more in
- line with the rest (the DC21040 chip doesn't have a hardware timer).
- The downside is the 1 'jiffies' (10ms) resolution.
-
- IEEE 802.3u MII interface code has been added in anticipation that some
- products may use it in the future.
-
- The SMC9332 card has a non-compliant SROM which needs fixing - I have
- patched this driver to detect it because the SROM format used complies
- to a previous DEC-STD format.
-
- I have removed the buffer copies needed for receive on Intels. I cannot
- remove them for Alphas since the Tulip hardware only does longword
- aligned DMA transfers and the Alphas get alignment traps with non
- longword aligned data copies (which makes them really slow). No comment.
-
- I have added SROM decoding routines to make this driver work with any
- card that supports the Digital Semiconductor SROM spec. This will help
- all cards running the dc2114x series chips in particular. Cards using
- the dc2104x chips should run correctly with the basic driver. I'm in
- debt to <mjacob@feral.com> for the testing and feedback that helped get
- this feature working. So far we have tested KINGSTON, SMC8432, SMC9332
- (with the latest SROM complying with the SROM spec V3: their first was
- broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315
- (quad 21041 MAC) cards also appear to work despite their incorrectly
- wired IRQs.
-
- I have added a temporary fix for interrupt problems when some SCSI cards
- share the same interrupt as the DECchip based cards. The problem occurs
- because the SCSI card wants to grab the interrupt as a fast interrupt
- (runs the service routine with interrupts turned off) vs. this card
- which really needs to run the service routine with interrupts turned on.
- This driver will now add the interrupt service routine as a fast
- interrupt if it is bounced from the slow interrupt. THIS IS NOT A
- RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time
- until people sort out their compatibility issues and the kernel
- interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST
- INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not
- run on the same interrupt. PCMCIA/CardBus is another can of worms...
-
- Finally, I think I have really fixed the module loading problem with
- more than one DECchip based card. As a side effect, I don't mess with
- the device structure any more which means that if more than 1 card in
- 2.0.x is installed (4 in 2.1.x), the user will have to edit
- linux/drivers/net/Space.c to make room for them. Hence, module loading
- is the preferred way to use this driver, since it doesn't have this
- limitation.
-
- Where SROM media detection is used and full duplex is specified in the
- SROM, the feature is ignored unless lp->params.fdx is set at compile
- time OR during a module load (insmod de4x5 args='eth??:fdx' [see
- below]). This is because there is no way to automatically detect full
- duplex links except through autonegotiation. When I include the
- autonegotiation feature in the SROM autoconf code, this detection will
- occur automatically for that case.
-
- Command line arguments are now allowed, similar to passing arguments
- through LILO. This will allow a per adapter board set up of full duplex
- and media. The only lexical constraints are: the board name (dev->name)
- appears in the list before its parameters. The list of parameters ends
- either at the end of the parameter list or with another board name. The
- following parameters are allowed:
-
- ========= ===============================================
- fdx for full duplex
- autosense to set the media/speed; with the following
- sub-parameters:
- TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO
- ========= ===============================================
-
- Case sensitivity is important for the sub-parameters. They *must* be
- upper case. Examples::
-
- insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'.
-
- For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.::
-
- DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"'
-
- Yes, I know full duplex isn't permissible on BNC or AUI; they're just
- examples. By default, full duplex is turned off and AUTO is the default
- autosense setting. In reality, I expect only the full duplex option to
- be used. Note the use of single quotes in the two examples above and the
- lack of commas to separate items.
diff --git a/Documentation/networking/device_drivers/3com/3c509.rst b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
index 47f706bacdd9..47f706bacdd9 100644
--- a/Documentation/networking/device_drivers/3com/3c509.rst
+++ b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
diff --git a/Documentation/networking/device_drivers/3com/vortex.rst b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
index 800add5be338..a060f84c4f96 100644
--- a/Documentation/networking/device_drivers/3com/vortex.rst
+++ b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
@@ -4,8 +4,6 @@
3Com Vortex device driver
=========================
-Documentation/networking/device_drivers/3com/vortex.rst
-
Andrew Morton
30 April 2000
@@ -256,7 +254,7 @@ Media selection
A number of the older NICs such as the 3c590 and 3c900 series have
10base2 and AUI interfaces.
-Prior to January, 2001 this driver would autoeselect the 10base2 or AUI
+Prior to January, 2001 this driver would autoselect the 10base2 or AUI
port if it didn't detect activity on the 10baseT port. It would then
get stuck on the 10base2 port and a driver reload was necessary to
switch back to 10baseT. This behaviour could not be prevented with a
@@ -376,8 +374,8 @@ steps you should take:
email address will be in the driver source or in the MAINTAINERS file.
- The contents of your report will vary a lot depending upon the
- problem. If it's a kernel crash then you should refer to the
- admin-guide/reporting-bugs.rst file.
+ problem. If it's a kernel crash then you should refer to
+ 'Documentation/admin-guide/reporting-issues.rst'.
But for most problems it is useful to provide the following:
diff --git a/Documentation/networking/altera_tse.rst b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
index 7a7040072e58..7a7040072e58 100644
--- a/Documentation/networking/altera_tse.rst
+++ b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
diff --git a/Documentation/networking/device_drivers/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 11af6388ea87..a4c7d0c65fd7 100644
--- a/Documentation/networking/device_drivers/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -11,12 +11,12 @@ ENA is a networking interface designed to make good use of modern CPU
features and system architectures.
The ENA device exposes a lightweight management interface with a
-minimal set of memory mapped registers and extendable command set
+minimal set of memory mapped registers and extendible command set
through an Admin Queue.
The driver supports a range of ENA devices, is link-speed independent
-(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
-a negotiated and extendable feature set.
+(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
+a negotiated and extendible feature set.
Some ENA devices support SR-IOV. This driver is used for both the
SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
@@ -27,9 +27,9 @@ is advertised by the device via the Admin Queue), a dedicated MSI-X
interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
and CPU cacheline optimized data placement.
-The ENA driver supports industry standard TCP/IP offload features such
-as checksum offload and TCP transmit segmentation offload (TSO).
-Receive-side scaling (RSS) is supported for multi-core scaling.
+The ENA driver supports industry standard TCP/IP offload features such as
+checksum offload. Receive-side scaling (RSS) is supported for multi-core
+scaling.
The ENA driver and its corresponding devices implement health
monitoring mechanisms such as watchdog, enabling the device and driver
@@ -39,32 +39,22 @@ debug logs.
Some of the ENA devices support a working mode called Low-latency
Queue (LLQ), which saves several more microseconds.
-Supported PCI vendor ID/device IDs
-==================================
-
-========= =======================
-1d0f:0ec2 ENA PF
-1d0f:1ec2 ENA PF with LLQ support
-1d0f:ec20 ENA VF
-1d0f:ec21 ENA VF with LLQ support
-========= =======================
-
ENA Source Code Directory Structure
===================================
================= ======================================================
ena_com.[ch] Management communication layer. This layer is
- responsible for the handling all the management
- (admin) communication between the device and the
- driver.
+ responsible for the handling all the management
+ (admin) communication between the device and the
+ driver.
ena_eth_com.[ch] Tx/Rx data path.
ena_admin_defs.h Definition of ENA management interface.
ena_eth_io_defs.h Definition of ENA data path interface.
ena_common_defs.h Common definitions for ena_com layer.
ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers.
ena_netdev.[ch] Main Linux kernel driver.
-ena_syfsfs.[ch] Sysfs files.
ena_ethtool.c ethtool callbacks.
+ena_xdp.[ch] XDP files
ena_pci_id_tbl.h Supported device IDs.
================= ======================================================
@@ -79,7 +69,7 @@ ENA management interface is exposed by means of:
- Asynchronous Event Notification Queue (AENQ)
ENA device MMIO Registers are accessed only during driver
-initialization and are not involved in further normal device
+initialization and are not used during further normal device
operation.
AQ is used for submitting management commands, and the
@@ -110,28 +100,27 @@ group may have multiple syndromes, as shown below
The events are:
- ==================== ===============
- Group Syndrome
- ==================== ===============
- Link state change **X**
- Fatal error **X**
- Notification Suspend traffic
- Notification Resume traffic
- Keep-Alive **X**
- ==================== ===============
+==================== ===============
+Group Syndrome
+==================== ===============
+Link state change **X**
+Fatal error **X**
+Notification Suspend traffic
+Notification Resume traffic
+Keep-Alive **X**
+==================== ===============
ACQ and AENQ share the same MSI-X vector.
-Keep-Alive is a special mechanism that allows monitoring of the
-device's health. The driver maintains a watchdog (WD) handler which,
-if fired, logs the current state and statistics then resets and
-restarts the ENA device and driver. A Keep-Alive event is delivered by
-the device every second. The driver re-arms the WD upon reception of a
-Keep-Alive event. A missed Keep-Alive event causes the WD handler to
-fire.
+Keep-Alive is a special mechanism that allows monitoring the device's health.
+A Keep-Alive event is delivered by the device every second.
+The driver maintains a watchdog (WD) handler which logs the current state and
+statistics. If the keep-alive events aren't delivered as expected the WD resets
+the device and the driver.
Data Path Interface
===================
+
I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
SQ correspondingly). Each SQ has a completion queue (CQ) associated
with it.
@@ -141,26 +130,24 @@ physical memory.
The ENA driver supports two Queue Operation modes for Tx SQs:
-- Regular mode
+- **Regular mode:**
+ In this mode the Tx SQs reside in the host's memory. The ENA
+ device fetches the ENA Tx descriptors and packet data from host
+ memory.
- * In this mode the Tx SQs reside in the host's memory. The ENA
- device fetches the ENA Tx descriptors and packet data from host
- memory.
+- **Low Latency Queue (LLQ) mode or "push-mode":**
+ In this mode the driver pushes the transmit descriptors and the
+ first 96 bytes of the packet directly to the ENA device memory
+ space. The rest of the packet payload is fetched by the
+ device. For this operation mode, the driver uses a dedicated PCI
+ device memory BAR, which is mapped with write-combine capability.
-- Low Latency Queue (LLQ) mode or "push-mode".
-
- * In this mode the driver pushes the transmit descriptors and the
- first 128 bytes of the packet directly to the ENA device memory
- space. The rest of the packet payload is fetched by the
- device. For this operation mode, the driver uses a dedicated PCI
- device memory BAR, which is mapped with write-combine capability.
+ **Note that** not all ENA devices support LLQ, and this feature is negotiated
+ with the device upon initialization. If the ENA device does not
+ support LLQ mode, the driver falls back to the regular mode.
The Rx SQs support only the regular mode.
-Note: Not all ENA devices support LLQ, and this feature is negotiated
- with the device upon initialization. If the ENA device does not
- support LLQ mode, the driver falls back to the regular mode.
-
The driver supports multi-queue for both Tx and Rx. This has various
benefits:
@@ -175,6 +162,7 @@ benefits:
Interrupt Modes
===============
+
The driver assigns a single MSI-X vector per queue pair (for both Tx
and Rx directions). The driver assigns an additional dedicated MSI-X vector
for management (for ACQ and AENQ).
@@ -200,50 +188,43 @@ unmasked by the driver after NAPI processing is complete.
Interrupt Moderation
====================
+
ENA driver and device can operate in conventional or adaptive interrupt
moderation mode.
-In conventional mode the driver instructs device to postpone interrupt
+**In conventional mode** the driver instructs device to postpone interrupt
posting according to static interrupt delay value. The interrupt delay
-value can be configured through ethtool(8). The following ethtool
-parameters are supported by the driver: tx-usecs, rx-usecs
+value can be configured through `ethtool(8)`. The following `ethtool`
+parameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
-In adaptive interrupt moderation mode the interrupt delay value is
+**In adaptive interrupt** moderation mode the interrupt delay value is
updated by the driver dynamically and adjusted every NAPI cycle
according to the traffic nature.
-By default ENA driver applies adaptive coalescing on Rx traffic and
-conventional coalescing on Tx traffic.
-
-Adaptive coalescing can be switched on/off through ethtool(8)
-adaptive_rx on|off parameter.
+Adaptive coalescing can be switched on/off through `ethtool(8)`'s
+:code:`adaptive_rx on|off` parameter.
-The driver chooses interrupt delay value according to the number of
-bytes and packets received between interrupt unmasking and interrupt
-posting. The driver uses interrupt delay table that subdivides the
-range of received bytes/packets into 5 levels and assigns interrupt
-delay value to each level.
+More information about Adaptive Interrupt Moderation (DIM) can be found in
+Documentation/networking/net_dim.rst
-The user can enable/disable adaptive moderation, modify the interrupt
-delay table and restore its default values through sysfs.
+.. _`RX copybreak`:
RX copybreak
============
+
The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
and can be configured by the ETHTOOL_STUNABLE command of the
SIOCETHTOOL ioctl.
-SKB
-===
-The driver-allocated SKB for frames received from Rx handling using
-NAPI context. The allocation method depends on the size of the packet.
-If the frame length is larger than rx_copybreak, napi_get_frags()
-is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
-content is copied (by CPU) to the SKB, and the buffer is recycled.
+This option controls the maximum packet length for which the RX
+descriptor it was received on would be recycled. When a packet smaller
+than RX copybreak bytes is received, it is copied into a new memory
+buffer and the RX descriptor is returned to HW.
Statistics
==========
-The user can obtain ENA device and driver statistics using ethtool.
+
+The user can obtain ENA device and driver statistics using `ethtool`.
The driver can collect regular or extended statistics (including
per-queue stats) from the device.
@@ -251,22 +232,23 @@ In addition the driver logs the stats to syslog upon device reset.
MTU
===
+
The driver supports an arbitrarily large MTU with a maximum that is
negotiated with the device. The driver configures MTU using the
SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
-via ip(8) and similar legacy tools.
+via `ip(8)` and similar legacy tools.
Stateless Offloads
==================
+
The ENA driver supports:
-- TSO over IPv4/IPv6
-- TSO with ECN
- IPv4 header checksum offload
- TCP/UDP over IPv4/IPv6 checksum offloads
RSS
===
+
- The ENA device supports RSS that allows flexible Rx traffic
steering.
- Toeplitz and CRC32 hash functions are supported.
@@ -274,46 +256,47 @@ RSS
inputs for hash functions.
- The driver configures RSS settings using the AQ SetFeature command
(ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
- ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties).
+ ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
function delivered in the Rx CQ descriptor is set in the received
SKB.
- The user can provide a hash key, hash function, and configure the
- indirection table through ethtool(8).
+ indirection table through `ethtool(8)`.
DATA PATH
=========
+
Tx
--
-end_start_xmit() is called by the stack. This function does the following:
+:code:`ena_start_xmit()` is called by the stack. This function does the following:
-- Maps data buffers (skb->data and frags).
-- Populates ena_buf for the push buffer (if the driver and device are
- in push mode.)
+- Maps data buffers (``skb->data`` and frags).
+- Populates ``ena_buf`` for the push buffer (if the driver and device are
+ in push mode).
- Prepares ENA bufs for the remaining frags.
-- Allocates a new request ID from the empty req_id ring. The request
+- Allocates a new request ID from the empty ``req_id`` ring. The request
ID is the index of the packet in the Tx info. This is used for
- out-of-order TX completions.
+ out-of-order Tx completions.
- Adds the packet to the proper place in the Tx ring.
-- Calls ena_com_prepare_tx(), an ENA communication layer that converts
- the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
- needed.)
+- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
+ the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
+ needed).
* This function also copies the ENA descriptors and the push buffer
- to the Device memory space (if in push mode.)
+ to the Device memory space (if in push mode).
-- Writes doorbell to the ENA device.
+- Writes a doorbell to the ENA device.
- When the ENA device finishes sending the packet, a completion
interrupt is raised.
- The interrupt handler schedules NAPI.
-- The ena_clean_tx_irq() function is called. This function handles the
+- The :code:`ena_clean_tx_irq()` function is called. This function handles the
completion descriptors generated by the ENA, with a single
completion descriptor per completed packet.
- * req_id is retrieved from the completion descriptor. The tx_info of
- the packet is retrieved via the req_id. The data buffers are
- unmapped and req_id is returned to the empty req_id ring.
+ * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
+ the packet is retrieved via the ``req_id``. The data buffers are
+ unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
* The function stops when the completion descriptors are completed or
the budget is reached.
@@ -322,12 +305,11 @@ Rx
- When a packet is received from the ENA device.
- The interrupt handler schedules NAPI.
-- The ena_clean_rx_irq() function is called. This function calls
- ena_rx_pkt(), an ENA communication layer function, which returns the
- number of descriptors used for a new unhandled packet, and zero if
+- The :code:`ena_clean_rx_irq()` function is called. This function calls
+ :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
+ number of descriptors used for a new packet, and zero if
no new packet is found.
-- Then it calls the ena_clean_rx_irq() function.
-- ena_eth_rx_skb() checks packet length:
+- :code:`ena_rx_skb()` checks packet length:
* If the packet is small (len < rx_copybreak), the driver allocates
a SKB for the new packet, and copies the packet payload into the
@@ -336,9 +318,41 @@ Rx
- In this way the original data buffer is not passed to the stack
and is reused for future Rx packets.
- * Otherwise the function unmaps the Rx buffer, then allocates the
- new SKB structure and hooks the Rx buffer to the SKB frags.
+ * Otherwise the function unmaps the Rx buffer, sets the first
+ descriptor as `skb`'s linear part and the other descriptors as the
+ `skb`'s frags.
- The new SKB is updated with the necessary information (protocol,
- checksum hw verify result, etc.), and then passed to the network
- stack, using the NAPI interface function napi_gro_receive().
+ checksum hw verify result, etc), and then passed to the network
+ stack, using the NAPI interface function :code:`napi_gro_receive()`.
+
+Dynamic RX Buffers (DRB)
+------------------------
+
+Each RX descriptor in the RX ring is a single memory page (which is either 4KB
+or 16KB long depending on system's configurations).
+To reduce the memory allocations required when dealing with a high rate of small
+packets, the driver tries to reuse the remaining RX descriptor's space if more
+than 2KB of this page remain unused.
+
+A simple example of this mechanism is the following sequence of events:
+
+::
+
+ 1. Driver allocates page-sized RX buffer and passes it to hardware
+ +----------------------+
+ |4KB RX Buffer |
+ +----------------------+
+
+ 2. A 300Bytes packet is received on this buffer
+
+ 3. The driver increases the ref count on this page and returns it back to
+ HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes
+ +----+--------------------+
+ |****|3796 Bytes RX Buffer|
+ +----+--------------------+
+
+This mechanism isn't used when an XDP program is loaded, or when the
+RX packet is less than rx_copybreak bytes (in which case the packet is
+copied out of the RX buffer into the linear part of a new skb allocated
+for it and the RX buffer remains the same size, see `RX copybreak`_).
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
new file mode 100644
index 000000000000..9e8a16c44102
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the AMD/Pensando(R) DSC adapter family
+========================================================
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Identifying the Adapter
+=======================
+
+To find if one or more AMD/Pensando PCI Core devices are installed on the
+host, check for the PCI devices::
+
+ # lspci -d 1dd8:100c
+ b5:00.0 Processing accelerators: Pensando Systems Device 100c
+ b6:00.0 Processing accelerators: Pensando Systems Device 100c
+
+If such devices are listed as above, then the pds_core.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep pds_core
+ pds_core 0000:b5:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b5:00.0: FW: 1.60.0-73
+ pds_core 0000:b6:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b6:00.0: FW: 1.60.0-73
+
+Driver and firmware version information can be gathered with devlink::
+
+ $ devlink dev info pci/0000:b5:00.0
+ pci/0000:b5:00.0:
+ driver pds_core
+ serial_number FLM18420073
+ versions:
+ fixed:
+ asic.id 0x0
+ asic.rev 0x0
+ running:
+ fw 1.51.0-73
+ stored:
+ fw.goldfw 1.15.9-C-22
+ fw.mainfwa 1.60.0-73
+ fw.mainfwb 1.60.0-57
+
+Info versions
+=============
+
+The ``pds_core`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of firmware running on the device
+ * - ``fw.goldfw``
+ - stored
+ - Version of firmware stored in the goldfw slot
+ * - ``fw.mainfwa``
+ - stored
+ - Version of firmware stored in the mainfwa slot
+ * - ``fw.mainfwb``
+ - stored
+ - Version of firmware stored in the mainfwb slot
+ * - ``asic.id``
+ - fixed
+ - The ASIC type for this device
+ * - ``asic.rev``
+ - fixed
+ - The revision of the ASIC for this device
+
+Parameters
+==========
+
+The ``pds_core`` driver implements the following generic
+parameters for controlling the functionality to be made available
+as auxiliary_bus devices.
+
+.. list-table:: Generic parameters implemented
+ :widths: 5 5 8 82
+
+ * - Name
+ - Mode
+ - Type
+ - Description
+ * - ``enable_vnet``
+ - runtime
+ - Boolean
+ - Enables vDPA functionality through an auxiliary_bus device
+
+Firmware Management
+===================
+
+The ``flash`` command can update a the DSC firmware. The downloaded firmware
+will be saved into either of firmware bank 1 or bank 2, whichever is not
+currently in use, and that bank will used for the next boot::
+
+ # devlink dev flash pci/0000:b5:00.0 \
+ file pensando/dsc_fw_1.63.0-22.tar
+
+Health Reporters
+================
+
+The driver supports a devlink health reporter for FW status::
+
+ # devlink health show pci/0000:2b:00.0 reporter fw
+ pci/0000:2b:00.0:
+ reporter fw
+ state healthy error 0 recover 0
+ # devlink health diagnose pci/0000:2b:00.0 reporter fw
+ Status: healthy State: 1 Generation: 0 Recoveries: 0
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> AMD devices
+ -> AMD/Pensando Ethernet PDS_CORE Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by AMD/Pensando personnel::
+
+ netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
new file mode 100644
index 000000000000..587927d3de92
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst
@@ -0,0 +1,85 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI vDPA driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando vDPA VF Device Driver
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Overview
+========
+
+The ``pds_vdpa`` driver is an auxiliary bus driver that supplies
+a vDPA device for use by the virtio network stack. It is used with
+the Pensando Virtual Function devices that offer vDPA and virtio queue
+services. It depends on the ``pds_core`` driver and hardware for the PF
+and VF PCI handling as well as for device configuration services.
+
+Using the device
+================
+
+The ``pds_vdpa`` device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices. After the VFs are enabled, we enable the vDPA service
+in the ``pds_core`` device to create the auxiliary devices used by pds_vdpa.
+
+Example steps:
+
+.. code-block:: bash
+
+ #!/bin/bash
+
+ modprobe pds_core
+ modprobe vdpa
+ modprobe pds_vdpa
+
+ PF_BDF=`ls /sys/module/pds_core/drivers/pci\:pds_core/*/sriov_numvfs | awk -F / '{print $7}'`
+
+ # Enable vDPA VF auxiliary device(s) in the PF
+ devlink dev param set pci/$PF_BDF name enable_vnet cmode runtime value true
+
+ # Create a VF for vDPA use
+ echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+ # Find the vDPA services/devices available
+ PDS_VDPA_MGMT=`vdpa mgmtdev show | grep vDPA | head -1 | cut -d: -f1`
+
+ # Create a vDPA device for use in virtio network configurations
+ vdpa dev add name vdpa1 mgmtdev $PDS_VDPA_MGMT mac 00:11:22:33:44:55
+
+ # Set up an ethernet interface on the vdpa device
+ modprobe virtio_vdpa
+
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> Pensando devices
+ -> Pensando Ethernet PDS_VDPA Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
new file mode 100644
index 000000000000..7a6bc848a2b2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vfio_pci.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0+
+.. note: can be edited and viewed with /usr/bin/formiko-vim
+
+==========================================================
+PCI VFIO driver for the AMD/Pensando(R) DSC adapter family
+==========================================================
+
+AMD/Pensando Linux VFIO PCI Device Driver
+Copyright(c) 2023 Advanced Micro Devices, Inc.
+
+Overview
+========
+
+The ``pds-vfio-pci`` module is a PCI driver that supports Live Migration
+capable Virtual Function (VF) devices in the DSC hardware.
+
+Using the device
+================
+
+The pds-vfio-pci device is enabled via multiple configuration steps and
+depends on the ``pds_core`` driver to create and enable SR-IOV Virtual
+Function devices.
+
+Shown below are the steps to bind the driver to a VF and also to the
+associated auxiliary device created by the ``pds_core`` driver. This
+example assumes the pds_core and pds-vfio-pci modules are already
+loaded.
+
+.. code-block:: bash
+ :name: example-setup-script
+
+ #!/bin/bash
+
+ PF_BUS="0000:60"
+ PF_BDF="0000:60:00.0"
+ VF_BDF="0000:60:00.1"
+
+ # Prevent non-vfio VF driver from probing the VF device
+ echo 0 > /sys/class/pci_bus/$PF_BUS/device/$PF_BDF/sriov_drivers_autoprobe
+
+ # Create single VF for Live Migration via pds_core
+ echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs
+
+ # Allow the VF to be bound to the pds-vfio-pci driver
+ echo "pds-vfio-pci" > /sys/class/pci_bus/$PF_BUS/device/$VF_BDF/driver_override
+
+ # Bind the VF to the pds-vfio-pci driver
+ echo "$VF_BDF" > /sys/bus/pci/drivers/pds-vfio-pci/bind
+
+After performing the steps above, a file in /dev/vfio/<iommu_group>
+should have been created.
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> VFIO Non-Privileged userspace driver framework
+ -> VFIO support for PDS PCI devices
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.rst b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
index 595ddef1c8b3..099280a261be 100644
--- a/Documentation/networking/device_drivers/aquantia/atlantic.rst
+++ b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
@@ -270,7 +270,7 @@ RX flow rules (ntuple filters)
ethtool -K ethX ntuple <on|off>
- When disabling ntuple filters, all the user programed filters are
+ When disabling ntuple filters, all the user programmed filters are
flushed from the driver cache and hardware. All needed filters must
be re-added when ntuple is re-enabled.
@@ -418,7 +418,7 @@ Default value: 0xFFFF
0 Disable interrupt throttling.
1 Enable interrupt throttling and use specified tx and rx rates.
0xFFFF Auto throttling mode. Driver will choose the best RX and TX
- interrupt throtting settings based on link speed.
+ interrupt throttling settings based on link speed.
====== ==============================================================
aq_itr_tx - TX interrupt throttle rate
@@ -456,7 +456,7 @@ AQ_CFG_RX_PAGEORDER
Default value: 0
-RX page order override. Thats a power of 2 number of RX pages allocated for
+RX page order override. That's a power of 2 number of RX pages allocated for
each descriptor. Received descriptor size is still limited by
AQ_CFG_RX_FRAME_MAX.
diff --git a/Documentation/networking/device_drivers/chelsio/cxgb.rst b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
index 435dce5fa2c7..435dce5fa2c7 100644
--- a/Documentation/networking/device_drivers/chelsio/cxgb.rst
+++ b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
diff --git a/Documentation/networking/device_drivers/cirrus/cs89x0.rst b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
index e5c283940ac5..e5c283940ac5 100644
--- a/Documentation/networking/device_drivers/cirrus/cs89x0.rst
+++ b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
diff --git a/Documentation/networking/device_drivers/davicom/dm9000.rst b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
index d5458da01083..14eb0a4d4e4e 100644
--- a/Documentation/networking/device_drivers/davicom/dm9000.rst
+++ b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
@@ -34,7 +34,7 @@ These resources should be specified in that order, as the ordering of the
two address regions is important (the driver expects these to be address
and then data).
-An example from arch/arm/mach-s3c2410/mach-bast.c is::
+An example from arch/arm/mach-s3c/mach-bast.c is::
static struct resource bast_dm9k_resource[] = {
[0] = {
diff --git a/Documentation/networking/device_drivers/dec/dmfe.rst b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
index c4cf809cad84..c4cf809cad84 100644
--- a/Documentation/networking/device_drivers/dec/dmfe.rst
+++ b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
diff --git a/Documentation/networking/device_drivers/dlink/dl2k.rst b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
index ccdb5d0d7460..ccdb5d0d7460 100644
--- a/Documentation/networking/device_drivers/dlink/dl2k.rst
+++ b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
diff --git a/Documentation/networking/device_drivers/freescale/dpaa.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
index 241c6c6f6e68..241c6c6f6e68 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
index 17dbee1ac53e..e4ebfe62a183 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
@@ -1,5 +1,6 @@
.. include:: <isonum.txt>
+===================================
DPAA2 DPIO (Data Path I/O) Overview
===================================
@@ -19,8 +20,10 @@ pool management for network interfaces.
This document provides an overview the Linux DPIO driver, its
subcomponents, and its APIs.
-See Documentation/networking/device_drivers/freescale/dpaa2/overview.rst for
-a general overview of DPAA2 and the general DPAA2 driver architecture in Linux.
+See
+Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+for a general overview of DPAA2 and the general DPAA2 driver architecture
+in Linux.
Driver Overview
---------------
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
index cb4c9a0c5a17..682f3986c15b 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
@@ -33,7 +33,8 @@ hardware resources, like queues, do not have a corresponding MC object and
are treated as internal resources of other objects.
For a more detailed description of the DPAA2 architecture and its object
-abstractions see *Documentation/networking/device_drivers/freescale/dpaa2/overview.rst*.
+abstractions see
+*Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst*.
Each Linux net device is built on top of a Datapath Network Interface (DPNI)
object and uses Buffer Pools (DPBPs), I/O Portals (DPIOs) and Concentrators
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
index ee40fcc5ddff..62f4a4aff6ec 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
@@ -9,3 +9,4 @@ DPAA2 Documentation
dpio-driver
ethernet-driver
mac-phy-support
+ switch-driver
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
index 51e6624fb774..e2a36d0d88ef 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
@@ -11,7 +11,7 @@ Overview
--------
The DPAA2 MAC / PHY support consists of a set of APIs that help DPAA2 network
-drivers (dpaa2-eth, dpaa2-ethsw) interract with the PHY library.
+drivers (dpaa2-eth, dpaa2-ethsw) interact with the PHY library.
DPAA2 Software Architecture
---------------------------
@@ -181,10 +181,13 @@ when necessary using the below listed API::
- int dpaa2_mac_connect(struct dpaa2_mac *mac);
- void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
-A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
-One can check for this condition using the below API::
+A phylink integration is necessary only when the partner DPMAC is not of
+``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
+``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
+mode, the MC firmware does not access the PCS registers). One can check for
+this condition using the following helper::
- - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
+ - static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
Before connection to a MAC, the caller must allocate and populate the
dpaa2_mac structure with the associated net_device, a pointer to the MC portal
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index d638b5a8aadd..199647729251 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -183,6 +183,7 @@ PHY and allows physical transmission and reception of Ethernet frames.
IRQ config, enable, reset
DPNI (Datapath Network Interface)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contains TX/RX queues, network interface configuration, and RX buffer pool
configuration mechanisms. The TX/RX queues are in memory and are identified
by queue number.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
new file mode 100644
index 000000000000..8bf411b857d4
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================
+DPAA2 Switch driver
+===================
+
+:Copyright: |copy| 2021 NXP
+
+The DPAA2 Switch driver probes on the Datapath Switch (DPSW) object which can
+be instantiated on the following DPAA2 SoCs and their variants: LS2088A and
+LX2160A.
+
+The driver uses the switch device driver model and exposes each switch port as
+a network interface, which can be included in a bridge or used as a standalone
+interface. Traffic switched between ports is offloaded into the hardware.
+
+The DPSW can have ports connected to DPNIs or to DPMACs for external access.
+::
+
+ [ethA] [ethB] [ethC] [ethD] [ethE] [ethF]
+ : : : : : :
+ : : : : : :
+ [dpaa2-eth] [dpaa2-eth] [ dpaa2-switch ]
+ : : : : : : kernel
+ =============================================================================
+ : : : : : : hardware
+ [DPNI] [DPNI] [============= DPSW =================]
+ | | | | | |
+ | ---------- | [DPMAC] [DPMAC]
+ ------------------------------- | |
+ | |
+ [PHY] [PHY]
+
+Creating an Ethernet Switch
+===========================
+
+The dpaa2-switch driver probes on DPSW devices found on the fsl-mc bus. These
+devices can be either created statically through the boot time configuration
+file - DataPath Layout (DPL) - or at runtime using the DPAA2 object APIs
+(incorporated already into the restool userspace tool).
+
+At the moment, the dpaa2-switch driver imposes the following restrictions on
+the DPSW object that it will probe:
+
+ * The minimum number of FDBs should be at least equal to the number of switch
+ interfaces. This is necessary so that separation of switch ports can be
+ done, ie when not under a bridge, each switch port will have its own FDB.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: The number of FDBs is lower than the number of ports, cannot probe
+
+ * Both the broadcast and flooding configuration should be per FDB. This
+ enables the driver to restrict the broadcast and flooding domains of each
+ FDB depending on the switch ports that are sharing it (aka are under the
+ same bridge).
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Flooding domain is not per FDB, cannot probe
+ fsl_dpaa2_switch dpsw.0: Broadcast domain is not per FDB, cannot probe
+
+ * The control interface of the switch should not be disabled
+ (DPSW_OPT_CTRL_IF_DIS not passed as a create time option). Without the
+ control interface, the driver is not capable to provide proper Rx/Tx traffic
+ support on the switch port netdevices.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Control Interface is disabled, cannot probe
+
+Besides the configuration of the actual DPSW object, the dpaa2-switch driver
+will need the following DPAA2 objects:
+
+ * 1 DPMCP - A Management Command Portal object is needed for any interraction
+ with the MC firmware.
+
+ * 1 DPBP - A Buffer Pool is used for seeding buffers intended for the Rx path
+ on the control interface.
+
+ * Access to at least one DPIO object (Software Portal) is needed for any
+ enqueue/dequeue operation to be performed on the control interface queues.
+ The DPIO object will be shared, no need for a private one.
+
+Switching features
+==================
+
+The driver supports the configuration of L2 forwarding rules in hardware for
+port bridging as well as standalone usage of the independent switch interfaces.
+
+The hardware is not configurable with respect to VLAN awareness, thus any DPAA2
+switch port should be used only in usecases with a VLAN aware bridge::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1
+
+ $ ip link add dev br1 type bridge
+ $ ip link set dev ethX master br1
+ Error: fsl_dpaa2_switch: Cannot join a VLAN-unaware bridge
+
+Topology and loop detection through STP is supported when ``stp_state 1`` is
+used at bridge create ::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1 stp_state 1
+
+L2 FDB manipulation (add/delete/dump) is supported.
+
+HW FDB learning can be configured on each switch port independently through
+bridge commands. When the HW learning is disabled, a fast age procedure will be
+run and any previously learnt addresses will be removed.
+::
+
+ $ bridge link set dev ethX learning off
+ $ bridge link set dev ethX learning on
+
+Restricting the unknown unicast and multicast flooding domain is supported, but
+not independently of each other::
+
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood off
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood on
+ Error: fsl_dpaa2_switch: Cannot configure multicast flooding independently of unicast.
+
+Broadcast flooding on a switch port can be disabled/enabled through the brport sysfs::
+
+ $ echo 0 > /sys/bus/fsl-mc/devices/dpsw.Y/net/ethX/brport/broadcast_flood
+
+Offloads
+========
+
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The DPAA2 switch is able to offload flow-based redirection of packets making
+use of ACL tables. Shared filter blocks are supported by sharing a single ACL
+table between multiple ports.
+
+The following flow keys are supported:
+
+ * Ethernet: dst_mac/src_mac
+ * IPv4: dst_ip/src_ip/ip_proto/tos
+ * VLAN: vlan_id/vlan_prio/vlan_tpid/vlan_dei
+ * L4: dst_port/src_port
+
+Also, the matchall filter can be used to redirect the entire traffic received
+on a port.
+
+As per flow actions, the following are supported:
+
+ * drop
+ * mirred egress redirect
+ * trap
+
+Each ACL entry (filter) can be setup with only one of the listed
+actions.
+
+Example 1: send frames received on eth4 with a SA of 00:01:02:03:04:05 to the
+CPU::
+
+ $ tc qdisc add dev eth4 clsact
+ $ tc filter add dev eth4 ingress flower src_mac 00:01:02:03:04:05 skip_sw action trap
+
+Example 2: drop frames received on eth4 with VID 100 and PCP of 3::
+
+ $ tc filter add dev eth4 ingress protocol 802.1q flower skip_sw vlan_id 100 vlan_prio 3 action drop
+
+Example 3: redirect all frames received on eth4 to eth1::
+
+ $ tc filter add dev eth4 ingress matchall action mirred egress redirect dev eth1
+
+Example 4: Use a single shared filter block on both eth5 and eth6::
+
+ $ tc qdisc add dev eth5 ingress_block 1 clsact
+ $ tc qdisc add dev eth6 ingress_block 1 clsact
+ $ tc filter add block 1 ingress flower dst_mac 00:01:02:03:04:04 skip_sw \
+ action trap
+ $ tc filter add block 1 ingress protocol ipv4 flower src_ip 192.168.1.1 skip_sw \
+ action mirred egress redirect dev eth3
+
+Mirroring
+~~~~~~~~~
+
+The DPAA2 switch supports only per port mirroring and per VLAN mirroring.
+Adding mirroring filters in shared blocks is also supported.
+
+When using the tc-flower classifier with the 802.1q protocol, only the
+''vlan_id'' key will be accepted. Mirroring based on any other fields from the
+802.1q protocol will be rejected::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_prio 3 action mirred egress mirror dev eth6
+ Error: fsl_dpaa2_switch: Only matching on VLAN ID supported.
+ We have an error talking to the kernel
+
+If a mirroring VLAN filter is requested on a port, the VLAN must to be
+installed on the switch port in question either using ''bridge'' or by creating
+a VLAN upper device if the switch port is used as a standalone interface::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ Error: VLAN must be installed on the switch port.
+ We have an error talking to the kernel
+
+ $ bridge vlan add vid 200 dev eth8
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+ $ ip link add link eth8 name eth8.200 type vlan id 200
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+Also, it should be noted that the mirrored traffic will be subject to the same
+egress restrictions as any other traffic. This means that when a mirrored
+packet will reach the mirror port, if the VLAN found in the packet is not
+installed on the port it will get dropped.
+
+The DPAA2 switch supports only a single mirroring destination, thus multiple
+mirror rules can be installed but their ''to'' port has to be the same::
+
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 100 action mirred egress mirror dev eth7
+ Error: fsl_dpaa2_switch: Multiple mirror ports not supported.
+ We have an error talking to the kernel
diff --git a/Documentation/networking/device_drivers/freescale/gianfar.rst b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
index 9c4a91d3824b..9c4a91d3824b 100644
--- a/Documentation/networking/device_drivers/freescale/gianfar.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
index 793693cef6e3..31d621bca82e 100644
--- a/Documentation/networking/device_drivers/google/gve.rst
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -47,13 +47,33 @@ The driver interacts with the device in the following ways:
- Transmit and Receive Queues
- See description below
+Descriptor Formats
+------------------
+GVE supports two descriptor formats: GQI and DQO. These two formats have
+entirely different descriptors, which will be described below.
+
+Addressing Mode
+------------------
+GVE supports two addressing modes: QPL and RDA.
+QPL ("queue-page-list") mode communicates data through a set of
+pre-registered pages.
+
+For RDA ("raw DMA addressing") mode, the set of pages is dynamic.
+Therefore, the packet buffers can be anywhere in guest memory.
+
Registers
---------
-All registers are MMIO and big endian.
+All registers are MMIO.
The registers are used for initializing and configuring the device as well as
querying device status in response to management interrupts.
+Endianness
+----------
+- Admin Queue messages and registers are all Big Endian.
+- GQI descriptors and datapath registers are Big Endian.
+- DQO descriptors and datapath registers are Little Endian.
+
Admin Queue (AQ)
----------------
The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
@@ -97,10 +117,10 @@ the queues associated with that interrupt.
The handler for these irqs schedule the napi for that block to run
and poll the queues.
-Traffic Queues
---------------
-gVNIC's queues are composed of a descriptor ring and a buffer and are
-assigned to a notification block.
+GQI Traffic Queues
+------------------
+GQI queues are composed of a descriptor ring and a buffer and are assigned to a
+notification block.
The descriptor rings are power-of-two-sized ring buffers consisting of
fixed-size descriptors. They advance their head pointer using a __be32
@@ -121,3 +141,35 @@ Receive
The buffers for receive rings are put into a data ring that is the same
length as the descriptor ring and the head and tail pointers advance over
the rings together.
+
+DQO Traffic Queues
+------------------
+- Every TX and RX queue is assigned a notification block.
+
+- TX and RX buffers queues, which send descriptors to the device, use MMIO
+ doorbells to notify the device of new descriptors.
+
+- RX and TX completion queues, which receive descriptors from the device, use a
+ "generation bit" to know when a descriptor was populated by the device. The
+ driver initializes all bits with the "current generation". The device will
+ populate received descriptors with the "next generation" which is inverted
+ from the current generation. When the ring wraps, the current/next generation
+ are swapped.
+
+- It's the driver's responsibility to ensure that the RX and TX completion
+ queues are not overrun. This can be accomplished by limiting the number of
+ descriptors posted to HW.
+
+- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
+ buffer_id. These will be returned on the TX completion and RX queues
+ respectively to let the driver know which packet/buffer was completed.
+
+Transmit
+~~~~~~~~
+A packet's buffers are DMA mapped for the device to access before transmission.
+After the packet was successfully transmitted, the buffers are unmapped.
+
+Receive
+~~~~~~~
+The driver posts fixed sized buffers to HW on the RX buffer queue. The packet
+received on the associated RX queue may span multiple descriptors.
diff --git a/Documentation/networking/hinic.rst b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
index 867ac8f4e04a..867ac8f4e04a 100644
--- a/Documentation/networking/hinic.rst
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
new file mode 100644
index 000000000000..6932d8c043c2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -0,0 +1,66 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Ethernet Device Drivers
+=======================
+
+Device drivers for Ethernet and Ethernet-based virtual function devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ 3com/3c509
+ 3com/vortex
+ amazon/ena
+ altera/altera_tse
+ amd/pds_core
+ amd/pds_vdpa
+ amd/pds_vfio_pci
+ aquantia/atlantic
+ chelsio/cxgb
+ cirrus/cs89x0
+ dlink/dl2k
+ davicom/dm9000
+ dec/dmfe
+ freescale/dpaa
+ freescale/dpaa2/index
+ freescale/gianfar
+ google/gve
+ huawei/hinic
+ intel/e100
+ intel/e1000
+ intel/e1000e
+ intel/fm10k
+ intel/idpf
+ intel/igb
+ intel/igbvf
+ intel/ixgbe
+ intel/ixgbevf
+ intel/i40e
+ intel/iavf
+ intel/ice
+ marvell/octeontx2
+ marvell/octeon_ep
+ marvell/octeon_ep_vf
+ mellanox/mlx5/index
+ microsoft/netvsc
+ neterion/s2io
+ netronome/nfp
+ pensando/ionic
+ smsc/smc9
+ stmicro/stmmac
+ ti/cpsw
+ ti/cpsw_switchdev
+ ti/am65_nuss_cpsw_switchdev
+ ti/tlan
+ toshiba/spider_net
+ wangxun/txgbe
+ wangxun/ngbe
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
index 3ac21e7119a7..5dee1b53e977 100644
--- a/Documentation/networking/device_drivers/intel/e100.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
@@ -41,7 +41,7 @@ Identifying Your Adapter
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Driver Configuration Parameters
===============================
@@ -151,8 +151,7 @@ NAPI
NAPI (Rx polling mode) is supported in the e100 driver.
-See https://wiki.linuxfoundation.org/networking/napi for more
-information on NAPI.
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
Multiple Interfaces on Same Ethernet Broadcast Network
------------------------------------------------------
@@ -179,10 +178,8 @@ filtering by
Support
=======
For general information, go to the Intel support website at:
-http://www.intel.com/support/
+https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-http://sourceforge.net/projects/e1000
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
index 4aaae0f7d6ba..52a7fb9ce8d9 100644
--- a/Documentation/networking/device_drivers/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
@@ -451,13 +451,8 @@ Support
=======
For general information, go to the Intel support website at:
-
- http://support.intel.com
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
- http://sourceforge.net/projects/e1000
+http://support.intel.com
If an issue is identified with the released source code on the supported
kernel with a supported adapter, email the specific information related
-to the issue to e1000-devel@lists.sf.net
+to the issue to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
index f49cd370e7bf..d8f810afdd49 100644
--- a/Documentation/networking/device_drivers/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
@@ -371,13 +371,8 @@ NOTE: Wake on LAN is only supported on port A for the following devices:
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
index 4d279e64e221..396a2c8c3db1 100644
--- a/Documentation/networking/device_drivers/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
@@ -22,7 +22,7 @@ Ethernet Multi-host Controller.
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Flow Control
@@ -130,13 +130,8 @@ the Intel Ethernet Controller XL710.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index 8a9b18573688..4fbaa1a2d674 100644
--- a/Documentation/networking/device_drivers/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -173,7 +173,7 @@ Director rule is added from ethtool (Sideband filter), ATR is turned off by the
driver. To re-enable ATR, the sideband can be disabled with the ethtool -K
option. For example::
- ethtool –K [adapter] ntuple [off|on]
+ ethtool -K [adapter] ntuple [off|on]
If sideband is re-enabled after ATR is re-enabled, ATR remains enabled until a
TCP-IP flow is added. When all TCP-IP sideband rules are deleted, ATR is
@@ -399,8 +399,8 @@ operate only in full duplex and only at their native speed.
NAPI
----
NAPI (Rx polling mode) is supported in the i40e driver.
-For more information on NAPI, see
-https://wiki.linuxfoundation.org/networking/napi
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
Flow Control
------------
@@ -466,7 +466,7 @@ network. PTP support varies among Intel devices that support this driver. Use
"ethtool -T <netdev name>" to get a definitive list of PTP capabilities
supported by the device.
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -523,8 +523,8 @@ of a port's bandwidth (should it be available). The sum of all the values for
Maximum Bandwidth is not restricted, because no more than 100% of a port's
bandwidth can ever be used.
-NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
-per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
+NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
+per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
"add vsi failed for VF N, aq_err 16". To workaround the issue, enable less than
64 virtual functions (VFs).
@@ -688,7 +688,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as `ifstat` or `sar -n DEV [interval] [number of samples]`
2. Enable HW TC offload on interface::
@@ -759,13 +759,8 @@ enabled when setting up DCB on your switch.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
index 84ac7e75f363..eb926c3bd4cd 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -43,7 +43,7 @@ device.
For information on how to identify your adapter, and for the latest NVM/FW
images and Intel network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Additional Features and Configurations
@@ -113,7 +113,7 @@ which the AVF is associated. The following are base mode features:
- AVF device ID
- HW mailbox is used for VF to PF communications (including on Windows)
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -179,7 +179,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as ``ifstat`` or ``sar -n DEV [interval] [number of samples]``
NOTE:
Setting up channels via ethtool (ethtool -L) is not supported when the
@@ -319,13 +319,8 @@ This is caused by the way the Linux kernel reports this stressed condition.
Support
=======
For general information, go to the Intel support website at:
-
https://support.intel.com
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on the supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
new file mode 100644
index 000000000000..934752f675ba
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -0,0 +1,1175 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 800 Series
+=================================================================
+
+Intel ice Linux driver.
+Copyright(c) 2018-2021 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Important Notes
+- Additional Features & Configurations
+- Performance Optimization
+
+
+The associated Virtual Function (VF) driver for this driver is iavf.
+
+Driver information can be obtained using ethtool and lspci.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+This driver supports XDP (Express Data Path) and AF_XDP zero-copy. Note that
+XDP is blocked for frame sizes larger than 3KB.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Important Notes
+===============
+
+Packet drops may occur under receive stress
+-------------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series are designed to
+tolerate a limited amount of system latency during PCIe and DMA transactions.
+If these transactions take longer than the tolerated latency, it can impact the
+length of time the packets are buffered in the device and associated memory,
+which may result in dropped packets. These packets drops typically do not have
+a noticeable impact on throughput and performance under standard workloads.
+
+If these packet drops appear to affect your workload, the following may improve
+the situation:
+
+1) Make sure that your system's physical memory is in a high-performance
+ configuration, as recommended by the platform vendor. A common
+ recommendation is for all channels to be populated with a single DIMM
+ module.
+2) In your system's BIOS/UEFI settings, select the "Performance" profile.
+3) Your distribution may provide tools like "tuned," which can help tweak
+ kernel settings to achieve better standard settings for different workloads.
+
+
+Configuring SR-IOV for improved network security
+------------------------------------------------
+In a virtualized environment, on Intel(R) Ethernet Network Adapters that
+support SR-IOV, the virtual function (VF) may be subject to malicious behavior.
+Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE
+802.1Qbb (priority based flow-control), and others of this type, are not
+expected and can throttle traffic between the host and the virtual switch,
+reducing performance. To resolve this issue, and to ensure isolation from
+unintended traffic streams, configure all SR-IOV enabled ports for VLAN tagging
+from the administrative interface on the PF. This configuration allows
+unexpected, and potentially malicious, frames to be dropped.
+
+See "Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports" later in this
+README for configuration instructions.
+
+
+Do not unload port driver if VF with active VM is bound to it
+-------------------------------------------------------------
+Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
+Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
+Once the VM shuts down, or otherwise releases the VF, the command will
+complete.
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://kernel.org/pub/software/network/ethtool/
+
+NOTE: The rx_bytes value of ethtool does not match the rx_bytes value of
+Netdev, due to the 4-byte CRC being stripped by the device. The difference
+between the two rx_bytes values will be 4 x the number of Rx packets. For
+example, if Rx packets are 10 and Netdev (software statistics) displays
+rx_bytes as "X", then ethtool (hardware statistics) will display rx_bytes as
+"X+40" (4 bytes CRC x 10 packets).
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ # dmesg -n 8
+
+NOTE: This setting is not saved across reboots.
+
+
+Dynamic Device Personalization
+------------------------------
+Dynamic Device Personalization (DDP) allows you to change the packet processing
+pipeline of a device by applying a profile package to the device at runtime.
+Profiles can be used to, for example, add support for new protocols, change
+existing protocols, or change default settings. DDP profiles can also be rolled
+back without rebooting the system.
+
+The DDP package loads during device initialization. The driver looks for
+``intel/ice/ddp/ice.pkg`` in your firmware root (typically ``/lib/firmware/``
+or ``/lib/firmware/updates/``) and checks that it contains a valid DDP package
+file.
+
+NOTE: Your distribution should likely have provided the latest DDP file, but if
+ice.pkg is missing, you can find it in the linux-firmware repository or from
+intel.com.
+
+If the driver is unable to load the DDP package, the device will enter Safe
+Mode. Safe Mode disables advanced and performance features and supports only
+basic traffic and minimal functionality, such as updating the NVM or
+downloading a new driver or DDP package. Safe Mode only applies to the affected
+physical function and does not impact any other PFs. See the "Intel(R) Ethernet
+Adapters and Devices User Guide" for more details on DDP and Safe Mode.
+
+NOTES:
+
+- If you encounter issues with the DDP package file, you may need to download
+ an updated driver or DDP package file. See the log messages for more
+ information.
+
+- The ice.pkg file is a symbolic link to the default DDP package file.
+
+- You cannot update the DDP package if any PF drivers are already loaded. To
+ overwrite a package, unload all PFs and then reload the driver with the new
+ package.
+
+- Only the first loaded PF per device can download a package for that device.
+
+You can install specific DDP package files for different physical devices in
+the same system. To install a specific DDP package file:
+
+1. Download the DDP package file you want for your device.
+
+2. Rename the file ice-xxxxxxxxxxxxxxxx.pkg, where 'xxxxxxxxxxxxxxxx' is the
+ unique 64-bit PCI Express device serial number (in hex) of the device you
+ want the package downloaded on. The filename must include the complete
+ serial number (including leading zeros) and be all lowercase. For example,
+ if the 64-bit serial number is b887a3ffffca0568, then the file name would be
+ ice-b887a3ffffca0568.pkg.
+
+ To find the serial number from the PCI bus address, you can use the
+ following command::
+
+ # lspci -vv -s af:00.0 | grep -i Serial
+ Capabilities: [150 v1] Device Serial Number b8-87-a3-ff-ff-ca-05-68
+
+ You can use the following command to format the serial number without the
+ dashes::
+
+ # lspci -vv -s af:00.0 | grep -i Serial | awk '{print $7}' | sed s/-//g
+ b887a3ffffca0568
+
+3. Copy the renamed DDP package file to
+ ``/lib/firmware/updates/intel/ice/ddp/``. If the directory does not yet
+ exist, create it before copying the file.
+
+4. Unload all of the PFs on the device.
+
+5. Reload the driver with the new package.
+
+NOTE: The presence of a device-specific DDP package file overrides the loading
+of the default DDP package file (ice.pkg).
+
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues
+- Enables tight control on routing a flow in the platform
+- Matches flows and CPU cores for flow affinity
+
+NOTE: This driver supports the following flow types:
+
+- IPv4
+- TCPv4
+- UDPv4
+- SCTPv4
+- IPv6
+- TCPv6
+- UDPv6
+- SCTPv6
+
+Each flow type supports valid combinations of IP addresses (source or
+destination) and UDP/TCP/SCTP ports (source and destination). You can supply
+only a source IP address, a source IP address and a destination port, or any
+combination of one or more of these four parameters.
+
+NOTE: This driver allows you to filter traffic based on a user-defined flexible
+two-byte pattern and offset by using the ethtool user-def and mask fields. Only
+L3 and L4 flow types are supported for user-defined flexible filters. For a
+given flow type, you must clear all Intel Ethernet Flow Director filters before
+changing the input set (for that flow type).
+
+
+Flow Director Filters
+---------------------
+Flow Director filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To enable
+or disable the Intel Ethernet Flow Director and these filters::
+
+ # ethtool -K <ethX> ntuple <off|on>
+
+NOTE: When you disable ntuple filters, all the user programmed filters are
+flushed from the driver cache and hardware. All needed filters must be re-added
+when ntuple is re-enabled.
+
+To display all of the active filters::
+
+ # ethtool -u <ethX>
+
+To add a new filter::
+
+ # ethtool -U <ethX> flow-type <type> src-ip <ip> [m <ip_mask>] dst-ip <ip>
+ [m <ip_mask>] src-port <port> [m <port_mask>] dst-port <port> [m <port_mask>]
+ action <queue>
+
+ Where:
+ <ethX> - the Ethernet device to program
+ <type> - can be ip4, tcp4, udp4, sctp4, ip6, tcp6, udp6, sctp6
+ <ip> - the IP address to match on
+ <ip_mask> - the IPv4 address to mask on
+ NOTE: These filters use inverted masks.
+ <port> - the port number to match on
+ <port_mask> - the 16-bit integer for masking
+ NOTE: These filters use inverted masks.
+ <queue> - the queue to direct traffic toward (-1 discards the
+ matched traffic)
+
+To delete a filter::
+
+ # ethtool -U <ethX> delete <N>
+
+ Where <N> is the filter ID displayed when printing all the active filters,
+ and may also have been specified using "loc <N>" when adding the filter.
+
+EXAMPLES:
+
+To add a filter that directs packet to queue 2::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To set a filter using only the source and destination IP address::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 action 2 [loc 1]
+
+To set a filter based on a user-defined pattern and offset::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 user-def 0x4FFFF action 2 [loc 1]
+
+ where the value of the user-def field contains the offset (4 bytes) and
+ the pattern (0xffff).
+
+To match TCP traffic sent from 192.168.0.1, port 5300, directed to 192.168.0.5,
+port 80, and then send it to queue 7::
+
+ # ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5
+ src-port 5300 dst-port 80 action 7
+
+To add a TCPv4 filter with a partial mask for a source IP subnet::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.0.0 m 0.255.255.255 dst-ip
+ 192.168.5.12 src-port 12600 dst-port 31 action 12
+
+NOTES:
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two tcp4 filters with different matching fields.
+
+The ice driver does not support matching on a subportion of a field, thus
+partial mask fields are not supported.
+
+
+Flex Byte Flow Director Filters
+-------------------------------
+The driver also supports matching user-defined data within the packet payload.
+This flexible data is specified using the "user-def" field of the ethtool
+command in the following way:
+
+.. table::
+
+ ============================== ============================
+ ``31 28 24 20 16`` ``15 12 8 4 0``
+ ``offset into packet payload`` ``2 bytes of flexible data``
+ ============================== ============================
+
+For example,
+
+::
+
+ ... user-def 0x4FFFF ...
+
+tells the filter to look 4 bytes into the payload and match that value against
+0xFFFF. The offset is based on the beginning of the payload, and not the
+beginning of the packet. Thus
+
+::
+
+ flow-type tcp4 ... user-def 0x8BEAF ...
+
+would match TCP/IPv4 packets which have the value 0xBEAF 8 bytes into the
+TCP/IPv4 payload.
+
+Note that ICMP headers are parsed as 4 bytes of header and 4 bytes of payload.
+Thus to match the first byte of the payload, you must actually add 4 bytes to
+the offset. Also note that ip4 filters match both ICMP frames as well as raw
+(unknown) ip4 frames, where the payload will be the L3 payload of the IP4
+frame.
+
+The maximum offset is 64. The hardware will only read up to 64 bytes of data
+from the payload. The offset must be even because the flexible data is 2 bytes
+long and must be aligned to byte 0 of the packet payload.
+
+The user-defined flexible offset is also considered part of the input set and
+cannot be programmed separately for multiple filters of the same type. However,
+the flexible data is not part of the input set and multiple filters may use the
+same offset but match against different data.
+
+
+RSS Hash Flow
+-------------
+Allows you to set the hash bytes per flow type and any combination of one or
+more options for Receive Side Scaling (RSS) hash byte configuration.
+
+::
+
+ # ethtool -N <ethX> rx-flow-hash <type> <option>
+
+ Where <type> is:
+ tcp4 signifying TCP over IPv4
+ udp4 signifying UDP over IPv4
+ gtpc4 signifying GTP-C over IPv4
+ gtpc4t signifying GTP-C (include TEID) over IPv4
+ gtpu4 signifying GTP-U over IPV4
+ gtpu4e signifying GTP-U and Extension Header over IPV4
+ gtpu4u signifying GTP-U PSC Uplink over IPV4
+ gtpu4d signifying GTP-U PSC Downlink over IPV4
+ tcp6 signifying TCP over IPv6
+ udp6 signifying UDP over IPv6
+ gtpc6 signifying GTP-C over IPv6
+ gtpc6t signifying GTP-C (include TEID) over IPv6
+ gtpu6 signifying GTP-U over IPV6
+ gtpu6e signifying GTP-U and Extension Header over IPV6
+ gtpu6u signifying GTP-U PSC Uplink over IPV6
+ gtpu6d signifying GTP-U PSC Downlink over IPV6
+ And <option> is one or more of:
+ s Hash on the IP source address of the Rx packet.
+ d Hash on the IP destination address of the Rx packet.
+ f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+ n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+ e Hash on GTP Packet on TEID (4bytes) of the Rx packet.
+
+
+Accelerated Receive Flow Steering (aRFS)
+----------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series support
+Accelerated Receive Flow Steering (aRFS) on the PF. aRFS is a load-balancing
+mechanism that allows you to direct packets to the same CPU where an
+application is running or consuming the packets in that flow.
+
+NOTES:
+
+- aRFS requires that ntuple filtering is enabled via ethtool.
+- aRFS support is limited to the following packet types:
+
+ - TCP over IPv4 and IPv6
+ - UDP over IPv4 and IPv6
+ - Nonfragmented packets
+
+- aRFS only supports Flow Director filters, which consist of the
+ source/destination IP addresses and source/destination ports.
+- aRFS and ethtool's ntuple interface both use the device's Flow Director. aRFS
+ and ntuple features can coexist, but you may encounter unexpected results if
+ there's a conflict between aRFS and ntuple requests. See "Intel(R) Ethernet
+ Flow Director" for additional information.
+
+To set up aRFS:
+
+1. Enable the Intel Ethernet Flow Director and ntuple filters using ethtool.
+
+::
+
+ # ethtool -K <ethX> ntuple on
+
+2. Set up the number of entries in the global flow table. For example:
+
+::
+
+ # NUM_RPS_ENTRIES=16384
+ # echo $NUM_RPS_ENTRIES > /proc/sys/net/core/rps_sock_flow_entries
+
+3. Set up the number of entries in the per-queue flow table. For example:
+
+::
+
+ # NUM_RX_QUEUES=64
+ # for file in /sys/class/net/$IFACE/queues/rx-*/rps_flow_cnt; do
+ # echo $(($NUM_RPS_ENTRIES/$NUM_RX_QUEUES)) > $file;
+ # done
+
+4. Disable the IRQ balance daemon (this is only a temporary stop of the service
+ until the next reboot).
+
+::
+
+ # systemctl stop irqbalance
+
+5. Configure the interrupt affinity.
+
+ See ``/Documentation/core-api/irq/irq-affinity.rst``
+
+
+To disable aRFS using ethtool::
+
+ # ethtool -K <ethX> ntuple off
+
+NOTE: This command will disable ntuple filters and clear any aRFS filters in
+software and hardware.
+
+Example Use Case:
+
+1. Set the server application on the desired CPU (e.g., CPU 4).
+
+::
+
+ # taskset -c 4 netserver
+
+2. Use netperf to route traffic from the client to CPU 4 on the server with
+ aRFS configured. This example uses TCP over IPv4.
+
+::
+
+ # netperf -H <Host IPv4 Address> -t TCP_STREAM
+
+
+Enabling Virtual Functions (VFs)
+--------------------------------
+Use sysfs to enable virtual functions (VF).
+
+For example, you can create 4 VFs as follows::
+
+ # echo 4 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+To disable VFs, write 0 to the same file::
+
+ # echo 0 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+The maximum number of VFs for the ice driver is 256 total (all ports). To check
+how many VFs each PF supports, use the following command::
+
+ # cat /sys/class/net/<ethX>/device/sriov_totalvfs
+
+Note: You cannot use SR-IOV when link aggregation (LAG)/bonding is active, and
+vice versa. To enforce this, the driver checks for this mutual exclusion.
+
+
+Displaying VF Statistics on the PF
+----------------------------------
+Use the following command to display the statistics for the PF and its VFs::
+
+ # ip -s link show dev <ethX>
+
+NOTE: The output of this command can be very large due to the maximum number of
+possible VFs.
+
+The PF driver will display a subset of the statistics for the PF and for all
+VFs that are configured. The PF will always print a statistics block for each
+of the possible VFs, and it will show zero for all unconfigured VFs.
+
+
+Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports
+--------------------------------------------------------
+To configure VLAN tagging for the ports on an SR-IOV enabled adapter, use the
+following command. The VLAN configuration should be done before the VF driver
+is loaded or the VM is booted. The VF is not aware of the VLAN tag being
+inserted on transmit and removed on received frames (sometimes called "port
+VLAN" mode).
+
+::
+
+ # ip link set dev <ethX> vf <id> vlan <vlan id>
+
+For example, the following will configure PF eth0 and the first VF on VLAN 10::
+
+ # ip link set dev eth0 vf 0 vlan 10
+
+
+Enabling a VF link if the port is disconnected
+----------------------------------------------
+If the physical function (PF) link is down, you can force link up (from the
+host PF) on any virtual functions (VF) bound to the PF.
+
+For example, to force link up on VF 0 bound to PF eth0::
+
+ # ip link set eth0 vf 0 state enable
+
+Note: If the command does not work, it may not be supported by your system.
+
+
+Setting the MAC Address for a VF
+--------------------------------
+To change the MAC address for the specified VF::
+
+ # ip link set <ethX> vf 0 mac <address>
+
+For example::
+
+ # ip link set <ethX> vf 0 mac 00:01:02:03:04:05
+
+This setting lasts until the PF is reloaded.
+
+NOTE: Assigning a MAC address for a VF from the host will disable any
+subsequent requests to change the MAC address from within the VM. This is a
+security feature. The VM is not aware of this restriction, so if this is
+attempted in the VM, it will trigger MDD events.
+
+
+Trusted VFs and VF Promiscuous Mode
+-----------------------------------
+This feature allows you to designate a particular VF as trusted and allows that
+trusted VF to request selective promiscuous mode on the Physical Function (PF).
+
+To set a VF as trusted or untrusted, enter the following command in the
+Hypervisor::
+
+ # ip link set dev <ethX> vf 1 trust [on|off]
+
+NOTE: It's important to set the VF to trusted before setting promiscuous mode.
+If the VM is not trusted, the PF will ignore promiscuous mode requests from the
+VF. If the VM becomes trusted after the VF driver is loaded, you must make a
+new request to set the VF to promiscuous.
+
+Once the VF is designated as trusted, use the following commands in the VM to
+set the VF to promiscuous mode.
+
+For promiscuous all::
+
+ # ip link set <ethX> promisc on
+ Where <ethX> is a VF interface in the VM
+
+For promiscuous Multicast::
+
+ # ip link set <ethX> allmulticast on
+ Where <ethX> is a VF interface in the VM
+
+NOTE: By default, the ethtool private flag vf-true-promisc-support is set to
+"off," meaning that promiscuous mode for the VF will be limited. To set the
+promiscuous mode for the VF to true promiscuous and allow the VF to see all
+ingress traffic, use the following command::
+
+ # ethtool --set-priv-flags <ethX> vf-true-promisc-support on
+
+The vf-true-promisc-support private flag does not enable promiscuous mode;
+rather, it designates which type of promiscuous mode (limited or true) you will
+get when you enable promiscuous mode using the ip link commands above. Note
+that this is a global setting that affects the entire device. However, the
+vf-true-promisc-support private flag is only exposed to the first PF of the
+device. The PF remains in limited promiscuous mode regardless of the
+vf-true-promisc-support setting.
+
+Next, add a VLAN interface on the VF interface. For example::
+
+ # ip link add link eth2 name eth2.100 type vlan id 100
+
+Note that the order in which you set the VF to promiscuous mode and add the
+VLAN interface does not matter (you can do either first). The result in this
+example is that the VF will get all traffic that is tagged with VLAN 100.
+
+
+Malicious Driver Detection (MDD) for VFs
+----------------------------------------
+Some Intel Ethernet devices use Malicious Driver Detection (MDD) to detect
+malicious traffic from the VF and disable Tx/Rx queues or drop the offending
+packet until a VF driver reset occurs. You can view MDD messages in the PF's
+system log using the dmesg command.
+
+- If the PF driver logs MDD events from the VF, confirm that the correct VF
+ driver is installed.
+- To restore functionality, you can manually reload the VF or VM or enable
+ automatic VF resets.
+- When automatic VF resets are enabled, the PF driver will immediately reset
+ the VF and reenable queues when it detects MDD events on the receive path.
+- If automatic VF resets are disabled, the PF will not automatically reset the
+ VF when it detects MDD events.
+
+To enable or disable automatic VF resets, use the following command::
+
+ # ethtool --set-priv-flags <ethX> mdd-auto-reset-vf on|off
+
+
+MAC and VLAN Anti-Spoofing Feature for VFs
+------------------------------------------
+When a malicious driver on a Virtual Function (VF) interface attempts to send a
+spoofed packet, it is dropped by the hardware and not transmitted.
+
+NOTE: This feature can be disabled for a specific VF::
+
+ # ip link set <ethX> vf <vf id> spoofchk {off|on}
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <ethX> is the interface number::
+
+ # ifconfig <ethX> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ # ip link set mtu 9000 dev <ethX>
+ # ip link set up dev <ethX>
+
+This setting is not saved across reboots.
+
+
+NOTE: The maximum MTU setting for jumbo frames is 9702. This corresponds to the
+maximum jumbo frame size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+NOTE: Packet loss may have a greater impact on throughput when you use jumbo
+frames. If you observe a drop in performance after enabling jumbo frames,
+enabling flow control may mitigate the issue.
+
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility. For the latest version, download and install ethtool from the
+following website:
+
+ https://kernel.org/pub/software/network/ethtool/
+
+To see the speed configurations your device supports, run the following::
+
+ # ethtool <ethX>
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+
+Data Center Bridging (DCB)
+--------------------------
+NOTE: The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+DCB is normally configured on the network using the DCBX protocol (802.1Qaz), a
+specialization of LLDP (802.1AB). The ice driver supports the following
+mutually exclusive variants of DCBX support:
+
+1) Firmware-based LLDP Agent
+2) Software-based LLDP Agent
+
+In firmware-based mode, firmware intercepts all LLDP traffic and handles DCBX
+negotiation transparently for the user. In this mode, the adapter operates in
+"willing" DCBX mode, receiving DCB settings from the link partner (typically a
+switch). The local user can only query the negotiated DCB configuration. For
+information on configuring DCBX parameters on a switch, please consult the
+switch manufacturer's documentation.
+
+In software-based mode, LLDP traffic is forwarded to the network stack and user
+space, where a software agent can handle it. In this mode, the adapter can
+operate in either "willing" or "nonwilling" DCBX mode and DCB configuration can
+be both queried and set locally. This mode requires the FW-based LLDP Agent to
+be disabled.
+
+NOTE:
+
+- You can enable and disable the firmware-based LLDP Agent using an ethtool
+ private flag. Refer to the "FW-LLDP (Firmware Link Layer Discovery Protocol)"
+ section in this README for more information.
+- In software-based DCBX mode, you can configure DCB parameters using software
+ LLDP/DCBX agents that interface with the Linux kernel's DCB Netlink API. We
+ recommend using OpenLLDP as the DCBX agent when running in software mode. For
+ more information, see the OpenLLDP man pages and
+ https://github.com/intel/openlldp.
+- The driver implements the DCB netlink interface layer to allow the user space
+ to communicate with the driver and query DCB configuration for the port.
+- iSCSI with DCB is not supported.
+
+
+FW-LLDP (Firmware Link Layer Discovery Protocol)
+------------------------------------------------
+Use ethtool to change FW-LLDP settings. The FW-LLDP setting is per port and
+persists across boots.
+
+To enable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent on
+
+To disable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent off
+
+To check the current LLDP setting::
+
+ # ethtool --show-priv-flags <ethX>
+
+NOTE: You must enable the UEFI HII "LLDP Agent" attribute for this setting to
+take effect. If "LLDP AGENT" is set to disabled, you cannot enable it from the
+OS.
+
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ice. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is disabled by default.
+
+Use ethtool to change the flow control settings.
+
+To enable or disable Rx or Tx Flow Control::
+
+ # ethtool -A <ethX> rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+NOTE:
+
+- The ice driver requires flow control on both the port and link partner. If
+ flow control is disabled on one of the sides, the port may appear to hang on
+ heavy traffic.
+- You may encounter issues with link-level flow control (LFC) after disabling
+ DCB. The LFC status may show as enabled but traffic is not paused. To resolve
+ this issue, disable and reenable LFC using ethtool::
+
+ # ethtool -A <ethX> rx off tx off
+ # ethtool -A <ethX> rx on tx on
+
+
+NAPI
+----
+
+This driver supports NAPI (Rx polling mode).
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
+
+MACVLAN
+-------
+This driver supports MACVLAN. Kernel support for MACVLAN can be tested by
+checking if the MACVLAN driver is loaded. You can run 'lsmod | grep macvlan' to
+see if the MACVLAN driver is loaded or run 'modprobe macvlan' to try to load
+the MACVLAN driver.
+
+NOTE:
+
+- In passthru mode, you can only set up one MACVLAN device. It will inherit the
+ MAC address of the underlying PF (Physical Function) device.
+
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+NOTES:
+
+- Receive checksum offloads and VLAN acceleration are not supported for 802.1ad
+ (QinQ) packets.
+
+- 0x88A8 traffic will not be received unless VLAN stripping is disabled with
+ the following command::
+
+ # ethtool -K <ethX> rxvlan off
+
+- 0x88A8/0x8100 double VLANs cannot be used with 0x8100 or 0x8100/0x8100 VLANS
+ configured on the same port. 0x88a8/0x8100 traffic will not be received if
+ 0x8100 VLANs are configured.
+
+- The VF can only transmit 0x88A8/0x8100 (i.e., 802.1ad/802.1Q) traffic if:
+
+ 1) The VF is not assigned a port VLAN.
+ 2) spoofchk is disabled from the PF. If you enable spoofchk, the VF will
+ not transmit 0x88A8/0x8100 traffic.
+
+- The VF may not receive all network traffic based on the Inner VLAN header
+ when VF true promiscuous mode (vf-true-promisc-support) and double VLANs are
+ enabled in SR-IOV mode.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+ Where "24" and "371" are example VLAN IDs.
+
+
+Tunnel/Overlay Stateless Offloads
+---------------------------------
+Supported tunnels and overlays include VXLAN, GENEVE, and others depending on
+hardware and software configuration. Stateless offloads are enabled by default.
+
+To view the current state of all offloads::
+
+ # ethtool -k <ethX>
+
+
+UDP Segmentation Offload
+------------------------
+Allows the adapter to offload transmit segmentation of UDP packets with
+payloads up to 64K into valid Ethernet frames. Because the adapter hardware is
+able to complete data segmentation much faster than operating system software,
+this feature may improve transmission performance.
+In addition, the adapter may use fewer CPU resources.
+
+NOTE:
+
+- The application sending UDP packets must support UDP segmentation offload.
+
+To enable/disable UDP Segmentation Offload, issue the following command::
+
+ # ethtool -K <ethX> tx-udp-segmentation [off|on]
+
+
+GNSS module
+-----------
+Requires kernel compiled with CONFIG_GNSS=y or CONFIG_GNSS=m.
+Allows user to read messages from the GNSS hardware module and write supported
+commands. If the module is physically present, a GNSS device is spawned:
+``/dev/gnss<id>``.
+The protocol of write command is dependent on the GNSS hardware module as the
+driver writes raw bytes by the GNSS object to the receiver through i2c. Please
+refer to the hardware GNSS module documentation for configuration details.
+
+
+Firmware (FW) logging
+---------------------
+The driver supports FW logging via the debugfs interface on PF 0 only. The FW
+running on the NIC must support FW logging; if the FW doesn't support FW logging
+the 'fwlog' file will not get created in the ice debugfs directory.
+
+Module configuration
+~~~~~~~~~~~~~~~~~~~~
+Firmware logging is configured on a per module basis. Each module can be set to
+a value independent of the other modules (unless the module 'all' is specified).
+The modules will be instantiated under the 'fwlog/modules' directory.
+
+The user can set the log level for a module by writing to the module file like
+this::
+
+ # echo <log_level> > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/<module>
+
+where
+
+* log_level is a name as described below. Each level includes the
+ messages from the previous/lower level
+
+ * none
+ * error
+ * warning
+ * normal
+ * verbose
+
+* module is a name that represents the module to receive events for. The
+ module names are
+
+ * general
+ * ctrl
+ * link
+ * link_topo
+ * dnl
+ * i2c
+ * sdp
+ * mdio
+ * adminq
+ * hdma
+ * lldp
+ * dcbx
+ * dcb
+ * xlr
+ * nvm
+ * auth
+ * vpd
+ * iosf
+ * parser
+ * sw
+ * scheduler
+ * txq
+ * rsvd
+ * post
+ * watchdog
+ * task_dispatch
+ * mng
+ * synce
+ * health
+ * tsdrv
+ * pfreg
+ * mdlver
+ * all
+
+The name 'all' is special and allows the user to set all of the modules to the
+specified log_level or to read the log_level of all of the modules.
+
+Example usage to configure the modules
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To set a single module to 'verbose'::
+
+ # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/link
+
+To set multiple modules then issue the command multiple times::
+
+ # echo verbose > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/link
+ # echo warning > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/ctrl
+ # echo none > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/dcb
+
+To set all the modules to the same value::
+
+ # echo normal > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all
+
+To read the log_level of a specific module (e.g. module 'general')::
+
+ # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/general
+
+To read the log_level of all the modules::
+
+ # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/modules/all
+
+Enabling FW log
+~~~~~~~~~~~~~~~
+Configuring the modules indicates to the FW that the configured modules should
+generate events that the driver is interested in, but it **does not** send the
+events to the driver until the enable message is sent to the FW. To do this
+the user can write a 1 (enable) or 0 (disable) to 'fwlog/enable'. An example
+is::
+
+ # echo 1 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/enable
+
+Retrieving FW log data
+~~~~~~~~~~~~~~~~~~~~~~
+The FW log data can be retrieved by reading from 'fwlog/data'. The user can
+write any value to 'fwlog/data' to clear the data. The data can only be cleared
+when FW logging is disabled. The FW log data is a binary file that is sent to
+Intel and used to help debug user issues.
+
+An example to read the data is::
+
+ # cat /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/data > fwlog.bin
+
+An example to clear the data is::
+
+ # echo 0 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/data
+
+Changing how often the log events are sent to the driver
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver receives FW log data from the Admin Receive Queue (ARQ). The
+frequency that the FW sends the ARQ events can be configured by writing to
+'fwlog/nr_messages'. The range is 1-128 (1 means push every log message, 128
+means push only when the max AQ command buffer is full). The suggested value is
+10. The user can see what the value is configured to by reading
+'fwlog/nr_messages'. An example to set the value is::
+
+ # echo 50 > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/nr_messages
+
+Configuring the amount of memory used to store FW log data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver stores FW log data within the driver. The default size of the memory
+used to store the data is 1MB. Some use cases may require more or less data so
+the user can change the amount of memory that is allocated for FW log data.
+To change the amount of memory then write to 'fwlog/log_size'. The value must be
+one of: 128K, 256K, 512K, 1M, or 2M. FW logging must be disabled to change the
+value. An example of changing the value is::
+
+ # echo 128K > /sys/kernel/debug/ice/0000\:18\:00.0/fwlog/log_size
+
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Rx Descriptor Ring Size
+-----------------------
+To reduce the number of Rx packet discards, increase the number of Rx
+descriptors for each Rx ring using ethtool.
+
+ Check if the interface is dropping Rx packets due to buffers being full
+ (rx_dropped.nic can mean that there is no PCIe bandwidth)::
+
+ # ethtool -S <ethX> | grep "rx_dropped"
+
+ If the previous command shows drops on queues, it may help to increase
+ the number of descriptors using 'ethtool -G'::
+
+ # ethtool -G <ethX> rx <N>
+ Where <N> is the desired number of ring entries/descriptors
+
+ This can provide temporary buffering for issues that create latency while
+ the CPUs process descriptors.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+
+ Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+ affect every queue of the specified interface.
+
+ Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+ 12,500 interrupts per second per queue::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80 tx-usecs 80
+
+For reduced latency:
+
+ Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+ using ethtool::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
+
+Per-queue interrupt rate settings:
+
+ The following examples are for queues 1 and 3, but you can adjust other
+ queues.
+
+ To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+ about 100,000 interrupts/second, for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+ rx-usecs 10
+
+ To show the current coalesce settings for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+Bounding interrupt rates using rx-usecs-high:
+
+ :Valid Range: 0-236 (0=no limit)
+
+ The range of 0-236 microseconds provides an effective range of 4,237 to
+ 250,000 interrupts per second. The value of rx-usecs-high can be set
+ independently of rx-usecs and tx-usecs in the same ethtool command, and is
+ also independent of the adaptive interrupt moderation algorithm. The
+ underlying hardware supports granularity in 4-microsecond intervals, so
+ adjacent values may result in the same interrupt rate.
+
+ The following command would disable adaptive interrupt moderation, and allow
+ a maximum of 5 microseconds before indicating a receive or transmit was
+ complete. However, instead of resulting in as many as 200,000 interrupts per
+ second, it limits total interrupts per second to 50,000 via the rx-usecs-high
+ parameter.
+
+ ::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs-high 20
+ rx-usecs 5 tx-usecs 5
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+ Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+ individual LCPUs, making sure to use a set of CPUs included in the
+ device's local_cpulist: ``/sys/class/net/<ethX>/device/local_cpulist``.
+
+ Configure as many Rx/Tx queues in the VM as available. (See the iavf driver
+ documentation for the number of queues supported.) For example::
+
+ # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/idpf.rst b/Documentation/networking/device_drivers/ethernet/intel/idpf.rst
new file mode 100644
index 000000000000..adb16e2abd21
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/idpf.rst
@@ -0,0 +1,160 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==========================================================================
+idpf Linux* Base Driver for the Intel(R) Infrastructure Data Path Function
+==========================================================================
+
+Intel idpf Linux driver.
+Copyright(C) 2023 Intel Corporation.
+
+.. contents::
+
+The idpf driver serves as both the Physical Function (PF) and Virtual Function
+(VF) driver for the Intel(R) Infrastructure Data Path Function.
+
+Driver information can be obtained using ethtool, lspci, and ip.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+http://www.intel.com/support
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. If you don't have one yet, you can
+obtain it at:
+https://kernel.org/pub/software/network/ethtool/
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ # dmesg -n 8
+
+.. note::
+ This setting is not saved across reboots.
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ip command to increase the MTU size. For example, enter the following
+where <ethX> is the interface number::
+
+ # ip link set mtu 9000 dev <ethX>
+ # ip link set up dev <ethX>
+
+.. note::
+ The maximum MTU setting for jumbo frames is 9706. This corresponds to the
+ maximum jumbo frame size of 9728 bytes.
+
+.. note::
+ This driver will attempt to use multiple page sized buffers to receive
+ each jumbo packet. This should help to avoid buffer starvation issues when
+ allocating receive packets.
+
+.. note::
+ Packet loss may have a greater impact on throughput when you use jumbo
+ frames. If you observe a drop in performance after enabling jumbo frames,
+ enabling flow control may mitigate the issue.
+
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+ - Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+ affect every queue of the specified interface.
+
+ - Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+ 12,500 interrupts per second per queue::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80
+ tx-usecs 80
+
+For reduced latency:
+ - Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+ using ethtool::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0
+ tx-usecs 0
+
+Per-queue interrupt rate settings:
+ - The following examples are for queues 1 and 3, but you can adjust other
+ queues.
+
+ - To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+ about 100,000 interrupts/second, for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+ rx-usecs 10
+
+ - To show the current coalesce settings for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+ - Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+ individual LCPUs, making sure to use a set of CPUs included in the
+ device's local_cpulist: /sys/class/net/<ethX>/device/local_cpulist.
+
+ - Configure as many Rx/Tx queues in the VM as available. (See the idpf driver
+ documentation for the number of queues supported.) For example::
+
+ # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+http://www.intel.com/support/
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to intel-wired-lan@lists.osuosl.org.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
index 87e560fe5eaa..fbd590b6a0d6 100644
--- a/Documentation/networking/device_drivers/intel/igb.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
@@ -20,7 +20,7 @@ Identifying Your Adapter
========================
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Command Line Parameters
@@ -201,13 +201,8 @@ NOTE: This feature is exclusive to i210 models.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
index 557fc020ef31..11a9017f3069 100644
--- a/Documentation/networking/device_drivers/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
@@ -35,7 +35,7 @@ Identifying Your Adapter
========================
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Additional Features and Configurations
@@ -53,13 +53,8 @@ https://www.kernel.org/pub/software/network/ethtool/
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
index f1d5233e5e51..1e5f16993f69 100644
--- a/Documentation/networking/device_drivers/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
@@ -440,6 +440,22 @@ NOTE: For 82599-based network connections, if you are enabling jumbo frames in
a virtual function (VF), jumbo frames must first be enabled in the physical
function (PF). The VF MTU setting cannot be larger than the PF MTU.
+NBASE-T Support
+---------------
+The ixgbe driver supports NBASE-T on some devices. However, the advertisement
+of NBASE-T speeds is suppressed by default, to accommodate broken network
+switches which cannot cope with advertised NBASE-T speeds. Use the ethtool
+command to enable advertising NBASE-T speeds on devices which support it::
+
+ ethtool -s eth? advertise 0x1800000001028
+
+On Linux systems with INTERFACES(5), this can be specified as a pre-up command
+in /etc/network/interfaces so that the interface is always brought up with
+NBASE-T support, e.g.::
+
+ iface eth? inet dhcp
+ pre-up ethtool -s eth? advertise 0x1800000001028 || true
+
Generic Receive Offload, aka GRO
--------------------------------
The driver supports the in-kernel software implementation of GRO. GRO has
@@ -529,13 +545,8 @@ on the Intel Ethernet Controller XL710.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
index 76bbde736f21..08dc0d368a48 100644
--- a/Documentation/networking/device_drivers/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
@@ -55,13 +55,8 @@ VLANs: There is a limit of a total of 64 shared VLANs to 1 or more VFs.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
new file mode 100644
index 000000000000..c96d262b30be
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -0,0 +1,41 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC
+====================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC.
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Supported Devices`_
+- `Interface Control`_
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b100
+ * Network controller: Cavium, Inc. Device b200
+ * Network controller: Cavium, Inc. Device b400
+ * Network controller: Cavium, Inc. Device b900
+ * Network controller: Cavium, Inc. Device ba00
+ * Network controller: Cavium, Inc. Device bc00
+ * Network controller: Cavium, Inc. Device bd00
+
+Interface Control
+=================
+Network Interface control like changing mtu, link speed, link down/up are
+done by writing command to mailbox command queue, a mailbox interface
+implemented through a reserved region in BAR4.
+This driver writes the commands into the mailbox and the firmware on the
+Octeon device processes them. The firmware also sends unsolicited notifications
+to driver for events suchs as link change, through notification queue
+implemented as part of mailbox interface.
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst
new file mode 100644
index 000000000000..603133d0b92f
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep_vf.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=======================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC VF
+=======================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC VF.
+Copyright (c) 2020 Marvell International Ltd.
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC VF.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b203
+ * Network controller: Cavium, Inc. Device b403
+ * Network controller: Cavium, Inc. Device b103
+ * Network controller: Cavium, Inc. Device b903
+ * Network controller: Cavium, Inc. Device ba03
+ * Network controller: Cavium, Inc. Device bc03
+ * Network controller: Cavium, Inc. Device bd03
diff --git a/Documentation/networking/device_drivers/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index 88f508338c5f..1e196cb9ce25 100644
--- a/Documentation/networking/device_drivers/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -12,6 +12,8 @@ Contents
- `Overview`_
- `Drivers`_
- `Basic packet flow`_
+- `Devlink health reporters`_
+- `Quality of service`_
Overview
========
@@ -126,7 +128,7 @@ Type1:
Type2:
- RVU PF0 ie admin function creates these VFs and maps them to loopback block's channels.
- A set of two VFs (VF0 & VF1, VF2 & VF3 .. so on) works as a pair ie pkts sent out of
- VF0 will be received by VF1 and viceversa.
+ VF0 will be received by VF1 and vice versa.
- These VFs can be used by applications or virtual machines to communicate between them
without sending traffic outside. There is no switch present in HW, hence the support
for loopback VFs.
@@ -157,3 +159,184 @@ Egress
3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
4. NIX block transmits the pkt on the designated channel.
5. NPC MCAM entries can be installed to divert pkt onto a different channel.
+
+Devlink health reporters
+========================
+
+NPA Reporters
+-------------
+The NPA reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Error due to operation of unmapped PF.
+ - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA).
+
+2. ERROR events
+
+ - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_gen
+ NPA_AF_GENERAL:
+ NPA General Interrupt Reg : 1
+ NIX0: free disabled RX
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_intr
+ NPA_AF_RVU:
+ NPA RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_err
+ NPA_AF_ERR:
+ NPA Error Interrupt Reg : 4096
+ AQ Doorbell Error
+
+
+NIX Reporters
+-------------
+The NIX reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Receive mirror/multicast packet drop due to insufficient buffer.
+ - SMQ Flush operation.
+
+2. ERROR events
+
+ - Memory Fault due to WQE read/write from multicast/mirror buffer.
+ - Receive multicast/mirror replication list error.
+ - Receive packet on an unmapped PF.
+ - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure.
+ - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer..
+ - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# ./devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_intr
+ state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_gen
+ state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_err
+ state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_ras
+ state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
+ NIX_AF_RVU:
+ NIX RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
+ NIX_AF_GENERAL:
+ NIX General Interrupt Reg : 1
+ Rx multicast pkt drop
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err
+ NIX_AF_ERR:
+ NIX Error Interrupt Reg : 64
+ Rx on unmapped PF_FUNC
+
+
+Quality of service
+==================
+
+
+Hardware algorithms used in scheduling
+--------------------------------------
+
+octeontx2 silicon and CN10K transmit interface consists of five transmit levels
+starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1
+levels. Each level contains an array of queues to support scheduling and shaping.
+The hardware uses the below algorithms depending on the priority of scheduler queues.
+once the usercreates tc classes with different priorities, the driver configures
+schedulers allocated to the class with specified priority along with rate-limiting
+configuration.
+
+1. Strict Priority
+
+ - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority
+ using strict priority.
+
+2. Round Robin
+
+ - Active MDQs having the same priority level are chosen using round robin.
+
+
+Setup HTB offload
+-----------------
+
+1. Enable HW TC offload on the interface::
+
+ # ethtool -K <interface> hw-tc-offload on
+
+2. Crate htb root::
+
+ # tc qdisc add dev <interface> clsact
+ # tc qdisc replace dev <interface> root handle 1: htb offload
+
+3. Create tc classes with different priorities::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7
+
+4. Create tc classes with same priorities and different quantum::
+
+ # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 2 quantum 409600
+
+ # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416
+
+ # tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
new file mode 100644
index 000000000000..f69ee1ebee01
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -0,0 +1,1305 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+================
+Ethtool counters
+================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents
+========
+
+- `Overview`_
+- `Groups`_
+- `Types`_
+- `Descriptions`_
+
+Overview
+========
+
+There are several counter groups based on where the counter is being counted. In
+addition, each group of counters may have different counter types.
+
+These counter groups are based on which component in a networking setup,
+illustrated below, that they describe::
+
+ ----------------------------------------
+ | |
+ ---------------------------------------- ---------------------------------------- |
+ | Hypervisor | | VM | |
+ | | | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | Ethernet driver | | RDMA driver | | | | Ethernet driver | | RDMA driver | | |
+ | ------------------- --------------- | | ------------------- --------------- | |
+ | | | | | | | | |
+ | ------------------- | | ------------------- | |
+ | | | | | |--
+ ---------------------------------------- ----------------------------------------
+ | |
+ ------------- -----------------------------
+ | |
+ ------ ------ ------ ------ ------ ------ ------
+ -----| PF |----------------------| VF |-| VF |-| VF |----- --| PF |--- --| PF |--- --| PF |---
+ | ------ ------ ------ ------ | | ------ | | ------ | | ------ |
+ | | | | | | | |
+ | | | | | | | |
+ | | | | | | | |
+ | eSwitch | | eSwitch | | eSwitch | | eSwitch |
+ ---------------------------------------------------------- ----------- ----------- -----------
+ -------------------------------------------------------------------------------
+ | |
+ | |
+ | Uplink (no counters) |
+ -------------------------------------------------------------------------------
+ ---------------------------------------------------------------
+ | |
+ | |
+ | MPFS (no counters) |
+ ---------------------------------------------------------------
+ |
+ |
+ | Port
+
+Groups
+======
+
+Ring
+ Software counters populated by the driver stack.
+
+Netdev
+ An aggregation of software ring counters.
+
+vPort counters
+ Traffic counters and drops due to steering or no buffers. May indicate issues
+ with NIC. These counters include Ethernet traffic counters (including Raw
+ Ethernet) and RDMA/RoCE traffic counters.
+
+Physical port counters
+ Counters that collect statistics about the PFs and VFs. May indicate issues
+ with NIC, link, or network. This measuring point holds information on
+ standardized counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and
+ additional counters like flow control, FEC and more. Physical port counters
+ are not exposed to virtual machines.
+
+Priority Port Counters
+ A set of the physical port counters, per priority per port.
+
+Types
+=====
+
+Counters are divided into three types.
+
+Traffic Informative Counters
+ Counters which count traffic. These counters can be used for load estimation
+ or for general debug.
+
+Traffic Acceleration Counters
+ Counters which count traffic that was accelerated by Mellanox driver or by
+ hardware. The counters are an additional layer to the informative counter set,
+ and the same traffic is counted in both informative and acceleration counters.
+
+.. [#accel] Traffic acceleration counter.
+
+Error Counters
+ Increment of these counters might indicate a problem. Each of these counters
+ has an explanation and correction action.
+
+Statistic can be fetched via the `ip link` or `ethtool` commands. `ethtool`
+provides more detailed information.::
+
+ ip –s link show <if-name>
+ ethtool -S <if-name>
+
+Descriptions
+============
+
+XSK, PTP, and QoS counters that are similar to counters defined previously will
+not be separately listed. For example, `ptp_tx[i]_packets` will not be
+explicitly documented since `tx[i]_packets` describes the behavior of both
+counters, except `ptp_tx[i]_packets` is only counted when precision time
+protocol is used.
+
+Ring / Netdev Counter
+----------------------------
+The following counters are available per ring or software port.
+
+These counters provide information on the amount of traffic that was accelerated
+by the NIC. The counters are counting the accelerated traffic in addition to the
+standard counters which counts it (i.e. accelerated traffic is counted twice).
+
+The counter names in the table below refers to both ring and port counters. The
+notation for ring counters includes the [i] index without the braces. The
+notation for port counters doesn't include the [i]. A counter name
+`rx[i]_packets` will be printed as `rx0_packets` for ring 0 and `rx_packets` for
+the software port.
+
+.. flat-table:: Ring / Software Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx[i]_packets`
+ - The number of packets received on ring i.
+ - Informative
+
+ * - `rx[i]_bytes`
+ - The number of bytes received on ring i.
+ - Informative
+
+ * - `tx[i]_packets`
+ - The number of packets transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_bytes`
+ - The number of bytes transmitted on ring i.
+ - Informative
+
+ * - `tx[i]_recover`
+ - The number of times the SQ was recovered.
+ - Error
+
+ * - `tx[i]_cqes`
+ - Number of CQEs events on SQ issued on ring i.
+ - Informative
+
+ * - `tx[i]_cqe_err`
+ - The number of error CQEs encountered on the SQ for ring i.
+ - Error
+
+ * - `tx[i]_tso_packets`
+ - The number of TSO packets transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_bytes`
+ - The number of TSO bytes transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_packets`
+ - The number of TSO packets which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_tso_inner_bytes`
+ - The number of TSO bytes which are indicated to be carry internal
+ encapsulation transmitted on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_gro_packets`
+ - Number of received packets processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded packets received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_bytes`
+ - Number of received bytes processed using hardware-accelerated GRO. The
+ number of hardware GRO offloaded bytes received on ring i.
+ - Acceleration
+
+ * - `rx[i]_gro_skbs`
+ - The number of receive SKBs constructed while performing
+ hardware-accelerated GRO.
+ - Informative
+
+ * - `rx[i]_gro_match_packets`
+ - Number of received packets processed using hardware-accelerated GRO that
+ met the flow table match criteria.
+ - Informative
+
+ * - `rx[i]_gro_large_hds`
+ - Number of receive packets using hardware-accelerated GRO that have large
+ headers that require additional memory to be allocated.
+ - Informative
+
+ * - `rx[i]_lro_packets`
+ - The number of LRO packets received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_lro_bytes`
+ - The number of LRO bytes received on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_ecn_mark`
+ - The number of received packets where the ECN mark was turned on.
+ - Informative
+
+ * - `rx_oversize_pkts_buffer`
+ - The number of dropped received packets due to length which arrived to RQ
+ and exceed software buffer size allocated by the device for incoming
+ traffic. It might imply that the device MTU is larger than the software
+ buffers size.
+ - Error
+
+ * - `rx_oversize_pkts_sw_drop`
+ - Number of received packets dropped in software because the CQE data is
+ larger than the MTU size.
+ - Error
+
+ * - `rx[i]_csum_unnecessary`
+ - Packets received with a `CHECKSUM_UNNECESSARY` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_unnecessary_inner`
+ - Packets received with inner encapsulation with a `CHECKSUM_UNNECESSARY`
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_none`
+ - Packets received with a `CHECKSUM_NONE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete`
+ - Packets received with a `CHECKSUM_COMPLETE` on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_csum_complete_tail`
+ - Number of received packets that had checksum calculation computed,
+ potentially needed padding, and were able to do so with
+ `CHECKSUM_PARTIAL`.
+ - Informative
+
+ * - `rx[i]_csum_complete_tail_slow`
+ - Number of received packets that need padding larger than eight bytes for
+ the checksum.
+ - Informative
+
+ * - `tx[i]_csum_partial`
+ - Packets transmitted with a `CHECKSUM_PARTIAL` on ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_partial_inner`
+ - Packets transmitted with inner encapsulation with a `CHECKSUM_PARTIAL` on
+ ring i [#accel]_.
+ - Acceleration
+
+ * - `tx[i]_csum_none`
+ - Packets transmitted with no hardware checksum acceleration on ring i.
+ - Informative
+
+ * - `tx[i]_stopped` / `tx_queue_stopped` [#ring_global]_
+ - Events where SQ was full on ring i. If this counter is increased, check
+ the amount of buffers allocated for transmission.
+ - Informative
+
+ * - `tx[i]_wake` / `tx_queue_wake` [#ring_global]_
+ - Events where SQ was full and has become not full on ring i.
+ - Informative
+
+ * - `tx[i]_dropped` / `tx_queue_dropped` [#ring_global]_
+ - Packets transmitted that were dropped due to DMA mapping failure on
+ ring i. If this counter is increased, check the amount of buffers
+ allocated for transmission.
+ - Error
+
+ * - `tx[i]_nop`
+ - The number of nop WQEs (empty WQEs) inserted to the SQ (related to
+ ring i) due to the reach of the end of the cyclic buffer. When reaching
+ near to the end of cyclic buffer the driver may add those empty WQEs to
+ avoid handling a state the a WQE start in the end of the queue and ends
+ in the beginning of the queue. This is a normal condition.
+ - Informative
+
+ * - `tx[i]_added_vlan_packets`
+ - The number of packets sent where vlan tag insertion was offloaded to the
+ hardware.
+ - Acceleration
+
+ * - `rx[i]_removed_vlan_packets`
+ - The number of packets received where vlan tag stripping was offloaded to
+ the hardware.
+ - Acceleration
+
+ * - `rx[i]_wqe_err`
+ - The number of wrong opcodes received on ring i.
+ - Error
+
+ * - `rx[i]_mpwqe_frag`
+ - The number of WQEs that failed to allocate compound page and hence
+ fragmented MPWQE’s (Multi Packet WQEs) were used on ring i. If this
+ counter raise, it may suggest that there is no enough memory for large
+ pages, the driver allocated fragmented pages. This is not abnormal
+ condition.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_cqes`
+ - The number of filler CQEs events that were issued on ring i.
+ - Informative
+
+ * - `rx[i]_mpwqe_filler_strides`
+ - The number of strides consumed by filler CQEs on ring i.
+ - Informative
+
+ * - `tx[i]_mpwqe_blks`
+ - The number of send blocks processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `tx[i]_mpwqe_pkts`
+ - The number of send packets processed from Multi-Packet WQEs (mpwqe).
+ - Informative
+
+ * - `rx[i]_cqe_compress_blks`
+ - The number of receive blocks with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_cqe_compress_pkts`
+ - The number of receive packets with CQE compression on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_add`
+ - The number of aRFS flow rules added to the device for direct RQ steering
+ on ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_in`
+ - Number of flow rules that have been requested to move into ring i for
+ direct RQ steering [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_request_out`
+ - Number of flow rules that have been requested to move out of ring i [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_expired`
+ - Number of flow rules that have been expired and removed [#accel]_.
+ - Acceleration
+
+ * - `rx[i]_arfs_err`
+ - Number of flow rules that failed to be added to the flow table.
+ - Error
+
+ * - `rx[i]_recover`
+ - The number of times the RQ was recovered.
+ - Error
+
+ * - `tx[i]_xmit_more`
+ - The number of packets sent with `xmit_more` indication set on the skbuff
+ (no doorbell).
+ - Acceleration
+
+ * - `ch[i]_poll`
+ - The number of invocations of NAPI poll of channel i.
+ - Informative
+
+ * - `ch[i]_arm`
+ - The number of times the NAPI poll function completed and armed the
+ completion queues on channel i.
+ - Informative
+
+ * - `ch[i]_aff_change`
+ - The number of times the NAPI poll function explicitly stopped execution
+ on a CPU due to a change in affinity, on channel i.
+ - Informative
+
+ * - `ch[i]_events`
+ - The number of hard interrupt events on the completion queues of channel i.
+ - Informative
+
+ * - `ch[i]_eq_rearm`
+ - The number of times the EQ was recovered.
+ - Error
+
+ * - `ch[i]_force_irq`
+ - Number of times NAPI is triggered by XSK wakeups by posting a NOP to
+ ICOSQ.
+ - Acceleration
+
+ * - `rx[i]_congst_umr`
+ - The number of times an outstanding UMR request is delayed due to
+ congestion, on ring i.
+ - Informative
+
+ * - `rx_pp_alloc_fast`
+ - Number of successful fast path allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow`
+ - Number of slow path order-0 allocations.
+ - Informative
+
+ * - `rx_pp_alloc_slow_high_order`
+ - Number of slow path high order allocations.
+ - Informative
+
+ * - `rx_pp_alloc_empty`
+ - Counter is incremented when ptr ring is empty, so a slow path allocation
+ was forced.
+ - Informative
+
+ * - `rx_pp_alloc_refill`
+ - Counter is incremented when an allocation which triggered a refill of the
+ cache.
+ - Informative
+
+ * - `rx_pp_alloc_waive`
+ - Counter is incremented when pages obtained from the ptr ring that cannot
+ be added to the cache due to a NUMA mismatch.
+ - Informative
+
+ * - `rx_pp_recycle_cached`
+ - Counter is incremented when recycling placed page in the page pool cache.
+ - Informative
+
+ * - `rx_pp_recycle_cache_full`
+ - Counter is incremented when page pool cache was full.
+ - Informative
+
+ * - `rx_pp_recycle_ring`
+ - Counter is incremented when page placed into the ptr ring.
+ - Informative
+
+ * - `rx_pp_recycle_ring_full`
+ - Counter is incremented when page released from page pool because the ptr
+ ring was full.
+ - Informative
+
+ * - `rx_pp_recycle_released_ref`
+ - Counter is incremented when page released (and not recycled) because
+ refcnt > 1.
+ - Informative
+
+ * - `rx[i]_xsk_buff_alloc_err`
+ - The number of times allocating an skb or XSK buffer failed in the XSK RQ
+ context.
+ - Error
+
+ * - `rx[i]_xdp_tx_xmit`
+ - The number of packets forwarded back to the port due to XDP program
+ `XDP_TX` action (bouncing). these packets are not counted by other
+ software counters. These packets are counted by physical port and vPort
+ counters.
+ - Informative
+
+ * - `rx[i]_xdp_tx_mpwqe`
+ - Number of multi-packet WQEs transmitted by the netdev and `XDP_TX`-ed by
+ the netdev during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_inlnw`
+ - Number of WQE data segments transmitted where the data could be inlined
+ in the WQE and then `XDP_TX`-ed during the RQ context.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_nops`
+ - Number of NOP WQEBBs (WQE building blocks) received posted to the XDP SQ.
+ - Acceleration
+
+ * - `rx[i]_xdp_tx_full`
+ - The number of packets that should have been forwarded back to the port
+ due to `XDP_TX` action but were dropped due to full tx queue. These packets
+ are not counted by other software counters. These packets are counted by
+ physical port and vPort counters. You may open more rx queues and spread
+ traffic rx over all queues and/or increase rx ring size.
+ - Error
+
+ * - `rx[i]_xdp_tx_err`
+ - The number of times an `XDP_TX` error such as frame too long and frame
+ too short occurred on `XDP_TX` ring of RX ring.
+ - Error
+
+ * - `rx[i]_xdp_tx_cqes` / `rx_xdp_tx_cqe` [#ring_global]_
+ - The number of completions received on the CQ of the `XDP_TX` ring.
+ - Informative
+
+ * - `rx[i]_xdp_drop`
+ - The number of packets dropped due to XDP program `XDP_DROP` action. these
+ packets are not counted by other software counters. These packets are
+ counted by physical port and vPort counters.
+ - Informative
+
+ * - `rx[i]_xdp_redirect`
+ - The number of times an XDP redirect action was triggered on ring i.
+ - Acceleration
+
+ * - `tx[i]_xdp_xmit`
+ - The number of packets redirected to the interface(due to XDP redirect).
+ These packets are not counted by other software counters. These packets
+ are counted by physical port and vPort counters.
+ - Informative
+
+ * - `tx[i]_xdp_full`
+ - The number of packets redirected to the interface(due to XDP redirect),
+ but were dropped due to full tx queue. these packets are not counted by
+ other software counters. you may enlarge tx queues.
+ - Informative
+
+ * - `tx[i]_xdp_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ where the data segments were `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_nops`
+ - Number of NOP WQEBBs (WQE building blocks) posted to the SQ that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xdp_err`
+ - The number of packets redirected to the interface(due to XDP redirect)
+ but were dropped due to error such as frame too long and frame too short.
+ - Error
+
+ * - `tx[i]_xdp_cqes`
+ - The number of completions received for packets redirected to the
+ interface(due to XDP redirect) on the CQ.
+ - Informative
+
+ * - `tx[i]_xsk_xmit`
+ - The number of packets transmitted using XSK zerocopy functionality.
+ - Acceleration
+
+ * - `tx[i]_xsk_mpwqe`
+ - Number of multi-packet WQEs offloaded onto the NIC that were
+ `XDP_REDIRECT`-ed from other netdevs.
+ - Acceleration
+
+ * - `tx[i]_xsk_inlnw`
+ - Number of WQE data segments where the data could be inlined in the WQE
+ that are transmitted using XSK zerocopy.
+ - Acceleration
+
+ * - `tx[i]_xsk_full`
+ - Number of times doorbell is rung in XSK zerocopy mode when SQ is full.
+ - Error
+
+ * - `tx[i]_xsk_err`
+ - Number of errors that occurred in XSK zerocopy mode such as if the data
+ size is larger than the MTU size.
+ - Error
+
+ * - `tx[i]_xsk_cqes`
+ - Number of CQEs processed in XSK zerocopy mode.
+ - Acceleration
+
+ * - `tx_tls_ctx`
+ - Number of TLS TX HW offload contexts added to device for encryption.
+ - Acceleration
+
+ * - `tx_tls_del`
+ - Number of TLS TX HW offload contexts removed from device (connection
+ closed).
+ - Acceleration
+
+ * - `tx_tls_pool_alloc`
+ - Number of times a unit of work is successfully allocated in the TLS HW
+ offload pool.
+ - Acceleration
+
+ * - `tx_tls_pool_free`
+ - Number of times a unit of work is freed in the TLS HW offload pool.
+ - Acceleration
+
+ * - `rx_tls_ctx`
+ - Number of TLS RX HW offload contexts added to device for decryption.
+ - Acceleration
+
+ * - `rx_tls_del`
+ - Number of TLS RX HW offload contexts deleted from device (connection has
+ finished).
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_packets`
+ - Number of successfully decrypted RX packets which were part of a TLS
+ stream.
+ - Acceleration
+
+ * - `rx[i]_tls_decrypted_bytes`
+ - Number of TLS payload bytes in RX packets which were successfully
+ decrypted.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_pkt`
+ - Number of received TLS packets with a resync request.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_start`
+ - Number of times the TLS async resync request was started.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_end`
+ - Number of times the TLS async resync request properly ended with
+ providing the HW tracked tcp-seq.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_req_skip`
+ - Number of times the TLS async resync request procedure was started but
+ not properly ended.
+ - Error
+
+ * - `rx[i]_tls_resync_res_ok`
+ - Number of times the TLS resync response call to the driver was
+ successfully handled.
+ - Acceleration
+
+ * - `rx[i]_tls_resync_res_retry`
+ - Number of times the TLS resync response call to the driver was
+ reattempted when ICOSQ is full.
+ - Error
+
+ * - `rx[i]_tls_resync_res_skip`
+ - Number of times the TLS resync response call to the driver was terminated
+ unsuccessfully.
+ - Error
+
+ * - `rx[i]_tls_err`
+ - Number of times when CQE TLS offload was problematic.
+ - Error
+
+ * - `tx[i]_tls_encrypted_packets`
+ - The number of send packets that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_encrypted_bytes`
+ - The number of send bytes that are TLS encrypted by the kernel.
+ - Acceleration
+
+ * - `tx[i]_tls_ooo`
+ - Number of times out of order TLS SQE fragments were handled on ring i.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_packets`
+ - Number of TLS decrypted packets copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_dump_bytes`
+ - Number of TLS decrypted bytes copied over from NIC over DMA.
+ - Acceleration
+
+ * - `tx[i]_tls_resync_bytes`
+ - Number of TLS bytes requested to be resynchronized in order to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_skip_no_sync_data`
+ - Number of TLS send data that can safely be skipped / do not need to be
+ decrypted.
+ - Acceleration
+
+ * - `tx[i]_tls_drop_no_sync_data`
+ - Number of TLS send data that were dropped due to retransmission of TLS
+ data.
+ - Acceleration
+
+ * - `ptp_cq[i]_abort`
+ - Number of times a CQE has to be skipped in precision time protocol due to
+ a skew between the port timestamp and CQE timestamp being greater than
+ 128 seconds.
+ - Error
+
+ * - `ptp_cq[i]_abort_abs_diff_ns`
+ - Accumulation of time differences between the port timestamp and CQE
+ timestamp when the difference is greater than 128 seconds in precision
+ time protocol.
+ - Error
+
+ * - `ptp_cq[i]_late_cqe`
+ - Number of times a CQE has been delivered on the PTP timestamping CQ when
+ the CQE was not expected since a certain amount of time had elapsed where
+ the device typically ensures not posting the CQE.
+ - Error
+
+.. [#ring_global] The corresponding ring and global counters do not share the
+ same name (i.e. do not follow the common naming scheme).
+
+vPort Counters
+--------------
+Counters on the NIC port that is connected to a eSwitch.
+
+.. flat-table:: vPort Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_vport_unicast_packets`
+ - Unicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_unicast_bytes`
+ - Unicast bytes received, steered to a port including Raw Ethernet QP/DPDK
+ traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_packets`
+ - Unicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_unicast_bytes`
+ - Unicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_packets`
+ - Multicast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_multicast_bytes`
+ - Multicast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_packets`
+ - Multicast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_multicast_bytes`
+ - Multicast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_packets`
+ - Broadcast packets received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_broadcast_bytes`
+ - Broadcast bytes received, steered to a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_packets`
+ - Broadcast packets transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `tx_vport_broadcast_bytes`
+ - Broadcast bytes transmitted, steered from a port including Raw Ethernet
+ QP/DPDK traffic, excluding RDMA traffic.
+ - Informative
+
+ * - `rx_vport_rdma_unicast_packets`
+ - RDMA unicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_packets`
+ - RDMA unicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_unicast_bytes`
+ - RDMA unicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_packets`
+ - RDMA multicast packets received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `rx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes received, steered to a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_packets`
+ - RDMA multicast packets transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `tx_vport_rdma_multicast_bytes`
+ - RDMA multicast bytes transmitted, steered from a port (counters counts
+ RoCE/UD/RC traffic) [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_packets`
+ - Unicast, multicast and broadcast packets that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `vport_loopback_bytes`
+ - Unicast, multicast and broadcast bytes that were loop-back (received
+ and transmitted), IB/Eth [#accel]_.
+ - Acceleration
+
+ * - `rx_steer_missed_packets`
+ - Number of packets that was received by the NIC, however was discarded
+ because it did not match any flow in the NIC flow table.
+ - Error
+
+ * - `rx_packets`
+ - Representor only: packets received, that were handled by the hypervisor.
+ - Informative
+
+ * - `rx_bytes`
+ - Representor only: bytes received, that were handled by the hypervisor.
+ - Informative
+
+ * - `tx_packets`
+ - Representor only: packets transmitted, that were handled by the
+ hypervisor.
+ - Informative
+
+ * - `tx_bytes`
+ - Representor only: bytes transmitted, that were handled by the hypervisor.
+ - Informative
+
+ * - `dev_internal_queue_oob`
+ - The number of dropped packets due to lack of receive WQEs for an internal
+ device RQ.
+ - Error
+
+Physical Port Counters
+----------------------
+The physical port counters are the counters on the external port connecting the
+adapter to the network. This measuring point holds information on standardized
+counters like IEEE 802.3, RFC2863, RFC 2819, RFC 3635 and additional counters
+like flow control, FEC and more.
+
+.. flat-table:: Physical Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_packets_phy`
+ - The number of packets received on the physical port. This counter doesn’t
+ include packets that were discarded due to FCS, frame size and similar
+ errors.
+ - Informative
+
+ * - `tx_packets_phy`
+ - The number of packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_bytes_phy`
+ - The number of bytes received on the physical port, including Ethernet
+ header and FCS.
+ - Informative
+
+ * - `tx_bytes_phy`
+ - The number of bytes transmitted on the physical port.
+ - Informative
+
+ * - `rx_multicast_phy`
+ - The number of multicast packets received on the physical port.
+ - Informative
+
+ * - `tx_multicast_phy`
+ - The number of multicast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_broadcast_phy`
+ - The number of broadcast packets received on the physical port.
+ - Informative
+
+ * - `tx_broadcast_phy`
+ - The number of broadcast packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_crc_errors_phy`
+ - The number of dropped received packets due to FCS (Frame Check Sequence)
+ error on the physical port. If this counter is increased in high rate,
+ check the link quality using `rx_symbol_error_phy` and
+ `rx_corrected_bits_phy` counters below.
+ - Error
+
+ * - `rx_in_range_len_errors_phy`
+ - The number of received packets dropped due to length/type errors on a
+ physical port.
+ - Error
+
+ * - `rx_out_of_range_len_phy`
+ - The number of received packets dropped due to length greater than allowed
+ on a physical port. If this counter is increasing, it implies that the
+ peer connected to the adapter has a larger MTU configured. Using same MTU
+ configuration shall resolve this issue.
+ - Error
+
+ * - `rx_oversize_pkts_phy`
+ - The number of dropped received packets due to length which exceed MTU
+ size on a physical port. If this counter is increasing, it implies that
+ the peer connected to the adapter has a larger MTU configured. Using same
+ MTU configuration shall resolve this issue.
+ - Error
+
+ * - `rx_symbol_err_phy`
+ - The number of received packets dropped due to physical coding errors
+ (symbol errors) on a physical port.
+ - Error
+
+ * - `rx_mac_control_phy`
+ - The number of MAC control packets received on the physical port.
+ - Informative
+
+ * - `tx_mac_control_phy`
+ - The number of MAC control packets transmitted on the physical port.
+ - Informative
+
+ * - `rx_pause_ctrl_phy`
+ - The number of link layer pause packets received on a physical port. If
+ this counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from to the adapter.
+ - Informative
+
+ * - `tx_pause_ctrl_phy`
+ - The number of link layer pause packets transmitted on a physical port. If
+ this counter is increasing, it implies that the NIC is congested and
+ cannot absorb the traffic coming from the network.
+ - Informative
+
+ * - `rx_unsupported_op_phy`
+ - The number of MAC control packets received with unsupported opcode on a
+ physical port.
+ - Error
+
+ * - `rx_discards_phy`
+ - The number of received packets dropped due to lack of buffers on a
+ physical port. If this counter is increasing, it implies that the adapter
+ is congested and cannot absorb the traffic coming from the network.
+ - Error
+
+ * - `tx_discards_phy`
+ - The number of packets which were discarded on transmission, even no
+ errors were detected. the drop might occur due to link in down state,
+ head of line drop, pause from the network, etc.
+ - Error
+
+ * - `tx_errors_phy`
+ - The number of transmitted packets dropped due to a length which exceed
+ MTU size on a physical port.
+ - Error
+
+ * - `rx_undersize_pkts_phy`
+ - The number of received packets dropped due to length which is shorter
+ than 64 bytes on a physical port. If this counter is increasing, it
+ implies that the peer connected to the adapter has a non-standard MTU
+ configured or malformed packet had arrived.
+ - Error
+
+ * - `rx_fragments_phy`
+ - The number of received packets dropped due to a length which is shorter
+ than 64 bytes and has FCS error on a physical port. If this counter is
+ increasing, it implies that the peer connected to the adapter has a
+ non-standard MTU configured.
+ - Error
+
+ * - `rx_jabbers_phy`
+ - The number of received packets d due to a length which is longer than 64
+ bytes and had FCS error on a physical port.
+ - Error
+
+ * - `rx_64_bytes_phy`
+ - The number of packets received on the physical port with size of 64 bytes.
+ - Informative
+
+ * - `rx_65_to_127_bytes_phy`
+ - The number of packets received on the physical port with size of 65 to
+ 127 bytes.
+ - Informative
+
+ * - `rx_128_to_255_bytes_phy`
+ - The number of packets received on the physical port with size of 128 to
+ 255 bytes.
+ - Informative
+
+ * - `rx_256_to_511_bytes_phy`
+ - The number of packets received on the physical port with size of 256 to
+ 512 bytes.
+ - Informative
+
+ * - `rx_512_to_1023_bytes_phy`
+ - The number of packets received on the physical port with size of 512 to
+ 1023 bytes.
+ - Informative
+
+ * - `rx_1024_to_1518_bytes_phy`
+ - The number of packets received on the physical port with size of 1024 to
+ 1518 bytes.
+ - Informative
+
+ * - `rx_1519_to_2047_bytes_phy`
+ - The number of packets received on the physical port with size of 1519 to
+ 2047 bytes.
+ - Informative
+
+ * - `rx_2048_to_4095_bytes_phy`
+ - The number of packets received on the physical port with size of 2048 to
+ 4095 bytes.
+ - Informative
+
+ * - `rx_4096_to_8191_bytes_phy`
+ - The number of packets received on the physical port with size of 4096 to
+ 8191 bytes.
+ - Informative
+
+ * - `rx_8192_to_10239_bytes_phy`
+ - The number of packets received on the physical port with size of 8192 to
+ 10239 bytes.
+ - Informative
+
+ * - `link_down_events_phy`
+ - The number of times where the link operative state changed to down. In
+ case this counter is increasing it may imply on port flapping. You may
+ need to replace the cable/transceiver.
+ - Error
+
+ * - `rx_out_of_buffer`
+ - Number of times receive queue had no software buffers allocated for the
+ adapter's incoming traffic.
+ - Error
+
+ * - `module_bus_stuck`
+ - The number of times that module's I\ :sup:`2`\C bus (data or clock)
+ short-wire was detected. You may need to replace the cable/transceiver.
+ - Error
+
+ * - `module_high_temp`
+ - The number of times that the module temperature was too high. If this
+ issue persist, you may need to check the ambient temperature or replace
+ the cable/transceiver module.
+ - Error
+
+ * - `module_bad_shorted`
+ - The number of times that the module cables were shorted. You may need to
+ replace the cable/transceiver module.
+ - Error
+
+ * - `module_unplug`
+ - The number of times that module was ejected.
+ - Informative
+
+ * - `rx_buffer_passed_thres_phy`
+ - The number of events where the port receive buffer was over 85% full.
+ - Informative
+
+ * - `tx_pause_storm_warning_events`
+ - The number of times the device was sending pauses for a long period of
+ time.
+ - Informative
+
+ * - `tx_pause_storm_error_events`
+ - The number of times the device was sending pauses for a long period of
+ time, reaching time out and disabling transmission of pause frames. on
+ the period where pause frames were disabled, drop could have been
+ occurred.
+ - Error
+
+ * - `rx[i]_buff_alloc_err`
+ - Failed to allocate a buffer to received packet (or SKB) on ring i.
+ - Error
+
+ * - `rx_bits_phy`
+ - This counter provides information on the total amount of traffic that
+ could have been received and can be used as a guideline to measure the
+ ratio of errored traffic in `rx_pcs_symbol_err_phy` and
+ `rx_corrected_bits_phy`.
+ - Informative
+
+ * - `rx_pcs_symbol_err_phy`
+ - This counter counts the number of symbol errors that wasn’t corrected by
+ FEC correction algorithm or that FEC algorithm was not active on this
+ interface. If this counter is increasing, it implies that the link
+ between the NIC and the network is suffering from high BER, and that
+ traffic is lost. You may need to replace the cable/transceiver. The error
+ rate is the number of `rx_pcs_symbol_err_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_corrected_bits_phy`
+ - The number of corrected bits on this port according to active FEC
+ (RS/FC). If this counter is increasing, it implies that the link between
+ the NIC and the network is suffering from high BER. The corrected bit
+ rate is the number of `rx_corrected_bits_phy` divided by the number of
+ `rx_bits_phy` on a specific time frame.
+ - Error
+
+ * - `rx_err_lane_[l]_phy`
+ - This counter counts the number of physical raw errors per lane l index.
+ The counter counts errors before FEC corrections. If this counter is
+ increasing, it implies that the link between the NIC and the network is
+ suffering from high BER, and that traffic might be lost. You may need to
+ replace the cable/transceiver. Please check in accordance with
+ `rx_corrected_bits_phy`.
+ - Error
+
+ * - `rx_global_pause`
+ - The number of pause packets received on the physical port. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_duration`
+ - The duration of pause received (in microSec) on the physical port. The
+ counter represents the time the port did not send any traffic. If this
+ counter is increasing, it implies that the network is congested and
+ cannot absorb the traffic coming from the adapter. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause`
+ - The number of pause packets transmitted on a physical port. If this
+ counter is increasing, it implies that the adapter is congested and
+ cannot absorb the traffic coming from the network. Note: This counter is
+ only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `tx_global_pause_duration`
+ - The duration of pause transmitter (in microSec) on the physical port.
+ Note: This counter is only enabled when global pause mode is enabled.
+ - Informative
+
+ * - `rx_global_pause_transition`
+ - The number of times a transition from Xoff to Xon on the physical port
+ has occurred. Note: This counter is only enabled when global pause mode
+ is enabled.
+ - Informative
+
+ * - `rx_if_down_packets`
+ - The number of received packets that were dropped due to interface down.
+ - Informative
+
+Priority Port Counters
+----------------------
+The following counters are physical port counters that are counted per L2
+priority (0-7).
+
+**Note:** `p` in the counter name represents the priority.
+
+.. flat-table:: Priority Port Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_prio[p]_bytes`
+ - The number of bytes received with priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_packets`
+ - The number of packets received with priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_bytes`
+ - The number of bytes transmitted on priority p on the physical port.
+ - Informative
+
+ * - `tx_prio[p]_packets`
+ - The number of packets transmitted on priority p on the physical port.
+ - Informative
+
+ * - `rx_prio[p]_pause`
+ - The number of pause packets received with priority p on a physical port.
+ If this counter is increasing, it implies that the network is congested
+ and cannot absorb the traffic coming from the adapter. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_duration`
+ - The duration of pause received (in microSec) on priority p on the
+ physical port. The counter represents the time the port did not send any
+ traffic on this priority. If this counter is increasing, it implies that
+ the network is congested and cannot absorb the traffic coming from the
+ adapter. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_pause_transition`
+ - The number of times a transition from Xoff to Xon on priority p on the
+ physical port has occurred. Note: This counter is available only if PFC
+ was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause`
+ - The number of pause packets transmitted on priority p on a physical port.
+ If this counter is increasing, it implies that the adapter is congested
+ and cannot absorb the traffic coming from the network. Note: This counter
+ is available only if PFC was enabled on priority p.
+ - Informative
+
+ * - `tx_prio[p]_pause_duration`
+ - The duration of pause transmitter (in microSec) on priority p on the
+ physical port. Note: This counter is available only if PFC was enabled on
+ priority p.
+ - Informative
+
+ * - `rx_prio[p]_buf_discard`
+ - The number of packets discarded by device due to lack of per host receive
+ buffers.
+ - Informative
+
+ * - `rx_prio[p]_cong_discard`
+ - The number of packets discarded by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_marked`
+ - The number of packets ecn marked by device due to per host congestion.
+ - Informative
+
+ * - `rx_prio[p]_discards`
+ - The number of packets discarded by device due to lack of receive buffers.
+ - Informative
+
+Device Counters
+---------------
+.. flat-table:: Device Counter Table
+ :widths: 2 3 1
+
+ * - Counter
+ - Description
+ - Type
+
+ * - `rx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transitions to recovery due to Framing errors and CRC (dlp and tlp). If
+ this counter is raising, try moving the adapter card to a different slot
+ to rule out a bad PCI slot. Validate that you are running with the latest
+ firmware available and latest server BIOS version.
+ - Error
+
+ * - `tx_pci_signal_integrity`
+ - Counts physical layer PCIe signal integrity errors, the number of
+ transition to recovery initiated by the other side (moving to recovery
+ due to getting TS/EIEOS). If this counter is raising, try moving the
+ adapter card to a different slot to rule out a bad PCI slot. Validate
+ that you are running with the latest firmware available and latest server
+ BIOS version.
+ - Error
+
+ * - `outbound_pci_buffer_overflow`
+ - The number of packets dropped due to pci buffer overflow. If this counter
+ is raising in high rate, it might indicate that the receive traffic rate
+ for a host is larger than the PCIe bus and therefore a congestion occurs.
+ - Informative
+
+ * - `outbound_pci_stalled_rd`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound non-posted reads requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_wr`
+ - The percentage (in the range 0...100) of time within the last second that
+ the NIC had outbound posted writes requests but could not perform the
+ operation due to insufficient posted credits.
+ - Informative
+
+ * - `outbound_pci_stalled_rd_events`
+ - The number of seconds where `outbound_pci_stalled_rd` was above 30%.
+ - Informative
+
+ * - `outbound_pci_stalled_wr_events`
+ - The number of seconds where `outbound_pci_stalled_wr` was above 30%.
+ - Informative
+
+ * - `dev_out_of_buffer`
+ - The number of times the device owned queue had not enough buffers
+ allocated.
+ - Error
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
new file mode 100644
index 000000000000..581a91caa579
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/index.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+:Copyright: |copy| 2019, Mellanox Technologies LTD.
+:Copyright: |copy| 2020-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ kconfig
+ switchdev
+ tracepoints
+ counters
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
new file mode 100644
index 000000000000..20d3b7e87049
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -0,0 +1,168 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=======================================
+Enabling the driver and kconfig options
+=======================================
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features, please see below.
+
+**CONFIG_MLX5_BRIDGE=(y/n)**
+
+| Enable :ref:`Ethernet Bridging (BRIDGE) offloading support <mlx5_bridge_offload>`.
+| This will provide the ability to add representors of mlx5 uplink and VF
+| ports to Bridge and offloading rules for traffic between such ports.
+| Supports VLANs (trunk and access modes).
+
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+| built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+| Enables `Data Center Bridging (DCB) Support <https://enterprise-support.nvidia.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+| IPoIB offloads & acceleration support.
+| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+| IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_CLS_ACT=(y/n)**
+
+| Enables offload support for TC classifier action (NET_CLS_ACT).
+| Works in both native NIC mode and Switchdev SRIOV mode.
+| Flow-based classifiers, such as those registered through
+| `tc-flower(8)`, are processed by the device, rather than the
+| host. Actions that would then overwrite matching classification
+| results would then be instant due to the offload.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+| https://enterprise-support.nvidia.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+| Enables :ref:`IPSec XFRM cryptography-offload acceleration <xfrm_device>`.
+
+
+**CONFIG_MLX5_MACSEC=(y/n)**
+
+| Build support for MACsec cryptography-offload acceleration in the NIC.
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+| Enables ethtool receive network flow classification, which allows user defined
+| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+| TLS cryptography-offload acceleration.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+| and switching for the enabled VFs and PF in two available modes:
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://enterprise-support.nvidia.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet>`_.
+| 2) :ref:`Switchdev mode (eswitch offloads) <switchdev>`.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+| Build support for the Innova family of network cards by Mellanox Technologies.
+| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+| building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+| Provides low-level InfiniBand/RDMA and `RoCE <https://enterprise-support.nvidia.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+| MPFs is required for when `Multi-Host <https://www.nvidia.com/en-us/networking/multi-host/>`_ configuration is enabled to allow passing
+| user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_SF=(y/n)**
+
+| Build support for subfunction.
+| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+| will enable support for creating subfunction devices.
+
+
+**CONFIG_MLX5_SF_MANAGER=(y/n)**
+
+| Build support for subfuction port in the NIC. A Mellanox subfunction
+| port is managed through devlink. A subfunction supports RDMA, netdevice
+| and vdpa device. It is similar to a SRIOV VF but it doesn't require
+| SRIOV support.
+
+
+**CONFIG_MLX5_SW_STEERING=(y/n)**
+
+| Build support for software-managed steering in the NIC.
+
+
+**CONFIG_MLX5_TC_CT=(y/n)**
+
+| Support offloading connection tracking rules via tc ct action.
+
+
+**CONFIG_MLX5_TC_SAMPLE=(y/n)**
+
+| Support offloading sample rules via tc sample action.
+
+
+**CONFIG_MLX5_VDPA=(y/n)**
+
+| Support library for Mellanox VDPA drivers. Provides code that is
+| common for all types of VDPA drivers. The following drivers are planned:
+| net, block.
+
+
+**CONFIG_MLX5_VDPA_NET=(y/n)**
+
+| VDPA network driver for ConnectX6 and newer. Provides offloading
+| of virtio net datapath such that descriptors put on the ring will
+| be executed by the hardware. It also supports a variety of stateless
+| offloads depending on the actual device used and firmware version.
+
+
+**CONFIG_MLX5_VFIO_PCI=(y/n)**
+
+| This provides migration support for MLX5 devices using the VFIO framework.
+
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
new file mode 100644
index 000000000000..b617e93d7c2c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst
@@ -0,0 +1,281 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+=========
+Switchdev
+=========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+.. _mlx5_bridge_offload:
+
+Bridge offload
+==============
+
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+ $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+ $ ip link set bridge1 type bridge vlan_filtering 1
+ $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
+Subfunction
+===========
+
+Subfunction which are spawned over the E-switch are created only with devlink
+device, and by default all the SF auxiliary devices are disabled.
+This will allow user to configure the SF before the SF have been fully probed,
+which will save time.
+
+Usage example:
+
+- Create SF::
+
+ $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11
+ $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active
+
+- Enable ETH auxiliary device::
+
+ $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit
+
+- Now, in order to fully probe the SF, use devlink reload::
+
+ $ devlink dev reload auxiliary/mlx5_core.sf.1
+
+mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
+queues are neither shared nor stolen from the parent PCI function.
+
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has a dedicated window in PCI BAR space that is not shared
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
+PCI BAR space.
+
+A subfunction supports eswitch representation through which it supports tc
+offloads. The user configures eswitch to send/receive packets from/to
+the subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system, and device view::
+
+ _______
+ | admin |
+ | user |----------
+ |_______| |
+ | |
+ ____|____ __|______ _________________
+ | | | | | |
+ | devlink | | tc tool | | user |
+ | tool | |_________| | applications |
+ |_________| | |_________________|
+ | | | |
+ | | | | Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+ | | +----------+ +----------+ Kernel
+ | | | netdev | | rdma dev |
+ | | +----------+ +----------+
+ (devlink port add/del | ^ ^
+ port function set) | | |
+ | | +---------------|
+ _____|___ | | _______|_______
+ | | | | | mlx5 class |
+ | devlink | +------------+ | | drivers |
+ | kernel | | rep netdev | | |(mlx5_core,ib) |
+ |_________| +------------+ | |_______________|
+ | | | ^
+ (devlink ops) | | (probe/remove)
+ _________|________ | | ____|________
+ | subfunction | | +---------------+ | subfunction |
+ | management driver|----- | subfunction |---| driver |
+ | (mlx5_core) | | auxiliary dev | | (mlx5_core) |
+ |__________________| +---------------+ |_____________|
+ | ^
+ (sf add/del, vhca events) |
+ | (device add/del)
+ _____|____ ____|________
+ | | | subfunction |
+ | PCI NIC |--- activate/deactivate events--->| host driver |
+ |__________| | (mlx5_core) |
+ |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flavour::
+
+ $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+ pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Show a devlink port of the subfunction::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Delete a devlink port of subfunction after use::
+
+ $ devlink port del pci/0000:06:00.0/32768
+
+Function attributes
+===================
+
+The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
+a unified way for SmartNIC and non-SmartNIC.
+
+This is supported only when the eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+RoCE capability setup
+~~~~~~~~~~~~~~~~~~~~~
+Not all mlx5 PCI devices/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
+
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+migratable capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec crypto capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec crypto offloading need
+to explicitly enable the VF ipsec_crypto capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_crypto
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+IPsec packet capability setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+User who wants mlx5 PCI VFs to be able to perform IPsec packet offloading need
+to explicitly enable the VF ipsec_packet capability. Enabling IPsec capability
+for VFs is supported starting with ConnectX6dx devices and above. When a VF has
+IPsec capability enabled, any IPsec offloading is blocked on the PF.
+
+mlx5 driver support devlink port function attr mechanism to setup ipsec_packet
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+SF state setup
+--------------
+
+To use the SF, the user must activate the SF using the SF function state
+attribute.
+
+- Get the state of the SF identified by its unique devlink port index::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+ $ devlink port function set ens2f0npf0sf88 state active
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, the PF driver instance gets the event from the device
+that a particular SF was activated. It's the cue to put the device on bus, probe
+it and instantiate the devlink instance and class specific auxiliary devices
+for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+ $ devlink dev show
+ devlink dev show auxiliary/mlx5_core.sf.4
+
+ $ devlink port show auxiliary/mlx5_core.sf.4/1
+ auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+ $ rdma link show mlx5_0/1
+ link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+ $ rdma dev show
+ 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+ 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+ mlx5_core.sf.4
+ (subfunction auxiliary device)
+ /\
+ / \
+ / \
+ / \
+ / \
+ mlx5_core.eth.4 mlx5_core.rdma.4
+ (sf eth aux dev) (sf rdma aux dev)
+ | |
+ | |
+ p0sf88 mlx5_0
+ (sf netdev) (sf rdma device)
+
+Additionally, the SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visibility to the user to decide when is it
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate attached
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
new file mode 100644
index 000000000000..da8e53cebb6c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/tracepoints.rst
@@ -0,0 +1,229 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+.. include:: <isonum.txt>
+
+===========
+Tracepoints
+===========
+
+:Copyright: |copy| 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+mlx5 driver provides internal tracepoints for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events, check /sys/kernel/tracing/events/mlx5/.
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+ $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+ $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+ $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+ mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
+
+Eswitch QoS tracepoints:
+
+- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
+
+- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
+
+- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
+
+- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
+
+ $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
+
+SF tracepoints:
+
+- mlx5_sf_add: trace addition of the SF port::
+
+ $ echo mlx5:mlx5_sf_add >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_free: trace freeing of the SF port::
+
+ $ echo mlx5:mlx5_sf_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_activate: trace activation of the SF port::
+
+ $ echo mlx5:mlx5_sf_activate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29841 [008] ..... 3669.635095: mlx5_sf_activate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_deactivate: trace deactivation of the SF port::
+
+ $ echo mlx5:mlx5_sf_deactivate >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-29994 [008] ..... 4015.969467: mlx5_sf_deactivate: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_hwc_deferred_free: trace deferred freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_update_state: trace state updates for SF contexts::
+
+ $ echo mlx5:mlx5_sf_update_state >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u20:3-29490 [009] ..... 4141.453530: mlx5_sf_update_state: (0000:08:00.0) port_index=32768 controller=0 hw_id=0x8000 state=2
+
+- mlx5_sf_vhca_event: trace SF vhca event and state::
+
+ $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
+
+- mlx5_sf_dev_add: trace SF device add event::
+
+ $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_dev_del: trace SF device delete event::
+
+ $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/tracing/set_event
+ $ cat /sys/kernel/tracing/trace
+ ...
+ kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
diff --git a/Documentation/networking/device_drivers/microsoft/netvsc.rst b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
index c3f51c672a68..fc5acd427a5d 100644
--- a/Documentation/networking/device_drivers/microsoft/netvsc.rst
+++ b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
@@ -87,11 +87,15 @@ Receive Buffer
contain one or more packets. The number of receive sections may be changed
via ethtool Rx ring parameters.
- There is a similar send buffer which is used to aggregate packets for sending.
- The send area is broken into chunks of 6144 bytes, each of section may
- contain one or more packets. The send buffer is an optimization, the driver
- will use slower method to handle very large packets or if the send buffer
- area is exhausted.
+ There is a similar send buffer which is used to aggregate packets
+ for sending. The send area is broken into chunks, typically of 6144
+ bytes, each of section may contain one or more packets. Small
+ packets are usually transmitted via copy to the send buffer. However,
+ if the buffer is temporarily exhausted, or the packet to be transmitted is
+ an LSO packet, the driver will provide the host with pointers to the data
+ from the SKB. This attempts to achieve a balance between the overhead of
+ data copy and the impact of remapping VM memory to be accessible by the
+ host.
XDP support
-----------
diff --git a/Documentation/networking/device_drivers/neterion/s2io.rst b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
index c5673ec4559b..d731b5a98561 100644
--- a/Documentation/networking/device_drivers/neterion/s2io.rst
+++ b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
@@ -64,8 +64,8 @@ c. Multi-buffer receive mode. Scattering of packet across multiple
IBM xSeries).
d. MSI/MSI-X. Can be enabled on platforms which support this feature
- (IA64, Xeon) resulting in noticeable performance improvement(up to 7%
- on certain platforms).
+ resulting in noticeable performance improvement (up to 7% on certain
+ platforms).
e. Statistics. Comprehensive MAC-level and software statistics displayed
using "ethtool -S" option.
diff --git a/Documentation/networking/device_drivers/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
index ada611fb427c..650b57742d51 100644
--- a/Documentation/networking/device_drivers/netronome/nfp.rst
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
@@ -1,50 +1,57 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+.. include:: <isonum.txt>
-=============================================
-Netronome Flow Processor (NFP) Kernel Drivers
-=============================================
+===========================================
+Network Flow Processor (NFP) Kernel Drivers
+===========================================
-Copyright (c) 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2022, Corigine, Inc.
Contents
========
- `Overview`_
- `Acquiring Firmware`_
+- `Devlink Info`_
+- `Configure Device`_
+- `Statistics`_
Overview
========
-This driver supports Netronome's line of Flow Processor devices,
-including the NFP4000, NFP5000, and NFP6000 models, which are also
-incorporated in the company's family of Agilio SmartNICs. The SR-IOV
-physical and virtual functions for these devices are supported by
-the driver.
+This driver supports Netronome and Corigine's line of Network Flow Processor
+devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
+are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by the driver.
Acquiring Firmware
==================
-The NFP4000 and NFP6000 devices require application specific firmware
-to function. Application firmware can be located either on the host file system
+The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
or in the device flash (if supported by management firmware).
Firmware files on the host filesystem contain card type (`AMDA-*` string), media
-config etc. They should be placed in `/lib/firmware/netronome` directory to
+config etc. They should be placed in `/lib/firmware/netronome` directory to
load firmware from the host file system.
Firmware for basic NIC operation is available in the upstream
`linux-firmware.git` repository.
+A more comprehensive list of firmware can be downloaded from the
+`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
+
Firmware in NVRAM
-----------------
Recent versions of management firmware supports loading application
-firmware from flash when the host driver gets probed. The firmware loading
+firmware from flash when the host driver gets probed. The firmware loading
policy configuration may be used to configure this feature appropriately.
Devlink or ethtool can be used to update the application firmware on the device
flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
-command. Users need to take care to write the correct firmware image for the
+command. Users need to take care to write the correct firmware image for the
card and media configuration to flash.
Available storage space in flash depends on the card being used.
@@ -79,9 +86,9 @@ You may need to use hard instead of symbolic links on distributions
which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
After changing firmware files you may need to regenerate the initramfs
-image. Initramfs contains drivers and firmware files your system may
-need to boot. Refer to the documentation of your distribution to find
-out how to update initramfs. Good indication of stale initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
is system loading wrong driver or firmware on boot, but when driver is
later reloaded manually everything works correctly.
@@ -89,9 +96,9 @@ Selecting firmware per device
-----------------------------
Most commonly all cards on the system use the same type of firmware.
-If you want to load specific firmware image for a specific card, you
-can use either the PCI bus address or serial number. Driver will print
-which files it's looking for when it recognizes a NFP device::
+If you want to load a specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. The driver will
+print which files it's looking for when it recognizes a NFP device::
nfp: Looking for firmware file in order of priority:
nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
@@ -106,6 +113,15 @@ Note that `serial-*` and `pci-*` files are **not** automatically included
in initramfs, you will have to refer to documentation of appropriate tools
to find out how to include them.
+Running firmware version
+------------------------
+
+The version of the loaded firmware for a particular <netdev> interface,
+(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
+be displayed with the ethtool command::
+
+ $ ethtool -i <netdev>
+
Firmware loading policy
-----------------------
@@ -132,6 +148,115 @@ abi_drv_load_ifc
Defines a list of PF devices allowed to load FW on the device.
This variable is not currently user configurable.
+Devlink Info
+============
+
+The devlink info command displays the running and stored firmware versions
+on the device, serial number and board information.
+
+Devlink info command example (replace PCI address)::
+
+ $ devlink dev info pci/0000:03:00.0
+ pci/0000:03:00.0:
+ driver nfp
+ serial_number CSAAMDA2001-1003000111
+ versions:
+ fixed:
+ board.id AMDA2001-1003
+ board.rev 01
+ board.manufacture CSA
+ board.model mozart
+ running:
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x1000003
+ fw.app nic-22.09.0
+ chip.init AMDA-2001-1003 1003000111
+ stored:
+ fw.bundle_id bspbundle_1003000111
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x0
+ chip.init AMDA-2001-1003 1003000111
+
+Configure Device
+================
+
+This section explains how to use Agilio SmartNICs running basic NIC firmware.
+
+Configure interface link-speed
+------------------------------
+The following steps explains how to change between 10G mode and 25G mode on
+Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
+port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
+
+Down the respective interface(s)::
+
+ $ ip link set dev <netdev port 0> down
+ $ ip link set dev <netdev port 1> down
+
+Set interface link-speed to 10G::
+
+ $ ethtool -s <netdev port 0> speed 10000
+ $ ethtool -s <netdev port 1> speed 10000
+
+Set interface link-speed to 25G::
+
+ $ ethtool -s <netdev port 0> speed 25000
+ $ ethtool -s <netdev port 1> speed 25000
+
+Reload driver for changes to take effect::
+
+ $ rmmod nfp; modprobe nfp
+
+Configure interface Maximum Transmission Unit (MTU)
+---------------------------------------------------
+
+The MTU of interfaces can temporarily be set using the iproute2, ip link or
+ifconfig tools. Note that this change will not persist. Setting this via
+Network Manager, or another appropriate OS configuration tool, is
+recommended as changes to the MTU using Network Manager can be made to
+persist.
+
+Set interface MTU to 9000 bytes::
+
+ $ ip link set dev <netdev port> mtu 9000
+
+It is the responsibility of the user or the orchestration layer to set
+appropriate MTU values when handling jumbo frames or utilizing tunnels. For
+example, if packets sent from a VM are to be encapsulated on the card and
+egress a physical port, then the MTU of the VF should be set to lower than
+that of the physical port to account for the extra bytes added by the
+additional header. If a setup is expected to see fallback traffic between
+the SmartNIC and the kernel then the user should also ensure that the PF MTU
+is appropriately set to avoid unexpected drops on this path.
+
+Configure Forward Error Correction (FEC) modes
+----------------------------------------------
+
+Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
+ReedSolomon and Off modes. Each physical port's FEC mode can be set
+independently using ethtool. The supported FEC modes for an interface can
+be viewed using::
+
+ $ ethtool <netdev>
+
+The currently configured FEC mode can be viewed using::
+
+ $ ethtool --show-fec <netdev>
+
+To force the FEC mode for a particular port, auto-negotiation must be disabled
+(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
+to Reed-Solomon is::
+
+ $ ethtool --set-fec <netdev> encoding rs
+
+Auto-negotiation
+----------------
+
+To change auto-negotiation settings, the link must first be put down. After the
+link is down, auto-negotiation can be enabled or disabled using::
+
+ ethtool -s <netdev> autoneg <on|off>
+
Statistics
==========
diff --git a/Documentation/networking/device_drivers/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
index 0eabbc347d6c..05fe2b11bb18 100644
--- a/Documentation/networking/device_drivers/pensando/ionic.rst
+++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
@@ -83,7 +83,7 @@ Configuring the Driver
MTU
---
-Jumbo frame support is available with a maximim size of 9194 bytes.
+Jumbo frame support is available with a maximum size of 9194 bytes.
Interrupt coalescing
--------------------
@@ -99,6 +99,12 @@ Minimal SR-IOV support is currently offered and can be enabled by setting
the sysfs 'sriov_numvfs' value, if supported by your particular firmware
configuration.
+XDP
+---
+
+Support for XDP includes the basics, plus Jumbo frames, Redirect and
+ndo_xmit. There is no current support for zero-copy sockets or HW offload.
+
Statistics
==========
@@ -138,6 +144,12 @@ Driver port specific::
rx_csum_none: 0
rx_csum_complete: 3
rx_csum_error: 0
+ xdp_drop: 0
+ xdp_aborted: 0
+ xdp_pass: 0
+ xdp_tx: 0
+ xdp_redirect: 0
+ xdp_frames: 0
Driver queue specific::
@@ -149,9 +161,12 @@ Driver queue specific::
tx_0_frags: 0
tx_0_tso: 0
tx_0_tso_bytes: 0
+ tx_0_hwstamp_valid: 0
+ tx_0_hwstamp_invalid: 0
tx_0_csum_none: 3
tx_0_csum: 0
tx_0_vlan_inserted: 0
+ tx_0_xdp_frames: 0
rx_0_pkts: 2
rx_0_bytes: 120
rx_0_dma_map_err: 0
@@ -159,8 +174,15 @@ Driver queue specific::
rx_0_csum_none: 0
rx_0_csum_complete: 0
rx_0_csum_error: 0
+ rx_0_hwstamp_valid: 0
+ rx_0_hwstamp_invalid: 0
rx_0_dropped: 0
rx_0_vlan_stripped: 0
+ rx_0_xdp_drop: 0
+ rx_0_xdp_aborted: 0
+ rx_0_xdp_pass: 0
+ rx_0_xdp_tx: 0
+ rx_0_xdp_redirect: 0
Firmware port specific::
diff --git a/Documentation/networking/device_drivers/smsc/smc9.rst b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
index e5eac896a631..e5eac896a631 100644
--- a/Documentation/networking/device_drivers/smsc/smc9.rst
+++ b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
index 5d46e5036129..5d46e5036129 100644
--- a/Documentation/networking/device_drivers/stmicro/stmmac.rst
+++ b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
new file mode 100644
index 000000000000..25fd9aa284e2
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================================
+Texas Instruments K3 AM65 CPSW NUSS switchdev based ethernet driver
+===================================================================
+
+:Version: 1.0
+
+Port renaming
+=============
+
+In order to rename via udev::
+
+ ip -d link show dev sw0p1 | grep switchid
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+Multi mac mode
+==============
+
+- The driver is operating in multi-mac mode by default, thus
+ working as N individual network interfaces.
+
+Devlink configuration parameters
+================================
+
+See Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
+
+Enabling "switch"
+=================
+
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver completely reloads its
+configuration when first port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag.
+
+All configuration is implemented via switchdev API.
+
+Bridge setup
+============
+
+::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+ [*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+ [*] if vlan_filtering=1. where default_pvid=1
+
+ Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
+
+::
+
+ ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+ bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+Add extra VLANs
+===============
+
+ 1. untagged::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+FDBs
+----
+
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs::
+
+ bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+ bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+MDBs
+----
+
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs::
+
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on switch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+Access and Trunk port
+=====================
+
+::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+Note. Setting PVID on Bridge device itself works only for
+default VLAN (default_pvid).
diff --git a/Documentation/networking/device_drivers/ti/cpsw.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
index a88946bd188b..a88946bd188b 100644
--- a/Documentation/networking/device_drivers/ti/cpsw.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
index 1241ecac73bd..464dce938ed1 100644
--- a/Documentation/networking/device_drivers/ti/cpsw_switchdev.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
@@ -174,7 +174,7 @@ Multicast flooding
==================
CPU port mcast_flooding is always on
-Turning flooding on/off on swithch ports:
+Turning flooding on/off on switch ports:
bridge link set dev sw0p1 mcast_flood on/off
Access and Trunk port
diff --git a/Documentation/networking/device_drivers/ti/tlan.rst b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
index 4fdc0907f4fc..4fdc0907f4fc 100644
--- a/Documentation/networking/device_drivers/ti/tlan.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
diff --git a/Documentation/networking/device_drivers/toshiba/spider_net.rst b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
index fe5b32be15cd..fe5b32be15cd 100644
--- a/Documentation/networking/device_drivers/toshiba/spider_net.rst
+++ b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
new file mode 100644
index 000000000000..43a02f9943e1
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================
+Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters
+=============================================================
+
+WangXun Gigabit Linux driver.
+Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at nic-support@net-swift.com or check our website
+ at https://www.net-swift.com
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
new file mode 100644
index 000000000000..d052ef40fe36
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================================
+Linux Base Driver for WangXun(R) 10 Gigabit PCI Express Adapters
+================================================================
+
+WangXun 10 Gigabit Linux driver.
+Copyright (c) 2015 - 2022 Beijing WangXun Technology Co., Ltd.
+
+
+Contents
+========
+
+- Support
+
+
+Support
+=======
+If you got any problem, contact Wangxun support team via nic-support@net-swift.com
+and Cc: netdev.
diff --git a/Documentation/networking/defza.rst b/Documentation/networking/device_drivers/fddi/defza.rst
index 73c2f793ea26..7393f33ea705 100644
--- a/Documentation/networking/defza.rst
+++ b/Documentation/networking/device_drivers/fddi/defza.rst
@@ -60,4 +60,4 @@ To do:
Both success and failure reports are welcome.
-Maciej W. Rozycki <macro@linux-mips.org>
+Maciej W. Rozycki <macro@orcam.me.uk>
diff --git a/Documentation/networking/device_drivers/fddi/index.rst b/Documentation/networking/device_drivers/fddi/index.rst
new file mode 100644
index 000000000000..0b75294e6c8b
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Fiber Distributed Data Interface (FDDI) Device Drivers
+======================================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ defza
+ skfp
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/skfp.rst b/Documentation/networking/device_drivers/fddi/skfp.rst
index 58f548105c1d..58f548105c1d 100644
--- a/Documentation/networking/skfp.rst
+++ b/Documentation/networking/device_drivers/fddi/skfp.rst
diff --git a/Documentation/networking/baycom.rst b/Documentation/networking/device_drivers/hamradio/baycom.rst
index fe2d010f0e86..fe2d010f0e86 100644
--- a/Documentation/networking/baycom.rst
+++ b/Documentation/networking/device_drivers/hamradio/baycom.rst
diff --git a/Documentation/networking/device_drivers/hamradio/index.rst b/Documentation/networking/device_drivers/hamradio/index.rst
new file mode 100644
index 000000000000..7e731732057b
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Amateur Radio Device Drivers
+============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ baycom
+ z8530drv
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/z8530drv.rst b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
index d2942760f167..d2942760f167 100644
--- a/Documentation/networking/z8530drv.rst
+++ b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index e18dad11bc72..0dd30a84ce25 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -1,56 +1,22 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
-Vendor Device Drivers
-=====================
+Hardware Device Drivers
+=======================
Contents:
.. toctree::
:maxdepth: 2
- freescale/dpaa2/index
- intel/e100
- intel/e1000
- intel/e1000e
- intel/fm10k
- intel/igb
- intel/igbvf
- intel/ixgb
- intel/ixgbe
- intel/ixgbevf
- intel/i40e
- intel/iavf
- intel/ice
- google/gve
- marvell/octeontx2
- mellanox/mlx5
- netronome/nfp
- pensando/ionic
- stmicro/stmmac
- 3com/3c509
- 3com/vortex
- amazon/ena
- aquantia/atlantic
- chelsio/cxgb
- cirrus/cs89x0
- davicom/dm9000
- dec/de4x5
- dec/dmfe
- dlink/dl2k
- freescale/dpaa
- freescale/gianfar
- intel/ipw2100
- intel/ipw2200
- microsoft/netvsc
- neterion/s2io
- neterion/vxge
- qualcomm/rmnet
- sb1000
- smsc/smc9
- ti/cpsw_switchdev
- ti/cpsw
- ti/tlan
- toshiba/spider_net
+ atm/index
+ cable/index
+ can/index
+ cellular/index
+ ethernet/index
+ fddi/index
+ hamradio/index
+ wifi/index
+ wwan/index
.. only:: subproject and html
diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst
deleted file mode 100644
index ee43ea57d443..000000000000
--- a/Documentation/networking/device_drivers/intel/ice.rst
+++ /dev/null
@@ -1,46 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-==================================================================
-Linux Base Driver for the Intel(R) Ethernet Connection E800 Series
-==================================================================
-
-Intel ice Linux driver.
-Copyright(c) 2018 Intel Corporation.
-
-Contents
-========
-
-- Enabling the driver
-- Support
-
-The driver in this release supports Intel's E800 Series of products. For
-more information, visit Intel's support page at https://support.intel.com.
-
-Enabling the driver
-===================
-The driver is enabled via the standard kernel configuration system,
-using the make command::
-
- make oldconfig/menuconfig/etc.
-
-The driver is located in the menu structure at:
-
- -> Device Drivers
- -> Network device support (NETDEVICES [=y])
- -> Ethernet driver support
- -> Intel devices
- -> Intel(R) Ethernet Connection E800 Series Support
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/intel/ixgb.rst
deleted file mode 100644
index ab624f1a44a8..000000000000
--- a/Documentation/networking/device_drivers/intel/ixgb.rst
+++ /dev/null
@@ -1,468 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-=====================================================================
-Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection
-=====================================================================
-
-October 1, 2018
-
-
-Contents
-========
-
-- In This Release
-- Identifying Your Adapter
-- Command Line Parameters
-- Improving Performance
-- Additional Configurations
-- Known Issues/Troubleshooting
-- Support
-
-
-
-In This Release
-===============
-
-This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R)
-Network Connection. This driver includes support for Itanium(R)2-based
-systems.
-
-For questions related to hardware requirements, refer to the documentation
-supplied with your 10 Gigabit adapter. All hardware requirements listed apply
-to use with Linux.
-
-The following features are available in this kernel:
- - Native VLANs
- - Channel Bonding (teaming)
- - SNMP
-
-Channel Bonding documentation can be found in the Linux kernel source:
-/Documentation/networking/bonding.rst
-
-The driver information previously displayed in the /proc filesystem is not
-supported in this release. Alternatively, you can use ethtool (version 1.6
-or later), lspci, and iproute2 to obtain the same information.
-
-Instructions on updating ethtool can be found in the section "Additional
-Configurations" later in this document.
-
-
-Identifying Your Adapter
-========================
-
-The following Intel network adapters are compatible with the drivers in this
-release:
-
-+------------+------------------------------+----------------------------------+
-| Controller | Adapter Name | Physical Layer |
-+============+==============================+==================================+
-| 82597EX | Intel(R) PRO/10GbE LR/SR/CX4 | - 10G Base-LR (fiber) |
-| | Server Adapters | - 10G Base-SR (fiber) |
-| | | - 10G Base-CX4 (copper) |
-+------------+------------------------------+----------------------------------+
-
-For more information on how to identify your adapter, go to the Adapter &
-Driver ID Guide at:
-
- https://support.intel.com
-
-
-Command Line Parameters
-=======================
-
-If the driver is built as a module, the following optional parameters are
-used by entering them on the command line with the modprobe command using
-this syntax::
-
- modprobe ixgb [<option>=<VAL1>,<VAL2>,...]
-
-For example, with two 10GbE PCI adapters, entering::
-
- modprobe ixgb TxDescriptors=80,128
-
-loads the ixgb driver with 80 TX resources for the first adapter and 128 TX
-resources for the second adapter.
-
-The default value for each parameter is generally the recommended setting,
-unless otherwise noted.
-
-Copybreak
----------
-:Valid Range: 0-XXXX
-:Default Value: 256
-
- This is the maximum size of packet that is copied to a new buffer on
- receive.
-
-Debug
------
-:Valid Range: 0-16 (0=none,...,16=all)
-:Default Value: 0
-
- This parameter adjusts the level of debug messages displayed in the
- system logs.
-
-FlowControl
------------
-:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
-:Default Value: 1 if no EEPROM, otherwise read from EEPROM
-
- This parameter controls the automatic generation(Tx) and response(Rx) to
- Ethernet PAUSE frames. There are hardware bugs associated with enabling
- Tx flow control so beware.
-
-RxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 1024
-
- This value is the number of receive descriptors allocated by the driver.
- Increasing this value allows the driver to buffer more incoming packets.
- Each descriptor is 16 bytes. A receive buffer is also allocated for
- each descriptor and can be either 2048, 4056, 8192, or 16384 bytes,
- depending on the MTU setting. When the MTU size is 1500 or less, the
- receive buffer size is 2048 bytes. When the MTU is greater than 1500 the
- receive buffer size will be either 4056, 8192, or 16384 bytes. The
- maximum MTU size is 16114.
-
-TxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 256
-
- This value is the number of transmit descriptors allocated by the driver.
- Increasing this value allows the driver to queue more transmits. Each
- descriptor is 16 bytes.
-
-RxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 72
-
- This value delays the generation of receive interrupts in units of
- 0.8192 microseconds. Receive interrupt reduction can improve CPU
- efficiency if properly tuned for specific network traffic. Increasing
- this value adds extra latency to frame reception and can end up
- decreasing the throughput of TCP traffic. If the system is reporting
- dropped receives, this value may be set too high, causing the driver to
- run out of available receive descriptors.
-
-TxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 32
-
- This value delays the generation of transmit interrupts in units of
- 0.8192 microseconds. Transmit interrupt reduction can improve CPU
- efficiency if properly tuned for specific network traffic. Increasing
- this value adds extra latency to frame transmission and can end up
- decreasing the throughput of TCP traffic. If this value is set too high,
- it will cause the driver to run out of available transmit descriptors.
-
-XsumRX
-------
-:Valid Range: 0-1
-:Default Value: 1
-
- A value of '1' indicates that the driver should enable IP checksum
- offload for received packets (both UDP and TCP) to the adapter hardware.
-
-RxFCHighThresh
---------------
-:Valid Range: 1,536-262,136 (0x600 - 0x3FFF8, 8 byte granularity)
-:Default Value: 196,608 (0x30000)
-
- Receive Flow control high threshold (when we send a pause frame)
-
-RxFCLowThresh
--------------
-:Valid Range: 64-262,136 (0x40 - 0x3FFF8, 8 byte granularity)
-:Default Value: 163,840 (0x28000)
-
- Receive Flow control low threshold (when we send a resume frame)
-
-FCReqTimeout
-------------
-:Valid Range: 1-65535
-:Default Value: 65535
-
- Flow control request timeout (how long to pause the link partner's tx)
-
-IntDelayEnable
---------------
-:Value Range: 0,1
-:Default Value: 1
-
- Interrupt Delay, 0 disables transmit interrupt delay and 1 enables it.
-
-
-Improving Performance
-=====================
-
-With the 10 Gigabit server adapters, the default Linux configuration will
-very likely limit the total available throughput artificially. There is a set
-of configuration changes that, when applied together, will increase the ability
-of Linux to transmit and receive data. The following enhancements were
-originally acquired from settings published at http://www.spec.org/web99/ for
-various submitted results using Linux.
-
-NOTE:
- These changes are only suggestions, and serve as a starting point for
- tuning your network performance.
-
-The changes are made in three major ways, listed in order of greatest effect:
-
-- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen
- parameter.
-- Use sysctl to modify /proc parameters (essentially kernel tuning)
-- Use setpci to modify the MMRBC field in PCI-X configuration space to increase
- transmit burst lengths on the bus.
-
-NOTE:
- setpci modifies the adapter's configuration registers to allow it to read
- up to 4k bytes at a time (for transmits). However, for some systems the
- behavior after modifying this register may be undefined (possibly errors of
- some kind). A power-cycle, hard reset or explicitly setting the e6 register
- back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a
- stable configuration.
-
-- COPY these lines and paste them into ixgb_perf.sh:
-
-::
-
- #!/bin/bash
- echo "configuring network performance , edit this file to change the interface
- or device ID of 10GbE card"
- # set mmrbc to 4k reads, modify only Intel 10GbE device IDs
- # replace 1a48 with appropriate 10GbE device's ID installed on the system,
- # if needed.
- setpci -d 8086:1a48 e6.b=2e
- # set the MTU (max transmission unit) - it requires your switch and clients
- # to change as well.
- # set the txqueuelen
- # your ixgb adapter should be loaded as eth1 for this to work, change if needed
- ip li set dev eth1 mtu 9000 txqueuelen 1000 up
- # call the sysctl utility to modify /proc/sys entries
- sysctl -p ./sysctl_ixgb.conf
-
-- COPY these lines and paste them into sysctl_ixgb.conf:
-
-::
-
- # some of the defaults may be different for your kernel
- # call this file with sysctl -p <this file>
- # these are just suggested values that worked well to increase throughput in
- # several network benchmark tests, your mileage may vary
-
- ### IPV4 specific settings
- # turn TCP timestamp support off, default 1, reduces CPU use
- net.ipv4.tcp_timestamps = 0
- # turn SACK support off, default on
- # on systems with a VERY fast bus -> memory interface this is the big gainer
- net.ipv4.tcp_sack = 0
- # set min/default/max TCP read buffer, default 4096 87380 174760
- net.ipv4.tcp_rmem = 10000000 10000000 10000000
- # set min/pressure/max TCP write buffer, default 4096 16384 131072
- net.ipv4.tcp_wmem = 10000000 10000000 10000000
- # set min/pressure/max TCP buffer space, default 31744 32256 32768
- net.ipv4.tcp_mem = 10000000 10000000 10000000
-
- ### CORE settings (mostly for socket and UDP effect)
- # set maximum receive socket buffer size, default 131071
- net.core.rmem_max = 524287
- # set maximum send socket buffer size, default 131071
- net.core.wmem_max = 524287
- # set default receive socket buffer size, default 65535
- net.core.rmem_default = 524287
- # set default send socket buffer size, default 65535
- net.core.wmem_default = 524287
- # set maximum amount of option memory buffers, default 10240
- net.core.optmem_max = 524287
- # set number of unprocessed input packets before kernel starts dropping them; default 300
- net.core.netdev_max_backlog = 300000
-
-Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface
-your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's
-ID installed on the system.
-
-NOTE:
- Unless these scripts are added to the boot process, these changes will
- only last only until the next system reboot.
-
-
-Resolving Slow UDP Traffic
---------------------------
-If your server does not seem to be able to receive UDP traffic as fast as it
-can receive TCP traffic, it could be because Linux, by default, does not set
-the network stack buffers as large as they need to be to support high UDP
-transfer rates. One way to alleviate this problem is to allow more memory to
-be used by the IP stack to store incoming data.
-
-For instance, use the commands::
-
- sysctl -w net.core.rmem_max=262143
-
-and::
-
- sysctl -w net.core.rmem_default=262143
-
-to increase the read buffer memory max and default to 262143 (256k - 1) from
-defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables
-will increase the amount of memory used by the network stack for receives, and
-can be increased significantly more if necessary for your application.
-
-
-Additional Configurations
-=========================
-
-Configuring the Driver on Different Distributions
--------------------------------------------------
-Configuring a network driver to load properly when the system is started is
-distribution dependent. Typically, the configuration process involves adding
-an alias line to /etc/modprobe.conf as well as editing other system startup
-scripts and/or configuration files. Many popular Linux distributions ship
-with tools to make these changes for you. To learn the proper way to
-configure a network device for your system, refer to your distribution
-documentation. If during this process you are asked for the driver or module
-name, the name for the Linux Base Driver for the Intel 10GbE Family of
-Adapters is ixgb.
-
-Viewing Link Messages
----------------------
-Link messages will not be displayed to the console if the distribution is
-restricting system messages. In order to see network driver link messages on
-your console, set dmesg to eight by entering the following::
-
- dmesg -n 8
-
-NOTE: This setting is not saved across reboots.
-
-Jumbo Frames
-------------
-The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
-enabled by changing the MTU to a value larger than the default of 1500.
-The maximum value for the MTU is 16114. Use the ip command to
-increase the MTU size. For example::
-
- ip li set dev ethx mtu 9000
-
-The maximum MTU setting for Jumbo Frames is 16114. This value coincides
-with the maximum Jumbo Frames size of 16128.
-
-Ethtool
--------
-The driver utilizes the ethtool interface for driver configuration and
-diagnostics, as well as displaying statistical information. The ethtool
-version 1.6 or later is required for this functionality.
-
-The latest release of ethtool can be found from
-https://www.kernel.org/pub/software/network/ethtool/
-
-NOTE:
- The ethtool version 1.6 only supports a limited set of ethtool options.
- Support for a more complete ethtool feature set can be enabled by
- upgrading to the latest version.
-
-NAPI
-----
-NAPI (Rx polling mode) is supported in the ixgb driver.
-
-See https://wiki.linuxfoundation.org/networking/napi for more information on
-NAPI.
-
-
-Known Issues/Troubleshooting
-============================
-
-NOTE:
- After installing the driver, if your Intel Network Connection is not
- working, verify in the "In This Release" section of the readme that you have
- installed the correct driver.
-
-Cable Interoperability Issue with Fujitsu XENPAK Module in SmartBits Chassis
-----------------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4
-Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits
-chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni.
-The CRC errors may be received either by the Intel(R) PRO/10GbE CX4
-Server adapter or the SmartBits. If this situation occurs using a different
-cable assembly may resolve the issue.
-
-Cable Interoperability Issues with HP Procurve 3400cl Switch Port
------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server
-adapter is connected to an HP Procurve 3400cl switch port using short cables
-(1 m or shorter). If this situation occurs, using a longer cable may resolve
-the issue.
-
-Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that
-Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC
-errors may be received either by the CX4 Server adapter or at the switch. If
-this situation occurs, using a different cable assembly may resolve the issue.
-
-Jumbo Frames System Requirement
--------------------------------
-Memory allocation failures have been observed on Linux systems with 64 MB
-of RAM or less that are running Jumbo Frames. If you are using Jumbo
-Frames, your system may require more than the advertised minimum
-requirement of 64 MB of system memory.
-
-Performance Degradation with Jumbo Frames
------------------------------------------
-Degradation in throughput performance may be observed in some Jumbo frames
-environments. If this is observed, increasing the application's socket buffer
-size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
-See the specific application manual and /usr/src/linux*/Documentation/
-networking/ip-sysctl.txt for more details.
-
-Allocating Rx Buffers when Using Jumbo Frames
----------------------------------------------
-Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if
-the available memory is heavily fragmented. This issue may be seen with PCI-X
-adapters or with packet split disabled. This can be reduced or eliminated
-by changing the amount of available memory for receive buffer allocation, by
-increasing /proc/sys/vm/min_free_kbytes.
-
-Multiple Interfaces on Same Ethernet Broadcast Network
-------------------------------------------------------
-Due to the default ARP behavior on Linux, it is not possible to have
-one system on two IP networks in the same Ethernet broadcast domain
-(non-partitioned switch) behave as expected. All Ethernet interfaces
-will respond to IP traffic for any IP address assigned to the system.
-This results in unbalanced receive traffic.
-
-If you have multiple interfaces in a server, do either of the following:
-
- - Turn on ARP filtering by entering::
-
- echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
-
- - Install the interfaces in separate broadcast domains - either in
- different switches or in a switch partitioned to VLANs.
-
-UDP Stress Test Dropped Packet Issue
---------------------------------------
-Under small packets UDP stress test with 10GbE driver, the Linux system
-may drop UDP packets due to the fullness of socket buffers. You may want
-to change the driver's Flow Control variables to the minimum value for
-controlling packet reception.
-
-Tx Hangs Possible Under Stress
-------------------------------
-Under stress conditions, if TX hangs occur, turning off TSO
-"ethtool -K eth0 tso off" may resolve the problem.
-
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst
deleted file mode 100644
index e9b65035cd47..000000000000
--- a/Documentation/networking/device_drivers/mellanox/mlx5.rst
+++ /dev/null
@@ -1,321 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
-
-=================================================
-Mellanox ConnectX(R) mlx5 core VPI Network Driver
-=================================================
-
-Copyright (c) 2019, Mellanox Technologies LTD.
-
-Contents
-========
-
-- `Enabling the driver and kconfig options`_
-- `Devlink info`_
-- `Devlink parameters`_
-- `Devlink health reporters`_
-- `mlx5 tracepoints`_
-
-Enabling the driver and kconfig options
-================================================
-
-| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
-| at build time via kernel Kconfig flags.
-| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
-| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
-| For the list of advanced features please see below.
-
-**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
-
-| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
-| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
-
-
-**CONFIG_MLX5_CORE_EN=(y/n)**
-
-| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
-| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
-| built-in into mlx5_core.ko.
-
-
-**CONFIG_MLX5_EN_ARFS=(y/n)**
-
-| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
-| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
-
-
-**CONFIG_MLX5_EN_RXNFC=(y/n)**
-
-| Enables ethtool receive network flow classification, which allows user defined
-| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
-
-
-**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
-
-| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
-
-
-**CONFIG_MLX5_MPFS=(y/n)**
-
-| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
-| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
-| user configured unicast MAC addresses to the requesting PF.
-
-
-**CONFIG_MLX5_ESWITCH=(y/n)**
-
-| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
-| and switching for the enabled VFs and PF in two available modes:
-| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
-| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
-
-
-**CONFIG_MLX5_CORE_IPOIB=(y/n)**
-
-| IPoIB offloads & acceleration support.
-| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
-| IPoIB ulp netdevice.
-
-
-**CONFIG_MLX5_FPGA=(y/n)**
-
-| Build support for the Innova family of network cards by Mellanox Technologies.
-| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
-| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
-| building sandbox-specific client drivers.
-
-
-**CONFIG_MLX5_EN_IPSEC=(y/n)**
-
-| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
-
-**CONFIG_MLX5_EN_TLS=(y/n)**
-
-| TLS cryptography-offload accelaration.
-
-
-**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
-
-| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
-
-
-**External options** ( Choose if the corresponding mlx5 feature is required )
-
-- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
-- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
-- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
-
-Devlink info
-============
-
-The devlink info reports the running and stored firmware versions on device.
-It also prints the device PSID which represents the HCA board type ID.
-
-User command example::
-
- $ devlink dev info pci/0000:00:06.0
- pci/0000:00:06.0:
- driver mlx5_core
- versions:
- fixed:
- fw.psid MT_0000000009
- running:
- fw.version 16.26.0100
- stored:
- fw.version 16.26.0100
-
-Devlink parameters
-==================
-
-flow_steering_mode: Device flow steering mode
----------------------------------------------
-The flow steering mode parameter controls the flow steering mode of the driver.
-Two modes are supported:
-1. 'dmfs' - Device managed flow steering.
-2. 'smfs - Software/Driver managed flow steering.
-
-In DMFS mode, the HW steering entities are created and managed through the
-Firmware.
-In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into Hardware without firmware intervention.
-
-SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
-
-User command examples:
-
-- Set SMFS flow steering mode::
-
- $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
-
-- Read device flow steering mode::
-
- $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
- pci/0000:06:00.0:
- name flow_steering_mode type driver-specific
- values:
- cmode runtime value smfs
-
-enable_roce: RoCE enablement state
-----------------------------------
-RoCE enablement state controls driver support for RoCE traffic.
-When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
-
-To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
-
-User command examples:
-
-- Disable RoCE::
-
- $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
- $ devlink dev reload pci/0000:06:00.0
-
-- Read RoCE enablement state::
-
- $ devlink dev param show pci/0000:06:00.0 name enable_roce
- pci/0000:06:00.0:
- name enable_roce type generic
- values:
- cmode driverinit value true
-
-Devlink health reporters
-========================
-
-tx reporter
------------
-The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- TX timeout
- Report on kernel tx timeout detection.
- Recover by searching lost interrupts.
-- TX error completion
- Report on error tx completion.
- Recover by flushing the TX queue and reset it.
-
-TX reporter also support on demand diagnose callback, on which it provides
-real time information of its send queues status.
-
-User commands examples:
-
-- Diagnose send queues status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter tx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of tx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter tx
-
-rx reporter
------------
-The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- RX queues initialization (population) timeout
- RX queues descriptors population on ring initialization is done in
- napi context via triggering an irq, in case of a failure to get
- the minimum amount of descriptors, a timeout would occur and it
- could be recoverable by polling the EQ (Event Queue).
-- RX completions with errors (reported by HW on interrupt context)
- Report on rx completion error.
- Recover (if needed) by flushing the related queue and reset it.
-
-RX reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues status.
-
-- Diagnose rx queues status, and corresponding completion queue::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter rx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of rx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter rx
-
-fw reporter
------------
-The fw reporter implements diagnose and dump callbacks.
-It follows symptoms of fw error such as fw syndrome by triggering
-fw core dump and storing it into the dump buffer.
-The fw reporter diagnose command can be triggered any time by the user to check
-current fw status.
-
-User commands examples:
-
-- Check fw heath status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter fw
-
-- Read FW core dump if already stored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.0 reporter fw
-
-NOTE: This command can run only on the PF which has fw tracer ownership,
-running it on other PF or any VF will return "Operation not permitted".
-
-fw fatal reporter
------------------
-The fw fatal reporter implements dump and recover callbacks.
-It follows fatal errors indications by CR-space dump and recover flow.
-The CR-space dump uses vsc interface which is valid even if the FW command
-interface is not functional, which is the case in most FW fatal errors.
-The recover function runs recover flow which reloads the driver and triggers fw
-reset if needed.
-
-User commands examples:
-
-- Run fw recover flow manually::
-
- $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-
-- Read FW CR-space dump if already strored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
-
-NOTE: This command can run only on PF.
-
-mlx5 tracepoints
-================
-
-mlx5 driver provides internal trace points for tracking and debugging using
-kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
-
-For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
-
-tc and eswitch offloads tracepoints:
-
-- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
-
- $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
-
-- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
-
- $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
-
-- mlx5e_stats_flower: trace flower stats request::
-
- $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
-
-- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
-
- $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
-
-- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
-
- $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
diff --git a/Documentation/networking/device_drivers/neterion/vxge.rst b/Documentation/networking/device_drivers/neterion/vxge.rst
deleted file mode 100644
index 589c6b15c63d..000000000000
--- a/Documentation/networking/device_drivers/neterion/vxge.rst
+++ /dev/null
@@ -1,115 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==============================================================================
-Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver
-==============================================================================
-
-.. Contents
-
- 1) Introduction
- 2) Features supported
- 3) Configurable driver parameters
- 4) Troubleshooting
-
-1. Introduction
-===============
-
-This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O
-Virtualized Server adapters.
-
-The X3100 series supports four modes of operation, configurable via
-firmware:
-
- - Single function mode
- - Multi function mode
- - SRIOV mode
- - MRIOV mode
-
-The functions share a 10GbE link and the pci-e bus, but hardly anything else
-inside the ASIC. Features like independent hw reset, statistics, bandwidth/
-priority allocation and guarantees, GRO, TSO, interrupt moderation etc are
-supported independently on each function.
-
-(See below for a complete list of features supported for both IPv4 and IPv6)
-
-2. Features supported
-=====================
-
-i) Single function mode (up to 17 queues)
-
-ii) Multi function mode (up to 17 functions)
-
-iii) PCI-SIG's I/O Virtualization
-
- - Single Root mode: v1.0 (up to 17 functions)
- - Multi-Root mode: v1.0 (up to 17 functions)
-
-iv) Jumbo frames
-
- X3100 Series supports MTU up to 9600 bytes, modifiable using
- ip command.
-
-v) Offloads supported: (Enabled by default)
-
- - Checksum offload (TCP/UDP/IP) on transmit and receive paths
- - TCP Segmentation Offload (TSO) on transmit path
- - Generic Receive Offload (GRO) on receive path
-
-vi) MSI-X: (Enabled by default)
-
- Resulting in noticeable performance improvement (up to 7% on certain
- platforms).
-
-vii) NAPI: (Enabled by default)
-
- For better Rx interrupt moderation.
-
-viii)RTH (Receive Traffic Hash): (Enabled by default)
-
- Receive side steering for better scaling.
-
-ix) Statistics
-
- Comprehensive MAC-level and software statistics displayed using
- "ethtool -S" option.
-
-x) Multiple hardware queues: (Enabled by default)
-
- Up to 17 hardware based transmit and receive data channels, with
- multiple steering options (transmit multiqueue enabled by default).
-
-3) Configurable driver parameters:
-----------------------------------
-
-i) max_config_dev
- Specifies maximum device functions to be enabled.
-
- Valid range: 1-8
-
-ii) max_config_port
- Specifies number of ports to be enabled.
-
- Valid range: 1,2
-
- Default: 1
-
-iii) max_config_vpath
- Specifies maximum VPATH(s) configured for each device function.
-
- Valid range: 1-17
-
-iv) vlan_tag_strip
- Enables/disables vlan tag stripping from all received tagged frames that
- are not replicated at the internal L2 switch.
-
- Valid range: 0,1 (disabled, enabled respectively)
-
- Default: 1
-
-v) addr_learn_en
- Enable learning the mac address of the guest OS interface in
- virtualization environment.
-
- Valid range: 0,1 (disabled, enabled respectively)
-
- Default: 0
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx b/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
deleted file mode 100644
index 2f2077e34d81..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
+++ /dev/null
@@ -1,46 +0,0 @@
-Copyright (c) 2003-2006 QLogic Corporation
-QLogic Linux Networking HBA Driver
-
-This program includes a device driver for Linux 2.6 that may be
-distributed with QLogic hardware specific firmware binary file.
-You may modify and redistribute the device driver code under the
-GNU General Public License as published by the Free Software
-Foundation (version 2 or a later version).
-
-You may redistribute the hardware specific firmware binary file
-under the following terms:
-
- 1. Redistribution of source code (only if applicable),
- must retain the above copyright notice, this list of
- conditions and the following disclaimer.
-
- 2. Redistribution in binary form must reproduce the above
- copyright notice, this list of conditions and the
- following disclaimer in the documentation and/or other
- materials provided with the distribution.
-
- 3. The name of QLogic Corporation may not be used to
- endorse or promote products derived from this software
- without specific prior written permission
-
-REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE,
-THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY
-EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR
-BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
-TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
-
-USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT
-CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR
-OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT,
-TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN
-ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN
-COMBINATION WITH THIS PROGRAM.
-
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic b/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
deleted file mode 100644
index 2ae3b64983ab..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2009-2013 QLogic Corporation
-QLogic Linux qlcnic NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
- GNU GENERAL PUBLIC LICENSE
- Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
- Preamble
-
- The licenses for most software are designed to take away your
-freedom to share and change it. By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users. This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it. (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.) You can apply it to
-your programs, too.
-
- When we speak of free software, we are referring to freedom, not
-price. Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
- To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
- For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have. You must make sure that they, too, receive or can get the
-source code. And you must show them these terms so they know their
-rights.
-
- We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
- Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software. If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
- Finally, any free program is threatened constantly by software
-patents. We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary. To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
- The precise terms and conditions for copying, distribution and
-modification follow.
-
- GNU GENERAL PUBLIC LICENSE
- TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
- 0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License. The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language. (Hereinafter, translation is included without limitation in
-the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope. The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
- 1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
- 2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
- a) You must cause the modified files to carry prominent notices
- stating that you changed the files and the date of any change.
-
- b) You must cause any work that you distribute or publish, that in
- whole or in part contains or is derived from the Program or any
- part thereof, to be licensed as a whole at no charge to all third
- parties under the terms of this License.
-
- c) If the modified program normally reads commands interactively
- when run, you must cause it, when started running for such
- interactive use in the most ordinary way, to print or display an
- announcement including an appropriate copyright notice and a
- notice that there is no warranty (or else, saying that you provide
- a warranty) and that users may redistribute the program under
- these conditions, and telling the user how to view a copy of this
- License. (Exception: if the Program itself is interactive but
- does not normally print such an announcement, your work based on
- the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works. But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
- 3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
- a) Accompany it with the complete corresponding machine-readable
- source code, which must be distributed under the terms of Sections
- 1 and 2 above on a medium customarily used for software interchange; or,
-
- b) Accompany it with a written offer, valid for at least three
- years, to give any third party, for a charge no more than your
- cost of physically performing source distribution, a complete
- machine-readable copy of the corresponding source code, to be
- distributed under the terms of Sections 1 and 2 above on a medium
- customarily used for software interchange; or,
-
- c) Accompany it with the information you received as to the offer
- to distribute corresponding source code. (This alternative is
- allowed only for noncommercial distribution and only if you
- received the program in object code or executable form with such
- an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it. For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable. However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
- 4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License. Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
- 5. You are not required to accept this License, since you have not
-signed it. However, nothing else grants you permission to modify or
-distribute the Program or its derivative works. These actions are
-prohibited by law if you do not accept this License. Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
- 6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions. You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
- 7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License. If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all. For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices. Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
- 8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded. In such case, this License incorporates
-the limitation as if written in the body of this License.
-
- 9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time. Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation. If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
- 10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission. For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this. Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
- NO WARRANTY
-
- 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
- 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge b/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
deleted file mode 100644
index ce64e4d15b21..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2003-2011 QLogic Corporation
-QLogic Linux qlge NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
- GNU GENERAL PUBLIC LICENSE
- Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
- Preamble
-
- The licenses for most software are designed to take away your
-freedom to share and change it. By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users. This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it. (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.) You can apply it to
-your programs, too.
-
- When we speak of free software, we are referring to freedom, not
-price. Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
- To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
- For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have. You must make sure that they, too, receive or can get the
-source code. And you must show them these terms so they know their
-rights.
-
- We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
- Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software. If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
- Finally, any free program is threatened constantly by software
-patents. We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary. To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
- The precise terms and conditions for copying, distribution and
-modification follow.
-
- GNU GENERAL PUBLIC LICENSE
- TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
- 0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License. The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language. (Hereinafter, translation is included without limitation in
-the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope. The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
- 1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
- 2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
- a) You must cause the modified files to carry prominent notices
- stating that you changed the files and the date of any change.
-
- b) You must cause any work that you distribute or publish, that in
- whole or in part contains or is derived from the Program or any
- part thereof, to be licensed as a whole at no charge to all third
- parties under the terms of this License.
-
- c) If the modified program normally reads commands interactively
- when run, you must cause it, when started running for such
- interactive use in the most ordinary way, to print or display an
- announcement including an appropriate copyright notice and a
- notice that there is no warranty (or else, saying that you provide
- a warranty) and that users may redistribute the program under
- these conditions, and telling the user how to view a copy of this
- License. (Exception: if the Program itself is interactive but
- does not normally print such an announcement, your work based on
- the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works. But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
- 3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
- a) Accompany it with the complete corresponding machine-readable
- source code, which must be distributed under the terms of Sections
- 1 and 2 above on a medium customarily used for software interchange; or,
-
- b) Accompany it with a written offer, valid for at least three
- years, to give any third party, for a charge no more than your
- cost of physically performing source distribution, a complete
- machine-readable copy of the corresponding source code, to be
- distributed under the terms of Sections 1 and 2 above on a medium
- customarily used for software interchange; or,
-
- c) Accompany it with the information you received as to the offer
- to distribute corresponding source code. (This alternative is
- allowed only for noncommercial distribution and only if you
- received the program in object code or executable form with such
- an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it. For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable. However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
- 4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License. Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
- 5. You are not required to accept this License, since you have not
-signed it. However, nothing else grants you permission to modify or
-distribute the Program or its derivative works. These actions are
-prohibited by law if you do not accept this License. Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
- 6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions. You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
- 7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License. If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all. For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices. Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
- 8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded. In such case, this License incorporates
-the limitation as if written in the body of this License.
-
- 9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time. Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation. If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
- 10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission. For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this. Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
- NO WARRANTY
-
- 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
- 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/qualcomm/rmnet.rst
deleted file mode 100644
index 70643b58de05..000000000000
--- a/Documentation/networking/device_drivers/qualcomm/rmnet.rst
+++ /dev/null
@@ -1,95 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============
-Rmnet Driver
-============
-
-1. Introduction
-===============
-
-rmnet driver is used for supporting the Multiplexing and aggregation
-Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
-Technologies, Inc. modems.
-
-This driver can be used to register onto any physical network device in
-IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
-
-Multiplexing allows for creation of logical netdevices (rmnet devices) to
-handle multiple private data networks (PDN) like a default internet, tethering,
-multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
-packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
-routes to the appropriate PDN after removing the MAP header.
-
-Aggregation is required to achieve high data rates. This involves hardware
-sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
-these MAP frames and send them to appropriate PDN's.
-
-2. Packet format
-================
-
-a. MAP packet (data / control)
-
-MAP header has the same endianness of the IP packet.
-
-Packet format::
-
- Bit 0 1 2-7 8 - 15 16 - 31
- Function Command / Data Reserved Pad Multiplexer ID Payload length
- Bit 32 - x
- Function Raw Bytes
-
-Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
-or data packet. Control packet is used for transport level flow control. Data
-packets are standard IP packets.
-
-Reserved bits are usually zeroed out and to be ignored by receiver.
-
-Padding is number of bytes to be added for 4 byte alignment if required by
-hardware.
-
-Multiplexer ID is to indicate the PDN on which data has to be sent.
-
-Payload length includes the padding length but does not include MAP header
-length.
-
-b. MAP packet (command specific)::
-
- Bit 0 1 2-7 8 - 15 16 - 31
- Function Command Reserved Pad Multiplexer ID Payload length
- Bit 32 - 39 40 - 45 46 - 47 48 - 63
- Function Command name Reserved Command Type Reserved
- Bit 64 - 95
- Function Transaction ID
- Bit 96 - 127
- Function Command data
-
-Command 1 indicates disabling flow while 2 is enabling flow
-
-Command types
-
-= ==========================================
-0 for MAP command request
-1 is to acknowledge the receipt of a command
-2 is for unsupported commands
-3 is for error during processing of commands
-= ==========================================
-
-c. Aggregation
-
-Aggregation is multiple MAP packets (can be data or command) delivered to
-rmnet in a single linear skb. rmnet will process the individual
-packets and either ACK the MAP command or deliver the IP packet to the
-network stack as needed
-
-MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
-
-MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
-
-3. Userspace configuration
-==========================
-
-rmnet userspace configuration is done through netlink library librmnetctl
-and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
-The driver uses rtnl_link_ops for communication.
-
-https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
diff --git a/Documentation/networking/device_drivers/wifi/index.rst b/Documentation/networking/device_drivers/wifi/index.rst
new file mode 100644
index 000000000000..fb394f5de4a9
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Wi-Fi Device Drivers
+====================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ intel/ipw2100
+ intel/ipw2200
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/ipw2100.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
index d54ad522f937..883e96355799 100644
--- a/Documentation/networking/device_drivers/intel/ipw2100.rst
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
@@ -78,7 +78,7 @@ such, if you are interested in deploying or shipping a driver as part of
solution intended to be used for purposes other than development, please
obtain a tested driver from Intel Customer Support at:
-http://www.intel.com/support/wireless/sb/CS-006408.htm
+https://www.intel.com/support/wireless/sb/CS-006408.htm
1. Introduction
===============
diff --git a/Documentation/networking/device_drivers/intel/ipw2200.rst b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
index 0cb42d2fd7e5..0cb42d2fd7e5 100644
--- a/Documentation/networking/device_drivers/intel/ipw2200.rst
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst
new file mode 100644
index 000000000000..370d8264d5dc
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+WWAN Device Drivers
+===================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ iosm
+ t7xx
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst
new file mode 100644
index 000000000000..6f9e955af984
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/iosm.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _iosm_driver_doc:
+
+===========================================
+IOSM Driver for Intel M.2 PCIe based Modems
+===========================================
+The IOSM (IPC over Shared Memory) driver is a WWAN PCIe host driver developed
+for linux or chrome platform for data exchange over PCIe interface between
+Host platform & Intel M.2 Modem. The driver exposes interface conforming to the
+MBIM protocol [1]. Any front end application ( eg: Modem Manager) could easily
+manage the MBIM interface to enable data communication towards WWAN.
+
+Basic usage
+===========
+MBIM functions are inactive when unmanaged. The IOSM driver only provides a
+userspace interface MBIM "WWAN PORT" representing MBIM control channel and does
+not play any role in managing the functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM functionality.
+
+Examples of few such userspace application are:
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session. The Driver supports
+up to 8 IP sessions for simultaneous IP communication.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev-name wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+References
+==========
+[1] "MBIM (Mobile Broadband Interface Model) Errata-1"
+ - https://www.usb.org/document-library/
+
+[2] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] Modem Manager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst
new file mode 100644
index 000000000000..f346f5f85f15
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/t7xx.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _t7xx_driver_doc:
+
+============================================
+t7xx driver for MTK PCIe based T700 5G modem
+============================================
+The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS platforms
+for data exchange over PCIe interface between Host platform & MediaTek's T700 5G modem.
+The driver exposes an interface conforming to the MBIM protocol [1]. Any front end
+application (e.g. Modem Manager) could easily manage the MBIM interface to enable
+data communication towards WWAN. The driver also provides an interface to interact
+with the MediaTek's modem via AT commands.
+
+Basic usage
+===========
+MBIM & AT functions are inactive when unmanaged. The t7xx driver provides
+WWAN port userspace interfaces representing MBIM & AT control channels and does
+not play any role in managing their functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM & AT functionalities.
+
+Examples of few such userspace applications are:
+
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management Applications to carry out below required actions for send an AT
+command and receive response:
+
+- open the AT control channel using a UART tool or a special user tool
+
+Sysfs
+=====
+The driver provides sysfs interfaces to userspace.
+
+t7xx_mode
+---------
+The sysfs interface provides userspace with access to the device mode, this interface
+supports read and write operations.
+
+Device mode:
+
+- ``unknown`` represents that device in unknown status
+- ``ready`` represents that device in ready status
+- ``reset`` represents that device in reset status
+- ``fastboot_switching`` represents that device in fastboot switching status
+- ``fastboot_download`` represents that device in fastboot download status
+- ``fastboot_dump`` represents that device in fastboot dump status
+
+Read from userspace to get the current device mode.
+
+::
+ $ cat /sys/bus/pci/devices/${bdf}/t7xx_mode
+
+Write from userspace to set the device mode.
+
+::
+ $ echo fastboot_switching > /sys/bus/pci/devices/${bdf}/t7xx_mode
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The t7xx driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+AT port userspace ABI
+----------------------------------
+
+/dev/wwan0at0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an AT port by implementing AT WWAN Port.
+The userspace end of the control port is a /dev/wwan0at0 character
+device. Application shall use this interface to issue AT commands.
+
+fastboot port userspace ABI
+---------------------------
+
+/dev/wwan0fastboot0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a fastboot protocol interface by implementing
+fastboot WWAN Port. The userspace end of the fastboot channel pipe is a
+/dev/wwan0fastboot0 character device. Application shall use this interface for
+fastboot protocol communication.
+
+Please note that driver needs to be reloaded to export /dev/wwan0fastboot0
+port, because device needs a cold reset after enter ``fastboot_switching``
+mode.
+
+The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
+
+References
+==========
+[1] *MBIM (Mobile Broadband Interface Model) Errata-1*
+
+- https://www.usb.org/document-library/
+
+[2] *libmbim "a glib-based library for talking to WWAN modems and devices which
+speak the Mobile Interface Broadband Model (MBIM) protocol"*
+
+- http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] *Modem Manager "a DBus-activated daemon which controls mobile broadband
+(2G/3G/4G/5G) devices and connections"*
+
+- http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[4] *Specification # 27.007 - 3GPP*
+
+- https://www.3gpp.org/DynaReport/27007.htm
+
+[5] *fastboot "a mechanism for communicating with bootloaders"*
+
+- https://android.googlesource.com/platform/system/core/+/refs/heads/main/fastboot/README.md
diff --git a/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
new file mode 100644
index 000000000000..1e589c26abff
--- /dev/null
+++ b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+am65-cpsw-nuss devlink support
+==============================
+
+This document describes the devlink features implemented by the ``am65-cpsw-nuss``
+device driver.
+
+Parameters
+==========
+
+The ``am65-cpsw-nuss`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
index 3dfd84ccb1c7..a4fb27663cd6 100644
--- a/Documentation/networking/devlink/bnxt.rst
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -22,6 +22,8 @@ Parameters
- Permanent
* - ``msix_vec_per_pf_min``
- Permanent
+ * - ``enable_remote_dev_reset``
+ - Runtime
The ``bnxt`` driver also implements the following driver-specific
parameters.
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
index 468fe1001b74..af37f250df43 100644
--- a/Documentation/networking/devlink/devlink-dpipe.rst
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -52,7 +52,7 @@ purposes as a standard complementary tool. The system's view from
``devlink-dpipe`` should change according to the changes done by the
standard configuration tools.
-For example, it’s quiet common to implement Access Control Lists (ACL)
+For example, it’s quite common to implement Access Control Lists (ACL)
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
divided into TCAM regions. Complex TC filters can have multiple rules with
different priorities and different lookup keys. On the other hand hardware
diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst
index 40a87c0222cb..603e732f00cc 100644
--- a/Documentation/networking/devlink/devlink-flash.rst
+++ b/Documentation/networking/devlink/devlink-flash.rst
@@ -16,6 +16,34 @@ Note that the file name is a path relative to the firmware loading path
(usually ``/lib/firmware/``). Drivers may send status updates to inform
user space about the progress of the update operation.
+Overwrite Mask
+==============
+
+The ``devlink-flash`` command allows optionally specifying a mask indicating
+how the device should handle subsections of flash components when updating.
+This mask indicates the set of sections which are allowed to be overwritten.
+
+.. list-table:: List of overwrite mask bits
+ :widths: 5 95
+
+ * - Name
+ - Description
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Indicates that the device should overwrite settings in the components
+ being updated with the settings found in the provided image.
+ * - ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Indicates that the device should overwrite identifiers in the
+ components being updated with the identifiers found in the provided
+ image. This includes MAC addresses, serial IDs, and similar device
+ identifiers.
+
+Multiple overwrite bits may be combined and requested together. If no bits
+are provided, it is expected that the device only update firmware binaries
+in the components being updated. Settings and identifiers are expected to be
+preserved across the update. A device may not support every combination and
+the driver for such a device must reject any combination which cannot be
+faithfully implemented.
+
Firmware Loading
================
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 0c99b11f05f9..e0b8cfed610a 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -24,7 +24,7 @@ attributes of the health reporting and recovery procedures.
The ``devlink`` health reporter:
Device driver creates a "health reporter" per each error/health type.
-Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by ``devlink``.
@@ -33,7 +33,7 @@ Device driver can provide specific callbacks for each "health reporter", e.g.:
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
- * OOB initial parameters
+ * Out Of Box initial parameters
Different parts of the driver can register different types of health reporters
with different handlers.
@@ -46,11 +46,31 @@ Once an error is reported, devlink health will perform the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and saved at the reporter instance (as long as
- there is no other dump which is already stored)
+ auto-dump is set and there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
+
- Auto-recovery configuration
- Grace period vs. time passed since last recover
+Devlink formatted message
+=========================
+
+To handle devlink health diagnose and health dump requests, devlink creates a
+formatted message structure ``devlink_fmsg`` and send it to the driver's callback
+to fill the data in using the devlink fmsg API.
+
+Devlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
+json-like format. The API allows the driver to add nested attributes such as
+object, object pair and value array, in addition to attributes such as name and
+value.
+
+Driver should use this API to fill the fmsg context in a format which will be
+translated by the devlink to the netlink message later. When it needs to send
+the data using SKBs to the netlink layer, it fragments the data between
+different SKBs. In order to do this fragmentation, it uses virtual nests
+attributes, to avoid actual nesting use which cannot be divided between
+different SKBs.
+
User Interface
==============
@@ -72,14 +92,18 @@ via ``devlink``, e.g per error type (per health reporter):
* - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
- Allows reporter-related configuration setting.
* - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
- - Triggers a reporter's recovery procedure.
+ - Triggers reporter's recovery procedure.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
+ - Triggers a fake health event on the reporter. The effects of the test
+ event in terms of recovery flow should follow closely that of a real
+ event.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
- - Retrieves diagnostics data from a reporter on a device.
+ - Retrieves current device state related to the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
- Retrieves the last stored dump. Devlink health
- saves a single dump. If an dump is not already stored by the devlink
+ saves a single dump. If an dump is not already stored by devlink
for this reporter, devlink generates a new dump.
- dump output is defined by the reporter.
+ Dump output is defined by the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
- Clears the last saved dump file for the specified reporter.
@@ -93,7 +117,7 @@ The following diagram provides a general overview of ``devlink-health``::
+--------------------------+
|request for ops
|(diagnose,
- mlx5_core devlink |recover,
+ driver devlink |recover,
|dump)
+--------+ +--------------------------+
| | | reporter| |
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 3fe11401b838..1242b0e6826b 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -44,9 +44,11 @@ versions is generally discouraged - here, and via any other Linux API.
reported for two ports of the same device or on two hosts of
a multi-host device should be identical.
- .. note:: ``devlink-info`` API should be extended with a new field
- if devices want to report board/product serial number (often
- reported in PCI *Vital Product Data* capability).
+ * - ``board.serial_number``
+ - Board serial number of the device.
+
+ This is usually the serial number of the board, often available in
+ PCI *Vital Product Data*.
* - ``fixed``
- Group for hardware identifiers, and versions of components
@@ -196,15 +198,16 @@ fw.bundle_id
Unique identifier of the entire firmware bundle.
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
Future work
===========
The following extensions could be useful:
- - product serial number - NIC boards often get labeled with a board serial
- number rather than ASIC serial number; it'd be useful to add board serial
- numbers to the API if they can be retrieved from the device;
-
- on-disk firmware file names - drivers list the file names of firmware they
may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These,
however, are per module, rather than per device. It'd be useful to list
diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst
new file mode 100644
index 000000000000..6c0b8928bc13
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-linecard.rst
@@ -0,0 +1,122 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Devlink Line card
+=================
+
+Background
+==========
+
+The ``devlink-linecard`` mechanism is targeted for manipulation of
+line cards that serve as a detachable PHY modules for modular switch
+system. Following operations are provided:
+
+ * Get a list of supported line card types.
+ * Provision of a slot with specific line card type.
+ * Get and monitor of line card state and its change.
+
+Line card according to the type may contain one or more gearboxes
+to mux the lanes with certain speed to multiple ports with lanes
+of different speed. Line card ensures N:M mapping between
+the switch ASIC modules and physical front panel ports.
+
+Overview
+========
+
+Each line card devlink object is created by device driver,
+according to the physical line card slots available on the device.
+
+Similar to splitter cable, where the device might have no way
+of detection of the splitter cable geometry, the device
+might not have a way to detect line card type. For that devices,
+concept of provisioning is introduced. It allows the user to:
+
+ * Provision a line card slot with certain line card type
+
+ - Device driver would instruct the ASIC to prepare all
+ resources accordingly. The device driver would
+ create all instances, namely devlink port and netdevices
+ that reside on the line card, according to the line card type
+ * Manipulate of line card entities even without line card
+ being physically connected or powered-up
+ * Setup splitter cable on line card ports
+
+ - As on the ordinary ports, user may provision a splitter
+ cable of a certain type, without the need to
+ be physically connected to the port
+ * Configure devlink ports and netdevices
+
+Netdevice carrier is decided as follows:
+
+ * Line card is not inserted or powered-down
+
+ - The carrier is always down
+ * Line card is inserted and powered up
+
+ - The carrier is decided as for ordinary port netdevice
+
+Line card state
+===============
+
+The ``devlink-linecard`` mechanism supports the following line card states:
+
+ * ``unprovisioned``: Line card is not provisioned on the slot.
+ * ``unprovisioning``: Line card slot is currently being unprovisioned.
+ * ``provisioning``: Line card slot is currently in a process of being provisioned
+ with a line card type.
+ * ``provisioning_failed``: Provisioning was not successful.
+ * ``provisioned``: Line card slot is provisioned with a type.
+ * ``active``: Line card is powered-up and active.
+
+The following diagram provides a general overview of ``devlink-linecard``
+state transitions::
+
+ +-------------------------+
+ | |
+ +----------------------------------> unprovisioned |
+ | | |
+ | +--------|-------^--------+
+ | | |
+ | | |
+ | +--------v-------|--------+
+ | | |
+ | | provisioning |
+ | | |
+ | +------------|------------+
+ | |
+ | +-----------------------------+
+ | | |
+ | +------------v------------+ +------------v------------+ +-------------------------+
+ | | | | ----> |
+ +----- provisioning_failed | | provisioned | | active |
+ | | | | <---- |
+ | +------------^------------+ +------------|------------+ +-------------------------+
+ | | |
+ | | |
+ | | +------------v------------+
+ | | | |
+ | | | unprovisioning |
+ | | | |
+ | | +------------|------------+
+ | | |
+ | +-----------------------------+
+ | |
+ +-----------------------------------------------+
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ $ devlink lc show [ DEV [ lc LC_INDEX ] ]
+ $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ]
+
+ # Show current line card configuration and status for all slots:
+ $ devlink lc
+
+ # Set slot 8 to be provisioned with type "16x100G":
+ $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
+
+ # Set slot 8 to be unprovisioned:
+ $ devlink lc set pci/0000:01:00.0 lc 8 notype
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index d075fd090b3d..4e01dc32bc08 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -97,14 +97,43 @@ own name.
* - ``enable_roce``
- Boolean
- Enable handling of RoCE traffic in the device.
+ * - ``enable_eth``
+ - Boolean
+ - When enabled, the device driver will instantiate Ethernet specific
+ auxiliary device of the devlink device.
+ * - ``enable_rdma``
+ - Boolean
+ - When enabled, the device driver will instantiate RDMA specific
+ auxiliary device of the devlink device.
+ * - ``enable_vnet``
+ - Boolean
+ - When enabled, the device driver will instantiate VDPA networking
+ specific auxiliary device of the devlink device.
+ * - ``enable_iwarp``
+ - Boolean
+ - Enable handling of iWARP traffic in the device.
* - ``internal_err_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
* - ``max_macs``
- u32
- - Specifies the maximum number of MAC addresses per ethernet port of
- this device.
+ - Typically macvlan, vlan net devices mac are also programmed in their
+ parent netdevice's Function rx filter. This parameter limit the
+ maximum number of unicast mac address filters to receive traffic from
+ per ethernet port of this device.
* - ``region_snapshot_enable``
- Boolean
- Enable capture of ``devlink-region`` snapshots.
+ * - ``enable_remote_dev_reset``
+ - Boolean
+ - Enable device reset by remote host. When cleared, the device driver
+ will NACK any attempt of other host to reset the device. This parameter
+ is useful for setups where a device is shared by different hosts, such
+ as multi-host setup.
+ * - ``io_eq_size``
+ - u32
+ - Control the size of I/O completion EQs.
+ * - ``event_eq_size``
+ - u32
+ - Control the size of asynchronous control events EQ.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..562f46b41274
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,443 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+ :widths: 33 90
+
+ * - Flavour
+ - Description
+ * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+ - Any kind of physical port. This can be an eswitch physical port or any
+ other physical port on the device.
+ * - ``DEVLINK_PORT_FLAVOUR_DSA``
+ - This indicates a DSA interconnect port.
+ * - ``DEVLINK_PORT_FLAVOUR_CPU``
+ - This indicates a CPU port applicable only to DSA.
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+ - This indicates an eswitch port representing a port of PCI
+ physical function (PF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+ - This indicates an eswitch port representing a port of PCI
+ virtual function (VF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+ - This indicates an eswitch port representing a port of PCI
+ subfunction (SF).
+ * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+ - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+ :widths: 23 90
+
+ * - Type
+ - Description
+ * - ``DEVLINK_PORT_TYPE_ETH``
+ - Driver should set this port type when a link layer of the port is
+ Ethernet.
+ * - ``DEVLINK_PORT_TYPE_IB``
+ - Driver should set this port type when a link layer of the port is
+ InfiniBand.
+ * - ``DEVLINK_PORT_TYPE_AUTO``
+ - This type is indicated by the user when driver should detect the port
+ type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical, virtual functions and subfunctions. A function
+consists of one or more ports. This port is represented by the devlink eswitch
+port.
+
+A PCI device connected to multiple CPUs or multiple PCI root complexes or a
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch is on the PCI device which supports ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+ ---------------------------------------------------------
+ | |
+ | --------- --------- ------- ------- |
+ ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
+ | connect | | ------- ------- |
+ ----------- | | controller_num=1 (no eswitch) |
+ ------|--------------------------------------------------
+ (internal wire)
+ |
+ ---------------------------------------------------------
+ | devlink eswitch ports and reps |
+ | ----------------------------------------------------- |
+ | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |
+ | |
+ ----------- | --------- --------- ------- ------- |
+ | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
+ ----------- | ------- ------- |
+ | |
+ | local controller_num=0 (eswitch) |
+ ---------------------------------------------------------
+
+In the above example, the external controller (identified by controller number = 1)
+doesn't have the eswitch. Local controller (identified by controller number = 0)
+has the eswitch. The Devlink instance on the local controller has eswitch
+devlink ports for both the controllers.
+
+Function configuration
+======================
+
+Users can configure one or more function attributes before enumerating the PCI
+function. Usually it means, user should configure function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
+
+A user may set the hardware address of the function using
+`devlink port function set hw_addr` command. For Ethernet port function
+this means a MAC address.
+
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+`devlink port function set migratable` command.
+
+Users may also set the IPsec crypto capability of the function using
+`devlink port function set ipsec_crypto` command.
+
+Users may also set the IPsec packet capability of the function using
+`devlink port function set ipsec_packet` command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+ $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
+IPsec crypto capability setup
+-----------------------------
+When user enables IPsec crypto capability for a VF, user application can offload
+XFRM state crypto operation (Encrypt/Decrypt) to this VF.
+
+When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
+processed in software by the kernel.
+
+- Get IPsec crypto capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
+
+- Set IPsec crypto capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
+
+IPsec packet capability setup
+-----------------------------
+When user enables IPsec packet capability for a VF, user application can offload
+XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
+IPsec encapsulation.
+
+When IPsec packet capability is disabled (default) for a VF, the XFRM state and
+policy is processed in software by the kernel.
+
+- Get IPsec packet capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled
+
+- Set IPsec packet capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+
+Subfunction
+============
+
+Subfunction is a lightweight function that has a parent PCI function on which
+it is deployed. Subfunction is created and deployed in unit of 1. Unlike
+SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
+A subfunction communicates with the hardware through the parent PCI function.
+
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. A user adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink ops) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+A subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself is not created. A user might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. A user
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once a subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
+At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+ Represents a single devlink port; created/destroyed by the driver. Since leaf
+ have 1to1 mapping to its devlink port, in user space it is referred as
+ ``pci/<bus_addr>/<port_index>``;
+
+``node``
+ Represents a group of rate objects (leafs and/or nodes); created/deleted by
+ request from the userspace; initially empty (no rate objects added). In
+ userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+ ``node_name`` can be any identifier, except decimal number, to avoid
+ collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+ Minimum TX rate value shared among all other rate objects, or rate objects
+ that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+ Maximum TX rate value.
+
+``tx_priority``
+ Allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their priority
+ as long as the nodes remain within their bandwidth limit. The higher the
+ priority the higher the probability that the node will get selected for
+ scheduling.
+
+``tx_weight``
+ Allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with the
+ strict priority. As a node is configured with a higher rate it gets more
+ BW relative to its siblings. Values are relative like a percentage
+ points, they basically tell how much BW should node take relative to
+ its siblings.
+
+``parent``
+ Parent node name. Parent node rate limits are considered as additional limits
+ to all node children limits. ``tx_max`` is an upper limit for children.
+ ``tx_share`` is a total bandwidth distributed among children.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+ within the BW limit and are not blocked. Use ``tx_priority`` as a
+ parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+ that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among its children,
+ until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+ overused their assigned BW, move to the lower priority nodes.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+ :widths: 22 90
+
+ * - Term
+ - Definitions
+ * - ``PCI device``
+ - A physical PCI device having one or more PCI buses consists of one or
+ more PCI controllers.
+ * - ``PCI controller``
+ - A controller consists of potentially multiple physical functions,
+ virtual functions and subfunctions.
+ * - ``Port function``
+ - An object to manage the function of a port.
+ * - ``Subfunction``
+ - A lightweight function that has parent PCI function on which it is
+ deployed.
+ * - ``Subfunction device``
+ - A bus device of the subfunction, usually on a auxiliary bus.
+ * - ``Subfunction driver``
+ - A device driver for the subfunction auxiliary device.
+ * - ``Subfunction management device``
+ - A PCI physical function that supports subfunction management.
+ * - ``Subfunction management driver``
+ - A device driver for PCI physical function that supports
+ subfunction management using devlink port interface.
+ * - ``Subfunction host driver``
+ - A device driver for PCI physical function that hosts subfunction
+ devices. In most cases it is same as subfunction management driver. When
+ subfunction is used on external controller, subfunction management and
+ host drivers are different.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 3654c3e9658f..9232cd7da301 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -22,7 +22,7 @@ The major benefit to creating a region is to provide access to internal
address regions that are otherwise inaccessible to the user.
Regions may also be used to provide an additional way to debug complex error
-states, but see also :doc:`devlink-health`
+states, but see also Documentation/networking/devlink/devlink-health.rst
Regions may optionally support capturing a snapshot on demand via the
``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
@@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in
the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
the snapshot information to user space.
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
+
example usage
-------------
@@ -44,8 +53,8 @@ example usage
# Show all of the exposed regions with region sizes:
$ devlink region show
- pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
- pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
+ pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] max 8
+ pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] max 8
# Delete a snapshot using:
$ devlink region del pci/0000:00:05.0/cr-space snapshot 1
@@ -65,6 +74,10 @@ example usage
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ # Read from the region without a snapshot
+ $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
As regions are likely very device or driver specific, no generic regions are
defined. See the driver-specific documentation files for information on the
specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-reload.rst b/Documentation/networking/devlink/devlink-reload.rst
new file mode 100644
index 000000000000..2fb0269b2054
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-reload.rst
@@ -0,0 +1,90 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Reload
+==============
+
+``devlink-reload`` provides mechanism to reinit driver entities, applying
+``devlink-params`` and ``devlink-resources`` new values. It also provides
+mechanism to activate firmware.
+
+Reload Actions
+==============
+
+User may select a reload action.
+By default ``driver_reinit`` action is selected.
+
+.. list-table:: Possible reload actions
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``driver-reinit``
+ - Devlink driver entities re-initialization, including applying
+ new values to devlink entities which are used during driver
+ load which are:
+
+ * ``devlink-params`` in configuration mode ``driverinit``
+ * ``devlink-resources``
+
+ Other devlink entities may stay over the re-initialization:
+
+ * ``devlink-health-reporter``
+ * ``devlink-region``
+
+ The rest of the devlink entities have to be removed and readded.
+ * - ``fw_activate``
+ - Firmware activate. Activates new firmware if such image is stored and
+ pending activation. If no limitation specified this action may involve
+ firmware reset. If no new image pending this action will reload current
+ firmware image.
+
+Note that even though user asks for a specific action, the driver
+implementation might require to perform another action alongside with
+it. For example, some driver do not support driver reinitialization
+being performed without fw activation. Therefore, the devlink reload
+command returns the list of actions which were actrually performed.
+
+Reload Limits
+=============
+
+By default reload actions are not limited and driver implementation may
+include reset or downtime as needed to perform the actions.
+
+However, some drivers support action limits, which limit the action
+implementation to specific constraints.
+
+.. list-table:: Possible reload limits
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``no_reset``
+ - No reset allowed, no down time allowed, no link flap and no
+ configuration is lost.
+
+Change Namespace
+================
+
+The netns option allows user to be able to move devlink instances into
+namespaces during devlink reload operation.
+By default all devlink instances are created in init_net and stay there.
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink dev reload help
+ $ devlink dev reload DEV [ netns { PID | NAME | ID } ] [ action { driver_reinit | fw_activate } ] [ limit no_reset ]
+
+ # Run reload command for devlink driver entities re-initialization:
+ $ devlink dev reload pci/0000:82:00.0 action driver_reinit
+ reload_actions_performed:
+ driver_reinit
+
+ # Run reload command to activate firmware:
+ # Note that mlx5 driver reloads the driver while activating firmware
+ $ devlink dev reload pci/0000:82:00.0 action fw_activate
+ reload_actions_performed:
+ driver_reinit fw_activate
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 93e92d2f0752..3d5ae51e65a2 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -23,6 +23,20 @@ current size and related sub resources. To access a sub resource, you
specify the path of the resource. For example ``/IPv4/fib`` is the id for
the ``fib`` sub-resource under the ``IPv4`` resource.
+Generic Resources
+=================
+
+Generic resources are used to describe resources that can be shared by multiple
+device drivers and their description must be added to the following table:
+
+.. list-table:: List of Generic Resources
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``physical_ports``
+ - A limited capacity of physical ports that the switch ASIC can support
+
example usage
-------------
diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst
new file mode 100644
index 000000000000..c0aa1f3aef0d
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-selftests.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=================
+Devlink Selftests
+=================
+
+The ``devlink-selftests`` API allows executing selftests on the device.
+
+Tests Mask
+==========
+The ``devlink-selftests`` command should be run with a mask indicating
+the tests to be executed.
+
+Tests Description
+=================
+The following is a list of tests that drivers may execute.
+
+.. list-table:: List of tests
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_SELFTEST_FLASH``
+ - Devices may have the firmware on non-volatile memory on the board, e.g.
+ flash. This particular test helps to run a flash selftest on the device.
+ Implementation of the test is left to the driver/firmware.
+
+example usage
+-------------
+
+.. code:: shell
+
+ # Query selftests supported on the devlink device
+ $ devlink dev selftests show DEV
+ # Query selftests supported on all devlink devices
+ $ devlink dev selftests show
+ # Executes selftests on the device
+ $ devlink dev selftests run DEV id flash
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 1e3f3ffee248..2c14dfe69b3a 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -405,6 +405,96 @@ be added to the following table:
- ``control``
- Traps packets logged during processing of flow action trap (e.g., via
tc's trap action)
+ * - ``early_drop``
+ - ``drop``
+ - Traps packets dropped due to the RED (Random Early Detection) algorithm
+ (i.e., early drops)
+ * - ``vxlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VXLAN header parsing which
+ might be because of packet truncation or the I flag is not set.
+ * - ``llc_snap_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the LLC+SNAP header parsing
+ * - ``vlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VLAN header parsing. Could
+ include unexpected packet truncation.
+ * - ``pppoe_ppp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the PPPoE+PPP header parsing.
+ This could include finding a session ID of 0xFFFF (which is reserved and
+ not for use), a PPPoE length which is larger than the frame received or
+ any common error on this type of header
+ * - ``mpls_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the MPLS header parsing which
+ could include unexpected header truncation
+ * - ``arp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ARP header parsing
+ * - ``ip_1_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the first IP header parsing.
+ This packet trap could include packets which do not pass an IP checksum
+ check, a header length check (a minimum of 20 bytes), which might suffer
+ from packet truncation thus the total length field exceeds the received
+ packet length etc
+ * - ``ip_n_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the parsing of the last IP
+ header (the inner one in case of an IP over IP tunnel). The same common
+ error checking is performed here as for the ip_1_parsing trap
+ * - ``gre_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GRE header parsing
+ * - ``udp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the UDP header parsing.
+ This packet trap could include checksum errorrs, an improper UDP
+ length detected (smaller than 8 bytes) or detection of header
+ truncation.
+ * - ``tcp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the TCP header parsing.
+ This could include TCP checksum errors, improper combination of SYN, FIN
+ and/or RESET etc.
+ * - ``ipsec_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the IPSEC header parsing
+ * - ``sctp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the SCTP header parsing.
+ This would mean that port number 0 was used or that the header is
+ truncated.
+ * - ``dccp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the DCCP header parsing
+ * - ``gtp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GTP header parsing
+ * - ``esp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ESP header parsing
+ * - ``blackhole_nexthop``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole nexthop
+ * - ``dmac_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because
+ the destination MAC is not configured in the MAC table and
+ the interface is not in promiscuous mode
+ * - ``eapol``
+ - ``control``
+ - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+ specified in IEEE 802.1X
+ * - ``locked_port``
+ - ``drop``
+ - Traps packets that the device decided to drop because they failed the
+ locked bridge port check. That is, packets that were received via a
+ locked port and whose {SMAC, VID} does not correspond to an FDB entry
+ pointing to the port
Driver-specific Packet Traps
============================
@@ -415,8 +505,9 @@ help debug packet drops caused by these exceptions. The following list includes
links to the description of driver-specific traps registered by various device
drivers:
- * :doc:`netdevsim`
- * :doc:`mlxsw`
+ * Documentation/networking/devlink/netdevsim.rst
+ * Documentation/networking/devlink/mlxsw.rst
+ * Documentation/networking/devlink/prestera.rst
.. _Generic-Packet-Trap-Groups:
@@ -486,6 +577,10 @@ narrow. The description of these groups must be added to the following table:
- Contains packet traps for packets that should be locally delivered after
routing, but do not match more specific packet traps (e.g.,
``ipv4_bgp``)
+ * - ``external_delivery``
+ - Contains packet traps for packets that should be routed through an
+ external interface (e.g., management interface) that does not belong to
+ the same device (e.g., switch ASIC) as the ingress interface
* - ``ipv6``
- Contains packet traps for various IPv6 control packets (e.g., Router
Advertisements)
@@ -501,6 +596,12 @@ narrow. The description of these groups must be added to the following table:
* - ``acl_trap``
- Contains packet traps for packets that were trapped (logged) by the
device during ACL processing
+ * - ``parser_error_drops``
+ - Contains packet traps for packets that were marked by the device during
+ parsing as erroneous
+ * - ``eapol``
+ - Contains packet traps for "Extensible Authentication Protocol over LAN"
+ (EAPOL) packets specified in IEEE 802.1X
Packet Trap Policers
====================
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 000000000000..3b857d82a44c
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as the first member of the
+ ``firmware-version``.
+ * - ``fw.bootloader``
+ - running
+ - Version of the bootloader running on the device. Also available
+ through ``ethtool -i`` as the second member of the
+ ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The hardware revision of the device.
+ * - ``serial_number``
+ - fixed
+ - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
new file mode 100644
index 000000000000..4562a6e4782f
--- /dev/null
+++ b/Documentation/networking/devlink/hns3.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+hns3 devlink support
+====================
+
+This document describes the devlink features implemented by the ``hns3``
+device driver.
+
+The ``hns3`` driver supports reloading via ``DEVLINK_CMD_RELOAD``.
+
+Info versions
+=============
+
+The ``hns3`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 10 10 80
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Used to represent the firmware version.
diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
new file mode 100644
index 000000000000..d3cb5bb5197e
--- /dev/null
+++ b/Documentation/networking/devlink/i40e.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+i40e devlink support
+====================
+
+This document describes the devlink features implemented by the ``i40e``
+device driver.
+
+Info versions
+=============
+
+The ``i40e`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``board.id``
+ - fixed
+ - K15190-000
+ - The Product Board Assembly (PBA) identifier of the board.
+ * - ``fw.mgmt``
+ - running
+ - 9.130
+ - 2-digit version number of the management firmware that controls the
+ PHY, link, etc.
+ * - ``fw.mgmt.api``
+ - running
+ - 1.15
+ - 2-digit version number of the API exported over the AdminQ by the
+ management firmware. Used by the driver to identify what commands
+ are supported.
+ * - ``fw.mgmt.build``
+ - running
+ - 73618
+ - Build number of the source for the management firmware.
+ * - ``fw.undi``
+ - running
+ - 1.3429.0
+ - Version of the Option ROM containing the UEFI driver. The version is
+ reported in ``major.minor.patch`` format. The major version is
+ incremented whenever a major breaking change occurs, or when the
+ minor version would overflow. The minor version is incremented for
+ non-breaking changes and reset to 1 when the major version is
+ incremented. The patch version is normally 0 but is incremented when
+ a fix is delivered as a patch against an older base Option ROM.
+ * - ``fw.psid.api``
+ - running
+ - 9.30
+ - Version defining the format of the flash contents.
+ * - ``fw.bundle_id``
+ - running
+ - 0x8000e5f3
+ - Unique identifier of the firmware image file that was loaded onto
+ the device. Also referred to as the EETRACK identifier of the NVM.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 72ea8d295724..7f30ebd5debb 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -7,6 +7,21 @@ ice devlink support
This document describes the devlink features implemented by the ``ice``
device driver.
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+
+ * - Name
+ - Mode
+ - Notes
+ * - ``enable_roce``
+ - runtime
+ - mutually exclusive with ``enable_iwarp``
+ * - ``enable_iwarp``
+ - runtime
+ - mutually exclusive with ``enable_roce``
+
Info versions
=============
@@ -23,17 +38,24 @@ The ``ice`` driver reports the following versions
- fixed
- K65390-000
- The Product Board Assembly (PBA) identifier of the board.
+ * - ``cgu.id``
+ - fixed
+ - 36
+ - The Clock Generation Unit (CGU) hardware revision identifier.
* - ``fw.mgmt``
- running
- 2.1.7
- - 3-digit version number of the management firmware that controls the
- PHY, link, etc.
+ - 3-digit version number of the management firmware running on the
+ Embedded Management Processor of the device. It controls the PHY,
+ link, access to device resources, etc. Intel documentation refers to
+ this as the EMP firmware.
* - ``fw.mgmt.api``
- running
- - 1.5
- - 2-digit version number of the API exported over the AdminQ by the
- management firmware. Used by the driver to identify what commands
- are supported.
+ - 1.5.1
+ - 3-digit version number (major.minor.patch) of the API exported over
+ the AdminQ by the management firmware. Used by the driver to
+ identify what commands are supported. Historical versions of the
+ kernel only displayed a 2-digit version number (major.minor).
* - ``fw.mgmt.build``
- running
- 0x305d955f
@@ -69,6 +91,12 @@ The ``ice`` driver reports the following versions
- The version of the DDP package that is active in the device. Note
that both the name (as reported by ``fw.app.name``) and version are
required to uniquely identify the package.
+ * - ``fw.app.bundle_id``
+ - running
+ - 0xc0000001
+ - Unique identifier for the DDP package loaded in the device. Also
+ referred to as the DDP Track ID. Can be used to uniquely identify
+ the specific DDP package.
* - ``fw.netlist``
- running
- 1.1.2000-6.7.0
@@ -80,18 +108,133 @@ The ``ice`` driver reports the following versions
- running
- 0xee16ced7
- The first 4 bytes of the hash of the netlist module contents.
+ * - ``fw.cgu``
+ - running
+ - 8032.16973825.6021
+ - The version of Clock Generation Unit (CGU). Format:
+ <CGU type>.<configuration version>.<firmware version>.
+
+Flash Update
+============
+
+The ``ice`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+ :widths: 5 95
+
+ * - Bits
+ - Behavior
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Do not preserve settings stored in the flash components being
+ updated. This includes overwriting the port configuration that
+ determines the number of physical functions the device will
+ initialize with.
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Do not preserve either settings or identifiers. Overwrite everything
+ in the flash with the contents from the provided image, without
+ performing any preservation. This includes overwriting device
+ identifying fields such as the MAC address, VPD area, and device
+ serial number. It is expected that this combination be used with an
+ image customized for the specific device.
+
+The ice hardware does not support overwriting only identifiers while
+preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
+own will be rejected. If no overwrite mask is provided, the firmware will be
+instructed to preserve all settings and identifying fields when updating.
+
+Reload
+======
+
+The ``ice`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+ $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Port split
+==========
+
+The ``ice`` driver supports port splitting only for port 0, as the FW has
+a predefined set of available port split options for the whole device.
+
+A system reboot is required for port split to be applied.
+
+The following command will select the port split option with 4 ports:
+
+.. code:: shell
+
+ $ devlink port split pci/0000:16:00.0/0 count 4
+
+The list of all available port options will be printed to dynamic debug after
+each ``split`` and ``unsplit`` command. The first option is the default.
+
+.. code:: shell
+
+ ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
+ ice 0000:16:00.0: Status Split Quad 0 Quad 1
+ ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
+ ice 0000:16:00.0: Active 2 100 - - - 100 - - -
+ ice 0000:16:00.0: 2 50 - 50 - - - - -
+ ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
+ ice 0000:16:00.0: 4 25 25 - - 25 25 - -
+ ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
+ ice 0000:16:00.0: 1 100 - - - - - - -
+
+There could be multiple FW port options with the same port split count. When
+the same port split count request is issued again, the next FW port option with
+the same port split count will be selected.
+
+``devlink port unsplit`` will select the option with a split count of 1. If
+there is no FW option available with split count 1, you will receive an error.
Regions
=======
-The ``ice`` driver enables access to the contents of the Non Volatile Memory
-flash chip via the ``nvm-flash`` region.
+The ``ice`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``nvm-flash``
+ - The contents of the entire flash chip, sometimes referred to as
+ the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
+ * - ``device-caps``
+ - The contents of the device firmware's capabilities buffer. Useful to
+ determine the current state and configuration of the device.
+
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
-Users can request an immediate capture of a snapshot via the
-``DEVLINK_CMD_REGION_NEW``
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
.. code:: shell
+ $ devlink region show
+ pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+ pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
$ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
$ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
@@ -105,3 +248,157 @@ Users can request an immediate capture of a snapshot via the
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
$ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
+ $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
+ 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
+ 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
+ 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
+ 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
+ 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
+ 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
+ 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+
+ $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required because devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``tx_max``
+ - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+ an absolute number specifying a maximum amount of bytes a Node may
+ consume during the course of one second. Rate limit guarantees
+ that a link will not oversaturate the receiver on the remote end
+ and also enforces an SLA between the subscriber and network
+ provider.
+ * - ``tx_share``
+ - minimum bandwidth allocated to a tree node when it is not blocked.
+ It specifies an absolute BW. While tx_max defines the maximum
+ bandwidth the node may consume, the tx_share marks committed BW
+ for the Node.
+ * - ``tx_priority``
+ - allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their
+ priority as long as the nodes remain within their bandwidth limit.
+ Range 0-7. Nodes with priority 7 have the highest priority and are
+ selected first, while nodes with priority 0 have the lowest
+ priority. Nodes that have the same priority are treated equally.
+ * - ``tx_weight``
+ - allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with
+ the strict priority. Range 1-200. Only relative values matter for
+ arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+ # enable switchdev
+ $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+ # at this point driver should export internal hierarchy
+ $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+ $ devlink port function rate show
+ pci/0000:4b:00.0/node_25: type node parent node_24
+ pci/0000:4b:00.0/node_24: type node parent node_0
+ pci/0000:4b:00.0/node_32: type node parent node_31
+ pci/0000:4b:00.0/node_31: type node parent node_30
+ pci/0000:4b:00.0/node_30: type node parent node_16
+ pci/0000:4b:00.0/node_19: type node parent node_18
+ pci/0000:4b:00.0/node_18: type node parent node_17
+ pci/0000:4b:00.0/node_17: type node parent node_16
+ pci/0000:4b:00.0/node_14: type node parent node_5
+ pci/0000:4b:00.0/node_5: type node parent node_3
+ pci/0000:4b:00.0/node_13: type node parent node_4
+ pci/0000:4b:00.0/node_12: type node parent node_4
+ pci/0000:4b:00.0/node_11: type node parent node_4
+ pci/0000:4b:00.0/node_10: type node parent node_4
+ pci/0000:4b:00.0/node_9: type node parent node_4
+ pci/0000:4b:00.0/node_8: type node parent node_4
+ pci/0000:4b:00.0/node_7: type node parent node_4
+ pci/0000:4b:00.0/node_6: type node parent node_4
+ pci/0000:4b:00.0/node_4: type node parent node_3
+ pci/0000:4b:00.0/node_3: type node parent node_16
+ pci/0000:4b:00.0/node_16: type node parent node_15
+ pci/0000:4b:00.0/node_15: type node parent node_0
+ pci/0000:4b:00.0/node_2: type node parent node_1
+ pci/0000:4b:00.0/node_1: type node parent node_0
+ pci/0000:4b:00.0/node_0: type node
+ pci/0000:4b:00.0/1: type leaf parent node_25
+ pci/0000:4b:00.0/2: type leaf parent node_25
+
+ # let's create some custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+ # second custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+ # reassign second VF to newly created branch
+ $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+ # assign tx_weight to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+ # assign tx_share to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 7684ae5c4a4a..e14d7a701b72 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -4,6 +4,48 @@ Linux Devlink Documentation
devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held all callbacks of devlink
+netlink commands.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
+Drivers need to be cautious when taking devlink instance lock and
+taking RTNL lock at the same time. Devlink instance lock needs to be taken
+first, only after that RTNL lock could be taken.
+
+Nested instances
+----------------
+
+Some objects, like linecards or port functions, could have another
+devlink instances created underneath. In that case, drivers should make
+sure to respect following rules:
+
+ - Lock ordering should be maintained. If driver needs to take instance
+ lock of both nested and parent instances at the same time, devlink
+ instance lock of the parent instance should be taken first, only then
+ instance lock of the nested instance could be taken.
+ - Driver should use object-specific helpers to setup the
+ nested relationship:
+
+ - ``devl_nested_devlink_set()`` - called to setup devlink -> nested
+ devlink relationship (could be user for multiple nested instances.
+ - ``devl_port_fn_devlink_set()`` - called to setup port function ->
+ nested devlink relationship.
+ - ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
+ nested devlink relationship.
+
+The nested devlink info is exposed to the userspace over object-specific
+attributes of devlink netlink.
+
Interface documentation
-----------------------
@@ -18,9 +60,13 @@ general.
devlink-info
devlink-flash
devlink-params
+ devlink-port
devlink-region
devlink-resource
+ devlink-reload
+ devlink-selftests
devlink-trap
+ devlink-linecard
Driver-specific documentation
-----------------------------
@@ -32,6 +78,9 @@ parameters, info versions, and other features it supports.
:maxdepth: 1
bnxt
+ etas_es58x
+ hns3
+ i40e
ionic
ice
mlx4
@@ -40,6 +89,10 @@ parameters, info versions, and other features it supports.
mv88e6xxx
netdevsim
nfp
- sja1105
qed
ti-cpsw-switch
+ am65-nuss-cpsw-switch
+ prestera
+ iosm
+ octeontx2
+ sfc
diff --git a/Documentation/networking/devlink/iosm.rst b/Documentation/networking/devlink/iosm.rst
new file mode 100644
index 000000000000..6136181339aa
--- /dev/null
+++ b/Documentation/networking/devlink/iosm.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+iosm devlink support
+====================
+
+This document describes the devlink features implemented by the ``iosm``
+device driver.
+
+Parameters
+==========
+
+The ``iosm`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``erase_full_flash``
+ - u8
+ - runtime
+ - erase_full_flash parameter is used to check if full erase is required for
+ the device during firmware flashing.
+ If set, Full nand erase command will be sent to the device. By default,
+ only conditional erase support is enabled.
+
+
+Flash Update
+============
+
+The ``iosm`` driver implements support for flash update using the
+``devlink-flash`` interface.
+
+It supports updating the device flash using a combined flash image which contains
+the Bootloader images and other modem software images.
+
+The driver uses DEVLINK_SUPPORT_FLASH_UPDATE_COMPONENT to identify type of
+firmware image that need to be flashed as requested by user space application.
+Supported firmware image types.
+
+.. list-table:: Firmware Image types
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``PSI RAM``
+ - Primary Signed Image
+ * - ``EBL``
+ - External Bootloader
+ * - ``FLS``
+ - Modem Software Image
+
+PSI RAM and EBL are the RAM images which are injected to the device when the
+device is in BOOT ROM stage. Once this is successful, the actual modem firmware
+image is flashed to the device. The modem software image contains multiple files
+each having one secure bin file and at least one Loadmap/Region file. For flashing
+these files, appropriate commands are sent to the modem device along with the
+data required for flashing. The data like region count and address of each region
+has to be passed to the driver using the devlink param command.
+
+If the device has to be fully erased before firmware flashing, user application
+need to set the erase_full_flash parameter using devlink param command.
+By default, conditional erase feature is supported.
+
+Flash Commands:
+===============
+1) When modem is in Boot ROM stage, user can use below command to inject PSI RAM
+image using devlink flash command.
+
+$ devlink dev flash pci/0000:02:00.0 file <PSI_RAM_File_name>
+
+2) If user want to do a full erase, below command need to be issued to set the
+erase full flash param (To be set only if full erase required).
+
+$ devlink dev param set pci/0000:02:00.0 name erase_full_flash value true cmode runtime
+
+3) Inject EBL after the modem is in PSI stage.
+
+$ devlink dev flash pci/0000:02:00.0 file <EBL_File_name>
+
+4) Once EBL is injected successfully, then the actual firmware flashing takes
+place. Below is the sequence of commands used for each of the firmware images.
+
+a) Flash secure bin file.
+
+$ devlink dev flash pci/0000:02:00.0 file <Secure_bin_file_name>
+
+b) Flashing the Loadmap/Region file
+
+$ devlink dev flash pci/0000:02:00.0 file <Load_map_file_name>
+
+Regions
+=======
+
+The ``iosm`` driver supports dumping the coredump logs.
+
+In case a firmware encounters an exception, a snapshot will be taken by the
+driver. Following regions are accessed for device internal data.
+
+.. list-table:: Regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``report.json``
+ - The summary of exception details logged as part of this region.
+ * - ``coredump.fcd``
+ - This region contains the details related to the exception occurred in the
+ device (RAM dump).
+ * - ``cdd.log``
+ - This region contains the logs related to the modem CDD driver.
+ * - ``eeprom.bin``
+ - This region contains the eeprom logs.
+ * - ``bootcore_trace.bin``
+ - This region contains the current instance of bootloader logs.
+ * - ``bootcore_prev_trace.bin``
+ - This region contains the previous instance of bootloader logs.
+
+
+Region commands
+===============
+
+$ devlink region show
+
+$ devlink region new pci/0000:02:00.0/report.json
+
+$ devlink region dump pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region del pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region new pci/0000:02:00.0/coredump.fcd
+
+$ devlink region dump pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region del pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region new pci/0000:02:00.0/cdd.log
+
+$ devlink region dump pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region del pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region new pci/0000:02:00.0/eeprom.bin
+
+$ devlink region dump pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region del pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region new pci/0000:02:00.0/bootcore_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region del pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region new pci/0000:02:00.0/bootcore_prev_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
+
+$ devlink region del pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4e4b97f7971a..456985407475 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -14,8 +14,24 @@ Parameters
* - Name
- Mode
+ - Validation
* - ``enable_roce``
- driverinit
+ - Type: Boolean
+
+ If the device supports RoCE disablement, RoCE enablement state controls
+ device support for RoCE capability. Otherwise, the control occurs in the
+ driver stack. When RoCE is disabled at the driver level, only raw
+ ethernet QPs are supported.
+ * - ``io_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``event_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``max_macs``
+ - driverinit
+ - The range is between 1 and 2^31. Only power of 2 values are supported.
The ``mlx5`` driver also implements the following driver-specific
parameters.
@@ -37,12 +53,66 @@ parameters.
* ``smfs`` Software managed flow steering. In SMFS mode, the HW
steering entities are created and manage through the driver without
firmware intervention.
+
+ SMFS mode is faster and provides better rule insertion rate compared to
+ default DMFS mode.
* - ``fdb_large_groups``
- u32
- driverinit
- Control the number of large groups (size > 1) in the FDB table.
* The default value is 15, and the range is between 1 and 1024.
+ * - ``esw_multiport``
+ - Boolean
+ - runtime
+ - Control MultiPort E-Switch shared fdb mode.
+
+ An experimental mode where a single E-Switch is used and all the vports
+ and physical ports on the NIC are connected to it.
+
+ An example is to send traffic from a VF that is created on PF0 to an
+ uplink that is natively associated with the uplink of PF1
+
+ Note: Future devices, ConnectX-8 and onward, will eventually have this
+ as the default to allow forwarding between all NIC ports in a single
+ E-switch environment and the dual E-switch mode will likely get
+ deprecated.
+
+ Default: disabled
+ * - ``esw_port_metadata``
+ - Boolean
+ - runtime
+ - When applicable, disabling eswitch metadata can increase packet rate up
+ to 20% depending on the use case and packet sizes.
+
+ Eswitch port metadata state controls whether to internally tag packets
+ with metadata. Metadata tagging must be enabled for multi-port RoCE,
+ failover between representors and stacked devices. By default metadata is
+ enabled on the supported devices in E-switch. Metadata is applicable only
+ for E-switch in switchdev mode and users may disable it when NONE of the
+ below use cases will be in use:
+ 1. HCA is in Dual/multi-port RoCE mode.
+ 2. VF/SF representor bonding (Usually used for Live migration)
+ 3. Stacked devices
+
+ When metadata is disabled, the above use cases will fail to initialize if
+ users try to enable them.
+
+ Note: Setting this parameter does not take effect immediately. Setting
+ must happen in legacy mode and eswitch port metadata takes effect after
+ enabling switchdev mode.
+ * - ``hairpin_num_queues``
+ - u32
+ - driverinit
+ - We refer to a TC NIC rule that involves forwarding as "hairpin".
+ Hairpin queues are mlx5 hardware specific implementation for hardware
+ forwarding of such packets.
+
+ Control the number of hairpin queues.
+ * - ``hairpin_queue_size``
+ - u32
+ - driverinit
+ - Control the size (in packets) of the hairpin queues.
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
@@ -63,3 +133,161 @@ The ``mlx5`` driver reports the following versions
* - ``fw.version``
- stored, running
- Three digit major.minor.subminor firmware version number.
+
+Health reporters
+================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following three error scenarios:
+
+- tx timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- tx error completion
+ Report on error tx completion.
+ Recover by flushing the tx queue and reset it.
+- tx PTP port timestamping CQ unhealthy
+ Report too many CQEs never delivered on port ts CQ.
+ Recover by flushing and re-creating all PTP channels.
+
+tx reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+.. note::
+ This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
+
+- Diagnose rx queues' status and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+.. note::
+ This command has valid output only when interface is up. Otherwise, the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled, and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements `diagnose` and `dump` callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+.. note::
+ This command can run only on the PF which has fw tracer ownership,
+ running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements `dump` and `recover` callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+.. note::
+ This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+
+- total_error_queues
+ number of queues in an error state due to
+ an async error or errored command.
+- send_queue_priority_update_flow
+ number of QP/SQ priority/SL update events.
+- cq_overrun
+ number of times CQ entered an error state due to an overflow.
+- async_eq_overrun
+ number of times an EQ mapped to async events was overrun.
+- comp_eq_overrun
+ number of times an EQ mapped to completion events was
+ overrun.
+- quota_exceeded_command
+ number of commands issued and failed due to quota exceeded.
+- invalid_command
+ number of commands issued and failed dues to any reason other than quota
+ exceeded.
+- nic_receive_steering_discard
+ number of packets that completed RX flow
+ steering but were discarded due to a mismatch in flow table.
+- generated_pkt_steering_fail
+ number of packets generated by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow).
+- handled_pkt_steering_fail
+ number of packets handled by the VNIC experiencing unexpected steering
+ failure (at any point in steering flow owned by the VNIC, including the FDB
+ for the eswitch owner).
+
+User commands examples:
+
+- Diagnose PF/VF vnic counters::
+
+ $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+ representor, which can be obtained via devlink port command)::
+
+ $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+.. note::
+ This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
index cf857cb4ba8f..433962225bd4 100644
--- a/Documentation/networking/devlink/mlxsw.rst
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -58,6 +58,30 @@ The ``mlxsw`` driver reports the following versions
- running
- Three digit firmware version
+Line card auxiliary device info versions
+========================================
+
+The ``mlxsw`` driver reports the following versions for line card auxiliary device
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this line card
+ * - ``ini.version``
+ - running
+ - Version of line card INI loaded
+ * - ``fw.psid``
+ - fixed
+ - Line card device PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version of line card device
+
Driver-specific Traps
=====================
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 2a266b7e7b38..88482725422c 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -46,7 +46,7 @@ Resources
=========
The ``netdevsim`` driver exposes resources to control the number of FIB
-entries and FIB rule entries that the driver will allow.
+entries, FIB rule entries and nexthops that the driver will allow.
.. code:: shell
@@ -54,8 +54,35 @@ entries and FIB rule entries that the driver will allow.
$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
$ devlink dev reload netdevsim/netdevsim0
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+ rate_parent tx_max tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+ dev ethtool rate_parent tx_max tx_share
+
Driver-specific Traps
=====================
@@ -68,5 +95,5 @@ Driver-specific Traps
* - ``fid_miss``
- ``exception``
- When a packet enters the device it is classified to a filtering
- indentifier (FID) based on the ingress port and VLAN. This trap is used
+ identifier (FID) based on the ingress port and VLAN. This trap is used
to trap packets for which a FID could not be found
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
new file mode 100644
index 000000000000..610de99b728a
--- /dev/null
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+octeontx2 devlink support
+=========================
+
+This document describes the devlink features implemented by the ``octeontx2 AF, PF and VF``
+device drivers.
+
+Parameters
+==========
+
+The ``octeontx2 PF and VF`` drivers implement the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``mcam_count``
+ - u16
+ - runtime
+ - Select number of match CAM entries to be allocated for an interface.
+ The same is used for ntuple filters of the interface. Supported by
+ PF and VF drivers.
+
+The ``octeontx2 AF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``dwrr_mtu``
+ - u32
+ - runtime
+ - Use to set the quantum which hardware uses for scheduling among transmit queues.
+ Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 000000000000..96b1124e614b
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``arp_bc``
+ - ``trap``
+ - Traps ARP broadcast packets (both requests/responses)
+ * - ``is_is``
+ - ``trap``
+ - Traps IS-IS packets
+ * - ``ospf``
+ - ``trap``
+ - Traps OSPF packets
+ * - ``ip_bc_mac``
+ - ``trap``
+ - Traps IPv4 packets with broadcast DA Mac address
+ * - ``stp``
+ - ``trap``
+ - Traps STP BPDU
+ * - ``lacp``
+ - ``trap``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``trap``
+ - Traps LLDP packets
+ * - ``router_mc``
+ - ``trap``
+ - Traps multicast packets
+ * - ``vrrp``
+ - ``trap``
+ - Traps VRRP packets
+ * - ``dhcp``
+ - ``trap``
+ - Traps DHCP packets
+ * - ``mtu_error``
+ - ``trap``
+ - Traps (exception) packets that exceeded port's MTU
+ * - ``mac_to_me``
+ - ``trap``
+ - Traps packets with switch-port's DA Mac address
+ * - ``ttl_error``
+ - ``trap``
+ - Traps (exception) IPv4 packets whose TTL exceeded
+ * - ``ipv4_options``
+ - ``trap``
+ - Traps (exception) packets due to the malformed IPV4 header options
+ * - ``ip_default_route``
+ - ``trap``
+ - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+ * - ``local_route``
+ - ``trap``
+ - Traps packets that have been send to one of switch IP interfaces addresses
+ * - ``ipv4_icmp_redirect``
+ - ``trap``
+ - Traps (exception) IPV4 ICMP redirect packets
+ * - ``arp_response``
+ - ``trap``
+ - Traps ARP replies packets that have switch-port's DA Mac address
+ * - ``acl_code_0``
+ - ``trap``
+ - Traps packets that have ACL priority set to 0 (tc pref 0)
+ * - ``acl_code_1``
+ - ``trap``
+ - Traps packets that have ACL priority set to 1 (tc pref 1)
+ * - ``acl_code_2``
+ - ``trap``
+ - Traps packets that have ACL priority set to 2 (tc pref 2)
+ * - ``acl_code_3``
+ - ``trap``
+ - Traps packets that have ACL priority set to 3 (tc pref 3)
+ * - ``acl_code_4``
+ - ``trap``
+ - Traps packets that have ACL priority set to 4 (tc pref 4)
+ * - ``acl_code_5``
+ - ``trap``
+ - Traps packets that have ACL priority set to 5 (tc pref 5)
+ * - ``acl_code_6``
+ - ``trap``
+ - Traps packets that have ACL priority set to 6 (tc pref 6)
+ * - ``acl_code_7``
+ - ``trap``
+ - Traps packets that have ACL priority set to 7 (tc pref 7)
+ * - ``ipv4_bgp``
+ - ``trap``
+ - Traps IPv4 BGP packets
+ * - ``ssh``
+ - ``trap``
+ - Traps SSH packets
+ * - ``telnet``
+ - ``trap``
+ - Traps Telnet packets
+ * - ``icmp``
+ - ``trap``
+ - Traps ICMP packets
+ * - ``rxdma_drop``
+ - ``drop``
+ - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+ * - ``port_no_vlan``
+ - ``drop``
+ - Drops packets due to faulty-configured network or due to internal bug (config issue).
+ * - ``local_port``
+ - ``drop``
+ - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+ * - ``invalid_sa``
+ - ``drop``
+ - Drops packets with multicast source MAC address.
+ * - ``illegal_ip_addr``
+ - ``drop``
+ - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+ * - ``illegal_ipv4_hdr``
+ - ``drop``
+ - Drops packets with illegal IPV4 header.
+ * - ``ip_uc_dip_da_mismatch``
+ - ``drop``
+ - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+ * - ``ip_sip_is_zero``
+ - ``drop``
+ - Drops packets with zero (0) IPV4 source address.
+ * - ``met_red``
+ - ``drop``
+ - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwidth.
diff --git a/Documentation/networking/devlink/sfc.rst b/Documentation/networking/devlink/sfc.rst
new file mode 100644
index 000000000000..db64a1bd9733
--- /dev/null
+++ b/Documentation/networking/devlink/sfc.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+sfc devlink support
+===================
+
+This document describes the devlink features implemented by the ``sfc``
+device driver for the ef100 device.
+
+Info versions
+=============
+
+The ``sfc`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw.mgmt.suc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the SUC control unit's firmware version.
+ * - ``fw.mgmt.cmc``
+ - running
+ - For boards where the management function is split between multiple
+ control units, this is the CMC control unit's firmware version.
+ * - ``fpga.rev``
+ - running
+ - FPGA design revision.
+ * - ``fpga.app``
+ - running
+ - Datapath programmable logic version.
+ * - ``fw.app``
+ - running
+ - Datapath software/microcode/firmware version.
+ * - ``coproc.boot``
+ - running
+ - SmartNIC application co-processor (APU) first stage boot loader version.
+ * - ``coproc.uboot``
+ - running
+ - SmartNIC application co-processor (APU) co-operating system loader version.
+ * - ``coproc.main``
+ - running
+ - SmartNIC application co-processor (APU) main operating system version.
+ * - ``coproc.recovery``
+ - running
+ - SmartNIC application co-processor (APU) recovery operating system version.
+ * - ``fw.exprom``
+ - running
+ - Expansion ROM version. For boards where the expansion ROM is split between
+ multiple images (e.g. PXE and UEFI), this is the specifically the PXE boot
+ ROM version.
+ * - ``fw.uefi``
+ - running
+ - UEFI driver version (No UNDI support).
diff --git a/Documentation/networking/devlink/sja1105.rst b/Documentation/networking/devlink/sja1105.rst
deleted file mode 100644
index e2679c274085..000000000000
--- a/Documentation/networking/devlink/sja1105.rst
+++ /dev/null
@@ -1,49 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=======================
-sja1105 devlink support
-=======================
-
-This document describes the devlink features implemented
-by the ``sja1105`` device driver.
-
-Parameters
-==========
-
-.. list-table:: Driver-specific parameters implemented
- :widths: 5 5 5 85
-
- * - Name
- - Type
- - Mode
- - Description
- * - ``best_effort_vlan_filtering``
- - Boolean
- - runtime
- - Allow plain ETH_P_8021Q headers to be used as DSA tags.
-
- Benefits:
-
- - Can terminate untagged traffic over switch net
- devices even when enslaved to a bridge with
- vlan_filtering=1.
- - Can terminate VLAN-tagged traffic over switch net
- devices even when enslaved to a bridge with
- vlan_filtering=1, with some constraints (no more than
- 7 non-pvid VLANs per user port).
- - Can do QoS based on VLAN PCP and VLAN membership
- admission control for autonomously forwarded frames
- (regardless of whether they can be terminated on the
- CPU or not).
-
- Drawbacks:
-
- - User cannot use VLANs in range 1024-3071. If the
- switch receives frames with such VIDs, it will
- misinterpret them as DSA tags.
- - Switch uses Shared VLAN Learning (FDB lookup uses
- only DMAC as key).
- - When VLANs span cross-chip topologies, the total
- number of permitted VLANs may be less than 7 per
- port, due to a maximum number of 32 VLAN retagging
- rules per switch.
diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
index c8f59dbda46f..4f5dfa9c022e 100644
--- a/Documentation/networking/driver.rst
+++ b/Documentation/networking/driver.rst
@@ -4,94 +4,124 @@
Softnet Driver Issues
=====================
-Transmit path guidelines:
+Probing guidelines
+==================
-1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under
- any normal circumstances. It is considered a hard error unless
- there is no way your device can tell ahead of time when it's
- transmit function will become busy.
+Address validation
+------------------
- Instead it must maintain the queue properly. For example,
- for a driver implementing scatter-gather this means::
+Any hardware layer address you obtain for your device should
+be verified. For example, for ethernet check it with
+linux/etherdevice.h:is_valid_ether_addr()
+
+Close/stop guidelines
+=====================
+
+Quiescence
+----------
+
+After the ndo_stop routine has been called, the hardware must
+not receive or transmit any data. All in flight packets must
+be aborted. If necessary, poll or wait for completion of
+any reset commands.
+
+Auto-close
+----------
+
+The ndo_stop routine will be called by unregister_netdevice
+if device is still UP.
+
+Transmit path guidelines
+========================
+
+Stop queues in advance
+----------------------
+
+The ndo_start_xmit method must not return NETDEV_TX_BUSY under
+any normal circumstances. It is considered a hard error unless
+there is no way your device can tell ahead of time when its
+transmit function will become busy.
+
+Instead it must maintain the queue properly. For example,
+for a driver implementing scatter-gather this means:
+
+.. code-block:: c
+
+ static u32 drv_tx_avail(struct drv_ring *dr)
+ {
+ u32 used = READ_ONCE(dr->prod) - READ_ONCE(dr->cons);
+
+ return dr->tx_ring_size - (used & bp->tx_ring_mask);
+ }
static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev)
{
struct drv *dp = netdev_priv(dev);
+ struct netdev_queue *txq;
+ struct drv_ring *dr;
+ int idx;
- lock_tx(dp);
- ...
- /* This is a hard error log it. */
- if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) {
+ idx = skb_get_queue_mapping(skb);
+ dr = dp->tx_rings[idx];
+ txq = netdev_get_tx_queue(dev, idx);
+
+ //...
+ /* This should be a very rare race - log it. */
+ if (drv_tx_avail(dr) <= skb_shinfo(skb)->nr_frags + 1) {
netif_stop_queue(dev);
- unlock_tx(dp);
- printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n",
- dev->name);
+ netdev_warn(dev, "Tx Ring full when queue awake!\n");
return NETDEV_TX_BUSY;
}
- ... queue packet to card ...
- ... update tx consumer index ...
-
- if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1))
- netif_stop_queue(dev);
-
- ...
- unlock_tx(dp);
- ...
- return NETDEV_TX_OK;
- }
-
- And then at the end of your TX reclamation event handling::
+ //... queue packet to card ...
- if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
- netif_wake_queue(dp->dev);
+ netdev_tx_sent_queue(txq, skb->len);
- For a non-scatter-gather supporting card, the three tests simply become::
+ //... update tx producer index using WRITE_ONCE() ...
- /* This is a hard error log it. */
- if (TX_BUFFS_AVAIL(dp) <= 0)
+ if (!netif_txq_maybe_stop(txq, drv_tx_avail(dr),
+ MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS))
+ dr->stats.stopped++;
- and::
+ //...
+ return NETDEV_TX_OK;
+ }
- if (TX_BUFFS_AVAIL(dp) == 0)
+And then at the end of your TX reclamation event handling:
- and::
+.. code-block:: c
- if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > 0)
- netif_wake_queue(dp->dev);
+ //... update tx consumer index using WRITE_ONCE() ...
-2) An ndo_start_xmit method must not modify the shared parts of a
- cloned SKB.
+ netif_txq_completed_wake(txq, cmpl_pkts, cmpl_bytes,
+ drv_tx_avail(dr), 2 * MAX_SKB_FRAGS);
-3) Do not forget that once you return NETDEV_TX_OK from your
- ndo_start_xmit method, it is your driver's responsibility to free
- up the SKB and in some finite amount of time.
+Lockless queue stop / wake helper macros
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- For example, this means that it is not allowed for your TX
- mitigation scheme to let TX packets "hang out" in the TX
- ring unreclaimed forever if no new TX packets are sent.
- This error can deadlock sockets waiting for send buffer room
- to be freed up.
+.. kernel-doc:: include/net/netdev_queues.h
+ :doc: Lockless queue stopping / waking helpers.
- If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
- must not keep any reference to that SKB and you must not attempt
- to free it up.
+No exclusive ownership
+----------------------
-Probing guidelines:
+An ndo_start_xmit method must not modify the shared parts of a
+cloned SKB.
-1) Any hardware layer address you obtain for your device should
- be verified. For example, for ethernet check it with
- linux/etherdevice.h:is_valid_ether_addr()
+Timely completions
+------------------
-Close/stop guidelines:
+Do not forget that once you return NETDEV_TX_OK from your
+ndo_start_xmit method, it is your driver's responsibility to free
+up the SKB and in some finite amount of time.
-1) After the ndo_stop routine has been called, the hardware must
- not receive or transmit any data. All in flight packets must
- be aborted. If necessary, poll or wait for completion of
- any reset commands.
+For example, this means that it is not allowed for your TX
+mitigation scheme to let TX packets "hang out" in the TX
+ring unreclaimed forever if no new TX packets are sent.
+This error can deadlock sockets waiting for send buffer room
+to be freed up.
-2) The ndo_stop routine will be called by unregister_netdevice
- if device is still UP.
+If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
+must not keep any reference to that SKB and you must not attempt
+to free it up.
diff --git a/Documentation/networking/dsa/b53.rst b/Documentation/networking/dsa/b53.rst
index b41637cdb82b..1cb3ff648f88 100644
--- a/Documentation/networking/dsa/b53.rst
+++ b/Documentation/networking/dsa/b53.rst
@@ -52,7 +52,7 @@ VLAN programming would basically change the CPU port's default PVID and make
it untagged, undesirable.
In difference to the configuration described in :ref:`dsa-vlan-configuration`
-the default VLAN 1 has to be removed from the slave interface configuration in
+the default VLAN 1 has to be removed from the user interface configuration in
single port and gateway configuration, while there is no need to add an extra
VLAN configuration in the bridge showcase.
@@ -68,13 +68,13 @@ By default packages are tagged with vid 1:
ip link add link eth0 name eth0.2 type vlan id 2
ip link add link eth0 name eth0.3 type vlan id 3
- # The master interface needs to be brought up before the slave ports.
+ # The conduit interface needs to be brought up before the user ports.
ip link set eth0 up
ip link set eth0.1 up
ip link set eth0.2 up
ip link set eth0.3 up
- # bring up the slave interfaces
+ # bring up the user interfaces
ip link set wan up
ip link set lan1 up
ip link set lan2 up
@@ -113,11 +113,11 @@ bridge
# tag traffic on CPU port
ip link add link eth0 name eth0.1 type vlan id 1
- # The master interface needs to be brought up before the slave ports.
+ # The conduit interface needs to be brought up before the user ports.
ip link set eth0 up
ip link set eth0.1 up
- # bring up the slave interfaces
+ # bring up the user interfaces
ip link set wan up
ip link set lan1 up
ip link set lan2 up
@@ -149,12 +149,12 @@ gateway
ip link add link eth0 name eth0.1 type vlan id 1
ip link add link eth0 name eth0.2 type vlan id 2
- # The master interface needs to be brought up before the slave ports.
+ # The conduit interface needs to be brought up before the user ports.
ip link set eth0 up
ip link set eth0.1 up
ip link set eth0.2 up
- # bring up the slave interfaces
+ # bring up the user interfaces
ip link set wan up
ip link set lan1 up
ip link set lan2 up
diff --git a/Documentation/networking/dsa/bcm_sf2.rst b/Documentation/networking/dsa/bcm_sf2.rst
index dee234039e1e..d2571435696f 100644
--- a/Documentation/networking/dsa/bcm_sf2.rst
+++ b/Documentation/networking/dsa/bcm_sf2.rst
@@ -67,7 +67,7 @@ MDIO indirect accesses
----------------------
Due to a limitation in how Broadcom switches have been designed, external
-Broadcom switches connected to a SF2 require the use of the DSA slave MDIO bus
+Broadcom switches connected to a SF2 require the use of the DSA user MDIO bus
in order to properly configure them. By default, the SF2 pseudo-PHY address, and
an external switch pseudo-PHY address will both be snooping for incoming MDIO
transactions, since they are at the same address (30), resulting in some kind of
diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst
index af029b3ca2ab..6cc4ded3cc23 100644
--- a/Documentation/networking/dsa/configuration.rst
+++ b/Documentation/networking/dsa/configuration.rst
@@ -5,7 +5,7 @@ DSA switch configuration from userspace
=======================================
The DSA switch configuration is not integrated into the main userspace
-network configuration suites by now and has to be performed manualy.
+network configuration suites by now and has to be performed manually.
.. _dsa-config-showcases:
@@ -31,28 +31,38 @@ at https://www.kernel.org/pub/linux/utils/net/iproute2/
Through DSA every port of a switch is handled like a normal linux Ethernet
interface. The CPU port is the switch port connected to an Ethernet MAC chip.
-The corresponding linux Ethernet interface is called the master interface.
-All other corresponding linux interfaces are called slave interfaces.
+The corresponding linux Ethernet interface is called the conduit interface.
+All other corresponding linux interfaces are called user interfaces.
-The slave interfaces depend on the master interface. They can only brought up,
-when the master interface is up.
+The user interfaces depend on the conduit interface being up in order for them
+to send or receive traffic. Prior to kernel v5.12, the state of the conduit
+interface had to be managed explicitly by the user. Starting with kernel v5.12,
+the behavior is as follows:
+
+- when a DSA user interface is brought up, the conduit interface is
+ automatically brought up.
+- when the conduit interface is brought down, all DSA user interfaces are
+ automatically brought down.
In this documentation the following Ethernet interfaces are used:
*eth0*
- the master interface
+ the conduit interface
+
+*eth1*
+ another conduit interface
*lan1*
- a slave interface
+ a user interface
*lan2*
- another slave interface
+ another user interface
*lan3*
- a third slave interface
+ a third user interface
*wan*
- A slave interface dedicated for upstream traffic
+ A user interface dedicated for upstream traffic
Further Ethernet interfaces can be configured similar.
The configured IPs and networks are:
@@ -78,79 +88,76 @@ The tagging based configuration is desired and supported by the majority of
DSA switches. These switches are capable to tag incoming and outgoing traffic
without using a VLAN based configuration.
-single port
-~~~~~~~~~~~
-
-.. code-block:: sh
-
- # configure each interface
- ip addr add 192.0.2.1/30 dev lan1
- ip addr add 192.0.2.5/30 dev lan2
- ip addr add 192.0.2.9/30 dev lan3
-
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+*single port*
+ .. code-block:: sh
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # configure each interface
+ ip addr add 192.0.2.1/30 dev lan1
+ ip addr add 192.0.2.5/30 dev lan2
+ ip addr add 192.0.2.9/30 dev lan3
-bridge
-~~~~~~
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
-.. code-block:: sh
+ # bring up the user interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+*bridge*
+ .. code-block:: sh
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
- # create bridge
- ip link add name br0 type bridge
+ # bring up the user interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
+ # create bridge
+ ip link add name br0 type bridge
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
- # bring up the bridge
- ip link set dev br0 up
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
-gateway
-~~~~~~~
+ # bring up the bridge
+ ip link set dev br0 up
-.. code-block:: sh
+*gateway*
+ .. code-block:: sh
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
- # bring up the slave interfaces
- ip link set wan up
- ip link set lan1 up
- ip link set lan2 up
+ # bring up the user interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
- # configure the upstream port
- ip addr add 192.0.2.1/30 dev wan
+ # configure the upstream port
+ ip addr add 192.0.2.1/30 dev wan
- # create bridge
- ip link add name br0 type bridge
+ # create bridge
+ ip link add name br0 type bridge
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
- # bring up the bridge
- ip link set dev br0 up
+ # bring up the bridge
+ ip link set dev br0 up
.. _dsa-vlan-configuration:
@@ -161,132 +168,291 @@ A minority of switches are not capable to use a taging protocol
(DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based
configuration.
-single port
-~~~~~~~~~~~
-The configuration can only be set up via VLAN tagging and bridge setup.
-
-.. code-block:: sh
-
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
- ip link add link eth0 name eth0.2 type vlan id 2
- ip link add link eth0 name eth0.3 type vlan id 3
-
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
- ip link set eth0.2 up
- ip link set eth0.3 up
-
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan1 up
- ip link set lan3 up
-
- # create bridge
- ip link add name br0 type bridge
-
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
-
- # add ports to bridges
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
-
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 2 pvid untagged
- bridge vlan add dev lan3 vid 3 pvid untagged
-
- # configure the VLANs
- ip addr add 192.0.2.1/30 dev eth0.1
- ip addr add 192.0.2.5/30 dev eth0.2
- ip addr add 192.0.2.9/30 dev eth0.3
-
- # bring up the bridge devices
- ip link set br0 up
-
+*single port*
+ The configuration can only be set up via VLAN tagging and bridge setup.
-bridge
-~~~~~~
+ .. code-block:: sh
-.. code-block:: sh
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+ ip link add link eth0 name eth0.3 type vlan id 3
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+ ip link set eth0.3 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
+ # bring up the user interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # create bridge
+ ip link add name br0 type bridge
- # create bridge
- ip link add name br0 type bridge
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
+ # add ports to bridges
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
- ip link set eth0.1 master br0
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 2 pvid untagged
+ bridge vlan add dev lan3 vid 3 pvid untagged
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 1 pvid untagged
- bridge vlan add dev lan3 vid 1 pvid untagged
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.1
+ ip addr add 192.0.2.5/30 dev eth0.2
+ ip addr add 192.0.2.9/30 dev eth0.3
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # bring up the bridge devices
+ ip link set br0 up
- # bring up the bridge
- ip link set dev br0 up
-gateway
-~~~~~~~
+*bridge*
+ .. code-block:: sh
-.. code-block:: sh
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
- ip link add link eth0 name eth0.2 type vlan id 2
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
- ip link set eth0.2 up
+ # bring up the user interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # bring up the slave interfaces
- ip link set wan up
- ip link set lan1 up
- ip link set lan2 up
+ # create bridge
+ ip link add name br0 type bridge
- # create bridge
- ip link add name br0 type bridge
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
+ ip link set eth0.1 master br0
- # add ports to bridges
- ip link set dev wan master br0
- ip link set eth0.1 master br0
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev lan3 vid 1 pvid untagged
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 1 pvid untagged
- bridge vlan add dev wan vid 2 pvid untagged
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
- # configure the VLANs
- ip addr add 192.0.2.1/30 dev eth0.2
- ip addr add 192.0.2.129/25 dev br0
+ # bring up the bridge
+ ip link set dev br0 up
- # bring up the bridge devices
- ip link set br0 up
+*gateway*
+ .. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+
+ # For kernels earlier than v5.12, the conduit interface needs to be
+ # brought up manually before the user ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+
+ # bring up the user interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev wan master br0
+ ip link set eth0.1 master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev wan vid 2 pvid untagged
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.2
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge devices
+ ip link set br0 up
+
+Forwarding database (FDB) management
+------------------------------------
+
+The existing DSA switches do not have the necessary hardware support to keep
+the software FDB of the bridge in sync with the hardware tables, so the two
+tables are managed separately (``bridge fdb show`` queries both, and depending
+on whether the ``self`` or ``master`` flags are being used, a ``bridge fdb
+add`` or ``bridge fdb del`` command acts upon entries from one or both tables).
+
+Up until kernel v4.14, DSA only supported user space management of bridge FDB
+entries using the bridge bypass operations (which do not update the software
+FDB, just the hardware one) using the ``self`` flag (which is optional and can
+be omitted).
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 self static
+ # or shorthand
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Due to a bug, the bridge bypass FDB implementation provided by DSA did not
+distinguish between ``static`` and ``local`` FDB entries (``static`` are meant
+to be forwarded, while ``local`` are meant to be locally terminated, i.e. sent
+to the host port). Instead, all FDB entries with the ``self`` flag (implicit or
+explicit) are treated by DSA as ``static`` even if they are ``local``.
+
+ .. code-block:: sh
+
+ # This command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+ # behaves the same for DSA as this command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 local
+ # or shorthand, because the 'local' flag is implicit if 'static' is not
+ # specified, it also behaves the same as:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+
+The last command is an incorrect way of adding a static bridge FDB entry to a
+DSA switch using the bridge bypass operations, and works by mistake. Other
+drivers will treat an FDB entry added by the same command as ``local`` and as
+such, will not forward it, as opposed to DSA.
+
+Between kernel v4.14 and v5.14, DSA has supported in parallel two modes of
+adding a bridge FDB entry to the switch: the bridge bypass discussed above, as
+well as a new mode using the ``master`` flag which installs FDB entries in the
+software bridge too.
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+
+Since kernel v5.14, DSA has gained stronger integration with the bridge's
+software FDB, and the support for its bridge bypass FDB implementation (using
+the ``self`` flag) has been removed. This results in the following changes:
+
+ .. code-block:: sh
+
+ # This is the only valid way of adding an FDB entry that is supported,
+ # compatible with v4.14 kernels and later:
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+ # This command is no longer buggy and the entry is properly treated as
+ # 'local' instead of being forwarded:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+ # This command no longer installs a static FDB entry to hardware:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Script writers are therefore encouraged to use the ``master static`` set of
+flags when working with bridge FDB entries on DSA switch interfaces.
+
+Affinity of user ports to CPU ports
+-----------------------------------
+
+Typically, DSA switches are attached to the host via a single Ethernet
+interface, but in cases where the switch chip is discrete, the hardware design
+may permit the use of 2 or more ports connected to the host, for an increase in
+termination throughput.
+
+DSA can make use of multiple CPU ports in two ways. First, it is possible to
+statically assign the termination traffic associated with a certain user port
+to be processed by a certain CPU port. This way, user space can implement
+custom policies of static load balancing between user ports, by spreading the
+affinities according to the available CPU ports.
+
+Secondly, it is possible to perform load balancing between CPU ports on a per
+packet basis, rather than statically assigning user ports to CPU ports.
+This can be achieved by placing the DSA conduits under a LAG interface (bonding
+or team). DSA monitors this operation and creates a mirror of this software LAG
+on the CPU ports facing the physical DSA conduits that constitute the LAG slave
+devices.
+
+To make use of multiple CPU ports, the firmware (device tree) description of
+the switch must mark all the links between CPU ports and their DSA conduits
+using the ``ethernet`` reference/phandle. At startup, only a single CPU port
+and DSA conduit will be used - the numerically first port from the firmware
+description which has an ``ethernet`` property. It is up to the user to
+configure the system for the switch to use other conduits.
+
+DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
+changing the DSA conduit of a user port. The ``IFLA_DSA_CONDUIT`` u32 netlink
+attribute contains the ifindex of the conduit device that handles each user
+device. The DSA conduit must be a valid candidate based on firmware node
+information, or a LAG interface which contains only slaves which are valid
+candidates.
+
+Using iproute2, the following manipulations are possible:
+
+ .. code-block:: sh
+
+ # See the DSA conduit in current use
+ ip -d link show dev swp0
+ (...)
+ dsa master eth0
+
+ # Static CPU port distribution
+ ip link set swp0 type dsa master eth1
+ ip link set swp1 type dsa master eth0
+ ip link set swp2 type dsa master eth1
+ ip link set swp3 type dsa master eth0
+
+ # CPU ports in LAG, using explicit assignment of the DSA conduit
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip link set swp0 type dsa master bond0
+ ip link set swp1 type dsa master bond0
+ ip link set swp2 type dsa master bond0
+ ip link set swp3 type dsa master bond0
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+ # CPU ports in LAG, relying on implicit migration of the DSA conduit
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+Notice that in the case of CPU ports under a LAG, the use of the
+``IFLA_DSA_CONDUIT`` netlink attribute is not strictly needed, but rather, DSA
+reacts to the ``IFLA_MASTER`` attribute change of its present conduit (``eth0``)
+and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
+when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
+that were assigned to this interface to the first physical DSA conduit which is
+eligible, based on the firmware description (it effectively reverts to the
+startup configuration).
+
+In a setup with more than 2 physical CPU ports, it is therefore possible to mix
+static user to CPU port assignment with LAG between DSA conduits. It is not
+possible to statically assign a user port towards a DSA conduit that has any
+upper interfaces (this includes LAG devices - the conduit must always be the LAG
+in this case).
+
+Live changing of the DSA conduit (and thus CPU port) affinity of a user port is
+permitted, in order to allow dynamic redistribution in response to traffic.
+
+Physical DSA conduits are allowed to join and leave at any time a LAG interface
+used as a DSA conduit; however, DSA will reject a LAG interface as a valid
+candidate for being a DSA conduit unless it has at least one physical DSA conduit
+as a slave device.
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index a8d15dd2b42b..7b2e69cd7ef0 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -10,22 +10,22 @@ in joining the effort.
Design principles
=================
-The Distributed Switch Architecture is a subsystem which was primarily designed
-to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
-using Linux, but has since evolved to support other vendors as well.
+The Distributed Switch Architecture subsystem was primarily designed to
+support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product
+line) using Linux, but has since evolved to support other vendors as well.
The original philosophy behind this design was to be able to use unmodified
Linux tools such as bridge, iproute2, ifconfig to work transparently whether
they configured/queried a switch port network device or a regular network
device.
-An Ethernet switch is typically comprised of multiple front-panel ports, and one
-or more CPU or management port. The DSA subsystem currently relies on the
+An Ethernet switch typically comprises multiple front-panel ports and one
+or more CPU or management ports. The DSA subsystem currently relies on the
presence of a management port connected to an Ethernet controller capable of
receiving Ethernet frames from the switch. This is a very common setup for all
kinds of Ethernet switches found in Small Home and Office products: routers,
-gateways, or even top-of-the rack switches. This host Ethernet controller will
-be later referred to as "master" and "cpu" in DSA terminology and code.
+gateways, or even top-of-rack switches. This host Ethernet controller will
+be later referred to as "conduit" and "cpu" in DSA terminology and code.
The D in DSA stands for Distributed, because the subsystem has been designed
with the ability to configure and manage cascaded switches on top of each other
@@ -33,14 +33,14 @@ using upstream and downstream Ethernet links between switches. These specific
ports are referred to as "dsa" ports in DSA terminology and code. A collection
of multiple switches connected to each other is called a "switch tree".
-For each front-panel port, DSA will create specialized network devices which are
+For each front-panel port, DSA creates specialized network devices which are
used as controlling and data-flowing endpoints for use by the Linux networking
-stack. These specialized network interfaces are referred to as "slave" network
+stack. These specialized network interfaces are referred to as "user" network
interfaces in DSA terminology and code.
The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
which is a hardware feature making the switch insert a specific tag for each
-Ethernet frames it received to/from specific ports to help the management
+Ethernet frame it receives to/from specific ports to help the management
interface figure out:
- what port is this frame coming from
@@ -56,23 +56,21 @@ Note that DSA does not currently create network interfaces for the "cpu" and
- the "cpu" port is the Ethernet switch facing side of the management
controller, and as such, would create a duplication of feature, since you
- would get two interfaces for the same conduit: master netdev, and "cpu" netdev
+ would get two interfaces for the same conduit: conduit netdev, and "cpu" netdev
- the "dsa" port(s) are just conduits between two or more switches, and as such
cannot really be used as proper network interfaces either, only the
downstream, or the top-most upstream interface makes sense with that model
+NB: for the past 15 years, the DSA subsystem had been making use of the terms
+"master" (rather than "conduit") and "slave" (rather than "user"). These terms
+have been removed from the DSA codebase and phased out of the uAPI.
+
Switch tagging protocols
------------------------
-DSA currently supports 5 different tagging protocols, and a tag-less mode as
-well. The different protocols are implemented in:
-
-- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
-- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
-- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
-- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
-- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
+DSA supports many vendor-specific tagging protocols, one software-defined
+tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
The exact format of the tag protocol is vendor specific, but in general, they
all contain something which:
@@ -80,10 +78,153 @@ all contain something which:
- identifies which port the Ethernet frame came from/should be sent to
- provides a reason why this frame was forwarded to the management interface
-Master network devices
-----------------------
-
-Master network devices are regular, unmodified Linux network device drivers for
+All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
+methods of the ``struct dsa_device_ops`` structure, which are detailed below.
+
+Tagging protocols generally fall in one of three categories:
+
+1. The switch-specific frame header is located before the Ethernet header,
+ shifting to the right (from the perspective of the DSA conduit's frame
+ parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
+2. The switch-specific frame header is located before the EtherType, keeping
+ the MAC DA and MAC SA in place from the DSA conduit's perspective, but
+ shifting the 'real' EtherType and L2 payload to the right.
+3. The switch-specific frame header is located at the tail of the packet,
+ keeping all frame headers in place and not altering the view of the packet
+ that the DSA conduit's frame parser has.
+
+A tagging protocol may tag all packets with switch tags of the same length, or
+the tag length might vary (for example packets with PTP timestamps might
+require an extended switch tag, or there might be one tag length on TX and a
+different one on RX). Either way, the tagging protocol driver must populate the
+``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
+with the length in octets of the longest switch frame header/trailer. The DSA
+framework will automatically adjust the MTU of the conduit interface to
+accommodate for this extra size in order for DSA user ports to support the
+standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
+``needed_tailroom`` properties are also used to request from the network stack,
+on a best-effort basis, the allocation of packets with enough extra space such
+that the act of pushing the switch tag on transmission of a packet does not
+cause it to reallocate due to lack of memory.
+
+Even though applications are not expected to parse DSA-specific frame headers,
+the format on the wire of the tagging protocol represents an Application Binary
+Interface exposed by the kernel towards user space, for decoders such as
+``libpcap``. The tagging protocol driver must populate the ``proto`` member of
+``struct dsa_device_ops`` with a value that uniquely describes the
+characteristics of the interaction required between the switch hardware and the
+data path driver: the offset of each bit field within the frame header and any
+stateful processing required to deal with the frames (as may be required for
+PTP timestamping).
+
+From the perspective of the network stack, all switches within the same DSA
+switch tree use the same tagging protocol. In case of a packet transiting a
+fabric with more than one switch, the switch-specific frame header is inserted
+by the first switch in the fabric that the packet was received on. This header
+typically contains information regarding its type (whether it is a control
+frame that must be trapped to the CPU, or a data frame to be forwarded).
+Control frames should be decapsulated only by the software data path, whereas
+data frames might also be autonomously forwarded towards other user ports of
+other switches from the same fabric, and in this case, the outermost switch
+ports must decapsulate the packet.
+
+Note that in certain cases, it might be the case that the tagging format used
+by a leaf switch (not connected directly to the CPU) is not the same as what
+the network stack sees. This can be seen with Marvell switch trees, where the
+CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
+format, but the DSA links are configured to use the shorter (without Ethertype)
+DSA frame header, in order to reduce the autonomous packet forwarding overhead.
+It still remains the case that, if the DSA switch tree is configured for the
+EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
+leaf switches that tagged them with the shorter DSA header. This can be done
+because the Marvell switch connected directly to the CPU is configured to
+perform tag translation between DSA and EDSA (which is simply the operation of
+adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
+
+It is possible to construct cascaded setups of DSA switches even if their
+tagging protocols are not compatible with one another. In this case, there are
+no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
+tree. The DSA links are viewed as simply a pair of a DSA conduit (the out-facing
+port of the upstream DSA switch) and a CPU port (the in-facing port of the
+downstream DSA switch).
+
+The tagging protocol of the attached DSA switch tree can be viewed through the
+``dsa/tagging`` sysfs attribute of the DSA conduit::
+
+ cat /sys/class/net/eth0/dsa/tagging
+
+If the hardware and driver are capable, the tagging protocol of the DSA switch
+tree can be changed at runtime. This is done by writing the new tagging
+protocol name to the same sysfs device attribute as above (the DSA conduit and
+all attached switch ports must be down while doing this).
+
+It is desirable that all tagging protocols are testable with the ``dsa_loop``
+mockup driver, which can be attached to any network interface. The goal is that
+any network interface should be capable of transmitting the same packet in the
+same way, and the tagger should decode the same received packet in the same way
+regardless of the driver used for the switch control path, and the driver used
+for the DSA conduit.
+
+The transmission of a packet goes through the tagger's ``xmit`` function.
+The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
+``struct net_device *dev`` represents the virtual DSA user network interface
+whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
+The job of this method is to prepare the skb in a way that the switch will
+understand what egress port the packet is for (and not deliver it towards other
+ports). Typically this is fulfilled by pushing a frame header. Checking for
+insufficient size in the skb headroom or tailroom is unnecessary provided that
+the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
+properly, because DSA ensures there is enough space before calling this method.
+
+The reception of a packet goes through the tagger's ``rcv`` function. The
+passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
+the EtherType would have been, were this frame not tagged. The role of this
+method is to consume the frame header, adjust ``skb->data`` to really point at
+the first octet after the EtherType, and to change ``skb->dev`` to point to the
+virtual DSA user network interface corresponding to the physical front-facing
+switch port that the packet was received on.
+
+Since tagging protocols in category 1 and 2 break software (and most often also
+hardware) packet dissection on the DSA conduit, features such as RPS (Receive
+Packet Steering) on the DSA conduit would be broken. The DSA framework deals
+with this by hooking into the flow dissector and shifting the offset at which
+the IP header is to be found in the tagged frame as seen by the DSA conduit.
+This behavior is automatic based on the ``overhead`` value of the tagging
+protocol. If not all packets are of equal size, the tagger can implement the
+``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
+default behavior by specifying the correct offset incurred by each individual
+RX packet. Tail taggers do not cause issues to the flow dissector.
+
+Checksum offload should work with category 1 and 2 taggers when the DSA conduit
+driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and
+csum_offset. For those cases, DSA will shift the checksum start and offset by
+the tag size. If the DSA conduit driver still uses the legacy NETIF_F_IP_CSUM
+or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
+offload hardware already expects that specific tag (perhaps due to matching
+vendors). DSA user ports inherit those flags from the conduit, and it is up to
+the driver to correctly fall back to software checksum when the IP header is not
+where the hardware expects. If that check is ineffective, the packets might go
+to the network without a proper checksum (the checksum field will have the
+pseudo IP header sum). For category 3, when the offload hardware does not
+already expect the switch tag in use, the checksum must be calculated before any
+tag is inserted (i.e. inside the tagger). Otherwise, the DSA conduit would
+include the tail tag in the (software or hardware) checksum calculation. Then,
+when the tag gets stripped by the switch during transmission, it will leave an
+incorrect IP checksum in place.
+
+Due to various reasons (most common being category 1 taggers being associated
+with DSA-unaware conduits, mangling what the conduit perceives as MAC DA), the
+tagging protocol may require the DSA conduit to operate in promiscuous mode, to
+receive all frames regardless of the value of the MAC DA. This can be done by
+setting the ``promisc_on_conduit`` property of the ``struct dsa_device_ops``.
+Note that this assumes a DSA-unaware conduit driver, which is the norm.
+
+Conduit network devices
+-----------------------
+
+Conduit network devices are regular, unmodified Linux network device drivers for
the CPU/management Ethernet interface. Such a driver might occasionally need to
know whether DSA is enabled (e.g.: to enable/disable specific offload features),
but the DSA subsystem has been proven to work with industry standard drivers:
@@ -95,14 +236,14 @@ Ethernet switch.
Networking stack hooks
----------------------
-When a master netdev is used with DSA, a small hook is placed in the
+When a conduit netdev is used with DSA, a small hook is placed in the
networking stack is in order to have the DSA subsystem process the Ethernet
switch specific tagging protocol. DSA accomplishes this by registering a
specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
Ethernet Frame receive sequence looks like this:
-Master network device (e.g.: e1000e):
+Conduit network device (e.g.: e1000e):
1. Receive interrupt fires:
@@ -132,16 +273,16 @@ Master network device (e.g.: e1000e):
- inspect and strip switch tag protocol to determine originating port
- locate per-port network device
- - invoke ``eth_type_trans()`` with the DSA slave network device
+ - invoke ``eth_type_trans()`` with the DSA user network device
- invoked ``netif_receive_skb()``
-Past this point, the DSA slave network devices get delivered regular Ethernet
+Past this point, the DSA user network devices get delivered regular Ethernet
frames that can be processed by the networking stack.
-Slave network devices
----------------------
+User network devices
+--------------------
-Slave network devices created by DSA are stacked on top of their master network
+User network devices created by DSA are stacked on top of their conduit network
device, each of these network interfaces will be responsible for being a
controlling and data-flowing end-point for each front-panel port of the switch.
These interfaces are specialized in order to:
@@ -150,21 +291,35 @@ These interfaces are specialized in order to:
to/from specific switch ports
- query the switch for ethtool operations: statistics, link state,
Wake-on-LAN, register dumps...
-- external/internal PHY management: link, auto-negotiation etc.
+- manage external/internal PHY: link, auto-negotiation, etc.
-These slave network devices have custom net_device_ops and ethtool_ops function
+These user network devices have custom net_device_ops and ethtool_ops function
pointers which allow DSA to introduce a level of layering between the networking
-stack/ethtool, and the switch driver implementation.
+stack/ethtool and the switch driver implementation.
-Upon frame transmission from these slave network devices, DSA will look up which
-switch tagging protocol is currently registered with these network devices, and
+Upon frame transmission from these user network devices, DSA will look up which
+switch tagging protocol is currently registered with these network devices and
invoke a specific transmit routine which takes care of adding the relevant
switch tag in the Ethernet frames.
-These frames are then queued for transmission using the master network device
-``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
+These frames are then queued for transmission using the conduit network device
+``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the
Ethernet switch will be able to process these incoming frames from the
-management interface and delivers these frames to the physical switch port.
+management interface and deliver them to the physical switch port.
+
+When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
+device between the DSA user devices and the physical DSA conduits. The LAG
+device is thus also a DSA conduit, but the LAG slave devices continue to be DSA
+conduits as well (just with no user port assigned to them; this is needed for
+recovery in case the LAG DSA conduit disappears). Thus, the data path of the LAG
+DSA conduit is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
+calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA conduit;
+LAG slave). Therefore, the RX data path of the LAG DSA conduit is not used.
+On the other hand, TX takes place linearly: ``dsa_user_xmit`` calls
+``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA conduit.
+The latter calls ``dev_queue_xmit`` towards one physical DSA conduit or the
+other, and in both cases, the packet exits the system through a hardware path
+towards the switch.
Graphical representation
------------------------
@@ -172,37 +327,48 @@ Graphical representation
Summarized, this is basically how DSA looks like from a network device
perspective::
-
- |---------------------------
- | CPU network device (eth0)|
- ----------------------------
- | <tag added by switch |
- | |
- | |
- | tag added by CPU> |
- |--------------------------------------------|
- | Switch driver |
- |--------------------------------------------|
- || || ||
- |-------| |-------| |-------|
- | sw0p0 | | sw0p1 | | sw0p2 |
- |-------| |-------| |-------|
-
-
-
-Slave MDIO bus
---------------
-
-In order to be able to read to/from a switch PHY built into it, DSA creates a
-slave MDIO bus which allows a specific switch driver to divert and intercept
+ Unaware application
+ opens and binds socket
+ | ^
+ | |
+ +-----------v--|--------------------+
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ |+------+-+------+-+------+-+------+|
+ | DSA switch driver |
+ +-----------------------------------+
+ | ^
+ Tag added by | | Tag consumed by
+ switch driver | | switch driver
+ v |
+ +-----------------------------------+
+ | Unmodified host interface driver | Software
+ --------+-----------------------------------+------------
+ | Host interface (eth0) | Hardware
+ +-----------------------------------+
+ | ^
+ Tag consumed by | | Tag added by
+ switch hardware | | switch hardware
+ v |
+ +-----------------------------------+
+ | Switch |
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ ++------+-+------+-+------+-+------++
+
+User MDIO bus
+-------------
+
+In order to be able to read to/from a switch PHY built into it, DSA creates an
+user MDIO bus which allows a specific switch driver to divert and intercept
MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
switches, these functions would utilize direct or indirect PHY addressing mode
to return standard MII registers from the switch builtin PHYs, allowing the PHY
library and/or to return link status, link partner pages, auto-negotiation
-results etc..
+results, etc.
-For Ethernet switches which have both external and internal MDIO busses, the
-slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
+For Ethernet switches which have both external and internal MDIO buses, the
+user MII bus can be utilized to mux/demux MDIO reads and writes towards either
internal or external MDIO devices this switch might be connected to: internal
PHYs, external PHYs, or even external switches.
@@ -218,11 +384,11 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
table indication (when cascading switches)
- ``dsa_platform_data``: platform device configuration data which can reference
- a collection of dsa_chip_data structure if multiples switches are cascaded,
- the master network device this switch tree is attached to needs to be
+ a collection of dsa_chip_data structures if multiple switches are cascaded,
+ the conduit network device this switch tree is attached to needs to be
referenced
-- ``dsa_switch_tree``: structure assigned to the master network device under
+- ``dsa_switch_tree``: structure assigned to the conduit network device under
``dsa_ptr``, this structure references a dsa_platform_data structure as well as
the tagging protocol supported by the switch tree, and which receive/transmit
function hooks should be invoked, information about the directly attached
@@ -230,7 +396,7 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
referenced to address individual switches in the tree.
- ``dsa_switch``: structure describing a switch device in the tree, referencing
- a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
+ a ``dsa_switch_tree`` as a backpointer, user network devices, conduit network
device, and a reference to the backing``dsa_switch_ops``
- ``dsa_switch_ops``: structure referencing function pointers, see below for a
@@ -239,18 +405,10 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
Design limitations
==================
-Limits on the number of devices and ports
------------------------------------------
-
-DSA currently limits the number of maximum switches within a tree to 4
-(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
-These limits could be extended to support larger configurations would this need
-arise.
-
Lack of CPU/DSA network devices
-------------------------------
-DSA does not currently create slave network devices for the CPU or DSA ports, as
+DSA does not currently create user network devices for the CPU or DSA ports, as
described before. This might be an issue in the following cases:
- inability to fetch switch CPU port statistics counters using ethtool, which
@@ -265,7 +423,7 @@ described before. This might be an issue in the following cases:
Common pitfalls using DSA setups
--------------------------------
-Once a master network device is configured to use DSA (dev->dsa_ptr becomes
+Once a conduit network device is configured to use DSA (dev->dsa_ptr becomes
non-NULL), and the switch behind it expects a tagging protocol, this network
interface can only exclusively be used as a conduit interface. Sending packets
directly through this interface (e.g.: opening a socket using this interface)
@@ -273,10 +431,6 @@ will not make us go through the switch tagging protocol transmit function, so
the Ethernet switch on the other end, expecting a tag will typically drop this
frame.
-Slave network devices check that the master network device is UP before allowing
-you to administratively bring UP these slave network devices. A common
-configuration mistake is forgetting to bring UP the master network device first.
-
Interactions with other subsystems
==================================
@@ -285,11 +439,12 @@ DSA currently leverages the following subsystems:
- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
- Switchdev:``net/switchdev/*``
- Device Tree for various of_* functions
+- Devlink: ``net/core/devlink.c``
MDIO/PHY library
----------------
-Slave network devices exposed by DSA may or may not be interfacing with PHY
+User network devices exposed by DSA may or may not be interfacing with PHY
devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
subsystem deals with all possible combinations:
@@ -299,20 +454,20 @@ subsystem deals with all possible combinations:
- special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
fixed PHYs
-The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
+The PHY configuration is done by the ``dsa_user_phy_setup()`` function and the
logic basically looks like this:
- if Device Tree is used, the PHY device is looked up using the standard
"phy-handle" property, if found, this PHY device is created and registered
using ``of_phy_connect()``
-- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
+- if Device Tree is used and the PHY device is "fixed", that is, conforms to
the definition of a non-MDIO managed PHY as defined in
``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
and connected transparently using the special fixed MDIO bus driver
- finally, if the PHY is built into the switch, as is very common with
- standalone switch packages, the PHY is probed using the slave MII bus created
+ standalone switch packages, the PHY is probed using the user MII bus created
by DSA
@@ -321,14 +476,39 @@ SWITCHDEV
DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
more specifically with its VLAN filtering portion when configuring VLANs on top
-of per-port slave network devices. Since DSA primarily deals with
-MDIO-connected switches, although not exclusively, SWITCHDEV's
-prepare/abort/commit phases are often simplified into a prepare phase which
-checks whether the operation is supported by the DSA switch driver, and a commit
-phase which applies the changes.
-
-As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
-objects.
+of per-port user network devices. As of today, the only SWITCHDEV objects
+supported by DSA are the FDB and VLAN objects.
+
+Devlink
+-------
+
+DSA registers one devlink device per physical switch in the fabric.
+For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
+links or unused ports) is exposed as a devlink port.
+
+DSA drivers can make use of the following devlink features:
+
+- Regions: debugging feature which allows user space to dump driver-defined
+ areas of hardware information in a low-level, binary format. Both global
+ regions as well as per-port regions are supported. It is possible to export
+ devlink regions even for pieces of data that are already exposed in some way
+ to the standard iproute2 user space programs (ip-link, bridge), like address
+ tables and VLAN tables. For example, this might be useful if the tables
+ contain additional hardware-specific details which are not visible through
+ the iproute2 abstraction, or it might be useful to inspect these tables on
+ the non-user ports too, which are invisible to iproute2 because no network
+ interface is registered for them.
+- Params: a feature which enables user to configure certain low-level tunable
+ knobs pertaining to the device. Drivers may implement applicable generic
+ devlink params, or may add new device-specific devlink params.
+- Resources: a monitoring feature which enables users to see the degree of
+ utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
+- Shared buffers: a QoS feature for adjusting and partitioning memory and frame
+ reservations per port and per traffic class, in the ingress and egress
+ directions, such that low-priority bulk traffic does not impede the
+ processing of high-priority critical traffic.
+
+For more details, consult ``Documentation/networking/devlink/``.
Device Tree
-----------
@@ -336,35 +516,117 @@ Device Tree
DSA features a standardized binding which is documented in
``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
-per-port PHY specific details: interface connection, MDIO bus location etc..
+per-port PHY specific details: interface connection, MDIO bus location, etc.
Driver development
==================
-DSA switch drivers need to implement a dsa_switch_ops structure which will
+DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will
contain the various members described below.
-``register_switch_driver()`` registers this dsa_switch_ops in its internal list
-of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
+Probing, registration and device lifetime
+-----------------------------------------
+
+DSA switches are regular ``device`` structures on buses (be they platform, SPI,
+I2C, MDIO or otherwise). The DSA framework is not involved in their probing
+with the device core.
+
+Switch registration from the perspective of a driver means passing a valid
+``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the
+switch driver's probing function. The following members must be valid in the
+provided structure:
+
+- ``ds->dev``: will be used to parse the switch's OF node or platform data.
+
+- ``ds->num_ports``: will be used to create the port list for this switch, and
+ to validate the port indices provided in the OF node.
+
+- ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA
+ method implementations.
+
+- ``ds->priv``: backpointer to a driver-private data structure which can be
+ retrieved in all further DSA method callbacks.
+
+In addition, the following flags in the ``dsa_switch`` structure may optionally
+be configured to obtain driver-specific behavior from the DSA core. Their
+behavior when set is documented through comments in ``include/net/dsa.h``.
+
+- ``ds->vlan_filtering_is_global``
+
+- ``ds->needs_standalone_vlan_filtering``
-Unless requested differently by setting the priv_size member accordingly, DSA
-does not allocate any driver private context space.
+- ``ds->configure_vlan_while_not_filtering``
+
+- ``ds->untag_bridge_pvid``
+
+- ``ds->assisted_learning_on_cpu_port``
+
+- ``ds->mtu_enforcement_ingress``
+
+- ``ds->fdb_isolation``
+
+Internally, DSA keeps an array of switch trees (group of switches) global to
+the kernel, and attaches a ``dsa_switch`` structure to a tree on registration.
+The tree ID to which the switch is attached is determined by the first u32
+number of the ``dsa,member`` property of the switch's OF node (0 if missing).
+The switch ID within the tree is determined by the second u32 number of the
+same OF property (0 if missing). Registering multiple switches with the same
+switch ID and tree ID is illegal and will cause an error. Using platform data,
+a single switch and a single switch tree is permitted.
+
+In case of a tree with multiple switches, probing takes place asymmetrically.
+The first N-1 callers of ``dsa_register_switch()`` only add their ports to the
+port list of the tree (``dst->ports``), each port having a backpointer to its
+associated switch (``dp->ds``). Then, these switches exit their
+``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()``
+has determined that the tree is not yet complete (not all ports referenced by
+DSA links are present in the tree's port list). The tree becomes complete when
+the last switch calls ``dsa_register_switch()``, and this triggers the effective
+continuation of initialization (including the call to ``ds->ops->setup()``) for
+all switches within that tree, all as part of the calling context of the last
+switch's probe function.
+
+The opposite of registration takes place when calling ``dsa_unregister_switch()``,
+which removes a switch's ports from the port list of the tree. The entire tree
+is torn down when the first switch unregisters.
+
+It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback
+of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal
+version of the full teardown performed by ``dsa_unregister_switch()``).
+The reason is that DSA keeps a reference on the conduit net device, and if the
+driver for the conduit device decides to unbind on shutdown, DSA's reference
+will block that operation from finalizing.
+
+Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called,
+but not both, and the device driver model permits the bus' ``remove()`` method
+to be called even if ``shutdown()`` was already called. Therefore, drivers are
+expected to implement a mutual exclusion method between ``remove()`` and
+``shutdown()`` by setting their drvdata to NULL after any of these has run, and
+checking whether the drvdata is NULL before proceeding to take any action.
+
+After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no
+further callbacks via the provided ``dsa_switch_ops`` may take place, and the
+driver may free the data structures associated with the ``dsa_switch``.
Switch configuration
--------------------
-- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
- should be a valid value from the ``dsa_tag_protocol`` enum
+- ``get_tag_protocol``: this is to indicate what kind of tagging protocol is
+ supported, should be a valid value from the ``dsa_tag_protocol`` enum.
+ The returned information does not have to be static; the driver is passed the
+ CPU port number, as well as the tagging protocol of a possibly stacked
+ upstream switch, in case there are hardware limitations in terms of supported
+ tag formats.
-- ``probe``: probe routine which will be invoked by the DSA platform device upon
- registration to test for the presence/absence of a switch device. For MDIO
- devices, it is recommended to issue a read towards internal registers using
- the switch pseudo-PHY and return whether this is a supported device. For other
- buses, return a non-NULL string
+- ``change_tag_protocol``: when the default tagging protocol has compatibility
+ problems with the conduit or other issues, the driver may support changing it
+ at runtime, either through a device tree property or through sysfs. In that
+ case, further calls to ``get_tag_protocol`` should report the protocol in
+ current use.
- ``setup``: setup function for the switch, this function is responsible for setting
up the ``dsa_switch_ops`` private structure with all it needs: register maps,
- interrupts, mutexes, locks etc.. This function is also expected to properly
+ interrupts, mutexes, locks, etc. This function is also expected to properly
configure the switch to separate all network interfaces from each other, that
is, they should be isolated by the switch hardware itself, typically by creating
a Port-based VLAN ID for each port and allowing only the CPU port and the
@@ -373,7 +635,35 @@ Switch configuration
fully configured and ready to serve any kind of request. It is recommended
to issue a software reset of the switch during this setup function in order to
avoid relying on what a previous software agent such as a bootloader/firmware
- may have previously configured.
+ may have previously configured. The method responsible for undoing any
+ applicable allocations or operations done here is ``teardown``.
+
+- ``port_setup`` and ``port_teardown``: methods for initialization and
+ destruction of per-port data structures. It is mandatory for some operations
+ such as registering and unregistering devlink port regions to be done from
+ these methods, otherwise they are optional. A port will be torn down only if
+ it has been previously set up. It is possible for a port to be set up during
+ probing only to be torn down immediately afterwards, for example in case its
+ PHY cannot be found. In this case, probing of the DSA switch continues
+ without that particular port.
+
+- ``port_change_conduit``: method through which the affinity (association used
+ for traffic termination purposes) between a user port and a CPU port can be
+ changed. By default all user ports from a tree are assigned to the first
+ available CPU port that makes sense for them (most of the times this means
+ the user ports of a tree are all assigned to the same CPU port, except for H
+ topologies as described in commit 2c0b03258b8b). The ``port`` argument
+ represents the index of the user port, and the ``conduit`` argument represents
+ the new DSA conduit ``net_device``. The CPU port associated with the new
+ conduit can be retrieved by looking at ``struct dsa_port *cpu_dp =
+ conduit->dsa_ptr``. Additionally, the conduit can also be a LAG device where
+ all the slave devices are physical DSA conduits. LAG DSA also have a
+ valid ``conduit->dsa_ptr`` pointer, however this is not unique, but rather a
+ duplicate of the first physical DSA conduit's (LAG slave) ``dsa_ptr``. In case
+ of a LAG DSA conduit, a further call to ``port_lag_join`` will be emitted
+ separately for the physical CPU ports associated with the physical DSA
+ conduits, requesting them to create a hardware LAG associated with the LAG
+ interface.
PHY devices and link management
-------------------------------
@@ -381,19 +671,19 @@ PHY devices and link management
- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
if the PHY library PHY driver needs to know about information it cannot obtain
on its own (e.g.: coming from switch memory mapped registers), this function
- should return a 32-bits bitmask of "flags", that is private between the switch
+ should return a 32-bit bitmask of "flags" that is private between the switch
driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
-- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
+- ``phy_read``: Function invoked by the DSA user MDIO bus when attempting to read
the switch port MDIO registers. If unavailable, return 0xffff for each read.
For builtin switch Ethernet PHYs, this function should allow reading the link
- status, auto-negotiation results, link partner pages etc..
+ status, auto-negotiation results, link partner pages, etc.
-- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
+- ``phy_write``: Function invoked by the DSA user MDIO bus when attempting to write
to the switch port MDIO registers. If unavailable return a negative error
code.
-- ``adjust_link``: Function invoked by the PHY library when a slave network device
+- ``adjust_link``: Function invoked by the PHY library when a user network device
is attached to a PHY device. This function is responsible for appropriately
configuring the switch port link parameters: speed, duplex, pause based on
what the ``phy_device`` is providing.
@@ -409,17 +699,17 @@ Ethtool operations
------------------
- ``get_strings``: ethtool function used to query the driver's strings, will
- typically return statistics strings, private flags strings etc.
+ typically return statistics strings, private flags strings, etc.
- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
- return their values. DSA overlays slave network devices general statistics:
+ return their values. DSA overlays user network devices general statistics:
RX/TX counters from the network device, with switch driver specific statistics
per port
- ``get_sset_count``: ethtool function used to query the number of statistics items
- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
- function may, for certain implementations also query the master network device
+ function may for certain implementations also query the conduit network device
Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
@@ -461,38 +751,227 @@ Power management
should resume all Ethernet switch activities and re-configure the switch to be
in a fully active state
-- ``port_enable``: function invoked by the DSA slave network device ndo_open
- function when a port is administratively brought up, this function should be
- fully enabling a given switch port. DSA takes care of marking the port with
+- ``port_enable``: function invoked by the DSA user network device ndo_open
+ function when a port is administratively brought up, this function should
+ fully enable a given switch port. DSA takes care of marking the port with
``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
was not, and propagating these changes down to the hardware
-- ``port_disable``: function invoked by the DSA slave network device ndo_close
- function when a port is administratively brought down, this function should be
- fully disabling a given switch port. DSA takes care of marking the port with
+- ``port_disable``: function invoked by the DSA user network device ndo_close
+ function when a port is administratively brought down, this function should
+ fully disable a given switch port. DSA takes care of marking the port with
``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
disabled while being a bridge member
+Address databases
+-----------------
+
+Switching hardware is expected to have a table for FDB entries, however not all
+of them are active at the same time. An address database is the subset (partition)
+of FDB entries that is active (can be matched by address learning on RX, or FDB
+lookup on TX) depending on the state of the port. An address database may
+occasionally be called "FID" (Filtering ID) in this document, although the
+underlying implementation may choose whatever is available to the hardware.
+
+For example, all ports that belong to a VLAN-unaware bridge (which is
+*currently* VLAN-unaware) are expected to learn source addresses in the
+database associated by the driver with that bridge (and not with other
+VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a
+VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having
+the same MAC DA as the packet, which is present on another port member of the
+same bridge. At the same time, the FDB lookup process must be able to not find
+an FDB entry having the same MAC DA as the packet, if that entry points towards
+a port which is a member of a different VLAN-unaware bridge (and is therefore
+associated with a different address database).
+
+Similarly, each VLAN of each offloaded VLAN-aware bridge should have an
+associated address database, which is shared by all ports which are members of
+that VLAN, but not shared by ports belonging to different bridges that are
+members of the same VID.
+
+In this context, a VLAN-unaware database means that all packets are expected to
+match on it irrespective of VLAN ID (only MAC address lookup), whereas a
+VLAN-aware database means that packets are supposed to match based on the VLAN
+ID from the classified 802.1Q header (or the pvid if untagged).
+
+At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0,
+whereas VLAN-aware FDB entries have non-zero VID values. Note that a
+VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a
+VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the
+software bridge keeps separate address databases, and offloads to hardware the
+FDB entries belonging to these databases, through switchdev, asynchronously
+relative to the moment when the databases become active or inactive.
+
+When a user port operates in standalone mode, its driver should configure it to
+use a separate database called a port private database. This is different from
+the databases described above, and should impede operation as standalone port
+(packet in, packet out to the CPU port) as little as possible. For example,
+on ingress, it should not attempt to learn the MAC SA of ingress traffic, since
+learning is a bridging layer service and this is a standalone port, therefore
+it would consume useless space. With no address learning, the port private
+database should be empty in a naive implementation, and in this case, all
+received packets should be trivially flooded to the CPU port.
+
+DSA (cascade) and CPU ports are also called "shared" ports because they service
+multiple address databases, and the database that a packet should be associated
+to is usually embedded in the DSA tag. This means that the CPU port may
+simultaneously transport packets coming from a standalone port (which were
+classified by hardware in one address database), and from a bridge port (which
+were classified to a different address database).
+
+Switch drivers which satisfy certain criteria are able to optimize the naive
+configuration by removing the CPU port from the flooding domain of the switch,
+and just program the hardware with FDB entries pointing towards the CPU port
+for which it is known that software is interested in those MAC addresses.
+Packets which do not match a known FDB entry will not be delivered to the CPU,
+which will save CPU cycles required for creating an skb just to drop it.
+
+DSA is able to perform host address filtering for the following kinds of
+addresses:
+
+- Primary unicast MAC addresses of ports (``dev->dev_addr``). These are
+ associated with the port private database of the respective user port,
+ and the driver is notified to install them through ``port_fdb_add`` towards
+ the CPU port.
+
+- Secondary unicast and multicast MAC addresses of ports (addresses added
+ through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated
+ with the port private database of the respective user port.
+
+- Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC
+ addresses of the bridge ports, for which packets must be terminated locally
+ and not forwarded. They are associated with the address database for that
+ bridge.
+
+- Static bridge FDB entries installed towards foreign (non-DSA) interfaces
+ present in the same bridge as some DSA switch ports. These are also
+ associated with the address database for that bridge.
+
+- Dynamically learned FDB entries on foreign interfaces present in the same
+ bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port``
+ is set to true by the driver. These are associated with the address database
+ for that bridge.
+
+For various operations detailed below, DSA provides a ``dsa_db`` structure
+which can be of the following types:
+
+- ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to
+ the port private database of user port ``db->dp``.
+- ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge
+ ``db->bridge``. Separation between the VLAN-unaware database and the per-VID
+ databases of this bridge is expected to be done by the driver.
+- ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``.
+ Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future.
+
+The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``,
+``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true.
+
+DSA associates each offloaded bridge and each offloaded LAG with a one-based ID
+(``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of
+refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering
+scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may
+implement their own.
+
+Only the drivers which declare support for FDB isolation are notified of FDB
+entries on the CPU port belonging to ``DSA_DB_PORT`` databases.
+For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to
+drivers even if they do not support FDB isolation. However, ``db->bridge.num``
+and ``db->lag.id`` are always set to 0 in that case (to denote the lack of
+isolation, for refcounting purposes).
+
+Note that it is not mandatory for a switch driver to implement physically
+separate address databases for each standalone user port. Since FDB entries in
+the port private databases will always point to the CPU port, there is no risk
+for incorrect forwarding decisions. In this case, all standalone ports may
+share the same database, but the reference counting of host-filtered addresses
+(not deleting the FDB entry for a port's MAC address if it's still in use by
+another port) becomes the responsibility of the driver, because DSA is unaware
+that the port databases are in fact shared. This can be achieved by calling
+``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``.
+The down side is that the RX filtering lists of each user port are in fact
+shared, which means that user port A may accept a packet with a MAC DA it
+shouldn't have, only because that MAC address was in the RX filtering list of
+user port B. These packets will still be dropped in software, however.
+
Bridge layer
------------
+Offloading the bridge forwarding plane is optional and handled by the methods
+below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may
+be non-zero and exceeded, and in this case, joining a bridge port is still
+possible, but the packet forwarding will take place in software, and the ports
+under a software bridge must remain configured in the same way as for
+standalone operation, i.e. have all bridging service functions (address
+learning etc) disabled, and send all received packets to the CPU port only.
+
+Concretely, a port starts offloading the forwarding plane of a bridge once it
+returns success to the ``port_bridge_join`` method, and stops doing so after
+``port_bridge_leave`` has been called. Offloading the bridge means autonomously
+learning FDB entries in accordance with the software bridge port's state, and
+autonomously forwarding (or flooding) received packets without CPU intervention.
+This is optional even when offloading a bridge port. Tagging protocol drivers
+are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which
+have already been autonomously forwarded in the forwarding domain of the
+ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all
+switch ports part of the same tree ID to be part of the same bridge forwarding
+domain (capable of autonomous forwarding to each other).
+
+Offloading the TX forwarding process of a bridge is a distinct concept from
+simply offloading its forwarding plane, and refers to the ability of certain
+driver and tag protocol combinations to transmit a single skb coming from the
+bridge device's transmit function to potentially multiple egress ports (and
+thereby avoid its cloning in software).
+
+Packets for which the bridge requests this behavior are called data plane
+packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol
+driver's ``xmit`` function. Data plane packets are subject to FDB lookup,
+hardware learning on the CPU port, and do not override the port STP state.
+Additionally, replication of data plane packets (multicast, flooding) is
+handled in hardware and the bridge driver will transmit a single skb for each
+packet that may or may not need replication.
+
+When the TX forwarding offload is enabled, the tag protocol driver is
+responsible to inject packets into the data plane of the hardware towards the
+correct bridging domain (FID) that the port is a part of. The port may be
+VLAN-unaware, and in this case the FID must be equal to the FID used by the
+driver for its VLAN-unaware address database associated with that bridge.
+Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed
+that the packet is also VLAN-tagged with the VLAN ID that the bridge processed
+this packet in. It is the responsibility of the hardware to untag the VID on
+the egress-untagged ports, or keep the tag on the egress-tagged ones.
+
- ``port_bridge_join``: bridge layer function invoked when a given switch port is
- added to a bridge, this function should be doing the necessary at the switch
- level to permit the joining port from being added to the relevant logical
+ added to a bridge, this function should do what's necessary at the switch
+ level to permit the joining port to be added to the relevant logical
domain for it to ingress/egress traffic with other members of the bridge.
+ By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process
+ of this bridge is also offloaded.
- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
- removed from a bridge, this function should be doing the necessary at the
+ removed from a bridge, this function should do what's necessary at the
switch level to deny the leaving port from ingress/egress traffic from the
- remaining bridge members. When the port leaves the bridge, it should be aged
- out at the switch hardware for the switch to (re) learn MAC addresses behind
- this port.
+ remaining bridge members.
- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
state is computed by the bridge layer and should be propagated to switch
- hardware to forward/block/learn traffic. The switch driver is responsible for
- computing a STP state change based on current and asked parameters and perform
- the relevant ageing based on the intersection results
+ hardware to forward/block/learn traffic.
+
+- ``port_bridge_flags``: bridge layer function invoked when a port must
+ configure its settings for e.g. flooding of unknown traffic or source address
+ learning. The switch driver is responsible for initial setup of the
+ standalone ports with address learning disabled and egress flooding of all
+ types of traffic, then the DSA core notifies of any change to the bridge port
+ flags when the port joins and leaves a bridge. DSA does not currently manage
+ the bridge port flags for the CPU port. The assumption is that address
+ learning should be statically enabled (if supported by the hardware) on the
+ CPU port, and flooding towards the CPU port should also be enabled, due to a
+ lack of an explicit address filtering mechanism in the DSA core.
+
+- ``port_fast_age``: bridge layer function invoked when flushing the
+ dynamically learned FDB entries on the port is necessary. This is called when
+ transitioning from an STP state where learning should take place to an STP
+ state where it shouldn't, or when leaving a bridge, or when address learning
+ is turned off via ``port_bridge_flags``.
Bridge VLAN filtering
---------------------
@@ -507,63 +986,139 @@ Bridge VLAN filtering
accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
allowed.
-- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
- configuration of a VLAN on the given port. If the operation is not supported
- by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
- code to fallback to a software implementation. No hardware setup must be done
- in this function. See port_vlan_add for this and details.
-
- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
- (tagged or untagged) for the given switch port
+ (tagged or untagged) for the given switch port. The CPU port becomes a member
+ of a VLAN only if a foreign bridge port is also a member of it (and
+ forwarding needs to take place in software), or the VLAN is installed to the
+ VLAN group of the bridge device itself, for termination purposes
+ (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are
+ reference counted and removed when there is no user left. Drivers do not need
+ to manually install a VLAN on the CPU port.
- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
given switch port
-- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each VLAN the given port is a member
- of. A switchdev object is used to carry the VID and bridge flags.
-
- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
Forwarding Database entry, the switch hardware should be programmed with the
specified address in the specified VLAN Id in the forwarding database
- associated with this VLAN ID. If the operation is not supported, this
- function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
- a software implementation.
-
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
- of DSA, would be its port-based VLAN, used by the associated bridge device.
+ associated with this VLAN ID.
- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
Forwarding Database entry, the switch hardware should be programmed to delete
the specified MAC address from the specified VLAN ID if it was mapped into
this port forwarding database
-- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each MAC address known to be behind
- the given port. A switchdev object is used to carry the VID and FDB info.
-
-- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
- installation of a multicast database entry. If the operation is not supported,
- this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
- to a software implementation. No hardware setup must be done in this function.
- See ``port_fdb_add`` for this and details.
+- ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the
+ physical DSA port interfaces. Since DSA does not attempt to keep in sync its
+ hardware FDB entries with the software bridge, this method is implemented as
+ a means to view the entries visible on user ports in the hardware database.
+ The entries reported by this function have the ``self`` flag in the output of
+ the ``bridge fdb show`` command.
- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
- a multicast database entry, the switch hardware should be programmed with the
+ a multicast database entry. The switch hardware should be programmed with the
specified address in the specified VLAN ID in the forwarding database
associated with this VLAN ID.
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
- of DSA, would be its port-based VLAN, used by the associated bridge device.
-
- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
multicast database entry, the switch hardware should be programmed to delete
the specified MAC address from the specified VLAN ID if it was mapped into
this port forwarding database.
-- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each MAC address known to be behind
- the given port. A switchdev object is used to carry the VID and MDB info.
+Link aggregation
+----------------
+
+Link aggregation is implemented in the Linux networking stack by the bonding
+and team drivers, which are modeled as virtual, stackable network interfaces.
+DSA is capable of offloading a link aggregation group (LAG) to hardware that
+supports the feature, and supports bridging between physical ports and LAGs,
+as well as between LAGs. A bonding/team interface which holds multiple physical
+ports constitutes a logical port, although DSA has no explicit concept of a
+logical port at the moment. Due to this, events where a LAG joins/leaves a
+bridge are treated as if all individual physical ports that are members of that
+LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
+state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
+are treated similarly: DSA offloads the same switchdev object / port attribute
+on all members of the LAG. Static bridge FDB entries on a LAG are not yet
+supported, since the DSA driver API does not have the concept of a logical port
+ID.
+
+- ``port_lag_join``: function invoked when a given switch port is added to a
+ LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
+ back to a software implementation where all traffic from this port is sent to
+ the CPU.
+- ``port_lag_leave``: function invoked when a given switch port leaves a LAG
+ and returns to operation as a standalone port.
+- ``port_lag_change``: function invoked when the link state of any member of
+ the LAG changes, and the hashing function needs rebalancing to only make use
+ of the subset of physical LAG member ports that are up.
+
+Drivers that benefit from having an ID associated with each offloaded LAG
+can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
+method. The LAG ID associated with a bonding/team interface can then be
+retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
+
+IEC 62439-2 (MRP)
+-----------------
+
+The Media Redundancy Protocol is a topology management protocol optimized for
+fast fault recovery time for ring networks, which has some components
+implemented as a function of the bridge driver. MRP uses management PDUs
+(Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
+address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
+Depending on the node's role in the ring (MRM: Media Redundancy Manager,
+MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
+PDUs might need to be terminated locally and others might need to be forwarded.
+An MRM might also benefit from offloading to hardware the creation and
+transmission of certain MRP PDUs (Test).
+
+Normally an MRP instance can be created on top of any network interface,
+however in the case of a device with an offloaded data path such as DSA, it is
+necessary for the hardware, even if it is not MRP-aware, to be able to extract
+the MRP PDUs from the fabric before the driver can proceed with the software
+implementation. DSA today has no driver which is MRP-aware, therefore it only
+listens for the bare minimum switchdev objects required for the software assist
+to work properly. The operations are detailed below.
+
+- ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
+ with a certain ring ID, priority, primary port and secondary port is
+ created/deleted.
+- ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
+ when an MRP instance changes ring roles between MRM or MRC. This affects
+ which MRP PDUs should be trapped to software and which should be autonomously
+ forwarded.
+
+IEC 62439-3 (HSR/PRP)
+---------------------
+
+The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
+works by duplicating and sequence numbering packets through two independent L2
+networks (which are unaware of the PRP tail tags carried in the packets), and
+eliminating the duplicates at the receiver. The High-availability Seamless
+Redundancy (HSR) protocol is similar in concept, except all nodes that carry
+the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
+uses a header with an EtherType of 0x892f) and are physically connected in a
+ring topology. Both HSR and PRP use supervision frames for monitoring the
+health of the network and for discovery of other nodes.
+
+In Linux, both HSR and PRP are implemented in the hsr driver, which
+instantiates a virtual, stackable network interface with two member ports.
+The driver only implements the basic roles of DANH (Doubly Attached Node
+implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
+of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
+interface with a physical switch port does not produce the expected result).
+
+A driver which is able of offloading certain functions of a DANP or DANH should
+declare the corresponding netdev features as indicated by the documentation at
+``Documentation/networking/netdev-features.rst``. Additionally, the following
+methods must be implemented:
+
+- ``port_hsr_join``: function invoked when a given switch port is added to a
+ DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
+ fall back to a software implementation where all traffic from this port is
+ sent to the CPU.
+- ``port_hsr_leave``: function invoked when a given switch port leaves a
+ DANP/DANH and returns to normal operation as a standalone port.
TODO
====
@@ -576,12 +1131,3 @@ capable hardware, but does not enforce a strict switch device driver model. On
the other DSA enforces a fairly strict device driver model, and deals with most
of the switch specific. At some point we should envision a merger between these
two subsystems and get the best of both worlds.
-
-Other hanging fruits
---------------------
-
-- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
-- allowing more than one CPU/management interface:
- http://comments.gmane.org/gmane.linux.network/365657
-- porting more drivers from other vendors:
- http://comments.gmane.org/gmane.linux.network/365510
diff --git a/Documentation/networking/dsa/lan9303.rst b/Documentation/networking/dsa/lan9303.rst
index e3c820db28ad..ab81b4e0139e 100644
--- a/Documentation/networking/dsa/lan9303.rst
+++ b/Documentation/networking/dsa/lan9303.rst
@@ -4,7 +4,7 @@ LAN9303 Ethernet switch driver
The LAN9303 is a three port 10/100 Mbps ethernet switch with integrated phys for
the two external ethernet ports. The third port is an RMII/MII interface to a
-host master network interface (e.g. fixed link).
+host conduit network interface (e.g. fixed link).
Driver details
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index 7395a33baaf9..8ab60eef07d4 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -5,7 +5,7 @@ NXP SJA1105 switch driver
Overview
========
-The NXP SJA1105 is a family of 6 devices:
+The NXP SJA1105 is a family of 10 SPI-managed automotive switches:
- SJA1105E: First generation, no TTEthernet
- SJA1105T: First generation, TTEthernet
@@ -13,9 +13,11 @@ The NXP SJA1105 is a family of 6 devices:
- SJA1105Q: Second generation, TTEthernet, no SGMII
- SJA1105R: Second generation, no TTEthernet, SGMII
- SJA1105S: Second generation, TTEthernet, SGMII
-
-These are SPI-managed automotive switches, with all ports being gigabit
-capable, and supporting MII/RMII/RGMII and optionally SGMII on one port.
+- SJA1110A: Third generation, TTEthernet, SGMII, integrated 100base-T1 and
+ 100base-TX PHYs
+- SJA1110B: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110C: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110D: Third generation, TTEthernet, SGMII, 100base-T1
Being automotive parts, their configuration interface is geared towards
set-and-forget use, with minimal dynamic interaction at runtime. They
@@ -63,199 +65,6 @@ If that changed setting can be transmitted to the switch through the dynamic
reconfiguration interface, it is; otherwise the switch is reset and
reprogrammed with the updated static configuration.
-Traffic support
-===============
-
-The switches do not have hardware support for DSA tags, except for "slow
-protocols" for switch control as STP and PTP. For these, the switches have two
-programmable filters for link-local destination MACs.
-These are used to trap BPDUs and PTP traffic to the master netdevice, and are
-further used to support STP and 1588 ordinary clock/boundary clock
-functionality. For frames trapped to the CPU, source port and switch ID
-information is encoded by the hardware into the frames.
-
-But by leveraging ``CONFIG_NET_DSA_TAG_8021Q`` (a software-defined DSA tagging
-format based on VLANs), general-purpose traffic termination through the network
-stack can be supported under certain circumstances.
-
-Depending on VLAN awareness state, the following operating modes are possible
-with the switch:
-
-- Mode 1 (VLAN-unaware): a port is in this mode when it is used as a standalone
- net device, or when it is enslaved to a bridge with ``vlan_filtering=0``.
-- Mode 2 (fully VLAN-aware): a port is in this mode when it is enslaved to a
- bridge with ``vlan_filtering=1``. Access to the entire VLAN range is given to
- the user through ``bridge vlan`` commands, but general-purpose (anything
- other than STP, PTP etc) traffic termination is not possible through the
- switch net devices. The other packets can be still by user space processed
- through the DSA master interface (similar to ``DSA_TAG_PROTO_NONE``).
-- Mode 3 (best-effort VLAN-aware): a port is in this mode when enslaved to a
- bridge with ``vlan_filtering=1``, and the devlink property of its parent
- switch named ``best_effort_vlan_filtering`` is set to ``true``. When
- configured like this, the range of usable VIDs is reduced (0 to 1023 and 3072
- to 4094), so is the number of usable VIDs (maximum of 7 non-pvid VLANs per
- port*), and shared VLAN learning is performed (FDB lookup is done only by
- DMAC, not also by VID).
-
-To summarize, in each mode, the following types of traffic are supported over
-the switch net devices:
-
-+-------------+-----------+--------------+------------+
-| | Mode 1 | Mode 2 | Mode 3 |
-+=============+===========+==============+============+
-| Regular | Yes | No | Yes |
-| traffic | | (use master) | |
-+-------------+-----------+--------------+------------+
-| Management | Yes | Yes | Yes |
-| traffic | | | |
-| (BPDU, PTP) | | | |
-+-------------+-----------+--------------+------------+
-
-To configure the switch to operate in Mode 3, the following steps can be
-followed::
-
- ip link add dev br0 type bridge
- # swp2 operates in Mode 1 now
- ip link set dev swp2 master br0
- # swp2 temporarily moves to Mode 2
- ip link set dev br0 type bridge vlan_filtering 1
- [ 61.204770] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
- [ 61.239944] sja1105 spi0.1: Disabled switch tagging
- # swp3 now operates in Mode 3
- devlink dev param set spi/spi0.1 name best_effort_vlan_filtering value true cmode runtime
- [ 64.682927] sja1105 spi0.1: Reset switch and programmed static config. Reason: VLAN filtering
- [ 64.711925] sja1105 spi0.1: Enabled switch tagging
- # Cannot use VLANs in range 1024-3071 while in Mode 3.
- bridge vlan add dev swp2 vid 1025 untagged pvid
- RTNETLINK answers: Operation not permitted
- bridge vlan add dev swp2 vid 100
- bridge vlan add dev swp2 vid 101 untagged
- bridge vlan
- port vlan ids
- swp5 1 PVID Egress Untagged
-
- swp2 1 PVID Egress Untagged
- 100
- 101 Egress Untagged
-
- swp3 1 PVID Egress Untagged
-
- swp4 1 PVID Egress Untagged
-
- br0 1 PVID Egress Untagged
- bridge vlan add dev swp2 vid 102
- bridge vlan add dev swp2 vid 103
- bridge vlan add dev swp2 vid 104
- bridge vlan add dev swp2 vid 105
- bridge vlan add dev swp2 vid 106
- bridge vlan add dev swp2 vid 107
- # Cannot use mode than 7 VLANs per port while in Mode 3.
- [ 3885.216832] sja1105 spi0.1: No more free subvlans
-
-\* "maximum of 7 non-pvid VLANs per port": Decoding VLAN-tagged packets on the
-CPU in mode 3 is possible through VLAN retagging of packets that go from the
-switch to the CPU. In cross-chip topologies, the port that goes to the CPU
-might also go to other switches. In that case, those other switches will see
-only a retagged packet (which only has meaning for the CPU). So if they are
-interested in this VLAN, they need to apply retagging in the reverse direction,
-to recover the original value from it. This consumes extra hardware resources
-for this switch. There is a maximum of 32 entries in the Retagging Table of
-each switch device.
-
-As an example, consider this cross-chip topology::
-
- +-------------------------------------------------+
- | Host SoC |
- | +-------------------------+ |
- | | DSA master for embedded | |
- | | switch (non-sja1105) | |
- | +--------+-------------------------+--------+ |
- | | embedded L2 switch | |
- | | | |
- | | +--------------+ +--------------+ | |
- | | |DSA master for| |DSA master for| | |
- | | | SJA1105 1 | | SJA1105 2 | | |
- +--+---+--------------+-----+--------------+---+--+
-
- +-----------------------+ +-----------------------+
- | SJA1105 switch 1 | | SJA1105 switch 2 |
- +-----+-----+-----+-----+ +-----+-----+-----+-----+
- |sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3|
- +-----+-----+-----+-----+ +-----+-----+-----+-----+
-
-To reach the CPU, SJA1105 switch 1 (spi/spi2.1) uses the same port as is uses
-to reach SJA1105 switch 2 (spi/spi2.2), which would be port 4 (not drawn).
-Similarly for SJA1105 switch 2.
-
-Also consider the following commands, that add VLAN 100 to every sja1105 user
-port::
-
- devlink dev param set spi/spi2.1 name best_effort_vlan_filtering value true cmode runtime
- devlink dev param set spi/spi2.2 name best_effort_vlan_filtering value true cmode runtime
- ip link add dev br0 type bridge
- for port in sw1p0 sw1p1 sw1p2 sw1p3 \
- sw2p0 sw2p1 sw2p2 sw2p3; do
- ip link set dev $port master br0
- done
- ip link set dev br0 type bridge vlan_filtering 1
- for port in sw1p0 sw1p1 sw1p2 sw1p3 \
- sw2p0 sw2p1 sw2p2; do
- bridge vlan add dev $port vid 100
- done
- ip link add link br0 name br0.100 type vlan id 100 && ip link set dev br0.100 up
- ip addr add 192.168.100.3/24 dev br0.100
- bridge vlan add dev br0 vid 100 self
-
- bridge vlan
- port vlan ids
- sw1p0 1 PVID Egress Untagged
- 100
-
- sw1p1 1 PVID Egress Untagged
- 100
-
- sw1p2 1 PVID Egress Untagged
- 100
-
- sw1p3 1 PVID Egress Untagged
- 100
-
- sw2p0 1 PVID Egress Untagged
- 100
-
- sw2p1 1 PVID Egress Untagged
- 100
-
- sw2p2 1 PVID Egress Untagged
- 100
-
- sw2p3 1 PVID Egress Untagged
-
- br0 1 PVID Egress Untagged
- 100
-
-SJA1105 switch 1 consumes 1 retagging entry for each VLAN on each user port
-towards the CPU. It also consumes 1 retagging entry for each non-pvid VLAN that
-it is also interested in, which is configured on any port of any neighbor
-switch.
-
-In this case, SJA1105 switch 1 consumes a total of 11 retagging entries, as
-follows:
-
-- 8 retagging entries for VLANs 1 and 100 installed on its user ports
- (``sw1p0`` - ``sw1p3``)
-- 3 retagging entries for VLAN 100 installed on the user ports of SJA1105
- switch 2 (``sw2p0`` - ``sw2p2``), because it also has ports that are
- interested in it. The VLAN 1 is a pvid on SJA1105 switch 2 and does not need
- reverse retagging.
-
-SJA1105 switch 2 also consumes 11 retagging entries, but organized as follows:
-
-- 7 retagging entries for the bridge VLANs on its user ports (``sw2p0`` -
- ``sw2p3``).
-- 4 retagging entries for VLAN 100 installed on the user ports of SJA1105
- switch 1 (``sw1p0`` - ``sw1p3``).
-
Switching features
==================
@@ -270,7 +79,7 @@ The hardware tags all traffic internally with a port-based VLAN (pvid), or it
decodes the VLAN information from the 802.1Q tag. Advanced VLAN classification
is not possible. Once attributed a VLAN tag, frames are checked against the
port's membership rules and dropped at ingress if they don't match any VLAN.
-This behavior is available when switch ports are enslaved to a bridge with
+This behavior is available when switch ports join a bridge with
``vlan_filtering 1``.
Normally the hardware is not configurable with respect to VLAN awareness, but
@@ -280,33 +89,10 @@ untagged), and therefore this mode is also supported.
Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but
all bridges should have the same level of VLAN awareness (either both have
-``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact
-that VLAN awareness is global at the switch level is that once a bridge with
-``vlan_filtering`` enslaves at least one switch port, the other un-bridged
-ports are no longer available for standalone traffic termination.
+``vlan_filtering`` 0, or both 1).
Topology and loop detection through STP is supported.
-L2 FDB manipulation (add/delete/dump) is currently possible for the first
-generation devices. Aging time of FDB entries, as well as enabling fully static
-management (no address learning and no flooding of unknown traffic) is not yet
-configurable in the driver.
-
-A special comment about bridging with other netdevices (illustrated with an
-example):
-
-A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1.
-The switch ports (swp0-3) are under br0.
-It is desired that eth0 is turned into another switched port that communicates
-with swp0-3.
-
-If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the
-intended results.
-If br0 has vlan_filtering 1, then a new br1 interface needs to be created that
-enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in
-this mode, the switch ports beneath br0 are not capable of regular traffic, and
-are only used as a conduit for switchdev operations.
-
Offloads
========
@@ -336,7 +122,7 @@ on egress. Using ``vlan_filtering=1``, the behavior is the other way around:
offloaded flows can be steered to TX queues based on the VLAN PCP, but the DSA
net devices are no longer able to do that. To inject frames into a hardware TX
queue with VLAN awareness active, it is necessary to create a VLAN
-sub-interface on the DSA master port, and send normal (0x8100) VLAN-tagged
+sub-interface on the DSA conduit port, and send normal (0x8100) VLAN-tagged
towards the switch, with the VLAN PCP bits set appropriately.
Management traffic (having DMAC 01-80-C2-xx-xx-xx or 01-19-1B-xx-xx-xx) is the
@@ -507,10 +293,37 @@ of dropped frames, which is a sum of frames dropped due to timing violations,
lack of destination ports and MTU enforcement checks). Byte-level counters are
not available.
+Limitations
+===========
+
+The SJA1105 switch family always performs VLAN processing. When configured as
+VLAN-unaware, frames carry a different VLAN tag internally, depending on
+whether the port is standalone or under a VLAN-unaware bridge.
+
+The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the
+driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware
+bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on
+whether the port is standalone or in a VLAN-unaware bridge, and accepts only
+"VLAN-unaware" tc-flower keys (MAC DA).
+
+The existing tc-flower keys that are offloaded using virtual links are no
+longer operational after one of the following happens:
+
+- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware)
+- port is part of a bridge whose VLAN awareness state changes
+- port was part of a bridge and becomes standalone
+- port was standalone, but another port joins a VLAN-aware bridge and this
+ changes the global VLAN awareness state of the bridge
+
+The driver cannot veto all these operations, and it cannot update/remove the
+existing tc-flower filters either. So for proper operation, the tc-flower
+filters should be installed only after the forwarding configuration of the port
+has been made, and removed by user space before making any changes to it.
+
Device Tree bindings and board design
=====================================
-This section references ``Documentation/devicetree/bindings/net/dsa/sja1105.txt``
+This section references ``Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml``
and aims to showcase some potential switch caveats.
RMII PHY role and out-of-band signaling
@@ -576,6 +389,57 @@ MDIO bus and PHY management
The SJA1105 does not have an MDIO bus and does not perform in-band AN either.
Therefore there is no link state notification coming from the switch device.
A board would need to hook up the PHYs connected to the switch to any other
-MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO
+MDIO bus available to Linux within the system (e.g. to the DSA conduit's MDIO
bus). Link state management then works by the driver manually keeping in sync
(over SPI commands) the MAC link speed with the settings negotiated by the PHY.
+
+By comparison, the SJA1110 supports an MDIO slave access point over which its
+internal 100base-T1 PHYs can be accessed from the host. This is, however, not
+used by the driver, instead the internal 100base-T1 and 100base-TX PHYs are
+accessed through SPI commands, modeled in Linux as virtual MDIO buses.
+
+The microcontroller attached to the SJA1110 port 0 also has an MDIO controller
+operating in master mode, however the driver does not support this either,
+since the microcontroller gets disabled when the Linux driver operates.
+Discrete PHYs connected to the switch ports should have their MDIO interface
+attached to an MDIO controller from the host system and not to the switch,
+similar to SJA1105.
+
+Port compatibility matrix
+-------------------------
+
+The SJA1105 port compatibility matrix is:
+
+===== ============== ============== ==============
+Port SJA1105E/T SJA1105P/Q SJA1105R/S
+===== ============== ============== ==============
+0 xMII xMII xMII
+1 xMII xMII xMII
+2 xMII xMII xMII
+3 xMII xMII xMII
+4 xMII xMII SGMII
+===== ============== ============== ==============
+
+
+The SJA1110 port compatibility matrix is:
+
+===== ============== ============== ============== ==============
+Port SJA1110A SJA1110B SJA1110C SJA1110D
+===== ============== ============== ============== ==============
+0 RevMII (uC) RevMII (uC) RevMII (uC) RevMII (uC)
+1 100base-TX 100base-TX 100base-TX
+ or SGMII SGMII
+2 xMII xMII xMII xMII
+ or SGMII or SGMII
+3 xMII xMII xMII
+ or SGMII or SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X
+4 SGMII SGMII SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X or 2500base-X
+5 100base-T1 100base-T1 100base-T1 100base-T1
+6 100base-T1 100base-T1 100base-T1 100base-T1
+7 100base-T1 100base-T1 100base-T1 100base-T1
+8 100base-T1 100base-T1 n/a n/a
+9 100base-T1 100base-T1 n/a n/a
+10 100base-T1 n/a n/a n/a
+===== ============== ============== ============== ==============
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 82470c36c27a..d583d9abf2f8 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -41,6 +41,11 @@ In the message structure descriptions below, if an attribute name is suffixed
with "+", parent nest can contain multiple attributes of the same type. This
implements an array of entries.
+Attributes that need to be filled-in by device drivers and that are dumped to
+user space based on whether they are valid or not should not use zero as a
+valid value. This avoids the need to explicitly signal the validity of the
+attribute in the device driver API.
+
Request header
==============
@@ -68,6 +73,7 @@ the flags may not apply to requests. Recognized flags are:
================================= ===================================
``ETHTOOL_FLAG_COMPACT_BITSETS`` use compact format bitsets in reply
``ETHTOOL_FLAG_OMIT_REPLY`` omit optional reply (_SET and _ACT)
+ ``ETHTOOL_FLAG_STATS`` include optional device statistics
================================= ===================================
New request flags should follow the general idea that if the flag is not set,
@@ -100,7 +106,7 @@ modifying a bitmap, the former changes the bit set in mask to values set in
value and preserves the rest; the latter sets the bits set in the bitmap and
clears the rest.
-Compact form: nested (bitset) atrribute contents:
+Compact form: nested (bitset) attribute contents:
============================ ====== ============================
``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list
@@ -178,7 +184,7 @@ according to message purpose:
Userspace to kernel:
- ===================================== ================================
+ ===================================== =================================
``ETHTOOL_MSG_STRSET_GET`` get string set
``ETHTOOL_MSG_LINKINFO_GET`` get link settings
``ETHTOOL_MSG_LINKINFO_SET`` set link settings
@@ -206,40 +212,69 @@ Userspace to kernel:
``ETHTOOL_MSG_TSINFO_GET`` get timestamping info
``ETHTOOL_MSG_CABLE_TEST_ACT`` action start cable test
``ETHTOOL_MSG_CABLE_TEST_TDR_ACT`` action start raw TDR cable test
- ===================================== ================================
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET`` get tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET`` get FEC settings
+ ``ETHTOOL_MSG_FEC_SET`` set FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET`` get standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` get PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_SET`` set transceiver module parameters
+ ``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
+ ``ETHTOOL_MSG_PSE_SET`` set PSE parameters
+ ``ETHTOOL_MSG_PSE_GET`` get PSE parameters
+ ``ETHTOOL_MSG_RSS_GET`` get RSS settings
+ ``ETHTOOL_MSG_PLCA_GET_CFG`` get PLCA RS parameters
+ ``ETHTOOL_MSG_PLCA_SET_CFG`` set PLCA RS parameters
+ ``ETHTOOL_MSG_PLCA_GET_STATUS`` get PLCA RS status
+ ``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
+ ``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
+ ===================================== =================================
Kernel to userspace:
- ===================================== =================================
- ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
- ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
- ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
- ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
- ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
- ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
- ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
- ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
- ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
- ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
- ``ETHTOOL_MSG_FEATURES_GET_REPLY`` device features
- ``ETHTOOL_MSG_FEATURES_SET_REPLY`` optional reply to FEATURES_SET
- ``ETHTOOL_MSG_FEATURES_NTF`` netdev features notification
- ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY`` private flags
- ``ETHTOOL_MSG_PRIVFLAGS_NTF`` private flags
- ``ETHTOOL_MSG_RINGS_GET_REPLY`` ring sizes
- ``ETHTOOL_MSG_RINGS_NTF`` ring sizes
- ``ETHTOOL_MSG_CHANNELS_GET_REPLY`` channel counts
- ``ETHTOOL_MSG_CHANNELS_NTF`` channel counts
- ``ETHTOOL_MSG_COALESCE_GET_REPLY`` coalescing parameters
- ``ETHTOOL_MSG_COALESCE_NTF`` coalescing parameters
- ``ETHTOOL_MSG_PAUSE_GET_REPLY`` pause parameters
- ``ETHTOOL_MSG_PAUSE_NTF`` pause parameters
- ``ETHTOOL_MSG_EEE_GET_REPLY`` EEE settings
- ``ETHTOOL_MSG_EEE_NTF`` EEE settings
- ``ETHTOOL_MSG_TSINFO_GET_REPLY`` timestamping info
- ``ETHTOOL_MSG_CABLE_TEST_NTF`` Cable test results
- ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF`` Cable test TDR results
- ===================================== =================================
+ ======================================== =================================
+ ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
+ ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
+ ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
+ ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
+ ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
+ ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
+ ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
+ ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
+ ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
+ ``ETHTOOL_MSG_FEATURES_GET_REPLY`` device features
+ ``ETHTOOL_MSG_FEATURES_SET_REPLY`` optional reply to FEATURES_SET
+ ``ETHTOOL_MSG_FEATURES_NTF`` netdev features notification
+ ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY`` private flags
+ ``ETHTOOL_MSG_PRIVFLAGS_NTF`` private flags
+ ``ETHTOOL_MSG_RINGS_GET_REPLY`` ring sizes
+ ``ETHTOOL_MSG_RINGS_NTF`` ring sizes
+ ``ETHTOOL_MSG_CHANNELS_GET_REPLY`` channel counts
+ ``ETHTOOL_MSG_CHANNELS_NTF`` channel counts
+ ``ETHTOOL_MSG_COALESCE_GET_REPLY`` coalescing parameters
+ ``ETHTOOL_MSG_COALESCE_NTF`` coalescing parameters
+ ``ETHTOOL_MSG_PAUSE_GET_REPLY`` pause parameters
+ ``ETHTOOL_MSG_PAUSE_NTF`` pause parameters
+ ``ETHTOOL_MSG_EEE_GET_REPLY`` EEE settings
+ ``ETHTOOL_MSG_EEE_NTF`` EEE settings
+ ``ETHTOOL_MSG_TSINFO_GET_REPLY`` timestamping info
+ ``ETHTOOL_MSG_CABLE_TEST_NTF`` Cable test results
+ ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF`` Cable test TDR results
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY`` tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET_REPLY`` FEC settings
+ ``ETHTOOL_MSG_FEC_NTF`` FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET_REPLY`` standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
+ ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
+ ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
+ ``ETHTOOL_MSG_PLCA_GET_CFG_REPLY`` PLCA RS parameters
+ ``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status
+ ``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters
+ ``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
+ ======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
information. They usually do not contain any message specific attributes.
@@ -405,6 +440,7 @@ Kernel response contents:
``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
``ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE`` u8 Master/slave port state
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
========================================== ====== ==========================
For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask
@@ -428,25 +464,28 @@ Request contents:
``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
+ ``ETHTOOL_A_LINKMODES_LANES`` u32 lanes
========================================== ====== ==========================
``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If
autonegotiation is on (either set now or kept from before), advertised modes
are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one
-of speed and duplex is specified, kernel adjusts advertised modes to all
-supported modes matching speed, duplex or both (whatever is specified). This
-autoselection is done on ethtool side with ioctl interface, netlink interface
-is supposed to allow requesting changes without knowing what exactly kernel
-supports.
+of speed, duplex and lanes is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex, lanes or all (whatever is specified).
+This autoselection is done on ethtool side with ioctl interface, netlink
+interface is supposed to allow requesting changes without knowing what exactly
+kernel supports.
LINKSTATE_GET
=============
-Requests link state information. At the moment, only link up/down flag (as
-provided by ``ETHTOOL_GLINK`` ioctl command) is provided but some future
-extensions are planned (e.g. link down reason). This request does not have any
-attributes.
+Requests link state information. Link up/down flag (as provided by
+``ETHTOOL_GLINK`` ioctl command) is provided. Optionally, extended state might
+be provided as well. In general, extended state describes reasons for why a port
+is down, or why it operates in some non-obvious mode. This request does not have
+any attributes.
Request contents:
@@ -461,16 +500,154 @@ Kernel response contents:
``ETHTOOL_A_LINKSTATE_LINK`` bool link state (up/down)
``ETHTOOL_A_LINKSTATE_SQI`` u32 Current Signal Quality Index
``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
+ ``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
+ ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
+ ``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT`` u32 count of link down events
==================================== ====== ============================
For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
carrier flag provided by ``netif_carrier_ok()`` but there are drivers which
define their own handler.
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` are
+optional values. ethtool core can provide either both
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE``,
+or only ``ETHTOOL_A_LINKSTATE_EXT_STATE``, or none of them.
+
``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all
devices supporting the request).
+Link extended states:
+
+ ================================================ ============================================
+ ``ETHTOOL_LINK_EXT_STATE_AUTONEG`` States relating to the autonegotiation or
+ issues therein
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE`` Failure during link training
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH`` Logical mismatch in physical coding sublayer
+ or forward error correction sublayer
+
+ ``ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY`` Signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_STATE_NO_CABLE`` No cable connected
+
+ ``ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE`` Failure is related to cable,
+ e.g., unsupported cable
+
+ ``ETHTOOL_LINK_EXT_STATE_EEPROM_ISSUE`` Failure is related to EEPROM, e.g., failure
+ during reading or parsing the data
+
+ ``ETHTOOL_LINK_EXT_STATE_CALIBRATION_FAILURE`` Failure during calibration algorithm
+
+ ``ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED`` The hardware is not able to provide the
+ power required from cable or module
+
+ ``ETHTOOL_LINK_EXT_STATE_OVERHEAT`` The module is overheated
+
+ ``ETHTOOL_LINK_EXT_STATE_MODULE`` Transceiver module issue
+ ================================================ ============================================
+
+Link extended substates:
+
+ Autoneg substates:
+
+ =============================================================== ================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED`` Peer side is down
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_ACK_NOT_RECEIVED`` Ack not received from peer side
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NEXT_PAGE_EXCHANGE_FAILED`` Next page exchange failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED_FORCE_MODE`` Peer side is down during force
+ mode or there is no agreement of
+ speed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_FEC_MISMATCH_DURING_OVERRIDE`` Forward error correction modes
+ in both sides are mismatched
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_HCD`` No Highest Common Denominator
+ =============================================================== ================================
+
+ Link training substates:
+
+ =========================================================================== ====================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_FRAME_LOCK_NOT_ACQUIRED`` Frames were not
+ recognized, the
+ lock failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_INHIBIT_TIMEOUT`` The lock did not
+ occur before
+ timeout
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_PARTNER_DID_NOT_SET_RECEIVER_READY`` Peer side did not
+ send ready signal
+ after training
+ process
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_REMOTE_FAULT`` Remote side is not
+ ready yet
+ =========================================================================== ====================
+
+ Link logical mismatch substates:
+
+ ================================================================ ===============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_BLOCK_LOCK`` Physical coding sublayer was
+ not locked in first phase -
+ block lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_AM_LOCK`` Physical coding sublayer was
+ not locked in second phase -
+ alignment markers lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_GET_ALIGN_STATUS`` Physical coding sublayer did
+ not get align status
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_FC_FEC_IS_NOT_LOCKED`` FC forward error correction is
+ not locked
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_RS_FEC_IS_NOT_LOCKED`` RS forward error correction is
+ not locked
+ ================================================================ ===============================
+
+ Bad signal integrity substates:
+
+ ================================================================= =============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_LARGE_NUMBER_OF_PHYSICAL_ERRORS`` Large number of physical
+ errors
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_UNSUPPORTED_RATE`` The system attempted to
+ operate the cable at a rate
+ that is not formally
+ supported, which led to
+ signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST`` The external clock signal for
+ SerDes is too weak or
+ unavailable.
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS`` The received signal for
+ SerDes is too weak because
+ analog loss of signal.
+ ================================================================= =============================
+
+ Cable issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_UNSUPPORTED_CABLE`` Unsupported cable
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_CABLE_TEST_FAILURE`` Cable test failure
+ =================================================== ============================================
+
+ Transceiver module issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY`` The CMIS Module State Machine did not reach
+ the ModuleReady state. For example, if the
+ module is stuck at ModuleFault state
+ =================================================== ============================================
+
DEBUG_GET
=========
@@ -612,7 +789,7 @@ Kernel response contents:
``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active
==================================== ====== ==========================
-Request constains only one bitset which can be either value/mask pair (request
+Request contains only one bitset which can be either value/mask pair (request
to change specific feature bits and leave the rest) or only a value (request
to set all features to specified set).
@@ -689,18 +866,50 @@ Request contents:
Kernel response contents:
- ==================================== ====== ==========================
- ``ETHTOOL_A_RINGS_HEADER`` nested reply header
- ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
- ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
- ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
- ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
- ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
- ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
- ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
- ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
- ==================================== ====== ==========================
-
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer
+ ======================================= ====== ===========================
+
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
+page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
+If enabled the device is configured to place frame headers and data into
+separate buffers. The device configuration must make it possible to receive
+full memory pages of data, for example because MTU is high enough or through
+HW-GRO.
+
+``ETHTOOL_A_RINGS_[RX|TX]_PUSH`` flag is used to enable descriptor fast
+path to send or receive packets. In ordinary path, driver fills descriptors in DRAM and
+notifies NIC hardware. In fast path, driver pushes descriptors to the device
+through MMIO writes, thus reducing the latency. However, enabling this feature
+may increase the CPU cost. Drivers may enforce additional per-packet
+eligibility checks (e.g. on packet size).
+
+``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` specifies the maximum number of bytes of a
+transmitted packet a driver can push directly to the underlying device
+('push' mode). Pushing some of the payload bytes to the device has the
+advantages of reducing latency for small packets by avoiding DMA mapping (same
+as ``ETHTOOL_A_RINGS_TX_PUSH`` parameter) as well as allowing the underlying
+device to process packet headers ahead of fetching its payload.
+This can help the device to make fast actions based on the packet's headers.
+This is similar to the "tx-copybreak" parameter, which copies the packet to a
+preallocated DMA memory area instead of mapping new memory. However,
+tx-push-buff parameter copies the packet directly to the device to allow the
+device to take faster actions on the packet.
RINGS_SET
=========
@@ -709,19 +918,33 @@ Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request.
Request contents:
- ==================================== ====== ==========================
+ ==================================== ====== ===========================
``ETHTOOL_A_RINGS_HEADER`` nested reply header
``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
- ==================================== ====== ==========================
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
driver. Driver may impose additional constraints and may not suspport all
attributes.
+``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
+Completion queue events(CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent(like send success or
+error) or received(like pointers to packet fragments). The CQE size parameter
+enables to modify the CQE size other than default size if NIC supports it.
+A bigger CQE can have more receive buffer pointers inturn NIC can transfer
+a bigger frame from wire. Based on the NIC hardware, the overall completion
+queue size can be adjusted in the driver if CQE size is modified.
+
CHANNELS_GET
============
@@ -805,12 +1028,39 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
corresponding bit in ``ethtool_ops::supported_coalesce_params`` is set (i.e.
they are declared as supported by driver).
+Timer reset mode (``ETHTOOL_A_COALESCE_USE_CQE_TX`` and
+``ETHTOOL_A_COALESCE_USE_CQE_RX``) controls the interaction between packet
+arrival and the various time based delay parameters. By default timers are
+expected to limit the max delay between any packet arrival/departure and a
+corresponding interrupt. In this mode timer should be started by packet
+arrival (sometimes delivery of previous interrupt) and reset when interrupt
+is delivered.
+Setting the appropriate attribute to 1 will enable ``CQE`` mode, where
+each packet event resets the timer. In this mode timer is used to force
+the interrupt if queue goes idle, while busy queues depend on the packet
+limit to trigger interrupts.
+
+Tx aggregation consists of copying frames into a contiguous buffer so that they
+can be submitted as a single IO operation. ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES``
+describes the maximum size in bytes for the submitted buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` describes the maximum number of frames
+that can be aggregated into a single buffer.
+``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` describes the amount of time in usecs,
+counted since the first packet arrival in an aggregated block, after which the
+block should be sent.
+This feature is mainly of interest for specific USB devices which does not cope
+well with frequent small-sized URBs transmissions.
COALESCE_SET
============
@@ -843,6 +1093,11 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
+ ``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
@@ -850,18 +1105,32 @@ such that the corresponding bit in ``ethtool_ops::supported_coalesce_params``
is not set), regardless of their values. Driver may impose additional
constraints on coalescing parameters and their values.
+Compared to requests issued via the ``ioctl()`` netlink version of this request
+will try harder to make sure that values specified by the user have been applied
+and may call the driver twice.
+
PAUSE_GET
-============
+=========
-Gets channel counts like ``ETHTOOL_GPAUSE`` ioctl request.
+Gets pause frame settings like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
Request contents:
===================================== ====== ==========================
``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_STATS_SRC`` u32 source of statistics
===================================== ====== ==========================
+``ETHTOOL_A_PAUSE_STATS_SRC`` is optional. It takes values from:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_mac_stats_src
+
+If absent from the request, stats will be provided with
+an ``ETHTOOL_A_PAUSE_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
Kernel response contents:
===================================== ====== ==========================
@@ -869,11 +1138,21 @@ Kernel response contents:
``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation
``ETHTOOL_A_PAUSE_RX`` bool receive pause frames
``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames
+ ``ETHTOOL_A_PAUSE_STATS`` nested pause statistics
===================================== ====== ==========================
+``ETHTOOL_A_PAUSE_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set
+in ``ETHTOOL_A_HEADER_FLAGS``.
+It will be empty if driver did not report any statistics. Drivers fill in
+the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_pause_stats
+
+Each member has a corresponding attribute defined.
PAUSE_SET
-============
+=========
Sets pause parameters like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
@@ -890,7 +1169,7 @@ Request contents:
EEE_GET
=======
-Gets channel counts like ``ETHTOOL_GEEE`` ioctl request.
+Gets Energy Efficient Ethernet settings like ``ETHTOOL_GEEE`` ioctl request.
Request contents:
@@ -920,7 +1199,7 @@ first 32 are provided by the ``ethtool_ops`` callback.
EEE_SET
=======
-Sets pause parameters like ``ETHTOOL_GEEEPARAM`` ioctl request.
+Sets Energy Efficient Ethernet parameters like ``ETHTOOL_SEEE`` ioctl request.
Request contents:
@@ -1110,6 +1389,621 @@ used to report the amplitude of the reflection for a given pair.
| | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+-+-+-----------------------------------------+--------+----------------------+
+TUNNEL_INFO
+===========
+
+Gets information about the tunnel state NIC is aware of.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TUNNEL_INFO_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_UDP_PORTS`` | nested | all UDP port tables |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_TUNNEL_UDP_TABLE`` | nested | one UDP port table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_SIZE`` | u32 | max size of the |
+ | | | | | table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` | bitset | tunnel types which |
+ | | | | | table can hold |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_ENTRY`` | nested | offloaded UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_PORT`` | be16 | UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_TYPE`` | u32 | tunnel type |
+ +-+-+-+---------------------------------------+--------+---------------------+
+
+For UDP tunnel table empty ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` indicates that
+the table contains static entries, hard-coded by the NIC.
+
+FEC_GET
+=======
+
+Gets FEC configuration and state like ``ETHTOOL_GFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ``ETHTOOL_A_FEC_ACTIVE`` u32 index of active FEC mode
+ ``ETHTOOL_A_FEC_STATS`` nested FEC statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_FEC_ACTIVE`` is the bit index of the FEC link mode currently
+active on the interface. This attribute may not be present if device does
+not support FEC.
+
+``ETHTOOL_A_FEC_MODES`` and ``ETHTOOL_A_FEC_AUTO`` are only meaningful when
+autonegotiation is disabled. If ``ETHTOOL_A_FEC_AUTO`` is non-zero driver will
+select the FEC mode automatically based on the parameters of the SFP module.
+This is equivalent to the ``ETHTOOL_FEC_AUTO`` bit of the ioctl interface.
+``ETHTOOL_A_FEC_MODES`` carry the current FEC configuration using link mode
+bits (rather than old ``ETHTOOL_FEC_*`` bits).
+
+``ETHTOOL_A_FEC_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``.
+Each attribute carries an array of 64bit statistics. First entry in the array
+contains the total number of events on the port, while the following entries
+are counters corresponding to lanes/PCS instances. The number of entries in
+the array will be:
+
++--------------+---------------------------------------------+
+| `0` | device does not support FEC statistics |
++--------------+---------------------------------------------+
+| `1` | device does not support per-lane break down |
++--------------+---------------------------------------------+
+| `1 + #lanes` | device has full support for FEC stats |
++--------------+---------------------------------------------+
+
+Drivers fill in the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_fec_stats
+
+FEC_SET
+=======
+
+Sets FEC parameters like ``ETHTOOL_SFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ===================================== ====== ==========================
+
+``FEC_SET`` is only meaningful when autonegotiation is disabled. Otherwise
+FEC mode is selected as part of autonegotiation.
+
+``ETHTOOL_A_FEC_MODES`` selects which FEC mode should be used. It's recommended
+to set only one bit, if multiple bits are set driver may choose between them
+in an implementation specific way.
+
+``ETHTOOL_A_FEC_AUTO`` requests the driver to choose FEC mode based on SFP
+module parameters. This does not mean autonegotiation.
+
+MODULE_EEPROM_GET
+=================
+
+Fetch module EEPROM data dump.
+This interface is designed to allow dumps of at most 1/2 page at once. This
+means only dumps of 128 (or less) bytes are allowed, without crossing half page
+boundary located at offset 128. For pages other than 0 only high 128 bytes are
+accessible.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_MODULE_EEPROM_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_EEPROM_OFFSET`` u32 offset within a page
+ ``ETHTOOL_A_MODULE_EEPROM_LENGTH`` u32 amount of bytes to read
+ ``ETHTOOL_A_MODULE_EEPROM_PAGE`` u8 page number
+ ``ETHTOOL_A_MODULE_EEPROM_BANK`` u8 bank number
+ ``ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS`` u8 page I2C address
+ ======================================= ====== ==========================
+
+If ``ETHTOOL_A_MODULE_EEPROM_BANK`` is not specified, bank 0 is assumed.
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_DATA`` | binary | array of bytes from |
+ | | | module EEPROM |
+ +---------------------------------------------+--------+---------------------+
+
+``ETHTOOL_A_MODULE_EEPROM_DATA`` has an attribute length equal to the amount of
+bytes driver actually read.
+
+STATS_GET
+=========
+
+Get standard statistics for the interface. Note that this is not
+a re-implementation of ``ETHTOOL_GSTATS`` which exposed driver-defined
+stats.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_STATS_HEADER`` nested request header
+ ``ETHTOOL_A_STATS_SRC`` u32 source of statistics
+ ``ETHTOOL_A_STATS_GROUPS`` bitset requested groups of stats
+ ======================================= ====== ==========================
+
+Kernel response contents:
+
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_HEADER`` | nested | reply header |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_SRC`` | u32 | source of statistics |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_GRP`` | nested | one or more group of stats |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_ID`` | u32 | group ID - ``ETHTOOL_STATS_*`` |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_SS_ID`` | u32 | string set ID for names |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_STAT`` | nested | nest containing a statistic |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_RX`` | nested | histogram statistic (Rx) |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_TX`` | nested | histogram statistic (Tx) |
+ +-+---------------------------------+--------+--------------------------------+
+
+Users specify which groups of statistics they are requesting via
+the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are:
+
+ ====================== ======== ===============================================
+ ETHTOOL_STATS_ETH_MAC eth-mac Basic IEEE 802.3 MAC statistics (30.3.1.1.*)
+ ETHTOOL_STATS_ETH_PHY eth-phy Basic IEEE 802.3 PHY statistics (30.3.2.1.*)
+ ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*)
+ ETHTOOL_STATS_RMON rmon RMON (RFC 2819) statistics
+ ====================== ======== ===============================================
+
+Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply.
+``ETHTOOL_A_STATS_GRP_ID`` identifies which group's statistics nest contains.
+``ETHTOOL_A_STATS_GRP_SS_ID`` identifies the string set ID for the names of
+the statistics in the group, if available.
+
+Statistics are added to the ``ETHTOOL_A_STATS_GRP`` nest under
+``ETHTOOL_A_STATS_GRP_STAT``. ``ETHTOOL_A_STATS_GRP_STAT`` should contain
+single 8 byte (u64) attribute inside - the type of that attribute is
+the statistic ID and the value is the value of the statistic.
+Each group has its own interpretation of statistic IDs.
+Attribute IDs correspond to strings from the string set identified
+by ``ETHTOOL_A_STATS_GRP_SS_ID``. Complex statistics (such as RMON histogram
+entries) are also listed inside ``ETHTOOL_A_STATS_GRP`` and do not have
+a string defined in the string set.
+
+RMON "histogram" counters count number of packets within given size range.
+Because RFC does not specify the ranges beyond the standard 1518 MTU devices
+differ in definition of buckets. For this reason the definition of packet ranges
+is left to each driver.
+
+``ETHTOOL_A_STATS_GRP_HIST_RX`` and ``ETHTOOL_A_STATS_GRP_HIST_TX`` nests
+contain the following attributes:
+
+ ================================= ====== ===================================
+ ETHTOOL_A_STATS_RMON_HIST_BKT_LOW u32 low bound of the packet size bucket
+ ETHTOOL_A_STATS_RMON_HIST_BKT_HI u32 high bound of the bucket
+ ETHTOOL_A_STATS_RMON_HIST_VAL u64 packet counter
+ ================================= ====== ===================================
+
+Low and high bounds are inclusive, for example:
+
+ ============================= ==== ====
+ RFC statistic low high
+ ============================= ==== ====
+ etherStatsPkts64Octets 0 64
+ etherStatsPkts512to1023Octets 512 1023
+ ============================= ==== ====
+
+``ETHTOOL_A_STATS_SRC`` is optional. Similar to ``PAUSE_GET``, it takes values
+from ``enum ethtool_mac_stats_src``. If absent from the request, stats will be
+provided with an ``ETHTOOL_A_STATS_SRC`` attribute in the response equal to
+``ETHTOOL_MAC_STATS_SRC_AGGREGATE``.
+
+PHC_VCLOCKS_GET
+===============
+
+Query device PHC virtual clocks information.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested reply header
+ ``ETHTOOL_A_PHC_VCLOCKS_NUM`` u32 PHC virtual clocks number
+ ``ETHTOOL_A_PHC_VCLOCKS_INDEX`` s32 PHC index array
+ ==================================== ====== ==========================
+
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested reply header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ``ETHTOOL_A_MODULE_POWER_MODE`` u8 operational power mode
+ ====================================== ====== ==========================
+
+The optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute encodes the
+transceiver module power mode policy enforced by the host. The default policy
+is driver-dependent, but "auto" is the recommended default and it should be
+implemented by new drivers and drivers where conformance to a legacy behavior
+is not critical.
+
+The optional ``ETHTHOOL_A_MODULE_POWER_MODE`` attribute encodes the operational
+power mode policy of the transceiver module. It is only reported when a module
+is plugged-in. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ====================================== ====== ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute is used
+to set the transceiver module power policy enforced by the host. Possible
+values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode_policy
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
+PSE_GET
+=======
+
+Gets PSE attributes.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
+the operational state of the PoDL PSE functions. The operational state of the
+PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
+action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2
+aPoDLPSEAdminState. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_admin_state
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
+the power detection status of the PoDL PSE. The status depend on internal PSE
+state machine and automatic PD classification support. This option is
+corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_pw_d_status
+
+PSE_SET
+=======
+
+Sets PSE parameters.
+
+Request contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
+to control PoDL PSE Admin functions. This option is implementing
+``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
+``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+
+RSS_GET
+=======
+
+Get indirection table, hash key and hash function info associated with a
+RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
+
+Request contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==========================
+
+Kernel response contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+ ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation
+===================================== ====== ==========================
+
+ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
+being used. Current supported options are toeplitz, xor or crc32.
+ETHTOOL_A_RSS_INDIR attribute returns RSS indirection table where each byte
+indicates queue number.
+ETHTOOL_A_RSS_INPUT_XFRM attribute is a bitmap indicating the type of
+transformation applied to the input protocol fields before given to the RSS
+hfunc. Current supported option is symmetric-xor.
+
+PLCA_GET_CFG
+============
+
+Gets the IEEE 802.3cg-2019 Clause 148 Physical Layer Collision Avoidance
+(PLCA) Reconciliation Sublayer (RS) attributes.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested reply header
+ ``ETHTOOL_A_PLCA_VERSION`` u16 Supported PLCA management
+ interface standard/version
+ ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
+ ``ETHTOOL_A_PLCA_NODE_ID`` u32 PLCA unique local node ID
+ ``ETHTOOL_A_PLCA_NODE_CNT`` u32 Number of PLCA nodes on the
+ network, including the
+ coordinator
+ ``ETHTOOL_A_PLCA_TO_TMR`` u32 Transmit Opportunity Timer
+ value in bit-times (BT)
+ ``ETHTOOL_A_PLCA_BURST_CNT`` u32 Number of additional packets
+ the node is allowed to send
+ within a single TO
+ ``ETHTOOL_A_PLCA_BURST_TMR`` u32 Time to wait for the MAC to
+ transmit a new frame before
+ terminating the burst
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
+standard and version the PLCA management interface complies to. When not set,
+the interface is vendor-specific and (possibly) supplied by the driver.
+The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
+embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management
+Registers" at https://www.opensig.org/about/specifications/.
+
+When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
+administrative state of the PLCA RS. When not set, the node operates in "plain"
+CSMA/CD mode. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.1
+aPLCAAdminState / 30.16.1.2.1 acPLCAAdminControl.
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_ID`` attribute indicates the
+configured local node ID of the PHY. This ID determines which transmit
+opportunity (TO) is reserved for the node to transmit into. This option is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.4 aPLCALocalNodeID. The valid
+range for this attribute is [0 .. 255] where 255 means "not configured".
+
+When set, the optional ``ETHTOOL_A_PLCA_NODE_CNT`` attribute indicates the
+configured maximum number of PLCA nodes on the mixing-segment. This number
+determines the total number of transmit opportunities generated during a
+PLCA cycle. This attribute is relevant only for the PLCA coordinator, which is
+the node with aPLCALocalNodeID set to 0. Follower nodes ignore this setting.
+This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.3
+aPLCANodeCount. The valid range for this attribute is [1 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_TO_TMR`` attribute indicates the
+configured value of the transmit opportunity timer in bit-times. This value
+must be set equal across all nodes sharing the medium for PLCA to work
+correctly. This option is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.5
+aPLCATransmitOpportunityTimer. The valid range for this attribute is
+[0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_CNT`` attribute indicates the
+configured number of extra packets that the node is allowed to send during a
+single transmit opportunity. By default, this attribute is 0, meaning that
+the node can only send a single frame per TO. When greater than 0, the PLCA RS
+keeps the TO after any transmission, waiting for the MAC to send a new frame
+for up to aPLCABurstTimer BTs. This can only happen a number of times per PLCA
+cycle up to the value of this parameter. After that, the burst is over and the
+normal counting of TOs resumes. This option is corresponding to
+``IEEE 802.3cg-2019`` 30.16.1.1.6 aPLCAMaxBurstCount. The valid range for this
+attribute is [0 .. 255].
+
+When set, the optional ``ETHTOOL_A_PLCA_BURST_TMR`` attribute indicates how
+many bit-times the PLCA RS waits for the MAC to initiate a new transmission
+when aPLCAMaxBurstCount is greater than 0. If the MAC fails to send a new
+frame within this time, the burst ends and the counting of TOs resumes.
+Otherwise, the new frame is sent as part of the current burst. This option
+is corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.7 aPLCABurstTimer. The
+valid range for this attribute is [0 .. 255]. Although, the value should be
+set greater than the Inter-Frame-Gap (IFG) time of the MAC (plus some margin)
+for PLCA burst mode to work as intended.
+
+PLCA_SET_CFG
+============
+
+Sets PLCA RS parameters.
+
+Request contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
+ ``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID
+ ``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the
+ netkork, including the
+ coordinator
+ ``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer
+ value in bit-times (BT)
+ ``ETHTOOL_A_PLCA_BURST_CNT`` u8 Number of additional packets
+ the node is allowed to send
+ within a single TO
+ ``ETHTOOL_A_PLCA_BURST_TMR`` u8 Time to wait for the MAC to
+ transmit a new frame before
+ terminating the burst
+ ====================================== ====== =============================
+
+For a description of each attribute, see ``PLCA_GET_CFG``.
+
+PLCA_GET_STATUS
+===============
+
+Gets PLCA RS status information.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PLCA_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PLCA_HEADER`` nested reply header
+ ``ETHTOOL_A_PLCA_STATUS`` u8 PLCA RS operational status
+ ====================================== ====== =============================
+
+When set, the ``ETHTOOL_A_PLCA_STATUS`` attribute indicates whether the node is
+detecting the presence of the BEACON on the network. This flag is
+corresponding to ``IEEE 802.3cg-2019`` 30.16.1.1.2 aPLCAStatus.
+
+MM_GET
+======
+
+Retrieve 802.3 MAC Merge parameters.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_MM_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ================================= ====== ===================================
+ ``ETHTOOL_A_MM_HEADER`` nested request header
+ ``ETHTOOL_A_MM_PMAC_ENABLED`` bool set if RX of preemptible and SMD-V
+ frames is enabled
+ ``ETHTOOL_A_MM_TX_ENABLED`` bool set if TX of preemptible frames is
+ administratively enabled (might be
+ inactive if verification failed)
+ ``ETHTOOL_A_MM_TX_ACTIVE`` bool set if TX of preemptible frames is
+ operationally enabled
+ ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 minimum size of transmitted
+ non-final fragments, in octets
+ ``ETHTOOL_A_MM_RX_MIN_FRAG_SIZE`` u32 minimum size of received non-final
+ fragments, in octets
+ ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool set if TX of SMD-V frames is
+ administratively enabled
+ ``ETHTOOL_A_MM_VERIFY_STATUS`` u8 state of the verification function
+ ``ETHTOOL_A_MM_VERIFY_TIME`` u32 delay between verification attempts
+ ``ETHTOOL_A_MM_MAX_VERIFY_TIME``` u32 maximum verification interval
+ supported by device
+ ``ETHTOOL_A_MM_STATS`` nested IEEE 802.3-2018 subclause 30.14.1
+ oMACMergeEntity statistics counters
+ ================================= ====== ===================================
+
+The attributes are populated by the device driver through the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_state
+
+The ``ETHTOOL_A_MM_VERIFY_STATUS`` will report one of the values from
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_mm_verify_status
+
+If ``ETHTOOL_A_MM_VERIFY_ENABLED`` was passed as false in the ``MM_SET``
+command, ``ETHTOOL_A_MM_VERIFY_STATUS`` will report either
+``ETHTOOL_MM_VERIFY_STATUS_INITIAL`` or ``ETHTOOL_MM_VERIFY_STATUS_DISABLED``,
+otherwise it should report one of the other states.
+
+It is recommended that drivers start with the pMAC disabled, and enable it upon
+user space request. It is also recommended that user space does not depend upon
+the default values from ``ETHTOOL_MSG_MM_GET`` requests.
+
+``ETHTOOL_A_MM_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``. The attribute will be empty if driver did not
+report any statistics. Drivers fill in the statistics in the following
+structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_stats
+
+MM_SET
+======
+
+Modifies the configuration of the 802.3 MAC Merge layer.
+
+Request contents:
+
+ ================================= ====== ==========================
+ ``ETHTOOL_A_MM_VERIFY_TIME`` u32 see MM_GET description
+ ``ETHTOOL_A_MM_VERIFY_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_TX_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_PMAC_ENABLED`` bool see MM_GET description
+ ``ETHTOOL_A_MM_TX_MIN_FRAG_SIZE`` u32 see MM_GET description
+ ================================= ====== ==========================
+
+The attributes are propagated to the driver through the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_mm_cfg
+
Request translation
===================
@@ -1187,11 +2081,11 @@ are netlink only.
``ETHTOOL_GET_DUMP_FLAG`` n/a
``ETHTOOL_GET_DUMP_DATA`` n/a
``ETHTOOL_GET_TS_INFO`` ``ETHTOOL_MSG_TSINFO_GET``
- ``ETHTOOL_GMODULEINFO`` n/a
- ``ETHTOOL_GMODULEEEPROM`` n/a
+ ``ETHTOOL_GMODULEINFO`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+ ``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
- ``ETHTOOL_GRSSH`` n/a
+ ``ETHTOOL_GRSSH`` ``ETHTOOL_MSG_RSS_GET``
``ETHTOOL_SRSSH`` n/a
``ETHTOOL_GTUNABLE`` n/a
``ETHTOOL_STUNABLE`` n/a
@@ -1203,8 +2097,17 @@ are netlink only.
``ETHTOOL_MSG_LINKMODES_SET``
``ETHTOOL_PHY_GTUNABLE`` n/a
``ETHTOOL_PHY_STUNABLE`` n/a
- ``ETHTOOL_GFECPARAM`` n/a
- ``ETHTOOL_SFECPARAM`` n/a
- n/a ''ETHTOOL_MSG_CABLE_TEST_ACT''
- n/a ''ETHTOOL_MSG_CABLE_TEST_TDR_ACT''
+ ``ETHTOOL_GFECPARAM`` ``ETHTOOL_MSG_FEC_GET``
+ ``ETHTOOL_SFECPARAM`` ``ETHTOOL_MSG_FEC_SET``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_ACT``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
+ n/a ``ETHTOOL_MSG_TUNNEL_INFO_GET``
+ n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+ n/a ``ETHTOOL_MSG_MODULE_GET``
+ n/a ``ETHTOOL_MSG_MODULE_SET``
+ n/a ``ETHTOOL_MSG_PLCA_GET_CFG``
+ n/a ``ETHTOOL_MSG_PLCA_SET_CFG``
+ n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
+ n/a ``ETHTOOL_MSG_MM_GET``
+ n/a ``ETHTOOL_MSG_MM_SET``
=================================== =====================================
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
index a1d3e192b9fa..7d8c5380492f 100644
--- a/Documentation/networking/filter.rst
+++ b/Documentation/networking/filter.rst
@@ -1,9 +1,18 @@
.. SPDX-License-Identifier: GPL-2.0
+.. _networking-filter:
+
=======================================================
Linux Socket Filtering aka Berkeley Packet Filter (BPF)
=======================================================
+Notice
+------
+
+This file used to document the eBPF format and mechanisms even when not
+related to socket filtering. The ../bpf/index.rst has more details
+on eBPF.
+
Introduction
------------
@@ -296,7 +305,7 @@ Possible BPF extensions are shown in the following table:
vlan_tci skb_vlan_tag_get(skb)
vlan_avail skb_vlan_tag_present(skb)
vlan_tpid skb->vlan_proto
- rand prandom_u32()
+ rand get_random_u32()
=================================== =================================================
These extensions can also be prefixed with '#'.
@@ -318,14 +327,7 @@ Examples for low-level BPF:
ret #-1
drop: ret #0
-**(Accelerated) VLAN w/ id 10**::
-
- ld vlan_tci
- jneq #10, drop
- ret #-1
- drop: ret #0
-
-**icmp random packet sampling, 1 in 4**:
+**icmp random packet sampling, 1 in 4**::
ldh [12]
jne #0x800, drop
@@ -356,6 +358,22 @@ Examples for low-level BPF:
bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
+Examples for low-level BPF extension:
+
+**Packet for interface index 13**::
+
+ ld ifidx
+ jneq #13, drop
+ ret #-1
+ drop: ret #0
+
+**(Accelerated) VLAN w/ id 10**::
+
+ ld vlan_tci
+ jneq #10, drop
+ ret #-1
+ drop: ret #0
+
The above example code can be placed into a file (here called "foo"), and
then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
and cls_bpf understands and can directly be loaded with. Example with above
@@ -606,15 +624,11 @@ format with similar underlying principles from BPF described in previous
paragraphs is being used. However, the instruction set format is modelled
closer to the underlying architecture to mimic native instruction sets, so
that a better performance can be achieved (more details later). This new
-ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
+ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which
originates from [e]xtended BPF is not the same as BPF extensions! While
eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
-It is designed to be JITed with one to one mapping, which can also open up
-the possibility for GCC/LLVM compilers to generate optimized eBPF code through
-an eBPF backend that performs almost as fast as natively compiled code.
-
The new instruction set was originally designed with the possible goal in
mind to write programs in "restricted C" and compile into eBPF with a optional
GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
@@ -627,8 +641,8 @@ extension, PTP dissector/classifier, and much more. They are all internally
converted by the kernel into the new instruction set representation and run
in the eBPF interpreter. For in-kernel handlers, this all works transparently
by using bpf_prog_create() for setting up the filter, resp.
-bpf_prog_destroy() for destroying it. The macro
-BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
+bpf_prog_destroy() for destroying it. The function
+bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
code to run the filter. 'filter' is a pointer to struct bpf_prog that we
got from bpf_prog_create(), and 'ctx' the given context (e.g.
skb pointer). All constraints and restrictions from bpf_check_classic() apply
@@ -636,994 +650,14 @@ before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most
32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
-sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
-instruction set.
-
-Some core changes of the new internal format:
-
-- Number of registers increase from 2 to 10:
-
- The old format had two registers A and X, and a hidden frame pointer. The
- new layout extends this to be 10 internal registers and a read-only frame
- pointer. Since 64-bit CPUs are passing arguments to functions via registers
- the number of args from eBPF program to in-kernel function is restricted
- to 5 and one register is used to accept return value from an in-kernel
- function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
- sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
- registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
-
- Therefore, eBPF calling convention is defined as:
-
- * R0 - return value from in-kernel function, and exit value for eBPF program
- * R1 - R5 - arguments from eBPF program to in-kernel function
- * R6 - R9 - callee saved registers that in-kernel function will preserve
- * R10 - read-only frame pointer to access stack
-
- Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
- etc, and eBPF calling convention maps directly to ABIs used by the kernel on
- 64-bit architectures.
-
- On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
- and may let more complex programs to be interpreted.
-
- R0 - R5 are scratch registers and eBPF program needs spill/fill them if
- necessary across calls. Note that there is only one eBPF program (== one
- eBPF main routine) and it cannot call other eBPF functions, it can only
- call predefined in-kernel functions, though.
-
-- Register width increases from 32-bit to 64-bit:
-
- Still, the semantics of the original 32-bit ALU operations are preserved
- via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
- subregisters that zero-extend into 64-bit if they are being written to.
- That behavior maps directly to x86_64 and arm64 subregister definition, but
- makes other JITs more difficult.
-
- 32-bit architectures run 64-bit internal BPF programs via interpreter.
- Their JITs may convert BPF programs that only use 32-bit subregisters into
- native instruction set and let the rest being interpreted.
-
- Operation is 64-bit, because on 64-bit architectures, pointers are also
- 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
- so 32-bit eBPF registers would otherwise require to define register-pair
- ABI, thus, there won't be able to use a direct eBPF register to HW register
- mapping and JIT would need to do combine/split/move operations for every
- register in and out of the function, which is complex, bug prone and slow.
- Another reason is the use of atomic 64-bit counters.
-
-- Conditional jt/jf targets replaced with jt/fall-through:
-
- While the original design has constructs such as ``if (cond) jump_true;
- else jump_false;``, they are being replaced into alternative constructs like
- ``if (cond) jump_true; /* else fall-through */``.
-
-- Introduces bpf_call insn and register passing convention for zero overhead
- calls from/to other kernel functions:
-
- Before an in-kernel function call, the internal BPF program needs to
- place function arguments into R1 to R5 registers to satisfy calling
- convention, then the interpreter will take them from registers and pass
- to in-kernel function. If R1 - R5 registers are mapped to CPU registers
- that are used for argument passing on given architecture, the JIT compiler
- doesn't need to emit extra moves. Function arguments will be in the correct
- registers and BPF_CALL instruction will be JITed as single 'call' HW
- instruction. This calling convention was picked to cover common call
- situations without performance penalty.
-
- After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
- a return value of the function. Since R6 - R9 are callee saved, their state
- is preserved across the call.
-
- For example, consider three C functions::
-
- u64 f1() { return (*_f2)(1); }
- u64 f2(u64 a) { return f3(a + 1, a); }
- u64 f3(u64 a, u64 b) { return a - b; }
-
- GCC can compile f1, f3 into x86_64::
-
- f1:
- movl $1, %edi
- movq _f2(%rip), %rax
- jmp *%rax
- f3:
- movq %rdi, %rax
- subq %rsi, %rax
- ret
-
- Function f2 in eBPF may look like::
-
- f2:
- bpf_mov R2, R1
- bpf_add R1, 1
- bpf_call f3
- bpf_exit
-
- If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
- returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
- be used to call into f2.
-
- For practical reasons all eBPF programs have only one argument 'ctx' which is
- already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
- can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
- are currently not supported, but these restrictions can be lifted if necessary
- in the future.
-
- On 64-bit architectures all register map to HW registers one to one. For
- example, x86_64 JIT compiler can map them as ...
-
- ::
-
- R0 - rax
- R1 - rdi
- R2 - rsi
- R3 - rdx
- R4 - rcx
- R5 - r8
- R6 - rbx
- R7 - r13
- R8 - r14
- R9 - r15
- R10 - rbp
-
- ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
- and rbx, r12 - r15 are callee saved.
-
- Then the following internal BPF pseudo-program::
-
- bpf_mov R6, R1 /* save ctx */
- bpf_mov R2, 2
- bpf_mov R3, 3
- bpf_mov R4, 4
- bpf_mov R5, 5
- bpf_call foo
- bpf_mov R7, R0 /* save foo() return value */
- bpf_mov R1, R6 /* restore ctx for next call */
- bpf_mov R2, 6
- bpf_mov R3, 7
- bpf_mov R4, 8
- bpf_mov R5, 9
- bpf_call bar
- bpf_add R0, R7
- bpf_exit
-
- After JIT to x86_64 may look like::
-
- push %rbp
- mov %rsp,%rbp
- sub $0x228,%rsp
- mov %rbx,-0x228(%rbp)
- mov %r13,-0x220(%rbp)
- mov %rdi,%rbx
- mov $0x2,%esi
- mov $0x3,%edx
- mov $0x4,%ecx
- mov $0x5,%r8d
- callq foo
- mov %rax,%r13
- mov %rbx,%rdi
- mov $0x6,%esi
- mov $0x7,%edx
- mov $0x8,%ecx
- mov $0x9,%r8d
- callq bar
- add %r13,%rax
- mov -0x228(%rbp),%rbx
- mov -0x220(%rbp),%r13
- leaveq
- retq
-
- Which is in this example equivalent in C to::
-
- u64 bpf_filter(u64 ctx)
- {
- return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
- }
-
- In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
- arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
- registers and place their return value into ``%rax`` which is R0 in eBPF.
- Prologue and epilogue are emitted by JIT and are implicit in the
- interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
- them across the calls as defined by calling convention.
-
- For example the following program is invalid::
-
- bpf_mov R1, 1
- bpf_call foo
- bpf_mov R0, R1
- bpf_exit
-
- After the call the registers R1-R5 contain junk values and cannot be read.
- An in-kernel eBPF verifier is used to validate internal BPF programs.
-
-Also in the new design, eBPF is limited to 4096 insns, which means that any
-program will terminate quickly and will only call a fixed number of kernel
-functions. Original BPF and the new format are two operand instructions,
-which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
-
-The input context pointer for invoking the interpreter function is generic,
-its content is defined by a specific use case. For seccomp register R1 points
-to seccomp_data, for converted BPF filters R1 points to a skb.
-
-A program, that is translated internally consists of the following elements::
-
- op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
-
-So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
-has room for new instructions. Some of them may use 16/24/32 byte encoding. New
-instructions must be multiple of 8 bytes to preserve backward compatibility.
-
-Internal BPF is a general purpose RISC instruction set. Not every register and
-every instruction are used during translation from original BPF to new format.
-For example, socket filters are not using ``exclusive add`` instruction, but
-tracing filters may do to maintain counters of events, for example. Register R9
-is not used by socket filters either, but more complex filters may be running
-out of registers and would have to resort to spill/fill to stack.
-
-Internal BPF can be used as a generic assembler for last step performance
-optimizations, socket filters and seccomp are using it as assembler. Tracing
-filters may use it as assembler to generate code from kernel. In kernel usage
-may not be bounded by security considerations, since generated internal BPF code
-may be optimizing internal code path and not being exposed to the user space.
-Safety of internal BPF can come from a verifier (TBD). In such use cases as
-described, it may be used as safe instruction set.
-
-Just like the original BPF, the new format runs within a controlled environment,
-is deterministic and the kernel can easily prove that. The safety of the program
-can be determined in two steps: first step does depth-first-search to disallow
-loops and other CFG validation; second step starts from the first insn and
-descends all possible paths. It simulates execution of every insn and observes
-the state change of registers and stack.
-
-eBPF opcode encoding
---------------------
-
-eBPF is reusing most of the opcode encoding from classic to simplify conversion
-of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
-field is divided into three parts::
-
- +----------------+--------+--------------------+
- | 4 bits | 1 bit | 3 bits |
- | operation code | source | instruction class |
- +----------------+--------+--------------------+
- (MSB) (LSB)
-
-Three LSB bits store instruction class which is one of:
-
- =================== ===============
- Classic BPF classes eBPF classes
- =================== ===============
- BPF_LD 0x00 BPF_LD 0x00
- BPF_LDX 0x01 BPF_LDX 0x01
- BPF_ST 0x02 BPF_ST 0x02
- BPF_STX 0x03 BPF_STX 0x03
- BPF_ALU 0x04 BPF_ALU 0x04
- BPF_JMP 0x05 BPF_JMP 0x05
- BPF_RET 0x06 BPF_JMP32 0x06
- BPF_MISC 0x07 BPF_ALU64 0x07
- =================== ===============
-
-When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
-
- ::
-
- BPF_K 0x00
- BPF_X 0x08
-
- * in classic BPF, this means::
-
- BPF_SRC(code) == BPF_X - use register X as source operand
- BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
- * in eBPF, this means::
-
- BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
- BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
-... and four MSB bits store operation code.
-
-If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
-
- BPF_ADD 0x00
- BPF_SUB 0x10
- BPF_MUL 0x20
- BPF_DIV 0x30
- BPF_OR 0x40
- BPF_AND 0x50
- BPF_LSH 0x60
- BPF_RSH 0x70
- BPF_NEG 0x80
- BPF_MOD 0x90
- BPF_XOR 0xa0
- BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
- BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
- BPF_END 0xd0 /* eBPF only: endianness conversion */
-
-If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
-
- BPF_JA 0x00 /* BPF_JMP only */
- BPF_JEQ 0x10
- BPF_JGT 0x20
- BPF_JGE 0x30
- BPF_JSET 0x40
- BPF_JNE 0x50 /* eBPF only: jump != */
- BPF_JSGT 0x60 /* eBPF only: signed '>' */
- BPF_JSGE 0x70 /* eBPF only: signed '>=' */
- BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
- BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
- BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
- BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
- BPF_JSLT 0xc0 /* eBPF only: signed '<' */
- BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
-
-So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
-and eBPF. There are only two registers in classic BPF, so it means A += X.
-In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
-BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
-src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
-
-Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
-eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
-BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
-exactly the same operations as BPF_ALU, but with 64-bit wide operands
-instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
-dst_reg = dst_reg + src_reg
-
-Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
-operation. Classic BPF_RET | BPF_K means copy imm32 into return register
-and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
-in eBPF means function exit only. The eBPF program needs to store return
-value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
-BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
-operands for the comparisons instead.
-
-For load and store instructions the 8-bit 'code' field is divided as::
-
- +--------+--------+-------------------+
- | 3 bits | 2 bits | 3 bits |
- | mode | size | instruction class |
- +--------+--------+-------------------+
- (MSB) (LSB)
-
-Size modifier is one of ...
-
-::
-
- BPF_W 0x00 /* word */
- BPF_H 0x08 /* half word */
- BPF_B 0x10 /* byte */
- BPF_DW 0x18 /* eBPF only, double word */
-
-... which encodes size of load/store operation::
-
- B - 1 byte
- H - 2 byte
- W - 4 byte
- DW - 8 byte (eBPF only)
-
-Mode modifier is one of::
-
- BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
- BPF_ABS 0x20
- BPF_IND 0x40
- BPF_MEM 0x60
- BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
- BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
- BPF_XADD 0xc0 /* eBPF only, exclusive add */
-
-eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
-(BPF_IND | <size> | BPF_LD) which are used to access packet data.
-
-They had to be carried over from classic to have strong performance of
-socket filters running in eBPF interpreter. These instructions can only
-be used when interpreter context is a pointer to ``struct sk_buff`` and
-have seven implicit operands. Register R6 is an implicit input that must
-contain pointer to sk_buff. Register R0 is an implicit output which contains
-the data fetched from the packet. Registers R1-R5 are scratch registers
-and must not be used to store the data across BPF_ABS | BPF_LD or
-BPF_IND | BPF_LD instructions.
-
-These instructions have implicit program exit condition as well. When
-eBPF program is trying to access the data beyond the packet boundary,
-the interpreter will abort the execution of the program. JIT compilers
-therefore must preserve this property. src_reg and imm32 fields are
-explicit inputs to these instructions.
-
-For example::
-
- BPF_IND | BPF_W | BPF_LD means:
-
- R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
- and R1 - R5 were scratched.
-
-Unlike classic BPF instruction set, eBPF has generic load/store operations::
-
- BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
- BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
- BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off)
- BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
- BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
-
-Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
-2 byte atomic increments are not supported.
-
-eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
-of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
-instruction that loads 64-bit immediate value into a dst_reg.
-Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
-32-bit immediate value into a register.
-
-eBPF verifier
--------------
-The safety of the eBPF program is determined in two steps.
-
-First step does DAG check to disallow loops and other CFG validation.
-In particular it will detect programs that have unreachable instructions.
-(though classic BPF checker allows them)
-
-Second step starts from the first insn and descends all possible paths.
-It simulates execution of every insn and observes the state change of
-registers and stack.
-
-At the start of the program the register R1 contains a pointer to context
-and has type PTR_TO_CTX.
-If verifier sees an insn that does R2=R1, then R2 has now type
-PTR_TO_CTX as well and can be used on the right hand side of expression.
-If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
-since addition of two valid pointers makes invalid pointer.
-(In 'secure' mode verifier will reject any type of pointer arithmetic to make
-sure that kernel addresses don't leak to unprivileged users)
-
-If register was never written to, it's not readable::
-
- bpf_mov R0 = R2
- bpf_exit
-
-will be rejected, since R2 is unreadable at the start of the program.
-
-After kernel function call, R1-R5 are reset to unreadable and
-R0 has a return type of the function.
-
-Since R6-R9 are callee saved, their state is preserved across the call.
-
-::
-
- bpf_mov R6 = 1
- bpf_call foo
- bpf_mov R0 = R6
- bpf_exit
-
-is a correct program. If there was R1 instead of R6, it would have
-been rejected.
-
-load/store instructions are allowed only with registers of valid types, which
-are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
-For example::
-
- bpf_mov R1 = 1
- bpf_mov R2 = 2
- bpf_xadd *(u32 *)(R1 + 3) += R2
- bpf_exit
-
-will be rejected, since R1 doesn't have a valid pointer type at the time of
-execution of instruction bpf_xadd.
-
-At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
-A callback is used to customize verifier to restrict eBPF program access to only
-certain fields within ctx structure with specified size and alignment.
-
-For example, the following insn::
-
- bpf_ld R0 = *(u32 *)(R6 + 8)
-
-intends to load a word from address R6 + 8 and store it into R0
-If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
-that offset 8 of size 4 bytes can be accessed for reading, otherwise
-the verifier will reject the program.
-If R6=PTR_TO_STACK, then access should be aligned and be within
-stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
-so it will fail verification, since it's out of bounds.
-
-The verifier will allow eBPF program to read data from stack only after
-it wrote into it.
-
-Classic BPF verifier does similar check with M[0-15] memory slots.
-For example::
-
- bpf_ld R0 = *(u32 *)(R10 - 4)
- bpf_exit
-
-is invalid program.
-Though R10 is correct read-only register and has type PTR_TO_STACK
-and R10 - 4 is within stack bounds, there were no stores into that location.
-
-Pointer register spill/fill is tracked as well, since four (R6-R9)
-callee saved registers may not be enough for some programs.
-
-Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
-The eBPF verifier will check that registers match argument constraints.
-After the call register R0 will be set to return type of the function.
-
-Function calls is a main mechanism to extend functionality of eBPF programs.
-Socket filters may let programs to call one set of functions, whereas tracing
-filters may allow completely different set.
-
-If a function made accessible to eBPF program, it needs to be thought through
-from safety point of view. The verifier will guarantee that the function is
-called with valid arguments.
-
-seccomp vs socket filters have different security restrictions for classic BPF.
-Seccomp solves this by two stage verifier: classic BPF verifier is followed
-by seccomp verifier. In case of eBPF one configurable verifier is shared for
-all use cases.
-
-See details of eBPF verifier in kernel/bpf/verifier.c
-
-Register value tracking
------------------------
-In order to determine the safety of an eBPF program, the verifier must track
-the range of possible values in each register and also in each stack slot.
-This is done with ``struct bpf_reg_state``, defined in include/linux/
-bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
-register state has a type, which is either NOT_INIT (the register has not been
-written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
-pointer type. The types of pointers describe their base, as follows:
-
-
- PTR_TO_CTX
- Pointer to bpf_context.
- CONST_PTR_TO_MAP
- Pointer to struct bpf_map. "Const" because arithmetic
- on these pointers is forbidden.
- PTR_TO_MAP_VALUE
- Pointer to the value stored in a map element.
- PTR_TO_MAP_VALUE_OR_NULL
- Either a pointer to a map value, or NULL; map accesses
- (see section 'eBPF maps', below) return this type,
- which becomes a PTR_TO_MAP_VALUE when checked != NULL.
- Arithmetic on these pointers is forbidden.
- PTR_TO_STACK
- Frame pointer.
- PTR_TO_PACKET
- skb->data.
- PTR_TO_PACKET_END
- skb->data + headlen; arithmetic forbidden.
- PTR_TO_SOCKET
- Pointer to struct bpf_sock_ops, implicitly refcounted.
- PTR_TO_SOCKET_OR_NULL
- Either a pointer to a socket, or NULL; socket lookup
- returns this type, which becomes a PTR_TO_SOCKET when
- checked != NULL. PTR_TO_SOCKET is reference-counted,
- so programs must release the reference through the
- socket release function before the end of the program.
- Arithmetic on these pointers is forbidden.
-
-However, a pointer may be offset from this base (as a result of pointer
-arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
-offset'. The former is used when an exactly-known value (e.g. an immediate
-operand) is added to a pointer, while the latter is used for values which are
-not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
-the range of possible values in the register.
-
-The verifier's knowledge about the variable offset consists of:
-
-* minimum and maximum values as unsigned
-* minimum and maximum values as signed
-
-* knowledge of the values of individual bits, in the form of a 'tnum': a u64
- 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
- 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
- mask and value; no bit should ever be 1 in both. For example, if a byte is read
- into a register from memory, the register's top 56 bits are known zero, while
- the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
- then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
- 0x1ff), because of potential carries.
-
-Besides arithmetic, the register state can also be updated by conditional
-branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
-it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
-branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
-BPF_JSGE) would instead update the signed minimum/maximum values. Information
-from the signed and unsigned bounds can be combined; for instance if a value is
-first tested < 8 and then tested s> 4, the verifier will conclude that the value
-is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
-
-PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
-pointers sharing that same variable offset. This is important for packet range
-checks: after adding a variable to a packet pointer register A, if you then copy
-it to another register B and then add a constant 4 to A, both registers will
-share the same 'id' but the A will have a fixed offset of +4. Then if A is
-bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
-now known to have a safe range of at least 4 bytes. See 'Direct packet access',
-below, for more on PTR_TO_PACKET ranges.
-
-The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
-the pointer returned from a map lookup. This means that when one copy is
-checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
-As well as range-checking, the tracked information is also used for enforcing
-alignment of pointer accesses. For instance, on most systems the packet pointer
-is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
-over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
-pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
-bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
-that pointer are safe.
-The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
-to all copies of the pointer returned from a socket lookup. This has similar
-behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
-it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
-represents a reference to the corresponding ``struct sock``. To ensure that the
-reference is not leaked, it is imperative to NULL-check the reference and in
-the non-NULL case, and pass the valid reference to the socket release function.
-
-Direct packet access
---------------------
-In cls_bpf and act_bpf programs the verifier allows direct access to the packet
-data via skb->data and skb->data_end pointers.
-Ex::
-
- 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
- 2: r3 = *(u32 *)(r1 +76) /* load skb->data */
- 3: r5 = r3
- 4: r5 += 14
- 5: if r5 > r4 goto pc+16
- R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
- 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
-
-this 2byte load from the packet is safe to do, since the program author
-did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
-means that in the fall-through case the register R3 (which points to skb->data)
-has at least 14 directly accessible bytes. The verifier marks it
-as R3=pkt(id=0,off=0,r=14).
-id=0 means that no additional variables were added to the register.
-off=0 means that no additional constants were added.
-r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
-Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
-to the packet data, but constant 14 was added to the register, so
-it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
-which is zero bytes.
-
-More complex packet access may look like::
-
-
- R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
- 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
- 7: r4 = *(u8 *)(r3 +12)
- 8: r4 *= 14
- 9: r3 = *(u32 *)(r1 +76) /* load skb->data */
- 10: r3 += r4
- 11: r2 = r1
- 12: r2 <<= 48
- 13: r2 >>= 48
- 14: r3 += r2
- 15: r2 = r3
- 16: r2 += 8
- 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
- 18: if r2 > r1 goto pc+2
- R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
- 19: r1 = *(u8 *)(r3 +4)
-
-The state of the register R3 is R3=pkt(id=2,off=0,r=8)
-id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
-offset within a packet and since the program author did
-``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
-The verifier only allows 'add'/'sub' operations on packet registers. Any other
-operation will set the register state to 'SCALAR_VALUE' and it won't be
-available for direct packet access.
-
-Operation ``r3 += rX`` may overflow and become less than original skb->data,
-therefore the verifier has to prevent that. So when it sees ``r3 += rX``
-instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
-against skb->data_end will not give us 'range' information, so attempts to read
-through the pointer will give "invalid access to packet" error.
-
-Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is
-R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
-of the register are guaranteed to be zero, and nothing is known about the lower
-8 bits. After insn ``r4 *= 14`` the state becomes
-R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
-value by constant 14 will keep upper 52 bits as zero, also the least significant
-bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make
-R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
-extending. This logic is implemented in adjust_reg_min_max_vals() function,
-which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
-versa) and adjust_scalar_min_max_vals() for operations on two scalars.
-
-The end result is that bpf program author can access packet directly
-using normal C code as::
-
- void *data = (void *)(long)skb->data;
- void *data_end = (void *)(long)skb->data_end;
- struct eth_hdr *eth = data;
- struct iphdr *iph = data + sizeof(*eth);
- struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);
-
- if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
- return 0;
- if (eth->h_proto != htons(ETH_P_IP))
- return 0;
- if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
- return 0;
- if (udp->dest == 53 || udp->source == 9)
- ...;
-
-which makes such programs easier to write comparing to LD_ABS insn
-and significantly faster.
-
-eBPF maps
----------
-'maps' is a generic storage of different types for sharing data between kernel
-and userspace.
-
-The maps are accessed from user space via BPF syscall, which has commands:
-
-- create a map with given type and attributes
- ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
- using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
- returns process-local file descriptor or negative error
-
-- lookup key in a given map
- ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key, attr->value
- returns zero and stores found elem into value or negative error
-
-- create or update key/value pair in a given map
- ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key, attr->value
- returns zero or negative error
-
-- find and delete element by key in a given map
- ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key
-
-- to delete map: close(fd)
- Exiting process will delete maps automatically
-
-userspace programs use this syscall to create/access maps that eBPF programs
-are concurrently updating.
-
-maps can have different types: hash, array, bloom filter, radix-tree, etc.
-
-The map is defined by:
-
- - type
- - max number of elements
- - key size in bytes
- - value size in bytes
-
-Pruning
--------
-The verifier does not actually walk all possible paths through the program. For
-each new branch to analyse, the verifier looks at all the states it's previously
-been in when at this instruction. If any of them contain the current state as a
-subset, the branch is 'pruned' - that is, the fact that the previous state was
-accepted implies the current state would be as well. For instance, if in the
-previous state, r1 held a packet-pointer, and in the current state, r1 holds a
-packet-pointer with a range as long or longer and at least as strict an
-alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
-have been used by any path from that point, so any value in r2 (including
-another NOT_INIT) is safe. The implementation is in the function regsafe().
-Pruning considers not only the registers but also the stack (and any spilled
-registers it may hold). They must all be safe for the branch to be pruned.
-This is implemented in states_equal().
-
-Understanding eBPF verifier messages
-------------------------------------
-
-The following are few examples of invalid eBPF programs and verifier error
-messages as seen in the log:
-
-Program with unreachable instructions::
-
- static struct bpf_insn prog[] = {
- BPF_EXIT_INSN(),
- BPF_EXIT_INSN(),
- };
-
-Error:
-
- unreachable insn 1
-
-Program that reads uninitialized register::
-
- BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (bf) r0 = r2
- R2 !read_ok
-
-Program that doesn't initialize R0 before exiting::
-
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (bf) r2 = r1
- 1: (95) exit
- R0 !read_ok
-
-Program that accesses stack out of bounds::
-
- BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (7a) *(u64 *)(r10 +8) = 0
- invalid stack off=8 size=8
-
-Program that doesn't initialize stack before passing its address into function::
-
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (bf) r2 = r10
- 1: (07) r2 += -8
- 2: (b7) r1 = 0x0
- 3: (85) call 1
- invalid indirect read from stack off -8+0 size 8
-
-Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
-
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 0x0
- 4: (85) call 1
- fd 0 is not pointing to valid bpf_map
-
-Program that doesn't check return value of map_lookup_elem() before accessing
-map element::
-
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 0x0
- 4: (85) call 1
- 5: (7a) *(u64 *)(r0 +0) = 0
- R0 invalid mem access 'map_value_or_null'
-
-Program that correctly checks map_lookup_elem() returned value for NULL, but
-accesses the memory with incorrect alignment::
-
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 1
- 4: (85) call 1
- 5: (15) if r0 == 0x0 goto pc+1
- R0=map_ptr R10=fp
- 6: (7a) *(u64 *)(r0 +4) = 0
- misaligned access off 4 size 8
-
-Program that correctly checks map_lookup_elem() returned value for NULL and
-accesses memory with correct alignment in one side of 'if' branch, but fails
-to do so in the other side of 'if' branch::
-
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
- BPF_EXIT_INSN(),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 1
- 4: (85) call 1
- 5: (15) if r0 == 0x0 goto pc+2
- R0=map_ptr R10=fp
- 6: (7a) *(u64 *)(r0 +0) = 0
- 7: (95) exit
-
- from 5 to 8: R0=imm0 R10=fp
- 8: (7a) *(u64 *)(r0 +0) = 1
- R0 invalid mem access 'imm'
-
-Program that performs a socket lookup then sets the pointer to NULL without
-checking it::
-
- BPF_MOV64_IMM(BPF_REG_2, 0),
- BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_MOV64_IMM(BPF_REG_3, 4),
- BPF_MOV64_IMM(BPF_REG_4, 0),
- BPF_MOV64_IMM(BPF_REG_5, 0),
- BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
- BPF_MOV64_IMM(BPF_REG_0, 0),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (b7) r2 = 0
- 1: (63) *(u32 *)(r10 -8) = r2
- 2: (bf) r2 = r10
- 3: (07) r2 += -8
- 4: (b7) r3 = 4
- 5: (b7) r4 = 0
- 6: (b7) r5 = 0
- 7: (85) call bpf_sk_lookup_tcp#65
- 8: (b7) r0 = 0
- 9: (95) exit
- Unreleased reference id=1, alloc_insn=7
-
-Program that performs a socket lookup but does not NULL-check the returned
-value::
-
- BPF_MOV64_IMM(BPF_REG_2, 0),
- BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_MOV64_IMM(BPF_REG_3, 4),
- BPF_MOV64_IMM(BPF_REG_4, 0),
- BPF_MOV64_IMM(BPF_REG_5, 0),
- BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
- BPF_EXIT_INSN(),
-
-Error::
-
- 0: (b7) r2 = 0
- 1: (63) *(u32 *)(r10 -8) = r2
- 2: (bf) r2 = r10
- 3: (07) r2 += -8
- 4: (b7) r3 = 4
- 5: (b7) r4 = 0
- 6: (b7) r5 = 0
- 7: (85) call bpf_sk_lookup_tcp#65
- 8: (95) exit
- Unreleased reference id=1, alloc_insn=7
+sparc64, arm32, riscv64, riscv32, loongarch64 perform JIT compilation
+from eBPF instruction set.
Testing
-------
Next to the BPF toolchain, the kernel also ships a test module that contains
-various test cases for classic and internal BPF that can be executed against
+various test cases for classic and eBPF that can be executed against
the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
enabled via Kconfig::
diff --git a/Documentation/networking/framerelay.rst b/Documentation/networking/framerelay.rst
deleted file mode 100644
index 6d904399ec6d..000000000000
--- a/Documentation/networking/framerelay.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================
-Frame Relay (FR)
-================
-
-Frame Relay (FR) support for linux is built into a two tiered system of device
-drivers. The upper layer implements RFC1490 FR specification, and uses the
-Data Link Connection Identifier (DLCI) as its hardware address. Usually these
-are assigned by your network supplier, they give you the number/numbers of
-the Virtual Connections (VC) assigned to you.
-
-Each DLCI is a point-to-point link between your machine and a remote one.
-As such, a separate device is needed to accommodate the routing. Within the
-net-tools archives is 'dlcicfg'. This program will communicate with the
-base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'...
-The configuration script will ask you how many DLCIs you need, as well as
-how many DLCIs you want to assign to each Frame Relay Access Device (FRAD).
-
-The DLCI uses a number of function calls to communicate with the FRAD, all
-of which are stored in the FRAD's private data area. assoc/deassoc,
-activate/deactivate and dlci_config. The DLCI supplies a receive function
-to the FRAD to accept incoming packets.
-
-With this initial offering, only 1 FRAD driver is available. With many thanks
-to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E &
-S508 are supported. This driver is currently set up for only FR, but as
-Sangoma makes more firmware modules available, it can be updated to provide
-them as well.
-
-Configuration of the FRAD makes use of another net-tools program, 'fradcfg'.
-This program makes use of a configuration file (which dlcicfg can also read)
-to specify the types of boards to be configured as FRADs, as well as perform
-any board specific configuration. The Sangoma module of fradcfg loads the
-FR firmware into the card, sets the irq/port/memory information, and provides
-an initial configuration.
-
-Additional FRAD device drivers can be added as hardware is available.
-
-At this time, the dlcicfg and fradcfg programs have not been incorporated into
-the net-tools distribution. They can be found at ftp.invlogic.com, in
-/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just
-use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for
-pre-2.0.4 and later.
diff --git a/Documentation/networking/generic_netlink.rst b/Documentation/networking/generic_netlink.rst
index 59e04ccf80c1..d960dbd7e80e 100644
--- a/Documentation/networking/generic_netlink.rst
+++ b/Documentation/networking/generic_netlink.rst
@@ -6,4 +6,4 @@ Generic Netlink
A wiki document on how to use Generic Netlink can be found here:
- * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto
+ * https://wiki.linuxfoundation.org/networking/generic_netlink_howto
diff --git a/Documentation/networking/gtp.rst b/Documentation/networking/gtp.rst
index 1563fb94b289..9a7835cc1437 100644
--- a/Documentation/networking/gtp.rst
+++ b/Documentation/networking/gtp.rst
@@ -162,7 +162,7 @@ Local GTP-U entity and tunnel identification
GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
for GTPv1-U and 3386 for GTPv0-U.
-There is only one GTP-U entity (and therefor SGSN/GGSN/S-GW/PDN-GW
+There is only one GTP-U entity (and therefore SGSN/GGSN/S-GW/PDN-GW
instance) per IP address. Tunnel Endpoint Identifier (TEID) are unique
per GTP-U entity.
diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst
index 6f4bf8447a21..c652d383fe10 100644
--- a/Documentation/networking/ieee802154.rst
+++ b/Documentation/networking/ieee802154.rst
@@ -26,7 +26,9 @@ The stack is composed of three main parts:
Socket API
==========
-.. c:function:: int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
+::
+
+ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
The address family, socket addresses etc. are defined in the
include/net/af_ieee802154.h header or in the special header
@@ -68,7 +70,7 @@ Like with WiFi, there are several types of devices implementing IEEE 802.15.4.
exports a management (e.g. MLME) and data API.
2) 'SoftMAC' or just radio. These types of devices are just radio transceivers
possibly with some kinds of acceleration like automatic CRC computation and
-comparation, automagic ACK handling, address matching, etc.
+comparison, automagic ACK handling, address matching, etc.
Those types of devices require different approach to be hooked into Linux kernel.
@@ -131,12 +133,12 @@ Register PHY in the system.
Freeing registered PHY.
-.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi):
+.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi)
Telling 802.15.4 module there is a new received frame in the skb with
the RF Link Quality Indicator (LQI) from the hardware device.
-.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling):
+.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling)
Telling 802.15.4 module the frame in the skb is or going to be
transmitted through the hardware device
@@ -155,25 +157,25 @@ operations structure at least::
...
};
-.. c:function:: int start(struct ieee802154_hw *hw):
+.. c:function:: int start(struct ieee802154_hw *hw)
Handler that 802.15.4 module calls for the hardware device initialization.
-.. c:function:: void stop(struct ieee802154_hw *hw):
+.. c:function:: void stop(struct ieee802154_hw *hw)
Handler that 802.15.4 module calls for the hardware device cleanup.
-.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb):
+.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb)
Handler that 802.15.4 module calls for each frame in the skb going to be
transmitted through the hardware device.
-.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level):
+.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level)
Handler that 802.15.4 module calls for Energy Detection from the hardware
device.
-.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel):
+.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel)
Set radio for listening on specific channel of the hardware device.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 0186e276690a..473d72c36d61 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -1,12 +1,13 @@
-Linux Networking Documentation
-==============================
+Networking
+==========
+
+Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
Contents:
.. toctree::
:maxdepth: 2
- netdev-FAQ
af_xdp
bareudp
batman-adv
@@ -20,7 +21,6 @@ Contents:
ieee802154
j1939
kapi
- z8530book
msg_zerocopy
failover
net_dim
@@ -36,39 +36,31 @@ Contents:
scaling
tls
tls-offload
+ tls-handshake
nfc
6lowpan
6pack
- altera_tse
arcnet-hardware
arcnet
atm
ax25
- baycom
bonding
cdc_mbim
- cops
- cxacru
dccp
dctcp
- decnet
- defza
dns_resolver
driver
eql
fib_trie
filter
- fore200e
- framerelay
generic-hdlc
generic_netlink
+ netlink_spec/index
gen_stats
gtp
- hinic
ila
- ipddp
+ ioam6-sysctl
ip_dynaddr
- iphase
ipsec
ip-sysctl
ipv6
@@ -77,15 +69,20 @@ Contents:
kcm
l2tp
lapb-module
- ltpc
mac80211-injection
+ mctp
mpls-sysctl
+ mptcp-sysctl
multiqueue
+ multi-pf-netdev
+ napi
+ net_cachelines/index
netconsole
netdev-features
netdevices
netfilter-sysctl
netif-msg
+ nexthop-group-resilient
nf_conntrack-sysctl
nf_flowtable
openvswitch
@@ -97,32 +94,39 @@ Contents:
ppp_generic
proc_net_tcp
radiotap-headers
- ray_cs
rds
regulatory
+ representors
rxrpc
sctp
secid
seg6-sysctl
- skfp
+ skbuff
+ smc-sysctl
+ statistics
strparser
switchdev
+ sysfs-tagging
tc-actions-env-rules
+ tc-queue-filters
+ tcp_ao
tcp-thin
team
timestamping
+ tipc
tproxy
tuntap
udplite
vrf
vxlan
- x25-iface
x25
+ x25-iface
xfrm_device
xfrm_proc
xfrm_sync
xfrm_sysctl
- z8530drv
+ xdp-rx-metadata
+ xsk-tx-metadata
.. only:: subproject and html
diff --git a/Documentation/networking/ioam6-sysctl.rst b/Documentation/networking/ioam6-sysctl.rst
new file mode 100644
index 000000000000..c18cab2c481a
--- /dev/null
+++ b/Documentation/networking/ioam6-sysctl.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+IOAM6 Sysfs variables
+=====================
+
+
+/proc/sys/net/conf/<iface>/ioam6_* variables:
+=============================================
+
+ioam6_enabled - BOOL
+ Accept (= enabled) or ignore (= disabled) IPv6 IOAM options on ingress
+ for this interface.
+
+ * 0 - disabled (default)
+ * 1 - enabled
+
+ioam6_id - SHORT INTEGER
+ Define the IOAM id of this interface.
+
+ Default is ~0.
+
+ioam6_id_wide - INTEGER
+ Define the wide IOAM id of this interface.
+
+ Default is ~0.
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 837d51f9e1fa..bd50df6a5a42 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -25,7 +25,8 @@ ip_default_ttl - INTEGER
ip_no_pmtu_disc - INTEGER
Disable Path MTU Discovery. If enabled in mode 1 and a
fragmentation-required ICMP is received, the PMTU to this
- destination will be set to min_pmtu (see below). You will need
+ destination will be set to the smallest of the old MTU to
+ this destination and min_pmtu (see below). You will need
to raise min_pmtu to the smallest interface MTU on your system
manually if you want to avoid locally generated fragments.
@@ -49,7 +50,8 @@ ip_no_pmtu_disc - INTEGER
Default: FALSE
min_pmtu - INTEGER
- default 552 - minimum discovered Path MTU
+ default 552 - minimum Path MTU. Unless this is changed manually,
+ each cached pmtu will never be lower than this setting.
ip_forward_use_pmtu - BOOLEAN
By default we don't trust protocol path MTUs while forwarding
@@ -99,6 +101,35 @@ fib_multipath_hash_policy - INTEGER
- 0 - Layer 3
- 1 - Layer 4
- 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Unused (Flow Label)
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
@@ -125,6 +156,9 @@ route/max_size - INTEGER
From linux kernel 3.6 onwards, this is deprecated for ipv4
as route cache is no longer used.
+ From linux kernel 6.3 onwards, this is deprecated for ipv6
+ as garbage collection manages cached route entries.
+
neigh/default/gc_thresh1 - INTEGER
Minimum number of entries to keep. Garbage collector will not
purge entries if there are fewer than this number.
@@ -171,6 +205,12 @@ neigh/default/unres_qlen - INTEGER
Default: 101
+neigh/default/interval_probe_time_ms - INTEGER
+ The probe interval for neighbor entries with NTF_MANAGED flag,
+ the min value is 1.
+
+ Default: 5000
+
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
@@ -178,6 +218,27 @@ min_adv_mss - INTEGER
The advertised MSS depends on the first hop route MTU, but will
never be lower than this setting.
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
IP Fragmentation:
ipfrag_high_thresh - LONG INTEGER
@@ -215,6 +276,13 @@ ipfrag_max_dist - INTEGER
from different IP datagrams, which could result in data corruption.
Default: 64
+bc_forwarding - INTEGER
+ bc_forwarding enables the feature described in rfc1812#section-5.3.5.2
+ and rfc2644. It allows the router to forward directed broadcast.
+ To enable this feature, the 'all' entry and the input interface entry
+ should be set to 1.
+ Default: 0
+
INET peer storage
=================
@@ -253,6 +321,7 @@ tcp_abort_on_overflow - BOOLEAN
option can harm clients of your server.
tcp_adv_win_scale - INTEGER
+ Obsolete since linux-6.6
Count buffering overhead as bytes/2^tcp_adv_win_scale
(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
if it is <= 0.
@@ -272,6 +341,8 @@ tcp_app_win - INTEGER
Reserve max(window/2^tcp_app_win, mss) of window for application
buffer. Value 0 is special, it means that nothing is reserved.
+ Possible values are [0, 31], inclusive.
+
Default: 31
tcp_autocorking - BOOLEAN
@@ -571,6 +642,16 @@ tcp_recovery - INTEGER
Default: 0x1
+tcp_reflect_tos - BOOLEAN
+ For listening sockets, reuse the DSCP value of the initial SYN message
+ for outgoing packets. This allows to have both directions of a TCP
+ stream to use the same DSCP value, assuming DSCP remains unchanged for
+ the lifetime of the connection.
+
+ This options affects both IPv4 and IPv6.
+
+ Default: 0 (disabled)
+
tcp_reordering - INTEGER
Initial reordering level of packets in a TCP stream.
TCP stack can then dynamically adjust flow reordering level
@@ -630,16 +711,15 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
default: initial size of receive buffer used by TCP sockets.
This value overrides net.core.rmem_default used by other protocols.
- Default: 87380 bytes. This value results in window of 65535 with
- default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
- less for default tcp_app_win. See below about these variables.
+ Default: 131072 bytes.
+ This value results in initial window of 65535.
max: maximal size of receive buffer allowed for automatically
selected receiver buffers for TCP socket. This value does not override
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
automatic tuning of that socket's receive buffer size, in which
case this value is ignored.
- Default: between 87380B and 6MB, depending on RAM size.
+ Default: between 131072 and 6MB, depending on RAM size.
tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
@@ -665,6 +745,13 @@ tcp_comp_sack_nr - INTEGER
Default : 44
+tcp_backlog_ack_defer - BOOLEAN
+ If set, user thread processing socket backlog tries sending
+ one ACK for the whole queue. This helps to avoid potential
+ long latencies at end of a TCP socket syscall.
+
+ Default : true
+
tcp_slow_start_after_idle - BOOLEAN
If set, provide RFC2861 behavior and time out the congestion
window after an idle period. An idle period is defined at
@@ -712,6 +799,31 @@ tcp_syncookies - INTEGER
network connections you can set this knob to 2 to enable
unconditionally generation of syncookies.
+tcp_migrate_req - BOOLEAN
+ The incoming connection is tied to a specific listening socket when
+ the initial SYN packet is received during the three-way handshake.
+ When a listener is closed, in-flight request sockets during the
+ handshake and established sockets in the accept queue are aborted.
+
+ If the listener has SO_REUSEPORT enabled, other listeners on the
+ same port should have been able to accept such connections. This
+ option makes it possible to migrate such child sockets to another
+ listener after close() or shutdown().
+
+ The BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program should
+ usually be used to define the policy to pick an alive listener.
+ Otherwise, the kernel will randomly pick an alive listener only if
+ this option is enabled.
+
+ Note that migration between listeners with different settings may
+ crash applications. Let's say migration happens from listener A to
+ B, and only B has TCP_SAVE_SYN enabled. B cannot read SYN data from
+ the requests migrated from A. To avoid such a situation, cancel
+ migration by returning SK_DROP in the type of eBPF program, or
+ disable this option.
+
+ Default: 0
+
tcp_fastopen - INTEGER
Enable TCP Fast Open (RFC7413) to send and accept data in the opening
SYN packet.
@@ -752,7 +864,7 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER
initial value when the blackhole issue goes away.
0 to disable the blackhole detection.
- By default, it is set to 1hr.
+ By default, it is set to 0 (feature is disabled).
tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
The list consists of a primary key and an optional backup key. The
@@ -777,9 +889,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
tcp_syn_retries - INTEGER
Number of times initial SYNs for an active TCP connection attempt
will be retransmitted. Should not be higher than 127. Default value
- is 6, which corresponds to 63seconds till the last retransmission
- with the current initial RTO of 1second. With this the final timeout
- for an active TCP connection attempt will happen after 127seconds.
+ is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4)
+ till the last retransmission with the current initial RTO of 1second.
+ With this the final timeout for an active TCP connection attempt
+ will happen after 131seconds.
tcp_timestamps - INTEGER
Enable timestamps as defined in RFC1323.
@@ -802,6 +915,29 @@ tcp_min_tso_segs - INTEGER
Default: 2
+tcp_tso_rtt_log - INTEGER
+ Adjustment of TSO packet sizes based on min_rtt
+
+ Starting from linux-5.18, TCP autosizing can be tweaked
+ for flows having small RTT.
+
+ Old autosizing was splitting the pacing budget to send 1024 TSO
+ per second.
+
+ tso_packet_size = sk->sk_pacing_rate / 1024;
+
+ With the new mechanism, we increase this TSO sizing using:
+
+ distance = min_rtt_usec / (2^tcp_tso_rtt_log)
+ tso_packet_size += gso_max_size >> distance;
+
+ This means that flows between very close hosts can use bigger
+ TSO packets, reducing their cpu costs.
+
+ If you want to use the old autosizing, set this sysctl to 0.
+
+ Default: 9 (2^9 = 512 usec)
+
tcp_pacing_ss_ratio - INTEGER
sk->sk_pacing_rate is set by TCP stack using a ratio applied
to current rate. (current_rate = cwnd * mss / srtt)
@@ -819,6 +955,16 @@ tcp_pacing_ca_ratio - INTEGER
Default: 120
+tcp_syn_linear_timeouts - INTEGER
+ The number of times for an active TCP connection to retransmit SYNs with
+ a linear backoff timeout before defaulting to an exponential backoff
+ timeout. This has no effect on SYNACK at the passive TCP side.
+
+ With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would
+ expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts,
+ and the first exponential backoff using 2^0 * initial_RTO).
+ Default: 4
+
tcp_tso_win_divisor - INTEGER
This allows control over what percentage of the congestion window
can be consumed by a single TSO frame.
@@ -843,6 +989,21 @@ tcp_tw_reuse - INTEGER
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
+tcp_shrink_window - BOOLEAN
+ This changes how the TCP receive window is calculated.
+
+ RFC 7323, section 2.4, says there are instances when a retracted
+ window can be offered, and that TCP implementations MUST ensure
+ that they handle a shrinking window, as specified in RFC 1122.
+
+ - 0 - Disabled. The window is never shrunk.
+ - 1 - Enabled. The window is shrunk when necessary to remain within
+ the memory limit set by autotuning (sk_rcvbuf).
+ This only occurs if a non-zero receive window
+ scaling factor is also in effect.
+
+ Default: 0
+
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
@@ -913,15 +1074,127 @@ tcp_limit_output_bytes - INTEGER
tcp_challenge_ack_limit - INTEGER
Limits number of Challenge ACK sent per second, as recommended
in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
- Default: 1000
+ Note that this per netns rate limit can allow some side channel
+ attacks and probably should not be enabled.
+ TCP stack implements per TCP socket limits anyway.
+ Default: INT_MAX (unlimited)
-tcp_rx_skb_cache - BOOLEAN
- Controls a per TCP socket cache of one skb, that might help
- performance of some workloads. This might be dangerous
- on systems with a lot of TCP sockets, since it increases
- memory usage.
+tcp_ehash_entries - INTEGER
+ Show the number of hash buckets for TCP sockets in the current
+ networking namespace.
- Default: 0 (disabled)
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for TCP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Note also that the default value of tcp_max_tw_buckets and
+ tcp_max_syn_backlog depend on the hash bucket size.
+
+ Possible values: 0, 2^n (n: 0 - 24 (16Mi))
+
+ Default: 0
+
+tcp_plb_enabled - BOOLEAN
+ If set and the underlying congestion control (e.g. DCTCP) supports
+ and enables PLB feature, TCP PLB (Protective Load Balancing) is
+ enabled. PLB is described in the following paper:
+ https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
+ upon sensing sustained congestion, TCP triggers a change in
+ flow label field for outgoing IPv6 packets. A change in flow label
+ field potentially changes the path of outgoing packets for switches
+ that use ECMP/WCMP for routing.
+
+ PLB changes socket txhash which results in a change in IPv6 Flow Label
+ field, and currently no-op for IPv4 headers. It is possible
+ to apply PLB for IPv4 with other network header fields (e.g. TCP
+ or IPv4 options) or using encapsulation where outer header is used
+ by switches to determine next hop. In either case, further host
+ and switch side changes will be needed.
+
+ When set, PLB assumes that congestion signal (e.g. ECN) is made
+ available and used by congestion control module to estimate a
+ congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
+ make repathing decisions.
+
+ Default: FALSE
+
+tcp_plb_idle_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a rehash can be performed, given there are no packets in flight.
+ This is referred to as M in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 3
+
+tcp_plb_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a forced rehash can be performed. Be careful when setting this
+ parameter, as a small value increases the risk of retransmissions.
+ This is referred to as N in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 12
+
+tcp_plb_suspend_rto_sec - INTEGER
+ Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
+ having PLB repath onto a connectivity "black hole", after an RTO a TCP
+ connection suspends PLB repathing for a random duration between 1x and
+ 2x of this parameter. Randomness is added to avoid concurrent rehashing
+ of multiple TCP connections. This should be set corresponding to the
+ amount of time it takes to repair a failed link.
+
+ Possible Values: 0 - 255
+
+ Default: 60
+
+tcp_plb_cong_thresh - INTEGER
+ Fraction of packets marked with congestion over a round (RTT) to
+ tag that round as congested. This is referred to as K in the PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ The 0-1 fraction range is mapped to 0-256 range to avoid floating
+ point operations. For example, 128 means that if at least 50% of
+ the packets in a round were marked as congested then the round
+ will be tagged as congested.
+
+ Setting threshold to 0 means that PLB repaths every RTT regardless
+ of congestion. This is not intended behavior for PLB and should be
+ used only for experimentation purpose.
+
+ Possible Values: 0 - 256
+
+ Default: 128
+
+tcp_pingpong_thresh - INTEGER
+ The number of estimated data replies sent for estimated incoming data
+ requests that must happen before TCP considers that a connection is a
+ "ping-pong" (request-response) connection for which delayed
+ acknowledgments can provide benefits.
+
+ This threshold is 1 by default, but some applications may need a higher
+ threshold for optimal performance.
+
+ Possible Values: 1 - 255
+
+ Default: 1
UDP variables
=============
@@ -938,13 +1211,11 @@ udp_l3mdev_accept - BOOLEAN
udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets.
- min: Below this number of pages UDP is not bothered about its
- memory appetite. When amount of memory allocated by UDP exceeds
- this number, UDP starts to moderate memory usage.
+ min: Number of pages allowed for queueing by all UDP sockets.
pressure: This value was introduced to follow format of tcp_mem.
- max: Number of pages allowed for queueing by all UDP sockets.
+ max: This value was introduced to follow format of tcp_mem.
Default is calculated at boot time from amount of available memory.
@@ -956,11 +1227,34 @@ udp_rmem_min - INTEGER
Default: 4K
udp_wmem_min - INTEGER
- Minimal size of send buffer used by UDP sockets in moderation.
- Each UDP socket is able to use the size for sending data, even if
- total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+ UDP does not have tx memory accounting and this tunable has no effect.
+
+udp_hash_entries - INTEGER
+ Show the number of hash buckets for UDP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for UDP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
+
+ Default: 0
- Default: 4K
RAW variables
=============
@@ -989,7 +1283,7 @@ cipso_cache_enable - BOOLEAN
cipso_cache_bucket_size - INTEGER
The CIPSO label cache consists of a fixed size hash table with each
hash bucket containing a number of cache entries. This variable limits
- the number of entries in each hash bucket; the larger the value the
+ the number of entries in each hash bucket; the larger the value is, the
more CIPSO label mappings that can be cached. When the number of
entries in a given hash bucket reaches this limit adding new entries
causes the oldest entry in the bucket to be removed to make room.
@@ -1053,7 +1347,9 @@ ip_local_reserved_ports - list of comma separated ranges
although this is redundant. However such a setting is useful
if later the port range is changed to a value that will
- include the reserved ports.
+ include the reserved ports. Also keep in mind, that overlapping
+ of these ranges may affect probability of selecting ephemeral
+ ports which are right after block of reserved ports.
Default: Empty
@@ -1081,7 +1377,7 @@ ip_autobind_reuse - BOOLEAN
option should only be set by experts.
Default: 0
-ip_dynaddr - BOOLEAN
+ip_dynaddr - INTEGER
If set non-zero, enables support for dynamic addresses.
If set to a non-zero value larger than 1, a kernel log
message will be printed when dynamic address rewriting
@@ -1103,8 +1399,8 @@ ping_group_range - 2 INTEGERS
Restrict ICMP_PROTO datagram sockets to users in the group range.
The default is "1 0", meaning, that nobody (not even root) may
create ping sockets. Setting it to "100 100" would grant permissions
- to the single group. "0 4294967295" would enable it for the world, "100
- 4294967295" would enable it for the users, but not daemons.
+ to the single group. "0 4294967294" would enable it for the world, "100
+ 4294967294" would enable it for the users, but not daemons.
tcp_early_demux - BOOLEAN
Enable early demux for established TCP sockets.
@@ -1123,6 +1419,12 @@ icmp_echo_ignore_all - BOOLEAN
Default: 0
+icmp_echo_enable_probe - BOOLEAN
+ If set to one, then the kernel will respond to RFC 8335 PROBE
+ requests sent to it.
+
+ Default: 0
+
icmp_echo_ignore_broadcasts - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO and
TIMESTAMP requests sent to it via broadcast/multicast.
@@ -1142,13 +1444,15 @@ icmp_ratelimit - INTEGER
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask (see below) are
- controlled by this limit.
+ controlled by this limit. For security reasons, the precise count
+ of messages per second is randomized.
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
+ For security reasons, the precise burst size is randomized.
Default: 50
@@ -1194,7 +1498,7 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
- This is the behaviour network many administrators will expect from
+ This is the behaviour many network administrators will expect from
a router. And it can make debugging complicated network layouts
much easier.
@@ -1337,6 +1641,14 @@ proxy_arp_pvlan - BOOLEAN
Hewlett-Packard call it Source-Port filtering or port-isolation.
Ericsson call it MAC-Forced Forwarding (RFC Draft).
+proxy_delay - INTEGER
+ Delay proxy response.
+
+ Delay response to a neighbor solicitation when proxy_arp
+ or proxy_ndp is enabled. A random value between [0, proxy_delay)
+ will be chosen, setting to zero means reply with no delay.
+ Value in jiffies. Defaults to 80.
+
shared_media - BOOLEAN
Send(router) or accept(host) RFC1620 shared media redirects.
Overrides secure_redirects.
@@ -1423,6 +1735,25 @@ rp_filter - INTEGER
Default value is 0. Note that some distributions enable it
in startup scripts.
+src_valid_mark - BOOLEAN
+ - 0 - The fwmark of the packet is not included in reverse path
+ route lookup. This allows for asymmetric routing configurations
+ utilizing the fwmark in only one direction, e.g., transparent
+ proxying.
+
+ - 1 - The fwmark of the packet is included in reverse path route
+ lookup. This permits rp_filter to function when the fwmark is
+ used for routing traffic in both directions.
+
+ This setting also affects the utilization of fmwark when
+ performing source address selection for ICMP replies, or
+ determining addresses stored for the IPOPT_TS_TSANDADDR and
+ IPOPT_RR IP options.
+
+ The max value from conf/{all,interface}/src_valid_mark is used.
+
+ Default value is 0.
+
arp_filter - BOOLEAN
- 1 - Allows you to have multiple network interfaces on the same
subnet, and have the ARPs for each interface be answered
@@ -1502,12 +1833,15 @@ arp_notify - BOOLEAN
or hardware address changes.
== ==========================================================
-arp_accept - BOOLEAN
- Define behavior for gratuitous ARP frames who's IP is not
- already present in the ARP table:
+arp_accept - INTEGER
+ Define behavior for accepting gratuitous ARP (garp) frames from devices
+ that are not already present in the ARP table:
- 0 - don't create new entries in the ARP table
- 1 - create new entries in the ARP table
+ - 2 - create new entries only if the source IP address is in the same
+ subnet as an address configured on the interface that received the
+ garp message.
Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.
@@ -1516,6 +1850,15 @@ arp_accept - BOOLEAN
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.
+arp_evict_nocarrier - BOOLEAN
+ Clears the ARP cache on NOCARRIER events. This option is important for
+ wireless devices where the ARP cache should not be cleared when roaming
+ between access points on the same network. In most cases this should
+ remain as the default (1).
+
+ - 1 - (default): Clear the ARP cache on NOCARRIER events
+ - 0 - Do not clear ARP cache on NOCARRIER events
+
mcast_solicit - INTEGER
The maximum number of multicast probes in INCOMPLETE state,
when the associated hardware address is unknown. Defaults
@@ -1552,6 +1895,9 @@ igmpv3_unsolicited_report_interval - INTEGER
Default: 1000 (1 seconds)
+ignore_routes_with_linkdown - BOOLEAN
+ Ignore routes whose link is down when performing a FIB lookup.
+
promote_secondaries - BOOLEAN
When a primary IP address is removed from this interface
promote a corresponding secondary IP address instead of
@@ -1691,6 +2037,35 @@ fib_multipath_hash_policy - INTEGER
- 0 - Layer 3 (source and destination addresses plus flow label)
- 1 - Layer 4 (standard 5-tuple)
- 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Flow Label
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
@@ -1760,7 +2135,7 @@ skip_notify_on_dev_down - BOOLEAN
nexthop_compat_mode - BOOLEAN
New nexthop API provides a means for managing nexthops independent of
- prefixes. Backwards compatibilty with old route format is enabled by
+ prefixes. Backwards compatibility with old route format is enabled by
default which means route dumps and notifications contain the new
nexthop attribute but also the full, expanded nexthop definition.
Further, updates or deletes of a nexthop configuration generate route
@@ -1770,6 +2145,44 @@ nexthop_compat_mode - BOOLEAN
and extraneous notifications.
Default: true (backward compat mode)
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
+ioam6_id - INTEGER
+ Define the IOAM id of this node. Uses only 24 bits out of 32 in total.
+
+ Min: 0
+ Max: 0xFFFFFF
+
+ Default: 0xFFFFFF
+
+ioam6_id_wide - LONG INTEGER
+ Define the wide IOAM id of this node. Uses only 56 bits out of 64 in
+ total. Can be different from ioam6_id.
+
+ Min: 0
+ Max: 0xFFFFFFFFFFFFFF
+
+ Default: 0xFFFFFFFFFFFFFF
+
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
@@ -1784,30 +2197,27 @@ ip6frag_low_thresh - INTEGER
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
-IPv6 Segment Routing:
-
-seg6_flowlabel - INTEGER
- Controls the behaviour of computing the flowlabel of outer
- IPv6 header in case of SR T.encaps
-
- == =======================================================
- -1 set flowlabel to zero.
- 0 copy flowlabel from Inner packet in case of Inner IPv6
- (Set flowlabel to 0 in case IPv4/L2)
- 1 Compute the flowlabel using seg6_make_flowlabel()
- == =======================================================
-
- Default is 0.
-
``conf/default/*``:
Change the interface-specific default settings.
+ These settings would be used during creating new interfaces.
+
``conf/all/*``:
Change all the interface-specific settings.
[XXX: Other special features than forwarding?]
+conf/all/disable_ipv6 - BOOLEAN
+ Changing this value is same as changing ``conf/default/disable_ipv6``
+ setting and also all per-interface ``disable_ipv6`` settings to the same
+ value.
+
+ Reading this value does not have any particular meaning. It does not say
+ whether IPv6 support is enabled or disabled. Returned value can be 1
+ also in the case when some interface has ``disable_ipv6`` set to 0 and
+ has configured IPv6 addresses.
+
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
@@ -1866,6 +2276,16 @@ accept_ra_defrtr - BOOLEAN
- enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ra_defrtr_metric - UNSIGNED INTEGER
+ Route metric for default route learned in Router Advertisement. This value
+ will be assigned as metric for the default route learned via IPv6 Router
+ Advertisement. Takes affect only if accept_ra_defrtr is enabled.
+
+ Possible values:
+ 1 to 0xFFFFFFFF
+
+ Default: IP6_RT_PRIO_USER i.e. 1024.
+
accept_ra_from_local - BOOLEAN
Accept RA with source-address that is found on local machine
if the RA is otherwise proper and able to be accepted.
@@ -1888,6 +2308,14 @@ accept_ra_min_hop_limit - INTEGER
Default: 1
+accept_ra_min_lft - INTEGER
+ Minimum acceptable lifetime value in Router Advertisement.
+
+ RA sections with a lifetime less than this value shall be
+ ignored. Zero lifetimes stay unaffected.
+
+ Default: 0
+
accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
@@ -1896,6 +2324,17 @@ accept_ra_pinfo - BOOLEAN
- enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ra_honor_pio_life - BOOLEAN
+ Whether to use RFC4862 Section 5.5.3e to determine the valid
+ lifetime of an address matching a prefix sent in a Router
+ Advertisement Prefix Information Option.
+
+ - If enabled, the PIO valid lifetime will always be honored.
+ - If disabled, RFC4862 section 5.5.3e is used to determine
+ the valid lifetime of the address.
+
+ Default: 0 (disabled)
+
accept_ra_rt_info_min_plen - INTEGER
Minimum prefix length of Route Information in RA.
@@ -2063,12 +2502,18 @@ use_tempaddr - INTEGER
* -1 (for point-to-point devices and loopback devices)
temp_valid_lft - INTEGER
- valid lifetime (in seconds) for temporary addresses.
+ valid lifetime (in seconds) for temporary addresses. If less than the
+ minimum required lifetime (typically 5-7 seconds), temporary addresses
+ will not be created.
Default: 172800 (2 days)
temp_prefered_lft - INTEGER
- Preferred lifetime (in seconds) for temporary addresses.
+ Preferred lifetime (in seconds) for temporary addresses. If
+ temp_prefered_lft is less than the minimum required lifetime (typically
+ 5-7 seconds), the preferred lifetime is the minimum required. If
+ temp_prefered_lft is greater than temp_valid_lft, the preferred lifetime
+ is temp_valid_lft.
Default: 86400 (1 day)
@@ -2090,6 +2535,16 @@ max_desync_factor - INTEGER
Default: 600
+regen_min_advance - INTEGER
+ How far in advance (in seconds), at minimum, to create a new temporary
+ address before the current one is deprecated. This value is added to
+ the amount of time that may be required for duplicate address detection
+ to determine when to create a new address. Linux permits setting this
+ value to less than the default of 2 seconds, but a value less than 2
+ does not conform to RFC 8981.
+
+ Default: 2
+
regen_max_retry - INTEGER
Number of attempts before give up attempting to generate
valid temporary addresses.
@@ -2169,6 +2624,15 @@ ndisc_tclass - INTEGER
* 0 - (default)
+ndisc_evict_nocarrier - BOOLEAN
+ Clears the neighbor discovery table on NOCARRIER events. This option is
+ important for wireless devices where the neighbor discovery cache should
+ not be cleared when roaming between access points on the same network.
+ In most cases this should remain as the default (1).
+
+ - 1 - (default): Clear neighbor discover cache on NOCARRIER events.
+ - 0 - Do not clear neighbor discovery cache on NOCARRIER events.
+
mldv1_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
MLDv1 report retransmit will take place.
@@ -2254,6 +2718,37 @@ drop_unsolicited_na - BOOLEAN
By default this is turned off.
+accept_untracked_na - INTEGER
+ Define behavior for accepting neighbor advertisements from devices that
+ are absent in the neighbor cache:
+
+ - 0 - (default) Do not accept unsolicited and untracked neighbor
+ advertisements.
+
+ - 1 - Add a new neighbor cache entry in STALE state for routers on
+ receiving a neighbor advertisement (either solicited or unsolicited)
+ with target link-layer address option specified if no neighbor entry
+ is already present for the advertised IPv6 address. Without this knob,
+ NAs received for untracked addresses (absent in neighbor cache) are
+ silently ignored.
+
+ This is as per router-side behavior documented in RFC9131.
+
+ This has lower precedence than drop_unsolicited_na.
+
+ This will optimize the return path for the initial off-link
+ communication that is initiated by a directly connected host, by
+ ensuring that the first-hop router which turns on this setting doesn't
+ have to buffer the initial return packets to do neighbor-solicitation.
+ The prerequisite is that the host is configured to send unsolicited
+ neighbor advertisements on interface bringup. This setting should be
+ used in conjunction with the ndisc_notify setting on the host to
+ satisfy this prerequisite.
+
+ - 2 - Extend option (1) to add a new neighbor cache entry only if the
+ source IP address is in the same subnet as an address configured on
+ the interface that received the neighbor advertisement.
+
enhanced_dad - BOOLEAN
Include a nonce option in the IPv6 neighbor solicitation messages used for
duplicate address detection per RFC7527. A received DAD NS will only signal
@@ -2308,6 +2803,13 @@ echo_ignore_anycast - BOOLEAN
Default: 0
+error_anycast_as_unicast - BOOLEAN
+ If set to 1, then the kernel will respond with ICMP Errors
+ resulting from requests sent to it over the IPv6 protocol destined
+ to anycast address essentially treating anycast as unicast.
+
+ Default: 0
+
xfrm6_gc_thresh - INTEGER
(Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv6
@@ -2408,7 +2910,7 @@ pf_expose - INTEGER
can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled,
a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming
SCTP_PF state and a SCTP_PF-state transport info can be got via
- SCTP_GET_PEER_ADDR_INFO sockopt; When it's diabled, no
+ SCTP_GET_PEER_ADDR_INFO sockopt; When it's disabled, no
SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when
trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO
sockopt.
@@ -2628,7 +3130,14 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
Default: 4K
sctp_wmem - vector of 3 INTEGERs: min, default, max
- Currently this tunable has no effect.
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimum size of send buffer that can be used by SCTP sockets.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
+
+ Default: 4K
addr_scope_policy - INTEGER
Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
@@ -2640,6 +3149,95 @@ addr_scope_policy - INTEGER
Default: 1
+udp_port - INTEGER
+ The listening port for the local UDP tunneling sock. Normally it's
+ using the IANA-assigned UDP port number 9899 (sctp-tunneling).
+
+ This UDP sock is used for processing the incoming UDP-encapsulated
+ SCTP packets (from RFC6951), and shared by all applications in the
+ same net namespace. This UDP sock will be closed when the value is
+ set to 0.
+
+ The value will also be used to set the src port of the UDP header
+ for the outgoing UDP-encapsulated SCTP packets. For the dest port,
+ please refer to 'encap_port' below.
+
+ Default: 0
+
+encap_port - INTEGER
+ The default remote UDP encapsulation port.
+
+ This value is used to set the dest port of the UDP header for the
+ outgoing UDP-encapsulated SCTP packets by default. Users can also
+ change the value for each sock/asoc/transport by using setsockopt.
+ For further information, please refer to RFC6951.
+
+ Note that when connecting to a remote server, the client should set
+ this to the port that the UDP tunneling sock on the peer server is
+ listening to and the local UDP tunneling sock on the client also
+ must be started. On the server, it would get the encap_port from
+ the incoming packet's source port.
+
+ Default: 0
+
+plpmtud_probe_interval - INTEGER
+ The time interval (in milliseconds) for the PLPMTUD probe timer,
+ which is configured to expire after this period to receive an
+ acknowledgment to a probe packet. This is also the time interval
+ between the probes for the current pmtu when the probe search
+ is done.
+
+ PLPMTUD will be disabled when 0 is set, and other values for it
+ must be >= 5000.
+
+ Default: 0
+
+reconf_enable - BOOLEAN
+ Enable or disable extension of Stream Reconfiguration functionality
+ specified in RFC6525. This extension provides the ability to "reset"
+ a stream, and it includes the Parameters of "Outgoing/Incoming SSN
+ Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams".
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+intl_enable - BOOLEAN
+ Enable or disable extension of User Message Interleaving functionality
+ specified in RFC8260. This extension allows the interleaving of user
+ messages sent on different streams. With this feature enabled, I-DATA
+ chunk will replace DATA chunk to carry user messages if also supported
+ by the peer. Note that to use this feature, one needs to set this option
+ to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2
+ and SCTP_INTERLEAVING_SUPPORTED to 1.
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+ecn_enable - BOOLEAN
+ Control use of Explicit Congestion Notification (ECN) by SCTP.
+ Like in TCP, ECN is used only when both ends of the SCTP connection
+ indicate support for it. This feature is useful in avoiding losses
+ due to congestion by allowing supporting routers to signal congestion
+ before having to drop packets.
+
+ 1: Enable ecn.
+ 0: Disable ecn.
+
+ Default: 1
+
+l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 1 (enabled)
+
``/proc/sys/net/core/*``
========================
diff --git a/Documentation/networking/ipddp.rst b/Documentation/networking/ipddp.rst
deleted file mode 100644
index be7091b77927..000000000000
--- a/Documentation/networking/ipddp.rst
+++ /dev/null
@@ -1,78 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=========================================================
-AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
-=========================================================
-
-Documentation ipddp.c
-
-This file is written by Jay Schulist <jschlst@samba.org>
-
-Introduction
-------------
-
-AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk
-networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams
-inside AppleTalk packets.
-
-Through this driver you can either allow your Linux box to communicate
-IP over an AppleTalk network or you can provide IP gatewaying functions
-for your AppleTalk users.
-
-You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk,
-EtherTalk and PPPTalk. The only limit on the protocol is that of what
-kernel AppleTalk layer and drivers are available.
-
-Each mode requires its own user space software.
-
-Compiling AppleTalk-IP Decapsulation/Encapsulation
-==================================================
-
-AppleTalk-IP decapsulation needs to be compiled into your kernel. You
-will need to turn on AppleTalk-IP driver support. Then you will need to
-select ONE of the two options; IP to AppleTalk-IP encapsulation support or
-AppleTalk-IP to IP decapsulation support. If you compile the driver
-statically you will only be able to use the driver for the function you have
-enabled in the kernel. If you compile the driver as a module you can
-select what mode you want it to run in via a module loading param.
-ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for
-AppleTalk-IP to IP decapsulation.
-
-Basic instructions for user space tools
-=======================================
-
-I will briefly describe the operation of the tools, but you will
-need to consult the supporting documentation for each set of tools.
-
-Decapsulation - You will need to download a software package called
-MacGate. In this distribution there will be a tool called MacRoute
-which enables you to add routes to the kernel for your Macs by hand.
-Also the tool MacRegGateWay is included to register the
-proper IP Gateway and IP addresses for your machine. Included in this
-distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from
-ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional
-but it allows automatic adding and deleting of routes for Macs. (Handy
-for locations with large Mac installations)
-
-Encapsulation - You will need to download a software daemon called ipddpd.
-This software expects there to be an AppleTalk-IP gateway on the network.
-You will also need to add the proper routes to route your Linux box's IP
-traffic out the ipddp interface.
-
-Common Uses of ipddp.c
-----------------------
-Of course AppleTalk-IP decapsulation and encapsulation, but specifically
-decapsulation is being used most for connecting LocalTalk networks to
-IP networks. Although it has been used on EtherTalk networks to allow
-Macs that are only able to tunnel IP over EtherTalk.
-
-Encapsulation has been used to allow a Linux box stuck on a LocalTalk
-network to use IP. It should work equally well if you are stuck on an
-EtherTalk only network.
-
-Further Assistance
--------------------
-You can contact me (Jay Schulist <jschlst@samba.org>) with any
-questions regarding decapsulation or encapsulation. Bradford W. Johnson
-<johns393@maroon.tc.umn.edu> originally wrote the ipddp.c driver for IP
-encapsulation in AppleTalk.
diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst
index 694adcba36b0..895d0ccfd596 100644
--- a/Documentation/networking/ipvlan.rst
+++ b/Documentation/networking/ipvlan.rst
@@ -11,7 +11,7 @@ Initial Release:
================
This is conceptually very similar to the macvlan driver with one major
exception of using L3 for mux-ing /demux-ing among slaves. This property makes
-the master device share the L2 with it's slave devices. I have developed this
+the master device share the L2 with its slave devices. I have developed this
driver in conjunction with network namespaces and not sure if there is use case
outside of it.
@@ -61,7 +61,7 @@ e.g.
IPvlan has two modes of operation - L2 and L3. For a given master device,
you can select one of these two modes and all slaves on that master will
operate in the same (selected) mode. The RX mode is almost identical except
-that in L3 mode the slaves wont receive any multicast / broadcast traffic.
+that in L3 mode the slaves won't receive any multicast / broadcast traffic.
L3 mode is more restrictive since routing is controlled from the other (mostly)
default namespace.
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 2afccc63856e..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -37,8 +37,7 @@ conn_reuse_mode - INTEGER
0: disable any special handling on port reuse. The new
connection will be delivered to the same real server that was
- servicing the previous connection. This will effectively
- disable expire_nodest_conn.
+ servicing the previous connection.
bit 1: enable rescheduling of new connections when it is safe.
That is, whenever expire_nodest_conn and for TCP sockets, when
@@ -130,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+est_cpulist - CPULIST
+ Allowed CPUs for estimation kthreads
+
+ Syntax: standard cpulist format
+ empty list - stop kthread tasks and estimation
+ default - the system's housekeeping CPUs for kthreads
+
+ Example:
+ "all": all possible CPUs
+ "0-N": all possible CPUs, N denotes last CPU number
+ "0,1-N:1/2": first and all CPUs with odd number
+ "": empty list
+
+est_nice - INTEGER
+ default 0
+ Valid range: -20 (more favorable) .. 19 (less favorable)
+
+ Niceness value to use for the estimation kthreads (scheduling
+ priority)
+
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
@@ -300,3 +319,14 @@ sync_version - INTEGER
Kernels with this sync_version entry are able to receive messages
of both version 1 and version 2 of the synchronisation protocol.
+
+run_estimation - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If disabled, the estimation will be suspended and kthread tasks
+ stopped.
+
+ You can always re-enable estimation by setting this value to 1.
+ But be careful, the first estimation after re-enable is not
+ accurate.
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
index f5be243d250a..e4bd7aa1f5aa 100644
--- a/Documentation/networking/j1939.rst
+++ b/Documentation/networking/j1939.rst
@@ -10,9 +10,9 @@ Overview / What Is J1939
SAE J1939 defines a higher layer protocol on CAN. It implements a more
sophisticated addressing scheme and extends the maximum packet size above 8
bytes. Several derived specifications exist, which differ from the original
-J1939 on the application level, like MilCAN A, NMEA2000 and especially
+J1939 on the application level, like MilCAN A, NMEA2000, and especially
ISO-11783 (ISOBUS). This last one specifies the so-called ETP (Extended
-Transport Protocol) which is has been included in this implementation. This
+Transport Protocol), which has been included in this implementation. This
results in a maximum packet size of ((2 ^ 24) - 1) * 7 bytes == 111 MiB.
Specifications used
@@ -32,15 +32,15 @@ sockets, we found some reasons to justify a kernel implementation for the
addressing and transport methods used by J1939.
* **Addressing:** when a process on an ECU communicates via J1939, it should
- not necessarily know its source address. Although at least one process per
+ not necessarily know its source address. Although, at least one process per
ECU should know the source address. Other processes should be able to reuse
that address. This way, address parameters for different processes
cooperating for the same ECU, are not duplicated. This way of working is
- closely related to the UNIX concept where programs do just one thing, and do
+ closely related to the UNIX concept, where programs do just one thing and do
it well.
* **Dynamic addressing:** Address Claiming in J1939 is time critical.
- Furthermore data transport should be handled properly during the address
+ Furthermore, data transport should be handled properly during the address
negotiation. Putting this functionality in the kernel eliminates it as a
requirement for _every_ user space process that communicates via J1939. This
results in a consistent J1939 bus with proper addressing.
@@ -58,7 +58,7 @@ Therefore, these parts are left to user space.
The J1939 sockets operate on CAN network devices (see SocketCAN). Any J1939
user space library operating on CAN raw sockets will still operate properly.
-Since such library does not communicate with the in-kernel implementation, care
+Since such a library does not communicate with the in-kernel implementation, care
must be taken that these two do not interfere. In practice, this means they
cannot share ECU addresses. A single ECU (or virtual ECU) address is used by
the library exclusively, or by the in-kernel system exclusively.
@@ -69,21 +69,59 @@ J1939 concepts
PGN
---
+The J1939 protocol uses the 29-bit CAN identifier with the following structure:
+
+ ============ ============== ====================
+ 29 bit CAN-ID
+ --------------------------------------------------
+ Bit positions within the CAN-ID
+ --------------------------------------------------
+ 28 ... 26 25 ... 8 7 ... 0
+ ============ ============== ====================
+ Priority PGN SA (Source Address)
+ ============ ============== ====================
+
The PGN (Parameter Group Number) is a number to identify a packet. The PGN
is composed as follows:
-1 bit : Reserved Bit
-1 bit : Data Page
-8 bits : PF (PDU Format)
-8 bits : PS (PDU Specific)
+
+ ============ ============== ================= =================
+ PGN
+ ------------------------------------------------------------------
+ Bit positions within the CAN-ID
+ ------------------------------------------------------------------
+ 25 24 23 ... 16 15 ... 8
+ ============ ============== ================= =================
+ R (Reserved) DP (Data Page) PF (PDU Format) PS (PDU Specific)
+ ============ ============== ================= =================
In J1939-21 distinction is made between PDU1 format (where PF < 240) and PDU2
-format (where PF >= 240). Furthermore, when using PDU2 format, the PS-field
+format (where PF >= 240). Furthermore, when using the PDU2 format, the PS-field
contains a so-called Group Extension, which is part of the PGN. When using PDU2
format, the Group Extension is set in the PS-field.
+ ============== ========================
+ PDU1 Format (specific) (peer to peer)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ 00h ... EFh DA (Destination address)
+ ============== ========================
+
+ ============== ========================
+ PDU2 Format (global) (broadcast)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ F0h ... FFh GE (Group Extension)
+ ============== ========================
+
On the other hand, when using PDU1 format, the PS-field contains a so-called
Destination Address, which is _not_ part of the PGN. When communicating a PGN
-from user space to kernel (or visa versa) and PDU2 format is used, the PS-field
+from user space to kernel (or vice versa) and PDU2 format is used, the PS-field
of the PGN shall be set to zero. The Destination Address shall be set
elsewhere.
@@ -96,15 +134,15 @@ Addressing
Both static and dynamic addressing methods can be used.
-For static addresses, no extra checks are made by the kernel, and provided
+For static addresses, no extra checks are made by the kernel and provided
addresses are considered right. This responsibility is for the OEM or system
integrator.
For dynamic addressing, so-called Address Claiming, extra support is foreseen
-in the kernel. In J1939 any ECU is known by it's 64-bit NAME. At the moment of
+in the kernel. In J1939 any ECU is known by its 64-bit NAME. At the moment of
a successful address claim, the kernel keeps track of both NAME and source
address being claimed. This serves as a base for filter schemes. By default,
-packets with a destination that is not locally, will be rejected.
+packets with a destination that is not locally will be rejected.
Mixed mode packets (from a static to a dynamic address or vice versa) are
allowed. The BSD sockets define separate API calls for getting/setting the
@@ -131,31 +169,31 @@ API Calls
---------
On CAN, you first need to open a socket for communicating over a CAN network.
-To use J1939, #include <linux/can/j1939.h>. From there, <linux/can.h> will be
+To use J1939, ``#include <linux/can/j1939.h>``. From there, ``<linux/can.h>`` will be
included too. To open a socket, use:
.. code-block:: C
s = socket(PF_CAN, SOCK_DGRAM, CAN_J1939);
-J1939 does use SOCK_DGRAM sockets. In the J1939 specification, connections are
+J1939 does use ``SOCK_DGRAM`` sockets. In the J1939 specification, connections are
mentioned in the context of transport protocol sessions. These still deliver
-packets to the other end (using several CAN packets). SOCK_STREAM is not
+packets to the other end (using several CAN packets). ``SOCK_STREAM`` is not
supported.
-After the successful creation of the socket, you would normally use the bind(2)
-and/or connect(2) system call to bind the socket to a CAN interface. After
-binding and/or connecting the socket, you can read(2) and write(2) from/to the
-socket or use send(2), sendto(2), sendmsg(2) and the recv*() counterpart
+After the successful creation of the socket, you would normally use the ``bind(2)``
+and/or ``connect(2)`` system call to bind the socket to a CAN interface. After
+binding and/or connecting the socket, you can ``read(2)`` and ``write(2)`` from/to the
+socket or use ``send(2)``, ``sendto(2)``, ``sendmsg(2)`` and the ``recv*()`` counterpart
operations on the socket as usual. There are also J1939 specific socket options
described below.
-In order to send data, a bind(2) must have been successful. bind(2) assigns a
+In order to send data, a ``bind(2)`` must have been successful. ``bind(2)`` assigns a
local address to a socket.
-Different from CAN is that the payload data is just the data that get send,
-without it's header info. The header info is derived from the sockaddr supplied
-to bind(2), connect(2), sendto(2) and recvfrom(2). A write(2) with size 4 will
+Different from CAN is that the payload data is just the data that get sends,
+without its header info. The header info is derived from the sockaddr supplied
+to ``bind(2)``, ``connect(2)``, ``sendto(2)`` and ``recvfrom(2)``. A ``write(2)`` with size 4 will
result in a packet with 4 bytes.
The sockaddr structure has extensions for use with J1939 as specified below:
@@ -180,47 +218,47 @@ The sockaddr structure has extensions for use with J1939 as specified below:
} can_addr;
}
-can_family & can_ifindex serve the same purpose as for other SocketCAN sockets.
+``can_family`` & ``can_ifindex`` serve the same purpose as for other SocketCAN sockets.
-can_addr.j1939.pgn specifies the PGN (max 0x3ffff). Individual bits are
+``can_addr.j1939.pgn`` specifies the PGN (max 0x3ffff). Individual bits are
specified above.
-can_addr.j1939.name contains the 64-bit J1939 NAME.
+``can_addr.j1939.name`` contains the 64-bit J1939 NAME.
-can_addr.j1939.addr contains the address.
+``can_addr.j1939.addr`` contains the address.
-The bind(2) system call assigns the local address, i.e. the source address when
-sending packages. If a PGN during bind(2) is set, it's used as a RX filter.
-I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
+The ``bind(2)`` system call assigns the local address, i.e. the source address when
+sending packages. If a PGN during ``bind(2)`` is set, it's used as a RX filter.
+I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
it is used as a receive filter, too. It will match the destination NAME or ADDR
of the incoming packet. The NAME filter will work only if appropriate Address
Claiming for this name was done on the CAN bus and registered/cached by the
kernel.
-On the other hand connect(2) assigns the remote address, i.e. the destination
-address. The PGN from connect(2) is used as the default PGN when sending
+On the other hand ``connect(2)`` assigns the remote address, i.e. the destination
+address. The PGN from ``connect(2)`` is used as the default PGN when sending
packets. If ADDR or NAME is set it will be used as the default destination ADDR
-or NAME. Further a set ADDR or NAME during connect(2) is used as a receive
+or NAME. Further a set ADDR or NAME during ``connect(2)`` is used as a receive
filter. It will match the source NAME or ADDR of the incoming packet.
-Both write(2) and send(2) will send a packet with local address from bind(2) and
-the remote address from connect(2). Use sendto(2) to overwrite the destination
+Both ``write(2)`` and ``send(2)`` will send a packet with local address from ``bind(2)`` and the
+remote address from ``connect(2)``. Use ``sendto(2)`` to overwrite the destination
address.
-If can_addr.j1939.name is set (!= 0) the NAME is looked up by the kernel and
-the corresponding ADDR is used. If can_addr.j1939.name is not set (== 0),
-can_addr.j1939.addr is used.
+If ``can_addr.j1939.name`` is set (!= 0) the NAME is looked up by the kernel and
+the corresponding ADDR is used. If ``can_addr.j1939.name`` is not set (== 0),
+``can_addr.j1939.addr`` is used.
When creating a socket, reasonable defaults are set. Some options can be
-modified with setsockopt(2) & getsockopt(2).
+modified with ``setsockopt(2)`` & ``getsockopt(2)``.
RX path related options:
-- SO_J1939_FILTER - configure array of filters
-- SO_J1939_PROMISC - disable filters set by bind(2) and connect(2)
+- ``SO_J1939_FILTER`` - configure array of filters
+- ``SO_J1939_PROMISC`` - disable filters set by ``bind(2)`` and ``connect(2)``
By default no broadcast packets can be send or received. To enable sending or
-receiving broadcast packets use the socket option SO_BROADCAST:
+receiving broadcast packets use the socket option ``SO_BROADCAST``:
.. code-block:: C
@@ -261,26 +299,26 @@ The following diagram illustrates the RX path:
+---------------------------+
TX path related options:
-SO_J1939_SEND_PRIO - change default send priority for the socket
+``SO_J1939_SEND_PRIO`` - change default send priority for the socket
Message Flags during send() and Related System Calls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-send(2), sendto(2) and sendmsg(2) take a 'flags' argument. Currently
+``send(2)``, ``sendto(2)`` and ``sendmsg(2)`` take a 'flags' argument. Currently
supported flags are:
-* MSG_DONTWAIT, i.e. non-blocking operation.
+* ``MSG_DONTWAIT``, i.e. non-blocking operation.
recvmsg(2)
^^^^^^^^^^
-In most cases recvmsg(2) is needed if you want to extract more information than
-recvfrom(2) can provide. For example package priority and timestamp. The
+In most cases ``recvmsg(2)`` is needed if you want to extract more information than
+``recvfrom(2)`` can provide. For example package priority and timestamp. The
Destination Address, name and packet priority (if applicable) are attached to
-the msghdr in the recvmsg(2) call. They can be extracted using cmsg(3) macros,
-with cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR,
-SCM_J1939_DEST_NAME or SCM_J1939_PRIO. The returned data is a uint8_t for
-priority and dst_addr, and uint64_t for dst_name.
+the msghdr in the ``recvmsg(2)`` call. They can be extracted using ``cmsg(3)`` macros,
+with ``cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR``,
+``SCM_J1939_DEST_NAME`` or ``SCM_J1939_PRIO``. The returned data is a ``uint8_t`` for
+``priority`` and ``dst_addr``, and ``uint64_t`` for ``dst_name``.
.. code-block:: C
@@ -305,12 +343,12 @@ Dynamic Addressing
Distinction has to be made between using the claimed address and doing an
address claim. To use an already claimed address, one has to fill in the
-j1939.name member and provide it to bind(2). If the name had claimed an address
+``j1939.name`` member and provide it to ``bind(2)``. If the name had claimed an address
earlier, all further messages being sent will use that address. And the
-j1939.addr member will be ignored.
+``j1939.addr`` member will be ignored.
An exception on this is PGN 0x0ee00. This is the "Address Claim/Cannot Claim
-Address" message and the kernel will use the j1939.addr member for that PGN if
+Address" message and the kernel will use the ``j1939.addr`` member for that PGN if
necessary.
To claim an address following code example can be used:
@@ -371,12 +409,12 @@ NAME can send packets.
If another ECU claims the address, the kernel will mark the NAME-SA expired.
No socket bound to the NAME can send packets (other than address claims). To
-claim another address, some socket bound to NAME, must bind(2) again, but with
-only j1939.addr changed to the new SA, and must then send a valid address claim
+claim another address, some socket bound to NAME, must ``bind(2)`` again, but with
+only ``j1939.addr`` changed to the new SA, and must then send a valid address claim
packet. This restarts the state machine in the kernel (and any other
participant on the bus) for this NAME.
-can-utils also include the jacd tool, so it can be used as code example or as
+``can-utils`` also include the ``j1939acd`` tool, so it can be used as code example or as
default Address Claiming daemon.
Send Examples
@@ -403,8 +441,8 @@ Bind:
bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
-Now, the socket 'sock' is bound to the SA 0x20. Since no connect(2) was called,
-at this point we can use only sendto(2) or sendmsg(2).
+Now, the socket 'sock' is bound to the SA 0x20. Since no ``connect(2)`` was called,
+at this point we can use only ``sendto(2)`` or ``sendmsg(2)``.
Send:
@@ -414,8 +452,8 @@ Send:
.can_family = AF_CAN,
.can_addr.j1939 = {
.name = J1939_NO_NAME;
- .pgn = 0x30,
- .addr = 0x12300,
+ .addr = 0x30,
+ .pgn = 0x12300,
},
};
diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst
index f03ae64be8bc..ea55f462cefa 100644
--- a/Documentation/networking/kapi.rst
+++ b/Documentation/networking/kapi.rst
@@ -83,27 +83,6 @@ SUN RPC subsystem
.. kernel-doc:: net/sunrpc/clnt.c
:export:
-WiMAX
------
-
-.. kernel-doc:: net/wimax/op-msg.c
- :export:
-
-.. kernel-doc:: net/wimax/op-reset.c
- :export:
-
-.. kernel-doc:: net/wimax/op-rfkill.c
- :export:
-
-.. kernel-doc:: net/wimax/stack.c
- :export:
-
-.. kernel-doc:: include/net/wimax.h
- :internal:
-
-.. kernel-doc:: include/uapi/linux/wimax.h
- :internal:
-
Network device support
======================
@@ -134,6 +113,15 @@ PHY Support
.. kernel-doc:: drivers/net/phy/phy.c
:internal:
+.. kernel-doc:: drivers/net/phy/phy-core.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/phy-c45.c
+ :export:
+
+.. kernel-doc:: include/linux/phy.h
+ :internal:
+
.. kernel-doc:: drivers/net/phy/phy_device.c
:export:
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
index a48238a2ec09..8496b467dea4 100644
--- a/Documentation/networking/l2tp.rst
+++ b/Documentation/networking/l2tp.rst
@@ -4,124 +4,364 @@
L2TP
====
-This document describes how to use the kernel's L2TP drivers to
-provide L2TP functionality. L2TP is a protocol that tunnels one or
-more sessions over an IP tunnel. It is commonly used for VPNs
-(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
-network infrastructure. With L2TPv3, it is also useful as a Layer-2
-tunneling infrastructure.
-
-Features
+Layer 2 Tunneling Protocol (L2TP) allows L2 frames to be tunneled over
+an IP network.
+
+This document covers the kernel's L2TP subsystem. It documents kernel
+APIs for application developers who want to use the L2TP subsystem and
+it provides some technical details about the internal implementation
+which may be useful to kernel developers and maintainers.
+
+Overview
========
-L2TPv2 (PPP over L2TP (UDP tunnels)).
-L2TPv3 ethernet pseudowires.
-L2TPv3 PPP pseudowires.
-L2TPv3 IP encapsulation.
-Netlink sockets for L2TPv3 configuration management.
-
-History
-=======
-
-The original pppol2tp driver was introduced in 2.6.23 and provided
-L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP
-sessions over a UDP tunnel.
-
-L2TPv3 (rfc3931) changes the protocol to allow different frame types
-to be passed over an L2TP tunnel by moving the PPP-specific parts of
-the protocol out of the core L2TP packet headers. Each frame type is
-known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM
-pseudowires for L2TP are defined in separate RFC standards. Another
-change for L2TPv3 is that it can be carried directly over IP with no
-UDP header (UDP is optional). It is also possible to create static
-unmanaged L2TPv3 tunnels manually without a control protocol
-(userspace daemon) to manage them.
-
-To support L2TPv3, the original pppol2tp driver was split up to
-separate the L2TP and PPP functionality. Existing L2TPv2 userspace
-apps should be unaffected as the original pppol2tp sockets API is
-retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and
-sessions.
-
-Design
-======
-
-The L2TP protocol separates control and data frames. The L2TP kernel
-drivers handle only L2TP data frames; control frames are always
-handled by userspace. L2TP control frames carry messages between L2TP
-clients/servers and are used to setup / teardown tunnels and
-sessions. An L2TP client or server is implemented in userspace.
-
-Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP
-provides L2TPv3 IP encapsulation (no UDP) and is implemented using a
-new l2tpip socket family. The tunnel socket is typically created by
-userspace, though for unmanaged L2TPv3 tunnels, the socket can also be
-created by the kernel. Each L2TP session (pseudowire) gets a network
-interface instance. In the case of PPP, these interfaces are created
-indirectly by pppd using a pppol2tp socket. In the case of ethernet,
-the netdevice is created upon a netlink request to create an L2TPv3
-ethernet pseudowire.
-
-For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a
-mechanism by which PPP frames carried through an L2TP session are
-passed through the kernel's PPP subsystem. The standard PPP daemon,
-pppd, handles all PPP interaction with the peer. PPP network
-interfaces are created for each local PPP endpoint. The kernel's PPP
-subsystem arranges for PPP control frames to be delivered to pppd,
-while data frames are forwarded as usual.
-
-For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a
-netdevice driver, managing virtual ethernet devices, one per
-pseudowire. These interfaces can be managed using standard Linux tools
-such as "ip" and "ifconfig". If only IP frames are passed over the
-tunnel, the interface can be given an IP addresses of itself and its
-peer. If non-IP frames are to be passed over the tunnel, the interface
-can be added to a bridge using brctl. All L2TP datapath protocol
-functions are handled by the L2TP core driver.
-
-Each tunnel and session within a tunnel is assigned a unique tunnel_id
-and session_id. These ids are carried in the L2TP header of every
-control and data packet. (Actually, in L2TPv3, the tunnel_id isn't
-present in data frames - it is inferred from the IP connection on
-which the packet was received.) The L2TP driver uses the ids to lookup
-internal tunnel and/or session contexts to determine how to handle the
-packet. Zero tunnel / session ids are treated specially - zero ids are
-never assigned to tunnels or sessions in the network. In the driver,
-the tunnel context keeps a reference to the tunnel UDP or L2TPIP
-socket. The session context holds data that lets the driver interface
-to the kernel's network frame type subsystems, i.e. PPP, ethernet.
-
-Userspace Programming
-=====================
-
-For L2TPv2, there are a number of requirements on the userspace L2TP
-daemon in order to use the pppol2tp driver.
-
-1. Use a UDP socket per tunnel.
-
-2. Create a single PPPoL2TP socket per tunnel bound to a special null
- session id. This is used only for communicating with the driver but
- must remain open while the tunnel is active. Opening this tunnel
- management socket causes the driver to mark the tunnel socket as an
- L2TP UDP encapsulation socket and flags it for use by the
- referenced tunnel id. This hooks up the UDP receive path via
- udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
- in this special PPPoX socket.
-
-3. Create a PPPoL2TP socket per L2TP session. This is typically done
- by starting pppd with the pppol2tp plugin and appropriate
- arguments. A PPPoL2TP tunnel management socket (Step 2) must be
- created before the first PPPoL2TP session socket is created.
+The kernel's L2TP subsystem implements the datapath for L2TPv2 and
+L2TPv3. L2TPv2 is carried over UDP. L2TPv3 is carried over UDP or
+directly over IP (protocol 115).
+
+The L2TP RFCs define two basic kinds of L2TP packets: control packets
+(the "control plane"), and data packets (the "data plane"). The kernel
+deals only with data packets. The more complex control packets are
+handled by user space.
+
+An L2TP tunnel carries one or more L2TP sessions. Each tunnel is
+associated with a socket. Each session is associated with a virtual
+netdevice, e.g. ``pppN``, ``l2tpethN``, through which data frames pass
+to/from L2TP. Fields in the L2TP header identify the tunnel or session
+and whether it is a control or data packet. When tunnels and sessions
+are set up using the Linux kernel API, we're just setting up the L2TP
+data path. All aspects of the control protocol are to be handled by
+user space.
+
+This split in responsibilities leads to a natural sequence of
+operations when establishing tunnels and sessions. The procedure looks
+like this:
+
+ 1) Create a tunnel socket. Exchange L2TP control protocol messages
+ with the peer over that socket in order to establish a tunnel.
+
+ 2) Create a tunnel context in the kernel, using information
+ obtained from the peer using the control protocol messages.
+
+ 3) Exchange L2TP control protocol messages with the peer over the
+ tunnel socket in order to establish a session.
+
+ 4) Create a session context in the kernel using information
+ obtained from the peer using the control protocol messages.
+
+L2TP APIs
+=========
+
+This section documents each userspace API of the L2TP subsystem.
+
+Tunnel Sockets
+--------------
+
+L2TPv2 always uses UDP. L2TPv3 may use UDP or IP encapsulation.
+
+To create a tunnel socket for use by L2TP, the standard POSIX
+socket API is used.
+
+For example, for a tunnel using IPv4 addresses and UDP encapsulation::
+
+ int sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+
+Or for a tunnel using IPv6 addresses and IP encapsulation::
+
+ int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
+
+UDP socket programming doesn't need to be covered here.
+
+IPPROTO_L2TP is an IP protocol type implemented by the kernel's L2TP
+subsystem. The L2TPIP socket address is defined in struct
+sockaddr_l2tpip and struct sockaddr_l2tpip6 at
+`include/uapi/linux/l2tp.h`_. The address includes the L2TP tunnel
+(connection) id. To use L2TP IP encapsulation, an L2TPv3 application
+should bind the L2TPIP socket using the locally assigned
+tunnel id. When the peer's tunnel id and IP address is known, a
+connect must be done.
+
+If the L2TP application needs to handle L2TPv3 tunnel setup requests
+from peers using L2TPIP, it must open a dedicated L2TPIP
+socket to listen for those requests and bind the socket using tunnel
+id 0 since tunnel setup requests are addressed to tunnel id 0.
+
+An L2TP tunnel and all of its sessions are automatically closed when
+its tunnel socket is closed.
+
+Netlink API
+-----------
+
+L2TP applications use netlink to manage L2TP tunnel and session
+instances in the kernel. The L2TP netlink API is defined in
+`include/uapi/linux/l2tp.h`_.
+
+L2TP uses `Generic Netlink`_ (GENL). Several commands are defined:
+Create, Delete, Modify and Get for tunnel and session
+instances, e.g. ``L2TP_CMD_TUNNEL_CREATE``. The API header lists the
+netlink attribute types that can be used with each command.
+
+Tunnel and session instances are identified by a locally unique
+32-bit id. L2TP tunnel ids are given by ``L2TP_ATTR_CONN_ID`` and
+``L2TP_ATTR_PEER_CONN_ID`` attributes and L2TP session ids are given
+by ``L2TP_ATTR_SESSION_ID`` and ``L2TP_ATTR_PEER_SESSION_ID``
+attributes. If netlink is used to manage L2TPv2 tunnel and session
+instances, the L2TPv2 16-bit tunnel/session id is cast to a 32-bit
+value in these attributes.
+
+In the ``L2TP_CMD_TUNNEL_CREATE`` command, ``L2TP_ATTR_FD`` tells the
+kernel the tunnel socket fd being used. If not specified, the kernel
+creates a kernel socket for the tunnel, using IP parameters set in
+``L2TP_ATTR_IP[6]_SADDR``, ``L2TP_ATTR_IP[6]_DADDR``,
+``L2TP_ATTR_UDP_SPORT``, ``L2TP_ATTR_UDP_DPORT`` attributes. Kernel
+sockets are used to implement unmanaged L2TPv3 tunnels (iproute2's "ip
+l2tp" commands). If ``L2TP_ATTR_FD`` is given, it must be a socket fd
+that is already bound and connected. There is more information about
+unmanaged tunnels later in this document.
+
+``L2TP_CMD_TUNNEL_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Sets the tunnel (connection) id.
+PEER_CONN_ID Y Sets the peer tunnel (connection) id.
+PROTO_VERSION Y Protocol version. 2 or 3.
+ENCAP_TYPE Y Encapsulation type: UDP or IP.
+FD N Tunnel socket file descriptor.
+UDP_CSUM N Enable IPv4 UDP checksums. Used only if FD is
+ not set.
+UDP_ZERO_CSUM6_TX N Zero IPv6 UDP checksum on transmit. Used only
+ if FD is not set.
+UDP_ZERO_CSUM6_RX N Zero IPv6 UDP checksum on receive. Used only if
+ FD is not set.
+IP_SADDR N IPv4 source address. Used only if FD is not
+ set.
+IP_DADDR N IPv4 destination address. Used only if FD is
+ not set.
+UDP_SPORT N UDP source port. Used only if FD is not set.
+UDP_DPORT N UDP destination port. Used only if FD is not
+ set.
+IP6_SADDR N IPv6 source address. Used only if FD is not
+ set.
+IP6_DADDR N IPv6 destination address. Used only if FD is
+ not set.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be destroyed.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be modified.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored in DUMP requests.
+================== ======== ===
+
+``L2TP_CMD_SESSION_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y The parent tunnel id.
+SESSION_ID Y Sets the session id.
+PEER_SESSION_ID Y Sets the parent session id.
+PW_TYPE Y Sets the pseudowire type.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+L2SPEC_TYPE N Sets layer2-specific-sublayer type (L2TPv3
+ only).
+COOKIE N Sets optional cookie (L2TPv3 only).
+PEER_COOKIE N Sets optional peer cookie (L2TPv3 only).
+IFNAME N Sets interface name (L2TPv3 only).
+================== ======== ===
+
+For Ethernet session types, this will create an l2tpeth virtual
+interface which can then be configured as required. For PPP session
+types, a PPPoL2TP socket must also be opened and connected, mapping it
+onto the new session. This is covered in "PPPoL2TP Sockets" later.
+
+``L2TP_CMD_SESSION_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be destroyed.
+SESSION_ID Y Identifies the session id to be destroyed.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+``L2TP_CMD_SESSION_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be modified.
+SESSION_ID Y Identifies the session id to be modified.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+================== ======== ===
+
+``L2TP_CMD_SESSION_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored for DUMP requests.
+SESSION_ID N Identifies the session id to be queried.
+ Ignored for DUMP requests.
+IFNAME N Identifies the session by interface name.
+ If set, this overrides any CONN_ID and
+ SESSION_ID attributes. Ignored for DUMP
+ requests. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+Application developers should refer to `include/uapi/linux/l2tp.h`_ for
+netlink command and attribute definitions.
+
+Sample userspace code using libmnl_:
+
+ - Open L2TP netlink socket::
+
+ struct nl_sock *nl_sock;
+ int l2tp_nl_family_id;
+
+ nl_sock = nl_socket_alloc();
+ genl_connect(nl_sock);
+ genl_id = genl_ctrl_resolve(nl_sock, L2TP_GENL_NAME);
+
+ - Create a tunnel::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_FD, tunl_sock_fd);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u8(nlh, L2TP_ATTR_PROTO_VERSION, protocol_version);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_ENCAP_TYPE, encap);
+
+ - Create a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_SESSION_ID, peer_sid);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_PW_TYPE, pwtype);
+ /* there are other session options which can be set using netlink
+ * attributes during session creation -- see l2tp.h
+ */
+
+ - Delete a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+
+ - Delete a tunnel and all of its sessions (if any)::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+
+PPPoL2TP Session Socket API
+---------------------------
+
+For PPP session types, a PPPoL2TP socket must be opened and connected
+to the L2TP session.
When creating PPPoL2TP sockets, the application provides information
-to the driver about the socket in a socket connect() call. Source and
-destination tunnel and session ids are provided, as well as the file
-descriptor of a UDP socket. See struct pppol2tp_addr in
-include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are
-treated specially. When creating the per-tunnel PPPoL2TP management
-socket in Step 2 above, zero source and destination session ids are
-specified, which tells the driver to prepare the supplied UDP file
-descriptor for use as an L2TP tunnel socket.
+to the kernel about the tunnel and session in a socket connect()
+call. Source and destination tunnel and session ids are provided, as
+well as the file descriptor of a UDP or L2TPIP socket. See struct
+pppol2tp_addr in `include/linux/if_pppol2tp.h`_. For historical reasons,
+there are unfortunately slightly different address structures for
+L2TPv2/L2TPv3 IPv4/IPv6 tunnels and userspace must use the appropriate
+structure that matches the tunnel socket type.
Userspace may control behavior of the tunnel or session using
setsockopt and ioctl on the PPPoX socket. The following socket
@@ -130,229 +370,431 @@ options are supported:-
========= ===========================================================
DEBUG bitmask of debug message categories. See below.
SENDSEQ - 0 => don't send packets with sequence numbers
- - 1 => send packets with sequence numbers
+ - 1 => send packets with sequence numbers
RECVSEQ - 0 => receive packet sequence numbers are optional
- - 1 => drop receive packets without sequence numbers
+ - 1 => drop receive packets without sequence numbers
LNSMODE - 0 => act as LAC.
- - 1 => act as LNS.
+ - 1 => act as LNS.
REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder.
========= ===========================================================
-Only the DEBUG option is supported by the special tunnel management
-PPPoX socket.
-
In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
to retrieve tunnel and session statistics from the kernel using the
PPPoX socket of the appropriate tunnel or session.
-For L2TPv3, userspace must use the netlink API defined in
-include/linux/l2tp.h to manage tunnel and session contexts. The
-general procedure to create a new L2TP tunnel with one session is:-
+Sample userspace code:
-1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel
- using netlink.
+ - Create session PPPoX data socket::
-2. Create a UDP or L2TPIP socket for the tunnel.
+ /* Input: the L2TP tunnel UDP socket `tunnel_fd`, which needs to be
+ * bound already (both sockname and peername), otherwise it will not be
+ * ready.
+ */
-3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE
- request. Set attributes according to desired tunnel parameters,
- referencing the UDP or L2TPIP socket created in the previous step.
+ struct sockaddr_pppol2tp sax;
+ int session_fd;
+ int ret;
-4. Create a new L2TP session in the tunnel using a
- L2TP_CMD_SESSION_CREATE request.
+ session_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
+ if (session_fd < 0)
+ return -errno;
-The tunnel and all of its sessions are closed when the tunnel socket
-is closed. The netlink API may also be used to delete sessions and
-tunnels. Configuration and status info may be set or read using netlink.
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = tunnel_fd;
+ sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = session_id;
+ sax.pppol2tp.d_tunnel = peer_tunnel_id;
+ sax.pppol2tp.d_session = peer_session_id;
-The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These
-are where there is no L2TP control message exchange with the peer to
-setup the tunnel; the tunnel is configured manually at each end of the
-tunnel. There is no need for an L2TP userspace application in this
-case -- the tunnel socket is created by the kernel and configured
-using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink
-request. The "ip" utility of iproute2 has commands for managing static
-L2TPv3 tunnels; do "ip l2tp help" for more information.
+ /* session_fd is the fd of the session's PPPoL2TP socket.
+ * tunnel_fd is the fd of the tunnel UDP / L2TPIP socket.
+ */
+ ret = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+ if (ret < 0 ) {
+ close(session_fd);
+ return -errno;
+ }
-Debugging
-=========
+ return session_fd;
-The driver supports a flexible debug scheme where kernel trace
-messages may be optionally enabled per tunnel and per session. Care is
-needed when debugging a live system since the messages are not
-rate-limited and a busy system could be swamped. Userspace uses
-setsockopt on the PPPoX socket to set a debug mask.
+L2TP control packets will still be available for read on `tunnel_fd`.
-The following debug mask bits are available:
+ - Create PPP channel::
-================ ==============================
-L2TP_MSG_DEBUG verbose debug (if compiled in)
-L2TP_MSG_CONTROL userspace - kernel interface
-L2TP_MSG_SEQ sequence numbers handling
-L2TP_MSG_DATA data packets
-================ ==============================
+ /* Input: the session PPPoX data socket `session_fd` which was created
+ * as described above.
+ */
-If enabled, files under a l2tp debugfs directory can be used to dump
-kernel state about L2TP tunnels and sessions. To access it, the
-debugfs filesystem must first be mounted::
+ int ppp_chan_fd;
+ int chindx;
+ int ret;
- # mount -t debugfs debugfs /debug
+ ret = ioctl(session_fd, PPPIOCGCHAN, &chindx);
+ if (ret < 0)
+ return -errno;
-Files under the l2tp directory can then be accessed::
+ ppp_chan_fd = open("/dev/ppp", O_RDWR);
+ if (ppp_chan_fd < 0)
+ return -errno;
- # cat /debug/l2tp/tunnels
+ ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx);
+ if (ret < 0) {
+ close(ppp_chan_fd);
+ return -errno;
+ }
-The debugfs files should not be used by applications to obtain L2TP
-state information because the file format is subject to change. It is
-implemented to provide extra debug information to help diagnose
-problems.) Users should use the netlink API.
+ return ppp_chan_fd;
-/proc/net/pppol2tp is also provided for backwards compatibility with
-the original pppol2tp driver. It lists information about L2TPv2
-tunnels and sessions only. Its use is discouraged.
+LCP PPP frames will be available for read on `ppp_chan_fd`.
-Unmanaged L2TPv3 Tunnels
-========================
-
-Some commercial L2TP products support unmanaged L2TPv3 ethernet
-tunnels, where there is no L2TP control protocol; tunnels are
-configured at each side manually. New commands are available in
-iproute2's ip utility to support this.
-
-To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
-and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
-tunnel endpoints::
-
- # ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
- udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
- # ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
- # ip -s -d show dev l2tpeth0
- # ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
- # ip li set dev l2tpeth0 up
-
-Choose IP addresses to be the address of a local IP interface and that
-of the remote system. The IP addresses of the l2tpeth0 interface can be
-anything suitable.
-
-Repeat the above at the peer, with ports, tunnel/session ids and IP
-addresses reversed. The tunnel and session IDs can be any non-zero
-32-bit number, but the values must be reversed at the peer.
-
-======================== ===================
-Host 1 Host2
-======================== ===================
-udp_sport=5000 udp_sport=5001
-udp_dport=5001 udp_dport=5000
-tunnel_id=42 tunnel_id=45
-peer_tunnel_id=45 peer_tunnel_id=42
-session_id=128 session_id=5196755
-peer_session_id=5196755 peer_session_id=128
-======================== ===================
-
-When done at both ends of the tunnel, it should be possible to send
-data over the network. e.g.::
-
- # ping 10.5.1.1
-
-
-Sample Userspace Code
-=====================
-
-1. Create tunnel management PPPoX socket::
-
- kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
- if (kernel_fd >= 0) {
- struct sockaddr_pppol2tp sax;
- struct sockaddr_in const *peer_addr;
-
- peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
- memset(&sax, 0, sizeof(sax));
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
- sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
- sax.pppol2tp.d_tunnel = 0;
- sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
-
- if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
- perror("connect failed");
- result = -errno;
- goto err;
- }
- }
-
-2. Create session PPPoX data socket::
-
- struct sockaddr_pppol2tp sax;
- int fd;
-
- /* Note, the target socket must be bound already, else it will not be ready */
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = tunnel_fd;
- sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = session_id;
- sax.pppol2tp.d_tunnel = peer_tunnel_id;
- sax.pppol2tp.d_session = peer_session_id;
-
- /* session_fd is the fd of the session's PPPoL2TP socket.
- * tunnel_fd is the fd of the tunnel UDP socket.
- */
- fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
- if (fd < 0 ) {
- return -errno;
- }
- return 0;
+ - Create PPP interface::
-Internal Implementation
-=======================
+ /* Input: the PPP channel `ppp_chan_fd` which was created as described
+ * above.
+ */
+
+ int ifunit = -1;
+ int ppp_if_fd;
+ int ret;
+
+ ppp_if_fd = open("/dev/ppp", O_RDWR);
+ if (ppp_if_fd < 0)
+ return -errno;
+
+ ret = ioctl(ppp_if_fd, PPPIOCNEWUNIT, &ifunit);
+ if (ret < 0) {
+ close(ppp_if_fd);
+ return -errno;
+ }
+
+ ret = ioctl(ppp_chan_fd, PPPIOCCONNECT, &ifunit);
+ if (ret < 0) {
+ close(ppp_if_fd);
+ return -errno;
+ }
+
+ return ppp_if_fd;
+
+IPCP/IPv6CP PPP frames will be available for read on `ppp_if_fd`.
+
+The ppp<ifunit> interface can then be configured as usual with netlink's
+RTM_NEWLINK, RTM_NEWADDR, RTM_NEWROUTE, or ioctl's SIOCSIFMTU, SIOCSIFADDR,
+SIOCSIFDSTADDR, SIOCSIFNETMASK, SIOCSIFFLAGS, or with the `ip` command.
+
+ - Bridging L2TP sessions which have PPP pseudowire types (this is also called
+ L2TP tunnel switching or L2TP multihop) is supported by bridging the PPP
+ channels of the two L2TP sessions to be bridged::
+
+ /* Input: the session PPPoX data sockets `session_fd1` and `session_fd2`
+ * which were created as described further above.
+ */
+
+ int ppp_chan_fd;
+ int chindx1;
+ int chindx2;
+ int ret;
+
+ ret = ioctl(session_fd1, PPPIOCGCHAN, &chindx1);
+ if (ret < 0)
+ return -errno;
+
+ ret = ioctl(session_fd2, PPPIOCGCHAN, &chindx2);
+ if (ret < 0)
+ return -errno;
+
+ ppp_chan_fd = open("/dev/ppp", O_RDWR);
+ if (ppp_chan_fd < 0)
+ return -errno;
-The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a
-struct l2tp_session context for each session. The l2tp_tunnel is
-always associated with a UDP or L2TP/IP socket and keeps a list of
-sessions in the tunnel. The l2tp_session context keeps kernel state
-about the session. It has private data which is used for data specific
-to the session type. With L2TPv2, the session always carried PPP
-traffic. With L2TPv3, the session can also carry ethernet frames
-(ethernet pseudowire) or other data types such as ATM, HDLC or Frame
-Relay.
+ ret = ioctl(ppp_chan_fd, PPPIOCATTCHAN, &chindx1);
+ if (ret < 0) {
+ close(ppp_chan_fd);
+ return -errno;
+ }
-When a tunnel is first opened, the reference count on the socket is
-increased using sock_hold(). This ensures that the kernel socket
-cannot be removed while L2TP's data structures reference it.
+ ret = ioctl(ppp_chan_fd, PPPIOCBRIDGECHAN, &chindx2);
+ close(ppp_chan_fd);
+ if (ret < 0)
+ return -errno;
+
+ return 0;
+
+It can be noted that when bridging PPP channels, the PPP session is not locally
+terminated, and no local PPP interface is created. PPP frames arriving on one
+channel are directly passed to the other channel, and vice versa.
+
+The PPP channel does not need to be kept open. Only the session PPPoX data
+sockets need to be kept open.
+
+More generally, it is also possible in the same way to e.g. bridge a PPPoL2TP
+PPP channel with other types of PPP channels, such as PPPoE.
+
+See more details for the PPP side in ppp_generic.rst.
+
+Old L2TPv2-only API
+-------------------
+
+When L2TP was first added to the Linux kernel in 2.6.23, it
+implemented only L2TPv2 and did not include a netlink API. Instead,
+tunnel and session instances in the kernel were managed directly using
+only PPPoL2TP sockets. The PPPoL2TP socket is used as described in
+section "PPPoL2TP Session Socket API" but tunnel and session instances
+are automatically created on a connect() of the socket instead of
+being created by a separate netlink request:
+
+ - Tunnels are managed using a tunnel management socket which is a
+ dedicated PPPoL2TP socket, connected to (invalid) session
+ id 0. The L2TP tunnel instance is created when the PPPoL2TP
+ tunnel management socket is connected and is destroyed when the
+ socket is closed.
+
+ - Session instances are created in the kernel when a PPPoL2TP
+ socket is connected to a non-zero session id. Session parameters
+ are set using setsockopt. The L2TP session instance is destroyed
+ when the socket is closed.
+
+This API is still supported but its use is discouraged. Instead, new
+L2TPv2 applications should use netlink to first create the tunnel and
+session, then create a PPPoL2TP socket for the session.
+
+Unmanaged L2TPv3 tunnels
+------------------------
+
+The kernel L2TP subsystem also supports static (unmanaged) L2TPv3
+tunnels. Unmanaged tunnels have no userspace tunnel socket, and
+exchange no control messages with the peer to set up the tunnel; the
+tunnel is configured manually at each end of the tunnel. All
+configuration is done using netlink. There is no need for an L2TP
+userspace application in this case -- the tunnel socket is created by
+the kernel and configured using parameters sent in the
+``L2TP_CMD_TUNNEL_CREATE`` netlink request. The ``ip`` utility of
+``iproute2`` has commands for managing static L2TPv3 tunnels; do ``ip
+l2tp help`` for more information.
+
+Debugging
+---------
-Some L2TP sessions also have a socket (PPP pseudowires) while others
-do not (ethernet pseudowires). We can't use the socket reference count
-as the reference count for session contexts. The L2TP implementation
-therefore has its own internal reference counts on the session
-contexts.
+The L2TP subsystem offers a range of debugging interfaces through the
+debugfs filesystem.
-To Do
-=====
+To access these interfaces, the debugfs filesystem must first be mounted::
-Add L2TP tunnel switching support. This would route tunneled traffic
-from one L2TP tunnel into another. Specified in
-http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08
+ # mount -t debugfs debugfs /debug
-Add L2TPv3 VLAN pseudowire support.
+Files under the l2tp directory can then be accessed, providing a summary
+of the current population of tunnel and session contexts existing in the
+kernel::
-Add L2TPv3 IP pseudowire support.
+ # cat /debug/l2tp/tunnels
+
+The debugfs files should not be used by applications to obtain L2TP
+state information because the file format is subject to change. It is
+implemented to provide extra debug information to help diagnose
+problems. Applications should instead use the netlink API.
+
+In addition the L2TP subsystem implements tracepoints using the standard
+kernel event tracing API. The available L2TP events can be reviewed as
+follows::
+
+ # find /debug/tracing/events/l2tp
+
+Finally, /proc/net/pppol2tp is also provided for backwards compatibility
+with the original pppol2tp code. It lists information about L2TPv2
+tunnels and sessions only. Its use is discouraged.
+
+Internal Implementation
+=======================
-Add L2TPv3 ATM pseudowire support.
+This section is for kernel developers and maintainers.
+
+Sockets
+-------
+
+UDP sockets are implemented by the networking core. When an L2TP
+tunnel is created using a UDP socket, the socket is set up as an
+encapsulated UDP socket by setting encap_rcv and encap_destroy
+callbacks on the UDP socket. l2tp_udp_encap_recv is called when
+packets are received on the socket. l2tp_udp_encap_destroy is called
+when userspace closes the socket.
+
+L2TPIP sockets are implemented in `net/l2tp/l2tp_ip.c`_ and
+`net/l2tp/l2tp_ip6.c`_.
+
+Tunnels
+-------
+
+The kernel keeps a struct l2tp_tunnel context per L2TP tunnel. The
+l2tp_tunnel is always associated with a UDP or L2TP/IP socket and
+keeps a list of sessions in the tunnel. When a tunnel is first
+registered with L2TP core, the reference count on the socket is
+increased. This ensures that the socket cannot be removed while L2TP's
+data structures reference it.
+
+Tunnels are identified by a unique tunnel id. The id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Tunnels are kept in a per-net list, indexed by tunnel id. The tunnel
+id namespace is shared by L2TPv2 and L2TPv3. The tunnel context can be
+derived from the socket's sk_user_data.
+
+Handling tunnel socket close is perhaps the most tricky part of the
+L2TP implementation. If userspace closes a tunnel socket, the L2TP
+tunnel and all of its sessions must be closed and destroyed. Since the
+tunnel context holds a ref on the tunnel socket, the socket's
+sk_destruct won't be called until the tunnel sock_put's its
+socket. For UDP sockets, when userspace closes the tunnel socket, the
+socket's encap_destroy handler is invoked, which L2TP uses to initiate
+its tunnel close actions. For L2TPIP sockets, the socket's close
+handler initiates the same tunnel close actions. All sessions are
+first closed. Each session drops its tunnel ref. When the tunnel ref
+reaches zero, the tunnel puts its socket ref. When the socket is
+eventually destroyed, its sk_destruct finally frees the L2TP tunnel
+context.
+
+Sessions
+--------
+
+The kernel keeps a struct l2tp_session context for each session. Each
+session has private data which is used for data specific to the
+session type. With L2TPv2, the session always carries PPP
+traffic. With L2TPv3, the session can carry Ethernet frames (Ethernet
+pseudowire) or other data types such as PPP, ATM, HDLC or Frame
+Relay. Linux currently implements only Ethernet and PPP session types.
+
+Some L2TP session types also have a socket (PPP pseudowires) while
+others do not (Ethernet pseudowires). We can't therefore use the
+socket reference count as the reference count for session
+contexts. The L2TP implementation therefore has its own internal
+reference counts on the session contexts.
+
+Like tunnels, L2TP sessions are identified by a unique
+session id. Just as with tunnel ids, the session id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Sessions hold a ref on their parent tunnel to ensure that the tunnel
+stays extant while one or more sessions references it.
+
+Sessions are kept in a per-tunnel list, indexed by session id. L2TPv3
+sessions are also kept in a per-net list indexed by session id,
+because L2TPv3 session ids are unique across all tunnels and L2TPv3
+data packets do not contain a tunnel id in the header. This list is
+therefore needed to find the session context associated with a
+received data packet when the tunnel context cannot be derived from
+the tunnel socket.
+
+Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
+scoped by the tunnel, the kernel does not police this for L2TPv3 UDP
+tunnels and does not add sessions of L2TPv3 UDP tunnels into the
+per-net session list. In the UDP receive code, we must trust that the
+tunnel can be identified using the tunnel socket's sk_user_data and
+lookup the session in the tunnel's session list instead of the per-net
+session list.
+
+PPP
+---
+
+`net/l2tp/l2tp_ppp.c`_ implements the PPPoL2TP socket family. Each PPP
+session has a PPPoL2TP socket.
+
+The PPPoL2TP socket's sk_user_data references the l2tp_session.
+
+Userspace sends and receives PPP packets over L2TP using a PPPoL2TP
+socket. Only PPP control frames pass over this socket: PPP data
+packets are handled entirely by the kernel, passing between the L2TP
+session and its associated ``pppN`` netdev through the PPP channel
+interface of the kernel PPP subsystem.
+
+The L2TP PPP implementation handles the closing of a PPPoL2TP socket
+by closing its corresponding L2TP session. This is complicated because
+it must consider racing with netlink session create/destroy requests
+and pppol2tp_connect trying to reconnect with a session that is in the
+process of being closed. Unlike tunnels, PPP sessions do not hold a
+ref on their associated socket, so code must be careful to sock_hold
+the socket where necessary. For all the details, see commit
+3d609342cc04129ff7568e19316ce3d7451a27e8.
+
+Ethernet
+--------
+
+`net/l2tp/l2tp_eth.c`_ implements L2TPv3 Ethernet pseudowires. It
+manages a netdev for each session.
+
+L2TP Ethernet sessions are created and destroyed by netlink request,
+or are destroyed when the tunnel is destroyed. Unlike PPP sessions,
+Ethernet sessions do not have an associated socket.
Miscellaneous
=============
-The L2TP drivers were developed as part of the OpenL2TP project by
-Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
-designed from the ground up to have the L2TP datapath in the
-kernel. The project also implemented the pppol2tp plugin for pppd
-which allows pppd to use the kernel driver. Details can be found at
-http://www.openl2tp.org.
+RFCs
+----
+
+The kernel code implements the datapath features specified in the
+following RFCs:
+
+======= =============== ===================================
+RFC2661 L2TPv2 https://tools.ietf.org/html/rfc2661
+RFC3931 L2TPv3 https://tools.ietf.org/html/rfc3931
+RFC4719 L2TPv3 Ethernet https://tools.ietf.org/html/rfc4719
+======= =============== ===================================
+
+Implementations
+---------------
+
+A number of open source applications use the L2TP kernel subsystem:
+
+============ ==============================================
+iproute2 https://github.com/shemminger/iproute2
+go-l2tp https://github.com/katalix/go-l2tp
+tunneldigger https://github.com/wlanslovenija/tunneldigger
+xl2tpd https://github.com/xelerance/xl2tpd
+============ ==============================================
+
+Limitations
+-----------
+
+The current implementation has a number of limitations:
+
+ 1) Multiple UDP sockets with the same 5-tuple address cannot be
+ used. The kernel's tunnel context is identified using private
+ data associated with the socket so it is important that each
+ socket is uniquely identified by its address.
+
+ 2) Interfacing with openvswitch is not yet implemented. It may be
+ useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
+
+ 3) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+ configured with a VLAN sub-interface. Since L2TPv3 VLAN
+ pseudowires carry one and only one VLAN, it may be better to use
+ a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
+ pair per VLAN session. The netlink attribute
+ ``L2TP_ATTR_VLAN_ID`` was added for this, but it was never
+ implemented.
+
+Testing
+-------
+
+Unmanaged L2TPv3 Ethernet features are tested by the kernel's built-in
+selftests. See `tools/testing/selftests/net/l2tp.sh`_.
+
+Another test suite, l2tp-ktest_, covers all
+of the L2TP APIs and tunnel/session types. This may be integrated into
+the kernel's built-in L2TP selftests in the future.
+
+.. Links
+.. _Generic Netlink: generic_netlink.html
+.. _libmnl: https://www.netfilter.org/projects/libmnl
+.. _include/uapi/linux/l2tp.h: ../../../include/uapi/linux/l2tp.h
+.. _include/linux/if_pppol2tp.h: ../../../include/linux/if_pppol2tp.h
+.. _net/l2tp/l2tp_ip.c: ../../../net/l2tp/l2tp_ip.c
+.. _net/l2tp/l2tp_ip6.c: ../../../net/l2tp/l2tp_ip6.c
+.. _net/l2tp/l2tp_ppp.c: ../../../net/l2tp/l2tp_ppp.c
+.. _net/l2tp/l2tp_eth.c: ../../../net/l2tp/l2tp_eth.c
+.. _tools/testing/selftests/net/l2tp.sh: ../../../tools/testing/selftests/net/l2tp.sh
+.. _l2tp-ktest: https://github.com/katalix/l2tp-ktest
diff --git a/Documentation/networking/ltpc.rst b/Documentation/networking/ltpc.rst
deleted file mode 100644
index 0ad197fd17ce..000000000000
--- a/Documentation/networking/ltpc.rst
+++ /dev/null
@@ -1,144 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========
-LTPC Driver
-===========
-
-This is the ALPHA version of the ltpc driver.
-
-In order to use it, you will need at least version 1.3.3 of the
-netatalk package, and the Apple or Farallon LocalTalk PC card.
-There are a number of different LocalTalk cards for the PC; this
-driver applies only to the one with the 65c02 processor chip on it.
-
-To include it in the kernel, select the CONFIG_LTPC switch in the
-configuration dialog. You can also compile it as a module.
-
-While the driver will attempt to autoprobe the I/O port address, IRQ
-line, and DMA channel of the card, this does not always work. For
-this reason, you should be prepared to supply these parameters
-yourself. (see "Card Configuration" below for how to determine or
-change the settings on your card)
-
-When the driver is compiled into the kernel, you can add a line such
-as the following to your /etc/lilo.conf::
-
- append="ltpc=0x240,9,1"
-
-where the parameters (in order) are the port address, IRQ, and DMA
-channel. The second and third values can be omitted, in which case
-the driver will try to determine them itself.
-
-If you load the driver as a module, you can pass the parameters "io=",
-"irq=", and "dma=" on the command line with insmod or modprobe, or add
-them as options in a configuration file in /etc/modprobe.d/ directory::
-
- alias lt0 ltpc # autoload the module when the interface is configured
- options ltpc io=0x240 irq=9 dma=1
-
-Before starting up the netatalk demons (perhaps in rc.local), you
-need to add a line such as::
-
- /sbin/ifconfig lt0 127.0.0.42
-
-The address is unimportant - however, the card needs to be configured
-with ifconfig so that Netatalk can find it.
-
-The appropriate netatalk configuration depends on whether you are
-attached to a network that includes AppleTalk routers or not. If,
-like me, you are simply connecting to your home Macintoshes and
-printers, you need to set up netatalk to "seed". The way I do this
-is to have the lines::
-
- dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
- lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
-
-in my atalkd.conf. What is going on here is that I need to fool
-netatalk into thinking that there are two AppleTalk interfaces
-present; otherwise, it refuses to seed. This is a hack, and a more
-permanent solution would be to alter the netatalk code. Also, make
-sure you have the correct name for the dummy interface - If it's
-compiled as a module, you will need to refer to it as "dummy0" or some
-such.
-
-If you are attached to an extended AppleTalk network, with routers on
-it, then you don't need to fool around with this -- the appropriate
-line in atalkd.conf is::
-
- lt0 -phase 1
-
-
-Card Configuration
-==================
-
-The interrupts and so forth are configured via the dipswitch on the
-board. Set the switches so as not to conflict with other hardware.
-
- Interrupts -- set at most one. If none are set, the driver uses
- polled mode. Because the card was developed in the XT era, the
- original documentation refers to IRQ2. Since you'll be running
- this on an AT (or later) class machine, that really means IRQ9.
-
- === ===========================================================
- SW1 IRQ 4
- SW2 IRQ 3
- SW3 IRQ 9 (2 in original card documentation only applies to XT)
- === ===========================================================
-
-
- DMA -- choose DMA 1 or 3, and set both corresponding switches.
-
- === =====
- SW4 DMA 3
- SW5 DMA 1
- SW6 DMA 3
- SW7 DMA 1
- === =====
-
-
- I/O address -- choose one.
-
- === =========
- SW8 220 / 240
- === =========
-
-
-IP
-==
-
-Yes, it is possible to do IP over LocalTalk. However, you can't just
-treat the LocalTalk device like an ordinary Ethernet device, even if
-that's what it looks like to Netatalk.
-
-Instead, you follow the same procedure as for doing IP in EtherTalk.
-See Documentation/networking/ipddp.rst for more information about the
-kernel driver and userspace tools needed.
-
-
-Bugs
-====
-
-IRQ autoprobing often doesn't work on a cold boot. To get around
-this, either compile the driver as a module, or pass the parameters
-for the card to the kernel as described above.
-
-Also, as usual, autoprobing is not recommended when you use the driver
-as a module. (though it usually works at boot time, at least)
-
-Polled mode is *really* slow sometimes, but this seems to depend on
-the configuration of the network.
-
-It may theoretically be possible to use two LTPC cards in the same
-machine, but this is unsupported, so if you really want to do this,
-you'll probably have to hack the initialization code a bit.
-
-
-Thanks
-======
-
-Thanks to Alan Cox for helpful discussions early on in this
-work, and to Denis Hainsworth for doing the bleeding-edge testing.
-
-Bradford Johnson <bradford@math.umn.edu>
-
-Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst
new file mode 100644
index 000000000000..c628cb5406d2
--- /dev/null
+++ b/Documentation/networking/mctp.rst
@@ -0,0 +1,320 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Management Component Transport Protocol (MCTP)
+==============================================
+
+net/mctp/ contains protocol support for MCTP, as defined by DMTF standard
+DSP0236. Physical interface drivers ("bindings" in the specification) are
+provided in drivers/net/mctp/.
+
+The core code provides a socket-based interface to send and receive MCTP
+messages, through an AF_MCTP, SOCK_DGRAM socket.
+
+Structure: interfaces & networks
+================================
+
+The kernel models the local MCTP topology through two items: interfaces and
+networks.
+
+An interface (or "link") is an instance of an MCTP physical transport binding
+(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
+device. This is represented as a ``struct netdevice``.
+
+A network defines a unique address space for MCTP endpoints by endpoint-ID
+(described by DSP0236, section 3.2.31). A network has a user-visible identifier
+to allow references from userspace. Route definitions are specific to one
+network.
+
+Interfaces are associated with one network. A network may be associated with one
+or more interfaces.
+
+If multiple networks are present, each may contain endpoint IDs (EIDs) that are
+also present on other networks.
+
+Sockets API
+===========
+
+Protocol definitions
+--------------------
+
+MCTP uses ``AF_MCTP`` / ``PF_MCTP`` for the address- and protocol- families.
+Since MCTP is message-based, only ``SOCK_DGRAM`` sockets are supported.
+
+.. code-block:: C
+
+ int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+The only (current) value for the ``protocol`` argument is 0.
+
+As with all socket address families, source and destination addresses are
+specified with a ``sockaddr`` type, with a single-byte endpoint address:
+
+.. code-block:: C
+
+ typedef __u8 mctp_eid_t;
+
+ struct mctp_addr {
+ mctp_eid_t s_addr;
+ };
+
+ struct sockaddr_mctp {
+ __kernel_sa_family_t smctp_family;
+ unsigned int smctp_network;
+ struct mctp_addr smctp_addr;
+ __u8 smctp_type;
+ __u8 smctp_tag;
+ };
+
+ #define MCTP_NET_ANY 0x0
+ #define MCTP_ADDR_ANY 0xff
+
+
+Syscall behaviour
+-----------------
+
+The following sections describe the MCTP-specific behaviours of the standard
+socket system calls. These behaviours have been chosen to map closely to the
+existing sockets APIs.
+
+``bind()`` : set local socket address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sockets that receive incoming request packets will bind to a local address,
+using the ``bind()`` syscall.
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = MCTP_NET_ANY;
+ addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
+ addr.smctp_type = MCTP_TYPE_PLDM;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+
+ int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
+
+This establishes the local address of the socket. Incoming MCTP messages that
+match the network, address, and message type will be received by this socket.
+The reference to 'incoming' is important here; a bound socket will only receive
+messages with the TO bit set, to indicate an incoming request message, rather
+than a response.
+
+The ``smctp_tag`` value will configure the tags accepted from the remote side of
+this socket. Given the above, the only valid value is ``MCTP_TAG_OWNER``, which
+will result in remotely "owned" tags being routed to this socket. Since
+``MCTP_TAG_OWNER`` is set, the 3 least-significant bits of ``smctp_tag`` are not
+used; callers must set them to zero.
+
+A ``smctp_network`` value of ``MCTP_NET_ANY`` will configure the socket to
+receive incoming packets from any locally-connected network. A specific network
+value will cause the socket to only receive incoming messages from that network.
+
+The ``smctp_addr`` field specifies a local address to bind to. A value of
+``MCTP_ADDR_ANY`` configures the socket to receive messages addressed to any
+local destination EID.
+
+The ``smctp_type`` field specifies which message types to receive. Only the
+lower 7 bits of the type is matched on incoming messages (ie., the
+most-significant IC bit is not part of the match). This results in the socket
+receiving packets with and without a message integrity check footer.
+
+``sendto()``, ``sendmsg()``, ``send()`` : transmit an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message is transmitted using one of the ``sendto()``, ``sendmsg()`` or
+``send()`` syscalls. Using ``sendto()`` as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ char buf[14];
+ ssize_t len;
+
+ /* set message destination */
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = 0;
+ addr.smctp_addr.s_addr = 8;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+ addr.smctp_type = MCTP_TYPE_ECHO;
+
+ /* arbitrary message to send, with message-type header */
+ buf[0] = MCTP_TYPE_ECHO;
+ memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
+
+ len = sendto(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, sizeof(addr));
+
+The network and address fields of ``addr`` define the remote address to send to.
+If ``smctp_tag`` has the ``MCTP_TAG_OWNER``, the kernel will ignore any bits set
+in ``MCTP_TAG_VALUE``, and generate a tag value suitable for the destination
+EID. If ``MCTP_TAG_OWNER`` is not set, the message will be sent with the tag
+value as specified. If a tag value cannot be allocated, the system call will
+report an errno of ``EAGAIN``.
+
+The application must provide the message type byte as the first byte of the
+message buffer passed to ``sendto()``. If a message integrity check is to be
+included in the transmitted message, it must also be provided in the message
+buffer, and the most-significant bit of the message type byte must be 1.
+
+The ``sendmsg()`` system call allows a more compact argument interface, and the
+message buffer to be specified as a scatter-gather list. At present no ancillary
+message types (used for the ``msg_control`` data passed to ``sendmsg()``) are
+defined.
+
+Transmitting a message on an unconnected socket with ``MCTP_TAG_OWNER``
+specified will cause an allocation of a tag, if no valid tag is already
+allocated for that destination. The (destination-eid,tag) tuple acts as an
+implicit local socket address, to allow the socket to receive responses to this
+outgoing message. If any previous allocation has been performed (to for a
+different remote EID), that allocation is lost.
+
+Sockets will only receive responses to requests they have sent (with TO=1) and
+may only respond (with TO=0) to requests they have received.
+
+``recvfrom()``, ``recvmsg()``, ``recv()`` : receive an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message can be received by an application using one of the
+``recvfrom()``, ``recvmsg()``, or ``recv()`` system calls. Using ``recvfrom()``
+as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ socklen_t addrlen;
+ char buf[14];
+ ssize_t len;
+
+ addrlen = sizeof(addr);
+
+ len = recvfrom(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, &addrlen);
+
+ /* We can expect addr to describe an MCTP address */
+ assert(addrlen >= sizeof(buf));
+ assert(addr.smctp_family == AF_MCTP);
+
+ printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
+
+The address argument to ``recvfrom`` and ``recvmsg`` is populated with the
+remote address of the incoming message, including tag value (this will be needed
+in order to reply to the message).
+
+The first byte of the message buffer will contain the message type byte. If an
+integrity check follows the message, it will be included in the received buffer.
+
+The ``recv()`` system call behaves in a similar way, but does not provide a
+remote address to the application. Therefore, these are only useful if the
+remote address is already known, or the message does not require a reply.
+
+Like the send calls, sockets will only receive responses to requests they have
+sent (TO=1) and may only respond (TO=0) to requests they have received.
+
+``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tags give applications more control over MCTP message tags, by allocating
+(and dropping) tag values explicitly, rather than the kernel automatically
+allocating a per-message tag at ``sendmsg()`` time.
+
+In general, you will only need to use these ioctls if your MCTP protocol does
+not fit the usual request/response model. For example, if you need to persist
+tags across multiple requests, or a request may generate more than one response.
+In these cases, the ioctls allow you to decouple the tag allocation (and
+release) from individual message send and receive operations.
+
+Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``:
+
+.. code-block:: C
+
+ struct mctp_ioc_tag_ctl {
+ mctp_eid_t peer_addr;
+ __u8 tag;
+ __u16 flags;
+ };
+
+``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application
+can use in future ``sendmsg()`` calls. The application populates the
+``peer_addr`` member with the remote EID. Other fields must be zero.
+
+On return, the ``tag`` member will be populated with the allocated tag value.
+The allocated tag will have the following tag bits set:
+
+ - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag
+ owner
+
+ - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a
+ preallocated tag.
+
+ - ... and the actual tag value, within the least-significant three bits
+ (``MCTP_TAG_MASK``). Note that zero is a valid tag value.
+
+The tag value should be used as-is for the ``smctp_tag`` member of ``struct
+sockaddr_mctp``.
+
+``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a
+``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the
+allocation, and the ``tag`` value must match exactly the tag returned from the
+allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits).
+The ``flags`` field must be zero.
+
+Kernel internals
+================
+
+There are a few possible packet flows in the MCTP stack:
+
+1. local TX to remote endpoint, message <= MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ : route lookup
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+2. local TX to remote endpoint, message > MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ -> mctp_do_fragment_route()
+ : creates packet-sized skbs. For each new skb:
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+3. remote TX to local endpoint, single-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ -> sock_queue_rcv_skb()
+
+4. remote TX to local endpoint, multiple-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : stores skb in struct sk_key->reasm_head
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : finds existing reassembly in sk_key->reasm_head
+ : appends new fragment
+ -> sock_queue_rcv_skb()
+
+Key refcounts
+-------------
+
+ * keys are refed by:
+
+ - a skb: during route output, stored in ``skb->cb``.
+
+ - netns and sock lists.
+
+ * keys can be associated with a device, in which case they hold a
+ reference to the dev (set through ``key->dev``, counted through
+ ``dev->key_count``). Multiple keys can reference the device.
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
new file mode 100644
index 000000000000..69975ce25a02
--- /dev/null
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -0,0 +1,95 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+MPTCP Sysfs variables
+=====================
+
+/proc/sys/net/mptcp/* Variables
+===============================
+
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
+
+add_addr_timeout - INTEGER (seconds)
+ Set the timeout after which an ADD_ADDR control message will be
+ resent to an MPTCP peer that has not acknowledged a previous
+ ADD_ADDR message.
+
+ The default value matches TCP_RTO_MAX. This is a per-namespace
+ sysctl.
+
+ Default: 120
+
+close_timeout - INTEGER (seconds)
+ Set the make-after-break timeout: in absence of any close or
+ shutdown syscall, MPTCP sockets will maintain the status
+ unchanged for such time, after the last subflow removal, before
+ moving to TCP_CLOSE.
+
+ The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+ sysctl.
+
+ Default: 60
+
+checksum_enabled - BOOLEAN
+ Control whether DSS checksum can be enabled.
+
+ DSS checksum can be enabled if the value is nonzero. This is a
+ per-namespace sysctl.
+
+ Default: 0
+
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
+
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
+
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+pm_type - INTEGER
+ Set the default path manager type to use for each new MPTCP
+ socket. In-kernel path management will control subflow
+ connections and address advertisements according to
+ per-namespace values configured over the MPTCP netlink
+ API. Userspace path management puts per-MPTCP-connection subflow
+ connection decisions and address advertisements under control of
+ a privileged userspace program, at the cost of more netlink
+ traffic to propagate all of the related events and commands.
+
+ This is a per-namespace sysctl.
+
+ * 0 - In-kernel path manager
+ * 1 - Userspace path manager
+
+ Default: 0
+
+stale_loss_cnt - INTEGER
+ The number of MPTCP-level retransmission intervals with no traffic and
+ pending outstanding data on a given subflow required to declare it stale.
+ The packet scheduler ignores stale subflows.
+ A low stale_loss_cnt value allows for fast active-backup switch-over,
+ an high value maximize links utilization on edge scenarios e.g. lossy
+ link with high BER or peer pausing the data processing.
+
+ This is a per-namespace sysctl.
+
+ Default: 4
+
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index ace56204dd03..78fb70e748b7 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,8 @@ Intro
=====
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP and UDP sockets.
+The feature is currently implemented for TCP, UDP and VSOCK (with
+virtio transport) sockets.
Opportunity and Caveats
@@ -15,7 +16,7 @@ Opportunity and Caveats
Copying large buffers between user process and kernel can be
expensive. Linux supports various interfaces that eschew copying,
-such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+such as sendfile and splice. The MSG_ZEROCOPY flag extends the
underlying copy avoidance mechanism to common socket send calls.
Copy avoidance is not a free lunch. As implemented, with page pinning,
@@ -50,7 +51,7 @@ the excellent reporting over at LWN.net or read the original code.
patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
- https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+ https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface
@@ -83,8 +84,8 @@ Pass the new flag.
ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
A zerocopy failure will return -1 with errno ENOBUFS. This happens if
-the socket option was not set, the socket exceeds its optmem limit or
-the user exceeds its ulimit on locked pages.
+the socket exceeds its optmem limit or the user exceeds their ulimit on
+locked pages.
Mixing copy avoidance and copying
@@ -174,7 +175,9 @@ read_notification() call in the previous snippet. A notification
is encoded in the standard error format, sock_extended_err.
The level and type fields in the control data are protocol family
-specific, IP_RECVERR or IPV6_RECVERR.
+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be
+VSOCK_RECVERR.
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
as explained before, to avoid blocking read and write system calls on
@@ -235,12 +238,15 @@ Implementation
Loopback
--------
+For TCP and UDP:
Data sent to local sockets can be queued indefinitely if the receive
process does not read its socket. Unbound notification latency is not
acceptable. For this reason all packets generated with MSG_ZEROCOPY
that are looped to a local socket will incur a deferred copy. This
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+For VSOCK:
+Data path sent to local sockets is the same as for non-local sockets.
Testing
=======
@@ -254,3 +260,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
namespaces, the test will not show any improvement. For testing, the
loopback restriction can be temporarily relaxed by making
skb_orphan_frags_rx identical to skb_orphan_frags.
+
+For VSOCK type of socket example can be found in
+tools/testing/vsock/vsock_test_zerocopy.c.
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
new file mode 100644
index 000000000000..268819225866
--- /dev/null
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============
+Multi-PF Netdev
+===============
+
+Contents
+========
+
+- `Background`_
+- `Overview`_
+- `mlx5 implementation`_
+- `Channels distribution`_
+- `Observability`_
+- `Steering`_
+- `Mutually exclusive features`_
+
+Background
+==========
+
+The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
+the network, each through its own dedicated PCIe interface. Through either a connection harness that
+splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
+results in eliminating the network traffic traversing over the internal bus between the sockets,
+significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
+network throughput.
+
+Overview
+========
+
+The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
+one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
+sysfs entry, and devlink are kept separate.
+Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
+traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
+proximity to the device and achieve improved performance.
+
+mlx5 implementation
+===================
+
+Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
+NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
+to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
+
+The netdev network channels are distributed between all devices, a proper configuration would utilize
+the correct close NUMA node when working on a certain app/CPU.
+
+We pick one PF to be a primary (leader), and it fills a special role. The other devices
+(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
+mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
+the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
+to/from the secondaries.
+
+Currently, we limit the support to PFs only, and up to two PFs (sockets).
+
+Channels distribution
+=====================
+
+We distribute the channels between the different PFs to achieve local NUMA node performance
+on multiple NUMA nodes.
+
+Each combined channel works against one specific PF, creating all its datapath queues against it. We
+distribute channels to PFs in a round-robin policy.
+
+::
+
+ Example for 2 PFs and 5 channels:
+ +--------+--------+
+ | ch idx | PF idx |
+ +--------+--------+
+ | 0 | 0 |
+ | 1 | 1 |
+ | 2 | 0 |
+ | 3 | 1 |
+ | 4 | 0 |
+ +--------+--------+
+
+
+The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
+mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
+As the channel stats are persistent across channel's closure, changing the mapping every single time
+would turn the accumulative stats less representing of the channel's history.
+
+This is achieved by using the correct core device instance (mdev) in each channel, instead of them
+all using the same instance under "priv->mdev".
+
+Observability
+=============
+The relation between PF, irq, napi, and queue can be observed via netlink spec::
+
+ $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
+ [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
+ {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
+ {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
+ {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
+ {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
+ {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
+ {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
+ {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
+ {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
+ {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
+
+ $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
+ [{'id': 543, 'ifindex': 13, 'irq': 42},
+ {'id': 542, 'ifindex': 13, 'irq': 41},
+ {'id': 541, 'ifindex': 13, 'irq': 40},
+ {'id': 540, 'ifindex': 13, 'irq': 39},
+ {'id': 539, 'ifindex': 13, 'irq': 36}]
+
+Here you can clearly observe our channels distribution policy::
+
+ $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
+ /proc/irq/36/mlx5_comp1@pci:0000:08:00.0
+ /proc/irq/39/mlx5_comp1@pci:0000:09:00.0
+ /proc/irq/40/mlx5_comp2@pci:0000:08:00.0
+ /proc/irq/41/mlx5_comp2@pci:0000:09:00.0
+ /proc/irq/42/mlx5_comp3@pci:0000:08:00.0
+
+Steering
+========
+Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
+
+In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
+traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
+that is capable of pointing to the receive queues of a different PF.
+
+In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
+go out to the network through it.
+
+In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
+PF on the same node as the CPU.
+
+XPS default config example:
+
+NUMA node(s): 2
+NUMA node0 CPU(s): 0-11
+NUMA node1 CPU(s): 12-23
+
+PF0 on node0, PF1 on node1.
+
+- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
+- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
+- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
+- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
+- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
+- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
+- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
+- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
+- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
+- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
+- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
+- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
+- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
+- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
+- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
+- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
+- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
+- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
+- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
+- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
+- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
+- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
+- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
+- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
+
+Mutually exclusive features
+===========================
+
+The nature of Multi-PF, where different channels work with different PFs, conflicts with
+stateful features where the state is maintained in one of the PFs.
+For example, in the TLS device-offload feature, special context objects are created per connection
+and maintained in the PF. Transitioning between different RQs/SQs would break the feature. Hence,
+we disable this combination for now.
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
new file mode 100644
index 000000000000..7bf7b95c4f7a
--- /dev/null
+++ b/Documentation/networking/napi.rst
@@ -0,0 +1,255 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _napi:
+
+====
+NAPI
+====
+
+NAPI is the event handling mechanism used by the Linux networking stack.
+The name NAPI no longer stands for anything in particular [#]_.
+
+In basic operation the device notifies the host about new events
+via an interrupt.
+The host then schedules a NAPI instance to process the events.
+The device may also be polled for events via NAPI without receiving
+interrupts first (:ref:`busy polling<poll>`).
+
+NAPI processing usually happens in the software interrupt context,
+but there is an option to use :ref:`separate kernel threads<threaded>`
+for NAPI processing.
+
+All in all NAPI abstracts away from the drivers the context and configuration
+of event (packet Rx and Tx) processing.
+
+Driver API
+==========
+
+The two most important elements of NAPI are the struct napi_struct
+and the associated poll method. struct napi_struct holds the state
+of the NAPI instance while the method is the driver-specific event
+handler. The method will typically free Tx packets that have been
+transmitted and process newly received packets.
+
+.. _drv_ctrl:
+
+Control API
+-----------
+
+netif_napi_add() and netif_napi_del() add/remove a NAPI instance
+from the system. The instances are attached to the netdevice passed
+as argument (and will be deleted automatically when netdevice is
+unregistered). Instances are added in a disabled state.
+
+napi_enable() and napi_disable() manage the disabled state.
+A disabled NAPI can't be scheduled and its poll method is guaranteed
+to not be invoked. napi_disable() waits for ownership of the NAPI
+instance to be released.
+
+The control APIs are not idempotent. Control API calls are safe against
+concurrent use of datapath APIs but an incorrect sequence of control API
+calls may result in crashes, deadlocks, or race conditions. For example,
+calling napi_disable() multiple times in a row will deadlock.
+
+Datapath API
+------------
+
+napi_schedule() is the basic method of scheduling a NAPI poll.
+Drivers should call this function in their interrupt handler
+(see :ref:`drv_sched` for more info). A successful call to napi_schedule()
+will take ownership of the NAPI instance.
+
+Later, after NAPI is scheduled, the driver's poll method will be
+called to process the events/packets. The method takes a ``budget``
+argument - drivers can process completions for any number of Tx
+packets but should only process up to ``budget`` number of
+Rx packets. Rx processing is usually much more expensive.
+
+In other words for Rx processing the ``budget`` argument limits how many
+packets driver can process in a single poll. Rx specific APIs like page
+pool or XDP cannot be used at all when ``budget`` is 0.
+skb Tx processing should happen regardless of the ``budget``, but if
+the argument is 0 driver cannot call any XDP (or page pool) APIs.
+
+.. warning::
+
+ The ``budget`` argument may be 0 if core tries to only process
+ skb Tx completions and no Rx or XDP packets.
+
+The poll method returns the amount of work done. If the driver still
+has outstanding work to do (e.g. ``budget`` was exhausted)
+the poll method should return exactly ``budget``. In that case,
+the NAPI instance will be serviced/polled again (without the
+need to be scheduled).
+
+If event processing has been completed (all outstanding packets
+processed) the poll method should call napi_complete_done()
+before returning. napi_complete_done() releases the ownership
+of the instance.
+
+.. warning::
+
+ The case of finishing all events and using exactly ``budget``
+ must be handled carefully. There is no way to report this
+ (rare) condition to the stack, so the driver must either
+ not call napi_complete_done() and wait to be called again,
+ or return ``budget - 1``.
+
+ If the ``budget`` is 0 napi_complete_done() should never be called.
+
+Call sequence
+-------------
+
+Drivers should not make assumptions about the exact sequencing
+of calls. The poll method may be called without the driver scheduling
+the instance (unless the instance is disabled). Similarly,
+it's not guaranteed that the poll method will be called, even
+if napi_schedule() succeeded (e.g. if the instance gets disabled).
+
+As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
+calls to the poll method only wait for the ownership of the instance
+to be released, not for the poll method to exit. This means that
+drivers should avoid accessing any data structures after calling
+napi_complete_done().
+
+.. _drv_sched:
+
+Scheduling and IRQ masking
+--------------------------
+
+Drivers should keep the interrupts masked after scheduling
+the NAPI instance - until NAPI polling finishes any further
+interrupts are unnecessary.
+
+Drivers which have to mask the interrupts explicitly (as opposed
+to IRQ being auto-masked by the device) should use the napi_schedule_prep()
+and __napi_schedule() calls:
+
+.. code-block:: c
+
+ if (napi_schedule_prep(&v->napi)) {
+ mydrv_mask_rxtx_irq(v->idx);
+ /* schedule after masking to avoid races */
+ __napi_schedule(&v->napi);
+ }
+
+IRQ should only be unmasked after a successful call to napi_complete_done():
+
+.. code-block:: c
+
+ if (budget && napi_complete_done(&v->napi, work_done)) {
+ mydrv_unmask_rxtx_irq(v->idx);
+ return min(work_done, budget - 1);
+ }
+
+napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
+of guarantees given by being invoked in IRQ context (no need to
+mask interrupts). Note that PREEMPT_RT forces all interrupts
+to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD``
+to avoid issues on real-time kernel configurations.
+
+Instance to queue mapping
+-------------------------
+
+Modern devices have multiple NAPI instances (struct napi_struct) per
+interface. There is no strong requirement on how the instances are
+mapped to queues and interrupts. NAPI is primarily a polling/processing
+abstraction without specific user-facing semantics. That said, most networking
+devices end up using NAPI in fairly similar ways.
+
+NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
+(queue pair is a set of a single Rx and single Tx queue).
+
+In less common cases a NAPI instance may be used for multiple queues
+or Rx and Tx queues can be serviced by separate NAPI instances on a single
+core. Regardless of the queue assignment, however, there is usually still
+a 1:1 mapping between NAPI instances and interrupts.
+
+It's worth noting that the ethtool API uses a "channel" terminology where
+each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
+what constitutes a channel; the recommended interpretation is to understand
+a channel as an IRQ/NAPI which services queues of a given type. For example,
+a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
+to utilize 3 interrupts, 2 Rx and 2 Tx queues.
+
+User API
+========
+
+User interactions with NAPI depend on NAPI instance ID. The instance IDs
+are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
+It's not currently possible to query IDs used by a given device.
+
+Software IRQ coalescing
+-----------------------
+
+NAPI does not perform any explicit event coalescing by default.
+In most scenarios batching happens due to IRQ coalescing which is done
+by the device. There are cases where software coalescing is helpful.
+
+NAPI can be configured to arm a repoll timer instead of unmasking
+the hardware interrupts as soon as all packets are processed.
+The ``gro_flush_timeout`` sysfs configuration of the netdevice
+is reused to control the delay of the timer, while
+``napi_defer_hard_irqs`` controls the number of consecutive empty polls
+before NAPI gives up and goes back to using hardware IRQs.
+
+.. _poll:
+
+Busy polling
+------------
+
+Busy polling allows a user process to check for incoming packets before
+the device interrupt fires. As is the case with any busy polling it trades
+off CPU cycles for lower latency (production uses of NAPI busy polling
+are not well known).
+
+Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
+selected sockets or using the global ``net.core.busy_poll`` and
+``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
+also exists.
+
+IRQ mitigation
+---------------
+
+While busy polling is supposed to be used by low latency applications,
+a similar mechanism can be used for IRQ mitigation.
+
+Very high request-per-second applications (especially routing/forwarding
+applications and especially applications using AF_XDP sockets) may not
+want to be interrupted until they finish processing a request or a batch
+of packets.
+
+Such applications can pledge to the kernel that they will perform a busy
+polling operation periodically, and the driver should keep the device IRQs
+permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
+socket option. To avoid system misbehavior the pledge is revoked
+if ``gro_flush_timeout`` passes without any busy poll call.
+
+The NAPI budget for busy polling is lower than the default (which makes
+sense given the low latency intention of normal busy polling). This is
+not the case with IRQ mitigation, however, so the budget can be adjusted
+with the ``SO_BUSY_POLL_BUDGET`` socket option.
+
+.. _threaded:
+
+Threaded NAPI
+-------------
+
+Threaded NAPI is an operating mode that uses dedicated kernel
+threads rather than software IRQ context for NAPI processing.
+The configuration is per netdevice and will affect all
+NAPI instances of that device. Each NAPI instance will spawn a separate
+thread (called ``napi/${ifc-name}-${napi-id}``).
+
+It is recommended to pin each kernel thread to a single CPU, the same
+CPU as the CPU which services the interrupt. Note that the mapping
+between IRQs and NAPI instances may not be trivial (and is driver
+dependent). The NAPI instance IDs will be assigned in the opposite
+order than the process IDs of the kernel threads.
+
+Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
+netdev's sysfs directory.
+
+.. rubric:: Footnotes
+
+.. [#] NAPI was originally referred to as New API in 2.4 Linux.
diff --git a/Documentation/networking/net_cachelines/index.rst b/Documentation/networking/net_cachelines/index.rst
new file mode 100644
index 000000000000..2669e4cda086
--- /dev/null
+++ b/Documentation/networking/net_cachelines/index.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===================================
+Common Networking Struct Cachelines
+===================================
+
+.. toctree::
+ :maxdepth: 1
+
+ inet_connection_sock
+ inet_sock
+ net_device
+ netns_ipv4_sysctl
+ snmp
+ tcp_sock
diff --git a/Documentation/networking/net_cachelines/inet_connection_sock.rst b/Documentation/networking/net_cachelines/inet_connection_sock.rst
new file mode 100644
index 000000000000..7a911dc95652
--- /dev/null
+++ b/Documentation/networking/net_cachelines/inet_connection_sock.rst
@@ -0,0 +1,50 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+=====================================================
+inet_connection_sock struct fast path usage breakdown
+=====================================================
+
+Type Name fastpath_tx_access fastpath_rx_access comment
+..struct ..inet_connection_sock
+struct_inet_sock icsk_inet read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct_request_sock_queue icsk_accept_queue - -
+struct_inet_bind_bucket icsk_bind_hash read_mostly - tcp_set_state
+struct_inet_bind2_bucket icsk_bind2_hash read_mostly - tcp_set_state,inet_put_port
+unsigned_long icsk_timeout read_mostly - inet_csk_reset_xmit_timer,tcp_connect
+struct_timer_list icsk_retransmit_timer read_mostly - inet_csk_reset_xmit_timer,tcp_connect
+struct_timer_list icsk_delack_timer read_mostly - inet_csk_reset_xmit_timer,tcp_connect
+u32 icsk_rto read_write - tcp_cwnd_validate,tcp_schedule_loss_probe,tcp_connect_init,tcp_connect,tcp_write_xmit,tcp_push_one
+u32 icsk_rto_min - -
+u32 icsk_delack_max - -
+u32 icsk_pmtu_cookie read_write - tcp_sync_mss,tcp_current_mss,tcp_send_syn_data,tcp_connect_init,tcp_connect
+struct_tcp_congestion_ops icsk_ca_ops read_write - tcp_cwnd_validate,tcp_tso_segs,tcp_ca_dst_init,tcp_connect_init,tcp_connect,tcp_write_xmit
+struct_inet_connection_sock_af_ops icsk_af_ops read_mostly - tcp_finish_connect,tcp_send_syn_data,tcp_mtup_init,tcp_mtu_check_reprobe,tcp_mtu_probe,tcp_connect_init,tcp_connect,__tcp_transmit_skb
+struct_tcp_ulp_ops* icsk_ulp_ops - -
+void* icsk_ulp_data - -
+u8:5 icsk_ca_state read_write - tcp_cwnd_application_limited,tcp_set_ca_state,tcp_enter_cwr,tcp_tso_should_defer,tcp_mtu_probe,tcp_schedule_loss_probe,tcp_write_xmit,__tcp_transmit_skb
+u8:1 icsk_ca_initialized read_write - tcp_init_transfer,tcp_init_congestion_control,tcp_init_transfer,tcp_finish_connect,tcp_connect
+u8:1 icsk_ca_setsockopt - -
+u8:1 icsk_ca_dst_locked write_mostly - tcp_ca_dst_init,tcp_connect_init,tcp_connect
+u8 icsk_retransmits write_mostly - tcp_connect_init,tcp_connect
+u8 icsk_pending read_write - inet_csk_reset_xmit_timer,tcp_connect,tcp_check_probe_timer,__tcp_push_pending_frames,tcp_rearm_rto,tcp_event_new_data_sent,tcp_event_new_data_sent
+u8 icsk_backoff write_mostly - tcp_write_queue_purge,tcp_connect_init
+u8 icsk_syn_retries - -
+u8 icsk_probes_out - -
+u16 icsk_ext_hdr_len read_mostly - __tcp_mtu_to_mss,tcp_mtu_to_rss,tcp_mtu_probe,tcp_write_xmit,tcp_mtu_to_mss,
+struct_icsk_ack_u8 pending read_write read_write inet_csk_ack_scheduled,__tcp_cleanup_rbuf,tcp_cleanup_rbuf,inet_csk_clear_xmit_timer,tcp_event_ack-sent,inet_csk_reset_xmit_timer
+struct_icsk_ack_u8 quick read_write write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_select_window,__tcp_cleanup_rbuf
+struct_icsk_ack_u8 pingpong - -
+struct_icsk_ack_u8 retry write_mostly read_write inet_csk_clear_xmit_timer,tcp_rearm_rto,tcp_event_new_data_sent,tcp_write_xmit,__tcp_send_ack,tcp_send_ack,
+struct_icsk_ack_u8 ato read_mostly write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_send_ack,tcp_send_ack
+struct_icsk_ack_unsigned_long timeout read_write read_write inet_csk_reset_xmit_timer,tcp_connect
+struct_icsk_ack_u32 lrcvtime read_write - tcp_finish_connect,tcp_connect,tcp_event_data_sent,__tcp_transmit_skb
+struct_icsk_ack_u16 rcv_mss write_mostly read_mostly __tcp_select_window,__tcp_cleanup_rbuf,tcp_initialize_rcv_mss,tcp_connect_init
+struct_icsk_mtup_int search_high read_write - tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_check_reprobe,tcp_write_xmit
+struct_icsk_mtup_int search_low read_write - tcp_mtu_probe,tcp_mtu_check_reprobe,tcp_write_xmit,tcp_sync_mss,tcp_connect_init,tcp_mtup_init
+struct_icsk_mtup_u32:31 probe_size read_write - tcp_mtup_init,tcp_connect_init,__tcp_transmit_skb
+struct_icsk_mtup_u32:1 enabled read_write - tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_probe,tcp_write_xmit
+struct_icsk_mtup_u32 probe_timestamp read_write - tcp_mtup_init,tcp_connect_init,tcp_mtu_check_reprobe,tcp_mtu_probe
+u32 icsk_probes_tstamp - -
+u32 icsk_user_timeout - -
+u64[104/sizeof(u64)] icsk_ca_priv - -
diff --git a/Documentation/networking/net_cachelines/inet_sock.rst b/Documentation/networking/net_cachelines/inet_sock.rst
new file mode 100644
index 000000000000..595d7ef5fc8b
--- /dev/null
+++ b/Documentation/networking/net_cachelines/inet_sock.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+==========================================
+inet_sock struct fast path usage breakdown
+==========================================
+
+Type Name fastpath_tx_access fastpath_rx_access comment
+..struct ..inet_sock
+struct_sock sk read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct_ipv6_pinfo* pinet6 - -
+be16 inet_sport read_mostly - __tcp_transmit_skb
+be32 inet_daddr read_mostly - ip_select_ident_segs
+be32 inet_rcv_saddr - -
+be16 inet_dport read_mostly - __tcp_transmit_skb
+u16 inet_num - -
+be32 inet_saddr - -
+s16 uc_ttl read_mostly - __ip_queue_xmit/ip_select_ttl
+u16 cmsg_flags - -
+struct_ip_options_rcu* inet_opt read_mostly - __ip_queue_xmit
+u16 inet_id read_mostly - ip_select_ident_segs
+u8 tos read_mostly - ip_queue_xmit
+u8 min_ttl - -
+u8 mc_ttl - -
+u8 pmtudisc - -
+u8:1 recverr - -
+u8:1 is_icsk - -
+u8:1 freebind - -
+u8:1 hdrincl - -
+u8:1 mc_loop - -
+u8:1 transparent - -
+u8:1 mc_all - -
+u8:1 nodefrag - -
+u8:1 bind_address_no_port - -
+u8:1 recverr_rfc4884 - -
+u8:1 defer_connect read_mostly - tcp_sendmsg_fastopen
+u8 rcv_tos - -
+u8 convert_csum - -
+int uc_index - -
+int mc_index - -
+be32 mc_addr - -
+struct_ip_mc_socklist* mc_list - -
+struct_inet_cork_full cork read_mostly - __tcp_transmit_skb
+struct local_port_range - -
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
new file mode 100644
index 000000000000..70c4fb9d4e5c
--- /dev/null
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -0,0 +1,178 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+net_device struct fast path usage breakdown
+===========================================
+
+Type Name fastpath_tx_access fastpath_rx_access Comments
+..struct ..net_device
+char name[16] - -
+struct_netdev_name_node* name_node
+struct_dev_ifalias* ifalias
+unsigned_long mem_end
+unsigned_long mem_start
+unsigned_long base_addr
+unsigned_long state read_mostly read_mostly netif_running(dev)
+struct_list_head dev_list
+struct_list_head napi_list
+struct_list_head unreg_list
+struct_list_head close_list
+struct_list_head ptype_all read_mostly - dev_nit_active(tx)
+struct_list_head ptype_specific read_mostly deliver_ptype_list_skb/__netif_receive_skb_core(rx)
+struct adj_list
+unsigned_int flags read_mostly read_mostly __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
+xdp_features_t xdp_features
+unsigned_long_long priv_flags read_mostly - __dev_queue_xmit(tx)
+struct_net_device_ops* netdev_ops read_mostly - netdev_core_pick_tx,netdev_start_xmit(tx)
+struct_xdp_metadata_ops* xdp_metadata_ops
+int ifindex - read_mostly ip6_rcv_core
+unsigned_short gflags
+unsigned_short hard_header_len read_mostly read_mostly ip6_xmit(tx);gro_list_prepare(rx)
+unsigned_int mtu read_mostly - ip_finish_output2
+unsigned_short needed_headroom read_mostly - LL_RESERVED_SPACE/ip_finish_output2
+unsigned_short needed_tailroom
+netdev_features_t features read_mostly read_mostly HARD_TX_LOCK,netif_skb_features,sk_setup_caps(tx);netif_elide_gro(rx)
+netdev_features_t hw_features
+netdev_features_t wanted_features
+netdev_features_t vlan_features
+netdev_features_t hw_enc_features - - netif_skb_features
+netdev_features_t mpls_features
+netdev_features_t gso_partial_features read_mostly gso_features_check
+unsigned_int min_mtu
+unsigned_int max_mtu
+unsigned_short type
+unsigned_char min_header_len
+unsigned_char name_assign_type
+int group
+struct_net_device_stats stats
+struct_net_device_core_stats* core_stats
+atomic_t carrier_up_count
+atomic_t carrier_down_count
+struct_iw_handler_def* wireless_handlers
+struct_iw_public_data* wireless_data
+struct_ethtool_ops* ethtool_ops
+struct_l3mdev_ops* l3mdev_ops
+struct_ndisc_ops* ndisc_ops
+struct_xfrmdev_ops* xfrmdev_ops
+struct_tlsdev_ops* tlsdev_ops
+struct_header_ops* header_ops read_mostly - ip_finish_output2,ip6_finish_output2(tx)
+unsigned_char operstate
+unsigned_char link_mode
+unsigned_char if_port
+unsigned_char dma
+unsigned_char perm_addr[32]
+unsigned_char addr_assign_type
+unsigned_char addr_len
+unsigned_char upper_level
+unsigned_char lower_level
+unsigned_short neigh_priv_len
+unsigned_short padded
+unsigned_short dev_id
+unsigned_short dev_port
+spinlock_t addr_list_lock
+int irq
+struct_netdev_hw_addr_list uc
+struct_netdev_hw_addr_list mc
+struct_netdev_hw_addr_list dev_addrs
+struct_kset* queues_kset
+struct_list_head unlink_list
+unsigned_int promiscuity
+unsigned_int allmulti
+bool uc_promisc
+unsigned_char nested_level
+struct_in_device* ip_ptr read_mostly read_mostly __in_dev_get
+struct_inet6_dev* ip6_ptr read_mostly read_mostly __in6_dev_get
+struct_vlan_info* vlan_info
+struct_dsa_port* dsa_ptr
+struct_tipc_bearer* tipc_ptr
+void* atalk_ptr
+void* ax25_ptr
+struct_wireless_dev* ieee80211_ptr
+struct_wpan_dev* ieee802154_ptr
+struct_mpls_dev* mpls_ptr
+struct_mctp_dev* mctp_ptr
+unsigned_char* dev_addr
+struct_netdev_queue* _rx read_mostly - netdev_get_rx_queue(rx)
+unsigned_int num_rx_queues
+unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
+struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
+unsigned_long gro_flush_timeout - read_mostly napi_complete_done
+int napi_defer_hard_irqs - read_mostly napi_complete_done
+unsigned_int gro_max_size - read_mostly skb_gro_receive
+unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
+rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
+void* rx_handler_data read_mostly -
+struct_netdev_queue* ingress_queue read_mostly -
+struct_bpf_mprog_entry tcx_ingress - read_mostly sch_handle_ingress
+struct_nf_hook_entries* nf_hooks_ingress
+unsigned_char broadcast[32]
+struct_cpu_rmap* rx_cpu_rmap
+struct_hlist_node index_hlist
+struct_netdev_queue* _tx read_mostly - netdev_get_tx_queue(tx)
+unsigned_int num_tx_queues - -
+unsigned_int real_num_tx_queues read_mostly - skb_tx_hash,netdev_core_pick_tx(tx)
+unsigned_int tx_queue_len
+spinlock_t tx_global_lock
+struct_xdp_dev_bulk_queue__percpu* xdp_bulkq
+struct_xps_dev_maps* xps_maps[2] read_mostly - __netif_set_xps_queue
+struct_bpf_mprog_entry tcx_egress read_mostly - sch_handle_egress
+struct_nf_hook_entries* nf_hooks_egress read_mostly -
+struct_hlist_head qdisc_hash[16]
+struct_timer_list watchdog_timer
+int watchdog_timeo
+u32 proto_down_reason
+struct_list_head todo_list
+int__percpu* pcpu_refcnt
+refcount_t dev_refcnt
+struct_ref_tracker_dir refcnt_tracker
+struct_list_head link_watch_list
+enum:8 reg_state
+bool dismantle
+enum:16 rtnl_link_state
+bool needs_free_netdev
+void*priv_destructor struct_net_device
+struct_netpoll_info* npinfo - read_mostly napi_poll/napi_poll_lock
+possible_net_t nd_net - read_mostly (dev_net)napi_busy_loop,tcp_v(4/6)_rcv,ip(v6)_rcv,ip(6)_input,ip(6)_input_finish
+void* ml_priv
+enum_netdev_ml_priv_type ml_priv_type
+struct_pcpu_lstats__percpu* lstats read_mostly dev_lstats_add()
+struct_pcpu_sw_netstats__percpu* tstats read_mostly dev_sw_netstats_tx_add()
+struct_pcpu_dstats__percpu* dstats
+struct_garp_port* garp_port
+struct_mrp_port* mrp_port
+struct_dm_hw_stat_delta* dm_private
+struct_device dev - -
+struct_attribute_group* sysfs_groups[4]
+struct_attribute_group* sysfs_rx_queue_group
+struct_rtnl_link_ops* rtnl_link_ops
+unsigned_int gso_max_size read_mostly - sk_dst_gso_max_size
+unsigned_int tso_max_size
+u16 gso_max_segs read_mostly - gso_max_segs
+u16 tso_max_segs
+unsigned_int gso_ipv4_max_size read_mostly - sk_dst_gso_max_size
+struct_dcbnl_rtnl_ops* dcbnl_ops
+s16 num_tc read_mostly - skb_tx_hash
+struct_netdev_tc_txq tc_to_txq[16] read_mostly - skb_tx_hash
+u8 prio_tc_map[16]
+unsigned_int fcoe_ddp_xid
+struct_netprio_map* priomap
+struct_phy_device* phydev
+struct_sfp_bus* sfp_bus
+struct_lock_class_key* qdisc_tx_busylock
+bool proto_down
+unsigned:1 wol_enabled
+unsigned:1 threaded - - napi_poll(napi_enable,dev_set_threaded)
+struct_list_head net_notifier_list
+struct_macsec_ops* macsec_ops
+struct_udp_tunnel_nic_info* udp_tunnel_nic_info
+struct_udp_tunnel_nic* udp_tunnel_nic
+unsigned_int xdp_zc_max_segs
+struct_bpf_xdp_entity xdp_state[3]
+u8 dev_addr_shadow[32]
+netdevice_tracker linkwatch_dev_tracker
+netdevice_tracker watchdog_dev_tracker
+netdevice_tracker dev_registered_tracker
+struct_rtnl_hw_stats64* offload_xstats_l3
+struct_devlink_port* devlink_port
+struct_dpll_pin* dpll_pin
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
new file mode 100644
index 000000000000..9b87089a84c6
--- /dev/null
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -0,0 +1,158 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+netns_ipv4 struct fast path usage breakdown
+===========================================
+
+Type Name fastpath_tx_access fastpath_rx_access comment
+..struct ..netns_ipv4
+struct_inet_timewait_death_row tcp_death_row
+struct_udp_table* udp_table
+struct_ctl_table_header* forw_hdr
+struct_ctl_table_header* frags_hdr
+struct_ctl_table_header* ipv4_hdr
+struct_ctl_table_header* route_hdr
+struct_ctl_table_header* xfrm4_hdr
+struct_ipv4_devconf* devconf_all
+struct_ipv4_devconf* devconf_dflt
+struct_ip_ra_chain ra_chain
+struct_mutex ra_mutex
+struct_fib_rules_ops* rules_ops
+struct_fib_table fib_main
+struct_fib_table fib_default
+unsigned_int fib_rules_require_fldissect
+bool fib_has_custom_rules
+bool fib_has_custom_local_routes
+bool fib_offload_disabled
+atomic_t fib_num_tclassid_users
+struct_hlist_head* fib_table_hash
+struct_sock* fibnl
+struct_sock* mc_autojoin_sk
+struct_inet_peer_base* peers
+struct_fqdir* fqdir
+u8 sysctl_icmp_echo_ignore_all
+u8 sysctl_icmp_echo_enable_probe
+u8 sysctl_icmp_echo_ignore_broadcasts
+u8 sysctl_icmp_ignore_bogus_error_responses
+u8 sysctl_icmp_errors_use_inbound_ifaddr
+int sysctl_icmp_ratelimit
+int sysctl_icmp_ratemask
+u32 ip_rt_min_pmtu - -
+int ip_rt_mtu_expires - -
+int ip_rt_min_advmss - -
+struct_local_ports ip_local_ports - -
+u8 sysctl_tcp_ecn - -
+u8 sysctl_tcp_ecn_fallback - -
+u8 sysctl_ip_default_ttl - - ip4_dst_hoplimit/ip_select_ttl
+u8 sysctl_ip_no_pmtu_disc - -
+u8 sysctl_ip_fwd_use_pmtu read_mostly - ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
+u8 sysctl_ip_fwd_update_priority - - ip_forward
+u8 sysctl_ip_nonlocal_bind - -
+u8 sysctl_ip_autobind_reuse - -
+u8 sysctl_ip_dynaddr - -
+u8 sysctl_ip_early_demux - read_mostly ip(6)_rcv_finish_core
+u8 sysctl_raw_l3mdev_accept - -
+u8 sysctl_tcp_early_demux - read_mostly ip(6)_rcv_finish_core
+u8 sysctl_udp_early_demux
+u8 sysctl_nexthop_compat_mode - -
+u8 sysctl_fwmark_reflect - -
+u8 sysctl_tcp_fwmark_accept - -
+u8 sysctl_tcp_l3mdev_accept - -
+u8 sysctl_tcp_mtu_probing - -
+int sysctl_tcp_mtu_probe_floor - -
+int sysctl_tcp_base_mss - -
+int sysctl_tcp_min_snd_mss read_mostly - __tcp_mtu_to_mss(tcp_write_xmit)
+int sysctl_tcp_probe_threshold - - tcp_mtu_probe(tcp_write_xmit)
+u32 sysctl_tcp_probe_interval - - tcp_mtu_check_reprobe(tcp_write_xmit)
+int sysctl_tcp_keepalive_time - -
+int sysctl_tcp_keepalive_intvl - -
+u8 sysctl_tcp_keepalive_probes - -
+u8 sysctl_tcp_syn_retries - -
+u8 sysctl_tcp_synack_retries - -
+u8 sysctl_tcp_syncookies - - generated_on_syn
+u8 sysctl_tcp_migrate_req - - reuseport
+u8 sysctl_tcp_comp_sack_nr - - __tcp_ack_snd_check
+int sysctl_tcp_reordering - read_mostly tcp_may_raise_cwnd/tcp_cong_control
+u8 sysctl_tcp_retries1 - -
+u8 sysctl_tcp_retries2 - -
+u8 sysctl_tcp_orphan_retries - -
+u8 sysctl_tcp_tw_reuse - - timewait_sock_ops
+int sysctl_tcp_fin_timeout - - TCP_LAST_ACK/tcp_rcv_state_process
+unsigned_int sysctl_tcp_notsent_lowat read_mostly - tcp_notsent_lowat/tcp_stream_memory_free
+u8 sysctl_tcp_sack - - tcp_syn_options
+u8 sysctl_tcp_window_scaling - - tcp_syn_options,tcp_parse_options
+u8 sysctl_tcp_timestamps
+u8 sysctl_tcp_early_retrans read_mostly - tcp_schedule_loss_probe(tcp_write_xmit)
+u8 sysctl_tcp_recovery - - tcp_fastretrans_alert
+u8 sysctl_tcp_thin_linear_timeouts - - tcp_retrans_timer(on_thin_streams)
+u8 sysctl_tcp_slow_start_after_idle - - unlikely(tcp_cwnd_validate-network-not-starved)
+u8 sysctl_tcp_retrans_collapse - -
+u8 sysctl_tcp_stdurg - - unlikely(tcp_check_urg)
+u8 sysctl_tcp_rfc1337 - -
+u8 sysctl_tcp_abort_on_overflow - -
+u8 sysctl_tcp_fack - -
+int sysctl_tcp_max_reordering - - tcp_check_sack_reordering
+int sysctl_tcp_adv_win_scale - - tcp_init_buffer_space
+u8 sysctl_tcp_dsack - - partial_packet_or_retrans_in_tcp_data_queue
+u8 sysctl_tcp_app_win - - tcp_win_from_space
+u8 sysctl_tcp_frto - - tcp_enter_loss
+u8 sysctl_tcp_nometrics_save - - TCP_LAST_ACK/tcp_update_metrics
+u8 sysctl_tcp_no_ssthresh_metrics_save - - TCP_LAST_ACK/tcp_(update/init)_metrics
+u8 sysctl_tcp_moderate_rcvbuf read_mostly read_mostly tcp_tso_should_defer(tx);tcp_rcv_space_adjust(rx)
+u8 sysctl_tcp_tso_win_divisor read_mostly - tcp_tso_should_defer(tcp_write_xmit)
+u8 sysctl_tcp_workaround_signed_windows - - tcp_select_window
+int sysctl_tcp_limit_output_bytes read_mostly - tcp_small_queue_check(tcp_write_xmit)
+int sysctl_tcp_challenge_ack_limit - -
+int sysctl_tcp_min_rtt_wlen read_mostly - tcp_ack_update_rtt
+u8 sysctl_tcp_min_tso_segs - - unlikely(icsk_ca_ops-written)
+u8 sysctl_tcp_tso_rtt_log read_mostly - tcp_tso_autosize
+u8 sysctl_tcp_autocorking read_mostly - tcp_push/tcp_should_autocork
+u8 sysctl_tcp_reflect_tos - - tcp_v(4/6)_send_synack
+int sysctl_tcp_invalid_ratelimit - -
+int sysctl_tcp_pacing_ss_ratio - - default_cong_cont(tcp_update_pacing_rate)
+int sysctl_tcp_pacing_ca_ratio - - default_cong_cont(tcp_update_pacing_rate)
+int sysctl_tcp_wmem[3] read_mostly - tcp_wmem_schedule(sendmsg/sendpage)
+int sysctl_tcp_rmem[3] - read_mostly __tcp_grow_window(tx),tcp_rcv_space_adjust(rx)
+unsigned_int sysctl_tcp_child_ehash_entries
+unsigned_long sysctl_tcp_comp_sack_delay_ns - - __tcp_ack_snd_check
+unsigned_long sysctl_tcp_comp_sack_slack_ns - - __tcp_ack_snd_check
+int sysctl_max_syn_backlog - -
+int sysctl_tcp_fastopen - -
+struct_tcp_congestion_ops tcp_congestion_control - - init_cc
+struct_tcp_fastopen_context tcp_fastopen_ctx - -
+unsigned_int sysctl_tcp_fastopen_blackhole_timeout - -
+atomic_t tfo_active_disable_times - -
+unsigned_long tfo_active_disable_stamp - -
+u32 tcp_challenge_timestamp - -
+u32 tcp_challenge_count - -
+u8 sysctl_tcp_plb_enabled - -
+u8 sysctl_tcp_plb_idle_rehash_rounds - -
+u8 sysctl_tcp_plb_rehash_rounds - -
+u8 sysctl_tcp_plb_suspend_rto_sec - -
+int sysctl_tcp_plb_cong_thresh - -
+int sysctl_udp_wmem_min
+int sysctl_udp_rmem_min
+u8 sysctl_fib_notify_on_flag_change
+u8 sysctl_udp_l3mdev_accept
+u8 sysctl_igmp_llm_reports
+int sysctl_igmp_max_memberships
+int sysctl_igmp_max_msf
+int sysctl_igmp_qrv
+struct_ping_group_range ping_group_range
+atomic_t dev_addr_genid
+unsigned_int sysctl_udp_child_hash_entries
+unsigned_long* sysctl_local_reserved_ports
+int sysctl_ip_prot_sock
+struct_mr_table* mrt
+struct_list_head mr_tables
+struct_fib_rules_ops* mr_rules_ops
+u32 sysctl_fib_multipath_hash_fields
+u8 sysctl_fib_multipath_use_neigh
+u8 sysctl_fib_multipath_hash_policy
+struct_fib_notifier_ops* notifier_ops
+unsigned_int fib_seq
+struct_fib_notifier_ops* ipmr_notifier_ops
+unsigned_int ipmr_seq
+atomic_t rt_genid
+siphash_key_t ip_id_key
diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
new file mode 100644
index 000000000000..6a071538566c
--- /dev/null
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -0,0 +1,135 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+===========================================
+netns_ipv4 enum fast path usage breakdown
+===========================================
+
+Type Name fastpath_tx_access fastpath_rx_access comment
+..enum
+unsigned_long LINUX_MIB_TCPKEEPALIVE write_mostly - tcp_keepalive_timer
+unsigned_long LINUX_MIB_DELAYEDACKS write_mostly - tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long LINUX_MIB_DELAYEDACKLOCKED write_mostly - tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long LINUX_MIB_TCPAUTOCORKING write_mostly - tcp_push,tcp_sendmsg_locked
+unsigned_long LINUX_MIB_TCPFROMZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPTOZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPWANTZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPORIGDATASENT write_mostly - tcp_write_xmit
+unsigned_long LINUX_MIB_TCPHPHITS - write_mostly tcp_rcv_established,tcp_v4_do_rcv,tcp_v6_do_rcv
+unsigned_long LINUX_MIB_TCPRCVCOALESCE - write_mostly tcp_try_coalesce,tcp_queue_rcv,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPPUREACKS - write_mostly tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPHPACKS - write_mostly tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPDELIVERED - write_mostly tcp_newly_delivered,tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_SYNCOOKIESSENT
+unsigned_long LINUX_MIB_SYNCOOKIESRECV
+unsigned_long LINUX_MIB_SYNCOOKIESFAILED
+unsigned_long LINUX_MIB_EMBRYONICRSTS
+unsigned_long LINUX_MIB_PRUNECALLED
+unsigned_long LINUX_MIB_RCVPRUNED
+unsigned_long LINUX_MIB_OFOPRUNED
+unsigned_long LINUX_MIB_OUTOFWINDOWICMPS
+unsigned_long LINUX_MIB_LOCKDROPPEDICMPS
+unsigned_long LINUX_MIB_ARPFILTER
+unsigned_long LINUX_MIB_TIMEWAITED
+unsigned_long LINUX_MIB_TIMEWAITRECYCLED
+unsigned_long LINUX_MIB_TIMEWAITKILLED
+unsigned_long LINUX_MIB_PAWSACTIVEREJECTED
+unsigned_long LINUX_MIB_PAWSESTABREJECTED
+unsigned_long LINUX_MIB_DELAYEDACKLOST
+unsigned_long LINUX_MIB_LISTENOVERFLOWS
+unsigned_long LINUX_MIB_LISTENDROPS
+unsigned_long LINUX_MIB_TCPRENORECOVERY
+unsigned_long LINUX_MIB_TCPSACKRECOVERY
+unsigned_long LINUX_MIB_TCPSACKRENEGING
+unsigned_long LINUX_MIB_TCPSACKREORDER
+unsigned_long LINUX_MIB_TCPRENOREORDER
+unsigned_long LINUX_MIB_TCPTSREORDER
+unsigned_long LINUX_MIB_TCPFULLUNDO
+unsigned_long LINUX_MIB_TCPPARTIALUNDO
+unsigned_long LINUX_MIB_TCPDSACKUNDO
+unsigned_long LINUX_MIB_TCPLOSSUNDO
+unsigned_long LINUX_MIB_TCPLOSTRETRANSMIT
+unsigned_long LINUX_MIB_TCPRENOFAILURES
+unsigned_long LINUX_MIB_TCPSACKFAILURES
+unsigned_long LINUX_MIB_TCPLOSSFAILURES
+unsigned_long LINUX_MIB_TCPFASTRETRANS
+unsigned_long LINUX_MIB_TCPSLOWSTARTRETRANS
+unsigned_long LINUX_MIB_TCPTIMEOUTS
+unsigned_long LINUX_MIB_TCPLOSSPROBES
+unsigned_long LINUX_MIB_TCPLOSSPROBERECOVERY
+unsigned_long LINUX_MIB_TCPRENORECOVERYFAIL
+unsigned_long LINUX_MIB_TCPSACKRECOVERYFAIL
+unsigned_long LINUX_MIB_TCPRCVCOLLAPSED
+unsigned_long LINUX_MIB_TCPDSACKOLDSENT
+unsigned_long LINUX_MIB_TCPDSACKOFOSENT
+unsigned_long LINUX_MIB_TCPDSACKRECV
+unsigned_long LINUX_MIB_TCPDSACKOFORECV
+unsigned_long LINUX_MIB_TCPABORTONDATA
+unsigned_long LINUX_MIB_TCPABORTONCLOSE
+unsigned_long LINUX_MIB_TCPABORTONMEMORY
+unsigned_long LINUX_MIB_TCPABORTONTIMEOUT
+unsigned_long LINUX_MIB_TCPABORTONLINGER
+unsigned_long LINUX_MIB_TCPABORTFAILED
+unsigned_long LINUX_MIB_TCPMEMORYPRESSURES
+unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO
+unsigned_long LINUX_MIB_TCPSACKDISCARD
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO
+unsigned_long LINUX_MIB_TCPSPURIOUSRTOS
+unsigned_long LINUX_MIB_TCPMD5NOTFOUND
+unsigned_long LINUX_MIB_TCPMD5UNEXPECTED
+unsigned_long LINUX_MIB_TCPMD5FAILURE
+unsigned_long LINUX_MIB_SACKSHIFTED
+unsigned_long LINUX_MIB_SACKMERGED
+unsigned_long LINUX_MIB_SACKSHIFTFALLBACK
+unsigned_long LINUX_MIB_TCPBACKLOGDROP
+unsigned_long LINUX_MIB_PFMEMALLOCDROP
+unsigned_long LINUX_MIB_TCPMINTTLDROP
+unsigned_long LINUX_MIB_TCPDEFERACCEPTDROP
+unsigned_long LINUX_MIB_IPRPFILTER
+unsigned_long LINUX_MIB_TCPTIMEWAITOVERFLOW
+unsigned_long LINUX_MIB_TCPREQQFULLDOCOOKIES
+unsigned_long LINUX_MIB_TCPREQQFULLDROP
+unsigned_long LINUX_MIB_TCPRETRANSFAIL
+unsigned_long LINUX_MIB_TCPBACKLOGCOALESCE
+unsigned_long LINUX_MIB_TCPOFOQUEUE
+unsigned_long LINUX_MIB_TCPOFODROP
+unsigned_long LINUX_MIB_TCPOFOMERGE
+unsigned_long LINUX_MIB_TCPCHALLENGEACK
+unsigned_long LINUX_MIB_TCPSYNCHALLENGE
+unsigned_long LINUX_MIB_TCPFASTOPENACTIVE
+unsigned_long LINUX_MIB_TCPFASTOPENACTIVEFAIL
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVE
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEFAIL
+unsigned_long LINUX_MIB_TCPFASTOPENLISTENOVERFLOW
+unsigned_long LINUX_MIB_TCPFASTOPENCOOKIEREQD
+unsigned_long LINUX_MIB_TCPFASTOPENBLACKHOLE
+unsigned_long LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES
+unsigned_long LINUX_MIB_BUSYPOLLRXPACKETS
+unsigned_long LINUX_MIB_TCPSYNRETRANS
+unsigned_long LINUX_MIB_TCPHYSTARTTRAINDETECT
+unsigned_long LINUX_MIB_TCPHYSTARTTRAINCWND
+unsigned_long LINUX_MIB_TCPHYSTARTDELAYDETECT
+unsigned_long LINUX_MIB_TCPHYSTARTDELAYCWND
+unsigned_long LINUX_MIB_TCPACKSKIPPEDSYNRECV
+unsigned_long LINUX_MIB_TCPACKSKIPPEDPAWS
+unsigned_long LINUX_MIB_TCPACKSKIPPEDSEQ
+unsigned_long LINUX_MIB_TCPACKSKIPPEDFINWAIT2
+unsigned_long LINUX_MIB_TCPACKSKIPPEDTIMEWAIT
+unsigned_long LINUX_MIB_TCPACKSKIPPEDCHALLENGE
+unsigned_long LINUX_MIB_TCPWINPROBE
+unsigned_long LINUX_MIB_TCPMTUPFAIL
+unsigned_long LINUX_MIB_TCPMTUPSUCCESS
+unsigned_long LINUX_MIB_TCPDELIVEREDCE
+unsigned_long LINUX_MIB_TCPACKCOMPRESSED
+unsigned_long LINUX_MIB_TCPZEROWINDOWDROP
+unsigned_long LINUX_MIB_TCPRCVQDROP
+unsigned_long LINUX_MIB_TCPWQUEUETOOBIG
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEALTKEY
+unsigned_long LINUX_MIB_TCPTIMEOUTREHASH
+unsigned_long LINUX_MIB_TCPDUPLICATEDATAREHASH
+unsigned_long LINUX_MIB_TCPDSACKRECVSEGS
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDDUBIOUS
+unsigned_long LINUX_MIB_TCPMIGRATEREQSUCCESS
+unsigned_long LINUX_MIB_TCPMIGRATEREQFAILURE
+unsigned_long __LINUX_MIB_MAX
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
new file mode 100644
index 000000000000..1c154cbd1848
--- /dev/null
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -0,0 +1,157 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. Copyright (C) 2023 Google LLC
+
+=========================================
+tcp_sock struct fast path usage breakdown
+=========================================
+
+Type Name fastpath_tx_access fastpath_rx_access Comments
+..struct ..tcp_sock
+struct_inet_connection_sock inet_conn
+u16 tcp_header_len read_mostly read_mostly tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
+u16 gso_segs read_mostly - tcp_xmit_size_goal
+__be32 pred_flags read_write read_mostly tcp_select_window(tx);tcp_rcv_established(rx)
+u64 bytes_received - read_write tcp_rcv_nxt_update(rx)
+u32 segs_in - read_write tcp_v6_rcv(rx)
+u32 data_segs_in - read_write tcp_v6_rcv(rx)
+u32 rcv_nxt read_mostly read_write tcp_cleanup_rbuf,tcp_send_ack,tcp_inq_hint,tcp_transmit_skb,tcp_receive_window(tx);tcp_v6_do_rcv,tcp_rcv_established,tcp_data_queue,tcp_receive_window,tcp_rcv_nxt_update(write)(rx)
+u32 copied_seq - read_mostly tcp_cleanup_rbuf,tcp_rcv_space_adjust,tcp_inq_hint
+u32 rcv_wup - read_write __tcp_cleanup_rbuf,tcp_receive_window,tcp_receive_established
+u32 snd_nxt read_write read_mostly tcp_rate_check_app_limited,__tcp_transmit_skb,tcp_event_new_data_sent(write)(tx);tcp_rcv_established,tcp_ack,tcp_clean_rtx_queue(rx)
+u32 segs_out read_write - __tcp_transmit_skb
+u32 data_segs_out read_write - __tcp_transmit_skb,tcp_update_skb_after_send
+u64 bytes_sent read_write - __tcp_transmit_skb
+u64 bytes_acked - read_write tcp_snd_una_update/tcp_ack
+u32 dsack_dups
+u32 snd_una read_mostly read_write tcp_wnd_end,tcp_urg_mode,tcp_minshall_check,tcp_cwnd_validate(tx);tcp_ack,tcp_may_update_window,tcp_clean_rtx_queue(write),tcp_ack_tstamp(rx)
+u32 snd_sml read_write - tcp_minshall_check,tcp_minshall_update
+u32 rcv_tstamp - read_mostly tcp_ack
+u32 lsndtime read_write - tcp_slow_start_after_idle_check,tcp_event_data_sent
+u32 last_oow_ack_time
+u32 compressed_ack_rcv_nxt
+u32 tsoffset read_mostly read_mostly tcp_established_options(tx);tcp_fast_parse_options(rx)
+struct_list_head tsq_node - -
+struct_list_head tsorted_sent_queue read_write - tcp_update_skb_after_send
+u32 snd_wl1 - read_mostly tcp_may_update_window
+u32 snd_wnd read_mostly read_mostly tcp_wnd_end,tcp_tso_should_defer(tx);tcp_fast_path_on(rx)
+u32 max_window read_mostly - tcp_bound_to_half_wnd,forced_push
+u32 mss_cache read_mostly read_mostly tcp_rate_check_app_limited,tcp_current_mss,tcp_sync_mss,tcp_sndbuf_expand,tcp_tso_should_defer(tx);tcp_update_pacing_rate,tcp_clean_rtx_queue(rx)
+u32 window_clamp read_mostly read_write tcp_rcv_space_adjust,__tcp_select_window
+u32 rcv_ssthresh read_mostly - __tcp_select_window
+u8 scaling_ratio read_mostly read_mostly tcp_win_from_space
+struct tcp_rack
+u16 advmss - read_mostly tcp_rcv_space_adjust
+u8 compressed_ack
+u8:2 dup_ack_counter
+u8:1 tlp_retrans
+u8:1 tcp_usec_ts read_mostly read_mostly
+u32 chrono_start read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u32[3] chrono_stat read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:2 chrono_type read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:1 rate_app_limited - read_write tcp_rate_gen
+u8:1 fastopen_connect
+u8:1 fastopen_no_cookie
+u8:1 is_sack_reneg - read_mostly tcp_skb_entail,tcp_ack
+u8:2 fastopen_client_fail
+u8:4 nonagle read_write - tcp_skb_entail,tcp_push_pending_frames
+u8:1 thin_lto
+u8:1 recvmsg_inq
+u8:1 repair read_mostly - tcp_write_xmit
+u8:1 frto
+u8 repair_queue - -
+u8:2 save_syn
+u8:1 syn_data
+u8:1 syn_fastopen
+u8:1 syn_fastopen_exp
+u8:1 syn_fastopen_ch
+u8:1 syn_data_acked
+u8:1 is_cwnd_limited read_mostly - tcp_cwnd_validate,tcp_is_cwnd_limited
+u32 tlp_high_seq - read_mostly tcp_ack
+u32 tcp_tx_delay
+u64 tcp_wstamp_ns read_write - tcp_pacing_check,tcp_tso_should_defer,tcp_update_skb_after_send
+u64 tcp_clock_cache read_write read_write tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust),__tcp_transmit_skb,tcp_tso_should_defer;timer
+u64 tcp_mstamp read_write read_write tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust)(tx);tcp_rcv_space_adjust,tcp_rate_gen,tcp_clean_rtx_queue,tcp_ack_update_rtt/tcp_time_stamp(rx);timer
+u32 srtt_us read_mostly read_write tcp_tso_should_defer(tx);tcp_update_pacing_rate,__tcp_set_rto,tcp_rtt_estimator(rx)
+u32 mdev_us read_write - tcp_rtt_estimator
+u32 mdev_max_us
+u32 rttvar_us - read_mostly __tcp_set_rto
+u32 rtt_seq read_write tcp_rtt_estimator
+struct_minmax rtt_min - read_mostly tcp_min_rtt/tcp_rate_gen,tcp_min_rtttcp_update_rtt_min
+u32 packets_out read_write read_write tcp_packets_in_flight(tx/rx);tcp_slow_start_after_idle_check,tcp_nagle_check,tcp_rate_skb_sent,tcp_event_new_data_sent,tcp_cwnd_validate,tcp_write_xmit(tx);tcp_ack,tcp_clean_rtx_queue,tcp_update_pacing_rate(rx)
+u32 retrans_out - read_mostly tcp_packets_in_flight,tcp_rate_check_app_limited
+u32 max_packets_out - read_write tcp_cwnd_validate
+u32 cwnd_usage_seq - read_write tcp_cwnd_validate
+u16 urg_data - read_mostly tcp_fast_path_check
+u8 ecn_flags read_write - tcp_ecn_send
+u8 keepalive_probes
+u32 reordering read_mostly - tcp_sndbuf_expand
+u32 reord_seen
+u32 snd_up read_write read_mostly tcp_mark_urg,tcp_urg_mode,__tcp_transmit_skb(tx);tcp_clean_rtx_queue(rx)
+struct_tcp_options_received rx_opt read_mostly read_write tcp_established_options(tx);tcp_fast_path_on,tcp_ack_update_window,tcp_is_sack,tcp_data_queue,tcp_rcv_established,tcp_ack_update_rtt(rx)
+u32 snd_ssthresh - read_mostly tcp_update_pacing_rate
+u32 snd_cwnd read_mostly read_mostly tcp_snd_cwnd,tcp_rate_check_app_limited,tcp_tso_should_defer(tx);tcp_update_pacing_rate
+u32 snd_cwnd_cnt
+u32 snd_cwnd_clamp
+u32 snd_cwnd_used
+u32 snd_cwnd_stamp
+u32 prior_cwnd
+u32 prr_delivered
+u32 prr_out read_mostly read_mostly tcp_rate_skb_sent,tcp_newly_delivered(tx);tcp_ack,tcp_rate_gen,tcp_clean_rtx_queue(rx)
+u32 delivered read_mostly read_write tcp_rate_skb_sent, tcp_newly_delivered(tx);tcp_ack, tcp_rate_gen, tcp_clean_rtx_queue (rx)
+u32 delivered_ce read_mostly read_write tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
+u32 lost - read_mostly tcp_ack
+u32 app_limited read_write read_mostly tcp_rate_check_app_limited,tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
+u64 first_tx_mstamp read_write - tcp_rate_skb_sent
+u64 delivered_mstamp read_write - tcp_rate_skb_sent
+u32 rate_delivered - read_mostly tcp_rate_gen
+u32 rate_interval_us - read_mostly rate_delivered,rate_app_limited
+u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
+u32 write_seq read_write - tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
+u32 notsent_lowat read_mostly - tcp_stream_memory_free
+u32 pushed_seq read_write - tcp_mark_push,forced_push
+u32 lost_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_rate_check_app_limited(rx)
+u32 sacked_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_clean_rtx_queue(rx)
+struct_hrtimer pacing_timer
+struct_hrtimer compressed_ack_timer
+struct_sk_buff* lost_skb_hint read_mostly tcp_clean_rtx_queue
+struct_sk_buff* retransmit_skb_hint read_mostly - tcp_clean_rtx_queue
+struct_rb_root out_of_order_queue - read_mostly tcp_data_queue,tcp_fast_path_check
+struct_sk_buff* ooo_last_skb
+struct_tcp_sack_block[1] duplicate_sack
+struct_tcp_sack_block[4] selective_acks
+struct_tcp_sack_block[4] recv_sack_cache
+struct_sk_buff* highest_sack read_write - tcp_event_new_data_sent
+int lost_cnt_hint
+u32 prior_ssthresh
+u32 high_seq
+u32 retrans_stamp
+u32 undo_marker
+int undo_retrans
+u64 bytes_retrans
+u32 total_retrans
+u32 rto_stamp
+u16 total_rto
+u16 total_rto_recoveries
+u32 total_rto_time
+u32 urg_seq - -
+unsigned_int keepalive_time
+unsigned_int keepalive_intvl
+int linger2
+u8 bpf_sock_ops_cb_flags
+u8:1 bpf_chg_cc_inprogress
+u16 timeout_rehash
+u32 rcv_ooopack
+u32 rcv_rtt_last_tsecr
+struct rcv_rtt_est - read_write tcp_rcv_space_adjust,tcp_rcv_established
+struct rcvq_space - read_write tcp_rcv_space_adjust
+struct mtu_probe
+u32 plb_rehash
+u32 mtu_info
+bool is_mptcp
+bool smc_hs_congested
+bool syn_smc
+struct_tcp_sock_af_ops* af_specific
+struct_tcp_md5sig_info* md5sig_info
+struct_tcp_fastopen_request* fastopen_req
+struct_request_sock* fastopen_rsk
+struct_saved_syn* saved_syn \ No newline at end of file
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
index e143ab79a960..f4e1b4e07adc 100644
--- a/Documentation/networking/net_failover.rst
+++ b/Documentation/networking/net_failover.rst
@@ -35,7 +35,7 @@ To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY
feature on the virtio-net interface and assign the same MAC address to both
virtio-net and VF interfaces.
-Here is an example XML snippet that shows such configuration.
+Here is an example libvirt XML snippet that shows such configuration:
::
<interface type='network'>
@@ -45,18 +45,32 @@ Here is an example XML snippet that shows such configuration.
<model type='virtio'/>
<driver name='vhost' queues='4'/>
<link state='down'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
+ <teaming type='persistent'/>
+ <alias name='ua-backup0'/>
</interface>
<interface type='hostdev' managed='yes'>
<mac address='52:54:00:00:12:53'/>
<source>
<address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
</source>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+ <teaming type='transient' persistent='ua-backup0'/>
</interface>
+In this configuration, the first device definition is for the virtio-net
+interface and this acts as the 'persistent' device indicating that this
+interface will always be plugged in. This is specified by the 'teaming' tag with
+required attribute type having value 'persistent'. The link state for the
+virtio-net device is set to 'down' to ensure that the 'failover' netdev prefers
+the VF passthrough device for normal communication. The virtio-net device will
+be brought UP during live migration to allow uninterrupted communication.
+
+The second device definition is for the VF passthrough interface. Here the
+'teaming' tag is provided with type 'transient' indicating that this device may
+periodically be unplugged. A second attribute - 'persistent' is provided and
+points to the alias name declared for the virtio-net device.
+
Booting a VM with the above configuration will result in the following 3
-netdevs created in the VM.
+interfaces created in the VM:
::
4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
@@ -65,13 +79,36 @@ netdevs created in the VM.
valid_lft 42482sec preferred_lft 42482sec
inet6 fe80::97d8:db2:8c10:b6d6/64 scope link
valid_lft forever preferred_lft forever
- 5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000
+ 5: ens10nsby: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master ens10 state DOWN group default qlen 1000
link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000
link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
-ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave
-'standby' and 'primary' netdevs respectively.
+Here, ens10 is the 'failover' master interface, ens10nsby is the slave 'standby'
+virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface.
+
+One point to note here is that some user space network configuration daemons
+like systemd-networkd, ifupdown, etc, do not understand the 'net_failover'
+device; and on the first boot, the VM might end up with both 'failover' device
+and VF acquiring IP addresses (either same or different) from the DHCP server.
+This will result in lack of connectivity to the VM. So some tweaks might be
+needed to these network configuration daemons to make sure that an IP is
+received only on the 'failover' device.
+
+Below is the patch snippet used with 'cloud-ifupdown-helper' script found on
+Debian cloud images:
+
+::
+ @@ -27,6 +27,8 @@ do_setup() {
+ local working="$cfgdir/.$INTERFACE"
+ local final="$cfgdir/$INTERFACE"
+
+ + if [ -d "/sys/class/net/${INTERFACE}/master" ]; then exit 0; fi
+ +
+ if ifup --no-act "$INTERFACE" > /dev/null 2>&1; then
+ # interface is already known to ifupdown, no need to generate cfg
+ log "Skipping configuration generation for $INTERFACE"
+
Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode
==================================================================
@@ -80,40 +117,68 @@ net_failover also enables hypervisor controlled live migration to be supported
with VMs that have direct attached SR-IOV VF devices by automatic failover to
the paravirtual datapath when the VF is unplugged.
-Here is a sample script that shows the steps to initiate live migration on
-the source hypervisor.
+Here is a sample script that shows the steps to initiate live migration from
+the source hypervisor. Note: It is assumed that the VM is connected to a
+software bridge 'br0' which has a single VF attached to it along with the vnet
+device to the VM. This is not the VF that was passthrough'd to the VM (seen in
+the vf.xml file).
::
- # cat vf_xml
+ # cat vf.xml
<interface type='hostdev' managed='yes'>
<mac address='52:54:00:00:12:53'/>
<source>
<address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
</source>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+ <teaming type='transient' persistent='ua-backup0'/>
</interface>
- # Source Hypervisor
+ # Source Hypervisor migrate.sh
#!/bin/bash
- DOMAIN=fedora27-tap01
- PF=enp66s0f0
- VF_NUM=5
- TAP_IF=tap01
- VF_XML=
+ DOMAIN=vm-01
+ PF=ens6np0
+ VF=ens6v1 # VF attached to the bridge.
+ VF_NUM=1
+ TAP_IF=vmtap01 # virtio-net interface in the VM.
+ VF_XML=vf.xml
MAC=52:54:00:00:12:53
ZERO_MAC=00:00:00:00:00:00
+ # Set the virtio-net interface up.
virsh domif-setlink $DOMAIN $TAP_IF up
- bridge fdb del $MAC dev $PF master
- virsh detach-device $DOMAIN $VF_XML
+
+ # Remove the VF that was passthrough'd to the VM.
+ virsh detach-device --live --config $DOMAIN $VF_XML
+
ip link set $PF vf $VF_NUM mac $ZERO_MAC
- virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system
+ # Add FDB entry for traffic to continue going to the VM via
+ # the VF -> br0 -> vnet interface path.
+ bridge fdb add $MAC dev $VF
+ bridge fdb add $MAC dev $TAP_IF master
+
+ # Migrate the VM
+ virsh migrate --live --persistent $DOMAIN qemu+ssh://$REMOTE_HOST/system
+
+ # Clean up FDB entries after migration completes.
+ bridge fdb del $MAC dev $VF
+ bridge fdb del $MAC dev $TAP_IF master
- # Destination Hypervisor
+On the destination hypervisor, a shared bridge 'br0' is created before migration
+starts, and a VF from the destination PF is added to the bridge. Similarly an
+appropriate FDB entry is added.
+
+The following script is executed on the destination hypervisor once migration
+completes, and it reattaches the VF to the VM and brings down the virtio-net
+interface.
+
+::
+ # reattach-vf.sh
#!/bin/bash
- virsh attach-device $DOMAIN $VF_XML
- virsh domif-setlink $DOMAIN $TAP_IF down
+ bridge fdb del 52:54:00:00:12:53 dev ens36v0
+ bridge fdb del 52:54:00:00:12:53 dev vmtap01 master
+ virsh attach-device --config --live vm01 vf.xml
+ virsh domif-setlink vm01 vmtap01 down
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index 1f5c4a04027c..d55c2a22ec7a 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -13,6 +13,10 @@ IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
+Release prepend support by Breno Leitao <leitao@debian.org>, Jul 7 2023
+
+Userdata append support by Matthew Wood <thepacketgeek@gmail.com>, Jan 22 2024
+
Please send bug reports to Matt Mackall <mpm@selenic.com>
Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
@@ -34,10 +38,11 @@ Sender and receiver configuration:
It takes a string configuration parameter "netconsole" in the
following format::
- netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
+ netconsole=[+][r][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
where
+ if present, enable extended console support
+ r if present, prepend kernel version (release) to the message
src-port source for UDP packets (defaults to 6665)
src-ip source IP to use (interface address)
dev network interface (eth0)
@@ -96,9 +101,6 @@ Dynamic reconfiguration:
Dynamic reconfigurability is a useful addition to netconsole that enables
remote logging targets to be dynamically added, removed, or have their
parameters reconfigured at runtime from a configfs-based userspace interface.
-[ Note that the parameters of netconsole targets that were specified/created
-from the boot/module option are not exposed via this interface, and hence
-cannot be modified dynamically. ]
To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
netconsole module (or kernel, if netconsole is built-in).
@@ -125,6 +127,7 @@ The interface exposes these parameters of a netconsole target to userspace:
============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
+ release Prepend kernel release to message (read-write)
dev_name Local network interface name (read-write)
local_port Source UDP port to use (read-write)
remote_port Remote agent's UDP port (read-write)
@@ -151,6 +154,89 @@ You can also update the local interface dynamically. This is especially
useful if you want to use interfaces that have newly come up (and may not
have existed when netconsole was loaded / initialized).
+Netconsole targets defined at boot time (or module load time) with the
+`netconsole=` param are assigned the name `cmdline<index>`. For example, the
+first target in the parameter is named `cmdline0`. You can control and modify
+these targets by creating configfs directories with the matching name.
+
+Let's suppose you have two netconsole targets defined at boot time::
+
+ netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc;4444@10.0.0.1/eth1,9353@10.0.0.3/12:34:56:78:9a:bc
+
+You can modify these targets in runtime by creating the following targets::
+
+ mkdir cmdline0
+ cat cmdline0/remote_ip
+ 10.0.0.2
+
+ mkdir cmdline1
+ cat cmdline1/remote_ip
+ 10.0.0.3
+
+Append User Data
+----------------
+
+Custom user data can be appended to the end of messages with netconsole
+dynamic configuration enabled. User data entries can be modified without
+changing the "enabled" attribute of a target.
+
+Directories (keys) under `userdata` are limited to 53 character length, and
+data in `userdata/<key>/value` are limited to 200 bytes::
+
+ cd /sys/kernel/config/netconsole && mkdir cmdline0
+ cd cmdline0
+ mkdir userdata/foo
+ echo bar > userdata/foo/value
+ mkdir userdata/qux
+ echo baz > userdata/qux/value
+
+Messages will now include this additional user data::
+
+ echo "This is a message" > /dev/kmsg
+
+Sends::
+
+ 12,607,22085407756,-;This is a message
+ foo=bar
+ qux=baz
+
+Preview the userdata that will be appended with::
+
+ cd /sys/kernel/config/netconsole/cmdline0/userdata
+ for f in `ls userdata`; do echo $f=$(cat userdata/$f/value); done
+
+If a `userdata` entry is created but no data is written to the `value` file,
+the entry will be omitted from netconsole messages::
+
+ cd /sys/kernel/config/netconsole && mkdir cmdline0
+ cd cmdline0
+ mkdir userdata/foo
+ echo bar > userdata/foo/value
+ mkdir userdata/qux
+
+The `qux` key is omitted since it has no value::
+
+ echo "This is a message" > /dev/kmsg
+ 12,607,22085407756,-;This is a message
+ foo=bar
+
+Delete `userdata` entries with `rmdir`::
+
+ rmdir /sys/kernel/config/netconsole/cmdline0/userdata/qux
+
+.. warning::
+ When writing strings to user data values, input is broken up per line in
+ configfs store calls and this can cause confusing behavior::
+
+ mkdir userdata/testing
+ printf "val1\nval2" > userdata/testing/value
+ # userdata store value is called twice, first with "val1\n" then "val2"
+ # so "val2" is stored, being the last value stored
+ cat userdata/testing/value
+ val2
+
+ It is recommended to not write user data values with newlines.
+
Extended console:
=================
@@ -165,9 +251,14 @@ following format which is the same as /dev/kmsg::
<level>,<sequnum>,<timestamp>,<contflag>;<message text>
+If 'r' (release) feature is enabled, the kernel release version is
+prepended to the start of the message. Example::
+
+ 6.4.0,6,444,501151268,-;netconsole: network logging started
+
Non printable characters in <message text> are escaped using "\xff"
notation. If the message contains optional dictionary, verbatim
-newline is used as the delimeter.
+newline is used as the delimiter.
If a message doesn't fit in certain number of bytes (currently 1000),
the message is split into multiple fragments by netconsole. These
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
deleted file mode 100644
index d5c9320901c3..000000000000
--- a/Documentation/networking/netdev-FAQ.rst
+++ /dev/null
@@ -1,272 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. _netdev-FAQ:
-
-==========
-netdev FAQ
-==========
-
-Q: What is netdev?
-------------------
-A: It is a mailing list for all network-related Linux stuff. This
-includes anything found under net/ (i.e. core code like IPv6) and
-drivers/net (i.e. hardware specific drivers) in the Linux source tree.
-
-Note that some subsystems (e.g. wireless drivers) which have a high
-volume of traffic have their own specific mailing lists.
-
-The netdev list is managed (like many other Linux mailing lists) through
-VGER (http://vger.kernel.org/) and archives can be found below:
-
-- http://marc.info/?l=linux-netdev
-- http://www.spinics.net/lists/netdev/
-
-Aside from subsystems like that mentioned above, all network-related
-Linux development (i.e. RFC, review, comments, etc.) takes place on
-netdev.
-
-Q: How do the changes posted to netdev make their way into Linux?
------------------------------------------------------------------
-A: There are always two trees (git repositories) in play. Both are
-driven by David Miller, the main network maintainer. There is the
-``net`` tree, and the ``net-next`` tree. As you can probably guess from
-the names, the ``net`` tree is for fixes to existing code already in the
-mainline tree from Linus, and ``net-next`` is where the new code goes
-for the future release. You can find the trees here:
-
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
-
-Q: How often do changes from these trees make it to the mainline Linus tree?
-----------------------------------------------------------------------------
-A: To understand this, you need to know a bit of background information on
-the cadence of Linux development. Each new release starts off with a
-two week "merge window" where the main maintainers feed their new stuff
-to Linus for merging into the mainline tree. After the two weeks, the
-merge window is closed, and it is called/tagged ``-rc1``. No new
-features get mainlined after this -- only fixes to the rc1 content are
-expected. After roughly a week of collecting fixes to the rc1 content,
-rc2 is released. This repeats on a roughly weekly basis until rc7
-(typically; sometimes rc6 if things are quiet, or rc8 if things are in a
-state of churn), and a week after the last vX.Y-rcN was done, the
-official vX.Y is released.
-
-Relating that to netdev: At the beginning of the 2-week merge window,
-the ``net-next`` tree will be closed - no new changes/features. The
-accumulated new content of the past ~10 weeks will be passed onto
-mainline/Linus via a pull request for vX.Y -- at the same time, the
-``net`` tree will start accumulating fixes for this pulled content
-relating to vX.Y
-
-An announcement indicating when ``net-next`` has been closed is usually
-sent to netdev, but knowing the above, you can predict that in advance.
-
-IMPORTANT: Do not send new ``net-next`` content to netdev during the
-period during which ``net-next`` tree is closed.
-
-Shortly after the two weeks have passed (and vX.Y-rc1 is released), the
-tree for ``net-next`` reopens to collect content for the next (vX.Y+1)
-release.
-
-If you aren't subscribed to netdev and/or are simply unsure if
-``net-next`` has re-opened yet, simply check the ``net-next`` git
-repository link above for any new networking-related commits. You may
-also check the following website for the current status:
-
- http://vger.kernel.org/~davem/net-next.html
-
-The ``net`` tree continues to collect fixes for the vX.Y content, and is
-fed back to Linus at regular (~weekly) intervals. Meaning that the
-focus for ``net`` is on stabilization and bug fixes.
-
-Finally, the vX.Y gets released, and the whole cycle starts over.
-
-Q: So where are we now in this cycle?
-
-Load the mainline (Linus) page here:
-
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-
-and note the top of the "tags" section. If it is rc1, it is early in
-the dev cycle. If it was tagged rc7 a week ago, then a release is
-probably imminent.
-
-Q: How do I indicate which tree (net vs. net-next) my patch should be in?
--------------------------------------------------------------------------
-A: Firstly, think whether you have a bug fix or new "next-like" content.
-Then once decided, assuming that you use git, use the prefix flag, i.e.
-::
-
- git format-patch --subject-prefix='PATCH net-next' start..finish
-
-Use ``net`` instead of ``net-next`` (always lower case) in the above for
-bug-fix ``net`` content. If you don't use git, then note the only magic
-in the above is just the subject text of the outgoing e-mail, and you
-can manually change it yourself with whatever MUA you are comfortable
-with.
-
-Q: I sent a patch and I'm wondering what happened to it?
---------------------------------------------------------
-Q: How can I tell whether it got merged?
-A: Start by looking at the main patchworks queue for netdev:
-
- http://patchwork.ozlabs.org/project/netdev/list/
-
-The "State" field will tell you exactly where things are at with your
-patch.
-
-Q: The above only says "Under Review". How can I find out more?
-----------------------------------------------------------------
-A: Generally speaking, the patches get triaged quickly (in less than
-48h). So be patient. Asking the maintainer for status updates on your
-patch is a good way to ensure your patch is ignored or pushed to the
-bottom of the priority list.
-
-Q: I submitted multiple versions of the patch series
-----------------------------------------------------
-Q: should I directly update patchwork for the previous versions of these
-patch series?
-A: No, please don't interfere with the patch status on patchwork, leave
-it to the maintainer to figure out what is the most recent and current
-version that should be applied. If there is any doubt, the maintainer
-will reply and ask what should be done.
-
-Q: I made changes to only a few patches in a patch series should I resend only those changed?
----------------------------------------------------------------------------------------------
-A: No, please resend the entire patch series and make sure you do number your
-patches such that it is clear this is the latest and greatest set of patches
-that can be applied.
-
-Q: I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do?
--------------------------------------------------------------------------------------------------------------------------------------------
-A: There is no revert possible, once it is pushed out, it stays like that.
-Please send incremental versions on top of what has been merged in order to fix
-the patches the way they would look like if your latest patch series was to be
-merged.
-
-Q: How can I tell what patches are queued up for backporting to the various stable releases?
---------------------------------------------------------------------------------------------
-A: Normally Greg Kroah-Hartman collects stable commits himself, but for
-networking, Dave collects up patches he deems critical for the
-networking subsystem, and then hands them off to Greg.
-
-There is a patchworks queue that you can see here:
-
- http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
-
-It contains the patches which Dave has selected, but not yet handed off
-to Greg. If Greg already has the patch, then it will be here:
-
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git
-
-A quick way to find whether the patch is in this stable-queue is to
-simply clone the repo, and then git grep the mainline commit ID, e.g.
-::
-
- stable-queue$ git grep -l 284041ef21fdf2e
- releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- stable/stable-queue$
-
-Q: I see a network patch and I think it should be backported to stable.
------------------------------------------------------------------------
-Q: Should I request it via stable@vger.kernel.org like the references in
-the kernel's Documentation/process/stable-kernel-rules.rst file say?
-A: No, not for networking. Check the stable queues as per above first
-to see if it is already queued. If not, then send a mail to netdev,
-listing the upstream commit ID and why you think it should be a stable
-candidate.
-
-Before you jump to go do the above, do note that the normal stable rules
-in :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
-still apply. So you need to explicitly indicate why it is a critical
-fix and exactly what users are impacted. In addition, you need to
-convince yourself that you *really* think it has been overlooked,
-vs. having been considered and rejected.
-
-Generally speaking, the longer it has had a chance to "soak" in
-mainline, the better the odds that it is an OK candidate for stable. So
-scrambling to request a commit be added the day after it appears should
-be avoided.
-
-Q: I have created a network patch and I think it should be backported to stable.
---------------------------------------------------------------------------------
-Q: Should I add a Cc: stable@vger.kernel.org like the references in the
-kernel's Documentation/ directory say?
-A: No. See above answer. In short, if you think it really belongs in
-stable, then ensure you write a decent commit log that describes who
-gets impacted by the bug fix and how it manifests itself, and when the
-bug was introduced. If you do that properly, then the commit will get
-handled appropriately and most likely get put in the patchworks stable
-queue if it really warrants it.
-
-If you think there is some valid information relating to it being in
-stable that does *not* belong in the commit log, then use the three dash
-marker line as described in
-:ref:`Documentation/process/submitting-patches.rst <the_canonical_patch_format>`
-to temporarily embed that information into the patch that you send.
-
-Q: Are all networking bug fixes backported to all stable releases?
-------------------------------------------------------------------
-A: Due to capacity, Dave could only take care of the backports for the
-last two stable releases. For earlier stable releases, each stable
-branch maintainer is supposed to take care of them. If you find any
-patch is missing from an earlier stable branch, please notify
-stable@vger.kernel.org with either a commit ID or a formal patch
-backported, and CC Dave and other relevant networking developers.
-
-Q: Is the comment style convention different for the networking content?
-------------------------------------------------------------------------
-A: Yes, in a largely trivial way. Instead of this::
-
- /*
- * foobar blah blah blah
- * another line of text
- */
-
-it is requested that you make it look like this::
-
- /* foobar blah blah blah
- * another line of text
- */
-
-Q: I am working in existing code that has the former comment style and not the latter.
---------------------------------------------------------------------------------------
-Q: Should I submit new code in the former style or the latter?
-A: Make it the latter style, so that eventually all code in the domain
-of netdev is of this format.
-
-Q: I found a bug that might have possible security implications or similar.
----------------------------------------------------------------------------
-Q: Should I mail the main netdev maintainer off-list?**
-A: No. The current netdev maintainer has consistently requested that
-people use the mailing lists and not reach out directly. If you aren't
-OK with that, then perhaps consider mailing security@kernel.org or
-reading about http://oss-security.openwall.org/wiki/mailing-lists/distros
-as possible alternative mechanisms.
-
-Q: What level of testing is expected before I submit my change?
----------------------------------------------------------------
-A: If your changes are against ``net-next``, the expectation is that you
-have tested by layering your changes on top of ``net-next``. Ideally
-you will have done run-time testing specific to your change, but at a
-minimum, your changes should survive an ``allyesconfig`` and an
-``allmodconfig`` build without new warnings or failures.
-
-Q: Any other tips to help ensure my net/net-next patch gets OK'd?
------------------------------------------------------------------
-A: Attention to detail. Re-read your own work as if you were the
-reviewer. You can start with using ``checkpatch.pl``, perhaps even with
-the ``--strict`` flag. But do not be mindlessly robotic in doing so.
-If your change is a bug fix, make sure your commit log indicates the
-end-user visible symptom, the underlying reason as to why it happens,
-and then if necessary, explain why the fix proposed is the best way to
-get things done. Don't mangle whitespace, and as is common, don't
-mis-indent function arguments that span multiple lines. If it is your
-first patch, mail it to yourself so you can test apply it to an
-unpatched tree to confirm infrastructure didn't mangle it.
-
-Finally, go back and read
-:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
-to be sure you are not repeating some common mistake documented there.
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
index a2d7d7160e39..d7b15bb64deb 100644
--- a/Documentation/networking/netdev-features.rst
+++ b/Documentation/networking/netdev-features.rst
@@ -182,3 +182,24 @@ stricter than Hardware LRO. A packet stream merged by Hardware GRO must
be re-segmentable by GSO or TSO back to the exact original packet stream.
Hardware GRO is dependent on RXCSUM since every packet successfully merged
by hardware must also have the checksum verified by hardware.
+
+* hsr-tag-ins-offload
+
+This should be set for devices which insert an HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tag automatically.
+
+* hsr-tag-rm-offload
+
+This should be set for devices which remove HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically.
+
+* hsr-fwd-offload
+
+This should be set for devices which forward HSR (High-availability Seamless
+Redundancy) frames from one port to another in hardware.
+
+* hsr-dup-offload
+
+This should be set for devices which duplicate outgoing HSR (High-availability
+Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
+frames in hardware.
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index 5a85fcc80c76..c2476917a6c3 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -10,18 +10,177 @@ Introduction
The following is a random collection of documentation regarding
network devices.
-struct net_device allocation rules
-==================================
+struct net_device lifetime rules
+================================
Network device structures need to persist even after module is unloaded and
must be allocated with alloc_netdev_mqs() and friends.
If device has registered successfully, it will be freed on last use
-by free_netdev(). This is required to handle the pathologic case cleanly
-(example: rmmod mydriver </sys/class/net/myeth/mtu )
+by free_netdev(). This is required to handle the pathological case cleanly
+(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
-alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver
+alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
private data which gets freed when the network device is freed. If
separately allocated data is attached to the network device
-(netdev_priv(dev)) then it is up to the module exit handler to free that.
+(netdev_priv()) then it is up to the module exit handler to free that.
+
+There are two groups of APIs for registering struct net_device.
+First group can be used in normal contexts where ``rtnl_lock`` is not already
+held: register_netdev(), unregister_netdev().
+Second group can be used when ``rtnl_lock`` is already held:
+register_netdevice(), unregister_netdevice(), free_netdevice().
+
+Simple drivers
+--------------
+
+Most drivers (especially device drivers) handle lifetime of struct net_device
+in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
+
+In that case the struct net_device registration is done using
+the register_netdev(), and unregister_netdev() functions:
+
+.. code-block:: c
+
+ int probe()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ dev = alloc_netdev_mqs(...);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* ... do all device setup before calling register_netdev() ...
+ */
+
+ err = register_netdev(dev);
+ if (err)
+ goto err_undo;
+
+ /* net_device is visible to the user! */
+
+ err_undo:
+ /* ... undo the device setup ... */
+ free_netdev(dev);
+ return err;
+ }
+
+ void remove()
+ {
+ unregister_netdev(dev);
+ free_netdev(dev);
+ }
+
+Note that after calling register_netdev() the device is visible in the system.
+Users can open it and start sending / receiving traffic immediately,
+or run any other callback, so all initialization must be done prior to
+registration.
+
+unregister_netdev() closes the device and waits for all users to be done
+with it. The memory of struct net_device itself may still be referenced
+by sysfs but all operations on that device will fail.
+
+free_netdev() can be called after unregister_netdev() returns on when
+register_netdev() failed.
+
+Device management under RTNL
+----------------------------
+
+Registering struct net_device while in context which already holds
+the ``rtnl_lock`` requires extra care. In those scenarios most drivers
+will want to make use of struct net_device's ``needs_free_netdev``
+and ``priv_destructor`` members for freeing of state.
+
+Example flow of netdev handling under ``rtnl_lock``:
+
+.. code-block:: c
+
+ static void my_setup(struct net_device *dev)
+ {
+ dev->needs_free_netdev = true;
+ }
+
+ static void my_destructor(struct net_device *dev)
+ {
+ some_obj_destroy(priv->obj);
+ some_uninit(priv);
+ }
+
+ int create_link()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ ASSERT_RTNL();
+
+ dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* Implicit constructor */
+ err = some_init(priv);
+ if (err)
+ goto err_free_dev;
+
+ priv->obj = some_obj_create();
+ if (!priv->obj) {
+ err = -ENOMEM;
+ goto err_some_uninit;
+ }
+ /* End of constructor, set the destructor: */
+ dev->priv_destructor = my_destructor;
+
+ err = register_netdevice(dev);
+ if (err)
+ /* register_netdevice() calls destructor on failure */
+ goto err_free_dev;
+
+ /* If anything fails now unregister_netdevice() (or unregister_netdev())
+ * will take care of calling my_destructor and free_netdev().
+ */
+
+ return 0;
+
+ err_some_uninit:
+ some_uninit(priv);
+ err_free_dev:
+ free_netdev(dev);
+ return err;
+ }
+
+If struct net_device.priv_destructor is set it will be called by the core
+some time after unregister_netdevice(), it will also be called if
+register_netdevice() fails. The callback may be invoked with or without
+``rtnl_lock`` held.
+
+There is no explicit constructor callback, driver "constructs" the private
+netdev state after allocating it and before registration.
+
+Setting struct net_device.needs_free_netdev makes core call free_netdevice()
+automatically after unregister_netdevice() when all references to the device
+are gone. It only takes effect after a successful call to register_netdevice()
+so if register_netdevice() fails driver is responsible for calling
+free_netdev().
+
+free_netdev() is safe to call on error paths right after unregister_netdevice()
+or when register_netdevice() fails. Parts of netdev (de)registration process
+happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
+will defer some of the processing until ``rtnl_lock`` is released.
+
+Devices spawned from struct rtnl_link_ops should never free the
+struct net_device directly.
+
+.ndo_init and .ndo_uninit
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
+registration and de-registration, under ``rtnl_lock``. Drivers can use
+those e.g. when parts of their init process need to run under ``rtnl_lock``.
+
+``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
+runs during de-registering after device is closed but other subsystems
+may still have outstanding references to the netdevice.
MTU
===
@@ -63,9 +222,38 @@ ndo_do_ioctl:
Synchronization: rtnl_lock() semaphore.
Context: process
+ This is only called by network subsystems internally,
+ not by user space calling ioctl as it was in before
+ linux-5.14.
+
+ndo_siocbond:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the bonding driver for the SIOCBOND family of
+ ioctl commands.
+
+ndo_siocwandev:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the drivers/net/wan framework to handle
+ the SIOCWANDEV ioctl with the if_settings structure.
+
+ndo_siocdevprivate:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ This is used to implement SIOCDEVPRIVATE ioctl helpers.
+ These should not be added to new drivers, so don't use.
+
+ndo_eth_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
ndo_get_stats:
- Synchronization: dev_base_lock rwlock.
- Context: nominally process, but don't sleep inside an rwlock
+ Synchronization: rtnl_lock() semaphore, or RCU.
+ Context: atomic (can't sleep under RCU)
ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
diff --git a/Documentation/networking/netlink_spec/.gitignore b/Documentation/networking/netlink_spec/.gitignore
new file mode 100644
index 000000000000..30d85567b592
--- /dev/null
+++ b/Documentation/networking/netlink_spec/.gitignore
@@ -0,0 +1 @@
+*.rst
diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt
new file mode 100644
index 000000000000..6763f99d216c
--- /dev/null
+++ b/Documentation/networking/netlink_spec/readme.txt
@@ -0,0 +1,4 @@
+SPDX-License-Identifier: GPL-2.0
+
+This file is populated during the build of the documentation (htmldocs) by the
+tools/net/ynl/ynl-gen-rst.py script.
diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst
new file mode 100644
index 000000000000..fabecee24d85
--- /dev/null
+++ b/Documentation/networking/nexthop-group-resilient.rst
@@ -0,0 +1,293 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Resilient Next-hop Groups
+=========================
+
+Resilient groups are a type of next-hop group that is aimed at minimizing
+disruption in flow routing across changes to the group composition and
+weights of constituent next hops.
+
+The idea behind resilient hashing groups is best explained in contrast to
+the legacy multipath next-hop group, which uses the hash-threshold
+algorithm, described in RFC 2992.
+
+To select a next hop, hash-threshold algorithm first assigns a range of
+hashes to each next hop in the group, and then selects the next hop by
+comparing the SKB hash with the individual ranges. When a next hop is
+removed from the group, the ranges are recomputed, which leads to
+reassignment of parts of hash space from one next hop to another. RFC 2992
+illustrates it thus::
+
+ +-------+-------+-------+-------+-------+
+ | 1 | 2 | 3 | 4 | 5 |
+ +-------+-+-----+---+---+-----+-+-------+
+ | 1 | 2 | 4 | 5 |
+ +---------+---------+---------+---------+
+
+ Before and after deletion of next hop 3
+ under the hash-threshold algorithm.
+
+Note how next hop 2 gave up part of the hash space in favor of next hop 1,
+and 4 in favor of 5. While there will usually be some overlap between the
+previous and the new distribution, some traffic flows change the next hop
+that they resolve to.
+
+If a multipath group is used for load-balancing between multiple servers,
+this hash space reassignment causes an issue that packets from a single
+flow suddenly end up arriving at a server that does not expect them. This
+can result in TCP connections being reset.
+
+If a multipath group is used for load-balancing among available paths to
+the same server, the issue is that different latencies and reordering along
+the way causes the packets to arrive in the wrong order, resulting in
+degraded application performance.
+
+To mitigate the above-mentioned flow redirection, resilient next-hop groups
+insert another layer of indirection between the hash space and its
+constituent next hops: a hash table. The selection algorithm uses SKB hash
+to choose a hash table bucket, then reads the next hop that this bucket
+contains, and forwards traffic there.
+
+This indirection brings an important feature. In the hash-threshold
+algorithm, the range of hashes associated with a next hop must be
+continuous. With a hash table, mapping between the hash table buckets and
+the individual next hops is arbitrary. Therefore when a next hop is deleted
+the buckets that held it are simply reassigned to other next hops::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ v v v v
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Before and after deletion of next hop 3
+ under the resilient hashing algorithm.
+
+When weights of next hops in a group are altered, it may be possible to
+choose a subset of buckets that are currently not used for forwarding
+traffic, and use those to satisfy the new next-hop distribution demands,
+keeping the "busy" buckets intact. This way, established flows are ideally
+kept being forwarded to the same endpoints through the same paths as before
+the next-hop group change.
+
+Algorithm
+---------
+
+In a nutshell, the algorithm works as follows. Each next hop deserves a
+certain number of buckets, according to its weight and the number of
+buckets in the hash table. In accordance with the source code, we will call
+this number a "wants count" of a next hop. In case of an event that might
+cause bucket allocation change, the wants counts for individual next hops
+are updated.
+
+Next hops that have fewer buckets than their wants count, are called
+"underweight". Those that have more are "overweight". If there are no
+overweight (and therefore no underweight) next hops in the group, it is
+said to be "balanced".
+
+Each bucket maintains a last-used timer. Every time a packet is forwarded
+through a bucket, this timer is updated to current jiffies value. One
+attribute of a resilient group is then the "idle timer", which is the
+amount of time that a bucket must not be hit by traffic in order for it to
+be considered "idle". Buckets that are not idle are busy.
+
+After assigning wants counts to next hops, an "upkeep" algorithm runs. For
+buckets:
+
+1) that have no assigned next hop, or
+2) whose next hop has been removed, or
+3) that are idle and their next hop is overweight,
+
+upkeep changes the next hop that the bucket references to one of the
+underweight next hops. If, after considering all buckets in this manner,
+there are still underweight next hops, another upkeep run is scheduled to a
+future time.
+
+There may not be enough "idle" buckets to satisfy the updated wants counts
+of all next hops. Another attribute of a resilient group is the "unbalanced
+timer". This timer can be set to 0, in which case the table will stay out
+of balance until idle buckets do appear, possibly never. If set to a
+non-zero value, the value represents the period of time that the table is
+permitted to stay out of balance.
+
+With this in mind, we update the above list of conditions with one more
+item. Thus buckets:
+
+4) whose next hop is overweight, and the amount of time that the table has
+ been out of balance exceeds the unbalanced timer, if that is non-zero,
+
+\... are migrated as well.
+
+Offloading & Driver Feedback
+----------------------------
+
+When offloading resilient groups, the algorithm that distributes buckets
+among next hops is still the one in SW. Drivers are notified of updates to
+next hop groups in the following three ways:
+
+- Full group notification with the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
+ created and buckets populated for the first time.
+
+- Single-bucket notifications of the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
+ individual migrations within an already-established group.
+
+- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
+ is sent before the group is replaced, and is a way for the driver to veto
+ the group before committing anything to the HW.
+
+Some single-bucket notifications are forced, as indicated by the "force"
+flag in the notification. Those are used for the cases where e.g. the next
+hop associated with the bucket was removed, and the bucket really must be
+migrated.
+
+Non-forced notifications can be overridden by the driver by returning an
+error code. The use case for this is that the driver notifies the HW that a
+bucket should be migrated, but the HW discovers that the bucket has in fact
+been hit by traffic.
+
+A second way for the HW to report that a bucket is busy is through the
+``nexthop_res_grp_activity_update()`` API. The buckets identified this way
+as busy are treated as if traffic hit them.
+
+Offloaded buckets should be flagged as either "offload" or "trap". This is
+done through the ``nexthop_bucket_set_hw_flags()`` API.
+
+Netlink UAPI
+------------
+
+Resilient Group Replacement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
+same manner as other multipath groups. The following changes apply to the
+attributes passed in the netlink message:
+
+ =================== =========================================================
+ ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
+ ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
+ groups.
+ =================== =========================================================
+
+``NHA_RES_GROUP`` payload:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
+ ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
+ ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
+ =================================== =========================================
+
+Next Hop Get
+^^^^^^^^^^^^
+
+Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
+message in exactly the same way as other next hop get requests. The
+response attributes match the replacement attributes cited above, except
+``NHA_RES_GROUP`` payload will include the following attribute:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
+ of balance, in units of clock_t.
+ =================================== =========================================
+
+Bucket Get
+^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
+used to request a single bucket. The attributes recognized at get requests
+are:
+
+ =================== =========================================================
+ ``NHA_ID`` ID of the next-hop group that the bucket belongs to.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
+ ======================== ====================================================
+
+Bucket Dumps
+^^^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
+to request a dump of matching buckets. The attributes recognized at dump
+requests are:
+
+ =================== =========================================================
+ ``NHA_ID`` If specified, limits the dump to just the next-hop group
+ with this ID.
+ ``NHA_OIF`` If specified, limits the dump to buckets that contain
+ next hops that use the device with this ifindex.
+ ``NHA_MASTER`` If specified, limits the dump to buckets that contain
+ next hops that use a device in the VRF with this ifindex.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
+ that contain the next hop with this ID.
+ ======================== ====================================================
+
+Usage
+-----
+
+To illustrate the usage, consider the following commands::
+
+ # ip nexthop add id 1 via 192.0.2.2 dev eth0
+ # ip nexthop add id 2 via 192.0.2.3 dev eth0
+ # ip nexthop add id 10 group 1/2 type resilient \
+ buckets 8 idle_timer 60 unbalanced_timer 300
+
+The last command creates a resilient next-hop group. It will have 8 buckets
+(which is unusually low number, and used here for demonstration purposes
+only), each bucket will be considered idle when no traffic hits it for at
+least 60 seconds, and if the table remains out of balance for 300 seconds,
+it will be forcefully brought into balance.
+
+Changing next-hop weights leads to change in bucket allocation::
+
+ # ip nexthop replace id 10 group 1,3/2 type resilient
+
+This can be confirmed by looking at individual buckets::
+
+ # ip nexthop bucket show id 10
+ id 10 index 0 idle_time 5.59 nhid 1
+ id 10 index 1 idle_time 5.59 nhid 1
+ id 10 index 2 idle_time 8.74 nhid 2
+ id 10 index 3 idle_time 8.74 nhid 2
+ id 10 index 4 idle_time 8.74 nhid 1
+ id 10 index 5 idle_time 8.74 nhid 1
+ id 10 index 6 idle_time 8.74 nhid 1
+ id 10 index 7 idle_time 8.74 nhid 1
+
+Note the two buckets that have a shorter idle time. Those are the ones that
+were migrated after the next-hop replace command to satisfy the new demand
+that next hop 1 be given 6 buckets instead of 4.
+
+Netdevsim
+---------
+
+The netdevsim driver implements a mock offload of resilient groups, and
+exposes debugfs interface that allows marking individual buckets as busy.
+For example, the following will mark bucket 23 in next-hop group 10 as
+active::
+
+ # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
+
+In addition, another debugfs interface can be used to configure that the
+next attempt to migrate a bucket should fail::
+
+ # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
+
+Besides serving as an example, the interfaces that netdevsim exposes are
+useful in automated testing, and
+``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
+them to test the algorithm.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 11a9b76786cb..c383a394c665 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -17,9 +17,8 @@ nf_conntrack_acct - BOOLEAN
nf_conntrack_buckets - INTEGER
Size of hash table. If not specified as parameter during module
loading, the default size is calculated by dividing total memory
- by 16384 to determine the number of buckets but the hash table will
- never have fewer than 32 and limited to 16384 buckets. For systems
- with more than 4GB of memory it will be 65536 buckets.
+ by 16384 to determine the number of buckets. The hash table will
+ never have fewer than 1024 and never more than 262144 buckets.
This sysctl is only writeable in the initial net namespace.
nf_conntrack_checksum - BOOLEAN
@@ -35,10 +34,13 @@ nf_conntrack_count - INTEGER (read-only)
nf_conntrack_events - BOOLEAN
- 0 - disabled
- - not 0 - enabled (default)
+ - 1 - enabled
+ - 2 - auto (default)
If this option is enabled, the connection tracking code will
provide userspace with connection tracking events via ctnetlink.
+ The default allocates the extension if a userspace program is
+ listening to ctnetlink events.
nf_conntrack_expect_max - INTEGER
Maximum size of expectation table. Default value is
@@ -68,15 +70,6 @@ nf_conntrack_generic_timeout - INTEGER (seconds)
Default for generic timeout. This refers to layer 4 unknown/unsupported
protocols.
-nf_conntrack_helper - BOOLEAN
- - 0 - disabled (default)
- - not 0 - enabled
-
- Enable automatic conntrack helper assignment.
- If disabled it is required to set up iptables rules to assign
- helpers to connections. See the CT target description in the
- iptables-extensions(8) man page for further information.
-
nf_conntrack_icmp_timeout - INTEGER (seconds)
default 30
@@ -100,8 +93,12 @@ nf_conntrack_log_invalid - INTEGER
Log invalid packets of a type specified by value.
nf_conntrack_max - INTEGER
- Size of connection tracking table. Default value is
- nf_conntrack_buckets value * 4.
+ Maximum number of allowed connection tracking entries. This value is set
+ to nf_conntrack_buckets by default.
+ Note that connection tracking entries are added to the table twice -- once
+ for the original direction and once for the reply direction (i.e., with
+ the reversed address). This means that with default settings a maxed-out
+ table will have a average hash chain length of 2, not 1.
nf_conntrack_tcp_be_liberal - BOOLEAN
- 0 - disabled (default)
@@ -110,6 +107,12 @@ nf_conntrack_tcp_be_liberal - BOOLEAN
Be conservative in what you do, be liberal in what you accept from others.
If it's non-zero, we mark only out of window RST segments as INVALID.
+nf_conntrack_tcp_ignore_invalid_rst - BOOLEAN
+ - 0 - disabled (default)
+ - 1 - enabled
+
+ If it's 1, we don't mark out of window RST segments as INVALID.
+
nf_conntrack_tcp_loose - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
@@ -160,6 +163,35 @@ nf_conntrack_timestamp - BOOLEAN
Enable connection tracking flow timestamping.
+nf_conntrack_sctp_timeout_closed - INTEGER (seconds)
+ default 10
+
+nf_conntrack_sctp_timeout_cookie_wait - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_cookie_echoed - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_established - INTEGER (seconds)
+ default 210
+
+ Default is set to (hb_interval * path_max_retrans + rto_max)
+
+nf_conntrack_sctp_timeout_shutdown_sent - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_shutdown_recd - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_shutdown_ack_sent - INTEGER (seconds)
+ default 3
+
+nf_conntrack_sctp_timeout_heartbeat_sent - INTEGER (seconds)
+ default 30
+
+ This timeout is used to setup conntrack entry on secondary paths.
+ Default is set to hb_interval.
+
nf_conntrack_udp_timeout - INTEGER (seconds)
default 30
@@ -177,3 +209,24 @@ nf_conntrack_gre_timeout_stream - INTEGER (seconds)
This extended timeout will be used in case there is an GRE stream
detected.
+
+nf_hooks_lwtunnel - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If this option is enabled, the lightweight tunnel netfilter hooks are
+ enabled. This option cannot be disabled once it is enabled.
+
+nf_flowtable_tcp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for tcp connections.
+ TCP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with tcp pickup timeout.
+
+nf_flowtable_udp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for udp connections.
+ UDP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with udp pickup timeout.
diff --git a/Documentation/networking/nf_flowtable.rst b/Documentation/networking/nf_flowtable.rst
index b6e1fa141aae..d757c21c10f2 100644
--- a/Documentation/networking/nf_flowtable.rst
+++ b/Documentation/networking/nf_flowtable.rst
@@ -4,35 +4,38 @@
Netfilter's flowtable infrastructure
====================================
-This documentation describes the software flowtable infrastructure available in
-Netfilter since Linux kernel 4.16.
+This documentation describes the Netfilter flowtable infrastructure which allows
+you to define a fastpath through the flowtable datapath. This infrastructure
+also provides hardware offload support. The flowtable supports for the layer 3
+IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
Overview
--------
-Initial packets follow the classic forwarding path, once the flow enters the
-established state according to the conntrack semantics (ie. we have seen traffic
-in both directions), then you can decide to offload the flow to the flowtable
-from the forward chain via the 'flow offload' action available in nftables.
+Once the first packet of the flow successfully goes through the IP forwarding
+path, from the second packet on, you might decide to offload the flow to the
+flowtable through your ruleset. The flowtable infrastructure provides a rule
+action that allows you to specify when to add a flow to the flowtable.
-Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
-output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
-path (the visible effect is that you do not see these packets from any of the
-netfilter hooks coming after the ingress). In case of flowtable miss, the packet
-follows the classic forward path.
+A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
+transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
+classic IP forwarding path (the visible effect is that you do not see these
+packets from any of the Netfilter hooks coming after ingress). In case that
+there is no matching entry in the flowtable (ie. flowtable miss), the packet
+follows the classic IP forwarding path.
-The flowtable uses a resizable hashtable, lookups are based on the following
-7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
-and destination ports and the input interface (useful in case there are several
-conntrack zones in place).
+The flowtable uses a resizable hashtable. Lookups are based on the following
+n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
+source and destination, layer 4 source and destination ports and the input
+interface (useful in case there are several conntrack zones in place).
-Flowtables are populated via the 'flow offload' nftables action, so the user can
-selectively specify what flows are placed into the flow table. Hence, packets
-follow the classic forwarding path unless the user explicitly instruct packets
-to use this new alternative forwarding path via nftables policy.
+The 'flow add' action allows you to populate the flowtable, the user selectively
+specifies what flows are placed into the flowtable. Hence, packets follow the
+classic IP forwarding path unless the user explicitly instruct flows to use this
+new alternative forwarding path via policy.
-This is represented in Fig.1, which describes the classic forwarding path
-including the Netfilter hooks and the flowtable fastpath bypass.
+The flowtable datapath is represented in Fig.1, which describes the classic IP
+forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
::
@@ -67,11 +70,13 @@ including the Netfilter hooks and the flowtable fastpath bypass.
Fig.1 Netfilter hooks and flowtable interactions
The flowtable entry also stores the NAT configuration, so all packets are
-mangled according to the NAT policy that matches the initial packets that went
-through the classic forwarding path. The TTL is decremented before calling
-neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
-path given that the transport selectors are missing, therefore flowtable lookup
-is not possible.
+mangled according to the NAT policy that is specified from the classic IP
+forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
+traffic is passed up to follow the classic IP forwarding path given that the
+transport header is missing, in this case, flowtable lookups are not possible.
+TCP RST and FIN packets are also passed up to the classic IP forwarding path to
+release the flow gracefully. Packets that exceed the MTU are also passed up to
+the classic forwarding path to report packet-too-big ICMP errors to the sender.
Example configuration
---------------------
@@ -85,7 +90,7 @@ flowtable and add one rule to your forward chain::
}
chain y {
type filter hook forward priority 0; policy accept;
- ip protocol tcp flow offload @f
+ ip protocol tcp flow add @f
counter packets 0 bytes 0
}
}
@@ -103,13 +108,126 @@ flow is offloaded, you will observe that the counter rule in the example above
does not get updated for the packets that are being forwarded through the
forwarding bypass.
+You can identify offloaded flows through the [OFFLOAD] tag when listing your
+connection tracking table.
+
+::
+
+ # conntrack -L
+ tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
+
+
+Layer 2 encapsulation
+---------------------
+
+Since Linux kernel 5.13, the flowtable infrastructure discovers the real
+netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
+parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
+VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
+flowtable datapath also deals with layer 2 decapsulation.
+
+You do not need to add the PPPoE and the VLAN devices to your flowtable,
+instead the real device is sufficient for the flowtable to track your flows.
+
+Bridge and IP forwarding
+------------------------
+
+Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
+flowtable infrastructure discovers the topology behind the bridge device. This
+allows the flowtable to define a fastpath bypass between the bridge ports
+(represented as eth1 and eth2 in the example figure below) and the gateway
+device (represented as eth0) in your switch/router.
+
+::
+
+ fastpath bypass
+ .-------------------------.
+ / \
+ | IP forwarding |
+ | / \ \/
+ | br0 eth0 ..... eth0
+ . / \ *host B*
+ -> eth1 eth2
+ . *switch/router*
+ .
+ .
+ eth0
+ *host A*
+
+The flowtable infrastructure also supports for bridge VLAN filtering actions
+such as PVID and untagged. You can also stack a classic VLAN device on top of
+your bridge port.
+
+If you would like that your flowtable defines a fastpath between your bridge
+ports and your IP forwarding path, you have to add your bridge ports (as
+represented by the real netdevice) to your flowtable definition.
+
+Counters
+--------
+
+The flowtable can synchronize packet and byte counters with the existing
+connection tracking entry by specifying the counter statement in your flowtable
+definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ counter
+ }
+ }
+
+Counter support is available since Linux kernel 5.7.
+
+Hardware offload
+----------------
+
+If your network device provides hardware offload support, you can turn it on by
+means of the 'offload' flag in your flowtable definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ flags offload;
+ }
+ }
+
+There is a workqueue that adds the flows to the hardware. Note that a few
+packets might still run over the flowtable software path until the workqueue has
+a chance to offload the flow to the network device.
+
+You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
+listing your connection tracking table. Please, note that the [OFFLOAD] tag
+refers to the software offload mode, so there is a distinction between [OFFLOAD]
+which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
+to the hardware offload datapath being used by the flow.
+
+The flowtable hardware offload infrastructure also supports for the DSA
+(Distributed Switch Architecture).
+
+Limitations
+-----------
+
+The flowtable behaves like a cache. The flowtable entries might get stale if
+either the destination MAC address or the egress netdevice that is used for
+transmission changes.
+
+This might be a problem if:
+
+- You run the flowtable in software mode and you combine bridge and IP
+ forwarding in your setup.
+- Hardware offload is enabled.
+
More reading
------------
This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
also made a very complete and comprehensive summary called "A state of network
acceleration" that describes how things were before this infrastructure was
-mailined [3]_ and it also makes a rough summary of this work [4]_.
+mainlined [3]_ and it also makes a rough summary of this work [4]_.
.. [1] https://lwn.net/Articles/738214/
.. [2] https://lwn.net/Articles/742164/
diff --git a/Documentation/networking/operstates.rst b/Documentation/networking/operstates.rst
index 9c918f7cb0e8..1ee2141e8ef1 100644
--- a/Documentation/networking/operstates.rst
+++ b/Documentation/networking/operstates.rst
@@ -73,7 +73,9 @@ IF_OPER_LOWERLAYERDOWN (3):
state (f.e. VLAN).
IF_OPER_TESTING (4):
- Unused in current kernel.
+ Interface is in testing mode, for example executing driver self-tests
+ or media (cable) test. It can't be used for normal traffic until tests
+ complete.
IF_OPER_DORMANT (5):
Interface is L1 up, but waiting for an external event, f.e. for a
@@ -111,7 +113,7 @@ it as lower layer.
Note that for certain kind of soft-devices, which are not managing any
real hardware, it is possible to set this bit from userspace. One
-should use TVL IFLA_CARRIER to do so.
+should use TLV IFLA_CARRIER to do so.
netif_carrier_ok() can be used to query that bit.
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
index 6c009ceb1183..dca15d15feaf 100644
--- a/Documentation/networking/packet_mmap.rst
+++ b/Documentation/networking/packet_mmap.rst
@@ -8,7 +8,7 @@ Abstract
========
This file documents the mmap() facility available with the PACKET
-socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+socket interface. This type of sockets is used for
i) capture network traffic with utilities like tcpdump,
ii) transmit network traffic, or any other that needs raw
@@ -25,12 +25,12 @@ Please send your comments to
Why use PACKET_MMAP
===================
-In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+Non PACKET_MMAP capture process (plain AF_PACKET) is very
inefficient. It uses very limited buffers and requires one system call to
capture each packet, it requires two if you want to get packet's timestamp
(like libpcap always does).
-In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+On the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
configurable circular buffer mapped in user space that can be used to either
send or receive packets. This way reading packets just needs to wait for them,
most of the time there is no need to issue a single system call. Concerning
@@ -153,7 +153,7 @@ As capture, each frame contains two parts::
struct ifreq s_ifr;
...
- strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+ strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
/* get interface index of eth0 */
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
@@ -252,8 +252,7 @@ PACKET_MMAP setting constraints
In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
-16384 in a 64 bit architecture. For information on these kernel versions
-see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
+16384 in a 64 bit architecture.
Block size limit
----------------
@@ -264,20 +263,20 @@ the name indicates, this function allocates pages of memory, and the second
argument is "order" or a power of two number of pages, that is
(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
order=2 ==> 16384 bytes, etc. The maximum size of a
-region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
-precisely the limit can be calculated as::
+region allocated by __get_free_pages is determined by the MAX_PAGE_ORDER macro.
+More precisely the limit can be calculated as::
- PAGE_SIZE << MAX_ORDER
+ PAGE_SIZE << MAX_PAGE_ORDER
In a i386 architecture PAGE_SIZE is 4096 bytes
- In a 2.4/i386 kernel MAX_ORDER is 10
- In a 2.6/i386 kernel MAX_ORDER is 11
+ In a 2.4/i386 kernel MAX_PAGE_ORDER is 10
+ In a 2.6/i386 kernel MAX_PAGE_ORDER is 11
So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
respectively, with an i386 architecture.
User space programs can include /usr/include/sys/user.h and
-/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_PAGE_ORDER declarations.
The pagesize can also be determined dynamically with the getpagesize (2)
system call.
@@ -325,7 +324,7 @@ Definitions:
(see /proc/slabinfo)
<pointer size> depends on the architecture -- ``sizeof(void *)``
<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
-<max-order> is the value defined with MAX_ORDER
+<max-order> is the value defined with MAX_PAGE_ORDER
<frame size> it's an upper bound of frame's capture size (more on this later)
============== ================================================================
@@ -437,7 +436,7 @@ and the following flags apply:
Capture process
^^^^^^^^^^^^^^^
- from include/linux/if_packet.h
+From include/linux/if_packet.h::
#define TP_STATUS_COPY (1 << 1)
#define TP_STATUS_LOSING (1 << 2)
@@ -756,7 +755,7 @@ AF_PACKET TPACKET_V3 example
============================
AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
-sizes by doing it's own memory management. It is based on blocks where polling
+sizes by doing its own memory management. It is based on blocks where polling
works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
It is said that TPACKET_V3 brings the following benefits:
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index 43088ddf95e4..9d958128a57c 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -4,22 +4,8 @@
Page Pool API
=============
-The page_pool allocator is optimized for the XDP mode that uses one frame
-per-page, but it can fallback on the regular page allocator APIs.
-
-Basic use involves replacing alloc_pages() calls with the
-page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages()
-replacing dev_alloc_pages().
-
-API keeps track of inflight pages, in order to let API user know
-when it is safe to free a page_pool object. Thus, API users
-must run page_pool_release_page() when a page is leaving the page_pool or
-call page_pool_put_page() where appropriate in order to maintain correct
-accounting.
-
-API user must call page_pool_put_page() once on a page, as it
-will either recycle the page, or in case of refcnt > 1, it will
-release the DMA mapping and inflight state accounting.
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :doc: page_pool allocator
Architecture overview
=====================
@@ -55,6 +41,11 @@ Architecture overview
| Fast cache | | ptr-ring cache |
+-----------------+ +------------------+
+Monitoring
+==========
+Information about page pools on the system can be accessed via the netdev
+genetlink family (see Documentation/netlink/specs/netdev.yaml).
+
API interface
=============
The number of pools created **must** match the number of hardware queues
@@ -64,38 +55,71 @@ This lockless guarantee naturally comes from running under a NAPI softirq.
The protection doesn't strictly have to be NAPI, any guarantee that allocating
a page will cause no race conditions is enough.
-* page_pool_create(): Create a pool.
- * flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
- * order: 2^order pages on allocation
- * pool_size: size of the ptr_ring
- * nid: preferred NUMA node for allocation
- * dev: struct device. Used on DMA operations
- * dma_dir: DMA direction
- * max_len: max DMA sync memory size
- * offset: DMA address offset
-
-* page_pool_put_page(): The outcome of this depends on the page refcnt. If the
- driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1
- the allocator owns the page and will try to recycle it in one of the pool
- caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device
- using dma_sync_single_range_for_device().
-
-* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync
- for the entire memory area configured in area pool->max_len.
-
-* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller
- must guarantee safe context (e.g NAPI), since it will recycle the page
- directly into the pool fast cache.
-
-* page_pool_release_page(): Unmap the page (if mapped) and account for it on
- inflight counters.
-
-* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool
- caches.
-
-* page_pool_get_dma_addr(): Retrieve the stored DMA address.
-
-* page_pool_get_dma_dir(): Retrieve the stored DMA direction.
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_create
+
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_params
+
+.. kernel-doc:: include/net/page_pool/helpers.h
+ :identifiers: page_pool_put_page page_pool_put_full_page
+ page_pool_recycle_direct page_pool_free_va
+ page_pool_dev_alloc_pages page_pool_dev_alloc_frag
+ page_pool_dev_alloc page_pool_dev_alloc_va
+ page_pool_get_dma_addr page_pool_get_dma_dir
+
+.. kernel-doc:: net/core/page_pool.c
+ :identifiers: page_pool_put_page_bulk page_pool_get_stats
+
+DMA sync
+--------
+Driver is always responsible for syncing the pages for the CPU.
+Drivers may choose to take care of syncing for the device as well
+or set the ``PP_FLAG_DMA_SYNC_DEV`` flag to request that pages
+allocated from the page pool are already synced for the device.
+
+If ``PP_FLAG_DMA_SYNC_DEV`` is set, the driver must inform the core what portion
+of the buffer has to be synced. This allows the core to avoid syncing the entire
+page when the drivers knows that the device only accessed a portion of the page.
+
+Most drivers will reserve headroom in front of the frame. This part
+of the buffer is not touched by the device, so to avoid syncing
+it drivers can set the ``offset`` field in struct page_pool_params
+appropriately.
+
+For pages recycled on the XDP xmit and skb paths the page pool will
+use the ``max_len`` member of struct page_pool_params to decide how
+much of the page needs to be synced (starting at ``offset``).
+When directly freeing pages in the driver (page_pool_put_page())
+the ``dma_sync_size`` argument specifies how much of the buffer needs
+to be synced.
+
+If in doubt set ``offset`` to 0, ``max_len`` to ``PAGE_SIZE`` and
+pass -1 as ``dma_sync_size``. That combination of arguments is always
+correct.
+
+Note that the syncing parameters are for the entire page.
+This is important to remember when using fragments (``PP_FLAG_PAGE_FRAG``),
+where allocated buffers may be smaller than a full page.
+Unless the driver author really understands page pool internals
+it's recommended to always use ``offset = 0``, ``max_len = PAGE_SIZE``
+with fragmented page pools.
+
+Stats API and structures
+------------------------
+If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
+page_pool_get_stats() and structures described below are available.
+It takes a pointer to a ``struct page_pool`` and a pointer to a struct
+page_pool_stats allocated by the caller.
+
+Older drivers expose page pool statistics via ethtool or debugfs.
+The same statistics are accessible via the netlink netdev family
+in a driver-independent fashion.
+
+.. kernel-doc:: include/net/page_pool/types.h
+ :identifiers: struct page_pool_recycle_stats
+ struct page_pool_alloc_stats
+ struct page_pool_stats
Coding examples
===============
@@ -116,6 +140,7 @@ Registration
pp_params.pool_size = DESC_NUM;
pp_params.nid = NUMA_NO_NODE;
pp_params.dev = priv->dev;
+ pp_params.napi = napi; /* only if locking is tied to NAPI */
pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
page_pool = page_pool_create(&pp_params);
@@ -144,11 +169,26 @@ NAPI poller
if XDP_DROP:
page_pool_recycle_direct(page_pool, page);
} else (packet_is_skb) {
- page_pool_release_page(page_pool, page);
+ skb_mark_for_recycle(skb);
new_page = page_pool_dev_alloc_pages(page_pool);
}
}
+Stats
+-----
+
+.. code-block:: c
+
+ #ifdef CONFIG_PAGE_POOL_STATS
+ /* retrieve stats */
+ struct page_pool_stats stats = { 0 };
+ if (page_pool_get_stats(page_pool, &stats)) {
+ /* perhaps the driver reports statistics with ethool */
+ ethtool_print_allocation_stats(&stats.alloc_stats);
+ ethtool_print_recycle_stats(&stats.recycle_stats);
+ }
+ #endif
+
Driver unload
-------------
diff --git a/Documentation/networking/phonet.rst b/Documentation/networking/phonet.rst
index 8668dcbc5e6a..d705cc5b09fc 100644
--- a/Documentation/networking/phonet.rst
+++ b/Documentation/networking/phonet.rst
@@ -131,7 +131,7 @@ Phonet resources, as follow::
Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O
control request, or when the socket is closed.
-Note that no more than one socket can be subcribed to any given
+Note that no more than one socket can be subscribed to any given
resource at a time. If not, ioctl() will return EBUSY.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 256106054c8c..1283240d7620 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -80,8 +80,8 @@ values of phy_interface_t must be understood from the perspective of the PHY
device itself, leading to the following:
* PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
- internal delay by itself, it assumes that either the Ethernet MAC (if capable
- or the PCB traces) insert the correct 1.5-2ns delay
+ internal delay by itself, it assumes that either the Ethernet MAC (if capable)
+ or the PCB traces insert the correct 1.5-2ns delay
* PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
for the transmit data lines (TXD[3:0]) processed by the PHY device
@@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons:
* PHY device drivers in PHYLIB being reusable by nature, being able to
configure correctly a specified delay enables more designs with similar delay
- requirements to be operate correctly
+ requirements to be operated correctly
For cases where the PHY is not capable of providing this delay, but the
Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
@@ -120,7 +120,7 @@ required delays, as defined per the RGMII standard, several options may be
available:
* Some SoCs may offer a pin pad/mux/controller capable of configuring a given
- set of pins'strength, delays, and voltage; and it may be a suitable
+ set of pins' strength, delays, and voltage; and it may be a suitable
option to insert the expected 2ns RGMII delay.
* Modifying the PCB design to include a fixed delay (e.g: using a specifically
@@ -216,7 +216,7 @@ put into an unsupported state.
Lastly, once the controller is ready to handle network traffic, you call
phy_start(phydev). This tells the PAL that you are ready, and configures the
PHY to connect to the network. If the MAC interrupt of your network driver
-also handles PHY status changes, just set phydev->irq to PHY_IGNORE_INTERRUPT
+also handles PHY status changes, just set phydev->irq to PHY_MAC_INTERRUPT
before you call phy_start and use phy_mac_interrupt() from the network
driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
phy_start() enables the PHY interrupts (if applicable) and starts the
@@ -237,6 +237,11 @@ negotiation results.
Some of the interface modes are described below:
+``PHY_INTERFACE_MODE_SMII``
+ This is serial MII, clocked at 125MHz, supporting 100M and 10M speeds.
+ Some details can be found in
+ https://opencores.org/ocsvn/smii/smii/trunk/doc/SMII.pdf
+
``PHY_INTERFACE_MODE_1000BASEX``
This defines the 1000BASE-X single-lane serdes link as defined by the
802.3 standard section 36. The link operates at a fixed bit rate of
@@ -247,8 +252,8 @@ Some of the interface modes are described below:
speeds (see below.)
``PHY_INTERFACE_MODE_2500BASEX``
- This defines a variant of 1000BASE-X which is clocked 2.5 times faster,
- than the 802.3 standard giving a fixed bit rate of 3.125Gbaud.
+ This defines a variant of 1000BASE-X which is clocked 2.5 times as fast
+ as the 802.3 standard, giving a fixed bit rate of 3.125Gbaud.
``PHY_INTERFACE_MODE_SGMII``
This is used for Cisco SGMII, which is a modification of 1000BASE-X
@@ -267,6 +272,12 @@ Some of the interface modes are described below:
duplex, pause or other settings. This is dependent on the MAC and/or
PHY behaviour.
+``PHY_INTERFACE_MODE_5GBASER``
+ This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is
+ identical to the 10GBASE-R protocol defined in Clause 49, with the
+ exception that it operates at half the frequency. Please refer to the
+ IEEE standard for the definition.
+
``PHY_INTERFACE_MODE_10GBASER``
This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
various different mediums. Please refer to the IEEE standard for a
@@ -286,6 +297,36 @@ Some of the interface modes are described below:
Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
use of this definition.
+``PHY_INTERFACE_MODE_25GBASER``
+ This is the IEEE 802.3 PCS Clause 107 defined 25GBASE-R protocol.
+ The PCS is identical to 10GBASE-R, i.e. 64B/66B encoded
+ running 2.5 as fast, giving a fixed bit rate of 25.78125 Gbaud.
+ Please refer to the IEEE standard for further information.
+
+``PHY_INTERFACE_MODE_100BASEX``
+ This defines IEEE 802.3 Clause 24. The link operates at a fixed data
+ rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying
+ data rate of 100Mpbs.
+
+``PHY_INTERFACE_MODE_QUSGMII``
+ This defines the Cisco the Quad USGMII mode, which is the Quad variant of
+ the USGMII (Universal SGMII) link. It's very similar to QSGMII, but uses
+ a Packet Control Header (PCH) instead of the 7 bytes preamble to carry not
+ only the port id, but also so-called "extensions". The only documented
+ extension so-far in the specification is the inclusion of timestamps, for
+ PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the
+ same capabilities in terms of link speed and negotiation.
+
+``PHY_INTERFACE_MODE_1000BASEKX``
+ This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73
+ autonegotiation. Generally, it will be used with a Clause 70 PMD. To
+ contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this
+ interface mode has different autonegotiation and only supports full duplex.
+
+``PHY_INTERFACE_MODE_PSGMII``
+ This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
+ SGMII lines into a single link compared to 4 on QSGMII.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/pktgen.rst b/Documentation/networking/pktgen.rst
index 7afa1c9f1183..c945218946e1 100644
--- a/Documentation/networking/pktgen.rst
+++ b/Documentation/networking/pktgen.rst
@@ -178,6 +178,7 @@ Examples::
IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
NODE_ALLOC # node specific memory allocation
NO_TIMESTAMP # disable timestamping
+ SHARED # enable shared SKB
pgset 'flag ![name]' Clear a flag to determine behaviour.
Note that you might need to use single quote in
interactive mode, so that your shell wouldn't expand
@@ -248,26 +249,24 @@ Usage:::
-i : ($DEV) output interface/device (required)
-s : ($PKT_SIZE) packet size
- -d : ($DEST_IP) destination IP
+ -d : ($DEST_IP) destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed
-m : ($DST_MAC) destination MAC-addr
+ -p : ($DST_PORT) destination PORT range (e.g. 433-444) is also allowed
-t : ($THREADS) threads to start
+ -f : ($F_THREAD) index of first thread (zero indexed CPU number)
-c : ($SKB_CLONE) SKB clones send before alloc new SKB
+ -n : ($COUNT) num messages to send per thread, 0 means indefinitely
-b : ($BURST) HW level bursting of SKBs
-v : ($VERBOSE) verbose
-x : ($DEBUG) debug
+ -6 : ($IP6) IPv6
+ -w : ($DELAY) Tx Delay value (ns)
+ -a : ($APPEND) Script will not reset generator's state, but will append its config
The global variables being set are also listed. E.g. the required
interface/device parameter "-i" sets variable $DEV. Copy the
pktgen_sampleXX scripts and modify them to fit your own needs.
-The old scripts::
-
- pktgen.conf-1-2 # 1 CPU 2 dev
- pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
- pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
- pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
- pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
-
Interrupt affinity
===================
@@ -290,6 +289,16 @@ To avoid breaking existing testbed scripts for using AH type and tunnel mode,
you can use "pgset spi SPI_VALUE" to specify which transformation mode
to employ.
+Disable shared SKB
+==================
+By default, SKBs sent by pktgen are shared (user count > 1).
+To test with non-shared SKBs, remove the "SHARED" flag by simply setting::
+
+ pg_set "flag !SHARED"
+
+However, if the "clone_skb" or "burst" parameters are configured, the skb
+still needs to be held by pktgen for further access. Hence the skb must be
+shared.
Current commands and configuration options
==========================================
@@ -359,6 +368,7 @@ Current commands and configuration options
IPSEC
NODE_ALLOC
NO_TIMESTAMP
+ SHARED
spi (ipsec)
@@ -398,7 +408,7 @@ Current commands and configuration options
References:
- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
-- tp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
Paper from Linux-Kongress in Erlangen 2004.
- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
diff --git a/Documentation/networking/ppp_generic.rst b/Documentation/networking/ppp_generic.rst
index e60504377900..5a10abce5964 100644
--- a/Documentation/networking/ppp_generic.rst
+++ b/Documentation/networking/ppp_generic.rst
@@ -314,6 +314,22 @@ channel are:
it is connected to. It will return an EINVAL error if the channel
is not connected to an interface.
+* PPPIOCBRIDGECHAN bridges a channel with another. The argument should
+ point to an int containing the channel number of the channel to bridge
+ to. Once two channels are bridged, frames presented to one channel by
+ ppp_input() are passed to the bridge instance for onward transmission.
+ This allows frames to be switched from one channel into another: for
+ example, to pass PPPoE frames into a PPPoL2TP session. Since channel
+ bridging interrupts the normal ppp_input() path, a given channel may
+ not be part of a bridge at the same time as being part of a unit.
+ This ioctl will return an EALREADY error if the channel is already
+ part of a bridge or unit, or ENXIO if the requested channel does not
+ exist.
+
+* PPPIOCUNBRIDGECHAN performs the inverse of PPPIOCBRIDGECHAN, unbridging
+ a channel pair. This ioctl will return an EINVAL error if the channel
+ does not form part of a bridge.
+
* All other ioctl commands are passed to the channel ioctl() function.
The ioctl calls that are available on an instance that is attached to
diff --git a/Documentation/networking/ray_cs.rst b/Documentation/networking/ray_cs.rst
deleted file mode 100644
index 9a46d1ae8f20..000000000000
--- a/Documentation/networking/ray_cs.rst
+++ /dev/null
@@ -1,165 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. include:: <isonum.txt>
-
-=========================
-Raylink wireless LAN card
-=========================
-
-September 21, 1999
-
-Copyright |copy| 1998 Corey Thomas (corey@world.std.com)
-
-This file is the documentation for the Raylink Wireless LAN card driver for
-Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
-802.11 compatible wireless network connectivity at 1 and 2 megabits/second.
-See http://www.raytheon.com/micro/raylink/ for more information on the Raylink
-card. This driver is in early development and does have bugs. See the known
-bugs and limitations at the end of this document for more information.
-This driver also works with WebGear's Aviator 2.4 and Aviator Pro
-wireless LAN cards.
-
-As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
-source. My web page for the development of ray_cs is at
-http://web.ralinktech.com/ralink/Home/Support/Linux.html
-and I can be emailed at corey@world.std.com
-
-The kernel driver is based on ray_cs-1.62.tgz
-
-The driver at my web page is intended to be used as an add on to
-David Hinds pcmcia package. All the command line parameters are
-available when compiled as a module. When built into the kernel, only
-the essid= string parameter is available via the kernel command line.
-This will change after the method of sorting out parameters for all
-the PCMCIA drivers is agreed upon. If you must have a built in driver
-with nondefault parameters, they can be edited in
-/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param
-will find them all.
-
-Information on card services is available at:
-
- http://pcmcia-cs.sourceforge.net/
-
-
-Card services user programs are still required for PCMCIA devices.
-pcmcia-cs-3.1.1 or greater is required for the kernel version of
-the driver.
-
-Currently, ray_cs is not part of David Hinds card services package,
-so the following magic is required.
-
-At the end of the /etc/pcmcia/config.opts file, add the line:
-source ./ray_cs.opts
-This will make card services read the ray_cs.opts file
-when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
-following::
-
- #### start of /etc/pcmcia/ray_cs.opts ###################
- # Configuration options for Raylink Wireless LAN PCMCIA card
- device "ray_cs"
- class "network" module "misc/ray_cs"
-
- card "RayLink PC Card WLAN Adapter"
- manfid 0x01a6, 0x0000
- bind "ray_cs"
-
- module "misc/ray_cs" opts ""
- #### end of /etc/pcmcia/ray_cs.opts #####################
-
-
-To join an existing network with
-different parameters, contact the network administrator for the
-configuration information, and edit /etc/pcmcia/ray_cs.opts.
-Add the parameters below between the empty quotes.
-
-Parameters for ray_cs driver which may be specified in ray_cs.opts:
-
-=============== =============== =============================================
-bc integer 0 = normal mode (802.11 timing),
- 1 = slow down inter frame timing to allow
- operation with older breezecom access
- points.
-
-beacon_period integer beacon period in Kilo-microseconds,
-
- legal values = must be integer multiple
- of hop dwell
-
- default = 256
-
-country integer 1 = USA (default),
- 2 = Europe,
- 3 = Japan,
- 4 = Korea,
- 5 = Spain,
- 6 = France,
- 7 = Israel,
- 8 = Australia
-
-essid string ESS ID - network name to join
-
- string with maximum length of 32 chars
- default value = "ADHOC_ESSID"
-
-hop_dwell integer hop dwell time in Kilo-microseconds
-
- legal values = 16,32,64,128(default),256
-
-irq_mask integer linux standard 16 bit value 1bit/IRQ
-
- lsb is IRQ 0, bit 1 is IRQ 1 etc.
- Used to restrict choice of IRQ's to use.
- Recommended method for controlling
- interrupts is in /etc/pcmcia/config.opts
-
-net_type integer 0 (default) = adhoc network,
- 1 = infrastructure
-
-phy_addr string string containing new MAC address in
- hex, must start with x eg
- x00008f123456
-
-psm integer 0 = continuously active,
- 1 = power save mode (not useful yet)
-
-pc_debug integer (0-5) larger values for more verbose
- logging. Replaces ray_debug.
-
-ray_debug integer Replaced with pc_debug
-
-ray_mem_speed integer defaults to 500
-
-sniffer integer 0 = not sniffer (default),
- 1 = sniffer which can be used to record all
- network traffic using tcpdump or similar,
- but no normal network use is allowed.
-
-translate integer 0 = no translation (encapsulate frames),
- 1 = translation (RFC1042/802.1)
-=============== =============== =============================================
-
-More on sniffer mode:
-
-tcpdump does not understand 802.11 headers, so it can't
-interpret the contents, but it can record to a file. This is only
-useful for debugging 802.11 lowlevel protocols that are not visible to
-linux. If you want to watch ftp xfers, or do similar things, you
-don't need to use sniffer mode. Also, some packet types are never
-sent up by the card, so you will never see them (ack, rts, cts, probe
-etc.) There is a simple program (showcap) included in the ray_cs
-package which parses the 802.11 headers.
-
-Known Problems and missing features
-
- Does not work with non x86
-
- Does not work with SMP
-
- Support for defragmenting frames is not yet debugged, and in
- fact is known to not work. I have never encountered a net set
- up to fragment, but still, it should be fixed.
-
- The ioctl support is incomplete. The hardware address cannot be set
- using ifconfig yet. If a different hardware address is needed, it may
- be set using the phy_addr parameter in ray_cs.opts. This requires
- a card insertion to take effect.
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
index 44936c27ab3a..498395f5fbcb 100644
--- a/Documentation/networking/rds.rst
+++ b/Documentation/networking/rds.rst
@@ -1,6 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
-==
+===
RDS
===
diff --git a/Documentation/networking/regulatory.rst b/Documentation/networking/regulatory.rst
index 16782a95b74a..5e42c8a175c3 100644
--- a/Documentation/networking/regulatory.rst
+++ b/Documentation/networking/regulatory.rst
@@ -66,7 +66,7 @@ An example::
iw reg set CR
This will request the kernel to set the regulatory domain to
-the specificied alpha2. The kernel in turn will then ask userspace
+the specified alpha2. The kernel in turn will then ask userspace
to provide a regulatory domain for the alpha2 specified by the user
by sending a uevent.
@@ -158,7 +158,7 @@ kmalloc() a structure big enough to hold your regulatory domain
structure and you should then fill it with your data. Finally you simply
call regulatory_hint() with the regulatory domain structure in it.
-Bellow is a simple example, with a regulatory domain cached using the stack.
+Below is a simple example, with a regulatory domain cached using the stack.
Your implementation may vary (read EEPROM cache instead, for example).
Example cache of some regulatory domain::
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 000000000000..decb39c19b9e
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,261 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs. For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support. This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function. The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors bring the standard Linux networking stack to
+virtual switches and IOV devices. Just as each physical port of a Linux-
+controlled switch has a separate netdev, so does each virtual port of a virtual
+switch.
+When the system boots, and before any offload is configured, all packets from
+the virtual functions appear in the networking stack of the PF via the
+representors. The PF can thus always communicate freely with the virtual
+functions.
+The PF can configure standard Linux forwarding between representors, the uplink
+or any other netdev (routing, bridging, TC classifiers).
+
+Thus, a representor is both a control plane object (representing the function in
+administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+Definitions
+-----------
+
+This document uses the term "switchdev function" to refer to the PCIe function
+which has administrative control over the virtual switch on the device.
+Typically, this will be a PF, but conceivably a NIC could be configured to grant
+these administrative privileges instead to a VF or SF (subfunction).
+Depending on NIC design, a multi-port NIC might have a single switchdev function
+for the whole device or might have a separate virtual switch, and hence
+switchdev function, for each physical network port.
+If the NIC supports nested switching, there might be separate switchdev
+functions for each nested switch, in which case each switchdev function should
+only create representors for the ports on the (sub-)switch it directly
+administers.
+
+A "representee" is the object that a representor represents. So for example in
+the case of a VF representor, the representee is the corresponding VF.
+
+What does a representor do?
+---------------------------
+
+A representor has three main roles.
+
+1. It is used to configure the network connection the representee sees, e.g.
+ link up/down, MTU, etc. For instance, bringing the representor
+ administratively UP should cause the representee to see a link up / carrier
+ on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+ fast-path rules in the virtual switch. Packets transmitted on the
+ representor netdevice should be delivered to the representee; packets
+ transmitted by the representee which fail to match any switching rule should
+ be received on the representor netdevice. (That is, there is a virtual pipe
+ connecting the representor to the representee, similar in concept to a veth
+ pair.)
+ This allows software switch implementations (such as OpenVSwitch or a Linux
+ bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+ to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not. E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF. Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+Some vendors have chosen to omit representors for the uplink and the physical
+network port, which can simplify usage (the uplink netdev becomes in effect the
+physical port's representor) but does not generalise to devices with multiple
+ports or uplinks.
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the switchdev function.
+ - Other PFs on the local PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
+ System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+ as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
+ if) their network access is implemented through a virtual switch port. [#]_
+ Note that such functions can require a representor despite the representee
+ not having a netdev.
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+ their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+ through a virtual switch port, even if they do not have a corresponding PCIe
+ PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+It is a common misunderstanding to conflate virtual ports with PCIe virtual
+functions or their netdevs. While in simple cases there will be a 1:1
+correspondence between VF netdevices and VF representors, more advanced device
+configurations may not follow this.
+A PCIe function which does not have network access through the internal switch
+(not even indirectly through the hardware implementation of whatever services
+the function provides) should *not* have a representor (even if it has a
+netdev).
+Such a function has no switch virtual port for the representor to configure or
+to be the other end of the virtual pipe.
+The representor represents the virtual port, not the PCIe function nor the 'end
+user' netdevice.
+
+.. [#] The concept here is that a hardware IP stack in the device performs the
+ translation between block DMA requests and network packets, so that only
+ network packets pass through the virtual port onto the switch. The network
+ access that the IP stack "sees" would then be configurable through tc rules;
+ e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,
+ any needed configuration of the block device *qua* block device, not being a
+ networking entity, would not be appropriate for the representor and would
+ thus use some other channel such as devlink.
+ Contrast this with the case of a virtio-blk implementation which forwards the
+ DMA requests unchanged to another PF whose driver then initiates and
+ terminates IP traffic in software; in that case the DMA traffic would *not*
+ run over the virtual switch and the virtio-blk PF should thus *not* have a
+ representor.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the switchdev function should, for each virtual
+port on the switch, create a pure-software netdevice which has some form of
+in-kernel reference to the switchdev function's own netdevice or driver private
+data (``netdev_priv()``).
+This may be by enumerating ports at probe time, reacting dynamically to the
+creation and destruction of ports at run time, or a combination of the two.
+
+The operations of the representor netdevice will generally involve acting
+through the switchdev function. For example, ``ndo_start_xmit()`` might send
+the packet through a hardware TX queue attached to the switchdev function, with
+either packet metadata or queue configuration marking it for delivery to the
+representee.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the switchdev function.
+Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
+assign a devlink port instance to the netdevice before registering the
+netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
+and ``phys_port_name`` sysfs nodes.
+(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name()`` directly, but this is deprecated.) See
+:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
+details of this API.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice. For
+instance if the switchdev function is ``eth4`` then a representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice. Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``PORT_DEV`` is the physical port representor and
+``REP_DEV`` is a VF representor, the following rules::
+
+ tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
+ action mirred egress redirect dev $PORT_DEV
+ tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
+ action mirred egress mirror dev $REP_DEV
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule,
+the VF would get two copies, as the packet reception on ``PORT_DEV`` would
+trigger the TC rule again and mirror the packet to ``REP_DEV``.)
+
+On devices without separate port and uplink representors, ``PORT_DEV`` would
+instead be the switchdev function's own uplink netdevice.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. switchdev
+function uplink netdev or port representor). TC rules such as::
+
+ tc filter add dev $REP_DEV parent ffff: flower \
+ action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+ dst_port 4789 \
+ action mirred egress redirect dev vxlan0
+ tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+ enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+ action tunnel_key unset action mirred egress redirect dev $REP_DEV
+
+where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
+perform an ARP/neighbour table lookup to find the MAC addresses to use in the
+outer Ethernet frame), while UDP packets received on the physical port with UDP
+port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
+decapsulated and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC (except ignoring any
+``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor. Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+ :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst
index 39c2249c7aa7..e807e18ba32a 100644
--- a/Documentation/networking/rxrpc.rst
+++ b/Documentation/networking/rxrpc.rst
@@ -848,14 +848,21 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
- (#) End a client call::
+ (#) Shut down a client call::
- void rxrpc_kernel_end_call(struct socket *sock,
+ void rxrpc_kernel_shutdown_call(struct socket *sock,
+ struct rxrpc_call *call);
+
+ This is used to shut down a previously begun call. The user_call_ID is
+ expunged from AF_RXRPC's knowledge and will not be seen again in
+ association with the specified call.
+
+ (#) Release the ref on a client call::
+
+ void rxrpc_kernel_put_call(struct socket *sock,
struct rxrpc_call *call);
- This is used to end a previously begun call. The user_call_ID is expunged
- from AF_RXRPC's knowledge and will not be seen again in association with
- the specified call.
+ This is used to release the caller's ref on an rxrpc call.
(#) Send data through a call::
@@ -880,8 +887,8 @@ The kernel interface functions are as follows:
notify_end_rx can be NULL or it can be used to specify a function to be
called when the call changes state to end the Tx phase. This function is
- called with the call-state spinlock held to prevent any reply or final ACK
- from being delivered first.
+ called with a spinlock held to prevent the last DATA packet from being
+ transmitted until the function returns.
(#) Receive data from a call::
@@ -1055,17 +1062,6 @@ The kernel interface functions are as follows:
first function to change. Note that this must be called in TASK_RUNNING
state.
- (#) Get reply timestamp::
-
- bool rxrpc_kernel_get_reply_time(struct socket *sock,
- struct rxrpc_call *call,
- ktime_t *_ts)
-
- This allows the timestamp on the first DATA packet of the reply of a
- client call to be queried, provided that it is still in the Rx ring. If
- successful, the timestamp will be stored into ``*_ts`` and true will be
- returned; false will be returned otherwise.
-
(#) Get remote client epoch::
u32 rxrpc_kernel_get_epoch(struct socket *sock,
@@ -1080,7 +1076,7 @@ The kernel interface functions are as follows:
This value can be used to determine if the remote client has been
restarted as it shouldn't change otherwise.
- (#) Set the maxmimum lifespan on a call::
+ (#) Set the maximum lifespan on a call::
void rxrpc_kernel_set_max_life(struct socket *sock,
struct rxrpc_call *call,
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index 8f0347b9fb3d..4eb50bcb9d42 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -44,6 +44,21 @@ by masking out the low order seven bits of the computed hash for the
packet (usually a Toeplitz hash), taking this number as a key into the
indirection table and reading the corresponding value.
+Some NICs support symmetric RSS hashing where, if the IP (source address,
+destination address) and TCP/UDP (source port, destination port) tuples
+are swapped, the computed hash is the same. This is beneficial in some
+applications that monitor TCP/IP flows (IDS, firewalls, ...etc) and need
+both directions of the flow to land on the same Rx queue (and CPU). The
+"Symmetric-XOR" is a type of RSS algorithms that achieves this hash
+symmetry by XORing the input source and destination fields of the IP
+and/or L4 protocols. This, however, results in reduced input entropy and
+could potentially be exploited. Specifically, the algorithm XORs the input
+as follows::
+
+ # (SRC_IP ^ DST_IP, SRC_IP ^ DST_IP, SRC_PORT ^ DST_PORT, SRC_PORT ^ DST_PORT)
+
+The result is then fed to the underlying RSS algorithm.
+
Some advanced NICs allow steering packets to queues based on
programmable filters. For example, webserver bound TCP port 80 packets
can be directed to their own receive queue. Such “n-tuple” filters can
@@ -105,6 +120,48 @@ a separate CPU. For interrupt handling, HT has shown no benefit in
initial tests, so limit the number of queues to the number of CPU cores
in the system.
+Dedicated RSS contexts
+~~~~~~~~~~~~~~~~~~~~~~
+
+Modern NICs support creating multiple co-existing RSS configurations
+which are selected based on explicit matching rules. This can be very
+useful when application wants to constrain the set of queues receiving
+traffic for e.g. a particular destination port or IP address.
+The example below shows how to direct all traffic to TCP port 22
+to queues 0 and 1.
+
+To create an additional RSS context use::
+
+ # ethtool -X eth0 hfunc toeplitz context new
+ New RSS context is 1
+
+Kernel reports back the ID of the allocated context (the default, always
+present RSS context has ID of 0). The new context can be queried and
+modified using the same APIs as the default context::
+
+ # ethtool -x eth0 context 1
+ RX flow hash indirection table for eth0 with 13 RX ring(s):
+ 0: 0 1 2 3 4 5 6 7
+ 8: 8 9 10 11 12 0 1 2
+ [...]
+ # ethtool -X eth0 equal 2 context 1
+ # ethtool -x eth0 context 1
+ RX flow hash indirection table for eth0 with 13 RX ring(s):
+ 0: 0 1 0 1 0 1 0 1
+ 8: 0 1 0 1 0 1 0 1
+ [...]
+
+To make use of the new context direct traffic to it using an n-tuple
+filter::
+
+ # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
+ Added rule with ID 1023
+
+When done, remove the context and the rule::
+
+ # ethtool -N eth0 delete 1023
+ # ethtool -X eth0 context 1 delete
+
RPS: Receive Packet Steering
============================
@@ -269,8 +326,8 @@ a single application thread handles flows with many different flow hashes.
rps_sock_flow_table is a global flow table that contains the *desired* CPU
for flows: the CPU that is currently processing the flow in userspace.
Each table value is a CPU index that is updated during calls to recvmsg
-and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
-and tcp_splice_read()).
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
+tcp_splice_read()).
When the scheduler moves a thread to a new CPU while it has outstanding
receive packets on the old CPU, packets may arrive out of order. To
@@ -465,9 +522,9 @@ XPS Configuration
-----------------
XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs/receive-queues that may
-use a transmit queue is configured using the sysfs file entry:
+default for SMP). If compiled in, it is driver dependent whether, and
+how, XPS is configured at device init. The mapping of CPUs/receive-queues
+to transmit queue can be inspected and configured using sysfs:
For selection based on CPUs map::
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
index ec73e1445030..07c20e470baf 100644
--- a/Documentation/networking/seg6-sysctl.rst
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -24,3 +24,16 @@ seg6_require_hmac - INTEGER
* 1 - Drop SR packets without HMAC, validate SR packets with HMAC
Default is 0.
+
+seg6_flowlabel - INTEGER
+ Controls the behaviour of computing the flowlabel of outer
+ IPv6 header in case of SR T.encaps
+
+ == =======================================================
+ -1 set flowlabel to zero.
+ 0 copy flowlabel from Inner packet in case of Inner IPv6
+ (Set flowlabel to 0 in case IPv4/L2)
+ 1 Compute the flowlabel using seg6_make_flowlabel()
+ == =======================================================
+
+ Default is 0.
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
index 5aec7c8857d0..5bf285d73e8a 100644
--- a/Documentation/networking/sfp-phylink.rst
+++ b/Documentation/networking/sfp-phylink.rst
@@ -163,7 +163,7 @@ this documentation.
err = phylink_of_phy_connect(priv->phylink, node, flags);
For the most part, ``flags`` can be zero; these flags are passed to
- the of_phy_attach() inside this function call if a PHY is specified
+ the phy_attach_direct() inside this function call if a PHY is specified
in the DT node ``node``.
``node`` should be the DT node which contains the network phy property,
@@ -200,10 +200,12 @@ this documentation.
when the in-band link state changes - otherwise the link will never
come up.
- The :c:func:`validate` method should mask the supplied supported mask,
- and ``state->advertising`` with the supported ethtool link modes.
- These are the new ethtool link modes, so bitmask operations must be
- used. For an example, see drivers/net/ethernet/marvell/mvneta.c.
+ The :c:func:`mac_get_caps` method is optional, and if provided should
+ return the phylink MAC capabilities that are supported for the passed
+ ``interface`` mode. In general, there is no need to implement this method.
+ Phylink will use these capabilities in combination with permissible
+ capabilities for ``interface`` to determine the allowable ethtool link
+ modes.
The :c:func:`mac_link_state` method is used to read the link state
from the MAC, and report back the settings that the MAC is currently
@@ -224,21 +226,141 @@ this documentation.
function should modify the state and only take the link down when
absolutely necessary to change the MAC configuration. An example
of how to do this can be found in :c:func:`mvneta_mac_config` in
- drivers/net/ethernet/marvell/mvneta.c.
+ ``drivers/net/ethernet/marvell/mvneta.c``.
For further information on these methods, please see the inline
documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
-9. Remove calls to of_parse_phandle() for the PHY,
- of_phy_register_fixed_link() for fixed links etc. from the probe
- function, and replace with:
+9. Fill-in the :c:type:`struct phylink_config <phylink_config>` fields with
+ a reference to the :c:type:`struct device <device>` associated to your
+ :c:type:`struct net_device <net_device>`:
.. code-block:: c
- struct phylink *phylink;
priv->phylink_config.dev = &dev.dev;
priv->phylink_config.type = PHYLINK_NETDEV;
+ Fill-in the various speeds, pause and duplex modes your MAC can handle:
+
+ .. code-block:: c
+
+ priv->phylink_config.mac_capabilities = MAC_SYM_PAUSE | MAC_10 | MAC_100 | MAC_1000FD;
+
+10. Some Ethernet controllers work in pair with a PCS (Physical Coding Sublayer)
+ block, that can handle among other things the encoding/decoding, link
+ establishment detection and autonegotiation. While some MACs have internal
+ PCS whose operation is transparent, some other require dedicated PCS
+ configuration for the link to become functional. In that case, phylink
+ provides a PCS abstraction through :c:type:`struct phylink_pcs <phylink_pcs>`.
+
+ Identify if your driver has one or more internal PCS blocks, and/or if
+ your controller can use an external PCS block that might be internally
+ connected to your controller.
+
+ If your controller doesn't have any internal PCS, you can go to step 11.
+
+ If your Ethernet controller contains one or several PCS blocks, create
+ one :c:type:`struct phylink_pcs <phylink_pcs>` instance per PCS block within
+ your driver's private data structure:
+
+ .. code-block:: c
+
+ struct phylink_pcs pcs;
+
+ Populate the relevant :c:type:`struct phylink_pcs_ops <phylink_pcs_ops>` to
+ configure your PCS. Create a :c:func:`pcs_get_state` function that reports
+ the inband link state, a :c:func:`pcs_config` function to configure your
+ PCS according to phylink-provided parameters, and a :c:func:`pcs_validate`
+ function that report to phylink all accepted configuration parameters for
+ your PCS:
+
+ .. code-block:: c
+
+ struct phylink_pcs_ops foo_pcs_ops = {
+ .pcs_validate = foo_pcs_validate,
+ .pcs_get_state = foo_pcs_get_state,
+ .pcs_config = foo_pcs_config,
+ };
+
+ Arrange for PCS link state interrupts to be forwarded into
+ phylink, via:
+
+ .. code-block:: c
+
+ phylink_pcs_change(pcs, link_is_up);
+
+ where ``link_is_up`` is true if the link is currently up or false
+ otherwise. If a PCS is unable to provide these interrupts, then
+ it should set ``pcs->pcs_poll = true;`` when creating the PCS.
+
+11. If your controller relies on, or accepts the presence of an external PCS
+ controlled through its own driver, add a pointer to a phylink_pcs instance
+ in your driver private data structure:
+
+ .. code-block:: c
+
+ struct phylink_pcs *pcs;
+
+ The way of getting an instance of the actual PCS depends on the platform,
+ some PCS sit on an MDIO bus and are grabbed by passing a pointer to the
+ corresponding :c:type:`struct mii_bus <mii_bus>` and the PCS's address on
+ that bus. In this example, we assume the controller attaches to a Lynx PCS
+ instance:
+
+ .. code-block:: c
+
+ priv->pcs = lynx_pcs_create_mdiodev(bus, 0);
+
+ Some PCS can be recovered based on firmware information:
+
+ .. code-block:: c
+
+ priv->pcs = lynx_pcs_create_fwnode(of_fwnode_handle(node));
+
+12. Populate the :c:func:`mac_select_pcs` callback and add it to your
+ :c:type:`struct phylink_mac_ops <phylink_mac_ops>` set of ops. This function
+ must return a pointer to the relevant :c:type:`struct phylink_pcs <phylink_pcs>`
+ that will be used for the requested link configuration:
+
+ .. code-block:: c
+
+ static struct phylink_pcs *foo_select_pcs(struct phylink_config *config,
+ phy_interface_t interface)
+ {
+ struct foo_priv *priv = container_of(config, struct foo_priv,
+ phylink_config);
+
+ if ( /* 'interface' needs a PCS to function */ )
+ return priv->pcs;
+
+ return NULL;
+ }
+
+ See :c:func:`mvpp2_select_pcs` for an example of a driver that has multiple
+ internal PCS.
+
+13. Fill-in all the :c:type:`phy_interface_t <phy_interface_t>` (i.e. all MAC to
+ PHY link modes) that your MAC can output. The following example shows a
+ configuration for a MAC that can handle all RGMII modes, SGMII and 1000BaseX.
+ You must adjust these according to what your MAC and all PCS associated
+ with this MAC are capable of, and not just the interface you wish to use:
+
+ .. code-block:: c
+
+ phy_interface_set_rgmii(priv->phylink_config.supported_interfaces);
+ __set_bit(PHY_INTERFACE_MODE_SGMII,
+ priv->phylink_config.supported_interfaces);
+ __set_bit(PHY_INTERFACE_MODE_1000BASEX,
+ priv->phylink_config.supported_interfaces);
+
+14. Remove calls to of_parse_phandle() for the PHY,
+ of_phy_register_fixed_link() for fixed links etc. from the probe
+ function, and replace with:
+
+ .. code-block:: c
+
+ struct phylink *phylink;
+
phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops);
if (IS_ERR(phylink)) {
err = PTR_ERR(phylink);
@@ -247,14 +369,14 @@ this documentation.
priv->phylink = phylink;
- and arrange to destroy the phylink in the probe failure path as
- appropriate and the removal path too by calling:
+ and arrange to destroy the phylink in the probe failure path as
+ appropriate and the removal path too by calling:
- .. code-block:: c
+ .. code-block:: c
phylink_destroy(priv->phylink);
-10. Arrange for MAC link state interrupts to be forwarded into
+15. Arrange for MAC link state interrupts to be forwarded into
phylink, via:
.. code-block:: c
@@ -262,17 +384,16 @@ this documentation.
phylink_mac_change(priv->phylink, link_is_up);
where ``link_is_up`` is true if the link is currently up or false
- otherwise. If a MAC is unable to provide these interrupts, then
- it should set ``priv->phylink_config.pcs_poll = true;`` in step 9.
+ otherwise.
-11. Verify that the driver does not call::
+16. Verify that the driver does not call::
netif_carrier_on()
netif_carrier_off()
- as these will interfere with phylink's tracking of the link state,
- and cause phylink to omit calls via the :c:func:`mac_link_up` and
- :c:func:`mac_link_down` methods.
+ as these will interfere with phylink's tracking of the link state,
+ and cause phylink to omit calls via the :c:func:`mac_link_up` and
+ :c:func:`mac_link_down` methods.
Network drivers should call phylink_stop() and phylink_start() via their
suspend/resume paths, which ensures that the appropriate
@@ -281,4 +402,4 @@ as necessary.
For information describing the SFP cage in DT, please see the binding
documentation in the kernel source tree
-``Documentation/devicetree/bindings/net/sff,sfp.txt``
+``Documentation/devicetree/bindings/net/sff,sfp.yaml``.
diff --git a/Documentation/networking/skbuff.rst b/Documentation/networking/skbuff.rst
new file mode 100644
index 000000000000..5b74275a73a3
--- /dev/null
+++ b/Documentation/networking/skbuff.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+struct sk_buff
+==============
+
+:c:type:`sk_buff` is the main networking structure representing
+a packet.
+
+Basic sk_buff geometry
+----------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: Basic sk_buff geometry
+
+Shared skbs and skb clones
+--------------------------
+
+:c:member:`sk_buff.users` is a simple refcount allowing multiple entities
+to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred
+to as shared skbs (see skb_shared()).
+
+skb_clone() allows for fast duplication of skbs. None of the data buffers
+get copied, but caller gets a new metadata struct (struct sk_buff).
+&skb_shared_info.refcount indicates the number of skbs pointing at the same
+packet data (i.e. clones).
+
+dataref and headerless skbs
+---------------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: dataref and headerless skbs
+
+Checksum information
+--------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: skb checksums
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..a874d007f2db
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+SMC Sysctl
+==========
+
+/proc/sys/net/smc/* Variables
+=============================
+
+autocorking_size - INTEGER
+ Setting SMC auto corking size:
+ SMC auto corking is like TCP auto corking from the application's
+ perspective of view. When applications do consecutive small
+ write()/sendmsg() system calls, we try to coalesce these small writes
+ as much as possible, to lower total amount of CDC and RDMA Write been
+ sent.
+ autocorking_size limits the maximum corked bytes that can be sent to
+ the under device in 1 single sending. If set to 0, the SMC auto corking
+ is disabled.
+ Applications can still use TCP_CORK for optimal behavior when they
+ know how/when to uncork their sockets.
+
+ Default: 64K
+
+smcr_buf_type - INTEGER
+ Controls which type of sndbufs and RMBs to use in later newly created
+ SMC-R link group. Only for SMC-R.
+
+ Default: 0 (physically contiguous sndbufs and RMBs)
+
+ Possible values:
+
+ - 0 - Use physically contiguous buffers
+ - 1 - Use virtually contiguous buffers
+ - 2 - Mixed use of the two types. Try physically contiguous buffers first.
+ If not available, use virtually contiguous buffers then.
+
+smcr_testlink_time - INTEGER
+ How frequently SMC-R link sends out TEST_LINK LLC messages to confirm
+ viability, after the last activity of connections on it. Value 0 means
+ disabling TEST_LINK.
+
+ Default: 30 seconds.
+
+wmem - INTEGER
+ Initial size of send buffer used by SMC sockets.
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 64KiB
+
+rmem - INTEGER
+ Initial size of receive buffer (RMB) used by SMC sockets.
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 64KiB
+
+smcr_max_links_per_lgr - INTEGER
+ Controls the max number of links can be added to a SMC-R link group. Notice that
+ the actual number of the links added to a SMC-R link group depends on the number
+ of RDMA devices exist in the system. The acceptable value ranges from 1 to 2. Only
+ for SMC-R v2.1 and later.
+
+ Default: 2
+
+smcr_max_conns_per_lgr - INTEGER
+ Controls the max number of connections can be added to a SMC-R link group. The
+ acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
+
+ Default: 255
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index 4edd0d38779e..ff1e6a8ffe21 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -313,8 +313,8 @@ https://lwn.net/Articles/576263/
* TcpExtTCPOrigDataSent
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
@@ -323,22 +323,20 @@ explaination below::
* TCPSynRetrans
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
* TCPFastOpenActiveFail
-This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+This counter is explained by kernel commit f19c29e3e391, I pasted the
+explanation below::
TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
the remote does not accept it or the attempts timed out.
-.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
-
* TcpExtListenOverflows and TcpExtListenDrops
When kernel receives a SYN from a client, and if the TCP accept queue
@@ -382,7 +380,7 @@ Defined in `RFC1213 tcpAttemptFails`_.
Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
the 'segments sent containing the RST flag', but in linux kernel, this
-couner indicates the segments kerenl tried to send. The sending
+counter indicates the segments kernel tried to send. The sending
process might be failed due to some errors (e.g. memory alloc failed).
.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
@@ -698,11 +696,9 @@ number of the SACK block. For more details, please refer the comment
of the function tcp_is_sackblock_valid in the kernel source code. A
SACK option could have up to 4 blocks, they are checked
individually. E.g., if 3 blocks of a SACk is invalid, the
-corresponding counter would be updated 3 times. The comment of the
-`Add counters for discarded SACK blocks`_ patch has additional
-explaination:
-
-.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
+corresponding counter would be updated 3 times. The comment of commit
+18f02545a9a1 ("[TCP] MIB: Add counters for discarded SACK blocks")
+has additional explanation:
* TcpExtTCPSACKDiscard
@@ -829,7 +825,7 @@ PAWS check fails or the received sequence number is out of window.
* TcpExtTCPACKSkippedTimeWait
-Tha ACK is skipped in Time-Wait status, the reason would be either
+The ACK is skipped in Time-Wait status, the reason would be either
PAWS check failed or the received sequence number is out of window.
* TcpExtTCPACKSkippedChallenge
@@ -980,11 +976,11 @@ How many reply packets of the SYN cookies the TCP stack receives.
The MSS decoded from the SYN cookie is invalid. When this counter is
updated, the received packet won't be treated as a SYN cookie and the
-TcpExtSyncookiesRecv counter wont be updated.
+TcpExtSyncookiesRecv counter won't be updated.
Challenge ACK
=============
-For details of challenge ACK, please refer the explaination of
+For details of challenge ACK, please refer the explanation of
TcpExtTCPACKSkippedChallenge.
* TcpExtTCPChallengeACK
@@ -1002,7 +998,7 @@ prune
=====
When a socket is under memory pressure, the TCP stack will try to
reclaim memory from the receiving queue and out of order queue. One of
-the reclaiming method is 'collapse', which means allocate a big sbk,
+the reclaiming method is 'collapse', which means allocate a big skb,
copy the contiguous skbs to the single big skb, and free these
contiguous skbs.
@@ -1163,7 +1159,7 @@ The server side nstat output::
IpExtOutOctets 52 0.0
IpExtInNoECTPkts 1 0.0
-Input a string in nc client side again ('world' in our exmaple)::
+Input a string in nc client side again ('world' in our example)::
nstatuser@nstat-a:~$ nc -v nstat-b 9000
Connection to nstat-b 9000 port [tcp/*] succeeded!
@@ -1211,7 +1207,7 @@ replied an ACK. But kernel handled them in different ways. When the
TCP window scale option is not used, kernel will try to enable fast
path immediately when the connection comes into the established state,
but if the TCP window scale option is used, kernel will disable the
-fast path at first, and try to enable it after kerenl receives
+fast path at first, and try to enable it after kernel receives
packets. We could use the 'ss' command to verify whether the window
scale option is used. e.g. run below command on either server or
client::
@@ -1343,7 +1339,7 @@ Check TcpExtTCPAbortOnMemory on client::
nstatuser@nstat-a:~$ nstat | grep -i abort
TcpExtTCPAbortOnMemory 54 0.0
-Check orphane socket count on client::
+Check orphaned socket count on client::
nstatuser@nstat-a:~$ ss -s
Total: 131 (kernel 0)
@@ -1681,11 +1677,11 @@ RST to nstat-b::
nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
-Send 3 SYN repeatly to nstat-b::
+Send 3 SYN repeatedly to nstat-b::
nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
-Check snmp cunter on nstat-b::
+Check snmp counter on nstat-b::
nstatuser@nstat-b:~$ nstat | grep -i skip
TcpExtTCPACKSkippedSynRecv 1 0.0
@@ -1770,7 +1766,7 @@ string 'foo' in our example::
Connection from nstat-a 42132 received!
foo
-On nstat-a, the tcpdump should have caputred the ACK. We should check
+On nstat-a, the tcpdump should have captured the ACK. We should check
the source port numbers of the two nc clients::
nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
@@ -1778,7 +1774,7 @@ the source port numbers of the two nc clients::
ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000
ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001
-Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
+Run tcprewrite, change port 9001 to port 9000, change port 42132 to
port 50208::
nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
new file mode 100644
index 000000000000..75e017dfa825
--- /dev/null
+++ b/Documentation/networking/statistics.rst
@@ -0,0 +1,236 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Interface statistics
+====================
+
+Overview
+========
+
+This document is a guide to Linux network interface statistics.
+
+There are three main sources of interface statistics in Linux:
+
+ - standard interface statistics based on
+ :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`;
+ - protocol-specific statistics; and
+ - driver-defined statistics available via ethtool.
+
+Standard interface statistics
+-----------------------------
+
+There are multiple interfaces to reach the standard statistics.
+Most commonly used is the `ip` command from `iproute2`::
+
+ $ ip -s -s link show dev ens4u1u1
+ 6: ens4u1u1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+ link/ether 48:2a:e3:4c:b1:d1 brd ff:ff:ff:ff:ff:ff
+ RX: bytes packets errors dropped overrun mcast
+ 74327665117 69016965 0 0 0 0
+ RX errors: length crc frame fifo missed
+ 0 0 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 21405556176 44608960 0 0 0 0
+ TX errors: aborted fifo window heartbeat transns
+ 0 0 0 0 128
+ altname enp58s0u1u1
+
+Note that `-s` has been specified twice to see all members of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+If `-s` is specified once the detailed errors won't be shown.
+
+`ip` supports JSON formatting via the `-j` option.
+
+Queue statistics
+~~~~~~~~~~~~~~~~
+
+Queue statistics are accessible via the netdev netlink family.
+
+Currently no widely distributed CLI exists to access those statistics.
+Kernel development tools (ynl) can be used to experiment with them,
+see `Documentation/userspace-api/netlink/intro-specs.rst`.
+
+Protocol-specific statistics
+----------------------------
+
+Protocol-specific statistics are exposed via relevant interfaces,
+the same interfaces as are used to configure them.
+
+ethtool
+~~~~~~~
+
+Ethtool exposes common low-level statistics.
+All the standard statistics are expected to be maintained
+by the device, not the driver (as opposed to driver-defined stats
+described in the next section which mix software and hardware stats).
+For devices which contain unmanaged
+switches (e.g. legacy SR-IOV or multi-host NICs) the events counted
+may not pertain exclusively to the packets destined to
+the local host interface. In other words the events may
+be counted at the network port (MAC/PHY blocks) without separation
+for different host side (PCIe) devices. Such ambiguity must not
+be present when internal switch is managed by Linux (so called
+switchdev mode for NICs).
+
+Standard ethtool statistics can be accessed via the interfaces used
+for configuration. For example ethtool interface used
+to configure pause frames can report corresponding hardware counters::
+
+ $ ethtool --include-statistics -a eth0
+ Pause parameters for eth0:
+ Autonegotiate: on
+ RX: on
+ TX: on
+ Statistics:
+ tx_pause_frames: 1
+ rx_pause_frames: 1
+
+General Ethernet statistics not associated with any particular
+functionality are exposed via ``ethtool -S $ifc`` by specifying
+the ``--groups`` parameter::
+
+ $ ethtool -S eth0 --groups eth-phy eth-mac eth-ctrl rmon
+ Stats for eth0:
+ eth-phy-SymbolErrorDuringCarrier: 0
+ eth-mac-FramesTransmittedOK: 1
+ eth-mac-FrameTooLongErrors: 1
+ eth-ctrl-MACControlFramesTransmitted: 1
+ eth-ctrl-MACControlFramesReceived: 0
+ eth-ctrl-UnsupportedOpcodesReceived: 1
+ rmon-etherStatsUndersizePkts: 1
+ rmon-etherStatsJabbers: 0
+ rmon-rx-etherStatsPkts64Octets: 1
+ rmon-rx-etherStatsPkts65to127Octets: 0
+ rmon-rx-etherStatsPkts128to255Octets: 0
+ rmon-tx-etherStatsPkts64Octets: 2
+ rmon-tx-etherStatsPkts65to127Octets: 3
+ rmon-tx-etherStatsPkts128to255Octets: 0
+
+Driver-defined statistics
+-------------------------
+
+Driver-defined ethtool statistics can be dumped using `ethtool -S $ifc`, e.g.::
+
+ $ ethtool -S ens4u1u1
+ NIC statistics:
+ tx_single_collisions: 0
+ tx_multi_collisions: 0
+
+uAPIs
+=====
+
+procfs
+------
+
+The historical `/proc/net/dev` text interface gives access to the list
+of interfaces as well as their statistics.
+
+Note that even though this interface is using
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+internally it combines some of the fields.
+
+sysfs
+-----
+
+Each device directory in sysfs contains a `statistics` directory (e.g.
+`/sys/class/net/lo/statistics/`) with files corresponding to
+members of :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+
+This simple interface is convenient especially in constrained/embedded
+environments without access to tools. However, it's inefficient when
+reading multiple stats as it internally performs a full dump of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+and reports only the stat corresponding to the accessed file.
+
+Sysfs files are documented in
+`Documentation/ABI/testing/sysfs-class-net-statistics`.
+
+
+netlink
+-------
+
+`rtnetlink` (`NETLINK_ROUTE`) is the preferred method of accessing
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` stats.
+
+Statistics are reported both in the responses to link information
+requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
+when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
+
+netdev (netlink)
+~~~~~~~~~~~~~~~~
+
+`netdev` generic netlink family allows accessing page pool and per queue
+statistics.
+
+ethtool
+-------
+
+Ethtool IOCTL interface allows drivers to report implementation
+specific statistics. Historically it has also been used to report
+statistics for which other APIs did not exist, like per-device-queue
+statistics, or standard-based statistics (e.g. RFC 2863).
+
+Statistics and their string identifiers are retrieved separately.
+Identifiers via `ETHTOOL_GSTRINGS` with `string_set` set to `ETH_SS_STATS`,
+and values via `ETHTOOL_GSTATS`. User space should use `ETHTOOL_GDRVINFO`
+to retrieve the number of statistics (`.n_stats`).
+
+ethtool-netlink
+---------------
+
+Ethtool netlink is a replacement for the older IOCTL interface.
+
+Protocol-related statistics can be requested in get commands by setting
+the `ETHTOOL_FLAG_STATS` flag in `ETHTOOL_A_HEADER_FLAGS`. Currently
+statistics are supported in the following commands:
+
+ - `ETHTOOL_MSG_PAUSE_GET`
+ - `ETHTOOL_MSG_FEC_GET`
+ - `ETHTOOL_MSG_MM_GET`
+
+debugfs
+-------
+
+Some drivers expose extra statistics via `debugfs`.
+
+struct rtnl_link_stats64
+========================
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :identifiers: rtnl_link_stats64
+
+Notes for driver authors
+========================
+
+Drivers should report all statistics which have a matching member in
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` exclusively
+via `.ndo_get_stats64`. Reporting such standard stats via ethtool
+or debugfs will not be accepted.
+
+Drivers must ensure best possible compliance with
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+Please note for example that detailed error statistics must be
+added into the general `rx_error` / `tx_error` counters.
+
+The `.ndo_get_stats64` callback can not sleep because of accesses
+via `/proc/net/dev`. If driver may sleep when retrieving the statistics
+from the device it should do so periodically asynchronously and only return
+a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
+allows setting the frequency of refreshing statistics, if needed.
+
+Retrieving ethtool statistics is a multi-syscall process, drivers are advised
+to keep the number of statistics constant to avoid race conditions with
+user space trying to read them.
+
+Statistics must persist across routine operations like bringing the interface
+down and up.
+
+Kernel-internal data structures
+-------------------------------
+
+The following structures are internal to the kernel, their members are
+translated to netlink attributes when dumped. Drivers must not overwrite
+the statistics they don't report with 0.
+
+- ethtool_pause_stats()
+- ethtool_fec_stats()
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index ddc3f35775dc..758f1dae3fce 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -1,5 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
+.. _switchdev:
===============================================
Ethernet switch device driver model (switchdev)
@@ -159,7 +160,7 @@ tools such as iproute2.
The switchdev driver can know a particular port's position in the topology by
monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
-bond will see it's upper master change. If that bond is moved into a bridge,
+bond will see its upper master change. If that bond is moved into a bridge,
the bond's upper master will change. And so on. The driver will track such
movements to know what position a port is in in the overall topology by
registering for netdevice events and acting on NETDEV_CHANGEUPPER.
@@ -181,18 +182,41 @@ To offloading L2 bridging, the switchdev driver/device should support:
Static FDB Entries
^^^^^^^^^^^^^^^^^^
-The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
-to support static FDB entries installed to the device. Static bridge FDB
-entries are installed, for example, using iproute2 bridge cmd::
+A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
+``ndo_fdb_dump`` operations is able to support the command below, which adds a
+static bridge FDB entry::
- bridge fdb add ADDR dev DEV [vlan VID] [self]
+ bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
-The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
-ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
-switchdev_port_obj_xxx ops.
+(the "static" keyword is non-optional: if not specified, the entry defaults to
+being "local", which means that it should not be forwarded)
-XXX: what should be done if offloading this rule to hardware fails (for
-example, due to full capacity in hardware tables) ?
+The "self" keyword (optional because it is implicit) has the role of
+instructing the kernel to fulfill the operation through the ``ndo_fdb_add``
+implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
+will bypass the bridge and therefore leave the software database out of sync
+with the hardware one.
+
+To avoid this, the "master" keyword can be used::
+
+ bridge fdb add dev DEV ADDRESS [vlan VID] master static
+
+The above command instructs the kernel to search for a master interface of
+``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
+This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
+which the port driver can handle and use it to program its hardware table. This
+way, the software and the hardware database will both contain this static FDB
+entry.
+
+Note: for new switchdev drivers that offload the Linux bridge, implementing the
+``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
+discouraged: all static FDB entries should be added on a bridge port using the
+"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
+visualize the hardware tables, if the device does not have an interrupt for
+notifying the operating system of newly learned/forgotten dynamic FDB
+addresses. In that case, the hardware FDB might end up having entries that the
+software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
+them.
Note: by default, the bridge does not filter on VLAN and only bridges untagged
traffic. To enable VLAN support, turn on VLAN filtering::
@@ -385,3 +409,156 @@ The driver can monitor for updates to arp_tbl using the netevent notifier
NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
to know when arp_tbl neighbor entries are purged from the port.
+
+Device driver expected behavior
+-------------------------------
+
+Below is a set of defined behavior that switchdev enabled network devices must
+adhere to.
+
+Configuration-less state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Upon driver bring up, the network devices must be fully operational, and the
+backing driver must configure the network device such that it is possible to
+send and receive traffic to this network device and it is properly separated
+from other network devices/ports (e.g.: as is frequent with a switch ASIC). How
+this is achieved is heavily hardware dependent, but a simple solution can be to
+use per-port VLAN identifiers unless a better mechanism is available
+(proprietary metadata for each network port for instance).
+
+The network device must be capable of running a full IP protocol stack
+including multicast, DHCP, IPv4/6, etc. If necessary, it should program the
+appropriate filters for VLAN, multicast, unicast etc. The underlying device
+driver must effectively be configured in a similar fashion to what it would do
+when IGMP snooping is enabled for IP multicast over these switchdev network
+devices and unsolicited multicast must be filtered as early as possible in
+the hardware.
+
+When configuring VLANs on top of the network device, all VLANs must be working,
+irrespective of the state of other network devices (e.g.: other ports being part
+of a VLAN-aware bridge doing ingress VID checking). See below for details.
+
+If the device implements e.g.: VLAN filtering, putting the interface in
+promiscuous mode should allow the reception of all VLAN tags (including those
+not present in the filter(s)).
+
+Bridged switch ports
+^^^^^^^^^^^^^^^^^^^^
+
+When a switchdev enabled network device is added as a bridge member, it should
+not disrupt any functionality of non-bridged network devices and they
+should continue to behave as normal network devices. Depending on the bridge
+configuration knobs below, the expected behavior is documented.
+
+Bridge VLAN filtering
+^^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of a VLAN filtering mode (statically,
+at device creation time, and dynamically, during run time) which must be
+observed by the underlying switchdev network device/hardware:
+
+- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
+ data path will process all Ethernet frames as if they are VLAN-untagged.
+ The bridge VLAN database can still be modified, but the modifications should
+ have no effect while VLAN filtering is turned off. Frames ingressing the
+ device with a VID that is not programmed into the bridge/switch's VLAN table
+ must be forwarded and may be processed using a VLAN device (see below).
+
+- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
+ the device with a VID that is not programmed into the bridges/switch's VLAN
+ table must be dropped (strict VID checking).
+
+When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
+network device which is a bridge port member, the behavior of the software
+network stack must be preserved, or the configuration must be refused if that
+is not possible.
+
+- with VLAN filtering turned off, the bridge will process all ingress traffic
+ for the port, except for the traffic tagged with a VLAN ID destined for a
+ VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
+ be added to a second bridge, which includes other switch ports or software
+ interfaces. Some approaches to ensure that the forwarding domain for traffic
+ belonging to the VLAN upper interfaces are managed properly:
+
+ * If forwarding destinations can be managed per VLAN, the hardware could be
+ configured to map all traffic, except the packets tagged with a VID
+ belonging to a VLAN upper interface, to an internal VID corresponding to
+ untagged packets. This internal VID spans all ports of the VLAN-unaware
+ bridge. The VID corresponding to the VLAN upper interface spans the
+ physical port of that VLAN interface, as well as the other ports that
+ might be bridged with it.
+ * Treat bridge ports with VLAN upper interfaces as standalone, and let
+ forwarding be handled in the software data path.
+
+- with VLAN filtering turned on, these VLAN devices can be created as long as
+ the bridge does not have an existing VLAN entry with the same VID on any
+ bridge port. These VLAN devices cannot be enslaved into the bridge since they
+ duplicate functionality/use case with the bridge's VLAN data path processing.
+
+Non-bridged network ports of the same switch fabric must not be disturbed in any
+way by the enabling of VLAN filtering on the bridge device(s). If the VLAN
+filtering setting is global to the entire chip, then the standalone ports
+should indicate to the network stack that VLAN filtering is required by setting
+'rx-vlan-filter: on [fixed]' in the ethtool features.
+
+Because VLAN filtering can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately. If that is not possible, the
+switchdev driver can also refuse to support dynamic toggling of the VLAN
+filtering knob at runtime and require a destruction of the bridge device(s) and
+creation of new bridge device(s) with a different VLAN filtering value to
+ensure VLAN awareness is pushed down to the hardware.
+
+Even when VLAN filtering in the bridge is turned off, the underlying switch
+hardware and driver may still configure itself in a VLAN-aware mode provided
+that the behavior described above is observed.
+
+The VLAN protocol of the bridge plays a role in deciding whether a packet is
+treated as tagged or not: a bridge using the 802.1ad protocol must treat both
+VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
+untagged.
+
+The 802.1p (VID 0) tagged packets must be treated in the same way by the device
+as untagged packets, since the bridge device does not allow the manipulation of
+VID 0 in its database.
+
+When the bridge has VLAN filtering enabled and a PVID is not configured on the
+ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
+has VLAN filtering enabled and a PVID exists on the ingress port, untagged and
+priority-tagged packets must be accepted and forwarded according to the
+bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
+disabled, the presence/lack of a PVID should not influence the packet
+forwarding decision.
+
+Bridge IGMP snooping
+^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of IGMP snooping (statically, at
+interface creation time, or dynamically, during runtime) which must be observed
+by the underlying switchdev network device/hardware in the following way:
+
+- when IGMP snooping is turned off, multicast traffic must be flooded to all
+ ports within the same bridge that have mcast_flood=true. The CPU/management
+ port should ideally not be flooded (unless the ingress interface has
+ IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
+ the network stack notifications. If the hardware is not capable of doing that
+ then the CPU/management port must also be flooded and multicast filtering
+ happens in software.
+
+- when IGMP snooping is turned on, multicast traffic must selectively flow
+ to the appropriate network ports (including CPU/management port). Flooding of
+ unknown multicast should be only towards the ports connected to a multicast
+ router (the local device may also act as a multicast router).
+
+The switch must adhere to RFC 4541 and flood multicast traffic accordingly
+since that is what the Linux bridge implementation does.
+
+Because IGMP snooping can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately.
+
+A switchdev driver can also refuse to support dynamic toggling of the multicast
+snooping knob at runtime and require the destruction of the bridge device(s)
+and creation of a new bridge device(s) with a different multicast snooping
+value.
diff --git a/Documentation/networking/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst
new file mode 100644
index 000000000000..65307130ab63
--- /dev/null
+++ b/Documentation/networking/sysfs-tagging.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Sysfs tagging
+=============
+
+(Taken almost verbatim from Eric Biederman's netns tagging patch
+commit msg)
+
+The problem. Network devices show up in sysfs and with the network
+namespace active multiple devices with the same name can show up in
+the same directory, ouch!
+
+To avoid that problem and allow existing applications in network
+namespaces to see the same interface that is currently presented in
+sysfs, sysfs now has tagging directory support.
+
+By using the network namespace pointers as tags to separate out
+the sysfs directory entries we ensure that we don't have conflicts
+in the directories and applications only see a limited set of
+the network devices.
+
+Each sysfs directory entry may be tagged with a namespace via the
+``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged,
+then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE
+and KOBJ_NS_TYPES, and ns will point to the namespace to which it
+belongs.
+
+Each sysfs superblock's kernfs_super_info contains an array
+``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace
+kobj_nstype first mounts sysfs, a new superblock is created. It
+will be differentiated from other sysfs mounts by having its
+``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that
+through bind mounting and mounts propagation, a task can easily view
+the contents of other namespaces' sysfs mounts. Therefore, when a
+namespace exits, it will call kobj_ns_exit() to invalidate any
+kernfs_node->ns pointers pointing to it.
+
+Users of this interface:
+
+- define a type in the ``kobj_ns_type`` enumeration.
+- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has
+
+ - current_ns() which returns current's namespace
+ - netlink_ns() which returns a socket's namespace
+ - initial_ns() which returns the initial namespace
+
+- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/networking/tc-queue-filters.rst b/Documentation/networking/tc-queue-filters.rst
new file mode 100644
index 000000000000..6b417092276f
--- /dev/null
+++ b/Documentation/networking/tc-queue-filters.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+TC queue based filtering
+=========================
+
+TC can be used for directing traffic to either a set of queues or
+to a single queue on both the transmit and receive side.
+
+On the transmit side:
+
+1) TC filter directing traffic to a set of queues is achieved
+ using the action skbedit priority for Tx priority selection,
+ the priority maps to a traffic class (set of queues) when
+ the queue-sets are configured using mqprio.
+
+2) TC filter directs traffic to a transmit queue with the action
+ skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
+ for transmit queue is executed in software only and cannot be
+ offloaded.
+
+Likewise, on the receive side, the two filters for selecting set of
+queues and/or a single queue are supported as below:
+
+1) TC flower filter directs incoming traffic to a set of queues using
+ the 'hw_tc' option.
+ hw_tc $TCID - Specify a hardware traffic class to pass matching
+ packets on to. TCID is in the range 0 through 15.
+
+2) TC filter with action skbedit queue_mapping $rx_qid selects a
+ receive queue. The action skbedit queue_mapping for receive queue
+ is supported only in hardware. Multiple filters may compete in
+ the hardware for queue selection. In such case, the hardware
+ pipeline resolves conflicts based on priority. On Intel E810
+ devices, TC filter directing traffic to a queue have higher
+ priority over flow director filter assigning a queue. The hash
+ filter has lowest priority.
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
new file mode 100644
index 000000000000..8a58321acce7
--- /dev/null
+++ b/Documentation/networking/tcp_ao.rst
@@ -0,0 +1,444 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================================
+TCP Authentication Option Linux implementation (RFC5925)
+========================================================
+
+TCP Authentication Option (TCP-AO) provides a TCP extension aimed at verifying
+segments between trusted peers. It adds a new TCP header option with
+a Message Authentication Code (MAC). MACs are produced from the content
+of a TCP segment using a hashing function with a password known to both peers.
+The intent of TCP-AO is to deprecate TCP-MD5 providing better security,
+key rotation and support for variety of hashing algorithms.
+
+1. Introduction
+===============
+
+.. table:: Short and Limited Comparison of TCP-AO and TCP-MD5
+
+ +----------------------+------------------------+-----------------------+
+ | | TCP-MD5 | TCP-AO |
+ +======================+========================+=======================+
+ |Supported hashing |MD5 |Must support HMAC-SHA1 |
+ |algorithms |(cryptographically weak)|(chosen-prefix attacks)|
+ | | |and CMAC-AES-128 (only |
+ | | |side-channel attacks). |
+ | | |May support any hashing|
+ | | |algorithm. |
+ +----------------------+------------------------+-----------------------+
+ |Length of MACs (bytes)|16 |Typically 12-16. |
+ | | |Other variants that fit|
+ | | |TCP header permitted. |
+ +----------------------+------------------------+-----------------------+
+ |Number of keys per |1 |Many |
+ |TCP connection | | |
+ +----------------------+------------------------+-----------------------+
+ |Possibility to change |Non-practical (both |Supported by protocol |
+ |an active key |peers have to change | |
+ | |them during MSL) | |
+ +----------------------+------------------------+-----------------------+
+ |Protection against |No |Yes: ignoring them |
+ |ICMP 'hard errors' | |by default on |
+ | | |established connections|
+ +----------------------+------------------------+-----------------------+
+ |Protection against |No |Yes: pseudo-header |
+ |traffic-crossing | |includes TCP ports. |
+ |attack | | |
+ +----------------------+------------------------+-----------------------+
+ |Protection against |No |Sequence Number |
+ |replayed TCP segments | |Extension (SNE) and |
+ | | |Initial Sequence |
+ | | |Numbers (ISNs) |
+ +----------------------+------------------------+-----------------------+
+ |Supports |Yes |No. ISNs+SNE are needed|
+ |Connectionless Resets | |to correctly sign RST. |
+ +----------------------+------------------------+-----------------------+
+ |Standards |RFC 2385 |RFC 5925, RFC 5926 |
+ +----------------------+------------------------+-----------------------+
+
+
+1.1 Frequently Asked Questions (FAQ) with references to RFC 5925
+----------------------------------------------------------------
+
+Q: Can either SendID or RecvID be non-unique for the same 4-tuple
+(srcaddr, srcport, dstaddr, dstport)?
+
+A: No [3.1]::
+
+ >> The IDs of MKTs MUST NOT overlap where their TCP connection
+ identifiers overlap.
+
+Q: Can Master Key Tuple (MKT) for an active connection be removed?
+
+A: No, unless it's copied to Transport Control Block (TCB) [3.1]::
+
+ It is presumed that an MKT affecting a particular connection cannot
+ be destroyed during an active connection -- or, equivalently, that
+ its parameters are copied to an area local to the connection (i.e.,
+ instantiated) and so changes would affect only new connections.
+
+Q: If an old MKT needs to be deleted, how should it be done in order
+to not remove it for an active connection? (As it can be still in use
+at any moment later)
+
+A: Not specified by RFC 5925, seems to be a problem for key management
+to ensure that no one uses such MKT before trying to remove it.
+
+Q: Can an old MKT exist forever and be used by another peer?
+
+A: It can, it's a key management task to decide when to remove an old key [6.1]::
+
+ Deciding when to start using a key is a performance issue. Deciding
+ when to remove an MKT is a security issue. Invalid MKTs are expected
+ to be removed. TCP-AO provides no mechanism to coordinate their removal,
+ as we consider this a key management operation.
+
+also [6.1]::
+
+ The only way to avoid reuse of previously used MKTs is to remove the MKT
+ when it is no longer considered permitted.
+
+Linux TCP-AO will try its best to prevent you from removing a key that's
+being used, considering it a key management failure. But since keeping
+an outdated key may become a security issue and as a peer may
+unintentionally prevent the removal of an old key by always setting
+it as RNextKeyID - a forced key removal mechanism is provided, where
+userspace has to supply KeyID to use instead of the one that's being removed
+and the kernel will atomically delete the old key, even if the peer is
+still requesting it. There are no guarantees for force-delete as the peer
+may yet not have the new key - the TCP connection may just break.
+Alternatively, one may choose to shut down the socket.
+
+Q: What happens when a packet is received on a new connection with no known
+MKT's RecvID?
+
+A: RFC 5925 specifies that by default it is accepted with a warning logged, but
+the behaviour can be configured by the user [7.5.1.a]::
+
+ If the segment is a SYN, then this is the first segment of a new
+ connection. Find the matching MKT for this segment, using the segment's
+ socket pair and its TCP-AO KeyID, matched against the MKT's TCP connection
+ identifier and the MKT's RecvID.
+
+ i. If there is no matching MKT, remove TCP-AO from the segment.
+ Proceed with further TCP handling of the segment.
+ NOTE: this presumes that connections that do not match any MKT
+ should be silently accepted, as noted in Section 7.3.
+
+[7.3]::
+
+ >> A TCP-AO implementation MUST allow for configuration of the behavior
+ of segments with TCP-AO but that do not match an MKT. The initial default
+ of this configuration SHOULD be to silently accept such connections.
+ If this is not the desired case, an MKT can be included to match such
+ connections, or the connection can indicate that TCP-AO is required.
+ Alternately, the configuration can be changed to discard segments with
+ the AO option not matching an MKT.
+
+[10.2.b]::
+
+ Connections not matching any MKT do not require TCP-AO. Further, incoming
+ segments with TCP-AO are not discarded solely because they include
+ the option, provided they do not match any MKT.
+
+Note that Linux TCP-AO implementation differs in this aspect. Currently, TCP-AO
+segments with unknown key signatures are discarded with warnings logged.
+
+Q: Does the RFC imply centralized kernel key management in any way?
+(i.e. that a key on all connections MUST be rotated at the same time?)
+
+A: Not specified. MKTs can be managed in userspace, the only relevant part to
+key changes is [7.3]::
+
+ >> All TCP segments MUST be checked against the set of MKTs for matching
+ TCP connection identifiers.
+
+Q: What happens when RNextKeyID requested by a peer is unknown? Should
+the connection be reset?
+
+A: It should not, no action needs to be performed [7.5.2.e]::
+
+ ii. If they differ, determine whether the RNextKeyID MKT is ready.
+
+ 1. If the MKT corresponding to the segment’s socket pair and RNextKeyID
+ is not available, no action is required (RNextKeyID of a received
+ segment needs to match the MKT’s SendID).
+
+Q: How current_key is set and when does it change? It is a user-triggered
+change, or is it by a request from the remote peer? Is it set by the user
+explicitly, or by a matching rule?
+
+A: current_key is set by RNextKeyID [6.1]::
+
+ Rnext_key is changed only by manual user intervention or MKT management
+ protocol operation. It is not manipulated by TCP-AO. Current_key is updated
+ by TCP-AO when processing received TCP segments as discussed in the segment
+ processing description in Section 7.5. Note that the algorithm allows
+ the current_key to change to a new MKT, then change back to a previously
+ used MKT (known as "backing up"). This can occur during an MKT change when
+ segments are received out of order, and is considered a feature of TCP-AO,
+ because reordering does not result in drops.
+
+[7.5.2.e.ii]::
+
+ 2. If the matching MKT corresponding to the segment’s socket pair and
+ RNextKeyID is available:
+
+ a. Set current_key to the RNextKeyID MKT.
+
+Q: If both peers have multiple MKTs matching the connection's socket pair
+(with different KeyIDs), how should the sender/receiver pick KeyID to use?
+
+A: Some mechanism should pick the "desired" MKT [3.3]::
+
+ Multiple MKTs may match a single outgoing segment, e.g., when MKTs
+ are being changed. Those MKTs cannot have conflicting IDs (as noted
+ elsewhere), and some mechanism must determine which MKT to use for each
+ given outgoing segment.
+
+ >> An outgoing TCP segment MUST match at most one desired MKT, indicated
+ by the segment’s socket pair. The segment MAY match multiple MKTs, provided
+ that exactly one MKT is indicated as desired. Other information in
+ the segment MAY be used to determine the desired MKT when multiple MKTs
+ match; such information MUST NOT include values in any TCP option fields.
+
+Q: Can TCP-MD5 connection migrate to TCP-AO (and vice-versa):
+
+A: No [1]::
+
+ TCP MD5-protected connections cannot be migrated to TCP-AO because TCP MD5
+ does not support any changes to a connection’s security algorithm
+ once established.
+
+Q: If all MKTs are removed on a connection, can it become a non-TCP-AO signed
+connection?
+
+A: [7.5.2] doesn't have the same choice as SYN packet handling in [7.5.1.i]
+that would allow accepting segments without a sign (which would be insecure).
+While switching to non-TCP-AO connection is not prohibited directly, it seems
+what the RFC means. Also, there's a requirement for TCP-AO connections to
+always have one current_key [3.3]::
+
+ TCP-AO requires that every protected TCP segment match exactly one MKT.
+
+[3.3]::
+
+ >> An incoming TCP segment including TCP-AO MUST match exactly one MKT,
+ indicated solely by the segment’s socket pair and its TCP-AO KeyID.
+
+[4.4]::
+
+ One or more MKTs. These are the MKTs that match this connection’s
+ socket pair.
+
+Q: Can a non-TCP-AO connection become a TCP-AO-enabled one?
+
+A: No: for already established non-TCP-AO connection it would be impossible
+to switch using TCP-AO as the traffic key generation requires the initial
+sequence numbers. Paraphrasing, starting using TCP-AO would require
+re-establishing the TCP connection.
+
+2. In-kernel MKTs database vs database in userspace
+===================================================
+
+Linux TCP-AO support is implemented using ``setsockopt()s``, in a similar way
+to TCP-MD5. It means that a userspace application that wants to use TCP-AO
+should perform ``setsockopt()`` on a TCP socket when it wants to add,
+remove or rotate MKTs. This approach moves the key management responsibility
+to userspace as well as decisions on corner cases, i.e. what to do if
+the peer doesn't respect RNextKeyID; moving more code to userspace, especially
+responsible for the policy decisions. Besides, it's flexible and scales well
+(with less locking needed than in the case of an in-kernel database). One also
+should keep in mind that mainly intended users are BGP processes, not any
+random applications, which means that compared to IPsec tunnels,
+no transparency is really needed and modern BGP daemons already have
+``setsockopt()s`` for TCP-MD5 support.
+
+.. table:: Considered pros and cons of the approaches
+
+ +----------------------+------------------------+-----------------------+
+ | | ``setsockopt()`` | in-kernel DB |
+ +======================+========================+=======================+
+ | Extendability | ``setsockopt()`` | Netlink messages are |
+ | | commands should be | simple and extendable |
+ | | extendable syscalls | |
+ +----------------------+------------------------+-----------------------+
+ | Required userspace | BGP or any application | could be transparent |
+ | changes | that wants TCP-AO needs| as tunnels, providing |
+ | | to perform | something like |
+ | | ``setsockopt()s`` | ``ip tcpao add key`` |
+ | | and do key management | (delete/show/rotate) |
+ +----------------------+------------------------+-----------------------+
+ |MKTs removal or adding| harder for userspace | harder for kernel |
+ +----------------------+------------------------+-----------------------+
+ | Dump-ability | ``getsockopt()`` | Netlink .dump() |
+ | | | callback |
+ +----------------------+------------------------+-----------------------+
+ | Limits on kernel | equal |
+ | resources/memory | |
+ +----------------------+------------------------+-----------------------+
+ | Scalability | contention on | contention on |
+ | | ``TCP_LISTEN`` sockets | the whole database |
+ +----------------------+------------------------+-----------------------+
+ | Monitoring & warnings| ``TCP_DIAG`` | same Netlink socket |
+ +----------------------+------------------------+-----------------------+
+ | Matching of MKTs | half-problem: only | hard |
+ | | listen sockets | |
+ +----------------------+------------------------+-----------------------+
+
+
+3. uAPI
+=======
+
+Linux provides a set of ``setsockopt()s`` and ``getsockopt()s`` that let
+userspace manage TCP-AO on a per-socket basis. In order to add/delete MKTs
+``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used
+It is not allowed to add a key on an established non-TCP-AO connection
+as well as to remove the last key from TCP-AO connection.
+
+``setsockopt(TCP_AO_DEL_KEY)`` command may specify ``tcp_ao_del::current_key``
++ ``tcp_ao_del::set_current`` and/or ``tcp_ao_del::rnext``
++ ``tcp_ao_del::set_rnext`` which makes such delete "forced": it
+provides userspace a way to delete a key that's being used and atomically set
+another one instead. This is not intended for normal use and should be used
+only when the peer ignores RNextKeyID and keeps requesting/using an old key.
+It provides a way to force-delete a key that's not trusted but may break
+the TCP-AO connection.
+
+The usual/normal key-rotation can be performed with ``setsockopt(TCP_AO_INFO)``.
+It also provides a uAPI to change per-socket TCP-AO settings, such as
+ignoring ICMPs, as well as clear per-socket TCP-AO packet counters.
+The corresponding ``getsockopt(TCP_AO_INFO)`` can be used to get those
+per-socket TCP-AO settings.
+
+Another useful command is ``getsockopt(TCP_AO_GET_KEYS)``. One can use it
+to list all MKTs on a TCP socket or use a filter to get keys for a specific
+peer and/or sndid/rcvid, VRF L3 interface or get current_key/rnext_key.
+
+To repair TCP-AO connections ``setsockopt(TCP_AO_REPAIR)`` is available,
+provided that the user previously has checkpointed/dumped the socket with
+``getsockopt(TCP_AO_REPAIR)``.
+
+A tip here for scaled TCP_LISTEN sockets, that may have some thousands TCP-AO
+keys, is: use filters in ``getsockopt(TCP_AO_GET_KEYS)`` and asynchronous
+delete with ``setsockopt(TCP_AO_DEL_KEY)``.
+
+Linux TCP-AO also provides a bunch of segment counters that can be helpful
+with troubleshooting/debugging issues. Every MKT has good/bad counters
+that reflect how many packets passed/failed verification.
+Each TCP-AO socket has the following counters:
+- for good segments (properly signed)
+- for bad segments (failed TCP-AO verification)
+- for segments with unknown keys
+- for segments where an AO signature was expected, but wasn't found
+- for the number of ignored ICMPs
+
+TCP-AO per-socket counters are also duplicated with per-netns counters,
+exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``,
+``TCPAORequired`` and ``TCPAODroppedIcmps``.
+
+RFC 5925 very permissively specifies how TCP port matching can be done for
+MKTs::
+
+ TCP connection identifier. A TCP socket pair, i.e., a local IP
+ address, a remote IP address, a TCP local port, and a TCP remote port.
+ Values can be partially specified using ranges (e.g., 2-30), masks
+ (e.g., 0xF0), wildcards (e.g., "*"), or any other suitable indication.
+
+Currently Linux TCP-AO implementation doesn't provide any TCP port matching.
+Probably, port ranges are the most flexible for uAPI, but so far
+not implemented.
+
+4. ``setsockopt()`` vs ``accept()`` race
+========================================
+
+In contrast with TCP-MD5 established connection which has just one key,
+TCP-AO connections may have many keys, which means that accepted connections
+on a listen socket may have any amount of keys as well. As copying all those
+keys on a first properly signed SYN would make the request socket bigger, that
+would be undesirable. Currently, the implementation doesn't copy keys
+to request sockets, but rather look them up on the "parent" listener socket.
+
+The result is that when userspace removes TCP-AO keys, that may break
+not-yet-established connections on request sockets as well as not removing
+keys from sockets that were already established, but not yet ``accept()``'ed,
+hanging in the accept queue.
+
+The reverse is valid as well: if userspace adds a new key for a peer on
+a listener socket, the established sockets in accept queue won't
+have the new keys.
+
+At this moment, the resolution for the two races:
+``setsockopt(TCP_AO_ADD_KEY)`` vs ``accept()``
+and ``setsockopt(TCP_AO_DEL_KEY)`` vs ``accept()`` is delegated to userspace.
+This means that it's expected that userspace would check the MKTs on the socket
+that was returned by ``accept()`` to verify that any key rotation that
+happened on listen socket is reflected on the newly established connection.
+
+This is a similar "do-nothing" approach to TCP-MD5 from the kernel side and
+may be changed later by introducing new flags to ``tcp_ao_add``
+and ``tcp_ao_del``.
+
+Note that this race is rare for it needs TCP-AO key rotation to happen
+during the 3-way handshake for the new TCP connection.
+
+5. Interaction with TCP-MD5
+===========================
+
+A TCP connection can not migrate between TCP-AO and TCP-MD5 options. The
+established sockets that have either AO or MD5 keys are restricted for
+adding keys of the other option.
+
+For listening sockets the picture is different: BGP server may want to receive
+both TCP-AO and (deprecated) TCP-MD5 clients. As a result, both types of keys
+may be added to TCP_CLOSED or TCP_LISTEN sockets. It's not allowed to add
+different types of keys for the same peer.
+
+6. SNE Linux implementation
+===========================
+
+RFC 5925 [6.2] describes the algorithm of how to extend TCP sequence numbers
+with SNE. In short: TCP has to track the previous sequence numbers and set
+sne_flag when the current SEQ number rolls over. The flag is cleared when
+both current and previous SEQ numbers cross 0x7fff, which is 32Kb.
+
+In times when sne_flag is set, the algorithm compares SEQ for each packet with
+0x7fff and if it's higher than 32Kb, it assumes that the packet should be
+verified with SNE before the increment. As a result, there's
+this [0; 32Kb] window, when packets with (SNE - 1) can be accepted.
+
+Linux implementation simplifies this a bit: as the network stack already tracks
+the first SEQ byte that ACK is wanted for (snd_una) and the next SEQ byte that
+is wanted (rcv_nxt) - that's enough information for a rough estimation
+on where in the 4GB SEQ number space both sender and receiver are.
+When they roll over to zero, the corresponding SNE gets incremented.
+
+tcp_ao_compute_sne() is called for each TCP-AO segment. It compares SEQ numbers
+from the segment with snd_una or rcv_nxt and fits the result into a 2GB window around them,
+detecting SEQ numbers rolling over. That simplifies the code a lot and only
+requires SNE numbers to be stored on every TCP-AO socket.
+
+The 2GB window at first glance seems much more permissive compared to
+RFC 5926. But that is only used to pick the correct SNE before/after
+a rollover. It allows more TCP segment replays, but yet all regular
+TCP checks in tcp_sequence() are applied on the verified segment.
+So, it trades a bit more permissive acceptance of replayed/retransmitted
+segments for the simplicity of the algorithm and what seems better behaviour
+for large TCP windows.
+
+7. Links
+========
+
+RFC 5925 The TCP Authentication Option
+ https://www.rfc-editor.org/rfc/pdfrfc/rfc5925.txt.pdf
+
+RFC 5926 Cryptographic Algorithms for the TCP Authentication Option (TCP-AO)
+ https://www.rfc-editor.org/rfc/pdfrfc/rfc5926.txt.pdf
+
+Draft "SHA-2 Algorithm for the TCP Authentication Option (TCP-AO)"
+ https://datatracker.ietf.org/doc/html/draft-nayak-tcp-sha2-03
+
+RFC 2385 Protection of BGP Sessions via the TCP MD5 Signature Option
+ https://www.rfc-editor.org/rfc/pdfrfc/rfc2385.txt.pdf
+
+:Author: Dmitry Safonov <dima@arista.com>
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index 1adead6a4527..5e93cd71f99f 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -55,7 +55,8 @@ struct __kernel_sock_timeval format.
SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
on 32 bit machines.
-1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW):
+1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW)
+-------------------------------------------------------------------
This option is identical to SO_TIMESTAMP except for the returned data type.
Its struct timespec allows for higher resolution (ns) timestamps than the
@@ -178,7 +179,8 @@ SOF_TIMESTAMPING_OPT_ID:
identifier and returns that along with the timestamp. The identifier
is derived from a per-socket u32 counter (that wraps). For datagram
sockets, the counter increments with each sent packet. For stream
- sockets, it increments with every byte.
+ sockets, it increments with every byte. For stream sockets, also set
+ SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
The counter starts at zero. It is initialized the first time that
the socket option is enabled. It is reset each time the option is
@@ -191,6 +193,35 @@ SOF_TIMESTAMPING_OPT_ID:
among all possibly concurrently outstanding timestamp requests for
that socket.
+SOF_TIMESTAMPING_OPT_ID_TCP:
+ Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+ timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+ counter increments for stream sockets, but its starting point is
+ not entirely trivial. This option fixes that.
+
+ For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
+ always be set too. On datagram sockets the option has no effect.
+
+ A reasonable expectation is that the counter is reset to zero with
+ the system call, so that a subsequent write() of N bytes generates
+ a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+ implements this behavior under all conditions.
+
+ SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+ especially when the socket option is set when no data is in
+ transmission. If data is being transmitted, it may be off by the
+ length of the output queue (SIOCOUTQ).
+
+ The difference is due to being based on snd_una versus write_seq.
+ snd_una is the offset in the stream acknowledged by the peer. This
+ depends on factors outside of process control, such as network RTT.
+ write_seq is the last byte written by the process. This offset is
+ not affected by external inputs.
+
+ The difference is subtle and unlikely to be noticed when configured
+ at initial socket creation, when no data is queued or sent. But
+ SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
+ when the socket option is set.
SOF_TIMESTAMPING_OPT_CMSG:
Support recv() cmsg for all timestamped packets. Control messages
@@ -326,7 +357,8 @@ enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
send time with the value returned for each timestamp. It can prevent
the situation by always flushing the TCP stack in between requests,
for instance by enabling TCP_NODELAY and disabling TCP_CORK and
-autocork.
+autocork. After linux-4.7, a better way to prevent coalescing is
+to use MSG_EOR flag at sendmsg() time.
These precautions ensure that the timestamp is generated only when all
bytes have passed a timestamp point, assuming that the network stack
@@ -485,8 +517,8 @@ of packets.
Drivers are free to use a more permissive configuration than the requested
configuration. It is expected that drivers should only implement directly the
most generic mode that can be supported. For example if the hardware can
-support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale
-HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT
+support HWTSTAMP_FILTER_PTP_V2_EVENT, then it should generally always upscale
+HWTSTAMP_FILTER_PTP_V2_L2_SYNC, and so forth, as HWTSTAMP_FILTER_PTP_V2_EVENT
is more generic (and more useful to applications).
A driver which supports hardware time stamping shall update the struct
@@ -581,11 +613,191 @@ Time stamps for outgoing packets are to be generated as follows:
and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
- As soon as the driver has sent the packet and/or obtained a
hardware time stamp for it, it passes the time stamp back by
- calling skb_hwtstamp_tx() with the original skb, the raw
- hardware time stamp. skb_hwtstamp_tx() clones the original skb and
+ calling skb_tstamp_tx() with the original skb, the raw
+ hardware time stamp. skb_tstamp_tx() clones the original skb and
adds the timestamps, therefore the original skb has to be freed now.
If obtaining the hardware time stamp somehow fails, then the driver
should not fall back to software time stamping. The rationale is that
this would occur at a later time in the processing pipeline than other
software time stamping and therefore could lead to unexpected deltas
between time stamps.
+
+3.2 Special considerations for stacked PTP Hardware Clocks
+----------------------------------------------------------
+
+There are situations when there may be more than one PHC (PTP Hardware Clock)
+in the data path of a packet. The kernel has no explicit mechanism to allow the
+user to select which PHC to use for timestamping Ethernet frames. Instead, the
+assumption is that the outermost PHC is always the most preferable, and that
+kernel drivers collaborate towards achieving that goal. Currently there are 3
+cases of stacked PHCs, detailed below:
+
+3.2.1 DSA (Distributed Switch Architecture) switches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are Ethernet switches which have one of their ports connected to an
+(otherwise completely unaware) host Ethernet interface, and perform the role of
+a port multiplier with optional forwarding acceleration features. Each DSA
+switch port is visible to the user as a standalone (virtual) network interface,
+and its network I/O is performed, under the hood, indirectly through the host
+interface (redirecting to the host port on TX, and intercepting frames on RX).
+
+When a DSA switch is attached to a host port, PTP synchronization has to
+suffer, since the switch's variable queuing delay introduces a path delay
+jitter between the host port and its PTP partner. For this reason, some DSA
+switches include a timestamping clock of their own, and have the ability to
+perform network timestamping on their own MAC, such that path delays only
+measure wire and PHY propagation latencies. Timestamping DSA switches are
+supported in Linux and expose the same ABI as any other network interface (save
+for the fact that the DSA interfaces are in fact virtual in terms of network
+I/O, they do have their own PHC). It is typical, but not mandatory, for all
+interfaces of a DSA switch to share the same PHC.
+
+By design, PTP timestamping with a DSA switch does not need any special
+handling in the driver for the host port it is attached to. However, when the
+host port also supports PTP timestamping, DSA will take care of intercepting
+the ``.ndo_eth_ioctl`` calls towards the host port, and block attempts to enable
+hardware timestamping on it. This is because the SO_TIMESTAMPING API does not
+allow the delivery of multiple hardware timestamps for the same packet, so
+anybody else except for the DSA switch port must be prevented from doing so.
+
+In the generic layer, DSA provides the following infrastructure for PTP
+timestamping:
+
+- ``.port_txtstamp()``: a hook called prior to the transmission of
+ packets with a hardware TX timestamping request from user space.
+ This is required for two-step timestamping, since the hardware
+ timestamp becomes available after the actual MAC transmission, so the
+ driver must be prepared to correlate the timestamp with the original
+ packet so that it can re-enqueue the packet back into the socket's
+ error queue. To save the packet for when the timestamp becomes
+ available, the driver can call ``skb_clone_sk`` , save the clone pointer
+ in skb->cb and enqueue a tx skb queue. Typically, a switch will have a
+ PTP TX timestamp register (or sometimes a FIFO) where the timestamp
+ becomes available. In case of a FIFO, the hardware might store
+ key-value pairs of PTP sequence ID/message type/domain number and the
+ actual timestamp. To perform the correlation correctly between the
+ packets in a queue waiting for timestamping and the actual timestamps,
+ drivers can use a BPF classifier (``ptp_classify_raw``) to identify
+ the PTP transport type, and ``ptp_parse_header`` to interpret the PTP
+ header fields. There may be an IRQ that is raised upon this
+ timestamp's availability, or the driver might have to poll after
+ invoking ``dev_queue_xmit()`` towards the host interface.
+ One-step TX timestamping do not require packet cloning, since there is
+ no follow-up message required by the PTP protocol (because the
+ TX timestamp is embedded into the packet by the MAC), and therefore
+ user space does not expect the packet annotated with the TX timestamp
+ to be re-enqueued into its socket's error queue.
+
+- ``.port_rxtstamp()``: On RX, the BPF classifier is run by DSA to
+ identify PTP event messages (any other packets, including PTP general
+ messages, are not timestamped). The original (and only) timestampable
+ skb is provided to the driver, for it to annotate it with a timestamp,
+ if that is immediately available, or defer to later. On reception,
+ timestamps might either be available in-band (through metadata in the
+ DSA header, or attached in other ways to the packet), or out-of-band
+ (through another RX timestamping FIFO). Deferral on RX is typically
+ necessary when retrieving the timestamp needs a sleepable context. In
+ that case, it is the responsibility of the DSA driver to call
+ ``netif_rx()`` on the freshly timestamped skb.
+
+3.2.2 Ethernet PHYs
+^^^^^^^^^^^^^^^^^^^
+
+These are devices that typically fulfill a Layer 1 role in the network stack,
+hence they do not have a representation in terms of a network interface as DSA
+switches do. However, PHYs may be able to detect and timestamp PTP packets, for
+performance reasons: timestamps taken as close as possible to the wire have the
+potential to yield a more stable and precise synchronization.
+
+A PHY driver that supports PTP timestamping must create a ``struct
+mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
+of this pointer will be checked by the networking stack.
+
+Since PHYs do not have network interface representations, the timestamping and
+ethtool ioctl operations for them need to be mediated by their respective MAC
+driver. Therefore, as opposed to DSA switches, modifications need to be done
+to each individual MAC driver for PHY timestamping support. This entails:
+
+- Checking, in ``.ndo_eth_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
+ is true or not. If it is, then the MAC driver should not process this request
+ but instead pass it on to the PHY using ``phy_mii_ioctl()``.
+
+- On RX, special intervention may or may not be needed, depending on the
+ function used to deliver skb's up the network stack. In the case of plain
+ ``netif_rx()`` and similar, MAC drivers must check whether
+ ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
+ call ``netif_rx()`` at all. If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
+ enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
+ will be called now, to determine, using logic very similar to DSA, whether
+ deferral for RX timestamping is necessary. Again like DSA, it becomes the
+ responsibility of the PHY driver to send the packet up the stack when the
+ timestamp is available.
+
+ For other skb receive functions, such as ``napi_gro_receive`` and
+ ``netif_receive_skb``, the stack automatically checks whether
+ ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
+ the driver.
+
+- On TX, again, special intervention might or might not be needed. The
+ function that calls the ``mii_ts->txtstamp()`` hook is named
+ ``skb_clone_tx_timestamp()``. This function can either be called directly
+ (case in which explicit MAC driver support is indeed needed), but the
+ function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
+ drivers already perform for software timestamping purposes. Therefore, if a
+ MAC supports software timestamping, it does not need to do anything further
+ at this stage.
+
+3.2.3 MII bus snooping devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These perform the same role as timestamping Ethernet PHYs, save for the fact
+that they are discrete devices and can therefore be used in conjunction with
+any PHY even if it doesn't support timestamping. In Linux, they are
+discoverable and attachable to a ``struct phy_device`` through Device Tree, and
+for the rest, they use the same mii_ts infrastructure as those. See
+Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
+
+3.2.4 Other caveats for MAC drivers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Stacked PHCs, especially DSA (but not only) - since that doesn't require any
+modification to MAC drivers, so it is more difficult to ensure correctness of
+all possible code paths - is that they uncover bugs which were impossible to
+trigger before the existence of stacked PTP clocks. One example has to do with
+this line of code, already presented earlier::
+
+ skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
+driver or a MII bus snooping device driver, should set this flag.
+But a MAC driver that is unaware of PHC stacking might get tripped up by
+somebody other than itself setting this flag, and deliver a duplicate
+timestamp.
+For example, a typical driver design for TX timestamping might be to split the
+transmission part into 2 portions:
+
+1. "TX": checks whether PTP timestamping has been previously enabled through
+ the ``.ndo_eth_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
+ current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
+ SKBTX_HW_TSTAMP``"). If this is true, it sets the
+ "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
+ described above, in the case of a stacked PHC system, this condition should
+ never trigger, as this MAC is certainly not the outermost PHC. But this is
+ not where the typical issue is. Transmission proceeds with this packet.
+
+2. "TX confirmation": Transmission has finished. The driver checks whether it
+ is necessary to collect any TX timestamp for it. Here is where the typical
+ issues are: the MAC driver takes a shortcut and only checks whether
+ "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
+ PHC system, this is incorrect because this MAC driver is not the only entity
+ in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
+ place.
+
+The correct solution for this problem is for MAC drivers to have a compound
+check in their "TX confirmation" portion, not only for
+"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
+"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
+that PTP timestamping is not enabled for anything other than the outermost PHC,
+this enhanced check will avoid delivering a duplicated TX timestamp to user
+space.
diff --git a/Documentation/networking/tipc.rst b/Documentation/networking/tipc.rst
new file mode 100644
index 000000000000..ab63d298cca2
--- /dev/null
+++ b/Documentation/networking/tipc.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel TIPC
+=================
+
+Introduction
+============
+
+TIPC (Transparent Inter Process Communication) is a protocol that is specially
+designed for intra-cluster communication. It can be configured to transmit
+messages either on UDP or directly across Ethernet. Message delivery is
+sequence guaranteed, loss free and flow controlled. Latency times are shorter
+than with any other known protocol, while maximal throughput is comparable to
+that of TCP.
+
+TIPC Features
+-------------
+
+- Cluster wide IPC service
+
+ Have you ever wished you had the convenience of Unix Domain Sockets even when
+ transmitting data between cluster nodes? Where you yourself determine the
+ addresses you want to bind to and use? Where you don't have to perform DNS
+ lookups and worry about IP addresses? Where you don't have to start timers
+ to monitor the continuous existence of peer sockets? And yet without the
+ downsides of that socket type, such as the risk of lingering inodes?
+
+ Welcome to the Transparent Inter Process Communication service, TIPC in short,
+ which gives you all of this, and a lot more.
+
+- Service Addressing
+
+ A fundamental concept in TIPC is that of Service Addressing which makes it
+ possible for a programmer to chose his own address, bind it to a server
+ socket and let client programs use only that address for sending messages.
+
+- Service Tracking
+
+ A client wanting to wait for the availability of a server, uses the Service
+ Tracking mechanism to subscribe for binding and unbinding/close events for
+ sockets with the associated service address.
+
+ The service tracking mechanism can also be used for Cluster Topology Tracking,
+ i.e., subscribing for availability/non-availability of cluster nodes.
+
+ Likewise, the service tracking mechanism can be used for Cluster Connectivity
+ Tracking, i.e., subscribing for up/down events for individual links between
+ cluster nodes.
+
+- Transmission Modes
+
+ Using a service address, a client can send datagram messages to a server socket.
+
+ Using the same address type, it can establish a connection towards an accepting
+ server socket.
+
+ It can also use a service address to create and join a Communication Group,
+ which is the TIPC manifestation of a brokerless message bus.
+
+ Multicast with very good performance and scalability is available both in
+ datagram mode and in communication group mode.
+
+- Inter Node Links
+
+ Communication between any two nodes in a cluster is maintained by one or two
+ Inter Node Links, which both guarantee data traffic integrity and monitor
+ the peer node's availability.
+
+- Cluster Scalability
+
+ By applying the Overlapping Ring Monitoring algorithm on the inter node links
+ it is possible to scale TIPC clusters up to 1000 nodes with a maintained
+ neighbor failure discovery time of 1-2 seconds. For smaller clusters this
+ time can be made much shorter.
+
+- Neighbor Discovery
+
+ Neighbor Node Discovery in the cluster is done by Ethernet broadcast or UDP
+ multicast, when any of those services are available. If not, configured peer
+ IP addresses can be used.
+
+- Configuration
+
+ When running TIPC in single node mode no configuration whatsoever is needed.
+ When running in cluster mode TIPC must as a minimum be given a node address
+ (before Linux 4.17) and told which interface to attach to. The "tipc"
+ configuration tool makes is possible to add and maintain many more
+ configuration parameters.
+
+- Performance
+
+ TIPC message transfer latency times are better than in any other known protocol.
+ Maximal byte throughput for inter-node connections is still somewhat lower than
+ for TCP, while they are superior for intra-node and inter-container throughput
+ on the same host.
+
+- Language Support
+
+ The TIPC user API has support for C, Python, Perl, Ruby, D and Go.
+
+More Information
+----------------
+
+- How to set up TIPC:
+
+ http://tipc.io/getting_started.html
+
+- How to program with TIPC:
+
+ http://tipc.io/programming.html
+
+- How to contribute to TIPC:
+
+- http://tipc.io/contacts.html
+
+- More details about TIPC specification:
+
+ http://tipc.io/protocol.html
+
+
+Implementation
+==============
+
+TIPC is implemented as a kernel module in net/tipc/ directory.
+
+TIPC Base Types
+---------------
+
+.. kernel-doc:: net/tipc/subscr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bearer.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_table.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bcast.c
+ :internal:
+
+TIPC Bearer Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/bearer.c
+ :internal:
+
+.. kernel-doc:: net/tipc/udp_media.c
+ :internal:
+
+TIPC Crypto Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/crypto.c
+ :internal:
+
+TIPC Discoverer Interfaces
+--------------------------
+
+.. kernel-doc:: net/tipc/discover.c
+ :internal:
+
+TIPC Link Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/link.c
+ :internal:
+
+TIPC msg Interfaces
+-------------------
+
+.. kernel-doc:: net/tipc/msg.c
+ :internal:
+
+TIPC Name Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/name_table.c
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.c
+ :internal:
+
+TIPC Node Management Interfaces
+-------------------------------
+
+.. kernel-doc:: net/tipc/node.c
+ :internal:
+
+TIPC Socket Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/socket.c
+ :internal:
+
+TIPC Network Topology Interfaces
+--------------------------------
+
+.. kernel-doc:: net/tipc/subscr.c
+ :internal:
+
+TIPC Server Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/topsrv.c
+ :internal:
+
+TIPC Trace Interfaces
+---------------------
+
+.. kernel-doc:: net/tipc/trace.c
+ :internal:
diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst
new file mode 100644
index 000000000000..6f5ea1646a47
--- /dev/null
+++ b/Documentation/networking/tls-handshake.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+In-Kernel TLS Handshake
+=======================
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs
+over TCP. TLS provides end-to-end data integrity and confidentiality in
+addition to peer authentication.
+
+The kernel's kTLS implementation handles the TLS record subprotocol, but
+does not handle the TLS handshake subprotocol which is used to establish
+a TLS session. Kernel consumers can use the API described here to
+request TLS session establishment.
+
+There are several possible ways to provide a handshake service in the
+kernel. The API described here is designed to hide the details of those
+implementations so that in-kernel TLS consumers do not need to be
+aware of how the handshake gets done.
+
+
+User handshake agent
+====================
+
+As of this writing, there is no TLS handshake implementation in the
+Linux kernel. To provide a handshake service, a handshake agent
+(typically in user space) is started in each network namespace where a
+kernel consumer might require a TLS handshake. Handshake agents listen
+for events sent from the kernel that indicate a handshake request is
+waiting.
+
+An open socket is passed to a handshake agent via a netlink operation,
+which creates a socket descriptor in the agent's file descriptor table.
+If the handshake completes successfully, the handshake agent promotes
+the socket to use the TLS ULP and sets the session information using the
+SOL_TLS socket options. The handshake agent returns the socket to the
+kernel via a second netlink operation.
+
+
+Kernel Handshake API
+====================
+
+A kernel TLS consumer initiates a client-side TLS handshake on an open
+socket by invoking one of the tls_client_hello() functions. First, it
+fills in a structure that contains the parameters of the request:
+
+.. code-block:: c
+
+ struct tls_handshake_args {
+ struct socket *ta_sock;
+ tls_done_func_t ta_done;
+ void *ta_data;
+ const char *ta_peername;
+ unsigned int ta_timeout_ms;
+ key_serial_t ta_keyring;
+ key_serial_t ta_my_cert;
+ key_serial_t ta_my_privkey;
+ unsigned int ta_num_peerids;
+ key_serial_t ta_my_peerids[5];
+ };
+
+The @ta_sock field references an open and connected socket. The consumer
+must hold a reference on the socket to prevent it from being destroyed
+while the handshake is in progress. The consumer must also have
+instantiated a struct file in sock->file.
+
+
+@ta_done contains a callback function that is invoked when the handshake
+has completed. Further explanation of this function is in the "Handshake
+Completion" sesction below.
+
+The consumer can provide a NUL-terminated hostname in the @ta_peername
+field that is sent as part of ClientHello. If no peername is provided,
+the DNS hostname associated with the server's IP address is used instead.
+
+The consumer can fill in the @ta_timeout_ms field to force the servicing
+handshake agent to exit after a number of milliseconds. This enables the
+socket to be fully closed once both the kernel and the handshake agent
+have closed their endpoints.
+
+Authentication material such as x.509 certificates, private certificate
+keys, and pre-shared keys are provided to the handshake agent in keys
+that are instantiated by the consumer before making the handshake
+request. The consumer can provide a private keyring that is linked into
+the handshake agent's process keyring in the @ta_keyring field to prevent
+access of those keys by other subsystems.
+
+To request an x.509-authenticated TLS session, the consumer fills in
+the @ta_my_cert and @ta_my_privkey fields with the serial numbers of
+keys containing an x.509 certificate and the private key for that
+certificate. Then, it invokes this function:
+
+.. code-block:: c
+
+ ret = tls_client_hello_x509(args, gfp_flags);
+
+The function returns zero when the handshake request is under way. A
+zero return guarantees the callback function @ta_done will be invoked
+for this socket. The function returns a negative errno if the handshake
+could not be started. A negative errno guarantees the callback function
+@ta_done will not be invoked on this socket.
+
+
+To initiate a client-side TLS handshake with a pre-shared key, use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_psk(args, gfp_flags);
+
+However, in this case, the consumer fills in the @ta_my_peerids array
+with serial numbers of keys containing the peer identities it wishes
+to offer, and the @ta_num_peerids field with the number of array
+entries it has filled in. The other fields are filled in as above.
+
+
+To initiate an anonymous client-side TLS handshake use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_anon(args, gfp_flags);
+
+The handshake agent presents no peer identity information to the remote
+during this type of handshake. Only server authentication (ie the client
+verifies the server's identity) is performed during the handshake. Thus
+the established session uses encryption only.
+
+
+Consumers that are in-kernel servers use:
+
+.. code-block:: c
+
+ ret = tls_server_hello_x509(args, gfp_flags);
+
+or
+
+.. code-block:: c
+
+ ret = tls_server_hello_psk(args, gfp_flags);
+
+The argument structure is filled in as above.
+
+
+If the consumer needs to cancel the handshake request, say, due to a ^C
+or other exigent event, the consumer can invoke:
+
+.. code-block:: c
+
+ bool tls_handshake_cancel(sock);
+
+This function returns true if the handshake request associated with
+@sock has been canceled. The consumer's handshake completion callback
+will not be invoked. If this function returns false, then the consumer's
+completion callback has already been invoked.
+
+
+Handshake Completion
+====================
+
+When the handshake agent has completed processing, it notifies the
+kernel that the socket may be used by the consumer again. At this point,
+the consumer's handshake completion callback, provided in the @ta_done
+field in the tls_handshake_args structure, is invoked.
+
+The synopsis of this function is:
+
+.. code-block:: c
+
+ typedef void (*tls_done_func_t)(void *data, int status,
+ key_serial_t peerid);
+
+The consumer provides a cookie in the @ta_data field of the
+tls_handshake_args structure that is returned in the @data parameter of
+this callback. The consumer uses the cookie to match the callback to the
+thread waiting for the handshake to complete.
+
+The success status of the handshake is returned via the @status
+parameter:
+
++------------+----------------------------------------------+
+| status | meaning |
++============+==============================================+
+| 0 | TLS session established successfully |
++------------+----------------------------------------------+
+| -EACCESS | Remote peer rejected the handshake or |
+| | authentication failed |
++------------+----------------------------------------------+
+| -ENOMEM | Temporary resource allocation failure |
++------------+----------------------------------------------+
+| -EINVAL | Consumer provided an invalid argument |
++------------+----------------------------------------------+
+| -ENOKEY | Missing authentication material |
++------------+----------------------------------------------+
+| -EIO | An unexpected fault occurred |
++------------+----------------------------------------------+
+
+The @peerid parameter contains the serial number of a key containing the
+remote peer's identity or the value TLS_NO_PEERID if the session is not
+authenticated.
+
+A best practice is to close and destroy the socket immediately if the
+handshake failed.
+
+
+Other considerations
+--------------------
+
+While a handshake is under way, the kernel consumer must alter the
+socket's sk_data_ready callback function to ignore all incoming data.
+Once the handshake completion callback function has been invoked, normal
+receive operation can be resumed.
+
+Once a TLS session is established, the consumer must provide a buffer
+for and then examine the control message (CMSG) that is part of every
+subsequent sock_recvmsg(). Each control message indicates whether the
+received message data is TLS record data or session metadata.
+
+See tls.rst for details on how a kTLS consumer recognizes incoming
+(decrypted) application data, alerts, and handshake packets once the
+socket has been promoted to use the TLS ULP.
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index f914e81fd3a6..5f0dea3d571e 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -428,6 +428,24 @@ by the driver:
which were part of a TLS stream.
* ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
which were successfully decrypted.
+ * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
+ decryption.
+ * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
+ (connection has finished).
+ * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
+ request.
+ * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
+ was started.
+ * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
+ properly ended with providing the HW tracked tcp-seq.
+ * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
+ procedure was started by not properly ended.
+ * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
+ the driver was successfully handled.
+ * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
+ the driver was terminated unsuccessfully.
+ * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
+ but were not decrypted due to unexpected error in the state machine.
* ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
for encryption of their TLS payload.
* ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
@@ -506,7 +524,16 @@ on TCP retransmissions to handle corner cases is not acceptable.
TLS device features
-------------------
-Drivers should ignore the changes to TLS the device feature flags.
+Drivers should ignore the changes to the TLS device feature flags.
These flags will be acted upon accordingly by the core ``ktls`` code.
TLS device feature flags only control adding of new TLS connection
offloads, old connections will remain active after flags are cleared.
+
+TLS encryption cannot be offloaded to devices without checksum calculation
+offload. Hence, TLS TX device feature flag requires TX csum offload being set.
+Disabling the latter implies clearing the former. Disabling TX checksum offload
+should not affect old connections, and drivers should make sure checksum
+calculation does not break for them.
+Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
+does not want to enable RX csum offload, TLS RX device feature is disabled
+as well.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 8cb2cd4e2a80..658ed3a71e1b 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -214,6 +214,44 @@ of calling send directly after a handshake using gnutls.
Since it doesn't implement a full record layer, control
messages are not supported.
+Optional optimizations
+----------------------
+
+There are certain condition-specific optimizations the TLS ULP can make,
+if requested. Those optimizations are either not universally beneficial
+or may impact correctness, hence they require an opt-in.
+All options are set per-socket using setsockopt(), and their
+state can be checked using getsockopt() and via socket diag (``ss``).
+
+TLS_TX_ZEROCOPY_RO
+~~~~~~~~~~~~~~~~~~
+
+For device offload only. Allow sendfile() data to be transmitted directly
+to the NIC without making an in-kernel copy. This allows true zero-copy
+behavior when device offload is enabled.
+
+The application must make sure that the data is not modified between being
+submitted and transmission completing. In other words this is mostly
+applicable if the data sent on a socket via sendfile() is read-only.
+
+Modifying the data may result in different versions of the data being used
+for the original TCP transmission and TCP retransmissions. To the receiver
+this will look like TLS records had been tampered with and will result
+in record authentication failures.
+
+TLS_RX_EXPECT_NO_PAD
+~~~~~~~~~~~~~~~~~~~~
+
+TLS 1.3 only. Expect the sender to not pad records. This allows the data
+to be decrypted directly into user space buffers with TLS 1.3.
+
+This optimization is safe to enable only if the remote end is trusted,
+otherwise it is an attack vector to doubling the TLS processing cost.
+
+If the record decrypted turns out to had been padded or is not a data
+record it will be decrypted again into a kernel buffer without zero copy.
+Such events are counted in the ``TlsDecryptRetry`` statistic.
+
Statistics
==========
@@ -239,3 +277,12 @@ TLS implementation exposes the following per-namespace statistics
- ``TlsDeviceRxResync`` -
number of RX resyncs sent to NICs handling cryptography
+
+- ``TlsDecryptRetry`` -
+ number of RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will
+ also increment for non-data records.
+
+- ``TlsRxNoPadViolation`` -
+ number of data RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction.
diff --git a/Documentation/networking/tuntap.rst b/Documentation/networking/tuntap.rst
index a59d1dd6fdcc..4d7087f727be 100644
--- a/Documentation/networking/tuntap.rst
+++ b/Documentation/networking/tuntap.rst
@@ -107,7 +107,7 @@ Note that the character pointer becomes overwritten with the real device name
*/
ifr.ifr_flags = IFF_TUN;
if( *dev )
- strncpy(ifr.ifr_name, dev, IFNAMSIZ);
+ strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
close(fd);
diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst
index 0dde145043bc..0a9a6f968cb9 100644
--- a/Documentation/networking/vrf.rst
+++ b/Documentation/networking/vrf.rst
@@ -144,6 +144,19 @@ default VRF are only handled by a socket not bound to any VRF::
netfilter rules on the VRF device can be used to limit access to services
running in the default VRF context as well.
+Using VRF-aware applications (applications which simultaneously create sockets
+outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1``
+is possible but may lead to problems in some situations. With that sysctl
+value, it is unspecified which listening socket will be selected to handle
+connections for VRF traffic; ie. either a socket bound to the VRF or an unbound
+socket may be used to accept new connections from a VRF. This somewhat
+unexpected behavior can lead to problems if sockets are configured with extra
+options (ex. TCP MD5 keys) with the expectation that VRF traffic will
+exclusively be handled by sockets bound to VRFs, as would be the case with
+``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of
+which listening socket is selected, established sockets will be created in the
+VRF based on the ingress interface, as documented earlier.
+
--------------------------------------------------------------------------------
Using iproute2 for VRFs
diff --git a/Documentation/networking/vxlan.rst b/Documentation/networking/vxlan.rst
index ce239fa01848..2759dc1cc525 100644
--- a/Documentation/networking/vxlan.rst
+++ b/Documentation/networking/vxlan.rst
@@ -58,3 +58,31 @@ forwarding table using the new bridge command.
3. Show forwarding table::
# bridge fdb show dev vxlan0
+
+The following NIC features may indicate support for UDP tunnel-related
+offloads (most commonly VXLAN features, but support for a particular
+encapsulation protocol is NIC specific):
+
+ - `tx-udp_tnl-segmentation`
+ - `tx-udp_tnl-csum-segmentation`
+ ability to perform TCP segmentation offload of UDP encapsulated frames
+
+ - `rx-udp_tunnel-port-offload`
+ receive side parsing of UDP encapsulated frames which allows NICs to
+ perform protocol-aware offloads, like checksum validation offload of
+ inner frames (only needed by NICs without protocol-agnostic offloads)
+
+For devices supporting `rx-udp_tunnel-port-offload` the list of currently
+offloaded ports can be interrogated with `ethtool`::
+
+ $ ethtool --show-tunnels eth0
+ Tunnel information for eth0:
+ UDP port table 0:
+ Size: 4
+ Types: vxlan
+ No entries
+ UDP port table 1:
+ Size: 4
+ Types: geneve, vxlan-gpe
+ Entries (1):
+ port 1230, vxlan-gpe
diff --git a/Documentation/networking/x25-iface.rst b/Documentation/networking/x25-iface.rst
index df401891dce6..285cefcfce87 100644
--- a/Documentation/networking/x25-iface.rst
+++ b/Documentation/networking/x25-iface.rst
@@ -1,8 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
-============================-
X.25 Device Driver Interface
-============================-
+============================
Version 1.1
@@ -70,60 +69,13 @@ First Byte = 0x03 (X25_IFACE_PARAMS)
LAPB parameters. To be defined.
+Requirements for the device driver
+----------------------------------
-Possible Problems
-=================
-
-(Henner Eisen, 2000-10-28)
-
-The X.25 packet layer protocol depends on a reliable datalink service.
-The LAPB protocol provides such reliable service. But this reliability
-is not preserved by the Linux network device driver interface:
-
-- With Linux 2.4.x (and above) SMP kernels, packet ordering is not
- preserved. Even if a device driver calls netif_rx(skb1) and later
- netif_rx(skb2), skb2 might be delivered to the network layer
- earlier that skb1.
-- Data passed upstream by means of netif_rx() might be dropped by the
- kernel if the backlog queue is congested.
-
-The X.25 packet layer protocol will detect this and reset the virtual
-call in question. But many upper layer protocols are not designed to
-handle such N-Reset events gracefully. And frequent N-Reset events
-will always degrade performance.
-
-Thus, driver authors should make netif_rx() as reliable as possible:
-
-SMP re-ordering will not occur if the driver's interrupt handler is
-always executed on the same CPU. Thus,
-
-- Driver authors should use irq affinity for the interrupt handler.
-
-The probability of packet loss due to backlog congestion can be
-reduced by the following measures or a combination thereof:
-
-(1) Drivers for kernel versions 2.4.x and above should always check the
- return value of netif_rx(). If it returns NET_RX_DROP, the
- driver's LAPB protocol must not confirm reception of the frame
- to the peer.
- This will reliably suppress packet loss. The LAPB protocol will
- automatically cause the peer to re-transmit the dropped packet
- later.
- The lapb module interface was modified to support this. Its
- data_indication() method should now transparently pass the
- netif_rx() return value to the (lapb module) caller.
-(2) Drivers for kernel versions 2.2.x should always check the global
- variable netdev_dropping when a new frame is received. The driver
- should only call netif_rx() if netdev_dropping is zero. Otherwise
- the driver should not confirm delivery of the frame and drop it.
- Alternatively, the driver can queue the frame internally and call
- netif_rx() later when netif_dropping is 0 again. In that case, delivery
- confirmation should also be deferred such that the internal queue
- cannot grow to much.
- This will not reliably avoid packet loss, but the probability
- of packet loss in netif_rx() path will be significantly reduced.
-(3) Additionally, driver authors might consider to support
- CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up
- when a previously congested backlog queue becomes empty again.
- The driver could uses this for flow-controlling the peer by means
- of the LAPB protocol's flow-control service.
+Packets should not be reordered or dropped when delivering between the
+Packet Layer and the device driver.
+
+To avoid packets from being reordered or dropped when delivering from
+the device driver to the Packet Layer, the device driver should not
+call "netif_rx" to deliver the received packets. Instead, it should
+call "netif_receive_skb_core" from softirq context to deliver them.
diff --git a/Documentation/networking/x25.rst b/Documentation/networking/x25.rst
index 00e45d384ba0..e11d9ebdf9a3 100644
--- a/Documentation/networking/x25.rst
+++ b/Documentation/networking/x25.rst
@@ -19,13 +19,11 @@ implementation of LAPB. Therefore the LAPB modules would be called by
unintelligent X.25 card drivers and not by intelligent ones, this would
provide a uniform device driver interface, and simplify configuration.
-To confuse matters a little, an 802.2 LLC implementation for Linux is being
-written which will allow X.25 to be run over an Ethernet (or Token Ring) and
-conform with the JNT "Pink Book", this will have a different interface to
-the Packet Layer but there will be no confusion since the class of device
-being served by the LLC will be completely separate from LAPB. The LLC
-implementation is being done as part of another protocol project (SNA) and
-by a different author.
+To confuse matters a little, an 802.2 LLC implementation is also possible
+which could allow X.25 to be run over an Ethernet (or Token Ring) and
+conform with the JNT "Pink Book", this would have a different interface to
+the Packet Layer but there would be no confusion since the class of device
+being served by the LLC would be completely separate from LAPB.
Just when you thought that it could not become more confusing, another
option appeared, XOT. This allows X.25 Packet Layer frames to operate over
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..a6e0ece18be5
--- /dev/null
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -0,0 +1,128 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+XDP RX Metadata
+===============
+
+This document describes how an eXpress Data Path (XDP) program can access
+hardware metadata related to a packet using a set of helper functions,
+and how it can pass that metadata on to other consumers.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
+Every device driver that wishes to expose additional packet metadata can
+implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
+via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_metadata_rx_timestamp
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_metadata_rx_hash
+
+.. kernel-doc:: net/core/xdp.c
+ :identifiers: bpf_xdp_metadata_rx_vlan_tag
+
+An XDP program can use these kfuncs to read the metadata into stack
+variables for its own consumption. Or, to pass the metadata on to other
+consumers, an XDP program can store it into the metadata area carried
+ahead of the packet. Not all packets will necessary have the requested
+metadata available in which case the driver returns ``-ENODATA``.
+
+Not all kfuncs have to be implemented by the device driver; when not
+implemented, the default ones that return ``-EOPNOTSUPP`` will be used
+to indicate the device driver have not implemented this kfunc.
+
+
+Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is
+as follows::
+
+ +----------+-----------------+------+
+ | headroom | custom metadata | data |
+ +----------+-----------------+------+
+ ^ ^
+ | |
+ xdp_buff->data_meta xdp_buff->data
+
+An XDP program can store individual metadata items into this ``data_meta``
+area in whichever format it chooses. Later consumers of the metadata
+will have to agree on the format by some out of band contract (like for
+the AF_XDP use case, see below).
+
+AF_XDP
+======
+
+:doc:`af_xdp` use-case implies that there is a contract between the BPF
+program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
+the final consumer. Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. The userspace ``XSK`` consumer computes
+``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
+Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
+``METADATA_SIZE`` is an application-specific constant (``AF_XDP`` receive
+descriptor does _not_ explicitly carry the size of the metadata).
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+ +----------+-----------------+------+
+ | headroom | custom metadata | data |
+ +----------+-----------------+------+
+ ^
+ |
+ rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
+contents. Currently, every driver has custom kernel code to parse
+the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
+conversion, and the XDP metadata is not used by the kernel when building
+``skbs``. However, TC-BPF programs can access the XDP metadata area using
+the ``data_meta`` pointer.
+
+In the future, we'd like to support a case where an XDP program
+can override some of the metadata used for building ``skbs``.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+Some devices (like virtual ethernet links) support running a second XDP
+program after the redirect. However, the final consumer doesn't have
+access to the original hardware descriptor and can't access any of
+the original metadata. The same applies to XDP programs installed
+into devmaps and cpumaps.
+
+This means that for redirected packets only custom metadata is
+currently supported, which has to be prepared by the initial XDP program
+before redirect. If the frame is eventually passed to the kernel, the
+``skb`` created from such a frame won't have any hardware metadata populated
+in its ``skb``. If such a packet is later redirected into an ``XSK``,
+that will also only have access to the custom metadata.
+
+bpf_tail_call
+=============
+
+Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
+is currently not supported.
+
+Supported Devices
+=================
+
+It is possible to query which kfunc the particular netdev implements via
+netlink. See ``xdp-rx-metadata-features`` attribute set in
+``Documentation/netlink/specs/netdev.yaml``.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
+``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
+BPF program that handles XDP metadata.
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index da1073acda96..bfea9d8579ed 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -1,10 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
+.. _xfrm_device:
===============================================
XFRM device - offloading the IPsec computations
===============================================
Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
Overview
@@ -18,10 +20,21 @@ can radically increase throughput and decrease CPU utilization. The XFRM
Device interface allows NIC drivers to offer to the stack access to the
hardware offload.
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+ * NIC performs encrypt/decrypt
+ * Kernel does everything else
+ * IPsec packet offload:
+ * NIC performs encrypt/decrypt
+ * NIC does encapsulation
+ * Kernel and NIC have SA and policy in-sync
+ * NIC handles the SA and policies states
+ * The Kernel talks to the keymanager
+
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
-like this::
+like this for crypto offload:
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -29,6 +42,17 @@ like this::
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload dev eth4 dir in
+and for packet offload
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload packet dev eth4 dir in
+
+ ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+ tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
@@ -40,17 +64,24 @@ Callbacks to implement
/* from include/linux/netdevice.h */
struct xfrmdev_ops {
- int (*xdo_dev_state_add) (struct xfrm_state *x);
+ /* Crypto and Packet offload callbacks */
+ int (*xdo_dev_state_add) (struct xfrm_state *x, struct netlink_ext_ack *extack);
void (*xdo_dev_state_delete) (struct xfrm_state *x);
void (*xdo_dev_state_free) (struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+ void (*xdo_dev_state_update_stats) (struct xfrm_state *x);
+
+ /* Solely packet offload callbacks */
+ int (*xdo_dev_policy_add) (struct xfrm_policy *x, struct netlink_ext_ack *extack);
+ void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_free) (struct xfrm_policy *x);
};
-The NIC driver offering ipsec offload will need to implement these
-callbacks to make the offload available to the network stack's
-XFRM subsytem. Additionally, the feature bits NETIF_F_HW_ESP and
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
@@ -79,7 +110,8 @@ and an indication of whether it is for Rx or Tx. The driver should
=========== ===================================
0 success
- -EOPNETSUPP offload not supported, try SW IPsec
+ -EOPNETSUPP offload not supported, try SW IPsec,
+ not applicable for packet offload mode
other fail the request
=========== ===================================
@@ -96,6 +128,7 @@ will serviceable. This can check the packet information to be sure the
offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
return true of false to signify its support.
+Crypto offload mode:
When ready to send, the driver needs to inspect the Tx packet for the
offload information, including the opaque context, and set up the packet
send accordingly::
@@ -139,13 +172,25 @@ the stack in xfrm_input().
In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
Driver will check packet seq number and update HW ESN state machine if needed.
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
When the SA is removed by the user, the driver's xdo_dev_state_delete()
-is asked to disable the offload. Later, xdo_dev_state_free() is called
-from a garbage collection routine after all reference counts to the state
+and xdo_dev_policy_delete() are asked to disable the offload. Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
have been removed and any remaining resources can be cleared for the
offload state. How these are used by the driver will depend on specific
hardware needs.
As a netdev is set to DOWN the XFRM stack's netdev listener will call
-xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
-states.
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_stats() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
new file mode 100644
index 000000000000..bd033fe95cca
--- /dev/null
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+AF_XDP TX Metadata
+==================
+
+This document describes how to enable offloads when transmitting packets
+via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
+metadata on the receive side.
+
+General Design
+==============
+
+The headroom for the metadata is reserved via ``tx_metadata_len`` in
+``struct xdp_umem_reg``. The metadata length is therefore the same for
+every socket that shares the same umem. The metadata layout is a fixed UAPI,
+refer to ``union xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
+Thus, generally, the ``tx_metadata_len`` field above should contain
+``sizeof(union xsk_tx_metadata)``.
+
+The headroom and the metadata itself should be located right before
+``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
+layout is as follows::
+
+ tx_metadata_len
+ / \
+ +-----------------+---------+----------------------------+
+ | xsk_tx_metadata | padding | payload |
+ +-----------------+---------+----------------------------+
+ ^
+ |
+ xdp_desc->addr
+
+An AF_XDP application can request headrooms larger than ``sizeof(struct
+xsk_tx_metadata)``. The kernel will ignore the padding (and will still
+use ``xdp_desc->addr - tx_metadata_len`` to locate
+the ``xsk_tx_metadata``). For the frames that shouldn't carry
+any metadata (i.e., the ones that don't have ``XDP_TX_METADATA`` option),
+the metadata area is ignored by the kernel as well.
+
+The flags field enables the particular offload:
+
+- ``XDP_TXMD_FLAGS_TIMESTAMP``: requests the device to put transmission
+ timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
+- ``XDP_TXMD_FLAGS_CHECKSUM``: requests the device to calculate L4
+ checksum. ``csum_start`` specifies byte offset of where the checksumming
+ should start and ``csum_offset`` specifies byte offset where the
+ device should store the computed checksum.
+
+Besides the flags above, in order to trigger the offloads, the first
+packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
+bit in the ``options`` field. Also note that in a multi-buffer packet
+only the first chunk should carry the metadata.
+
+Software TX Checksum
+====================
+
+For development and testing purposes its possible to pass
+``XDP_UMEM_TX_SW_CSUM`` flag to ``XDP_UMEM_REG`` UMEM registration call.
+In this case, when running in ``XDK_COPY`` mode, the TX checksum
+is calculated on the CPU. Do not enable this option in production because
+it will negatively affect performance.
+
+Querying Device Capabilities
+============================
+
+Every devices exports its offloads capabilities via netlink netdev family.
+Refer to ``xsk-flags`` features bitmask in
+``Documentation/netlink/specs/netdev.yaml``.
+
+- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
+- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
+
+See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
+program that handles TX metadata. Also see https://github.com/fomichev/xskgen
+for a more bare-bones example.
diff --git a/Documentation/networking/z8530book.rst b/Documentation/networking/z8530book.rst
deleted file mode 100644
index fea2c40e7973..000000000000
--- a/Documentation/networking/z8530book.rst
+++ /dev/null
@@ -1,256 +0,0 @@
-=======================
-Z8530 Programming Guide
-=======================
-
-:Author: Alan Cox
-
-Introduction
-============
-
-The Z85x30 family synchronous/asynchronous controller chips are used on
-a large number of cheap network interface cards. The kernel provides a
-core interface layer that is designed to make it easy to provide WAN
-services using this chip.
-
-The current driver only support synchronous operation. Merging the
-asynchronous driver support into this code to allow any Z85x30 device to
-be used as both a tty interface and as a synchronous controller is a
-project for Linux post the 2.4 release
-
-Driver Modes
-============
-
-The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in
-three different modes. Each mode can be applied to an individual channel
-on the chip (each chip has two channels).
-
-The PIO synchronous mode supports the most common Z8530 wiring. Here the
-chip is interface to the I/O and interrupt facilities of the host
-machine but not to the DMA subsystem. When running PIO the Z8530 has
-extremely tight timing requirements. Doing high speeds, even with a
-Z85230 will be tricky. Typically you should expect to achieve at best
-9600 baud with a Z8C530 and 64Kbits with a Z85230.
-
-The DMA mode supports the chip when it is configured to use dual DMA
-channels on an ISA bus. The better cards tend to support this mode of
-operation for a single channel. With DMA running the Z85230 tops out
-when it starts to hit ISA DMA constraints at about 512Kbits. It is worth
-noting here that many PC machines hang or crash when the chip is driven
-fast enough to hold the ISA bus solid.
-
-Transmit DMA mode uses a single DMA channel. The DMA channel is used for
-transmission as the transmit FIFO is smaller than the receive FIFO. it
-gives better performance than pure PIO mode but is nowhere near as ideal
-as pure DMA mode.
-
-Using the Z85230 driver
-=======================
-
-The Z85230 driver provides the back end interface to your board. To
-configure a Z8530 interface you need to detect the board and to identify
-its ports and interrupt resources. It is also your problem to verify the
-resources are available.
-
-Having identified the chip you need to fill in a struct z8530_dev,
-which describes each chip. This object must exist until you finally
-shutdown the board. Firstly zero the active field. This ensures nothing
-goes off without you intending it. The irq field should be set to the
-interrupt number of the chip. (Each chip has a single interrupt source
-rather than each channel). You are responsible for allocating the
-interrupt line. The interrupt handler should be set to
-:c:func:`z8530_interrupt()`. The device id should be set to the
-z8530_dev structure pointer. Whether the interrupt can be shared or not
-is board dependent, and up to you to initialise.
-
-The structure holds two channel structures. Initialise chanA.ctrlio and
-chanA.dataio with the address of the control and data ports. You can or
-this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS
-delay for chip settling done in software. The PORT_SLEEP option is
-architecture specific. Other flags may become available on future
-platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to
-start the chip up as disabled and discarding interrupt events. This
-ensures that stray interrupts will be mopped up and not hang the bus.
-Set chanA.dev to point to the device structure itself. The private and
-name field you may use as you wish. The private field is unused by the
-Z85230 layer. The name is used for error reporting and it may thus make
-sense to make it match the network name.
-
-Repeat the same operation with the B channel if your chip has both
-channels wired to something useful. This isn't always the case. If it is
-not wired then the I/O values do not matter, but you must initialise
-chanB.dev.
-
-If your board has DMA facilities then initialise the txdma and rxdma
-fields for the relevant channels. You must also allocate the ISA DMA
-channels and do any necessary board level initialisation to configure
-them. The low level driver will do the Z8530 and DMA controller
-programming but not board specific magic.
-
-Having initialised the device you can then call
-:c:func:`z8530_init()`. This will probe the chip and reset it into
-a known state. An identification sequence is then run to identify the
-chip type. If the checks fail to pass the function returns a non zero
-error code. Typically this indicates that the port given is not valid.
-After this call the type field of the z8530_dev structure is
-initialised to either Z8530, Z85C30 or Z85230 according to the chip
-found.
-
-Once you have called z8530_init you can also make use of the utility
-function :c:func:`z8530_describe()`. This provides a consistent
-reporting format for the Z8530 devices, and allows all the drivers to
-provide consistent reporting.
-
-Attaching Network Interfaces
-============================
-
-If you wish to use the network interface facilities of the driver, then
-you need to attach a network device to each channel that is present and
-in use. In addition to use the generic HDLC you need to follow some
-additional plumbing rules. They may seem complex but a look at the
-example hostess_sv11 driver should reassure you.
-
-The network device used for each channel should be pointed to by the
-netdevice field of each channel. The hdlc-> priv field of the network
-device points to your private data - you will need to be able to find
-your private data from this.
-
-The way most drivers approach this particular problem is to create a
-structure holding the Z8530 device definition and put that into the
-private field of the network device. The network device fields of the
-channels then point back to the network devices.
-
-If you wish to use the generic HDLC then you need to register the HDLC
-device.
-
-Before you register your network device you will also need to provide
-suitable handlers for most of the network device callbacks. See the
-network device documentation for more details on this.
-
-Configuring And Activating The Port
-===================================
-
-The Z85230 driver provides helper functions and tables to load the port
-registers on the Z8530 chips. When programming the register settings for
-a channel be aware that the documentation recommends initialisation
-orders. Strange things happen when these are not followed.
-
-:c:func:`z8530_channel_load()` takes an array of pairs of
-initialisation values in an array of u8 type. The first value is the
-Z8530 register number. Add 16 to indicate the alternate register bank on
-the later chips. The array is terminated by a 255.
-
-The driver provides a pair of public tables. The z8530_hdlc_kilostream
-table is for the UK 'Kilostream' service and also happens to cover most
-other end host configurations. The z8530_hdlc_kilostream_85230 table
-is the same configuration using the enhancements of the 85230 chip. The
-configuration loaded is standard NRZ encoded synchronous data with HDLC
-bitstuffing. All of the timing is taken from the other end of the link.
-
-When writing your own tables be aware that the driver internally tracks
-register values. It may need to reload values. You should therefore be
-sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where
-the register settings depend on DMA selection the driver will update the
-bits itself when you open or close. Loading a new table with the
-interface open is not recommended.
-
-There are three standard configurations supported by the core code. In
-PIO mode the interface is programmed up to use interrupt driven PIO.
-This places high demands on the host processor to avoid latency. The
-driver is written to take account of latency issues but it cannot avoid
-latencies caused by other drivers, notably IDE in PIO mode. Because the
-drivers allocate buffers you must also prevent MTU changes while the
-port is open.
-
-Once the port is open it will call the rx_function of each channel
-whenever a completed packet arrived. This is invoked from interrupt
-context and passes you the channel and a network buffer (struct
-sk_buff) holding the data. The data includes the CRC bytes so most
-users will want to trim the last two bytes before processing the data.
-This function is very timing critical. When you wish to simply discard
-data the support code provides the function
-:c:func:`z8530_null_rx()` to discard the data.
-
-To active PIO mode sending and receiving the ``z8530_sync_open`` is called.
-This expects to be passed the network device and the channel. Typically
-this is called from your network device open callback. On a failure a
-non zero error status is returned.
-The :c:func:`z8530_sync_close()` function shuts down a PIO
-channel. This must be done before the channel is opened again and before
-the driver shuts down and unloads.
-
-The ideal mode of operation is dual channel DMA mode. Here the kernel
-driver will configure the board for DMA in both directions. The driver
-also handles ISA DMA issues such as controller programming and the
-memory range limit for you. This mode is activated by calling the
-:c:func:`z8530_sync_dma_open()` function. On failure a non zero
-error value is returned. Once this mode is activated it can be shut down
-by calling the :c:func:`z8530_sync_dma_close()`. You must call
-the close function matching the open mode you used.
-
-The final supported mode uses a single DMA channel to drive the transmit
-side. As the Z85C30 has a larger FIFO on the receive channel this tends
-to increase the maximum speed a little. This is activated by calling the
-``z8530_sync_txdma_open``. This returns a non zero error code on failure. The
-:c:func:`z8530_sync_txdma_close()` function closes down the Z8530
-interface from this mode.
-
-Network Layer Functions
-=======================
-
-The Z8530 layer provides functions to queue packets for transmission.
-The driver internally buffers the frame currently being transmitted and
-one further frame (in order to keep back to back transmission running).
-Any further buffering is up to the caller.
-
-The function :c:func:`z8530_queue_xmit()` takes a network buffer
-in sk_buff format and queues it for transmission. The caller must
-provide the entire packet with the exception of the bitstuffing and CRC.
-This is normally done by the caller via the generic HDLC interface
-layer. It returns 0 if the buffer has been queued and non zero values
-for queue full. If the function accepts the buffer it becomes property
-of the Z8530 layer and the caller should not free it.
-
-The function :c:func:`z8530_get_stats()` returns a pointer to an
-internally maintained per interface statistics block. This provides most
-of the interface code needed to implement the network layer get_stats
-callback.
-
-Porting The Z8530 Driver
-========================
-
-The Z8530 driver is written to be portable. In DMA mode it makes
-assumptions about the use of ISA DMA. These are probably warranted in
-most cases as the Z85230 in particular was designed to glue to PC type
-machines. The PIO mode makes no real assumptions.
-
-Should you need to retarget the Z8530 driver to another architecture the
-only code that should need changing are the port I/O functions. At the
-moment these assume PC I/O port accesses. This may not be appropriate
-for all platforms. Replacing :c:func:`z8530_read_port()` and
-``z8530_write_port`` is intended to be all that is required to port
-this driver layer.
-
-Known Bugs And Assumptions
-==========================
-
-Interrupt Locking
- The locking in the driver is done via the global cli/sti lock. This
- makes for relatively poor SMP performance. Switching this to use a
- per device spin lock would probably materially improve performance.
-
-Occasional Failures
- We have reports of occasional failures when run for very long
- periods of time and the driver starts to receive junk frames. At the
- moment the cause of this is not clear.
-
-Public Functions Provided
-=========================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
- :export:
-
-Internal Functions
-==================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
- :internal: