aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/6lowpan.rst (renamed from Documentation/networking/6lowpan.txt)29
-rw-r--r--Documentation/networking/6pack.rst (renamed from Documentation/networking/6pack.txt)46
-rw-r--r--Documentation/networking/af_xdp.rst106
-rw-r--r--Documentation/networking/arcnet-hardware.rst (renamed from Documentation/networking/arcnet-hardware.txt)2227
-rw-r--r--Documentation/networking/arcnet.rst (renamed from Documentation/networking/arcnet.txt)350
-rw-r--r--Documentation/networking/atm.rst (renamed from Documentation/networking/atm.txt)6
-rw-r--r--Documentation/networking/ax25.rst (renamed from Documentation/networking/ax25.txt)8
-rw-r--r--Documentation/networking/bareudp.rst58
-rw-r--r--Documentation/networking/batman-adv.rst10
-rw-r--r--Documentation/networking/bonding.rst (renamed from Documentation/networking/bonding.txt)1411
-rw-r--r--Documentation/networking/caif/caif.rst7
-rw-r--r--Documentation/networking/caif/index.rst12
-rw-r--r--Documentation/networking/caif/linux_caif.rst (renamed from Documentation/networking/caif/Linux-CAIF.txt)54
-rw-r--r--Documentation/networking/caif/spi_porting.txt208
-rw-r--r--Documentation/networking/can.rst76
-rw-r--r--Documentation/networking/can_ucan_protocol.rst4
-rw-r--r--Documentation/networking/cdc_mbim.rst (renamed from Documentation/networking/cdc_mbim.txt)76
-rw-r--r--Documentation/networking/checksum-offloads.rst2
-rw-r--r--Documentation/networking/cops.txt63
-rw-r--r--Documentation/networking/dccp.rst (renamed from Documentation/networking/dccp.txt)42
-rw-r--r--Documentation/networking/dctcp.rst (renamed from Documentation/networking/dctcp.txt)14
-rw-r--r--Documentation/networking/decnet.txt230
-rw-r--r--Documentation/networking/device_drivers/appletalk/cops.rst80
-rw-r--r--Documentation/networking/device_drivers/appletalk/index.rst18
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru-cf.py (renamed from Documentation/networking/cxacru-cf.py)0
-rw-r--r--Documentation/networking/device_drivers/atm/cxacru.rst (renamed from Documentation/networking/cxacru.txt)86
-rw-r--r--Documentation/networking/device_drivers/atm/fore200e.rst (renamed from Documentation/networking/fore200e.txt)8
-rw-r--r--Documentation/networking/device_drivers/atm/index.rst20
-rw-r--r--Documentation/networking/device_drivers/atm/iphase.rst (renamed from Documentation/networking/iphase.txt)187
-rw-r--r--Documentation/networking/device_drivers/cable/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cable/sb1000.rst222
-rw-r--r--Documentation/networking/device_drivers/can/can327.rst331
-rw-r--r--Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst639
-rw-r--r--Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg151
-rw-r--r--Documentation/networking/device_drivers/can/freescale/flexcan.rst54
-rw-r--r--Documentation/networking/device_drivers/can/index.rst22
-rw-r--r--Documentation/networking/device_drivers/cellular/index.rst18
-rw-r--r--Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst197
-rw-r--r--Documentation/networking/device_drivers/dec/de4x5.txt178
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/3c509.rst (renamed from Documentation/networking/device_drivers/3com/3c509.txt)162
-rw-r--r--Documentation/networking/device_drivers/ethernet/3com/vortex.rst (renamed from Documentation/networking/device_drivers/3com/vortex.txt)229
-rw-r--r--Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst (renamed from Documentation/networking/altera_tse.txt)89
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst (renamed from Documentation/networking/device_drivers/amazon/ena.txt)281
-rw-r--r--Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst (renamed from Documentation/networking/device_drivers/aquantia/atlantic.txt)369
-rw-r--r--Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst (renamed from Documentation/networking/device_drivers/chelsio/cxgb.txt)183
-rw-r--r--Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst (renamed from Documentation/networking/device_drivers/cirrus/cs89x0.txt)559
-rw-r--r--Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst (renamed from Documentation/networking/device_drivers/davicom/dm9000.txt)24
-rw-r--r--Documentation/networking/device_drivers/ethernet/dec/dmfe.rst (renamed from Documentation/networking/device_drivers/dec/dmfe.txt)35
-rw-r--r--Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst (renamed from Documentation/networking/device_drivers/dlink/dl2k.txt)228
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa.txt)141
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst)7
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst)3
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/index.rst)1
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst (renamed from Documentation/networking/device_drivers/freescale/dpaa2/overview.rst)1
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst217
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst (renamed from Documentation/networking/device_drivers/freescale/gianfar.txt)21
-rw-r--r--Documentation/networking/device_drivers/ethernet/google/gve.rst (renamed from Documentation/networking/device_drivers/google/gve.rst)53
-rw-r--r--Documentation/networking/device_drivers/ethernet/huawei/hinic.rst (renamed from Documentation/networking/hinic.txt)5
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst62
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e100.rst (renamed from Documentation/networking/device_drivers/intel/e100.rst)6
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000.rst (renamed from Documentation/networking/device_drivers/intel/e1000.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000e.rst (renamed from Documentation/networking/device_drivers/intel/e1000e.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/fm10k.rst (renamed from Documentation/networking/device_drivers/intel/fm10k.rst)2
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/i40e.rst (renamed from Documentation/networking/device_drivers/intel/i40e.rst)10
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/iavf.rst (renamed from Documentation/networking/device_drivers/intel/iavf.rst)6
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst1040
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igb.rst (renamed from Documentation/networking/device_drivers/intel/igb.rst)2
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igbvf.rst (renamed from Documentation/networking/device_drivers/intel/igbvf.rst)2
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgb.rst (renamed from Documentation/networking/device_drivers/intel/ixgb.rst)4
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst (renamed from Documentation/networking/device_drivers/intel/ixgbe.rst)16
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst (renamed from Documentation/networking/device_drivers/intel/ixgbevf.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst35
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst (renamed from Documentation/networking/device_drivers/marvell/octeontx2.rst)130
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst762
-rw-r--r--Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst (renamed from Documentation/networking/device_drivers/microsoft/netvsc.txt)73
-rw-r--r--Documentation/networking/device_drivers/ethernet/neterion/s2io.rst196
-rw-r--r--Documentation/networking/device_drivers/ethernet/netronome/nfp.rst (renamed from Documentation/networking/device_drivers/netronome/nfp.rst)0
-rw-r--r--Documentation/networking/device_drivers/ethernet/pensando/ionic.rst274
-rw-r--r--Documentation/networking/device_drivers/ethernet/smsc/smc9.rst48
-rw-r--r--Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst (renamed from Documentation/networking/device_drivers/stmicro/stmmac.rst)7
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst143
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw.rst587
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst (renamed from Documentation/networking/device_drivers/ti/cpsw_switchdev.txt)243
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/tlan.rst (renamed from Documentation/networking/device_drivers/ti/tlan.txt)73
-rw-r--r--Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst (renamed from Documentation/networking/device_drivers/toshiba/spider_net.txt)60
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst14
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst20
-rw-r--r--Documentation/networking/device_drivers/fddi/defza.rst (renamed from Documentation/networking/defza.txt)10
-rw-r--r--Documentation/networking/device_drivers/fddi/index.rst19
-rw-r--r--Documentation/networking/device_drivers/fddi/skfp.rst (renamed from Documentation/networking/skfp.txt)153
-rw-r--r--Documentation/networking/device_drivers/hamradio/baycom.rst (renamed from Documentation/networking/baycom.txt)120
-rw-r--r--Documentation/networking/device_drivers/hamradio/index.rst19
-rw-r--r--Documentation/networking/device_drivers/hamradio/z8530drv.rst (renamed from Documentation/networking/z8530drv.txt)629
-rw-r--r--Documentation/networking/device_drivers/index.rst34
-rw-r--r--Documentation/networking/device_drivers/intel/ice.rst46
-rw-r--r--Documentation/networking/device_drivers/mellanox/mlx5.rst321
-rw-r--r--Documentation/networking/device_drivers/neterion/s2io.txt141
-rw-r--r--Documentation/networking/device_drivers/neterion/vxge.txt93
-rw-r--r--Documentation/networking/device_drivers/pensando/ionic.rst45
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx46
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic288
-rw-r--r--Documentation/networking/device_drivers/qlogic/LICENSE.qlge288
-rw-r--r--Documentation/networking/device_drivers/qlogic/index.rst18
-rw-r--r--Documentation/networking/device_drivers/qlogic/qlge.rst118
-rw-r--r--Documentation/networking/device_drivers/qualcomm/rmnet.txt82
-rw-r--r--Documentation/networking/device_drivers/sb1000.txt207
-rw-r--r--Documentation/networking/device_drivers/smsc/smc9.txt42
-rw-r--r--Documentation/networking/device_drivers/ti/cpsw.txt541
-rw-r--r--Documentation/networking/device_drivers/wifi/index.rst20
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2100.rst (renamed from Documentation/networking/device_drivers/intel/ipw2100.txt)242
-rw-r--r--Documentation/networking/device_drivers/wifi/intel/ipw2200.rst (renamed from Documentation/networking/device_drivers/intel/ipw2200.txt)414
-rw-r--r--Documentation/networking/device_drivers/wifi/ray_cs.rst (renamed from Documentation/networking/ray_cs.txt)105
-rw-r--r--Documentation/networking/device_drivers/wwan/index.rst19
-rw-r--r--Documentation/networking/device_drivers/wwan/iosm.rst96
-rw-r--r--Documentation/networking/device_drivers/wwan/t7xx.rst120
-rw-r--r--Documentation/networking/devlink/am65-nuss-cpsw-switch.rst26
-rw-r--r--Documentation/networking/devlink/bnxt.rst16
-rw-r--r--Documentation/networking/devlink/devlink-dpipe.rst2
-rw-r--r--Documentation/networking/devlink/devlink-flash.rst121
-rw-r--r--Documentation/networking/devlink/devlink-health.rst17
-rw-r--r--Documentation/networking/devlink/devlink-info.rst142
-rw-r--r--Documentation/networking/devlink/devlink-linecard.rst122
-rw-r--r--Documentation/networking/devlink/devlink-params.rst35
-rw-r--r--Documentation/networking/devlink/devlink-port.rst234
-rw-r--r--Documentation/networking/devlink/devlink-region.rst27
-rw-r--r--Documentation/networking/devlink/devlink-reload.rst81
-rw-r--r--Documentation/networking/devlink/devlink-resource.rst14
-rw-r--r--Documentation/networking/devlink/devlink-selftests.rst38
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst348
-rw-r--r--Documentation/networking/devlink/hns3.rst25
-rw-r--r--Documentation/networking/devlink/ice.rst256
-rw-r--r--Documentation/networking/devlink/index.rst25
-rw-r--r--Documentation/networking/devlink/iosm.rst162
-rw-r--r--Documentation/networking/devlink/mlx5.rst17
-rw-r--r--Documentation/networking/devlink/mlxsw.rst24
-rw-r--r--Documentation/networking/devlink/netdevsim.rst29
-rw-r--r--Documentation/networking/devlink/octeontx2.rst42
-rw-r--r--Documentation/networking/devlink/prestera.rst141
-rw-r--r--Documentation/networking/dns_resolver.rst (renamed from Documentation/networking/dns_resolver.txt)52
-rw-r--r--Documentation/networking/driver.rst (renamed from Documentation/networking/driver.txt)24
-rw-r--r--Documentation/networking/dsa/configuration.rst494
-rw-r--r--Documentation/networking/dsa/dsa.rst816
-rw-r--r--Documentation/networking/dsa/sja1105.rst263
-rw-r--r--Documentation/networking/eql.rst (renamed from Documentation/networking/eql.txt)443
-rw-r--r--Documentation/networking/ethtool-netlink.rst1326
-rw-r--r--Documentation/networking/fib_trie.rst (renamed from Documentation/networking/fib_trie.txt)16
-rw-r--r--Documentation/networking/filter.rst685
-rw-r--r--Documentation/networking/filter.txt1545
-rw-r--r--Documentation/networking/framerelay.txt39
-rw-r--r--Documentation/networking/gen_stats.rst (renamed from Documentation/networking/gen_stats.txt)98
-rw-r--r--Documentation/networking/generic-hdlc.rst (renamed from Documentation/networking/generic-hdlc.txt)86
-rw-r--r--Documentation/networking/generic_netlink.rst (renamed from Documentation/networking/generic_netlink.txt)6
-rw-r--r--Documentation/networking/gtp.rst (renamed from Documentation/networking/gtp.txt)97
-rw-r--r--Documentation/networking/ieee802154.rst22
-rw-r--r--Documentation/networking/ila.rst (renamed from Documentation/networking/ila.txt)89
-rw-r--r--Documentation/networking/index.rst94
-rw-r--r--Documentation/networking/ioam6-sysctl.rst26
-rw-r--r--Documentation/networking/ip-sysctl.rst (renamed from Documentation/networking/ip-sysctl.txt)1338
-rw-r--r--Documentation/networking/ip_dynaddr.rst (renamed from Documentation/networking/ip_dynaddr.txt)29
-rw-r--r--Documentation/networking/ipddp.rst (renamed from Documentation/networking/ipddp.txt)13
-rw-r--r--Documentation/networking/ipsec.rst (renamed from Documentation/networking/ipsec.txt)14
-rw-r--r--Documentation/networking/ipv6.rst (renamed from Documentation/networking/ipv6.txt)8
-rw-r--r--Documentation/networking/ipvlan.rst (renamed from Documentation/networking/ipvlan.txt)161
-rw-r--r--Documentation/networking/ipvs-sysctl.rst (renamed from Documentation/networking/ipvs-sysctl.txt)202
-rw-r--r--Documentation/networking/j1939.rst166
-rw-r--r--Documentation/networking/kapi.rst30
-rw-r--r--Documentation/networking/kcm.rst (renamed from Documentation/networking/kcm.txt)85
-rw-r--r--Documentation/networking/l2tp.rst677
-rw-r--r--Documentation/networking/l2tp.txt345
-rw-r--r--Documentation/networking/lapb-module.rst (renamed from Documentation/networking/lapb-module.txt)122
-rw-r--r--Documentation/networking/ltpc.txt131
-rw-r--r--Documentation/networking/mac80211-injection.rst (renamed from Documentation/networking/mac80211-injection.txt)43
-rw-r--r--Documentation/networking/mctp.rst320
-rw-r--r--Documentation/networking/mpls-sysctl.rst (renamed from Documentation/networking/mpls-sysctl.txt)17
-rw-r--r--Documentation/networking/mptcp-sysctl.rst76
-rw-r--r--Documentation/networking/msg_zerocopy.rst2
-rw-r--r--Documentation/networking/multiqueue.rst (renamed from Documentation/networking/multiqueue.txt)41
-rw-r--r--Documentation/networking/net_dim.rst (renamed from Documentation/networking/net_dim.txt)96
-rw-r--r--Documentation/networking/net_failover.rst111
-rw-r--r--Documentation/networking/netconsole.rst (renamed from Documentation/networking/netconsole.txt)125
-rw-r--r--Documentation/networking/netdev-FAQ.rst272
-rw-r--r--Documentation/networking/netdev-features.rst (renamed from Documentation/networking/netdev-features.txt)40
-rw-r--r--Documentation/networking/netdevices.rst299
-rw-r--r--Documentation/networking/netdevices.txt104
-rw-r--r--Documentation/networking/netfilter-sysctl.rst (renamed from Documentation/networking/netfilter-sysctl.txt)11
-rw-r--r--Documentation/networking/netif-msg.rst95
-rw-r--r--Documentation/networking/netif-msg.txt79
-rw-r--r--Documentation/networking/nexthop-group-resilient.rst293
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst (renamed from Documentation/networking/nf_conntrack-sysctl.txt)99
-rw-r--r--Documentation/networking/nf_flowtable.rst235
-rw-r--r--Documentation/networking/nf_flowtable.txt112
-rw-r--r--Documentation/networking/openvswitch.rst (renamed from Documentation/networking/openvswitch.txt)23
-rw-r--r--Documentation/networking/operstates.rst (renamed from Documentation/networking/operstates.txt)51
-rw-r--r--Documentation/networking/packet_mmap.rst1083
-rw-r--r--Documentation/networking/packet_mmap.txt1061
-rw-r--r--Documentation/networking/page_pool.rst223
-rw-r--r--Documentation/networking/phonet.rst (renamed from Documentation/networking/phonet.txt)56
-rw-r--r--Documentation/networking/phy.rst51
-rw-r--r--Documentation/networking/pktgen.rst (renamed from Documentation/networking/pktgen.txt)324
-rw-r--r--Documentation/networking/plip.rst (renamed from Documentation/networking/PLIP.txt)43
-rw-r--r--Documentation/networking/ppp_generic.rst (renamed from Documentation/networking/ppp_generic.txt)68
-rw-r--r--Documentation/networking/proc_net_tcp.rst (renamed from Documentation/networking/proc_net_tcp.txt)23
-rw-r--r--Documentation/networking/radiotap-headers.rst (renamed from Documentation/networking/radiotap-headers.txt)99
-rw-r--r--Documentation/networking/rds.rst (renamed from Documentation/networking/rds.txt)305
-rw-r--r--Documentation/networking/regulatory.rst (renamed from Documentation/networking/regulatory.txt)35
-rw-r--r--Documentation/networking/representors.rst259
-rw-r--r--Documentation/networking/rxrpc.rst (renamed from Documentation/networking/rxrpc.txt)326
-rw-r--r--Documentation/networking/scaling.rst10
-rw-r--r--Documentation/networking/sctp.rst (renamed from Documentation/networking/sctp.txt)37
-rw-r--r--Documentation/networking/secid.rst (renamed from Documentation/networking/secid.txt)6
-rw-r--r--Documentation/networking/seg6-sysctl.rst39
-rw-r--r--Documentation/networking/seg6-sysctl.txt18
-rw-r--r--Documentation/networking/sfp-phylink.rst57
-rw-r--r--Documentation/networking/skbuff.rst37
-rw-r--r--Documentation/networking/smc-sysctl.rst61
-rw-r--r--Documentation/networking/snmp_counter.rst34
-rw-r--r--Documentation/networking/statistics.rst220
-rw-r--r--Documentation/networking/strparser.rst (renamed from Documentation/networking/strparser.txt)85
-rw-r--r--Documentation/networking/switchdev.rst (renamed from Documentation/networking/switchdev.txt)311
-rw-r--r--Documentation/networking/sysfs-tagging.rst48
-rw-r--r--Documentation/networking/tc-actions-env-rules.rst29
-rw-r--r--Documentation/networking/tc-actions-env-rules.txt24
-rw-r--r--Documentation/networking/tcp-thin.rst (renamed from Documentation/networking/tcp-thin.txt)5
-rw-r--r--Documentation/networking/team.rst (renamed from Documentation/networking/team.txt)6
-rw-r--r--Documentation/networking/timestamping.rst (renamed from Documentation/networking/timestamping.txt)357
-rw-r--r--Documentation/networking/tipc.rst215
-rw-r--r--Documentation/networking/tls-offload.rst29
-rw-r--r--Documentation/networking/tls.rst47
-rw-r--r--Documentation/networking/tproxy.rst (renamed from Documentation/networking/tproxy.txt)57
-rw-r--r--Documentation/networking/tuntap.rst (renamed from Documentation/networking/tuntap.txt)200
-rw-r--r--Documentation/networking/udplite.rst (renamed from Documentation/networking/udplite.txt)175
-rw-r--r--Documentation/networking/vrf.rst464
-rw-r--r--Documentation/networking/vrf.txt418
-rw-r--r--Documentation/networking/vxlan.rst (renamed from Documentation/networking/vxlan.txt)61
-rw-r--r--Documentation/networking/x25-iface.rst82
-rw-r--r--Documentation/networking/x25-iface.txt123
-rw-r--r--Documentation/networking/x25.rst (renamed from Documentation/networking/x25.txt)16
-rw-r--r--Documentation/networking/xfrm_device.rst (renamed from Documentation/networking/xfrm_device.txt)35
-rw-r--r--Documentation/networking/xfrm_proc.rst (renamed from Documentation/networking/xfrm_proc.txt)31
-rw-r--r--Documentation/networking/xfrm_sync.rst (renamed from Documentation/networking/xfrm_sync.txt)66
-rw-r--r--Documentation/networking/xfrm_sysctl.rst (renamed from Documentation/networking/xfrm_sysctl.txt)7
-rw-r--r--Documentation/networking/z8530book.rst256
243 files changed, 25198 insertions, 14421 deletions
diff --git a/Documentation/networking/6lowpan.txt b/Documentation/networking/6lowpan.rst
index 2e5a939d7e6f..e70a6520cc33 100644
--- a/Documentation/networking/6lowpan.txt
+++ b/Documentation/networking/6lowpan.rst
@@ -1,37 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
-Netdev private dataroom for 6lowpan interfaces:
+==============================================
+Netdev private dataroom for 6lowpan interfaces
+==============================================
All 6lowpan able net devices, means all interfaces with ARPHRD_6LOWPAN,
must have "struct lowpan_priv" placed at beginning of netdev_priv.
-The priv_size of each interface should be calculate by:
+The priv_size of each interface should be calculate by::
dev->priv_size = LOWPAN_PRIV_SIZE(LL_6LOWPAN_PRIV_DATA);
Where LL_PRIV_6LOWPAN_DATA is sizeof linklayer 6lowpan private data struct.
-To access the LL_PRIV_6LOWPAN_DATA structure you can cast:
+To access the LL_PRIV_6LOWPAN_DATA structure you can cast::
lowpan_priv(dev)-priv;
to your LL_6LOWPAN_PRIV_DATA structure.
-Before registering the lowpan netdev interface you must run:
+Before registering the lowpan netdev interface you must run::
lowpan_netdev_setup(dev, LOWPAN_LLTYPE_FOOBAR);
wheres LOWPAN_LLTYPE_FOOBAR is a define for your 6LoWPAN linklayer type of
enum lowpan_lltypes.
-Example to evaluate the private usually you can do:
+Example to evaluate the private usually you can do::
-static inline struct lowpan_priv_foobar *
-lowpan_foobar_priv(struct net_device *dev)
-{
+ static inline struct lowpan_priv_foobar *
+ lowpan_foobar_priv(struct net_device *dev)
+ {
return (struct lowpan_priv_foobar *)lowpan_priv(dev)->priv;
-}
+ }
-switch (dev->type) {
-case ARPHRD_6LOWPAN:
+ switch (dev->type) {
+ case ARPHRD_6LOWPAN:
lowpan_priv = lowpan_priv(dev);
/* do great stuff which is ARPHRD_6LOWPAN related */
switch (lowpan_priv->lltype) {
@@ -42,8 +45,8 @@ case ARPHRD_6LOWPAN:
...
}
break;
-...
-}
+ ...
+ }
In case of generic 6lowpan branch ("net/6lowpan") you can remove the check
on ARPHRD_6LOWPAN, because you can be sure that these function are called
diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.rst
index 8f339428fdf4..bc5bf1f1a98f 100644
--- a/Documentation/networking/6pack.txt
+++ b/Documentation/networking/6pack.rst
@@ -1,27 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+6pack Protocol
+==============
+
This is the 6pack-mini-HOWTO, written by
Andreas Könsgen DG3KQ
-Internet: ajk@comnets.uni-bremen.de
-AMPR-net: dg3kq@db0pra.ampr.org
-AX.25: dg3kq@db0ach.#nrw.deu.eu
+
+:Internet: ajk@comnets.uni-bremen.de
+:AMPR-net: dg3kq@db0pra.ampr.org
+:AX.25: dg3kq@db0ach.#nrw.deu.eu
Last update: April 7, 1998
1. What is 6pack, and what are the advantages to KISS?
+======================================================
6pack is a transmission protocol for data exchange between the PC and
the TNC over a serial line. It can be used as an alternative to KISS.
6pack has two major advantages:
+
- The PC is given full control over the radio
channel. Special control data is exchanged between the PC and the TNC so
that the PC knows at any time if the TNC is receiving data, if a TNC
buffer underrun or overrun has occurred, if the PTT is
set and so on. This control data is processed at a higher priority than
normal data, so a data stream can be interrupted at any time to issue an
- important event. This helps to improve the channel access and timing
- algorithms as everything is computed in the PC. It would even be possible
- to experiment with something completely different from the known CSMA and
+ important event. This helps to improve the channel access and timing
+ algorithms as everything is computed in the PC. It would even be possible
+ to experiment with something completely different from the known CSMA and
DAMA channel access methods.
This kind of real-time control is especially important to supply several
TNCs that are connected between each other and the PC by a daisy chain
@@ -36,6 +45,7 @@ More details about 6pack are described in the file 6pack.ps that is located
in the doc directory of the AX.25 utilities package.
2. Who has developed the 6pack protocol?
+========================================
The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech
DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and
@@ -44,12 +54,14 @@ They have also written a firmware for TNCs to perform the 6pack
protocol (see section 4 below).
3. Where can I get the latest version of 6pack for LinuX?
+=========================================================
At the moment, the 6pack stuff can obtained via anonymous ftp from
db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq,
there is a file named 6pack.tgz.
4. Preparing the TNC for 6pack operation
+========================================
To be able to use 6pack, a special firmware for the TNC is needed. The EPROM
of a newly bought TNC does not contain 6pack, so you will have to
@@ -75,12 +87,14 @@ and the status LED are lit for about a second if the firmware initialises
the TNC correctly.
5. Building and installing the 6pack driver
+===========================================
The driver has been tested with kernel version 2.1.90. Use with older
kernels may lead to a compilation error because the interface to a kernel
function has been changed in the 2.1.8x kernels.
How to turn on 6pack support:
+=============================
- In the linux kernel configuration program, select the code maturity level
options menu and turn on the prompting for development drivers.
@@ -94,27 +108,28 @@ To use the driver, the kissattach program delivered with the AX.25 utilities
has to be modified.
- Do a cd to the directory that holds the kissattach sources. Edit the
- kissattach.c file. At the top, insert the following lines:
+ kissattach.c file. At the top, insert the following lines::
+
+ #ifndef N_6PACK
+ #define N_6PACK (N_AX25+1)
+ #endif
- #ifndef N_6PACK
- #define N_6PACK (N_AX25+1)
- #endif
+ Then find the line:
- Then find the line
-
- int disc = N_AX25;
+ int disc = N_AX25;
and replace N_AX25 by N_6PACK.
- Recompile kissattach. Rename it to spattach to avoid confusions.
Installing the driver:
+----------------------
-- Do an insmod 6pack. Look at your /var/log/messages file to check if the
+- Do an insmod 6pack. Look at your /var/log/messages file to check if the
module has printed its initialization message.
- Do a spattach as you would launch kissattach when starting a KISS port.
- Check if the kernel prints the message '6pack: TNC found'.
+ Check if the kernel prints the message '6pack: TNC found'.
- From here, everything should work as if you were setting up a KISS port.
The only difference is that the network device that represents
@@ -138,6 +153,7 @@ from the PC to the TNC over the serial line, the status LED if data is
sent to the PC.
6. Known problems
+=================
When testing the driver with 2.0.3x kernels and
operating with data rates on the radio channel of 9600 Baud or higher,
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 5bc55a4e3bce..60b217b436be 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -243,8 +243,8 @@ Configuration Flags and Socket Options
These are the various configuration flags that can be used to control
and monitor the behavior of AF_XDP sockets.
-XDP_COPY and XDP_ZERO_COPY bind flags
--------------------------------------
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
When you bind to a socket, the kernel will first try to use zero-copy
copy. If zero-copy is not supported, it will fall back on using copy
@@ -252,20 +252,27 @@ mode, i.e. copying all packets out to user space. But if you would
like to force a certain mode, you can use the following flags. If you
pass the XDP_COPY flag to the bind call, the kernel will force the
socket into copy mode. If it cannot use copy mode, the bind call will
-fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
socket into zero-copy mode or fail.
XDP_SHARED_UMEM bind flag
-------------------------
-This flag enables you to bind multiple sockets to the same UMEM, but
-only if they share the same queue id. In this mode, each socket has
-their own RX and TX rings, but the UMEM (tied to the fist socket
-created) only has a single FILL ring and a single COMPLETION
-ring. To use this mode, create the first socket and bind it in the normal
-way. Create a second socket and create an RX and a TX ring, or at
-least one of them, but no FILL or COMPLETION rings as the ones from
-the first socket will be used. In the bind call, set he
+This flag enables you to bind multiple sockets to the same UMEM. It
+works on the same queue id, between queue ids and between
+netdevs/devices. In this mode, each socket has their own RX and TX
+rings as usual, but you are going to have one or more FILL and
+COMPLETION ring pairs. You have to create one of these pairs per
+unique netdev and queue id tuple that you bind to.
+
+Starting with the case were we would like to share a UMEM between
+sockets bound to the same netdev and queue id. The UMEM (tied to the
+fist socket created) will only have a single FILL ring and a single
+COMPLETION ring as there is only on unique netdev,queue_id tuple that
+we have bound to. To use this mode, create the first socket and bind
+it in the normal way. Create a second socket and create an RX and a TX
+ring, or at least one of them, but no FILL or COMPLETION rings as the
+ones from the first socket will be used. In the bind call, set he
XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
sockets this way.
@@ -283,19 +290,19 @@ round-robin example of distributing packets is shown below:
#define MAX_SOCKS 16
struct {
- __uint(type, BPF_MAP_TYPE_XSKMAP);
- __uint(max_entries, MAX_SOCKS);
- __uint(key_size, sizeof(int));
- __uint(value_size, sizeof(int));
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __uint(max_entries, MAX_SOCKS);
+ __uint(key_size, sizeof(int));
+ __uint(value_size, sizeof(int));
} xsks_map SEC(".maps");
static unsigned int rr;
SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
{
- rr = (rr + 1) & (MAX_SOCKS - 1);
+ rr = (rr + 1) & (MAX_SOCKS - 1);
- return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+ return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
}
Note, that since there is only a single set of FILL and COMPLETION
@@ -305,11 +312,41 @@ concurrently. There are no synchronization primitives in the
libbpf code that protects multiple users at this point in time.
Libbpf uses this mode if you create more than one socket tied to the
-same umem. However, note that you need to supply the
+same UMEM. However, note that you need to supply the
XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
xsk_socket__create calls and load your own XDP program as there is no
built in one in libbpf that will route the traffic for you.
+The second case is when you share a UMEM between sockets that are
+bound to different queue ids and/or netdevs. In this case you have to
+create one FILL ring and one COMPLETION ring for each unique
+netdev,queue_id pair. Let us say you want to create two sockets bound
+to two different queue ids on the same netdev. Create the first socket
+and bind it in the normal way. Create a second socket and create an RX
+and a TX ring, or at least one of them, and then one FILL and
+COMPLETION ring for this socket. Then in the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field as you registered the UMEM on that
+socket. These two sockets will now share one and the same UMEM.
+
+There is no need to supply an XDP program like the one in the previous
+case where sockets were bound to the same queue id and
+device. Instead, use the NIC's packet steering capabilities to steer
+the packets to the right queue. In the previous example, there is only
+one queue shared among sockets, so the NIC cannot do this steering. It
+can only steer between queues.
+
+In libbpf, you need to use the xsk_socket__create_shared() API as it
+takes a reference to a FILL ring and a COMPLETION ring that will be
+created for you and bound to the shared UMEM. You can use this
+function for all the sockets you create, or you can use it for the
+second and following ones and use xsk_socket__create() for the first
+one. Both methods yield the same result.
+
+Note that a UMEM can be shared between sockets on the same queue id
+and device, as well as between queues on the same device and between
+devices at the same time.
+
XDP_USE_NEED_WAKEUP bind flag
-----------------------------
@@ -342,7 +379,7 @@ would look like this for the TX path:
.. code-block:: c
if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
- sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+ sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
I.e., only use the syscall if the flag is set.
@@ -364,7 +401,7 @@ resources by only setting up one of them. Both the FILL ring and the
COMPLETION ring are mandatory as you need to have a UMEM tied to your
socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
first one does not have a UMEM and should in that case not have any
-FILL or COMPLETION rings created as the ones from the shared umem will
+FILL or COMPLETION rings created as the ones from the shared UMEM will
be used. Note, that the rings are single-producer single-consumer, so
do not try to access them from multiple processes at the same
time. See the XDP_SHARED_UMEM section.
@@ -405,9 +442,9 @@ purposes. The supported statistics are shown below:
.. code-block:: c
struct xdp_statistics {
- __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
- __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
- __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+ __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+ __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
};
XDP_OPTIONS getsockopt
@@ -446,15 +483,15 @@ like this:
.. code-block:: c
// struct xdp_rxtx_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // struct xdp_desc *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // struct xdp_desc *desc;
// };
// struct xdp_umem_ring {
- // __u32 *producer;
- // __u32 *consumer;
- // __u64 *desc;
+ // __u32 *producer;
+ // __u32 *consumer;
+ // __u64 *desc;
// };
// typedef struct xdp_rxtx_ring RING;
@@ -567,6 +604,17 @@ A: The short answer is no, that is not supported at the moment. The
switch, or other distribution mechanism, in your NIC to direct
traffic to the correct queue id and socket.
+Q: My packets are sometimes corrupted. What is wrong?
+
+A: Care has to be taken not to feed the same buffer in the UMEM into
+ more than one ring at the same time. If you for example feed the
+ same buffer into the FILL ring and the TX ring at the same time, the
+ NIC might receive data into the buffer at the same time it is
+ sending it. This will cause some packets to become corrupted. Same
+ thing goes for feeding the same buffer into the FILL rings
+ belonging to different queue ids or netdevs bound with the
+ XDP_SHARED_UMEM flag.
+
Credits
=======
diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.rst
index 731de411513c..ac249ac8fcf2 100644
--- a/Documentation/networking/arcnet-hardware.txt
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -1,11 +1,15 @@
-
------------------------------------------------------------------------------
-1) This file is a supplement to arcnet.txt. Please read that for general
- driver configuration help.
------------------------------------------------------------------------------
-2) This file is no longer Linux-specific. It should probably be moved out of
- the kernel sources. Ideas?
------------------------------------------------------------------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+ARCnet Hardware
+===============
+
+.. note::
+
+ 1) This file is a supplement to arcnet.txt. Please read that for general
+ driver configuration help.
+ 2) This file is no longer Linux-specific. It should probably be moved out
+ of the kernel sources. Ideas?
Because so many people (myself included) seem to have obtained ARCnet cards
without manuals, this file contains a quick introduction to ARCnet hardware,
@@ -14,8 +18,8 @@ e-mail apenwarr@worldvisions.ca with any settings for your particular card,
or any other information you have!
-INTRODUCTION TO ARCNET
-----------------------
+Introduction to ARCnet
+======================
ARCnet is a network type which works in a way similar to popular Ethernet
networks but which is also different in some very important ways.
@@ -30,7 +34,7 @@ since I only have the 2.5 Mbps variety. It is probably not going to saturate
your 100 Mbps card. Stop complaining. :)
You also cannot connect an ARCnet card to any kind of Ethernet card and
-expect it to work.
+expect it to work.
There are two "types" of ARCnet - STAR topology and BUS topology. This
refers to how the cards are meant to be wired together. According to most
@@ -71,19 +75,24 @@ although they are generally kept down to the Ethernet-style 1500 bytes.
For more information on the advantages and disadvantages (mostly the
advantages) of ARCnet networks, you might try the "ARCnet Trade Association"
WWW page:
+
http://www.arcnet.com
-CABLING ARCNET NETWORKS
------------------------
+Cabling ARCnet Networks
+=======================
+
+This section was rewritten by
+
+ Vojtech Pavlik <vojtech@suse.cz>
-This section was rewritten by
- Vojtech Pavlik <vojtech@suse.cz>
using information from several people, including:
- Avery Pennraun <apenwarr@worldvisions.ca>
- Stephen A. Wood <saw@hallc1.cebaf.gov>
- John Paul Morrison <jmorriso@bogomips.ee.ubc.ca>
- Joachim Koenig <jojo@repas.de>
+
+ - Avery Pennraun <apenwarr@worldvisions.ca>
+ - Stephen A. Wood <saw@hallc1.cebaf.gov>
+ - John Paul Morrison <jmorriso@bogomips.ee.ubc.ca>
+ - Joachim Koenig <jojo@repas.de>
+
and Avery touched it up a bit, at Vojtech's request.
ARCnet (the classic 2.5 Mbps version) can be connected by two different
@@ -103,13 +112,13 @@ equal to a high impedance one with a terminator installed.
Usually, the ARCnet networks are built up from STAR cards and hubs. There
are two types of hubs - active and passive. Passive hubs are small boxes
-with four BNC connectors containing four 47 Ohm resistors:
+with four BNC connectors containing four 47 Ohm resistors::
- | | wires
- R + junction
--R-+-R- R 47 Ohm resistors
- R
- |
+ | | wires
+ R + junction
+ -R-+-R- R 47 Ohm resistors
+ R
+ |
The shielding is connected together. Active hubs are much more complicated;
they are powered and contain electronics to amplify the signal and send it
@@ -127,14 +136,15 @@ And now to the cabling. What you can connect together:
2. A card to a passive hub. Remember that all unused connectors on the hub
must be properly terminated with 93 Ohm (or something else if you don't
have the right ones) terminators.
- (Avery's note: oops, I didn't know that. Mine (TV cable) works
+
+ (Avery's note: oops, I didn't know that. Mine (TV cable) works
anyway, though.)
3. A card to an active hub. Here is no need to terminate the unused
connectors except some kind of aesthetic feeling. But, there may not be
more than eleven active hubs between any two computers. That of course
doesn't limit the number of active hubs on the network.
-
+
4. An active hub to another.
5. An active hub to passive hub.
@@ -142,22 +152,22 @@ And now to the cabling. What you can connect together:
Remember that you cannot connect two passive hubs together. The power loss
implied by such a connection is too high for the net to operate reliably.
-An example of a typical ARCnet network:
+An example of a typical ARCnet network::
- R S - STAR type card
+ R S - STAR type card
S------H--------A-------S R - Terminator
- | | H - Hub
- | | A - Active hub
- | S----H----S
- S |
- |
- S
-
+ | | H - Hub
+ | | A - Active hub
+ | S----H----S
+ S |
+ |
+ S
+
The BUS topology is very similar to the one used by Ethernet. The only
difference is in cable and terminators: they should be 93 Ohm. Ethernet
uses 50 Ohm impedance. You use T connectors to put the computers on a single
line of cable, the bus. You have to put terminators at both ends of the
-cable. A typical BUS ARCnet network looks like:
+cable. A typical BUS ARCnet network looks like::
RT----T------T------T------T------TR
B B B B B B
@@ -168,63 +178,63 @@ cable. A typical BUS ARCnet network looks like:
But that is not all! The two types can be connected together. According to
the official documentation the only way of connecting them is using an active
-hub:
+hub::
- A------T------T------TR
- | B B B
+ A------T------T------TR
+ | B B B
S---H---S
- |
- S
+ |
+ S
The official docs also state that you can use STAR cards at the ends of
-BUS network in place of a BUS card and a terminator:
+BUS network in place of a BUS card and a terminator::
S------T------T------S
- B B
+ B B
But, according to my own experiments, you can simply hang a BUS type card
anywhere in middle of a cable in a STAR topology network. And more - you
can use the bus card in place of any star card if you use a terminator. Then
you can build very complicated networks fulfilling all your needs! An
-example:
-
- S
- |
- RT------T-------T------H------S
- B B B |
- | R
- S------A------T-------T-------A-------H------TR
- | B B | | B
- | S BT |
- | | | S----A-----S
- S------H---A----S | |
- | | S------T----H---S |
- S S B R S
-
+example::
+
+ S
+ |
+ RT------T-------T------H------S
+ B B B |
+ | R
+ S------A------T-------T-------A-------H------TR
+ | B B | | B
+ | S BT |
+ | | | S----A-----S
+ S------H---A----S | |
+ | | S------T----H---S |
+ S S B R S
+
A basically different cabling scheme is used with Twisted Pair cabling. Each
of the TP cards has two RJ (phone-cord style) connectors. The cards are
then daisy-chained together using a cable connecting every two neighboring
cards. The ends are terminated with RJ 93 Ohm terminators which plug into
-the empty connectors of cards on the ends of the chain. An example:
+the empty connectors of cards on the ends of the chain. An example::
- ___________ ___________
- _R_|_ _|_|_ _|_R_
- | | | | | |
- |Card | |Card | |Card |
- |_____| |_____| |_____|
+ ___________ ___________
+ _R_|_ _|_|_ _|_R_
+ | | | | | |
+ |Card | |Card | |Card |
+ |_____| |_____| |_____|
There are also hubs for the TP topology. There is nothing difficult
involved in using them; you just connect a TP chain to a hub on any end or
-even at both. This way you can create almost any network configuration.
+even at both. This way you can create almost any network configuration.
The maximum of 11 hubs between any two computers on the net applies here as
-well. An example:
+well. An example::
RP-------P--------P--------H-----P------P-----PR
- |
+ |
RP-----H--------P--------H-----P------PR
- | |
- PR PR
+ | |
+ PR PR
R - RJ Terminator
P - TP Card
@@ -234,11 +244,13 @@ Like any network, ARCnet has a limited cable length. These are the maximum
cable lengths between two active ends (an active end being an active hub or
a STAR card).
+ ========== ======= ===========
RG-62 93 Ohm up to 650 m
RG-59/U 75 Ohm up to 457 m
RG-11/U 75 Ohm up to 533 m
IBM Type 1 150 Ohm up to 200 m
IBM Type 3 100 Ohm up to 100 m
+ ========== ======= ===========
The maximum length of all cables connected to a passive hub is limited to 65
meters for RG-62 cabling; less for others. You can see that using passive
@@ -248,8 +260,8 @@ most distant points of the net is limited to 3000 meters. The maximum length
of a TP cable between two cards/hubs is 650 meters.
-SETTING THE JUMPERS
--------------------
+Setting the Jumpers
+===================
All ARCnet cards should have a total of four or five different settings:
@@ -261,43 +273,51 @@ All ARCnet cards should have a total of four or five different settings:
eating net connections on my system (at least) otherwise. My guess is
this may be because, if your card is at 0x2E0, probing for a serial port
at 0x2E8 will reset the card and probably mess things up royally.
+
- Avery's favourite: 0x300.
- the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7.
- on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15.
-
+ on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15.
+
Make sure this is different from any other card on your system. Note
that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can
"cat /proc/interrupts" for a somewhat complete list of which ones are in
use at any given time. Here is a list of common usages from Vojtech
Pavlik <vojtech@suse.cz>:
- ("Not on bus" means there is no way for a card to generate this
+
+ ("Not on bus" means there is no way for a card to generate this
interrupt)
- IRQ 0 - Timer 0 (Not on bus)
- IRQ 1 - Keyboard (Not on bus)
- IRQ 2 - IRQ Controller 2 (Not on bus, nor does interrupt the CPU)
- IRQ 3 - COM2
- IRQ 4 - COM1
- IRQ 5 - FREE (LPT2 if you have it; sometimes COM3; maybe PLIP)
- IRQ 6 - Floppy disk controller
- IRQ 7 - FREE (LPT1 if you don't use the polling driver; PLIP)
- IRQ 8 - Realtime Clock Interrupt (Not on bus)
- IRQ 9 - FREE (VGA vertical sync interrupt if enabled)
- IRQ 10 - FREE
- IRQ 11 - FREE
- IRQ 12 - FREE
- IRQ 13 - Numeric Coprocessor (Not on bus)
- IRQ 14 - Fixed Disk Controller
- IRQ 15 - FREE (Fixed Disk Controller 2 if you have it)
-
- Note: IRQ 9 is used on some video cards for the "vertical retrace"
- interrupt. This interrupt would have been handy for things like
- video games, as it occurs exactly once per screen refresh, but
- unfortunately IBM cancelled this feature starting with the original
- VGA and thus many VGA/SVGA cards do not support it. For this
- reason, no modern software uses this interrupt and it can almost
- always be safely disabled, if your video card supports it at all.
-
+
+ ====== =========================================================
+ IRQ 0 Timer 0 (Not on bus)
+ IRQ 1 Keyboard (Not on bus)
+ IRQ 2 IRQ Controller 2 (Not on bus, nor does interrupt the CPU)
+ IRQ 3 COM2
+ IRQ 4 COM1
+ IRQ 5 FREE (LPT2 if you have it; sometimes COM3; maybe PLIP)
+ IRQ 6 Floppy disk controller
+ IRQ 7 FREE (LPT1 if you don't use the polling driver; PLIP)
+ IRQ 8 Realtime Clock Interrupt (Not on bus)
+ IRQ 9 FREE (VGA vertical sync interrupt if enabled)
+ IRQ 10 FREE
+ IRQ 11 FREE
+ IRQ 12 FREE
+ IRQ 13 Numeric Coprocessor (Not on bus)
+ IRQ 14 Fixed Disk Controller
+ IRQ 15 FREE (Fixed Disk Controller 2 if you have it)
+ ====== =========================================================
+
+
+ .. note::
+
+ IRQ 9 is used on some video cards for the "vertical retrace"
+ interrupt. This interrupt would have been handy for things like
+ video games, as it occurs exactly once per screen refresh, but
+ unfortunately IBM cancelled this feature starting with the original
+ VGA and thus many VGA/SVGA cards do not support it. For this
+ reason, no modern software uses this interrupt and it can almost
+ always be safely disabled, if your video card supports it at all.
+
If your card for some reason CANNOT disable this IRQ (usually there
is a jumper), one solution would be to clip the printed circuit
contact on the board: it's the fourth contact from the left on the
@@ -308,14 +328,18 @@ All ARCnet cards should have a total of four or five different settings:
- the memory address: Unlike most cards, ARCnets use "shared memory" for
copying buffers around. Make SURE it doesn't conflict with any other
used memory in your system!
+
+ ::
+
A0000 - VGA graphics memory (ok if you don't have VGA)
- B0000 - Monochrome text mode
- C0000 \ One of these is your VGA BIOS - usually C0000.
- E0000 /
- F0000 - System BIOS
+ B0000 - Monochrome text mode
+ C0000 \ One of these is your VGA BIOS - usually C0000.
+ E0000 /
+ F0000 - System BIOS
Anything less than 0xA0000 is, well, a BAD idea since it isn't above
640k.
+
- Avery's favourite: 0xD0000
- the station address: Every ARCnet card has its own "unique" network
@@ -326,6 +350,7 @@ All ARCnet cards should have a total of four or five different settings:
neat stuff will probably happen if you DO use them). By the way, if you
haven't already guessed, don't set this the same as any other ARCnet on
your network!
+
- Avery's favourite: 3 and 4. Not that it matters.
- There may be ETS1 and ETS2 settings. These may or may not make a
@@ -336,28 +361,34 @@ All ARCnet cards should have a total of four or five different settings:
requirement here is that all cards on the network with ETS1 and ETS2
jumpers have them in the same position. Chris Hindy <chrish@io.org>
sent in a chart with actual values for this:
+
+ ======= ======= =============== ====================
ET1 ET2 Response Time Reconfiguration Time
- --- --- ------------- --------------------
+ ======= ======= =============== ====================
open open 74.7us 840us
open closed 283.4us 1680us
closed open 561.8us 1680us
closed closed 1118.6us 1680us
-
+ ======= ======= =============== ====================
+
Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your
network.
-
-Also, on many cards (not mine, though) there are red and green LED's.
+
+Also, on many cards (not mine, though) there are red and green LED's.
Vojtech Pavlik <vojtech@suse.cz> tells me this is what they mean:
+
+ =============== =============== =====================================
GREEN RED Status
- ----- --- ------
+ =============== =============== =====================================
OFF OFF Power off
OFF Short flashes Cabling problems (broken cable or not
- terminated)
+ terminated)
OFF (short) ON Card init
ON ON Normal state - everything OK, nothing
- happens
+ happens
ON Long flashes Data transfer
ON OFF Never happens (maybe when wrong ID)
+ =============== =============== =====================================
The following is all the specific information people have sent me about
@@ -366,7 +397,7 @@ huge amounts of duplicated information. I have no time to fix it. If you
want to, PLEASE DO! Just send me a 'diff -u' of all your changes.
The model # is listed right above specifics for that card, so you should be
-able to use your text viewer's "search" function to find the entry you want.
+able to use your text viewer's "search" function to find the entry you want.
If you don't KNOW what kind of card you have, try looking through the
various diagrams to see if you can tell.
@@ -378,8 +409,9 @@ model that is, please e-mail me to say so.
Cards Listed in this file (in this order, mostly):
+ =============== ======================= ====
Manufacturer Model # Bits
- ------------ ------- ----
+ =============== ======================= ====
SMC PC100 8
SMC PC110 8
SMC PC120 8
@@ -404,17 +436,19 @@ Cards Listed in this file (in this order, mostly):
No Name Taiwan R.O.C? 8
No Name Model 9058 8
Tiara Tiara Lancard? 8
-
+ =============== ======================= ====
-** SMC = Standard Microsystems Corp.
-** CNet Tech = CNet Technology, Inc.
+* SMC = Standard Microsystems Corp.
+* CNet Tech = CNet Technology, Inc.
Unclassified Stuff
-------------------
+==================
+
- Please send any other information you can find.
-
- - And some other stuff (more info is welcome!):
+
+ - And some other stuff (more info is welcome!)::
+
From: root@ultraworld.xs4all.nl (Timo Hilbrink)
To: apenwarr@foxnet.net (Avery Pennarun)
Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT)
@@ -423,7 +457,7 @@ Unclassified Stuff
[...parts deleted...]
About the jumpers: On my PC130 there is one more jumper, located near the
- cable-connector and it's for changing to star or bus topology;
+ cable-connector and it's for changing to star or bus topology;
closed: star - open: bus
On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI
and another with ALE,LA17,LA18,LA19 these are undocumented..
@@ -432,136 +466,130 @@ Unclassified Stuff
--- CUT ---
+Standard Microsystems Corp (SMC)
+================================
+
+PC100, PC110, PC120, PC130 (8-bit cards) and PC500, PC600 (16-bit cards)
+------------------------------------------------------------------------
-** Standard Microsystems Corp (SMC) **
-PC100, PC110, PC120, PC130 (8-bit cards)
-PC500, PC600 (16-bit cards)
----------------------------------
- mainly from Avery Pennarun <apenwarr@worldvisions.ca>. Values depicted
are from Avery's setup.
- special thanks to Timo Hilbrink <timoh@xs4all.nl> for noting that PC120,
- 130, 500, and 600 all have the same switches as Avery's PC100.
+ 130, 500, and 600 all have the same switches as Avery's PC100.
PC500/600 have several extra, undocumented pins though. (?)
- PC110 settings were verified by Stephen A. Wood <saw@cebaf.gov>
- Also, the JP- and S-numbers probably don't match your card exactly. Try
to find jumpers/switches with the same number of settings - it's
probably more reliable.
-
-
- JP5 [|] : : : :
-(IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7
- Put exactly one jumper on exactly one set of pins.
-
-
- 1 2 3 4 5 6 7 8 9 10
- S1 /----------------------------------\
-(I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 |
- addresses) \----------------------------------/
- |--| |--------| |--------|
- (a) (b) (m)
-
- WARNING. It's very important when setting these which way
- you're holding the card, and which way you think is '1'!
-
- If you suspect that your settings are not being made
- correctly, try reversing the direction or inverting the
- switch positions.
-
- a: The first digit of the I/O address.
- Setting Value
- ------- -----
- 00 0
- 01 1
- 10 2
- 11 3
-
- b: The second digit of the I/O address.
- Setting Value
- ------- -----
- 0000 0
- 0001 1
- 0010 2
- ... ...
- 1110 E
- 1111 F
-
- The I/O address is in the form ab0. For example, if
- a is 0x2 and b is 0xE, the address will be 0x2E0.
-
- DO NOT SET THIS LESS THAN 0x200!!!!!
-
-
- m: The first digit of the memory address.
- Setting Value
- ------- -----
- 0000 0
- 0001 1
- 0010 2
- ... ...
- 1110 E
- 1111 F
-
- The memory address is in the form m0000. For example, if
- m is D, the address will be 0xD0000.
-
- DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000!
-
- 1 2 3 4 5 6 7 8
- S2 /--------------------------\
-(Station Address) | 1 1 0 0 0 0 0 0 |
- \--------------------------/
-
- Setting Value
- ------- -----
- 00000000 00
- 10000000 01
- 01000000 02
- ...
- 01111111 FE
- 11111111 FF
-
- Note that this is binary with the digits reversed!
-
- DO NOT SET THIS TO 0 OR 255 (0xFF)!
+::
+
+ JP5 [|] : : : :
+ (IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7
+ Put exactly one jumper on exactly one set of pins.
+
+
+ 1 2 3 4 5 6 7 8 9 10
+ S1 /----------------------------------\
+ (I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 |
+ addresses) \----------------------------------/
+ |--| |--------| |--------|
+ (a) (b) (m)
+
+ WARNING. It's very important when setting these which way
+ you're holding the card, and which way you think is '1'!
+
+ If you suspect that your settings are not being made
+ correctly, try reversing the direction or inverting the
+ switch positions.
+
+ a: The first digit of the I/O address.
+ Setting Value
+ ------- -----
+ 00 0
+ 01 1
+ 10 2
+ 11 3
+
+ b: The second digit of the I/O address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The I/O address is in the form ab0. For example, if
+ a is 0x2 and b is 0xE, the address will be 0x2E0.
+
+ DO NOT SET THIS LESS THAN 0x200!!!!!
+
+
+ m: The first digit of the memory address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The memory address is in the form m0000. For example, if
+ m is D, the address will be 0xD0000.
+
+ DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000!
+
+ 1 2 3 4 5 6 7 8
+ S2 /--------------------------\
+ (Station Address) | 1 1 0 0 0 0 0 0 |
+ \--------------------------/
+
+ Setting Value
+ ------- -----
+ 00000000 00
+ 10000000 01
+ 01000000 02
+ ...
+ 01111111 FE
+ 11111111 FF
+
+ Note that this is binary with the digits reversed!
+
+ DO NOT SET THIS TO 0 OR 255 (0xFF)!
-*****************************************************************************
-** Standard Microsystems Corp (SMC) **
PC130E/PC270E (8-bit cards)
---------------------------
- - from Juergen Seifert <seifert@htwm.de>
-
-STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET(R)-PC130E/PC270E
-===============================================================
+ - from Juergen Seifert <seifert@htwm.de>
This description has been written by Juergen Seifert <seifert@htwm.de>
-using information from the following Original SMC Manual
+using information from the following Original SMC Manual
- "Configuration Guide for
- ARCNET(R)-PC130E/PC270
- Network Controller Boards
- Pub. # 900.044A
- June, 1989"
+ "Configuration Guide for ARCNET(R)-PC130E/PC270 Network
+ Controller Boards Pub. # 900.044A June, 1989"
ARCNET is a registered trademark of the Datapoint Corporation
-SMC is a registered trademark of the Standard Microsystems Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
-The PC130E is an enhanced version of the PC130 board, is equipped with a
+The PC130E is an enhanced version of the PC130 board, is equipped with a
standard BNC female connector for connection to RG-62/U coax cable.
Since this board is designed both for point-to-point connection in star
-networks and for connection to bus networks, it is downwardly compatible
+networks and for connection to bus networks, it is downwardly compatible
with all the other standard boards designed for coax networks (that is,
-the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and
+the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and
PC200 bus topology boards).
-The PC270E is an enhanced version of the PC260 board, is equipped with two
+The PC270E is an enhanced version of the PC260 board, is equipped with two
modular RJ11-type jacks for connection to twisted pair wiring.
It can be used in a star or a daisy-chained network.
+::
- 8 7 6 5 4 3 2 1
+ 8 7 6 5 4 3 2 1
________________________________________________________________
| | S1 | |
| |_________________| |
@@ -587,27 +615,27 @@ It can be used in a star or a daisy-chained network.
| |
|_____________________________________________|
-Legend:
+Legend::
-SMC 90C63 ARCNET Controller / Transceiver /Logic
-S1 1-3: I/O Base Address Select
+ SMC 90C63 ARCNET Controller / Transceiver /Logic
+ S1 1-3: I/O Base Address Select
4-6: Memory Base Address Select
7-8: RAM Offset Select
-S2 1-8: Node ID Select
-EXT Extended Timeout Select
-ROM ROM Enable Select
-STAR Selected - Star Topology (PC130E only)
+ S2 1-8: Node ID Select
+ EXT Extended Timeout Select
+ ROM ROM Enable Select
+ STAR Selected - Star Topology (PC130E only)
Deselected - Bus Topology (PC130E only)
-CR3/CR4 Diagnostic LEDs
-J1 BNC RG62/U Connector (PC130E only)
-J1 6-position Telephone Jack (PC270E only)
-J2 6-position Telephone Jack (PC270E only)
+ CR3/CR4 Diagnostic LEDs
+ J1 BNC RG62/U Connector (PC130E only)
+ J1 6-position Telephone Jack (PC270E only)
+ J2 6-position Telephone Jack (PC270E only)
Setting one of the switches to Off/Open means "1", On/Closed means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in group S2 are used to set the node ID.
These switches work in a way similar to the PC100-series cards; see that
@@ -615,10 +643,10 @@ entry for more information.
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first three switches in switch group S1 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -635,14 +663,16 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer requires 2K of a 16K block of RAM. The base of this
16K block can be located in any of eight positions.
Switches 4-6 of switch group S1 select the Base of the 16K block.
-Within that 16K address space, the buffer may be assigned any one of four
+Within that 16K address space, the buffer may be assigned any one of four
positions, determined by the offset, switches 7 and 8 of group S1.
+::
+
Switch | Hex RAM | Hex ROM
4 5 6 7 8 | Address | Address *)
-----------|---------|-----------
@@ -650,115 +680,111 @@ positions, determined by the offset, switches 7 and 8 of group S1.
0 0 0 0 1 | C0800 | C2000
0 0 0 1 0 | C1000 | C2000
0 0 0 1 1 | C1800 | C2000
- | |
+ | |
0 0 1 0 0 | C4000 | C6000
0 0 1 0 1 | C4800 | C6000
0 0 1 1 0 | C5000 | C6000
0 0 1 1 1 | C5800 | C6000
- | |
+ | |
0 1 0 0 0 | CC000 | CE000
0 1 0 0 1 | CC800 | CE000
0 1 0 1 0 | CD000 | CE000
0 1 0 1 1 | CD800 | CE000
- | |
+ | |
0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
0 1 1 0 1 | D0800 | D2000
0 1 1 1 0 | D1000 | D2000
0 1 1 1 1 | D1800 | D2000
- | |
+ | |
1 0 0 0 0 | D4000 | D6000
1 0 0 0 1 | D4800 | D6000
1 0 0 1 0 | D5000 | D6000
1 0 0 1 1 | D5800 | D6000
- | |
+ | |
1 0 1 0 0 | D8000 | DA000
1 0 1 0 1 | D8800 | DA000
1 0 1 1 0 | D9000 | DA000
1 0 1 1 1 | D9800 | DA000
- | |
+ | |
1 1 0 0 0 | DC000 | DE000
1 1 0 0 1 | DC800 | DE000
1 1 0 1 0 | DD000 | DE000
1 1 0 1 1 | DD800 | DE000
- | |
+ | |
1 1 1 0 0 | E0000 | E2000
1 1 1 0 1 | E0800 | E2000
1 1 1 1 0 | E1000 | E2000
1 1 1 1 1 | E1800 | E2000
-
-*) To enable the 8K Boot PROM install the jumper ROM.
- The default is jumper ROM not installed.
+
+ *) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
Setting the Timeouts and Interrupt
-----------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
parameters. These two jumpers are normally left open.
To select a hardware interrupt level set one (only one!) of the jumpers
IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2.
-
+
Configuring the PC130E for Star or Bus Topology
------------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The single jumper labeled STAR is used to configure the PC130E board for
+The single jumper labeled STAR is used to configure the PC130E board for
star or bus topology.
-When the jumper is installed, the board may be used in a star network, when
+When the jumper is installed, the board may be used in a star network, when
it is removed, the board can be used in a bus topology.
Diagnostic LEDs
----------------
+^^^^^^^^^^^^^^^
Two diagnostic LEDs are visible on the rear bracket of the board.
The green LED monitors the network activity: the red one shows the
-board activity:
+board activity::
Green | Status Red | Status
-------|------------------- ---------|-------------------
on | normal activity flash/on | data transfer
blink | reconfiguration off | no data transfer;
off | defective board or | incorrect memory or
- | node ID is zero | I/O address
+ | node ID is zero | I/O address
-*****************************************************************************
-
-** Standard Microsystems Corp (SMC) **
PC500/PC550 Longboard (16-bit cards)
--------------------------------------
+------------------------------------
+
- from Juergen Seifert <seifert@htwm.de>
-STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET-PC500/PC550 Long Board
-=====================================================================
+ .. note::
-Note: There is another Version of the PC500 called Short Version, which
+ There is another Version of the PC500 called Short Version, which
is different in hard- and software! The most important differences
are:
+
- The long board has no Shared memory.
- On the long board the selection of the interrupt is done by binary
- coded switch, on the short board directly by jumper.
-
+ coded switch, on the short board directly by jumper.
+
[Avery's note: pay special attention to that: the long board HAS NO SHARED
-MEMORY. This means the current Linux-ARCnet driver can't use these cards.
+MEMORY. This means the current Linux-ARCnet driver can't use these cards.
I have obtained a PC500Longboard and will be doing some experiments on it in
the future, but don't hold your breath. Thanks again to Juergen Seifert for
his advice about this!]
This description has been written by Juergen Seifert <seifert@htwm.de>
-using information from the following Original SMC Manual
+using information from the following Original SMC Manual
- "Configuration Guide for
- SMC ARCNET-PC500/PC550
- Series Network Controller Boards
- Pub. # 900.033 Rev. A
- November, 1989"
+ "Configuration Guide for SMC ARCNET-PC500/PC550
+ Series Network Controller Boards Pub. # 900.033 Rev. A
+ November, 1989"
ARCNET is a registered trademark of the Datapoint Corporation
-SMC is a registered trademark of the Standard Microsystems Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
The PC500 is equipped with a standard BNC female connector for connection
to RG-62/U coax cable.
@@ -769,7 +795,9 @@ The PC550 is equipped with two modular RJ11-type jacks for connection
to twisted pair wiring.
It can be used in a star or a daisy-chained (BUS) network.
- 1
+::
+
+ 1
0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1
____________________________________________________________________
< | SW1 | | SW2 | |
@@ -796,34 +824,34 @@ It can be used in a star or a daisy-chained (BUS) network.
> | | |
<____| |_____________________________________________|
-Legend:
+Legend::
-SW1 1-6: I/O Base Address Select
+ SW1 1-6: I/O Base Address Select
7-10: Interrupt Select
-SW2 1-6: Reserved for Future Use
-SW3 1-8: Node ID Select
-JP2 1-4: Extended Timeout Select
-JP6 Selected - Star Topology (PC500 only)
+ SW2 1-6: Reserved for Future Use
+ SW3 1-8: Node ID Select
+ JP2 1-4: Extended Timeout Select
+ JP6 Selected - Star Topology (PC500 only)
Deselected - Bus Topology (PC500 only)
-CR3 Green Monitors Network Activity
-CR4 Red Monitors Board Activity
-J1 BNC RG62/U Connector (PC500 only)
-J1 6-position Telephone Jack (PC550 only)
-J2 6-position Telephone Jack (PC550 only)
+ CR3 Green Monitors Network Activity
+ CR4 Red Monitors Board Activity
+ J1 BNC RG62/U Connector (PC500 only)
+ J1 6-position Telephone Jack (PC550 only)
+ J2 6-position Telephone Jack (PC550 only)
Setting one of the switches to Off/Open means "1", On/Closed means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in group SW3 are used to set the node ID. Each node
-attached to the network must have an unique node ID which must be
+attached to the network must have an unique node ID which must be
different from 0.
Switch 1 serves as the least significant bit (LSB).
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Value
-------|-------
@@ -836,30 +864,30 @@ These values are:
7 | 64
8 | 128
-Some Examples:
+Some Examples::
- Switch | Hex | Decimal
+ Switch | Hex | Decimal
8 7 6 5 4 3 2 1 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
- 1 1 1 1 1 1 1 1 | FF | 255
+ 1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first six switches in switch group SW1 are used to select one
-of 32 possible I/O Base addresses using the following table
+of 32 possible I/O Base addresses using the following table::
Switch | Hex I/O
6 5 4 3 2 1 | Address
@@ -899,16 +927,18 @@ of 32 possible I/O Base addresses using the following table
Setting the Interrupt
----------------------
+^^^^^^^^^^^^^^^^^^^^^
-Switches seven through ten of switch group SW1 are used to select the
-interrupt level. The interrupt level is binary coded, so selections
+Switches seven through ten of switch group SW1 are used to select the
+interrupt level. The interrupt level is binary coded, so selections
from 0 to 15 would be possible, but only the following eight values will
be supported: 3, 4, 5, 7, 9, 10, 11, 12.
+::
+
Switch | IRQ
- 10 9 8 7 |
- ---------|--------
+ 10 9 8 7 |
+ ---------|--------
0 0 1 1 | 3
0 1 0 0 | 4
0 1 0 1 | 5
@@ -919,52 +949,50 @@ be supported: 3, 4, 5, 7, 9, 10, 11, 12.
1 1 0 0 | 12
-Setting the Timeouts
---------------------
+Setting the Timeouts
+^^^^^^^^^^^^^^^^^^^^
-The two jumpers JP2 (1-4) are used to determine the timeout parameters.
+The two jumpers JP2 (1-4) are used to determine the timeout parameters.
These two jumpers are normally left open.
Refer to the COM9026 Data Sheet for alternate configurations.
Configuring the PC500 for Star or Bus Topology
-----------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The single jumper labeled JP6 is used to configure the PC500 board for
+The single jumper labeled JP6 is used to configure the PC500 board for
star or bus topology.
-When the jumper is installed, the board may be used in a star network, when
+When the jumper is installed, the board may be used in a star network, when
it is removed, the board can be used in a bus topology.
Diagnostic LEDs
----------------
+^^^^^^^^^^^^^^^
Two diagnostic LEDs are visible on the rear bracket of the board.
The green LED monitors the network activity: the red one shows the
-board activity:
+board activity::
Green | Status Red | Status
-------|------------------- ---------|-------------------
on | normal activity flash/on | data transfer
blink | reconfiguration off | no data transfer;
off | defective board or | incorrect memory or
- | node ID is zero | I/O address
+ | node ID is zero | I/O address
-*****************************************************************************
-
-** SMC **
PC710 (8-bit card)
------------------
+
- from J.S. van Oosten <jvoosten@compiler.tdcnet.nl>
-
+
Note: this data is gathered by experimenting and looking at info of other
cards. However, I'm sure I got 99% of the settings right.
The SMC710 card resembles the PC270 card, but is much more basic (i.e. no
-LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing:
+LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing::
- _______________________________________
+ _______________________________________
| +---------+ +---------+ |____
| | S2 | | S1 | |
| +---------+ +---------+ |
@@ -976,12 +1004,12 @@ LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing:
| +===+ |
| |
| .. JP1 +----------+ |
- | .. | big chip | |
+ | .. | big chip | |
| .. | 90C63 | |
| .. | | |
| .. +----------+ |
------- -----------
- |||||||||||||||||||||
+ |||||||||||||||||||||
The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes
labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM,
@@ -992,71 +1020,76 @@ are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address).
I know it works when connected to a PC110 type ARCnet board.
-
+
*****************************************************************************
-** Possibly SMC **
+Possibly SMC
+============
+
LCS-8830(-T) (8 and 16-bit cards)
---------------------------------
+
- from Mathias Katzer <mkatzer@HRZ.Uni-Bielefeld.DE>
- Marek Michalkiewicz <marekm@i17linuxb.ists.pwr.wroc.pl> says the
LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS
only (the JP0 jumper is hardwired), and BNC only.
-
+
This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC,
nowhere else, not even on the few Xeroxed sheets from the manual).
-SMC ARCnet Board Type LCS-8830-T
+SMC ARCnet Board Type LCS-8830-T::
- ------------------------------------
- | |
- | JP3 88 8 JP2 |
- | ##### | \ |
- | ##### ET1 ET2 ###|
- | 8 ###|
- | U3 SW 1 JP0 ###| Phone Jacks
- | -- ###|
- | | | |
- | | | SW2 |
- | | | |
- | | | ##### |
- | -- ##### #### BNC Connector
- | ####
- | 888888 JP1 |
- | 234567 |
- -- -------
- |||||||||||||||||||||||||||
- --------------------------
-
-
-SW1: DIP-Switches for Station Address
-SW2: DIP-Switches for Memory Base and I/O Base addresses
-
-JP0: If closed, internal termination on (default open)
-JP1: IRQ Jumpers
-JP2: Boot-ROM enabled if closed
-JP3: Jumpers for response timeout
-
-U3: Boot-ROM Socket
-
-
-ET1 ET2 Response Time Idle Time Reconfiguration Time
-
- 78 86 840
- X 285 316 1680
- X 563 624 1680
- X X 1130 1237 1680
-
-(X means closed jumper)
-
-(DIP-Switch downwards means "0")
+ ------------------------------------
+ | |
+ | JP3 88 8 JP2 |
+ | ##### | \ |
+ | ##### ET1 ET2 ###|
+ | 8 ###|
+ | U3 SW 1 JP0 ###| Phone Jacks
+ | -- ###|
+ | | | |
+ | | | SW2 |
+ | | | |
+ | | | ##### |
+ | -- ##### #### BNC Connector
+ | ####
+ | 888888 JP1 |
+ | 234567 |
+ -- -------
+ |||||||||||||||||||||||||||
+ --------------------------
+
+
+ SW1: DIP-Switches for Station Address
+ SW2: DIP-Switches for Memory Base and I/O Base addresses
+
+ JP0: If closed, internal termination on (default open)
+ JP1: IRQ Jumpers
+ JP2: Boot-ROM enabled if closed
+ JP3: Jumpers for response timeout
+
+ U3: Boot-ROM Socket
+
+
+ ET1 ET2 Response Time Idle Time Reconfiguration Time
+
+ 78 86 840
+ X 285 316 1680
+ X 563 624 1680
+ X X 1130 1237 1680
+
+ (X means closed jumper)
+
+ (DIP-Switch downwards means "0")
The station address is binary-coded with SW1.
The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2:
+======== ========
Switches Base
678 Address
+======== ========
000 260-26f
100 290-29f
010 2e0-2ef
@@ -1065,19 +1098,22 @@ Switches Base
101 350-35f
011 380-38f
111 3e0-3ef
+======== ========
DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range:
+======== ============= ================
Switches RAM ROM
12345 Address Range Address Range
+======== ============= ================
00000 C:0000-C:07ff C:2000-C:3fff
10000 C:0800-C:0fff
01000 C:1000-C:17ff
11000 C:1800-C:1fff
00100 C:4000-C:47ff C:6000-C:7fff
10100 C:4800-C:4fff
-01100 C:5000-C:57ff
+01100 C:5000-C:57ff
11100 C:5800-C:5fff
00010 C:C000-C:C7ff C:E000-C:ffff
10010 C:C800-C:Cfff
@@ -1094,7 +1130,7 @@ Switches RAM ROM
00101 D:8000-D:87ff D:A000-D:bfff
10101 D:8800-D:8fff
01101 D:9000-D:97ff
-11101 D:9800-D:9fff
+11101 D:9800-D:9fff
00011 D:C000-D:c7ff D:E000-D:ffff
10011 D:C800-D:cfff
01011 D:D000-D:d7ff
@@ -1103,34 +1139,37 @@ Switches RAM ROM
10111 E:0800-E:0fff
01111 E:1000-E:17ff
11111 E:1800-E:1fff
+======== ============= ================
-*****************************************************************************
+PureData Corp
+=============
-** PureData Corp **
PDI507 (8-bit card)
--------------------
+
- from Mark Rejhon <mdrejhon@magi.com> (slight modifications by Avery)
- Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards)
are mostly the same as this. PDI508Plus cards appear to be mainly
software-configured.
Jumpers:
+
There is a jumper array at the bottom of the card, near the edge
- connector. This array is labelled J1. They control the IRQs and
- something else. Put only one jumper on the IRQ pins.
+ connector. This array is labelled J1. They control the IRQs and
+ something else. Put only one jumper on the IRQ pins.
ETS1, ETS2 are for timing on very long distance networks. See the
more general information near the top of this file.
There is a J2 jumper on two pins. A jumper should be put on them,
- since it was already there when I got the card. I don't know what
- this jumper is for though.
+ since it was already there when I got the card. I don't know what
+ this jumper is for though.
There is a two-jumper array for J3. I don't know what it is for,
- but there were already two jumpers on it when I got the card. It's
- a six pin grid in a two-by-three fashion. The jumpers were
- configured as follows:
+ but there were already two jumpers on it when I got the card. It's
+ a six pin grid in a two-by-three fashion. The jumpers were
+ configured as follows::
.-------.
o | o o |
@@ -1140,28 +1179,28 @@ Jumpers:
Carl de Billy <CARL@carainfo.com> explains J3 and J4:
- J3 Diagram:
+ J3 Diagram::
- .-------.
- o | o o |
- :-------: TWIST Technology
- o | o o |
- `-------'
- .-------.
- | o o | o
- :-------: COAX Technology
- | o o | o
- `-------'
+ .-------.
+ o | o o |
+ :-------: TWIST Technology
+ o | o o |
+ `-------'
+ .-------.
+ | o o | o
+ :-------: COAX Technology
+ | o o | o
+ `-------'
- If using coax cable in a bus topology the J4 jumper must be removed;
place it on one pin.
- - If using bus topology with twisted pair wiring move the J3
+ - If using bus topology with twisted pair wiring move the J3
jumpers so they connect the middle pin and the pins closest to the RJ11
Connectors. Also the J4 jumper must be removed; place it on one pin of
J4 jumper for storage.
- - If using star topology with twisted pair wiring move the J3
+ - If using star topology with twisted pair wiring move the J3
jumpers so they connect the middle pin and the pins closest to the RJ11
connectors.
@@ -1169,40 +1208,43 @@ Carl de Billy <CARL@carainfo.com> explains J3 and J4:
DIP Switches:
The DIP switches accessible on the accessible end of the card while
- it is installed, is used to set the ARCnet address. There are 8
- switches. Use an address from 1 to 254.
+ it is installed, is used to set the ARCnet address. There are 8
+ switches. Use an address from 1 to 254
- Switch No.
- 12345678 ARCnet address
- -----------------------------------------
+ ========== =========================
+ Switch No. ARCnet address
+ 12345678
+ ========== =========================
00000000 FF (Don't use this!)
00000001 FE
00000010 FD
- ....
- 11111101 2
+ ...
+ 11111101 2
11111110 1
11111111 0 (Don't use this!)
+ ========== =========================
There is another array of eight DIP switches at the top of the
- card. There are five labelled MS0-MS4 which seem to control the
- memory address, and another three labelled IO0-IO2 which seem to
- control the base I/O address of the card.
+ card. There are five labelled MS0-MS4 which seem to control the
+ memory address, and another three labelled IO0-IO2 which seem to
+ control the base I/O address of the card.
This was difficult to test by trial and error, and the I/O addresses
- are in a weird order. This was tested by setting the DIP switches,
- rebooting the computer, and attempting to load ARCETHER at various
- addresses (mostly between 0x200 and 0x400). The address that caused
- the red transmit LED to blink, is the one that I thought works.
+ are in a weird order. This was tested by setting the DIP switches,
+ rebooting the computer, and attempting to load ARCETHER at various
+ addresses (mostly between 0x200 and 0x400). The address that caused
+ the red transmit LED to blink, is the one that I thought works.
Also, the address 0x3D0 seem to have a special meaning, since the
- ARCETHER packet driver loaded fine, but without the red LED
- blinking. I don't know what 0x3D0 is for though. I recommend using
- an address of 0x300 since Windows may not like addresses below
- 0x300.
-
- IO Switch No.
- 210 I/O address
- -------------------------------
+ ARCETHER packet driver loaded fine, but without the red LED
+ blinking. I don't know what 0x3D0 is for though. I recommend using
+ an address of 0x300 since Windows may not like addresses below
+ 0x300.
+
+ ============= ===========
+ IO Switch No. I/O address
+ 210
+ ============= ===========
111 0x260
110 0x290
101 0x2E0
@@ -1211,29 +1253,31 @@ DIP Switches:
010 0x350
001 0x380
000 0x3E0
+ ============= ===========
The memory switches set a reserved address space of 0x1000 bytes
- (0x100 segment units, or 4k). For example if I set an address of
- 0xD000, it will use up addresses 0xD000 to 0xD100.
+ (0x100 segment units, or 4k). For example if I set an address of
+ 0xD000, it will use up addresses 0xD000 to 0xD100.
The memory switches were tested by booting using QEMM386 stealth,
- and using LOADHI to see what address automatically became excluded
- from the upper memory regions, and then attempting to load ARCETHER
- using these addresses.
+ and using LOADHI to see what address automatically became excluded
+ from the upper memory regions, and then attempting to load ARCETHER
+ using these addresses.
I recommend using an ARCnet memory address of 0xD000, and putting
- the EMS page frame at 0xC000 while using QEMM stealth mode. That
- way, you get contiguous high memory from 0xD100 almost all the way
- the end of the megabyte.
+ the EMS page frame at 0xC000 while using QEMM stealth mode. That
+ way, you get contiguous high memory from 0xD100 almost all the way
+ the end of the megabyte.
Memory Switch 0 (MS0) didn't seem to work properly when set to OFF
- on my card. It could be malfunctioning on my card. Experiment with
- it ON first, and if it doesn't work, set it to OFF. (It may be a
- modifier for the 0x200 bit?)
+ on my card. It could be malfunctioning on my card. Experiment with
+ it ON first, and if it doesn't work, set it to OFF. (It may be a
+ modifier for the 0x200 bit?)
+ ============= ============================================
MS Switch No.
43210 Memory address
- --------------------------------
+ ============= ============================================
00001 0xE100 (guessed - was not detected by QEMM)
00011 0xE000 (guessed - was not detected by QEMM)
00101 0xDD00
@@ -1250,40 +1294,36 @@ DIP Switches:
11011 0xC800 (guessed - crashes tested system)
11101 0xC500 (guessed - crashes tested system)
11111 0xC400 (guessed - crashes tested system)
-
-
-*****************************************************************************
+ ============= ============================================
+
+CNet Technology Inc. (8-bit cards)
+==================================
-** CNet Technology Inc. **
120 Series (8-bit cards)
------------------------
- from Juergen Seifert <seifert@htwm.de>
-
-CNET TECHNOLOGY INC. (CNet) ARCNET 120A SERIES
-==============================================
-
This description has been written by Juergen Seifert <seifert@htwm.de>
-using information from the following Original CNet Manual
-
- "ARCNET
- USER'S MANUAL
- for
- CN120A
- CN120AB
- CN120TP
- CN120ST
- CN120SBT
- P/N:12-01-0007
- Revision 3.00"
+using information from the following Original CNet Manual
+
+ "ARCNET USER'S MANUAL for
+ CN120A
+ CN120AB
+ CN120TP
+ CN120ST
+ CN120SBT
+ P/N:12-01-0007
+ Revision 3.00"
ARCNET is a registered trademark of the Datapoint Corporation
-P/N 120A ARCNET 8 bit XT/AT Star
-P/N 120AB ARCNET 8 bit XT/AT Bus
-P/N 120TP ARCNET 8 bit XT/AT Twisted Pair
-P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair
-P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair
+- P/N 120A ARCNET 8 bit XT/AT Star
+- P/N 120AB ARCNET 8 bit XT/AT Bus
+- P/N 120TP ARCNET 8 bit XT/AT Twisted Pair
+- P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair
+- P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair
+
+::
__________________________________________________________________
| |
@@ -1307,75 +1347,77 @@ P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair
| > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 |
| |______________| |o|o|o|o|o| |o|o|o| |_____|
|_____ |o|o|o|o|o| ______________|
- | |
- |_____________________________________________|
-
-Legend:
-
-90C65 ARCNET Probe
-S1 1-5: Base Memory Address Select
- 6-8: Base I/O Address Select
-S2 1-8: Node ID Select (ID0-ID7)
-JP1 ROM Enable Select
-JP2 IRQ2
-JP3 IRQ3
-JP4 IRQ4
-JP5 IRQ5
-JP6 IRQ7
-JP7/JP8 ET1, ET2 Timeout Parameters
-JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only)
-JP12 Terminator Select (CN120AB/ST/SBT only)
-J1 BNC RG62/U Connector (all except CN120TP)
-J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only)
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Probe
+ S1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ S2 1-8: Node ID Select (ID0-ID7)
+ JP1 ROM Enable Select
+ JP2 IRQ2
+ JP3 IRQ3
+ JP4 IRQ4
+ JP5 IRQ5
+ JP6 IRQ7
+ JP7/JP8 ET1, ET2 Timeout Parameters
+ JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only)
+ JP12 Terminator Select (CN120AB/ST/SBT only)
+ J1 BNC RG62/U Connector (all except CN120TP)
+ J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only)
Setting one of the switches to Off means "1", On means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW2 are used to set the node ID. Each node attached
to the network must have an unique node ID which must be different from 0.
Switch 1 (ID0) serves as the least significant bit (LSB).
-The node ID is the sum of the values of all switches set to "1"
+The node ID is the sum of the values of all switches set to "1"
These values are:
- Switch | Label | Value
- -------|-------|-------
- 1 | ID0 | 1
- 2 | ID1 | 2
- 3 | ID2 | 4
- 4 | ID3 | 8
- 5 | ID4 | 16
- 6 | ID5 | 32
- 7 | ID6 | 64
- 8 | ID7 | 128
-
-Some Examples:
-
- Switch | Hex | Decimal
+ ======= ====== =====
+ Switch Label Value
+ ======= ====== =====
+ 1 ID0 1
+ 2 ID1 2
+ 3 ID2 4
+ 4 ID3 8
+ 5 ID4 16
+ 6 ID5 32
+ 7 ID6 64
+ 8 ID7 128
+ ======= ====== =====
+
+Some Examples::
+
+ Switch | Hex | Decimal
8 7 6 5 4 3 2 1 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last three switches in switch block SW1 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -1392,13 +1434,15 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The memory buffer (RAM) requires 2K. The base of this buffer can be
+The memory buffer (RAM) requires 2K. The base of this buffer can be
located in any of eight positions. The address of the Boot Prom is
memory base + 8K or memory base + 0x2000.
Switches 1-5 of switch block SW1 select the Memory Base address.
+::
+
Switch | Hex RAM | Hex ROM
1 2 3 4 5 | Address | Address *)
--------------------|---------|-----------
@@ -1410,22 +1454,24 @@ Switches 1-5 of switch block SW1 select the Memory Base address.
ON ON OFF ON OFF | D8000 | DA000
ON ON ON OFF OFF | DC000 | DE000
ON ON OFF OFF OFF | E0000 | E2000
-
-*) To enable the Boot ROM install the jumper JP1
-Note: Since the switches 1 and 2 are always set to ON it may be possible
+ *) To enable the Boot ROM install the jumper JP1
+
+.. note::
+
+ Since the switches 1 and 2 are always set to ON it may be possible
that they can be used to add an offset of 2K, 4K or 6K to the base
address, but this feature is not documented in the manual and I
haven't tested it yet.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
To select a hardware interrupt level install one (only one!) of the jumpers
-JP2, JP3, JP4, JP5, JP6. JP2 is the default.
+JP2, JP3, JP4, JP5, JP6. JP2 is the default::
- Jumper | IRQ
+ Jumper | IRQ
-------|-----
2 | 2
3 | 3
@@ -1435,71 +1481,66 @@ JP2, JP3, JP4, JP5, JP6. JP2 is the default.
Setting the Internal Terminator on CN120AB/TP/SBT
---------------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The jumper JP12 is used to enable the internal terminator.
+The jumper JP12 is used to enable the internal terminator::
- -----
- 0 | 0 |
+ -----
+ 0 | 0 |
----- ON | | ON
| 0 | | 0 |
| | OFF ----- OFF
| 0 | 0
-----
- Terminator Terminator
+ Terminator Terminator
disabled enabled
-
+
Selecting the Connector Type on CN120ST/SBT
--------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+::
JP10 JP11 JP10 JP11
- ----- -----
- 0 0 | 0 | | 0 |
+ ----- -----
+ 0 0 | 0 | | 0 |
----- ----- | | | |
| 0 | | 0 | | 0 | | 0 |
| | | | ----- -----
- | 0 | | 0 | 0 0
+ | 0 | | 0 | 0 0
----- -----
- Coaxial Cable Twisted Pair Cable
+ Coaxial Cable Twisted Pair Cable
(Default)
Setting the Timeout Parameters
-------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
parameters. These two jumpers are normally left open.
+CNet Technology Inc. (16-bit cards)
+===================================
-*****************************************************************************
-
-** CNet Technology Inc. **
160 Series (16-bit cards)
-------------------------
- from Juergen Seifert <seifert@htwm.de>
-CNET TECHNOLOGY INC. (CNet) ARCNET 160A SERIES
-==============================================
-
This description has been written by Juergen Seifert <seifert@htwm.de>
-using information from the following Original CNet Manual
+using information from the following Original CNet Manual
- "ARCNET
- USER'S MANUAL
- for
- CN160A
- CN160AB
- CN160TP
- P/N:12-01-0006
- Revision 3.00"
+ "ARCNET USER'S MANUAL for
+ CN160A CN160AB CN160TP
+ P/N:12-01-0006 Revision 3.00"
ARCNET is a registered trademark of the Datapoint Corporation
-P/N 160A ARCNET 16 bit XT/AT Star
-P/N 160AB ARCNET 16 bit XT/AT Bus
-P/N 160TP ARCNET 16 bit XT/AT Twisted Pair
+- P/N 160A ARCNET 16 bit XT/AT Star
+- P/N 160AB ARCNET 16 bit XT/AT Bus
+- P/N 160TP ARCNET 16 bit XT/AT Twisted Pair
+
+::
___________________________________________________________________
< _________________________ ___|
@@ -1526,30 +1567,30 @@ P/N 160TP ARCNET 16 bit XT/AT Twisted Pair
> | | |
<____________| |_______________________________________|
-Legend:
+Legend::
-9026 ARCNET Probe
-SW1 1-6: Base I/O Address Select
- 7-10: Base Memory Address Select
-SW2 1-8: Node ID Select (ID0-ID7)
-JP1/JP2 ET1, ET2 Timeout Parameters
-JP3-JP13 Interrupt Select
-J1 BNC RG62/U Connector (CN160A/AB only)
-J1 Two 6-position Telephone Jack (CN160TP only)
-LED
+ 9026 ARCNET Probe
+ SW1 1-6: Base I/O Address Select
+ 7-10: Base Memory Address Select
+ SW2 1-8: Node ID Select (ID0-ID7)
+ JP1/JP2 ET1, ET2 Timeout Parameters
+ JP3-JP13 Interrupt Select
+ J1 BNC RG62/U Connector (CN160A/AB only)
+ J1 Two 6-position Telephone Jack (CN160TP only)
+ LED
Setting one of the switches to Off means "1", On means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW2 are used to set the node ID. Each node attached
to the network must have an unique node ID which must be different from 0.
Switch 1 (ID0) serves as the least significant bit (LSB).
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Label | Value
-------|-------|-------
@@ -1562,32 +1603,32 @@ These values are:
7 | ID6 | 64
8 | ID7 | 128
-Some Examples:
+Some Examples::
- Switch | Hex | Decimal
+ Switch | Hex | Decimal
8 7 6 5 4 3 2 1 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first six switches in switch block SW1 are used to select the I/O Base
-address using the following table:
+address using the following table::
- Switch | Hex I/O
+ Switch | Hex I/O
1 2 3 4 5 6 | Address
------------------------|--------
OFF ON ON OFF OFF ON | 260
@@ -1604,10 +1645,10 @@ Note: Other IO-Base addresses seem to be selectable, but only the above
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The switches 7-10 of switch block SW1 are used to select the Memory
-Base address of the RAM (2K) and the PROM.
+Base address of the RAM (2K) and the PROM::
Switch | Hex RAM | Hex ROM
7 8 9 10 | Address | Address
@@ -1616,17 +1657,19 @@ Base address of the RAM (2K) and the PROM.
OFF OFF ON OFF | D0000 | D8000 (Default)
OFF OFF OFF ON | E0000 | E8000
-Note: Other MEM-Base addresses seem to be selectable, but only the above
+.. note::
+
+ Other MEM-Base addresses seem to be selectable, but only the above
combinations are documented.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
To select a hardware interrupt level install one (only one!) of the jumpers
-JP3 through JP13 using the following table:
+JP3 through JP13 using the following table::
- Jumper | IRQ
+ Jumper | IRQ
-------|-----------------
3 | 14
4 | 15
@@ -1640,10 +1683,12 @@ JP3 through JP13 using the following table:
12 | 7
13 | 2 (=9) Default!
-Note: - Do not use JP11=IRQ6, it may conflict with your Floppy Disk
- Controller
+.. note::
+
+ - Do not use JP11=IRQ6, it may conflict with your Floppy Disk
+ Controller
- Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL-
- Hard Disk, it may conflict with their controllers
+ Hard Disk, it may conflict with their controllers
Setting the Timeout Parameters
@@ -1653,14 +1698,16 @@ The jumpers labeled JP1 and JP2 are used to determine the timeout
parameters. These two jumpers are normally left open.
-*****************************************************************************
+Lantech
+=======
-** Lantech **
8-bit card, unknown model
-------------------------
- from Vlad Lungu <vlungu@ugal.ro> - his e-mail address seemed broken at
the time I tried to reach him. Sorry Vlad, if you didn't get my reply.
+::
+
________________________________________________________________
| 1 8 |
| ___________ __|
@@ -1683,25 +1730,27 @@ parameters. These two jumpers are normally left open.
| | PROM | |ooooo| JP6 |
| |____________| |ooooo| |
|_____________ _ _|
- |____________________________________________| |__|
+ |____________________________________________| |__|
UM9065L : ARCnet Controller
SW 1 : Shared Memory Address and I/O Base
- ON=0
+::
- 12345|Memory Address
- -----|--------------
- 00001| D4000
- 00010| CC000
- 00110| D0000
- 01110| D1000
- 01101| D9000
- 10010| CC800
- 10011| DC800
- 11110| D1800
+ ON=0
+
+ 12345|Memory Address
+ -----|--------------
+ 00001| D4000
+ 00010| CC000
+ 00110| D0000
+ 01110| D1000
+ 01101| D9000
+ 10010| CC800
+ 10011| DC800
+ 11110| D1800
It seems that the bits are considered in reverse order. Also, you must
observe that some of those addresses are unusual and I didn't probe them; I
@@ -1710,43 +1759,48 @@ some others that I didn't write here the card seems to conflict with the
video card (an S3 GENDAC). I leave the full decoding of those addresses to
you.
- 678| I/O Address
- ---|------------
- 000| 260
- 001| failed probe
- 010| 2E0
- 011| 380
- 100| 290
- 101| 350
- 110| failed probe
- 111| 3E0
+::
-SW 2 : Node ID (binary coded)
+ 678| I/O Address
+ ---|------------
+ 000| 260
+ 001| failed probe
+ 010| 2E0
+ 011| 380
+ 100| 290
+ 101| 350
+ 110| failed probe
+ 111| 3E0
-JP 4 : Boot PROM enable CLOSE - enabled
- OPEN - disabled
+ SW 2 : Node ID (binary coded)
-JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6)
+ JP 4 : Boot PROM enable CLOSE - enabled
+ OPEN - disabled
+ JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6)
-*****************************************************************************
-** Acer **
+Acer
+====
+
8-bit card, Model 5210-003
--------------------------
+
- from Vojtech Pavlik <vojtech@suse.cz> using portions of the existing
arcnet-hardware file.
This is a 90C26 based card. Its configuration seems similar to the SMC
PC100, but has some additional jumpers I don't know the meaning of.
- __
- | |
+::
+
+ __
+ | |
___________|__|_________________________
| | | |
| | BNC | |
| |______| ___|
- | _____________________ |___
+ | _____________________ |___
| | | |
| | Hybrid IC | |
| | | o|o J1 |
@@ -1762,51 +1816,51 @@ PC100, but has some additional jumpers I don't know the meaning of.
| _____ |
| | | _____ |
| | | | | ___|
- | | | | | |
- | _____ | ROM | | UFS | |
- | | | | | | | |
- | | | ___ | | | | |
- | | | | | |__.__| |__.__| |
- | | NCR | |XTL| _____ _____ |
- | | | |___| | | | | |
- | |90C26| | | | | |
- | | | | RAM | | UFS | |
- | | | J17 o|o | | | | |
- | | | J16 o|o | | | | |
- | |__.__| |__.__| |__.__| |
- | ___ |
- | | |8 |
- | |SW2| |
- | | | |
- | |___|1 |
- | ___ |
- | | |10 J18 o|o |
- | | | o|o |
- | |SW1| o|o |
- | | | J21 o|o |
- | |___|1 |
- | |
- |____________________________________|
-
-
-Legend:
-
-90C26 ARCNET Chip
-XTL 20 MHz Crystal
-SW1 1-6 Base I/O Address Select
- 7-10 Memory Address Select
-SW2 1-8 Node ID Select (ID0-ID7)
-J1-J5 IRQ Select
-J6-J21 Unknown (Probably extra timeouts & ROM enable ...)
-LED1 Activity LED
-BNC Coax connector (STAR ARCnet)
-RAM 2k of SRAM
-ROM Boot ROM socket
-UFS Unidentified Flying Sockets
+ | | | | | |
+ | _____ | ROM | | UFS | |
+ | | | | | | | |
+ | | | ___ | | | | |
+ | | | | | |__.__| |__.__| |
+ | | NCR | |XTL| _____ _____ |
+ | | | |___| | | | | |
+ | |90C26| | | | | |
+ | | | | RAM | | UFS | |
+ | | | J17 o|o | | | | |
+ | | | J16 o|o | | | | |
+ | |__.__| |__.__| |__.__| |
+ | ___ |
+ | | |8 |
+ | |SW2| |
+ | | | |
+ | |___|1 |
+ | ___ |
+ | | |10 J18 o|o |
+ | | | o|o |
+ | |SW1| o|o |
+ | | | J21 o|o |
+ | |___|1 |
+ | |
+ |____________________________________|
+
+
+Legend::
+
+ 90C26 ARCNET Chip
+ XTL 20 MHz Crystal
+ SW1 1-6 Base I/O Address Select
+ 7-10 Memory Address Select
+ SW2 1-8 Node ID Select (ID0-ID7)
+ J1-J5 IRQ Select
+ J6-J21 Unknown (Probably extra timeouts & ROM enable ...)
+ LED1 Activity LED
+ BNC Coax connector (STAR ARCnet)
+ RAM 2k of SRAM
+ ROM Boot ROM socket
+ UFS Unidentified Flying Sockets
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW2 are used to set the node ID. Each node attached
to the network must have an unique node ID which must not be 0.
@@ -1815,7 +1869,7 @@ Switch 1 (ID0) serves as the least significant bit (LSB).
Setting one of the switches to OFF means "1", ON means "0".
The node ID is the sum of the values of all switches set to "1"
-These values are:
+These values are::
Switch | Value
-------|-------
@@ -1832,40 +1886,40 @@ Don't set this to 0 or 255; these values are reserved.
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The switches 1 to 6 of switch block SW1 are used to select one
-of 32 possible I/O Base addresses using the following tables
-
- | Hex
+of 32 possible I/O Base addresses using the following tables::
+
+ | Hex
Switch | Value
-------|-------
- 1 | 200
- 2 | 100
- 3 | 80
- 4 | 40
- 5 | 20
- 6 | 10
+ 1 | 200
+ 2 | 100
+ 3 | 80
+ 4 | 40
+ 5 | 20
+ 6 | 10
The I/O address is sum of all switches set to "1". Remember that
the I/O address space bellow 0x200 is RESERVED for mainboard, so
-switch 1 should be ALWAYS SET TO OFF.
+switch 1 should be ALWAYS SET TO OFF.
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer (RAM) requires 2K. The base of this buffer can be
located in any of sixteen positions. However, the addresses below
A0000 are likely to cause system hang because there's main RAM.
-Jumpers 7-10 of switch block SW1 select the Memory Base address.
+Jumpers 7-10 of switch block SW1 select the Memory Base address::
Switch | Hex RAM
7 8 9 10 | Address
----------------|---------
OFF OFF OFF OFF | F0000 (conflicts with main BIOS)
- OFF OFF OFF ON | E0000
+ OFF OFF OFF ON | E0000
OFF OFF ON OFF | D0000
OFF OFF ON ON | C0000 (conflicts with video BIOS)
OFF ON OFF OFF | B0000 (conflicts with mono video)
@@ -1873,10 +1927,10 @@ Jumpers 7-10 of switch block SW1 select the Memory Base address.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
-Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
-shorted, OFF means open.
+Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
+shorted, OFF means open::
Jumper | IRQ
1 2 3 4 5 |
@@ -1889,65 +1943,67 @@ shorted, OFF means open.
Unknown jumpers & sockets
--------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^
I know nothing about these. I just guess that J16&J17 are timeout
jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and
J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't
guess the purpose.
+Datapoint?
+==========
-*****************************************************************************
-
-** Datapoint? **
LAN-ARC-8, an 8-bit card
------------------------
+
- from Vojtech Pavlik <vojtech@suse.cz>
This is another SMC 90C65-based ARCnet card. I couldn't identify the
manufacturer, but it might be DataPoint, because the card has the
original arcNet logo in its upper right corner.
- _______________________________________________________
- | _________ |
- | | SW2 | ON arcNet |
- | |_________| OFF ___|
- | _____________ 1 ______ 8 | | 8
- | | | SW1 | XTAL | ____________ | S |
- | > RAM (2k) | |______|| | | W |
- | |_____________| | H | | 3 |
- | _________|_____ y | |___| 1
- | _________ | | |b | |
- | |_________| | | |r | |
- | | SMC | |i | |
- | | 90C65| |d | |
- | _________ | | | | |
- | | SW1 | ON | | |I | |
- | |_________| OFF |_________|_____/C | _____|
- | 1 8 | | | |___
- | ______________ | | | BNC |___|
- | | | |____________| |_____|
- | > EPROM SOCKET | _____________ |
- | |______________| |_____________| |
- | ______________|
- | |
- |________________________________________|
-
-Legend:
-
-90C65 ARCNET Chip
-SW1 1-5: Base Memory Address Select
- 6-8: Base I/O Address Select
-SW2 1-8: Node ID Select
-SW3 1-5: IRQ Select
- 6-7: Extra Timeout
- 8 : ROM Enable
-BNC Coax connector
-XTAL 20 MHz Crystal
+::
+
+ _______________________________________________________
+ | _________ |
+ | | SW2 | ON arcNet |
+ | |_________| OFF ___|
+ | _____________ 1 ______ 8 | | 8
+ | | | SW1 | XTAL | ____________ | S |
+ | > RAM (2k) | |______|| | | W |
+ | |_____________| | H | | 3 |
+ | _________|_____ y | |___| 1
+ | _________ | | |b | |
+ | |_________| | | |r | |
+ | | SMC | |i | |
+ | | 90C65| |d | |
+ | _________ | | | | |
+ | | SW1 | ON | | |I | |
+ | |_________| OFF |_________|_____/C | _____|
+ | 1 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |____________| |_____|
+ | > EPROM SOCKET | _____________ |
+ | |______________| |_____________| |
+ | ______________|
+ | |
+ |________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ SW2 1-8: Node ID Select
+ SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+ BNC Coax connector
+ XTAL 20 MHz Crystal
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW3 are used to set the node ID. Each node attached
to the network must have an unique node ID which must not be 0.
@@ -1955,8 +2011,8 @@ Switch 1 serves as the least significant bit (LSB).
Setting one of the switches to Off means "1", On means "0".
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Value
-------|-------
@@ -1971,10 +2027,10 @@ These values are:
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last three switches in switch block SW1 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -1991,13 +2047,16 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The memory buffer (RAM) requires 2K. The base of this buffer can be
+The memory buffer (RAM) requires 2K. The base of this buffer can be
located in any of eight positions. The address of the Boot Prom is
memory base + 0x2000.
+
Jumpers 3-5 of switch block SW1 select the Memory Base address.
+::
+
Switch | Hex RAM | Hex ROM
1 2 3 4 5 | Address | Address *)
--------------------|---------|-----------
@@ -2009,16 +2068,16 @@ Jumpers 3-5 of switch block SW1 select the Memory Base address.
ON ON OFF ON OFF | D8000 | DA000
ON ON ON OFF OFF | DC000 | DE000
ON ON OFF OFF OFF | E0000 | E2000
-
-*) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON.
+
+ *) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON.
The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
-Switches 1-5 of the switch block SW3 control the IRQ level.
+Switches 1-5 of the switch block SW3 control the IRQ level::
Jumper | IRQ
1 2 3 4 5 |
@@ -2031,64 +2090,67 @@ Switches 1-5 of the switch block SW3 control the IRQ level.
Setting the Timeout Parameters
-------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The switches 6-7 of the switch block SW3 are used to determine the timeout
parameters. These two switches are normally left in the OFF position.
-*****************************************************************************
+Topware
+=======
-** Topware **
8-bit card, TA-ARC/10
--------------------------
+---------------------
+
- from Vojtech Pavlik <vojtech@suse.cz>
This is another very similar 90C65 card. Most of the switches and jumpers
are the same as on other clones.
- _____________________________________________________________________
-| ___________ | | ______ |
-| |SW2 NODE ID| | | | XTAL | |
-| |___________| | Hybrid IC | |______| |
-| ___________ | | __|
-| |SW1 MEM+I/O| |_________________________| LED1|__|)
-| |___________| 1 2 |
-| J3 |o|o| TIMEOUT ______|
-| ______________ |o|o| | |
-| | | ___________________ | RJ |
-| > EPROM SOCKET | | \ |------|
-|J2 |______________| | | | |
-||o| | | |______|
-||o| ROM ENABLE | SMC | _________ |
-| _____________ | 90C65 | |_________| _____|
-| | | | | | |___
-| > RAM (2k) | | | | BNC |___|
-| |_____________| | | |_____|
-| |____________________| |
-| ________ IRQ 2 3 4 5 7 ___________ |
-||________| |o|o|o|o|o| |___________| |
-|________ J1|o|o|o|o|o| ______________|
- | |
- |_____________________________________________|
-
-Legend:
-
-90C65 ARCNET Chip
-XTAL 20 MHz Crystal
-SW1 1-5 Base Memory Address Select
- 6-8 Base I/O Address Select
-SW2 1-8 Node ID Select (ID0-ID7)
-J1 IRQ Select
-J2 ROM Enable
-J3 Extra Timeout
-LED1 Activity LED
-BNC Coax connector (BUS ARCnet)
-RJ Twisted Pair Connector (daisy chain)
+::
+
+ _____________________________________________________________________
+ | ___________ | | ______ |
+ | |SW2 NODE ID| | | | XTAL | |
+ | |___________| | Hybrid IC | |______| |
+ | ___________ | | __|
+ | |SW1 MEM+I/O| |_________________________| LED1|__|)
+ | |___________| 1 2 |
+ | J3 |o|o| TIMEOUT ______|
+ | ______________ |o|o| | |
+ | | | ___________________ | RJ |
+ | > EPROM SOCKET | | \ |------|
+ |J2 |______________| | | | |
+ ||o| | | |______|
+ ||o| ROM ENABLE | SMC | _________ |
+ | _____________ | 90C65 | |_________| _____|
+ | | | | | | |___
+ | > RAM (2k) | | | | BNC |___|
+ | |_____________| | | |_____|
+ | |____________________| |
+ | ________ IRQ 2 3 4 5 7 ___________ |
+ ||________| |o|o|o|o|o| |___________| |
+ |________ J1|o|o|o|o|o| ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ XTAL 20 MHz Crystal
+ SW1 1-5 Base Memory Address Select
+ 6-8 Base I/O Address Select
+ SW2 1-8 Node ID Select (ID0-ID7)
+ J1 IRQ Select
+ J2 ROM Enable
+ J3 Extra Timeout
+ LED1 Activity LED
+ BNC Coax connector (BUS ARCnet)
+ RJ Twisted Pair Connector (daisy chain)
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW2 are used to set the node ID. Each node attached to
the network must have an unique node ID which must not be 0. Switch 1 (ID0)
@@ -2097,7 +2159,7 @@ serves as the least significant bit (LSB).
Setting one of the switches to Off means "1", On means "0".
The node ID is the sum of the values of all switches set to "1"
-These values are:
+These values are::
Switch | Label | Value
-------|-------|-------
@@ -2111,10 +2173,10 @@ These values are:
8 | ID7 | 128
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last three switches in switch block SW1 are used to select one
-of eight possible I/O Base addresses using the following table:
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -2122,7 +2184,7 @@ of eight possible I/O Base addresses using the following table:
------------|--------
ON ON ON | 260 (Manufacturer's default)
OFF ON ON | 290
- ON OFF ON | 2E0
+ ON OFF ON | 2E0
OFF OFF ON | 2F0
ON ON OFF | 300
OFF ON OFF | 350
@@ -2131,35 +2193,38 @@ of eight possible I/O Base addresses using the following table:
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer (RAM) requires 2K. The base of this buffer can be
located in any of eight positions. The address of the Boot Prom is
memory base + 0x2000.
+
Jumpers 3-5 of switch block SW1 select the Memory Base address.
+::
+
Switch | Hex RAM | Hex ROM
1 2 3 4 5 | Address | Address *)
--------------------|---------|-----------
ON ON ON ON ON | C0000 | C2000
- ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default)
+ ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default)
ON ON ON OFF ON | CC000 | CE000
- ON ON OFF OFF ON | D0000 | D2000
+ ON ON OFF OFF ON | D0000 | D2000
ON ON ON ON OFF | D4000 | D6000
ON ON OFF ON OFF | D8000 | DA000
ON ON ON OFF OFF | DC000 | DE000
ON ON OFF OFF OFF | E0000 | E2000
-*) To enable the Boot ROM short the jumper J2.
+ *) To enable the Boot ROM short the jumper J2.
The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
-shorted, OFF means open.
+shorted, OFF means open::
Jumper | IRQ
1 2 3 4 5 |
@@ -2172,19 +2237,21 @@ shorted, OFF means open.
Setting the Timeout Parameters
-------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The jumpers J3 are used to set the timeout parameters. These two
+The jumpers J3 are used to set the timeout parameters. These two
jumpers are normally left open.
-
-*****************************************************************************
+Thomas-Conrad
+=============
-** Thomas-Conrad **
Model #500-6242-0097 REV A (8-bit card)
---------------------------------------
+
- from Lars Karlsson <100617.3473@compuserve.com>
+::
+
________________________________________________________
| ________ ________ |_____
| |........| |........| |
@@ -2194,11 +2261,11 @@ Model #500-6242-0097 REV A (8-bit card)
| address | |
| ______ switch | |
| | | | |
- | | | |___|
+ | | | |___|
| | | ______ |___._
| |______| |______| ____| BNC
| Jumper- _____| Connector
- | Main chip block _ __| '
+ | Main chip block _ __| '
| | | | RJ Connector
| |_| | with 110 Ohm
| |__ Terminator
@@ -2208,46 +2275,49 @@ Model #500-6242-0097 REV A (8-bit card)
| |___________| |_____| |__
| Boot PROM socket IRQ-jumpers |_ Diagnostic
|________ __ _| LED (red)
- | | | | | | | | | | | | | | | | | | | | | |
- | | | | | | | | | | | | | | | | | | | | |________|
- |
- |
+ | | | | | | | | | | | | | | | | | | | | | |
+ | | | | | | | | | | | | | | | | | | | | |________|
+ |
+ |
And here are the settings for some of the switches and jumpers on the cards.
+::
- I/O
+ I/O
- 1 2 3 4 5 6 7 8
+ 1 2 3 4 5 6 7 8
-2E0----- 0 0 0 1 0 0 0 1
-2F0----- 0 0 0 1 0 0 0 0
-300----- 0 0 0 0 1 1 1 1
-350----- 0 0 0 0 1 1 1 0
+ 2E0----- 0 0 0 1 0 0 0 1
+ 2F0----- 0 0 0 1 0 0 0 0
+ 300----- 0 0 0 0 1 1 1 1
+ 350----- 0 0 0 0 1 1 1 0
"0" in the above example means switch is off "1" means that it is on.
+::
- ShMem address.
+ ShMem address.
- 1 2 3 4 5 6 7 8
+ 1 2 3 4 5 6 7 8
-CX00--0 0 1 1 | | |
-DX00--0 0 1 0 |
-X000--------- 1 1 |
-X400--------- 1 0 |
-X800--------- 0 1 |
-XC00--------- 0 0
-ENHANCED----------- 1
-COMPATIBLE--------- 0
+ CX00--0 0 1 1 | | |
+ DX00--0 0 1 0 |
+ X000--------- 1 1 |
+ X400--------- 1 0 |
+ X800--------- 0 1 |
+ XC00--------- 0 0
+ ENHANCED----------- 1
+ COMPATIBLE--------- 0
+::
- IRQ
+ IRQ
- 3 4 5 7 2
- . . . . .
- . . . . .
+ 3 4 5 7 2
+ . . . . .
+ . . . . .
There is a DIP-switch with 8 switches, used to set the shared memory address
@@ -2266,10 +2336,9 @@ varies by the type of card involved. I fail to see how either of these
enhance anything. Send me more detailed information about this mode, or
just use "compatible" mode instead.]
+Waterloo Microsystems Inc. ??
+=============================
-*****************************************************************************
-
-** Waterloo Microsystems Inc. ?? **
8-bit card (C) 1985
-------------------
- from Robert Michael Best <rmb117@cs.usask.ca>
@@ -2283,103 +2352,104 @@ e-mail me.]
The probe has not been able to detect the card on any of the J2 settings,
and I tried them again with the "Waterloo" chip removed.
-
- _____________________________________________________________________
-| \/ \/ ___ __ __ |
-| C4 C4 |^| | M || ^ ||^| |
-| -- -- |_| | 5 || || | C3 |
-| \/ \/ C10 |___|| ||_| |
-| C4 C4 _ _ | | ?? |
-| -- -- | \/ || | |
-| | || | |
-| | || C1 | |
-| | || | \/ _____|
-| | C6 || | C9 | |___
-| | || | -- | BNC |___|
-| | || | >C7| |_____|
-| | || | |
-| __ __ |____||_____| 1 2 3 6 |
-|| ^ | >C4| |o|o|o|o|o|o| J2 >C4| |
-|| | |o|o|o|o|o|o| |
-|| C2 | >C4| >C4| |
-|| | >C8| |
-|| | 2 3 4 5 6 7 IRQ >C4| |
-||_____| |o|o|o|o|o|o| J3 |
-|_______ |o|o|o|o|o|o| _______________|
- | |
- |_____________________________________________|
-
-C1 -- "COM9026
- SMC 8638"
- In a chip socket.
-
-C2 -- "@Copyright
- Waterloo Microsystems Inc.
- 1985"
- In a chip Socket with info printed on a label covering a round window
- showing the circuit inside. (The window indicates it is an EPROM chip.)
-
-C3 -- "COM9032
- SMC 8643"
- In a chip socket.
-
-C4 -- "74LS"
- 9 total no sockets.
-
-M5 -- "50006-136
- 20.000000 MHZ
- MTQ-T1-S3
- 0 M-TRON 86-40"
- Metallic case with 4 pins, no socket.
-
-C6 -- "MOSTEK@TC8643
- MK6116N-20
- MALAYSIA"
- No socket.
-
-C7 -- No stamp or label but in a 20 pin chip socket.
-
-C8 -- "PAL10L8CN
- 8623"
- In a 20 pin socket.
-
-C9 -- "PAl16R4A-2CN
- 8641"
- In a 20 pin socket.
-
-C10 -- "M8640
- NMC
- 9306N"
- In an 8 pin socket.
-
-?? -- Some components on a smaller board and attached with 20 pins all
- along the side closest to the BNC connector. The are coated in a dark
- resin.
-
-On the board there are two jumper banks labeled J2 and J3. The
-manufacturer didn't put a J1 on the board. The two boards I have both
+
+::
+
+ _____________________________________________________________________
+ | \/ \/ ___ __ __ |
+ | C4 C4 |^| | M || ^ ||^| |
+ | -- -- |_| | 5 || || | C3 |
+ | \/ \/ C10 |___|| ||_| |
+ | C4 C4 _ _ | | ?? |
+ | -- -- | \/ || | |
+ | | || | |
+ | | || C1 | |
+ | | || | \/ _____|
+ | | C6 || | C9 | |___
+ | | || | -- | BNC |___|
+ | | || | >C7| |_____|
+ | | || | |
+ | __ __ |____||_____| 1 2 3 6 |
+ || ^ | >C4| |o|o|o|o|o|o| J2 >C4| |
+ || | |o|o|o|o|o|o| |
+ || C2 | >C4| >C4| |
+ || | >C8| |
+ || | 2 3 4 5 6 7 IRQ >C4| |
+ ||_____| |o|o|o|o|o|o| J3 |
+ |_______ |o|o|o|o|o|o| _______________|
+ | |
+ |_____________________________________________|
+
+ C1 -- "COM9026
+ SMC 8638"
+ In a chip socket.
+
+ C2 -- "@Copyright
+ Waterloo Microsystems Inc.
+ 1985"
+ In a chip Socket with info printed on a label covering a round window
+ showing the circuit inside. (The window indicates it is an EPROM chip.)
+
+ C3 -- "COM9032
+ SMC 8643"
+ In a chip socket.
+
+ C4 -- "74LS"
+ 9 total no sockets.
+
+ M5 -- "50006-136
+ 20.000000 MHZ
+ MTQ-T1-S3
+ 0 M-TRON 86-40"
+ Metallic case with 4 pins, no socket.
+
+ C6 -- "MOSTEK@TC8643
+ MK6116N-20
+ MALAYSIA"
+ No socket.
+
+ C7 -- No stamp or label but in a 20 pin chip socket.
+
+ C8 -- "PAL10L8CN
+ 8623"
+ In a 20 pin socket.
+
+ C9 -- "PAl16R4A-2CN
+ 8641"
+ In a 20 pin socket.
+
+ C10 -- "M8640
+ NMC
+ 9306N"
+ In an 8 pin socket.
+
+ ?? -- Some components on a smaller board and attached with 20 pins all
+ along the side closest to the BNC connector. The are coated in a dark
+ resin.
+
+On the board there are two jumper banks labeled J2 and J3. The
+manufacturer didn't put a J1 on the board. The two boards I have both
came with a jumper box for each bank.
-J2 -- Numbered 1 2 3 4 5 6.
- 4 and 5 are not stamped due to solder points.
-
-J3 -- IRQ 2 3 4 5 6 7
+::
+
+ J2 -- Numbered 1 2 3 4 5 6.
+ 4 and 5 are not stamped due to solder points.
+
+ J3 -- IRQ 2 3 4 5 6 7
-The board itself has a maple leaf stamped just above the irq jumpers
-and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986
+The board itself has a maple leaf stamped just above the irq jumpers
+and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986
CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector.
Below that "MADE IN CANADA"
-
-*****************************************************************************
+No Name
+=======
-** No Name **
8-bit cards, 16-bit cards
-------------------------
+
- from Juergen Seifert <seifert@htwm.de>
-
-NONAME 8-BIT ARCNET
-===================
I have named this ARCnet card "NONAME", since there is no name of any
manufacturer on the Installation manual nor on the shipping box. The only
@@ -2388,8 +2458,10 @@ it is "Made in Taiwan"
This description has been written by Juergen Seifert <seifert@htwm.de>
using information from the Original
- "ARCnet Installation Manual"
+ "ARCnet Installation Manual"
+
+::
________________________________________________________________
| |STAR| BUS| T/P| |
@@ -2416,32 +2488,32 @@ using information from the Original
| \ IRQ / T T O |
|__________________1_2_M______________________|
-Legend:
+Legend::
-COM90C65: ARCnet Probe
-S1 1-8: Node ID Select
-S2 1-3: I/O Base Address Select
- 4-6: Memory Base Address Select
- 7-8: RAM Offset Select
-ET1, ET2 Extended Timeout Select
-ROM ROM Enable Select
-CN RG62 Coax Connector
-STAR| BUS | T/P Three fields for placing a sign (colored circle)
- indicating the topology of the card
+ COM90C65: ARCnet Probe
+ S1 1-8: Node ID Select
+ S2 1-3: I/O Base Address Select
+ 4-6: Memory Base Address Select
+ 7-8: RAM Offset Select
+ ET1, ET2 Extended Timeout Select
+ ROM ROM Enable Select
+ CN RG62 Coax Connector
+ STAR| BUS | T/P Three fields for placing a sign (colored circle)
+ indicating the topology of the card
Setting one of the switches to Off means "1", On means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in group SW1 are used to set the node ID.
Each node attached to the network must have an unique node ID which
must be different from 0.
Switch 8 serves as the least significant bit (LSB).
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Value
-------|-------
@@ -2454,30 +2526,30 @@ These values are:
2 | 64
1 | 128
-Some Examples:
+Some Examples::
- Switch | Hex | Decimal
+ Switch | Hex | Decimal
1 2 3 4 5 6 7 8 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first three switches in switch group SW2 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
1 2 3 | Address
@@ -2493,7 +2565,7 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer requires 2K of a 16K block of RAM. The base of this
16K block can be located in any of eight positions.
@@ -2501,6 +2573,8 @@ Switches 4-6 of switch group SW2 select the Base of the 16K block.
Within that 16K address space, the buffer may be assigned any one of four
positions, determined by the offset, switches 7 and 8 of group SW2.
+::
+
Switch | Hex RAM | Hex ROM
4 5 6 7 8 | Address | Address *)
-----------|---------|-----------
@@ -2508,60 +2582,62 @@ positions, determined by the offset, switches 7 and 8 of group SW2.
0 0 0 0 1 | C0800 | C2000
0 0 0 1 0 | C1000 | C2000
0 0 0 1 1 | C1800 | C2000
- | |
+ | |
0 0 1 0 0 | C4000 | C6000
0 0 1 0 1 | C4800 | C6000
0 0 1 1 0 | C5000 | C6000
0 0 1 1 1 | C5800 | C6000
- | |
+ | |
0 1 0 0 0 | CC000 | CE000
0 1 0 0 1 | CC800 | CE000
0 1 0 1 0 | CD000 | CE000
0 1 0 1 1 | CD800 | CE000
- | |
+ | |
0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
0 1 1 0 1 | D0800 | D2000
0 1 1 1 0 | D1000 | D2000
0 1 1 1 1 | D1800 | D2000
- | |
+ | |
1 0 0 0 0 | D4000 | D6000
1 0 0 0 1 | D4800 | D6000
1 0 0 1 0 | D5000 | D6000
1 0 0 1 1 | D5800 | D6000
- | |
+ | |
1 0 1 0 0 | D8000 | DA000
1 0 1 0 1 | D8800 | DA000
1 0 1 1 0 | D9000 | DA000
1 0 1 1 1 | D9800 | DA000
- | |
+ | |
1 1 0 0 0 | DC000 | DE000
1 1 0 0 1 | DC800 | DE000
1 1 0 1 0 | DD000 | DE000
1 1 0 1 1 | DD800 | DE000
- | |
+ | |
1 1 1 0 0 | E0000 | E2000
1 1 1 0 1 | E0800 | E2000
1 1 1 1 0 | E1000 | E2000
1 1 1 1 1 | E1800 | E2000
-
-*) To enable the 8K Boot PROM install the jumper ROM.
- The default is jumper ROM not installed.
+
+ *) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
Setting Interrupt Request Lines (IRQ)
--------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To select a hardware interrupt level set one (only one!) of the jumpers
IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2.
-
+
Setting the Timeouts
---------------------
+^^^^^^^^^^^^^^^^^^^^
The two jumpers labeled ET1 and ET2 are used to determine the timeout
parameters (response and reconfiguration time). Every node in a network
must be set to the same timeout values.
+::
+
ET1 ET2 | Response Time (us) | Reconfiguration Time (ms)
--------|--------------------|--------------------------
Off Off | 78 | 840 (Default)
@@ -2572,8 +2648,8 @@ must be set to the same timeout values.
On means jumper installed, Off means jumper not installed
-NONAME 16-BIT ARCNET
-====================
+16-BIT ARCNET
+-------------
The manual of my 8-Bit NONAME ARCnet Card contains another description
of a 16-Bit Coax / Twisted Pair Card. This description is incomplete,
@@ -2584,13 +2660,16 @@ the booklet there is a different way of counting ... 2-9, 2-10, A-1,
Also the picture of the board layout is not as good as the picture of
8-Bit card, because there isn't any letter like "SW1" written to the
picture.
+
Should somebody have such a board, please feel free to complete this
description or to send a mail to me!
This description has been written by Juergen Seifert <seifert@htwm.de>
using information from the Original
- "ARCnet Installation Manual"
+ "ARCnet Installation Manual"
+
+::
___________________________________________________________________
< _________________ _________________ |
@@ -2622,15 +2701,15 @@ Setting one of the switches to Off means "1", On means "0".
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in group SW2 are used to set the node ID.
Each node attached to the network must have an unique node ID which
must be different from 0.
Switch 8 serves as the least significant bit (LSB).
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Value
-------|-------
@@ -2643,30 +2722,30 @@ These values are:
2 | 64
1 | 128
-Some Examples:
+Some Examples::
- Switch | Hex | Decimal
+ Switch | Hex | Decimal
1 2 3 4 5 6 7 8 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The first three switches in switch group SW1 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
3 2 1 | Address
@@ -2682,13 +2761,13 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer requires 2K of a 16K block of RAM. The base of this
16K block can be located in any of eight positions.
Switches 6-8 of switch group SW1 select the Base of the 16K block.
Within that 16K address space, the buffer may be assigned any one of four
-positions, determined by the offset, switches 4 and 5 of group SW1.
+positions, determined by the offset, switches 4 and 5 of group SW1::
Switch | Hex RAM | Hex ROM
8 7 6 5 4 | Address | Address
@@ -2697,111 +2776,111 @@ positions, determined by the offset, switches 4 and 5 of group SW1.
0 0 0 0 1 | C0800 | C2000
0 0 0 1 0 | C1000 | C2000
0 0 0 1 1 | C1800 | C2000
- | |
+ | |
0 0 1 0 0 | C4000 | C6000
0 0 1 0 1 | C4800 | C6000
0 0 1 1 0 | C5000 | C6000
0 0 1 1 1 | C5800 | C6000
- | |
+ | |
0 1 0 0 0 | CC000 | CE000
0 1 0 0 1 | CC800 | CE000
0 1 0 1 0 | CD000 | CE000
0 1 0 1 1 | CD800 | CE000
- | |
+ | |
0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
0 1 1 0 1 | D0800 | D2000
0 1 1 1 0 | D1000 | D2000
0 1 1 1 1 | D1800 | D2000
- | |
+ | |
1 0 0 0 0 | D4000 | D6000
1 0 0 0 1 | D4800 | D6000
1 0 0 1 0 | D5000 | D6000
1 0 0 1 1 | D5800 | D6000
- | |
+ | |
1 0 1 0 0 | D8000 | DA000
1 0 1 0 1 | D8800 | DA000
1 0 1 1 0 | D9000 | DA000
1 0 1 1 1 | D9800 | DA000
- | |
+ | |
1 1 0 0 0 | DC000 | DE000
1 1 0 0 1 | DC800 | DE000
1 1 0 1 0 | DD000 | DE000
1 1 0 1 1 | DD800 | DE000
- | |
+ | |
1 1 1 0 0 | E0000 | E2000
1 1 1 0 1 | E0800 | E2000
1 1 1 1 0 | E1000 | E2000
1 1 1 1 1 | E1800 | E2000
-
+
Setting Interrupt Request Lines (IRQ)
--------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
??????????????????????????????????????
Setting the Timeouts
---------------------
+^^^^^^^^^^^^^^^^^^^^
??????????????????????????????????????
-*****************************************************************************
-
-** No Name **
8-bit cards ("Made in Taiwan R.O.C.")
------------
+-------------------------------------
+
- from Vojtech Pavlik <vojtech@suse.cz>
I have named this ARCnet card "NONAME", since I got only the card with
-no manual at all and the only text identifying the manufacturer is
+no manual at all and the only text identifying the manufacturer is
"MADE IN TAIWAN R.O.C" printed on the card.
- ____________________________________________________________
- | 1 2 3 4 5 6 7 8 |
- | |o|o| JP1 o|o|o|o|o|o|o|o| ON |
- | + o|o|o|o|o|o|o|o| ___|
- | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7
- | | | SW1 | | | | ID6
- | > RAM (2k) | ____________________ | H | | S | ID5
- | |_____________| | || y | | W | ID4
- | | || b | | 2 | ID3
- | | || r | | | ID2
- | | || i | | | ID1
- | | 90C65 || d | |___| ID0
- | SW3 | || | |
- | |o|o|o|o|o|o|o|o| ON | || I | |
- | |o|o|o|o|o|o|o|o| | || C | |
- | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____|
- | 1 2 3 4 5 6 7 8 | | | |___
- | ______________ | | | BNC |___|
- | | | |_____| |_____|
- | > EPROM SOCKET | |
- | |______________| |
- | ______________|
- | |
- |_____________________________________________|
-
-Legend:
-
-90C65 ARCNET Chip
-SW1 1-5: Base Memory Address Select
- 6-8: Base I/O Address Select
-SW2 1-8: Node ID Select (ID0-ID7)
-SW3 1-5: IRQ Select
- 6-7: Extra Timeout
- 8 : ROM Enable
-JP1 Led connector
-BNC Coax connector
-
-Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not
+::
+
+ ____________________________________________________________
+ | 1 2 3 4 5 6 7 8 |
+ | |o|o| JP1 o|o|o|o|o|o|o|o| ON |
+ | + o|o|o|o|o|o|o|o| ___|
+ | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7
+ | | | SW1 | | | | ID6
+ | > RAM (2k) | ____________________ | H | | S | ID5
+ | |_____________| | || y | | W | ID4
+ | | || b | | 2 | ID3
+ | | || r | | | ID2
+ | | || i | | | ID1
+ | | 90C65 || d | |___| ID0
+ | SW3 | || | |
+ | |o|o|o|o|o|o|o|o| ON | || I | |
+ | |o|o|o|o|o|o|o|o| | || C | |
+ | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____|
+ | 1 2 3 4 5 6 7 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |_____| |_____|
+ | > EPROM SOCKET | |
+ | |______________| |
+ | ______________|
+ | |
+ |_____________________________________________|
+
+Legend::
+
+ 90C65 ARCNET Chip
+ SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+ SW2 1-8: Node ID Select (ID0-ID7)
+ SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+ JP1 Led connector
+ BNC Coax connector
+
+Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not
switches.
-Setting the jumpers to ON means connecting the upper two pins, off the bottom
+Setting the jumpers to ON means connecting the upper two pins, off the bottom
two - or - in case of IRQ setting, connecting none of them at all.
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in SW2 are used to set the node ID. Each node attached
to the network must have an unique node ID which must not be 0.
@@ -2809,8 +2888,8 @@ Switch 1 (ID0) serves as the least significant bit (LSB).
Setting one of the switches to Off means "1", On means "0".
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
Switch | Label | Value
-------|-------|-------
@@ -2823,30 +2902,30 @@ These values are:
7 | ID6 | 64
8 | ID7 | 128
-Some Examples:
+Some Examples::
- Switch | Hex | Decimal
+ Switch | Hex | Decimal
8 7 6 5 4 3 2 1 | Node ID | Node ID
----------------|---------|---------
0 0 0 0 0 0 0 0 | not allowed
- 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 0 1 | 1 | 1
0 0 0 0 0 0 1 0 | 2 | 2
0 0 0 0 0 0 1 1 | 3 | 3
. . . | |
0 1 0 1 0 1 0 1 | 55 | 85
. . . | |
1 0 1 0 1 0 1 0 | AA | 170
- . . . | |
+ . . . | |
1 1 1 1 1 1 0 1 | FD | 253
1 1 1 1 1 1 1 0 | FE | 254
1 1 1 1 1 1 1 1 | FF | 255
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last three switches in switch block SW1 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -2863,13 +2942,16 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory (RAM) buffer Address
---------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The memory buffer (RAM) requires 2K. The base of this buffer can be
+The memory buffer (RAM) requires 2K. The base of this buffer can be
located in any of eight positions. The address of the Boot Prom is
memory base + 0x2000.
+
Jumpers 3-5 of jumper block SW1 select the Memory Base address.
+::
+
Switch | Hex RAM | Hex ROM
1 2 3 4 5 | Address | Address *)
--------------------|---------|-----------
@@ -2881,15 +2963,15 @@ Jumpers 3-5 of jumper block SW1 select the Memory Base address.
ON ON OFF ON OFF | D8000 | DA000
ON ON ON OFF OFF | DC000 | DE000
ON ON OFF OFF OFF | E0000 | E2000
-
-*) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON.
+
+ *) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON.
The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders.
Setting the Interrupt Line
---------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^
-Jumpers 1-5 of the jumper block SW3 control the IRQ level.
+Jumpers 1-5 of the jumper block SW3 control the IRQ level::
Jumper | IRQ
1 2 3 4 5 |
@@ -2902,23 +2984,24 @@ Jumpers 1-5 of the jumper block SW3 control the IRQ level.
Setting the Timeout Parameters
-------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The jumpers 6-7 of the jumper block SW3 are used to determine the timeout
+The jumpers 6-7 of the jumper block SW3 are used to determine the timeout
parameters. These two jumpers are normally left in the OFF position.
-*****************************************************************************
-** No Name **
(Generic Model 9058)
--------------------
- from Andrew J. Kroll <ag784@freenet.buffalo.edu>
- Sorry this sat in my to-do box for so long, Andrew! (yikes - over a
year!)
- _____
- | <
- | .---'
+
+::
+
+ _____
+ | <
+ | .---'
________________________________________________________________ | |
| | SW2 | | |
| ___________ |_____________| | |
@@ -2936,7 +3019,7 @@ parameters. These two jumpers are normally left in the OFF position.
| |________________| | | : B |- | |
| 1 2 3 4 5 6 7 8 | | : O |- | |
| |_________o____|..../ A |- _______| |
- | ____________________ | R |- | |------,
+ | ____________________ | R |- | |------,
| | | | D |- | BNC | # |
| > 2764 PROM SOCKET | |__________|- |_______|------'
| |____________________| _________ | |
@@ -2945,23 +3028,24 @@ parameters. These two jumpers are normally left in the OFF position.
|___ ______________| |
|H H H H H H H H H H H H H H H H H H H H H H H| | |
|U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | |
- \|
-Legend:
+ \|
+
+Legend::
-SL90C65 ARCNET Controller / Transceiver /Logic
-SW1 1-5: IRQ Select
+ SL90C65 ARCNET Controller / Transceiver /Logic
+ SW1 1-5: IRQ Select
6: ET1
7: ET2
- 8: ROM ENABLE
-SW2 1-3: Memory Buffer/PROM Address
+ 8: ROM ENABLE
+ SW2 1-3: Memory Buffer/PROM Address
3-6: I/O Address Map
-SW3 1-8: Node ID Select
-BNC BNC RG62/U Connection
+ SW3 1-8: Node ID Select
+ BNC BNC RG62/U Connection
*I* have had success using RG59B/U with *NO* terminators!
What gives?!
SW1: Timeouts, Interrupt and ROM
----------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To select a hardware interrupt level set one (only one!) of the dip switches
up (on) SW1...(switches 1-5)
@@ -2976,10 +3060,10 @@ are normally left off (down).
Setting the I/O Base Address
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The last three switches in switch group SW2 are used to select one
-of eight possible I/O Base addresses using the following table
+of eight possible I/O Base addresses using the following table::
Switch | Hex I/O
@@ -2996,7 +3080,7 @@ of eight possible I/O Base addresses using the following table
Setting the Base Memory Address (RAM & ROM)
--------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The memory buffer requires 2K of a 16K block of RAM. The base of this
16K block can be located in any of eight positions.
@@ -3004,13 +3088,16 @@ Switches 1-3 of switch group SW2 select the Base of the 16K block.
(0 = DOWN, 1 = UP)
I could, however, only verify two settings...
+
+::
+
Switch| Hex RAM | Hex ROM
1 2 3 | Address | Address
------|---------|-----------
0 0 0 | E0000 | E2000
0 0 1 | D0000 | D2000 (Manufacturer's default)
0 1 0 | ????? | ?????
- 0 1 1 | ????? | ?????
+ 0 1 1 | ????? | ?????
1 0 0 | ????? | ?????
1 0 1 | ????? | ?????
1 1 0 | ????? | ?????
@@ -3018,7 +3105,7 @@ I could, however, only verify two settings...
Setting the Node ID
--------------------
+^^^^^^^^^^^^^^^^^^^
The eight switches in group SW3 are used to set the node ID.
Each node attached to the network must have an unique node ID which
@@ -3026,8 +3113,9 @@ must be different from 0.
Switch 1 serves as the least significant bit (LSB).
switches in the DOWN position are OFF (0) and in the UP position are ON (1)
-The node ID is the sum of the values of all switches set to "1"
-These values are:
+The node ID is the sum of the values of all switches set to "1"
+These values are::
+
Switch | Value
-------|-------
1 | 1
@@ -3039,70 +3127,80 @@ These values are:
7 | 64
8 | 128
-Some Examples:
-
- Switch# | Hex | Decimal
-8 7 6 5 4 3 2 1 | Node ID | Node ID
-----------------|---------|---------
-0 0 0 0 0 0 0 0 | not allowed <-.
-0 0 0 0 0 0 0 1 | 1 | 1 |
-0 0 0 0 0 0 1 0 | 2 | 2 |
-0 0 0 0 0 0 1 1 | 3 | 3 |
- . . . | | |
-0 1 0 1 0 1 0 1 | 55 | 85 |
- . . . | | + Don't use 0 or 255!
-1 0 1 0 1 0 1 0 | AA | 170 |
- . . . | | |
-1 1 1 1 1 1 0 1 | FD | 253 |
-1 1 1 1 1 1 1 0 | FE | 254 |
-1 1 1 1 1 1 1 1 | FF | 255 <-'
-
+Some Examples::
-*****************************************************************************
+ Switch# | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed <-.
+ 0 0 0 0 0 0 0 1 | 1 | 1 |
+ 0 0 0 0 0 0 1 0 | 2 | 2 |
+ 0 0 0 0 0 0 1 1 | 3 | 3 |
+ . . . | | |
+ 0 1 0 1 0 1 0 1 | 55 | 85 |
+ . . . | | + Don't use 0 or 255!
+ 1 0 1 0 1 0 1 0 | AA | 170 |
+ . . . | | |
+ 1 1 1 1 1 1 0 1 | FD | 253 |
+ 1 1 1 1 1 1 1 0 | FE | 254 |
+ 1 1 1 1 1 1 1 1 | FF | 255 <-'
+
+
+Tiara
+=====
-** Tiara **
(model unknown)
--------------------------
+---------------
+
- from Christoph Lameter <christoph@lameter.com>
-
-
-Here is information about my card as far as I could figure it out:
------------------------------------------------ tiara
-Tiara LanCard of Tiara Computer Systems.
-
-+----------------------------------------------+
-! ! Transmitter Unit ! !
-! +------------------+ -------
-! MEM Coax Connector
-! ROM 7654321 <- I/O -------
-! : : +--------+ !
-! : : ! 90C66LJ! +++
-! : : ! ! !D Switch to set
-! : : ! ! !I the Nodenumber
-! : : +--------+ !P
-! !++
-! 234567 <- IRQ !
-+------------!!!!!!!!!!!!!!!!!!!!!!!!--------+
- !!!!!!!!!!!!!!!!!!!!!!!!
-
-0 = Jumper Installed
-1 = Open
+
+
+Here is information about my card as far as I could figure it out::
+
+
+ ----------------------------------------------- tiara
+ Tiara LanCard of Tiara Computer Systems.
+
+ +----------------------------------------------+
+ ! ! Transmitter Unit ! !
+ ! +------------------+ -------
+ ! MEM Coax Connector
+ ! ROM 7654321 <- I/O -------
+ ! : : +--------+ !
+ ! : : ! 90C66LJ! +++
+ ! : : ! ! !D Switch to set
+ ! : : ! ! !I the Nodenumber
+ ! : : +--------+ !P
+ ! !++
+ ! 234567 <- IRQ !
+ +------------!!!!!!!!!!!!!!!!!!!!!!!!--------+
+ !!!!!!!!!!!!!!!!!!!!!!!!
+
+- 0 = Jumper Installed
+- 1 = Open
Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O
Settings for Memory Location (Top Jumper Line)
+
+=== ================
456 Address selected
+=== ================
000 C0000
001 C4000
010 CC000
011 D0000
100 D4000
101 D8000
-110 DC000
+110 DC000
111 E0000
+=== ================
Settings for I/O Address (Top Jumper Line)
+
+=== ====
123 Port
+=== ====
000 260
001 290
010 2E0
@@ -3111,23 +3209,26 @@ Settings for I/O Address (Top Jumper Line)
101 350
110 380
111 3E0
+=== ====
Settings for IRQ Selection (Lower Jumper Line)
+
+====== =====
234567
+====== =====
011111 IRQ 2
101111 IRQ 3
110111 IRQ 4
111011 IRQ 5
111110 IRQ 7
-
-*****************************************************************************
-
+====== =====
Other Cards
------------
+===========
I have no information on other models of ARCnet cards at the moment. Please
send any and all info to:
+
apenwarr@worldvisions.ca
Thanks.
diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.rst
index aff97f47c05c..82fce606c0f0 100644
--- a/Documentation/networking/arcnet.txt
+++ b/Documentation/networking/arcnet.rst
@@ -1,11 +1,18 @@
-----------------------------------------------------------------------------
-NOTE: See also arcnet-hardware.txt in this directory for jumper-setting
-and cabling information if you're like many of us and didn't happen to get a
-manual with your ARCnet card.
-----------------------------------------------------------------------------
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+ARCnet
+======
+
+.. note::
+
+ See also arcnet-hardware.txt in this directory for jumper-setting
+ and cabling information if you're like many of us and didn't happen to get a
+ manual with your ARCnet card.
Since no one seems to listen to me otherwise, perhaps a poem will get your
-attention:
+attention::
+
This driver's getting fat and beefy,
But my cat is still named Fifi.
@@ -24,28 +31,21 @@ Come on, be a sport! Send me a success report!
(hey, that was even better than my original poem... this is getting bad!)
---------
-WARNING:
---------
-
-If you don't e-mail me about your success/failure soon, I may be forced to
-start SINGING. And we don't want that, do we?
+.. warning::
-(You know, it might be argued that I'm pushing this point a little too much.
-If you think so, why not flame me in a quick little e-mail? Please also
-include the type of card(s) you're using, software, size of network, and
-whether it's working or not.)
+ If you don't e-mail me about your success/failure soon, I may be forced to
+ start SINGING. And we don't want that, do we?
-My e-mail address is: apenwarr@worldvisions.ca
+ (You know, it might be argued that I'm pushing this point a little too much.
+ If you think so, why not flame me in a quick little e-mail? Please also
+ include the type of card(s) you're using, software, size of network, and
+ whether it's working or not.)
+ My e-mail address is: apenwarr@worldvisions.ca
----------------------------------------------------------------------------
-
-
These are the ARCnet drivers for Linux.
-
-This new release (2.91) has been put together by David Woodhouse
+This new release (2.91) has been put together by David Woodhouse
<dwmw2@infradead.org>, in an attempt to tidy up the driver after adding support
for yet another chipset. Now the generic support has been separated from the
individual chipset drivers, and the source files aren't quite so packed with
@@ -62,12 +62,13 @@ included and seems to be working fine!
Where do I discuss these drivers?
---------------------------------
-Tomasz has been so kind as to set up a new and improved mailing list.
+Tomasz has been so kind as to set up a new and improved mailing list.
Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR
REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the
list, mail to linux-arcnet@tichy.ch.uj.edu.pl.
There are archives of the mailing list at:
+
http://epistolary.org/mailman/listinfo.cgi/arcnet
The people on linux-net@vger.kernel.org (now defunct, replaced by
@@ -80,17 +81,20 @@ Other Drivers and Info
----------------------
You can try my ARCNET page on the World Wide Web at:
- http://www.qis.net/~jschmitz/arcnet/
+
+ http://www.qis.net/~jschmitz/arcnet/
Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you
might be interested in, which includes several drivers for various cards
including ARCnet. Try:
+
http://www.smc.com/
-
+
Performance Technologies makes various network software that supports
ARCnet:
+
http://www.perftech.com/ or ftp to ftp.perftech.com.
-
+
Novell makes a networking stack for DOS which includes ARCnet drivers. Try
FTPing to ftp.novell.com.
@@ -99,19 +103,20 @@ one you'll want to use with ARCnet cards) from
oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+
without patches, though, and also doesn't like several cards. Fixed
versions are available on my WWW page, or via e-mail if you don't have WWW
-access.
+access.
Installing the Driver
---------------------
-All you will need to do in order to install the driver is:
+All you will need to do in order to install the driver is::
+
make config
- (be sure to choose ARCnet in the network devices
+ (be sure to choose ARCnet in the network devices
and at least one chipset driver.)
make clean
make zImage
-
+
If you obtained this ARCnet package as an upgrade to the ARCnet driver in
your current kernel, you will need to first copy arcnet.c over the one in
the linux/drivers/net directory.
@@ -125,10 +130,12 @@ There are four chipset options:
This is the normal ARCnet card, which you've probably got. This is the only
chipset driver which will autoprobe if not told where the card is.
-It following options on the command line:
+It following options on the command line::
+
com90xx=[<io>[,<irq>[,<shmem>]]][,<name>] | <name>
-If you load the chipset support as a module, the options are:
+If you load the chipset support as a module, the options are::
+
io=<io> irq=<irq> shmem=<shmem> device=<name>
To disable the autoprobe, just specify "com90xx=" on the kernel command line.
@@ -136,14 +143,17 @@ To specify the name alone, but allow autoprobe, just put "com90xx=<name>"
2. ARCnet COM20020 chipset.
-This is the new chipset from SMC with support for promiscuous mode (packet
+This is the new chipset from SMC with support for promiscuous mode (packet
sniffing), extra diagnostic information, etc. Unfortunately, there is no
sensible method of autoprobing for these cards. You must specify the I/O
address on the kernel command line.
-The command line options are:
+
+The command line options are::
+
com20020=<io>[,<irq>[,<node_ID>[,backplane[,CKP[,timeout]]]]][,name]
-If you load the chipset support as a module, the options are:
+If you load the chipset support as a module, the options are::
+
io=<io> irq=<irq> node=<node_ID> backplane=<backplane> clock=<CKP>
timeout=<timeout> device=<name>
@@ -160,8 +170,10 @@ you have a card which doesn't support shared memory, or (strangely) in case
you have so many ARCnet cards in your machine that you run out of shmem slots.
If you don't give the IO address on the kernel command line, then the driver
will not find the card.
-The command line options are:
- com90io=<io>[,<irq>][,<name>]
+
+The command line options are::
+
+ com90io=<io>[,<irq>][,<name>]
If you load the chipset support as a module, the options are:
io=<io> irq=<irq> device=<name>
@@ -169,44 +181,49 @@ If you load the chipset support as a module, the options are:
4. ARCnet RIM I cards.
These are COM90xx chips which are _completely_ memory mapped. The support for
-these is not tested. If you have one, please mail the author with a success
+these is not tested. If you have one, please mail the author with a success
report. All options must be specified, except the device name.
-Command line options:
+Command line options::
+
arcrimi=<shmem>,<irq>,<node_ID>[,<name>]
-If you load the chipset support as a module, the options are:
+If you load the chipset support as a module, the options are::
+
shmem=<shmem> irq=<irq> node=<node_ID> device=<name>
Loadable Module Support
-----------------------
-Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet
+Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet
support" and to support for your ARCnet chipset if you want to use the
-loadable module. You can also say 'y' to "Generic ARCnet support" and 'm'
+loadable module. You can also say 'y' to "Generic ARCnet support" and 'm'
to the chipset support if you wish.
+::
+
make config
- make clean
+ make clean
make zImage
make modules
-
+
If you're using a loadable module, you need to use insmod to load it, and
you can specify various characteristics of your card on the command
line. (In recent versions of the driver, autoprobing is much more reliable
and works as a module, so most of this is now unnecessary.)
-For example:
+For example::
+
cd /usr/src/linux/modules
insmod arcnet.o
insmod com90xx.o
insmod com20020.o io=0x2e0 device=eth1
-
+
Using the Driver
----------------
-If you build your kernel with ARCnet COM90xx support included, it should
+If you build your kernel with ARCnet COM90xx support included, it should
probe for your card automatically when you boot. If you use a different
chipset driver complied into the kernel, you must give the necessary options
on the kernel command line, as detailed above.
@@ -224,69 +241,78 @@ Multiple Cards in One Computer
------------------------------
Linux has pretty good support for this now, but since I've been busy, the
-ARCnet driver has somewhat suffered in this respect. COM90xx support, if
-compiled into the kernel, will (try to) autodetect all the installed cards.
+ARCnet driver has somewhat suffered in this respect. COM90xx support, if
+compiled into the kernel, will (try to) autodetect all the installed cards.
+
+If you have other cards, with support compiled into the kernel, then you can
+just repeat the options on the kernel command line, e.g.::
+
+ LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260
-If you have other cards, with support compiled into the kernel, then you can
-just repeat the options on the kernel command line, e.g.:
-LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260
+If you have the chipset support built as a loadable module, then you need to
+do something like this::
-If you have the chipset support built as a loadable module, then you need to
-do something like this:
insmod -o arc0 com90xx
insmod -o arc1 com20020 io=0x2e0
insmod -o arc2 com90xx
+
The ARCnet drivers will now sort out their names automatically.
How do I get it to work with...?
--------------------------------
-NFS: Should be fine linux->linux, just pretend you're using Ethernet cards.
- oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There
- is also a DOS-based NFS server called SOSS. It doesn't multitask
- quite the way Linux does (actually, it doesn't multitask AT ALL) but
- you never know what you might need.
-
- With AmiTCP (and possibly others), you may need to set the following
- options in your Amiga nfstab: MD 1024 MR 1024 MW 1024
- (Thanks to Christian Gottschling <ferksy@indigo.tng.oche.de>
+NFS:
+ Should be fine linux->linux, just pretend you're using Ethernet cards.
+ oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There
+ is also a DOS-based NFS server called SOSS. It doesn't multitask
+ quite the way Linux does (actually, it doesn't multitask AT ALL) but
+ you never know what you might need.
+
+ With AmiTCP (and possibly others), you may need to set the following
+ options in your Amiga nfstab: MD 1024 MR 1024 MW 1024
+ (Thanks to Christian Gottschling <ferksy@indigo.tng.oche.de>
for this.)
-
+
Probably these refer to maximum NFS data/read/write block sizes. I
don't know why the defaults on the Amiga didn't work; write to me if
you know more.
-DOS: If you're using the freeware arcether.com, you might want to install
- the driver patch from my web page. It helps with PC/TCP, and also
- can get arcether to load if it timed out too quickly during
- initialization. In fact, if you use it on a 386+ you REALLY need
- the patch, really.
-
-Windows: See DOS :) Trumpet Winsock works fine with either the Novell or
+DOS:
+ If you're using the freeware arcether.com, you might want to install
+ the driver patch from my web page. It helps with PC/TCP, and also
+ can get arcether to load if it timed out too quickly during
+ initialization. In fact, if you use it on a 386+ you REALLY need
+ the patch, really.
+
+Windows:
+ See DOS :) Trumpet Winsock works fine with either the Novell or
Arcether client, assuming you remember to load winpkt of course.
-LAN Manager and Windows for Workgroups: These programs use protocols that
- are incompatible with the Internet standard. They try to pretend
- the cards are Ethernet, and confuse everyone else on the network.
-
- However, v2.00 and higher of the Linux ARCnet driver supports this
- protocol via the 'arc0e' device. See the section on "Multiprotocol
- Support" for more information.
+LAN Manager and Windows for Workgroups:
+ These programs use protocols that
+ are incompatible with the Internet standard. They try to pretend
+ the cards are Ethernet, and confuse everyone else on the network.
+
+ However, v2.00 and higher of the Linux ARCnet driver supports this
+ protocol via the 'arc0e' device. See the section on "Multiprotocol
+ Support" for more information.
Using the freeware Samba server and clients for Linux, you can now
interface quite nicely with TCP/IP-based WfWg or Lan Manager
networks.
-
-Windows 95: Tools are included with Win95 that let you use either the LANMAN
+
+Windows 95:
+ Tools are included with Win95 that let you use either the LANMAN
style network drivers (NDIS) or Novell drivers (ODI) to handle your
ARCnet packets. If you use ODI, you'll need to use the 'arc0'
- device with Linux. If you use NDIS, then try the 'arc0e' device.
+ device with Linux. If you use NDIS, then try the 'arc0e' device.
See the "Multiprotocol Support" section below if you need arc0e,
you're completely insane, and/or you need to build some kind of
hybrid network that uses both encapsulation types.
-OS/2: I've been told it works under Warp Connect with an ARCnet driver from
+OS/2:
+ I've been told it works under Warp Connect with an ARCnet driver from
SMC. You need to use the 'arc0e' interface for this. If you get
the SMC driver to work with the TCP/IP stuff included in the
"normal" Warp Bonus Pack, let me know.
@@ -295,7 +321,8 @@ OS/2: I've been told it works under Warp Connect with an ARCnet driver from
which should use the same protocol as WfWg does. I had no luck
installing it under Warp, however. Please mail me with any results.
-NetBSD/AmiTCP: These use an old version of the Internet standard ARCnet
+NetBSD/AmiTCP:
+ These use an old version of the Internet standard ARCnet
protocol (RFC1051) which is compatible with the Linux driver v2.10
ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet"
below.) ** Newer versions of NetBSD apparently support RFC1201.
@@ -307,16 +334,17 @@ Using Multiprotocol ARCnet
The ARCnet driver v2.10 ALPHA supports three protocols, each on its own
"virtual network device":
- arc0 - RFC1201 protocol, the official Internet standard which just
- happens to be 100% compatible with Novell's TRXNET driver.
+ ====== ===============================================================
+ arc0 RFC1201 protocol, the official Internet standard which just
+ happens to be 100% compatible with Novell's TRXNET driver.
Version 1.00 of the ARCnet driver supported _only_ this
protocol. arc0 is the fastest of the three protocols (for
whatever reason), and allows larger packets to be used
- because it supports RFC1201 "packet splitting" operations.
+ because it supports RFC1201 "packet splitting" operations.
Unless you have a specific need to use a different protocol,
I strongly suggest that you stick with this one.
-
- arc0e - "Ethernet-Encapsulation" which sends packets over ARCnet
+
+ arc0e "Ethernet-Encapsulation" which sends packets over ARCnet
that are actually a lot like Ethernet packets, including the
6-byte hardware addresses. This protocol is compatible with
Microsoft's NDIS ARCnet driver, like the one in WfWg and
@@ -328,8 +356,8 @@ The ARCnet driver v2.10 ALPHA supports three protocols, each on its own
fit. arc0e also works slightly more slowly than arc0, for
reasons yet to be determined. (Probably it's the smaller
MTU that does it.)
-
- arc0s - The "[s]imple" RFC1051 protocol is the "previous" Internet
+
+ arc0s The "[s]imple" RFC1051 protocol is the "previous" Internet
standard that is completely incompatible with the new
standard. Some software today, however, continues to
support the old standard (and only the old standard)
@@ -338,9 +366,10 @@ The ARCnet driver v2.10 ALPHA supports three protocols, each on its own
smaller than the Internet "requirement," so it's quite
possible that you may run into problems. It's also slower
than RFC1201 by about 25%, for the same reason as arc0e.
-
+
The arc0s support was contributed by Tomasz Motylewski
and modified somewhat by me. Bugs are probably my fault.
+ ====== ===============================================================
You can choose not to compile arc0e and arc0s into the driver if you want -
this will save you a bit of memory and avoid confusion when eg. trying to
@@ -358,19 +387,21 @@ can set up your network then:
two available protocols. As mentioned above, it's a good idea to use
only arc0 unless you have a good reason (like some other software, ie.
WfWg, that only works with arc0e).
-
- If you need only arc0, then the following commands should get you going:
- ifconfig arc0 MY.IP.ADD.RESS
- route add MY.IP.ADD.RESS arc0
- route add -net SUB.NET.ADD.RESS arc0
- [add other local routes here]
-
- If you need arc0e (and only arc0e), it's a little different:
- ifconfig arc0 MY.IP.ADD.RESS
- ifconfig arc0e MY.IP.ADD.RESS
- route add MY.IP.ADD.RESS arc0e
- route add -net SUB.NET.ADD.RESS arc0e
-
+
+ If you need only arc0, then the following commands should get you going::
+
+ ifconfig arc0 MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0
+ route add -net SUB.NET.ADD.RESS arc0
+ [add other local routes here]
+
+ If you need arc0e (and only arc0e), it's a little different::
+
+ ifconfig arc0 MY.IP.ADD.RESS
+ ifconfig arc0e MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0e
+ route add -net SUB.NET.ADD.RESS arc0e
+
arc0s works much the same way as arc0e.
@@ -391,29 +422,32 @@ can set up your network then:
XT (patience), however, does not have its own Internet IP address and so
I assigned it one on a "private subnet" (as defined by RFC1597).
- To start with, take a simple network with just insight and freedom.
+ To start with, take a simple network with just insight and freedom.
Insight needs to:
- - talk to freedom via RFC1201 (arc0) protocol, because I like it
+
+ - talk to freedom via RFC1201 (arc0) protocol, because I like it
more and it's faster.
- use freedom as its Internet gateway.
-
- That's pretty easy to do. Set up insight like this:
- ifconfig arc0 insight
- route add insight arc0
- route add freedom arc0 /* I would use the subnet here (like I said
- to to in "single protocol" above),
- but the rest of the subnet
- unfortunately lies across the PPP
- link on freedom, which confuses
- things. */
- route add default gw freedom
-
- And freedom gets configured like so:
- ifconfig arc0 freedom
- route add freedom arc0
- route add insight arc0
- /* and default gateway is configured by pppd */
-
+
+ That's pretty easy to do. Set up insight like this::
+
+ ifconfig arc0 insight
+ route add insight arc0
+ route add freedom arc0 /* I would use the subnet here (like I said
+ to in "single protocol" above),
+ but the rest of the subnet
+ unfortunately lies across the PPP
+ link on freedom, which confuses
+ things. */
+ route add default gw freedom
+
+ And freedom gets configured like so::
+
+ ifconfig arc0 freedom
+ route add freedom arc0
+ route add insight arc0
+ /* and default gateway is configured by pppd */
+
Great, now insight talks to freedom directly on arc0, and sends packets
to the Internet through freedom. If you didn't know how to do the above,
you should probably stop reading this section now because it only gets
@@ -425,7 +459,7 @@ can set up your network then:
Internet. (Recall that patience has a "private IP address" which won't
work on the Internet; that's okay, I configured Linux IP masquerading on
freedom for this subnet).
-
+
So patience (necessarily; I don't have another IP number from my
provider) has an IP address on a different subnet than freedom and
insight, but needs to use freedom as an Internet gateway. Worse, most
@@ -435,53 +469,54 @@ can set up your network then:
insight, patience WILL send through its default gateway, regardless of
the fact that both freedom and insight (courtesy of the arc0e device)
could understand a direct transmission.
-
- I compensate by giving freedom an extra IP address - aliased 'gatekeeper'
- - that is on my private subnet, the same subnet that patience is on. I
+
+ I compensate by giving freedom an extra IP address - aliased 'gatekeeper' -
+ that is on my private subnet, the same subnet that patience is on. I
then define gatekeeper to be the default gateway for patience.
-
- To configure freedom (in addition to the commands above):
- ifconfig arc0e gatekeeper
- route add gatekeeper arc0e
- route add patience arc0e
-
+
+ To configure freedom (in addition to the commands above)::
+
+ ifconfig arc0e gatekeeper
+ route add gatekeeper arc0e
+ route add patience arc0e
+
This way, freedom will send all packets for patience through arc0e,
giving its IP address as gatekeeper (on the private subnet). When it
talks to insight or the Internet, it will use its "freedom" Internet IP
address.
-
- You will notice that we haven't configured the arc0e device on insight.
+
+ You will notice that we haven't configured the arc0e device on insight.
This would work, but is not really necessary, and would require me to
assign insight another special IP number from my private subnet. Since
both insight and patience are using freedom as their default gateway, the
two can already talk to each other.
-
+
It's quite fortunate that I set things up like this the first time (cough
cough) because it's really handy when I boot insight into DOS. There, it
- runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet.
+ runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet.
In this mode it would be impossible for insight to communicate directly
with patience, since the Novell stack is incompatible with Microsoft's
Ethernet-Encap. Without changing any settings on freedom or patience, I
simply set freedom as the default gateway for insight (now in DOS,
remember) and all the forwarding happens "automagically" between the two
hosts that would normally not be able to communicate at all.
-
+
For those who like diagrams, I have created two "virtual subnets" on the
- same physical ARCnet wire. You can picture it like this:
-
-
- [RFC1201 NETWORK] [ETHER-ENCAP NETWORK]
+ same physical ARCnet wire. You can picture it like this::
+
+
+ [RFC1201 NETWORK] [ETHER-ENCAP NETWORK]
(registered Internet subnet) (RFC1597 private subnet)
-
- (IP Masquerade)
- /---------------\ * /---------------\
- | | * | |
- | +-Freedom-*-Gatekeeper-+ |
- | | | * | |
- \-------+-------/ | * \-------+-------/
- | | |
- Insight | Patience
- (Internet)
+
+ (IP Masquerade)
+ /---------------\ * /---------------\
+ | | * | |
+ | +-Freedom-*-Gatekeeper-+ |
+ | | | * | |
+ \-------+-------/ | * \-------+-------/
+ | | |
+ Insight | Patience
+ (Internet)
@@ -491,6 +526,7 @@ It works: what now?
Send mail describing your setup, preferably including driver version, kernel
version, ARCnet card model, CPU type, number of systems on your network, and
list of software in use to me at the following address:
+
apenwarr@worldvisions.ca
I do send (sometimes automated) replies to all messages I receive. My email
@@ -525,7 +561,7 @@ this, you should grab the pertinent RFCs. (some are listed near the top of
arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the
script.
-Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending.
+Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending.
Ping-pong buffers are implemented both ways.
If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY,
@@ -535,9 +571,11 @@ decides that the driver is broken). During a transmit, unused parts of the
buffer will be cleared to 0x42 as well. This is to make it easier to figure
out which bytes are being used by a packet.
-You can change the debug level without recompiling the kernel by typing:
+You can change the debug level without recompiling the kernel by typing::
+
ifconfig arc0 down metric 1xxx
/etc/rc.d/rc.inet1
+
where "xxx" is the debug level you want. For example, "metric 1015" would put
you at debug level 15. Debug level 7 is currently the default.
@@ -546,7 +584,7 @@ combination of different debug flags; so debug level 7 is really 1+2+4 or
D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this,
resulting in debug level 23.
-If you don't understand that, you probably don't want to know anyway.
+If you don't understand that, you probably don't want to know anyway.
E-mail me about your problem.
diff --git a/Documentation/networking/atm.txt b/Documentation/networking/atm.rst
index 82921cee77fe..c1df8c038525 100644
--- a/Documentation/networking/atm.txt
+++ b/Documentation/networking/atm.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+ATM
+===
+
In order to use anything but the most primitive functions of ATM,
several user-mode programs are required to assist the kernel. These
programs and related material can be found via the ATM on Linux Web
diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.rst
index 8257dbf9be57..f060cfb1445a 100644
--- a/Documentation/networking/ax25.txt
+++ b/Documentation/networking/ax25.rst
@@ -1,6 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+AX.25
+=====
+
To use the amateur radio protocols within Linux you will need to get a
suitable copy of the AX.25 Utilities. More detailed information about
-AX.25, NET/ROM and ROSE, associated programs and and utilities can be
+AX.25, NET/ROM and ROSE, associated programs and utilities can be
found on http://www.linux-ax25.org.
There is an active mailing list for discussing Linux amateur radio matters
diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst
new file mode 100644
index 000000000000..b9d04ee6dac1
--- /dev/null
+++ b/Documentation/networking/bareudp.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+Bare UDP Tunnelling Module Documentation
+========================================
+
+There are various L3 encapsulation standards using UDP being discussed to
+leverage the UDP based load balancing capability of different networks.
+MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
+
+The Bareudp tunnel module provides a generic L3 encapsulation support for
+tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel.
+
+Special Handling
+----------------
+The bareudp device supports special handling for MPLS & IP as they can have
+multiple ethertypes.
+MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast).
+IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6).
+This special handling can be enabled only for ethertypes ETH_P_IP & ETH_P_MPLS_UC
+with a flag called multiproto mode.
+
+Usage
+------
+
+1) Device creation & deletion
+
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc
+
+ This creates a bareudp tunnel device which tunnels L3 traffic with ethertype
+ 0x8847 (MPLS traffic). The destination port of the UDP header will be set to
+ 6635.The device will listen on UDP port 6635 to receive traffic.
+
+ b) ip link delete bareudp0
+
+2) Device creation with multiproto mode enabled
+
+The multiproto mode allows bareudp tunnels to handle several protocols of the
+same family. It is currently only available for IP and MPLS. This mode has to
+be enabled explicitly with the "multiproto" flag.
+
+ a) ip link add dev bareudp0 type bareudp dstport 6635 ethertype ipv4 multiproto
+
+ For an IPv4 tunnel the multiproto mode allows the tunnel to also handle
+ IPv6.
+
+ b) ip link add dev bareudp0 type bareudp dstport 6635 ethertype mpls_uc multiproto
+
+ For MPLS, the multiproto mode allows the tunnel to handle both unicast
+ and multicast MPLS packets.
+
+3) Device Usage
+
+The bareudp device could be used along with OVS or flower filter in TC.
+The OVS or TC flower layer must set the tunnel information in SKB dst field before
+sending packet buffer to the bareudp device for transmission. On reception the
+bareudp device extracts and stores the tunnel information in SKB dst field before
+passing the packet buffer to the network stack.
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
index 18020943ba25..b85563ea3682 100644
--- a/Documentation/networking/batman-adv.rst
+++ b/Documentation/networking/batman-adv.rst
@@ -73,7 +73,7 @@ lower value. This will make the mesh more responsive to topology changes, but
will also increase the overhead.
Information about the current state can be accessed via the batadv generic
-netlink family. batctl provides human readable version via its debug tables
+netlink family. batctl provides a human readable version via its debug tables
subcommands.
@@ -115,8 +115,8 @@ are prefixed with "batman-adv:" So to see just these messages try::
$ dmesg | grep batman-adv
When investigating problems with your mesh network, it is sometimes necessary to
-see more detail debug messages. This must be enabled when compiling the
-batman-adv module. When building batman-adv as part of kernel, use "make
+see more detailed debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of the kernel, use "make
menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
(``CONFIG_BATMAN_ADV_DEBUG=y``).
@@ -157,10 +157,10 @@ Contact
Please send us comments, experiences, questions, anything :)
IRC:
- #batman on irc.freenode.org
+ #batadv on ircs://irc.hackint.org/
Mailing-list:
b.a.t.m.a.n@open-mesh.org (optional subscription at
- https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+ https://lists.open-mesh.org/mailman3/postorius/lists/b.a.t.m.a.n.lists.open-mesh.org/)
You can also contact the Authors:
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.rst
index e3abfbd32f71..96cd7a26f3d9 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.rst
@@ -1,10 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
- Linux Ethernet Bonding Driver HOWTO
+===================================
+Linux Ethernet Bonding Driver HOWTO
+===================================
- Latest update: 27 April 2011
+Latest update: 27 April 2011
+
+Initial release: Thomas Davis <tadavis at lbl.gov>
+
+Corrections, HA extensions: 2000/10/03-15:
-Initial release : Thomas Davis <tadavis at lbl.gov>
-Corrections, HA extensions : 2000/10/03-15 :
- Willy Tarreau <willy at meta-x.org>
- Constantine Gavrilov <const-g at xpert.com>
- Chad N. Tindel <ctindel at ieee dot org>
@@ -13,98 +18,98 @@ Corrections, HA extensions : 2000/10/03-15 :
Reorganized and updated Feb 2005 by Jay Vosburgh
Added Sysfs information: 2006/04/24
+
- Mitch Williams <mitch.a.williams at intel.com>
Introduction
============
- The Linux bonding driver provides a method for aggregating
+The Linux bonding driver provides a method for aggregating
multiple network interfaces into a single logical "bonded" interface.
The behavior of the bonded interfaces depends upon the mode; generally
speaking, modes provide either hot standby or load balancing services.
Additionally, link integrity monitoring may be performed.
-
- The bonding driver originally came from Donald Becker's
+
+The bonding driver originally came from Donald Becker's
beowulf patches for kernel 2.0. It has changed quite a bit since, and
the original tools from extreme-linux and beowulf sites will not work
with this version of the driver.
- For new versions of the driver, updated userspace tools, and
+For new versions of the driver, updated userspace tools, and
who to ask for help, please follow the links at the end of this file.
-Table of Contents
-=================
+.. Table of Contents
-1. Bonding Driver Installation
+ 1. Bonding Driver Installation
-2. Bonding Driver Options
+ 2. Bonding Driver Options
-3. Configuring Bonding Devices
-3.1 Configuration with Sysconfig Support
-3.1.1 Using DHCP with Sysconfig
-3.1.2 Configuring Multiple Bonds with Sysconfig
-3.2 Configuration with Initscripts Support
-3.2.1 Using DHCP with Initscripts
-3.2.2 Configuring Multiple Bonds with Initscripts
-3.3 Configuring Bonding Manually with Ifenslave
-3.3.1 Configuring Multiple Bonds Manually
-3.4 Configuring Bonding Manually via Sysfs
-3.5 Configuration with Interfaces Support
-3.6 Overriding Configuration for Special Cases
-3.7 Configuring LACP for 802.3ad mode in a more secure way
+ 3. Configuring Bonding Devices
+ 3.1 Configuration with Sysconfig Support
+ 3.1.1 Using DHCP with Sysconfig
+ 3.1.2 Configuring Multiple Bonds with Sysconfig
+ 3.2 Configuration with Initscripts Support
+ 3.2.1 Using DHCP with Initscripts
+ 3.2.2 Configuring Multiple Bonds with Initscripts
+ 3.3 Configuring Bonding Manually with Ifenslave
+ 3.3.1 Configuring Multiple Bonds Manually
+ 3.4 Configuring Bonding Manually via Sysfs
+ 3.5 Configuration with Interfaces Support
+ 3.6 Overriding Configuration for Special Cases
+ 3.7 Configuring LACP for 802.3ad mode in a more secure way
-4. Querying Bonding Configuration
-4.1 Bonding Configuration
-4.2 Network Configuration
+ 4. Querying Bonding Configuration
+ 4.1 Bonding Configuration
+ 4.2 Network Configuration
-5. Switch Configuration
+ 5. Switch Configuration
-6. 802.1q VLAN Support
+ 6. 802.1q VLAN Support
-7. Link Monitoring
-7.1 ARP Monitor Operation
-7.2 Configuring Multiple ARP Targets
-7.3 MII Monitor Operation
+ 7. Link Monitoring
+ 7.1 ARP Monitor Operation
+ 7.2 Configuring Multiple ARP Targets
+ 7.3 MII Monitor Operation
-8. Potential Trouble Sources
-8.1 Adventures in Routing
-8.2 Ethernet Device Renaming
-8.3 Painfully Slow Or No Failed Link Detection By Miimon
+ 8. Potential Trouble Sources
+ 8.1 Adventures in Routing
+ 8.2 Ethernet Device Renaming
+ 8.3 Painfully Slow Or No Failed Link Detection By Miimon
-9. SNMP agents
+ 9. SNMP agents
-10. Promiscuous mode
+ 10. Promiscuous mode
-11. Configuring Bonding for High Availability
-11.1 High Availability in a Single Switch Topology
-11.2 High Availability in a Multiple Switch Topology
-11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
-11.2.2 HA Link Monitoring for Multiple Switch Topology
+ 11. Configuring Bonding for High Availability
+ 11.1 High Availability in a Single Switch Topology
+ 11.2 High Availability in a Multiple Switch Topology
+ 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
+ 11.2.2 HA Link Monitoring for Multiple Switch Topology
-12. Configuring Bonding for Maximum Throughput
-12.1 Maximum Throughput in a Single Switch Topology
-12.1.1 MT Bonding Mode Selection for Single Switch Topology
-12.1.2 MT Link Monitoring for Single Switch Topology
-12.2 Maximum Throughput in a Multiple Switch Topology
-12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
-12.2.2 MT Link Monitoring for Multiple Switch Topology
+ 12. Configuring Bonding for Maximum Throughput
+ 12.1 Maximum Throughput in a Single Switch Topology
+ 12.1.1 MT Bonding Mode Selection for Single Switch Topology
+ 12.1.2 MT Link Monitoring for Single Switch Topology
+ 12.2 Maximum Throughput in a Multiple Switch Topology
+ 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
+ 12.2.2 MT Link Monitoring for Multiple Switch Topology
-13. Switch Behavior Issues
-13.1 Link Establishment and Failover Delays
-13.2 Duplicated Incoming Packets
+ 13. Switch Behavior Issues
+ 13.1 Link Establishment and Failover Delays
+ 13.2 Duplicated Incoming Packets
-14. Hardware Specific Considerations
-14.1 IBM BladeCenter
+ 14. Hardware Specific Considerations
+ 14.1 IBM BladeCenter
-15. Frequently Asked Questions
+ 15. Frequently Asked Questions
-16. Resources and Links
+ 16. Resources and Links
1. Bonding Driver Installation
==============================
- Most popular distro kernels ship with the bonding driver
+Most popular distro kernels ship with the bonding driver
already available as a module. If your distro does not, or you
have need to compile bonding from source (e.g., configuring and
installing a mainline kernel from kernel.org), you'll need to perform
@@ -113,54 +118,54 @@ the following steps:
1.1 Configure and build the kernel with bonding
-----------------------------------------------
- The current version of the bonding driver is available in the
+The current version of the bonding driver is available in the
drivers/net/bonding subdirectory of the most recent kernel source
(which is available on http://kernel.org). Most users "rolling their
own" will want to use the most recent kernel from kernel.org.
- Configure kernel with "make menuconfig" (or "make xconfig" or
+Configure kernel with "make menuconfig" (or "make xconfig" or
"make config"), then select "Bonding driver support" in the "Network
device support" section. It is recommended that you configure the
driver as module since it is currently the only way to pass parameters
to the driver or configure more than one bonding device.
- Build and install the new kernel and modules.
+Build and install the new kernel and modules.
1.2 Bonding Control Utility
--------------------------------------
+---------------------------
- It is recommended to configure bonding via iproute2 (netlink)
+It is recommended to configure bonding via iproute2 (netlink)
or sysfs, the old ifenslave control utility is obsolete.
2. Bonding Driver Options
=========================
- Options for the bonding driver are supplied as parameters to the
+Options for the bonding driver are supplied as parameters to the
bonding module at load time, or are specified via sysfs.
- Module options may be given as command line arguments to the
+Module options may be given as command line arguments to the
insmod or modprobe command, but are usually specified in either the
-/etc/modprobe.d/*.conf configuration files, or in a distro-specific
+``/etc/modprobe.d/*.conf`` configuration files, or in a distro-specific
configuration file (some of which are detailed in the next section).
- Details on bonding support for sysfs is provided in the
+Details on bonding support for sysfs is provided in the
"Configuring Bonding Manually via Sysfs" section, below.
- The available bonding driver parameters are listed below. If a
+The available bonding driver parameters are listed below. If a
parameter is not specified the default value is used. When initially
configuring a bond, it is recommended "tail -f /var/log/messages" be
run in a separate window to watch for bonding driver error messages.
- It is critical that either the miimon or arp_interval and
+It is critical that either the miimon or arp_interval and
arp_ip_target parameters be specified, otherwise serious network
degradation will occur during link failures. Very few devices do not
support at least miimon, so there is really no reason not to use it.
- Options with textual values will accept either the text name
+Options with textual values will accept either the text name
or, for backwards compatibility, the option value. E.g.,
"mode=802.3ad" and "mode=4" set the same mode.
- The parameters are as follows:
+The parameters are as follows:
active_slave
@@ -191,11 +196,12 @@ ad_actor_sys_prio
ad_actor_system
In an AD system, this specifies the mac-address for the actor in
- protocol packet exchanges (LACPDUs). The value cannot be NULL or
- multicast. It is preferred to have the local-admin bit set for this
- mac but driver does not enforce it. If the value is not given then
- system defaults to using the masters' mac address as actors' system
- address.
+ protocol packet exchanges (LACPDUs). The value cannot be a multicast
+ address. If the all-zeroes MAC is specified, bonding will internally
+ use the MAC of the bond itself. It is preferred to have the
+ local-admin bit set for this mac but driver does not enforce it. If
+ the value is not given then system defaults to using the masters'
+ mac address as actors' system address.
This parameter has effect only in 802.3ad mode and is available through
SysFs interface.
@@ -246,10 +252,13 @@ ad_user_port_key
In an AD system, the port-key has three parts as shown below -
+ ===== ============
Bits Use
+ ===== ============
00 Duplex
01-05 Speed
06-15 User-defined
+ ===== ============
This defines the upper 10 bits of the port key. The values can be
from 0 - 1023. If not given, the system defaults to 0.
@@ -304,6 +313,17 @@ arp_ip_target
maximum number of targets that can be specified is 16. The
default value is no IP addresses.
+ns_ip6_target
+
+ Specifies the IPv6 addresses to use as IPv6 monitoring peers when
+ arp_interval is > 0. These are the targets of the NS request
+ sent to determine the health of the link to the targets.
+ Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6
+ addresses must be separated by a comma. At least one IPv6
+ address must be given for NS/NA monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IPv6 addresses.
+
arp_validate
Specifies whether or not ARP probes and replies should be
@@ -413,6 +433,17 @@ arp_all_targets
consider the slave up only when all of the arp_ip_targets
are reachable
+arp_missed_max
+
+ Specifies the number of arp_interval monitor checks that must
+ fail in order for an interface to be marked down by the ARP monitor.
+
+ In order to provide orderly failover semantics, backup interfaces
+ are permitted an extra monitor check (i.e., they must fail
+ arp_missed_max + 1 times before being marked down).
+
+ The default value is 2, and the allowable range is 1 - 255.
+
downdelay
Specifies the time, in milliseconds, to wait before disabling
@@ -493,6 +524,18 @@ fail_over_mac
This option was added in bonding version 3.2.0. The "follow"
policy was added in bonding version 3.3.0.
+lacp_active
+ Option specifying whether to send LACPDU frames periodically.
+
+ off or 0
+ LACPDU frames acts as "speak when spoken to".
+
+ on or 1
+ LACPDU frames are sent along the configured links
+ periodically. See lacp_rate for more details.
+
+ The default is on.
+
lacp_rate
Option specifying the rate in which we'll ask our link partner
@@ -699,7 +742,7 @@ mode
swapped with the new curr_active_slave that was
chosen.
-num_grat_arp
+num_grat_arp,
num_unsol_na
Specify the number of peer notifications (gratuitous ARPs and
@@ -729,13 +772,24 @@ packets_per_slave
peer_notif_delay
- Specify the delay, in milliseconds, between each peer
- notification (gratuitous ARP and unsolicited IPv6 Neighbor
- Advertisement) when they are issued after a failover event.
- This delay should be a multiple of the link monitor interval
- (arp_interval or miimon, whichever is active). The default
- value is 0 which means to match the value of the link monitor
- interval.
+ Specify the delay, in milliseconds, between each peer
+ notification (gratuitous ARP and unsolicited IPv6 Neighbor
+ Advertisement) when they are issued after a failover event.
+ This delay should be a multiple of the link monitor interval
+ (arp_interval or miimon, whichever is active). The default
+ value is 0 which means to match the value of the link monitor
+ interval.
+
+prio
+ Slave priority. A higher number means higher priority.
+ The primary slave has the highest priority. This option also
+ follows the primary_reselect rules.
+
+ This option could only be configured via netlink, and is only valid
+ for active-backup(1), balance-tlb (5) and balance-alb (6) mode.
+ The valid value range is a signed 32 bit integer.
+
+ The default value is 0.
primary
@@ -792,7 +846,7 @@ primary_reselect
tlb_dynamic_lb
Specifies if dynamic shuffling of flows is enabled in tlb
- mode. The value has no effect on any other modes.
+ or alb mode. The value has no effect on any other modes.
The default behavior of tlb mode is to shuffle active flows across
slaves based on the load in that interval. This gives nice lb
@@ -851,7 +905,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and packet type ID
field to generate the hash. The formula is
- hash = source MAC XOR destination MAC XOR packet type ID
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
slave number = hash modulo slave count
This algorithm will place all traffic to a particular
@@ -867,7 +921,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and IP addresses to
generate the hash. The formula is
- hash = source MAC XOR destination MAC XOR packet type ID
+ hash = source MAC[5] XOR destination MAC[5] XOR packet type ID
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
@@ -943,6 +997,19 @@ xmit_hash_policy
packets will be distributed according to the encapsulated
flows.
+ vlan+srcmac
+
+ This policy uses a very rudimentary vlan ID and source mac
+ hash to load-balance traffic per-vlan, with failover
+ should one leg fail. The intended use case is for a bond
+ shared by multiple virtual machines, all configured to
+ use their own vlan, to give lacp-like functionality
+ without requiring lacp-capable switching hardware.
+
+ The formula for the hash is simply
+
+ hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
+
The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy. The
@@ -977,88 +1044,88 @@ lp_interval
3. Configuring Bonding Devices
==============================
- You can configure bonding using either your distro's network
+You can configure bonding using either your distro's network
initialization scripts, or manually using either iproute2 or the
sysfs interface. Distros generally use one of three packages for the
network initialization scripts: initscripts, sysconfig or interfaces.
Recent versions of these packages have support for bonding, while older
versions do not.
- We will first describe the options for configuring bonding for
+We will first describe the options for configuring bonding for
distros using versions of initscripts, sysconfig and interfaces with full
or partial support for bonding, then provide information on enabling
bonding without support from the network initialization scripts (i.e.,
older versions of initscripts or sysconfig).
- If you're unsure whether your distro uses sysconfig,
+If you're unsure whether your distro uses sysconfig,
initscripts or interfaces, or don't know if it's new enough, have no fear.
Determining this is fairly straightforward.
- First, look for a file called interfaces in /etc/network directory.
+First, look for a file called interfaces in /etc/network directory.
If this file is present in your system, then your system use interfaces. See
Configuration with Interfaces Support.
- Else, issue the command:
+Else, issue the command::
-$ rpm -qf /sbin/ifup
+ $ rpm -qf /sbin/ifup
- It will respond with a line of text starting with either
+It will respond with a line of text starting with either
"initscripts" or "sysconfig," followed by some numbers. This is the
package that provides your network initialization scripts.
- Next, to determine if your installation supports bonding,
-issue the command:
+Next, to determine if your installation supports bonding,
+issue the command::
-$ grep ifenslave /sbin/ifup
+ $ grep ifenslave /sbin/ifup
- If this returns any matches, then your initscripts or
+If this returns any matches, then your initscripts or
sysconfig has support for bonding.
3.1 Configuration with Sysconfig Support
----------------------------------------
- This section applies to distros using a version of sysconfig
+This section applies to distros using a version of sysconfig
with bonding support, for example, SuSE Linux Enterprise Server 9.
- SuSE SLES 9's networking configuration system does support
+SuSE SLES 9's networking configuration system does support
bonding, however, at this writing, the YaST system configuration
front end does not provide any means to work with bonding devices.
Bonding devices can be managed by hand, however, as follows.
- First, if they have not already been configured, configure the
+First, if they have not already been configured, configure the
slave devices. On SLES 9, this is most easily done by running the
yast2 sysconfig configuration utility. The goal is for to create an
ifcfg-id file for each slave device. The simplest way to accomplish
this is to configure the devices for DHCP (this is only to get the
file ifcfg-id file created; see below for some issues with DHCP). The
-name of the configuration file for each device will be of the form:
+name of the configuration file for each device will be of the form::
-ifcfg-id-xx:xx:xx:xx:xx:xx
+ ifcfg-id-xx:xx:xx:xx:xx:xx
- Where the "xx" portion will be replaced with the digits from
+Where the "xx" portion will be replaced with the digits from
the device's permanent MAC address.
- Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
+Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
created, it is necessary to edit the configuration files for the slave
devices (the MAC addresses correspond to those of the slave devices).
Before editing, the file will contain multiple lines, and will look
-something like this:
+something like this::
-BOOTPROTO='dhcp'
-STARTMODE='on'
-USERCTL='no'
-UNIQUE='XNzu.WeZGOGF+4wE'
-_nm_name='bus-pci-0001:61:01.0'
+ BOOTPROTO='dhcp'
+ STARTMODE='on'
+ USERCTL='no'
+ UNIQUE='XNzu.WeZGOGF+4wE'
+ _nm_name='bus-pci-0001:61:01.0'
- Change the BOOTPROTO and STARTMODE lines to the following:
+Change the BOOTPROTO and STARTMODE lines to the following::
-BOOTPROTO='none'
-STARTMODE='off'
+ BOOTPROTO='none'
+ STARTMODE='off'
- Do not alter the UNIQUE or _nm_name lines. Remove any other
+Do not alter the UNIQUE or _nm_name lines. Remove any other
lines (USERCTL, etc).
- Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
+Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
it's time to create the configuration file for the bonding device
itself. This file is named ifcfg-bondX, where X is the number of the
bonding device to create, starting at 0. The first such file is
@@ -1066,49 +1133,52 @@ ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig
network configuration system will correctly start multiple instances
of bonding.
- The contents of the ifcfg-bondX file is as follows:
-
-BOOTPROTO="static"
-BROADCAST="10.0.2.255"
-IPADDR="10.0.2.10"
-NETMASK="255.255.0.0"
-NETWORK="10.0.2.0"
-REMOTE_IPADDR=""
-STARTMODE="onboot"
-BONDING_MASTER="yes"
-BONDING_MODULE_OPTS="mode=active-backup miimon=100"
-BONDING_SLAVE0="eth0"
-BONDING_SLAVE1="bus-pci-0000:06:08.1"
-
- Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
+The contents of the ifcfg-bondX file is as follows::
+
+ BOOTPROTO="static"
+ BROADCAST="10.0.2.255"
+ IPADDR="10.0.2.10"
+ NETMASK="255.255.0.0"
+ NETWORK="10.0.2.0"
+ REMOTE_IPADDR=""
+ STARTMODE="onboot"
+ BONDING_MASTER="yes"
+ BONDING_MODULE_OPTS="mode=active-backup miimon=100"
+ BONDING_SLAVE0="eth0"
+ BONDING_SLAVE1="bus-pci-0000:06:08.1"
+
+Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
values with the appropriate values for your network.
- The STARTMODE specifies when the device is brought online.
+The STARTMODE specifies when the device is brought online.
The possible values are:
- onboot: The device is started at boot time. If you're not
+ ======== ======================================================
+ onboot The device is started at boot time. If you're not
sure, this is probably what you want.
- manual: The device is started only when ifup is called
+ manual The device is started only when ifup is called
manually. Bonding devices may be configured this
way if you do not wish them to start automatically
at boot for some reason.
- hotplug: The device is started by a hotplug event. This is not
+ hotplug The device is started by a hotplug event. This is not
a valid choice for a bonding device.
- off or ignore: The device configuration is ignored.
+ off or The device configuration is ignored.
+ ignore
+ ======== ======================================================
- The line BONDING_MASTER='yes' indicates that the device is a
+The line BONDING_MASTER='yes' indicates that the device is a
bonding master device. The only useful value is "yes."
- The contents of BONDING_MODULE_OPTS are supplied to the
+The contents of BONDING_MODULE_OPTS are supplied to the
instance of the bonding module for this device. Specify the options
for the bonding mode, link monitoring, and so on here. Do not include
the max_bonds bonding parameter; this will confuse the configuration
system if you have multiple bonding devices.
- Finally, supply one BONDING_SLAVEn="slave device" for each
+Finally, supply one BONDING_SLAVEn="slave device" for each
slave. where "n" is an increasing value, one for each slave. The
"slave device" is either an interface name, e.g., "eth0", or a device
specifier for the network device. The interface name is easier to
@@ -1120,34 +1190,34 @@ changes (for example, it is moved from one PCI slot to another). The
example above uses one of each type for demonstration purposes; most
configurations will choose one or the other for all slave devices.
- When all configuration files have been modified or created,
+When all configuration files have been modified or created,
networking must be restarted for the configuration changes to take
-effect. This can be accomplished via the following:
+effect. This can be accomplished via the following::
-# /etc/init.d/network restart
+ # /etc/init.d/network restart
- Note that the network control script (/sbin/ifdown) will
+Note that the network control script (/sbin/ifdown) will
remove the bonding module as part of the network shutdown processing,
so it is not necessary to remove the module by hand if, e.g., the
module parameters have changed.
- Also, at this writing, YaST/YaST2 will not manage bonding
+Also, at this writing, YaST/YaST2 will not manage bonding
devices (they do not show bonding interfaces on its list of network
devices). It is necessary to edit the configuration file by hand to
change the bonding configuration.
- Additional general options and details of the ifcfg file
-format can be found in an example ifcfg template file:
+Additional general options and details of the ifcfg file
+format can be found in an example ifcfg template file::
-/etc/sysconfig/network/ifcfg.template
+ /etc/sysconfig/network/ifcfg.template
- Note that the template does not document the various BONDING_
+Note that the template does not document the various ``BONDING_*``
settings described above, but does describe many of the other options.
3.1.1 Using DHCP with Sysconfig
-------------------------------
- Under sysconfig, configuring a device with BOOTPROTO='dhcp'
+Under sysconfig, configuring a device with BOOTPROTO='dhcp'
will cause it to query DHCP for its IP address information. At this
writing, this does not function for bonding devices; the scripts
attempt to obtain the device address from DHCP prior to adding any of
@@ -1157,7 +1227,7 @@ sent to the network.
3.1.2 Configuring Multiple Bonds with Sysconfig
-----------------------------------------------
- The sysconfig network initialization system is capable of
+The sysconfig network initialization system is capable of
handling multiple bonding devices. All that is necessary is for each
bonding instance to have an appropriately configured ifcfg-bondX file
(as described above). Do not specify the "max_bonds" parameter to any
@@ -1165,14 +1235,14 @@ instance of bonding, as this will confuse sysconfig. If you require
multiple bonding devices with identical parameters, create multiple
ifcfg-bondX files.
- Because the sysconfig scripts supply the bonding module
+Because the sysconfig scripts supply the bonding module
options in the ifcfg-bondX file, it is not necessary to add them to
-the system /etc/modules.d/*.conf configuration files.
+the system ``/etc/modules.d/*.conf`` configuration files.
3.2 Configuration with Initscripts Support
------------------------------------------
- This section applies to distros using a recent version of
+This section applies to distros using a recent version of
initscripts with bonding support, for example, Red Hat Enterprise Linux
version 3 or later, Fedora, etc. On these systems, the network
initialization scripts have knowledge of bonding, and can be configured to
@@ -1180,7 +1250,7 @@ control bonding devices. Note that older versions of the initscripts
package have lower levels of support for bonding; this will be noted where
applicable.
- These distros will not automatically load the network adapter
+These distros will not automatically load the network adapter
driver unless the ethX device is configured with an IP address.
Because of this constraint, users must manually configure a
network-script file for all physical adapters that will be members of
@@ -1188,19 +1258,19 @@ a bondX link. Network script files are located in the directory:
/etc/sysconfig/network-scripts
- The file name must be prefixed with "ifcfg-eth" and suffixed
+The file name must be prefixed with "ifcfg-eth" and suffixed
with the adapter's physical adapter number. For example, the script
for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0.
-Place the following text in the file:
+Place the following text in the file::
-DEVICE=eth0
-USERCTL=no
-ONBOOT=yes
-MASTER=bond0
-SLAVE=yes
-BOOTPROTO=none
+ DEVICE=eth0
+ USERCTL=no
+ ONBOOT=yes
+ MASTER=bond0
+ SLAVE=yes
+ BOOTPROTO=none
- The DEVICE= line will be different for every ethX device and
+The DEVICE= line will be different for every ethX device and
must correspond with the name of the file, i.e., ifcfg-eth1 must have
a device line of DEVICE=eth1. The setting of the MASTER= line will
also depend on the final bonding interface name chosen for your bond.
@@ -1208,69 +1278,70 @@ As with other network devices, these typically start at 0, and go up
one for each device, i.e., the first bonding instance is bond0, the
second is bond1, and so on.
- Next, create a bond network script. The file name for this
+Next, create a bond network script. The file name for this
script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is
the number of the bond. For bond0 the file is named "ifcfg-bond0",
for bond1 it is named "ifcfg-bond1", and so on. Within that file,
-place the following text:
-
-DEVICE=bond0
-IPADDR=192.168.1.1
-NETMASK=255.255.255.0
-NETWORK=192.168.1.0
-BROADCAST=192.168.1.255
-ONBOOT=yes
-BOOTPROTO=none
-USERCTL=no
-
- Be sure to change the networking specific lines (IPADDR,
+place the following text::
+
+ DEVICE=bond0
+ IPADDR=192.168.1.1
+ NETMASK=255.255.255.0
+ NETWORK=192.168.1.0
+ BROADCAST=192.168.1.255
+ ONBOOT=yes
+ BOOTPROTO=none
+ USERCTL=no
+
+Be sure to change the networking specific lines (IPADDR,
NETMASK, NETWORK and BROADCAST) to match your network configuration.
- For later versions of initscripts, such as that found with Fedora
+For later versions of initscripts, such as that found with Fedora
7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible,
and, indeed, preferable, to specify the bonding options in the ifcfg-bond0
-file, e.g. a line of the format:
+file, e.g. a line of the format::
-BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254"
+ BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254"
- will configure the bond with the specified options. The options
+will configure the bond with the specified options. The options
specified in BONDING_OPTS are identical to the bonding module parameters
except for the arp_ip_target field when using versions of initscripts older
than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When
using older versions each target should be included as a separate option and
should be preceded by a '+' to indicate it should be added to the list of
-queried targets, e.g.,
+queried targets, e.g.,::
- arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2
+ arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2
- is the proper syntax to specify multiple targets. When specifying
-options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf.
+is the proper syntax to specify multiple targets. When specifying
+options via BONDING_OPTS, it is not necessary to edit
+``/etc/modprobe.d/*.conf``.
- For even older versions of initscripts that do not support
+For even older versions of initscripts that do not support
BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon
your distro) to load the bonding module with your desired options when the
bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf
will load the bonding module, and select its options:
-alias bond0 bonding
-options bond0 mode=balance-alb miimon=100
+ alias bond0 bonding
+ options bond0 mode=balance-alb miimon=100
- Replace the sample parameters with the appropriate set of
+Replace the sample parameters with the appropriate set of
options for your configuration.
- Finally run "/etc/rc.d/init.d/network restart" as root. This
+Finally run "/etc/rc.d/init.d/network restart" as root. This
will restart the networking subsystem and your bond link should be now
up and running.
3.2.1 Using DHCP with Initscripts
---------------------------------
- Recent versions of initscripts (the versions supplied with Fedora
+Recent versions of initscripts (the versions supplied with Fedora
Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to
work) have support for assigning IP information to bonding devices via
DHCP.
- To configure bonding for DHCP, configure it as described
+To configure bonding for DHCP, configure it as described
above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp"
and add a line consisting of "TYPE=Bonding". Note that the TYPE value
is case sensitive.
@@ -1278,7 +1349,7 @@ is case sensitive.
3.2.2 Configuring Multiple Bonds with Initscripts
-------------------------------------------------
- Initscripts packages that are included with Fedora 7 and Red Hat
+Initscripts packages that are included with Fedora 7 and Red Hat
Enterprise Linux 5 support multiple bonding interfaces by simply
specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the
number of the bond. This support requires sysfs support in the kernel,
@@ -1290,77 +1361,77 @@ below.
3.3 Configuring Bonding Manually with iproute2
-----------------------------------------------
- This section applies to distros whose network initialization
+This section applies to distros whose network initialization
scripts (the sysconfig or initscripts package) do not have specific
knowledge of bonding. One such distro is SuSE Linux Enterprise Server
version 8.
- The general method for these systems is to place the bonding
+The general method for these systems is to place the bonding
module parameters into a config file in /etc/modprobe.d/ (as
appropriate for the installed distro), then add modprobe and/or
`ip link` commands to the system's global init script. The name of
the global init script differs; for sysconfig, it is
/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
- For example, if you wanted to make a simple bond of two e100
+For example, if you wanted to make a simple bond of two e100
devices (presumed to be eth0 and eth1), and have it persist across
reboots, edit the appropriate file (/etc/init.d/boot.local or
-/etc/rc.d/rc.local), and add the following:
+/etc/rc.d/rc.local), and add the following::
-modprobe bonding mode=balance-alb miimon=100
-modprobe e100
-ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
-ip link set eth0 master bond0
-ip link set eth1 master bond0
+ modprobe bonding mode=balance-alb miimon=100
+ modprobe e100
+ ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+ ip link set eth0 master bond0
+ ip link set eth1 master bond0
- Replace the example bonding module parameters and bond0
+Replace the example bonding module parameters and bond0
network configuration (IP address, netmask, etc) with the appropriate
values for your configuration.
- Unfortunately, this method will not provide support for the
+Unfortunately, this method will not provide support for the
ifup and ifdown scripts on the bond devices. To reload the bonding
-configuration, it is necessary to run the initialization script, e.g.,
+configuration, it is necessary to run the initialization script, e.g.,::
-# /etc/init.d/boot.local
+ # /etc/init.d/boot.local
- or
+or::
-# /etc/rc.d/rc.local
+ # /etc/rc.d/rc.local
- It may be desirable in such a case to create a separate script
+It may be desirable in such a case to create a separate script
which only initializes the bonding configuration, then call that
separate script from within boot.local. This allows for bonding to be
enabled without re-running the entire global init script.
- To shut down the bonding devices, it is necessary to first
+To shut down the bonding devices, it is necessary to first
mark the bonding device itself as being down, then remove the
appropriate device driver modules. For our example above, you can do
-the following:
+the following::
-# ifconfig bond0 down
-# rmmod bonding
-# rmmod e100
+ # ifconfig bond0 down
+ # rmmod bonding
+ # rmmod e100
- Again, for convenience, it may be desirable to create a script
+Again, for convenience, it may be desirable to create a script
with these commands.
3.3.1 Configuring Multiple Bonds Manually
-----------------------------------------
- This section contains information on configuring multiple
+This section contains information on configuring multiple
bonding devices with differing options for those systems whose network
initialization scripts lack support for configuring multiple bonds.
- If you require multiple bonding devices, but all with the same
+If you require multiple bonding devices, but all with the same
options, you may wish to use the "max_bonds" module parameter,
documented above.
- To create multiple bonding devices with differing options, it is
+To create multiple bonding devices with differing options, it is
preferable to use bonding parameters exported by sysfs, documented in the
section below.
- For versions of bonding without sysfs support, the only means to
+For versions of bonding without sysfs support, the only means to
provide multiple instances of bonding with differing options is to load
the bonding driver multiple times. Note that current versions of the
sysconfig network initialization scripts handle this automatically; if
@@ -1368,35 +1439,35 @@ your distro uses these scripts, no special action is needed. See the
section Configuring Bonding Devices, above, if you're not sure about your
network initialization scripts.
- To load multiple instances of the module, it is necessary to
+To load multiple instances of the module, it is necessary to
specify a different name for each instance (the module loading system
requires that every loaded module, even multiple instances of the same
module, have a unique name). This is accomplished by supplying multiple
-sets of bonding options in /etc/modprobe.d/*.conf, for example:
+sets of bonding options in ``/etc/modprobe.d/*.conf``, for example::
-alias bond0 bonding
-options bond0 -o bond0 mode=balance-rr miimon=100
+ alias bond0 bonding
+ options bond0 -o bond0 mode=balance-rr miimon=100
-alias bond1 bonding
-options bond1 -o bond1 mode=balance-alb miimon=50
+ alias bond1 bonding
+ options bond1 -o bond1 mode=balance-alb miimon=50
- will load the bonding module two times. The first instance is
+will load the bonding module two times. The first instance is
named "bond0" and creates the bond0 device in balance-rr mode with an
miimon of 100. The second instance is named "bond1" and creates the
bond1 device in balance-alb mode with an miimon of 50.
- In some circumstances (typically with older distributions),
+In some circumstances (typically with older distributions),
the above does not work, and the second bonding instance never sees
its options. In that case, the second options line can be substituted
-as follows:
+as follows::
-install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
- mode=balance-alb miimon=50
+ install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
+ mode=balance-alb miimon=50
- This may be repeated any number of times, specifying a new and
+This may be repeated any number of times, specifying a new and
unique name in place of bond1 for each subsequent instance.
- It has been observed that some Red Hat supplied kernels are unable
+It has been observed that some Red Hat supplied kernels are unable
to rename modules at load time (the "-o bond1" part). Attempts to pass
that option to modprobe will produce an "Operation not permitted" error.
This has been reported on some Fedora Core kernels, and has been seen on
@@ -1407,18 +1478,18 @@ kernels, and also lack sysfs support).
3.4 Configuring Bonding Manually via Sysfs
------------------------------------------
- Starting with version 3.0.0, Channel Bonding may be configured
+Starting with version 3.0.0, Channel Bonding may be configured
via the sysfs interface. This interface allows dynamic configuration
of all bonds in the system without unloading the module. It also
allows for adding and removing bonds at runtime. Ifenslave is no
longer required, though it is still supported.
- Use of the sysfs interface allows you to use multiple bonds
+Use of the sysfs interface allows you to use multiple bonds
with different configurations without having to reload the module.
It also allows you to use multiple, differently configured bonds when
bonding is compiled into the kernel.
- You must have the sysfs filesystem mounted to configure
+You must have the sysfs filesystem mounted to configure
bonding this way. The examples in this document assume that you
are using the standard mount point for sysfs, e.g. /sys. If your
sysfs filesystem is mounted elsewhere, you will need to adjust the
@@ -1426,38 +1497,45 @@ example paths accordingly.
Creating and Destroying Bonds
-----------------------------
-To add a new bond foo:
-# echo +foo > /sys/class/net/bonding_masters
+To add a new bond foo::
+
+ # echo +foo > /sys/class/net/bonding_masters
+
+To remove an existing bond bar::
-To remove an existing bond bar:
-# echo -bar > /sys/class/net/bonding_masters
+ # echo -bar > /sys/class/net/bonding_masters
-To show all existing bonds:
-# cat /sys/class/net/bonding_masters
+To show all existing bonds::
-NOTE: due to 4K size limitation of sysfs files, this list may be
-truncated if you have more than a few hundred bonds. This is unlikely
-to occur under normal operating conditions.
+ # cat /sys/class/net/bonding_masters
+
+.. note::
+
+ due to 4K size limitation of sysfs files, this list may be
+ truncated if you have more than a few hundred bonds. This is unlikely
+ to occur under normal operating conditions.
Adding and Removing Slaves
--------------------------
- Interfaces may be enslaved to a bond using the file
+Interfaces may be enslaved to a bond using the file
/sys/class/net/<bond>/bonding/slaves. The semantics for this file
are the same as for the bonding_masters file.
-To enslave interface eth0 to bond bond0:
-# ifconfig bond0 up
-# echo +eth0 > /sys/class/net/bond0/bonding/slaves
+To enslave interface eth0 to bond bond0::
+
+ # ifconfig bond0 up
+ # echo +eth0 > /sys/class/net/bond0/bonding/slaves
-To free slave eth0 from bond bond0:
-# echo -eth0 > /sys/class/net/bond0/bonding/slaves
+To free slave eth0 from bond bond0::
- When an interface is enslaved to a bond, symlinks between the
+ # echo -eth0 > /sys/class/net/bond0/bonding/slaves
+
+When an interface is enslaved to a bond, symlinks between the
two are created in the sysfs filesystem. In this case, you would get
/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and
/sys/class/net/eth0/master pointing to /sys/class/net/bond0.
- This means that you can tell quickly whether or not an
+This means that you can tell quickly whether or not an
interface is enslaved by looking for the master symlink. Thus:
# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves
will free eth0 from whatever bond it is enslaved to, regardless of
@@ -1465,127 +1543,143 @@ the name of the bond interface.
Changing a Bond's Configuration
-------------------------------
- Each bond may be configured individually by manipulating the
+Each bond may be configured individually by manipulating the
files located in /sys/class/net/<bond name>/bonding
- The names of these files correspond directly with the command-
+The names of these files correspond directly with the command-
line parameters described elsewhere in this file, and, with the
exception of arp_ip_target, they accept the same values. To see the
current setting, simply cat the appropriate file.
- A few examples will be given here; for specific usage
+A few examples will be given here; for specific usage
guidelines for each parameter, see the appropriate section in this
document.
-To configure bond0 for balance-alb mode:
-# ifconfig bond0 down
-# echo 6 > /sys/class/net/bond0/bonding/mode
- - or -
-# echo balance-alb > /sys/class/net/bond0/bonding/mode
- NOTE: The bond interface must be down before the mode can be
-changed.
-
-To enable MII monitoring on bond0 with a 1 second interval:
-# echo 1000 > /sys/class/net/bond0/bonding/miimon
- NOTE: If ARP monitoring is enabled, it will disabled when MII
-monitoring is enabled, and vice-versa.
-
-To add ARP targets:
-# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
-# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
- NOTE: up to 16 target addresses may be specified.
-
-To remove an ARP target:
-# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
-
-To configure the interval between learning packet transmits:
-# echo 12 > /sys/class/net/bond0/bonding/lp_interval
- NOTE: the lp_interval is the number of seconds between instances where
-the bonding driver sends learning packets to each slaves peer switch. The
-default interval is 1 second.
+To configure bond0 for balance-alb mode::
+
+ # ifconfig bond0 down
+ # echo 6 > /sys/class/net/bond0/bonding/mode
+ - or -
+ # echo balance-alb > /sys/class/net/bond0/bonding/mode
+
+.. note::
+
+ The bond interface must be down before the mode can be changed.
+
+To enable MII monitoring on bond0 with a 1 second interval::
+
+ # echo 1000 > /sys/class/net/bond0/bonding/miimon
+
+.. note::
+
+ If ARP monitoring is enabled, it will disabled when MII
+ monitoring is enabled, and vice-versa.
+
+To add ARP targets::
+
+ # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+ # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
+
+.. note::
+
+ up to 16 target addresses may be specified.
+
+To remove an ARP target::
+
+ # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+
+To configure the interval between learning packet transmits::
+
+ # echo 12 > /sys/class/net/bond0/bonding/lp_interval
+
+.. note::
+
+ the lp_interval is the number of seconds between instances where
+ the bonding driver sends learning packets to each slaves peer switch. The
+ default interval is 1 second.
Example Configuration
---------------------
- We begin with the same example that is shown in section 3.3,
+We begin with the same example that is shown in section 3.3,
executed with sysfs, and without using ifenslave.
- To make a simple bond of two e100 devices (presumed to be eth0
+To make a simple bond of two e100 devices (presumed to be eth0
and eth1), and have it persist across reboots, edit the appropriate
file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the
-following:
+following::
-modprobe bonding
-modprobe e100
-echo balance-alb > /sys/class/net/bond0/bonding/mode
-ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
-echo 100 > /sys/class/net/bond0/bonding/miimon
-echo +eth0 > /sys/class/net/bond0/bonding/slaves
-echo +eth1 > /sys/class/net/bond0/bonding/slaves
+ modprobe bonding
+ modprobe e100
+ echo balance-alb > /sys/class/net/bond0/bonding/mode
+ ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+ echo 100 > /sys/class/net/bond0/bonding/miimon
+ echo +eth0 > /sys/class/net/bond0/bonding/slaves
+ echo +eth1 > /sys/class/net/bond0/bonding/slaves
- To add a second bond, with two e1000 interfaces in
+To add a second bond, with two e1000 interfaces in
active-backup mode, using ARP monitoring, add the following lines to
-your init script:
+your init script::
-modprobe e1000
-echo +bond1 > /sys/class/net/bonding_masters
-echo active-backup > /sys/class/net/bond1/bonding/mode
-ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up
-echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target
-echo 2000 > /sys/class/net/bond1/bonding/arp_interval
-echo +eth2 > /sys/class/net/bond1/bonding/slaves
-echo +eth3 > /sys/class/net/bond1/bonding/slaves
+ modprobe e1000
+ echo +bond1 > /sys/class/net/bonding_masters
+ echo active-backup > /sys/class/net/bond1/bonding/mode
+ ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up
+ echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target
+ echo 2000 > /sys/class/net/bond1/bonding/arp_interval
+ echo +eth2 > /sys/class/net/bond1/bonding/slaves
+ echo +eth3 > /sys/class/net/bond1/bonding/slaves
3.5 Configuration with Interfaces Support
-----------------------------------------
- This section applies to distros which use /etc/network/interfaces file
+This section applies to distros which use /etc/network/interfaces file
to describe network interface configuration, most notably Debian and it's
derivatives.
- The ifup and ifdown commands on Debian don't support bonding out of
+The ifup and ifdown commands on Debian don't support bonding out of
the box. The ifenslave-2.6 package should be installed to provide bonding
-support. Once installed, this package will provide bond-* options to be used
-into /etc/network/interfaces.
+support. Once installed, this package will provide ``bond-*`` options
+to be used into /etc/network/interfaces.
- Note that ifenslave-2.6 package will load the bonding module and use
+Note that ifenslave-2.6 package will load the bonding module and use
the ifenslave command when appropriate.
Example Configurations
----------------------
In /etc/network/interfaces, the following stanza will configure bond0, in
-active-backup mode, with eth0 and eth1 as slaves.
+active-backup mode, with eth0 and eth1 as slaves::
-auto bond0
-iface bond0 inet dhcp
- bond-slaves eth0 eth1
- bond-mode active-backup
- bond-miimon 100
- bond-primary eth0 eth1
+ auto bond0
+ iface bond0 inet dhcp
+ bond-slaves eth0 eth1
+ bond-mode active-backup
+ bond-miimon 100
+ bond-primary eth0 eth1
If the above configuration doesn't work, you might have a system using
upstart for system startup. This is most notably true for recent
Ubuntu versions. The following stanza in /etc/network/interfaces will
-produce the same result on those systems.
-
-auto bond0
-iface bond0 inet dhcp
- bond-slaves none
- bond-mode active-backup
- bond-miimon 100
-
-auto eth0
-iface eth0 inet manual
- bond-master bond0
- bond-primary eth0 eth1
-
-auto eth1
-iface eth1 inet manual
- bond-master bond0
- bond-primary eth0 eth1
-
-For a full list of bond-* supported options in /etc/network/interfaces and some
-more advanced examples tailored to you particular distros, see the files in
+produce the same result on those systems::
+
+ auto bond0
+ iface bond0 inet dhcp
+ bond-slaves none
+ bond-mode active-backup
+ bond-miimon 100
+
+ auto eth0
+ iface eth0 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+ auto eth1
+ iface eth1 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+For a full list of ``bond-*`` supported options in /etc/network/interfaces and
+some more advanced examples tailored to you particular distros, see the files in
/usr/share/doc/ifenslave-2.6.
3.6 Overriding Configuration for Special Cases
@@ -1604,37 +1698,37 @@ can safely be sent over either interface. Such configurations may be achieved
using the traffic control utilities inherent in linux.
By default the bonding driver is multiqueue aware and 16 queues are created
-when the driver initializes (see Documentation/networking/multiqueue.txt
+when the driver initializes (see Documentation/networking/multiqueue.rst
for details). If more or less queues are desired the module parameter
tx_queues can be used to change this value. There is no sysfs parameter
available as the allocation is done at module init time.
The output of the file /proc/net/bonding/bondX has changed so the output Queue
-ID is now printed for each slave:
+ID is now printed for each slave::
-Bonding Mode: fault-tolerance (active-backup)
-Primary Slave: None
-Currently Active Slave: eth0
-MII Status: up
-MII Polling Interval (ms): 0
-Up Delay (ms): 0
-Down Delay (ms): 0
+ Bonding Mode: fault-tolerance (active-backup)
+ Primary Slave: None
+ Currently Active Slave: eth0
+ MII Status: up
+ MII Polling Interval (ms): 0
+ Up Delay (ms): 0
+ Down Delay (ms): 0
-Slave Interface: eth0
-MII Status: up
-Link Failure Count: 0
-Permanent HW addr: 00:1a:a0:12:8f:cb
-Slave queue ID: 0
+ Slave Interface: eth0
+ MII Status: up
+ Link Failure Count: 0
+ Permanent HW addr: 00:1a:a0:12:8f:cb
+ Slave queue ID: 0
-Slave Interface: eth1
-MII Status: up
-Link Failure Count: 0
-Permanent HW addr: 00:1a:a0:12:8f:cc
-Slave queue ID: 2
+ Slave Interface: eth1
+ MII Status: up
+ Link Failure Count: 0
+ Permanent HW addr: 00:1a:a0:12:8f:cc
+ Slave queue ID: 2
-The queue_id for a slave can be set using the command:
+The queue_id for a slave can be set using the command::
-# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
+ # echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
Any interface that needs a queue_id set should set it with multiple calls
like the one above until proper priorities are set for all interfaces. On
@@ -1645,12 +1739,12 @@ These queue id's can be used in conjunction with the tc utility to configure
a multiqueue qdisc and filters to bias certain traffic to transmit on certain
slave devices. For instance, say we wanted, in the above configuration to
force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output
-device. The following commands would accomplish this:
+device. The following commands would accomplish this::
-# tc qdisc add dev bond0 handle 1 root multiq
+ # tc qdisc add dev bond0 handle 1 root multiq
-# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \
- 192.168.1.100 action skbedit queue_mapping 2
+ # tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip \
+ dst 192.168.1.100 action skbedit queue_mapping 2
These commands tell the kernel to attach a multiqueue queue discipline to the
bond0 interface and filter traffic enqueued to it, such that packets with a dst
@@ -1663,7 +1757,7 @@ that normal output policy selection should take place. One benefit to simply
leaving the qid for a slave to 0 is the multiqueue awareness in the bonding
driver that is now present. This awareness allows tc filters to be placed on
slave devices as well as bond devices and the bonding driver will simply act as
-a pass-through for selecting output queues on the slave device rather than
+a pass-through for selecting output queues on the slave device rather than
output port selection.
This feature first appeared in bonding driver version 3.7.0 and support for
@@ -1689,31 +1783,31 @@ few bonding parameters:
(a) ad_actor_system : You can set a random mac-address that can be used for
these LACPDU exchanges. The value can not be either NULL or Multicast.
Also it's preferable to set the local-admin bit. Following shell code
- generates a random mac-address as described above.
+ generates a random mac-address as described above::
- # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
- $(( (RANDOM & 0xFE) | 0x02 )) \
- $(( RANDOM & 0xFF )) \
- $(( RANDOM & 0xFF )) \
- $(( RANDOM & 0xFF )) \
- $(( RANDOM & 0xFF )) \
- $(( RANDOM & 0xFF )))
- # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
+ # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
+ $(( (RANDOM & 0xFE) | 0x02 )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )) \
+ $(( RANDOM & 0xFF )))
+ # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
(b) ad_actor_sys_prio : Randomize the system priority. The default value
is 65535, but system can take the value from 1 - 65535. Following shell
- code generates random priority and sets it.
+ code generates random priority and sets it::
- # sys_prio=$(( 1 + RANDOM + RANDOM ))
- # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
+ # sys_prio=$(( 1 + RANDOM + RANDOM ))
+ # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
(c) ad_user_port_key : Use the user portion of the port-key. The default
keeps this empty. These are the upper 10 bits of the port-key and value
ranges from 0 - 1023. Following shell code generates these 10 bits and
- sets it.
+ sets it::
- # usr_port_key=$(( RANDOM & 0x3FF ))
- # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
+ # usr_port_key=$(( RANDOM & 0x3FF ))
+ # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
4 Querying Bonding Configuration
@@ -1722,81 +1816,81 @@ few bonding parameters:
4.1 Bonding Configuration
-------------------------
- Each bonding device has a read-only file residing in the
+Each bonding device has a read-only file residing in the
/proc/net/bonding directory. The file contents include information
about the bonding configuration, options and state of each slave.
- For example, the contents of /proc/net/bonding/bond0 after the
+For example, the contents of /proc/net/bonding/bond0 after the
driver is loaded with parameters of mode=0 and miimon=1000 is
-generally as follows:
+generally as follows::
Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)
- Bonding Mode: load balancing (round-robin)
- Currently Active Slave: eth0
- MII Status: up
- MII Polling Interval (ms): 1000
- Up Delay (ms): 0
- Down Delay (ms): 0
-
- Slave Interface: eth1
- MII Status: up
- Link Failure Count: 1
-
- Slave Interface: eth0
- MII Status: up
- Link Failure Count: 1
-
- The precise format and contents will change depending upon the
+ Bonding Mode: load balancing (round-robin)
+ Currently Active Slave: eth0
+ MII Status: up
+ MII Polling Interval (ms): 1000
+ Up Delay (ms): 0
+ Down Delay (ms): 0
+
+ Slave Interface: eth1
+ MII Status: up
+ Link Failure Count: 1
+
+ Slave Interface: eth0
+ MII Status: up
+ Link Failure Count: 1
+
+The precise format and contents will change depending upon the
bonding configuration, state, and version of the bonding driver.
4.2 Network configuration
-------------------------
- The network configuration can be inspected using the ifconfig
+The network configuration can be inspected using the ifconfig
command. Bonding devices will have the MASTER flag set; Bonding slave
devices will have the SLAVE flag set. The ifconfig output does not
contain information on which slaves are associated with which masters.
- In the example below, the bond0 interface is the master
+In the example below, the bond0 interface is the master
(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of
bond0 have the same MAC address (HWaddr) as bond0 for all modes except
-TLB and ALB that require a unique MAC address for each slave.
-
-# /sbin/ifconfig
-bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
- inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
- UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
- RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
- TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
- collisions:0 txqueuelen:0
-
-eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
- UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
- RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
- TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
- collisions:0 txqueuelen:100
- Interrupt:10 Base address:0x1080
-
-eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
- UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
- RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
- TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
- collisions:0 txqueuelen:100
- Interrupt:9 Base address:0x1400
+TLB and ALB that require a unique MAC address for each slave::
+
+ # /sbin/ifconfig
+ bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
+ UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
+ RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:0
+
+ eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:10 Base address:0x1080
+
+ eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:9 Base address:0x1400
5. Switch Configuration
=======================
- For this section, "switch" refers to whatever system the
+For this section, "switch" refers to whatever system the
bonded devices are directly connected to (i.e., where the other end of
the cable plugs into). This may be an actual dedicated switch device,
or it may be another regular system (e.g., another computer running
Linux),
- The active-backup, balance-tlb and balance-alb modes do not
+The active-backup, balance-tlb and balance-alb modes do not
require any specific configuration of the switch.
- The 802.3ad mode requires that the switch have the appropriate
+The 802.3ad mode requires that the switch have the appropriate
ports configured as an 802.3ad aggregation. The precise method used
to configure this varies from switch to switch, but, for example, a
Cisco 3550 series switch requires that the appropriate ports first be
@@ -1804,7 +1898,7 @@ grouped together in a single etherchannel instance, then that
etherchannel is set to mode "lacp" to enable 802.3ad (instead of
standard EtherChannel).
- The balance-rr, balance-xor and broadcast modes generally
+The balance-rr, balance-xor and broadcast modes generally
require that the switch have the appropriate ports grouped together.
The nomenclature for such a group differs between switches, it may be
called an "etherchannel" (as in the Cisco example, above), a "trunk
@@ -1820,7 +1914,7 @@ with another EtherChannel group.
6. 802.1q VLAN Support
======================
- It is possible to configure VLAN devices over a bond interface
+It is possible to configure VLAN devices over a bond interface
using the 8021q driver. However, only packets coming from the 8021q
driver and passing through bonding will be tagged by default. Self
generated packets, for example, bonding's learning packets or ARP
@@ -1829,7 +1923,7 @@ tagged internally by bonding itself. As a result, bonding must
"learn" the VLAN IDs configured above it, and use those IDs to tag
self generated packets.
- For reasons of simplicity, and to support the use of adapters
+For reasons of simplicity, and to support the use of adapters
that can do VLAN hardware acceleration offloading, the bonding
interface declares itself as fully hardware offloading capable, it gets
the add_vid/kill_vid notifications to gather the necessary
@@ -1839,7 +1933,7 @@ should go through an adapter that is not offloading capable are
"un-accelerated" by the bonding driver so the VLAN tag sits in the
regular location.
- VLAN interfaces *must* be added on top of a bonding interface
+VLAN interfaces *must* be added on top of a bonding interface
only after enslaving at least one slave. The bonding interface has a
hardware address of 00:00:00:00:00:00 until the first slave is added.
If the VLAN interface is created prior to the first enslavement, it
@@ -1847,23 +1941,23 @@ would pick up the all-zeroes hardware address. Once the first slave
is attached to the bond, the bond device itself will pick up the
slave's hardware address, which is then available for the VLAN device.
- Also, be aware that a similar problem can occur if all slaves
+Also, be aware that a similar problem can occur if all slaves
are released from a bond that still has one or more VLAN interfaces on
top of it. When a new slave is added, the bonding interface will
obtain its hardware address from the first slave, which might not
match the hardware address of the VLAN interfaces (which was
ultimately copied from an earlier slave).
- There are two methods to insure that the VLAN device operates
+There are two methods to insure that the VLAN device operates
with the correct hardware address if all slaves are removed from a
bond interface:
- 1. Remove all VLAN interfaces then recreate them
+1. Remove all VLAN interfaces then recreate them
- 2. Set the bonding interface's hardware address so that it
+2. Set the bonding interface's hardware address so that it
matches the hardware address of the VLAN interfaces.
- Note that changing a VLAN interface's HW address would set the
+Note that changing a VLAN interface's HW address would set the
underlying device -- i.e. the bonding interface -- to promiscuous
mode, which might not be what you want.
@@ -1871,65 +1965,56 @@ mode, which might not be what you want.
7. Link Monitoring
==================
- The bonding driver at present supports two schemes for
+The bonding driver at present supports two schemes for
monitoring a slave device's link state: the ARP monitor and the MII
monitor.
- At the present time, due to implementation restrictions in the
+At the present time, due to implementation restrictions in the
bonding driver itself, it is not possible to enable both ARP and MII
monitoring simultaneously.
7.1 ARP Monitor Operation
-------------------------
- The ARP monitor operates as its name suggests: it sends ARP
+The ARP monitor operates as its name suggests: it sends ARP
queries to one or more designated peer systems on the network, and
uses the response as an indication that the link is operating. This
gives some assurance that traffic is actually flowing to and from one
or more peers on the local network.
- The ARP monitor relies on the device driver itself to verify
-that traffic is flowing. In particular, the driver must keep up to
-date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX
-flag must also update netdev_queue->trans_start. If they do not, then the
-ARP monitor will immediately fail any slaves using that driver, and
-those slaves will stay down. If networking monitoring (tcpdump, etc)
-shows the ARP requests and replies on the network, then it may be that
-your device driver is not updating last_rx and trans_start.
-
7.2 Configuring Multiple ARP Targets
------------------------------------
- While ARP monitoring can be done with just one target, it can
+While ARP monitoring can be done with just one target, it can
be useful in a High Availability setup to have several targets to
monitor. In the case of just one target, the target itself may go
down or have a problem making it unresponsive to ARP requests. Having
an additional target (or several) increases the reliability of the ARP
monitoring.
- Multiple ARP targets must be separated by commas as follows:
+Multiple ARP targets must be separated by commas as follows::
-# example options for ARP monitoring with three targets
-alias bond0 bonding
-options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
+ # example options for ARP monitoring with three targets
+ alias bond0 bonding
+ options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
- For just a single target the options would resemble:
+For just a single target the options would resemble::
-# example options for ARP monitoring with one target
-alias bond0 bonding
-options bond0 arp_interval=60 arp_ip_target=192.168.0.100
+ # example options for ARP monitoring with one target
+ alias bond0 bonding
+ options bond0 arp_interval=60 arp_ip_target=192.168.0.100
7.3 MII Monitor Operation
-------------------------
- The MII monitor monitors only the carrier state of the local
+The MII monitor monitors only the carrier state of the local
network interface. It accomplishes this in one of three ways: by
depending upon the device driver to maintain its carrier state, by
querying the device's MII registers, or by making an ethtool query to
the device.
- If the use_carrier module parameter is 1 (the default value),
+If the use_carrier module parameter is 1 (the default value),
then the MII monitor will rely on the driver for carrier state
information (via the netif_carrier subsystem). As explained in the
use_carrier parameter information, above, if the MII monitor fails to
@@ -1937,10 +2022,10 @@ detect carrier loss on the device (e.g., when the cable is physically
disconnected), it may be that the driver does not support
netif_carrier.
- If use_carrier is 0, then the MII monitor will first query the
+If use_carrier is 0, then the MII monitor will first query the
device's (via ioctl) MII registers and check the link state. If that
request fails (not just that it returns carrier down), then the MII
-monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain
+monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain
the same information. If both methods fail (i.e., the driver either
does not support or had some error in processing both the MII register
and ethtool requests), then the MII monitor will assume the link is
@@ -1952,25 +2037,25 @@ up.
8.1 Adventures in Routing
-------------------------
- When bonding is configured, it is important that the slave
+When bonding is configured, it is important that the slave
devices not have routes that supersede routes of the master (or,
generally, not have routes at all). For example, suppose the bonding
device bond0 has two slaves, eth0 and eth1, and the routing table is
-as follows:
+as follows::
-Kernel IP routing table
-Destination Gateway Genmask Flags MSS Window irtt Iface
-10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
-10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
-10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
-127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
+ Kernel IP routing table
+ Destination Gateway Genmask Flags MSS Window irtt Iface
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
+ 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
+ 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
- This routing configuration will likely still update the
+This routing configuration will likely still update the
receive/transmit times in the driver (needed by the ARP monitor), but
may bypass the bonding driver (because outgoing traffic to, in this
case, another host on network 10 would use eth0 or eth1 before bond0).
- The ARP monitor (and ARP itself) may become confused by this
+The ARP monitor (and ARP itself) may become confused by this
configuration, because ARP requests (generated by the ARP monitor)
will be sent on one interface (bond0), but the corresponding reply
will arrive on a different interface (eth0). This reply looks to ARP
@@ -1978,7 +2063,7 @@ as an unsolicited ARP reply (because ARP matches replies on an
interface basis), and is discarded. The MII monitor is not affected
by the state of the routing table.
- The solution here is simply to insure that slaves do not have
+The solution here is simply to insure that slaves do not have
routes of their own, and if for some reason they must, those routes do
not supersede routes of their master. This should generally be the
case, but unusual configurations or errant manual or automatic static
@@ -1987,22 +2072,22 @@ route additions may cause trouble.
8.2 Ethernet Device Renaming
----------------------------
- On systems with network configuration scripts that do not
+On systems with network configuration scripts that do not
associate physical devices directly with network interface names (so
that the same physical device always has the same "ethX" name), it may
be necessary to add some special logic to config files in
/etc/modprobe.d/.
- For example, given a modules.conf containing the following:
+For example, given a modules.conf containing the following::
-alias bond0 bonding
-options bond0 mode=some-mode miimon=50
-alias eth0 tg3
-alias eth1 tg3
-alias eth2 e1000
-alias eth3 e1000
+ alias bond0 bonding
+ options bond0 mode=some-mode miimon=50
+ alias eth0 tg3
+ alias eth1 tg3
+ alias eth2 e1000
+ alias eth3 e1000
- If neither eth0 and eth1 are slaves to bond0, then when the
+If neither eth0 and eth1 are slaves to bond0, then when the
bond0 interface comes up, the devices may end up reordered. This
happens because bonding is loaded first, then its slave device's
drivers are loaded next. Since no other drivers have been loaded,
@@ -2010,36 +2095,36 @@ when the e1000 driver loads, it will receive eth0 and eth1 for its
devices, but the bonding configuration tries to enslave eth2 and eth3
(which may later be assigned to the tg3 devices).
- Adding the following:
+Adding the following::
-add above bonding e1000 tg3
+ add above bonding e1000 tg3
- causes modprobe to load e1000 then tg3, in that order, when
+causes modprobe to load e1000 then tg3, in that order, when
bonding is loaded. This command is fully documented in the
modules.conf manual page.
- On systems utilizing modprobe an equivalent problem can occur.
+On systems utilizing modprobe an equivalent problem can occur.
In this case, the following can be added to config files in
-/etc/modprobe.d/ as:
+/etc/modprobe.d/ as::
-softdep bonding pre: tg3 e1000
+ softdep bonding pre: tg3 e1000
- This will load tg3 and e1000 modules before loading the bonding one.
+This will load tg3 and e1000 modules before loading the bonding one.
Full documentation on this can be found in the modprobe.d and modprobe
manual pages.
8.3. Painfully Slow Or No Failed Link Detection By Miimon
---------------------------------------------------------
- By default, bonding enables the use_carrier option, which
+By default, bonding enables the use_carrier option, which
instructs bonding to trust the driver to maintain carrier state.
- As discussed in the options section, above, some drivers do
+As discussed in the options section, above, some drivers do
not support the netif_carrier_on/_off link state tracking system.
With use_carrier enabled, bonding will always see these links as up,
regardless of their actual state.
- Additionally, other drivers do support netif_carrier, but do
+Additionally, other drivers do support netif_carrier, but do
not maintain it in real time, e.g., only polling the link state at
some fixed interval. In this case, miimon will detect failures, but
only after some long period of time has expired. If it appears that
@@ -2051,7 +2136,7 @@ use_carrier=0 method of querying the registers directly works). If
use_carrier=0 does not improve the failover, then the driver may cache
the registers, or the problem may be elsewhere.
- Also, remember that miimon only checks for the device's
+Also, remember that miimon only checks for the device's
carrier state. It has no way to determine the state of devices on or
beyond other ports of a switch, or if a switch is refusing to pass
traffic while still maintaining carrier on.
@@ -2059,7 +2144,7 @@ traffic while still maintaining carrier on.
9. SNMP agents
===============
- If running SNMP agents, the bonding driver should be loaded
+If running SNMP agents, the bonding driver should be loaded
before any network drivers participating in a bond. This requirement
is due to the interface index (ipAdEntIfIndex) being associated to
the first interface found with a given IP address. That is, there is
@@ -2070,6 +2155,8 @@ with the eth0 interface. This configuration is shown below, the IP
address 192.168.1.1 has an interface index of 2 which indexes to eth0
in the ifDescr table (ifDescr.2).
+::
+
interfaces.ifTable.ifEntry.ifDescr.1 = lo
interfaces.ifTable.ifEntry.ifDescr.2 = eth0
interfaces.ifTable.ifEntry.ifDescr.3 = eth1
@@ -2081,7 +2168,7 @@ in the ifDescr table (ifDescr.2).
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
- This problem is avoided by loading the bonding driver before
+This problem is avoided by loading the bonding driver before
any network drivers participating in a bond. Below is an example of
loading the bonding driver first, the IP address 192.168.1.1 is
correctly associated with ifDescr.2.
@@ -2097,7 +2184,7 @@ correctly associated with ifDescr.2.
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
- While some distributions may not report the interface name in
+While some distributions may not report the interface name in
ifDescr, the association between the IP address and IfIndex remains
and SNMP functions such as Interface_Scan_Next will report that
association.
@@ -2105,34 +2192,34 @@ association.
10. Promiscuous mode
====================
- When running network monitoring tools, e.g., tcpdump, it is
+When running network monitoring tools, e.g., tcpdump, it is
common to enable promiscuous mode on the device, so that all traffic
is seen (instead of seeing only traffic destined for the local host).
The bonding driver handles promiscuous mode changes to the bonding
master device (e.g., bond0), and propagates the setting to the slave
devices.
- For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
+For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
the promiscuous mode setting is propagated to all slaves.
- For the active-backup, balance-tlb and balance-alb modes, the
+For the active-backup, balance-tlb and balance-alb modes, the
promiscuous mode setting is propagated only to the active slave.
- For balance-tlb mode, the active slave is the slave currently
+For balance-tlb mode, the active slave is the slave currently
receiving inbound traffic.
- For balance-alb mode, the active slave is the slave used as a
+For balance-alb mode, the active slave is the slave used as a
"primary." This slave is used for mode-specific control traffic, for
sending to peers that are unassigned or if the load is unbalanced.
- For the active-backup, balance-tlb and balance-alb modes, when
+For the active-backup, balance-tlb and balance-alb modes, when
the active slave changes (e.g., due to a link failure), the
promiscuous setting will be propagated to the new active slave.
11. Configuring Bonding for High Availability
=============================================
- High Availability refers to configurations that provide
+High Availability refers to configurations that provide
maximum network availability by having redundant or backup devices,
links or switches between the host and the rest of the world. The
goal is to provide the maximum availability of network connectivity
@@ -2142,7 +2229,7 @@ could provide higher throughput.
11.1 High Availability in a Single Switch Topology
--------------------------------------------------
- If two hosts (or a host and a single switch) are directly
+If two hosts (or a host and a single switch) are directly
connected via multiple physical links, then there is no availability
penalty to optimizing for maximum bandwidth. In this case, there is
only one switch (or peer), so if it fails, there is no alternative
@@ -2150,32 +2237,32 @@ access to fail over to. Additionally, the bonding load balance modes
support link monitoring of their members, so if individual links fail,
the load will be rebalanced across the remaining devices.
- See Section 12, "Configuring Bonding for Maximum Throughput"
+See Section 12, "Configuring Bonding for Maximum Throughput"
for information on configuring bonding with one peer device.
11.2 High Availability in a Multiple Switch Topology
----------------------------------------------------
- With multiple switches, the configuration of bonding and the
+With multiple switches, the configuration of bonding and the
network changes dramatically. In multiple switch topologies, there is
a trade off between network availability and usable bandwidth.
- Below is a sample network, configured to maximize the
-availability of the network:
-
- | |
- |port3 port3|
- +-----+----+ +-----+----+
- | |port2 ISL port2| |
- | switch A +--------------------------+ switch B |
- | | | |
- +-----+----+ +-----++---+
- |port1 port1|
- | +-------+ |
- +-------------+ host1 +---------------+
- eth0 +-------+ eth1
-
- In this configuration, there is a link between the two
+Below is a sample network, configured to maximize the
+availability of the network::
+
+ | |
+ |port3 port3|
+ +-----+----+ +-----+----+
+ | |port2 ISL port2| |
+ | switch A +--------------------------+ switch B |
+ | | | |
+ +-----+----+ +-----++---+
+ |port1 port1|
+ | +-------+ |
+ +-------------+ host1 +---------------+
+ eth0 +-------+ eth1
+
+In this configuration, there is a link between the two
switches (ISL, or inter switch link), and multiple ports connecting to
the outside world ("port3" on each switch). There is no technical
reason that this could not be extended to a third switch.
@@ -2183,19 +2270,21 @@ reason that this could not be extended to a third switch.
11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
-------------------------------------------------------------
- In a topology such as the example above, the active-backup and
+In a topology such as the example above, the active-backup and
broadcast modes are the only useful bonding modes when optimizing for
availability; the other modes require all links to terminate on the
same peer for them to behave rationally.
-active-backup: This is generally the preferred mode, particularly if
+active-backup:
+ This is generally the preferred mode, particularly if
the switches have an ISL and play together well. If the
network configuration is such that one switch is specifically
a backup switch (e.g., has lower capacity, higher cost, etc),
then the primary option can be used to insure that the
preferred link is always used when it is available.
-broadcast: This mode is really a special purpose mode, and is suitable
+broadcast:
+ This mode is really a special purpose mode, and is suitable
only for very specific needs. For example, if the two
switches are not connected (no ISL), and the networks beyond
them are totally independent. In this case, if it is
@@ -2205,7 +2294,7 @@ broadcast: This mode is really a special purpose mode, and is suitable
11.2.2 HA Link Monitoring Selection for Multiple Switch Topology
----------------------------------------------------------------
- The choice of link monitoring ultimately depends upon your
+The choice of link monitoring ultimately depends upon your
switch. If the switch can reliably fail ports in response to other
failures, then either the MII or ARP monitors should work. For
example, in the above example, if the "port3" link fails at the remote
@@ -2213,7 +2302,7 @@ end, the MII monitor has no direct means to detect this. The ARP
monitor could be configured with a target at the remote end of port3,
thus detecting that failure without switch support.
- In general, however, in a multiple switch topology, the ARP
+In general, however, in a multiple switch topology, the ARP
monitor can provide a higher level of reliability in detecting end to
end connectivity failures (which may be caused by the failure of any
individual component to pass traffic for any reason). Additionally,
@@ -2222,7 +2311,7 @@ one for each switch in the network). This will insure that,
regardless of which switch is active, the ARP monitor has a suitable
target to query.
- Note, also, that of late many switches now support a functionality
+Note, also, that of late many switches now support a functionality
generally referred to as "trunk failover." This is a feature of the
switch that causes the link state of a particular switch port to be set
down (or up) when the state of another switch port goes down (or up).
@@ -2238,18 +2327,18 @@ suitable switches.
12.1 Maximizing Throughput in a Single Switch Topology
------------------------------------------------------
- In a single switch configuration, the best method to maximize
+In a single switch configuration, the best method to maximize
throughput depends upon the application and network environment. The
various load balancing modes each have strengths and weaknesses in
different environments, as detailed below.
- For this discussion, we will break down the topologies into
+For this discussion, we will break down the topologies into
two categories. Depending upon the destination of most traffic, we
categorize them into either "gatewayed" or "local" configurations.
- In a gatewayed configuration, the "switch" is acting primarily
+In a gatewayed configuration, the "switch" is acting primarily
as a router, and the majority of traffic passes through this router to
-other networks. An example would be the following:
+other networks. An example would be the following::
+----------+ +----------+
@@ -2259,25 +2348,25 @@ other networks. An example would be the following:
| |eth1 port2| | here somewhere
+----------+ +----------+
- The router may be a dedicated router device, or another host
+The router may be a dedicated router device, or another host
acting as a gateway. For our discussion, the important point is that
the majority of traffic from Host A will pass through the router to
some other network before reaching its final destination.
- In a gatewayed network configuration, although Host A may
+In a gatewayed network configuration, although Host A may
communicate with many other systems, all of its traffic will be sent
and received via one other peer on the local network, the router.
- Note that the case of two systems connected directly via
+Note that the case of two systems connected directly via
multiple physical links is, for purposes of configuring bonding, the
same as a gatewayed configuration. In that case, it happens that all
traffic is destined for the "gateway" itself, not some other network
beyond the gateway.
- In a local configuration, the "switch" is acting primarily as
+In a local configuration, the "switch" is acting primarily as
a switch, and the majority of traffic passes through this switch to
reach other stations on the same network. An example would be the
-following:
+following::
+----------+ +----------+ +--------+
| |eth0 port1| +-------+ Host B |
@@ -2287,19 +2376,19 @@ following:
+----------+ +----------+port4 +--------+
- Again, the switch may be a dedicated switch device, or another
+Again, the switch may be a dedicated switch device, or another
host acting as a gateway. For our discussion, the important point is
that the majority of traffic from Host A is destined for other hosts
on the same local network (Hosts B and C in the above example).
- In summary, in a gatewayed configuration, traffic to and from
+In summary, in a gatewayed configuration, traffic to and from
the bonded device will be to the same MAC level peer on the network
(the gateway itself, i.e., the router), regardless of its final
destination. In a local configuration, traffic flows directly to and
from the final destinations, thus, each destination (Host B, Host C)
will be addressed directly by their individual MAC addresses.
- This distinction between a gatewayed and a local network
+This distinction between a gatewayed and a local network
configuration is important because many of the load balancing modes
available use the MAC addresses of the local network source and
destination to make load balancing decisions. The behavior of each
@@ -2309,11 +2398,12 @@ mode is described below.
12.1.1 MT Bonding Mode Selection for Single Switch Topology
-----------------------------------------------------------
- This configuration is the easiest to set up and to understand,
+This configuration is the easiest to set up and to understand,
although you will have to decide which bonding mode best suits your
needs. The trade offs for each mode are detailed below:
-balance-rr: This mode is the only mode that will permit a single
+balance-rr:
+ This mode is the only mode that will permit a single
TCP/IP connection to stripe traffic across multiple
interfaces. It is therefore the only mode that will allow a
single TCP/IP stream to utilize more than one interface's
@@ -2351,7 +2441,8 @@ balance-rr: This mode is the only mode that will permit a single
This mode requires the switch to have the appropriate ports
configured for "etherchannel" or "trunking."
-active-backup: There is not much advantage in this network topology to
+active-backup:
+ There is not much advantage in this network topology to
the active-backup mode, as the inactive backup devices are all
connected to the same peer as the primary. In this case, a
load balancing mode (with link monitoring) will provide the
@@ -2361,7 +2452,8 @@ active-backup: There is not much advantage in this network topology to
have value if the hardware available does not support any of
the load balance modes.
-balance-xor: This mode will limit traffic such that packets destined
+balance-xor:
+ This mode will limit traffic such that packets destined
for specific peers will always be sent over the same
interface. Since the destination is determined by the MAC
addresses involved, this mode works best in a "local" network
@@ -2373,10 +2465,12 @@ balance-xor: This mode will limit traffic such that packets destined
As with balance-rr, the switch ports need to be configured for
"etherchannel" or "trunking."
-broadcast: Like active-backup, there is not much advantage to this
+broadcast:
+ Like active-backup, there is not much advantage to this
mode in this type of network topology.
-802.3ad: This mode can be a good choice for this type of network
+802.3ad:
+ This mode can be a good choice for this type of network
topology. The 802.3ad mode is an IEEE standard, so all peers
that implement 802.3ad should interoperate well. The 802.3ad
protocol includes automatic configuration of the aggregates,
@@ -2390,7 +2484,7 @@ broadcast: Like active-backup, there is not much advantage to this
the same speed and duplex. Also, as with all bonding load
balance modes other than balance-rr, no single connection will
be able to utilize more than a single interface's worth of
- bandwidth.
+ bandwidth.
Additionally, the linux bonding 802.3ad implementation
distributes traffic by peer (using an XOR of MAC addresses
@@ -2404,7 +2498,8 @@ broadcast: Like active-backup, there is not much advantage to this
Finally, the 802.3ad mode mandates the use of the MII monitor,
therefore, the ARP monitor is not available in this mode.
-balance-tlb: The balance-tlb mode balances outgoing traffic by peer.
+balance-tlb:
+ The balance-tlb mode balances outgoing traffic by peer.
Since the balancing is done according to MAC address, in a
"gatewayed" configuration (as described above), this mode will
send all traffic across a single device. However, in a
@@ -2422,7 +2517,8 @@ balance-tlb: The balance-tlb mode balances outgoing traffic by peer.
network device driver of the slave interfaces, and the ARP
monitor is not available.
-balance-alb: This mode is everything that balance-tlb is, and more.
+balance-alb:
+ This mode is everything that balance-tlb is, and more.
It has all of the features (and restrictions) of balance-tlb,
and will also balance incoming traffic from local network
peers (as described in the Bonding Module Options section,
@@ -2435,7 +2531,7 @@ balance-alb: This mode is everything that balance-tlb is, and more.
12.1.2 MT Link Monitoring for Single Switch Topology
----------------------------------------------------
- The choice of link monitoring may largely depend upon which
+The choice of link monitoring may largely depend upon which
mode you choose to use. The more advanced load balancing modes do not
support the use of the ARP monitor, and are thus restricted to using
the MII monitor (which does not provide as high a level of end to end
@@ -2444,27 +2540,27 @@ assurance as the ARP monitor).
12.2 Maximum Throughput in a Multiple Switch Topology
-----------------------------------------------------
- Multiple switches may be utilized to optimize for throughput
+Multiple switches may be utilized to optimize for throughput
when they are configured in parallel as part of an isolated network
-between two or more systems, for example:
-
- +-----------+
- | Host A |
- +-+---+---+-+
- | | |
- +--------+ | +---------+
- | | |
- +------+---+ +-----+----+ +-----+----+
- | Switch A | | Switch B | | Switch C |
- +------+---+ +-----+----+ +-----+----+
- | | |
- +--------+ | +---------+
- | | |
- +-+---+---+-+
- | Host B |
- +-----------+
-
- In this configuration, the switches are isolated from one
+between two or more systems, for example::
+
+ +-----------+
+ | Host A |
+ +-+---+---+-+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +------+---+ +-----+----+ +-----+----+
+ | Switch A | | Switch B | | Switch C |
+ +------+---+ +-----+----+ +-----+----+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +-+---+---+-+
+ | Host B |
+ +-----------+
+
+In this configuration, the switches are isolated from one
another. One reason to employ a topology such as this is for an
isolated network with many hosts (a cluster configured for high
performance, for example), using multiple smaller switches can be more
@@ -2472,14 +2568,14 @@ cost effective than a single larger switch, e.g., on a network with 24
hosts, three 24 port switches can be significantly less expensive than
a single 72 port switch.
- If access beyond the network is required, an individual host
+If access beyond the network is required, an individual host
can be equipped with an additional network device connected to an
external network; this host then additionally acts as a gateway.
12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
-------------------------------------------------------------
- In actual practice, the bonding mode typically employed in
+In actual practice, the bonding mode typically employed in
configurations of this type is balance-rr. Historically, in this
network configuration, the usual caveats about out of order packet
delivery are mitigated by the use of network adapters that do not do
@@ -2492,7 +2588,7 @@ utilize greater than one interface's bandwidth.
12.2.2 MT Link Monitoring for Multiple Switch Topology
------------------------------------------------------
- Again, in actual practice, the MII monitor is most often used
+Again, in actual practice, the MII monitor is most often used
in this configuration, as performance is given preference over
availability. The ARP monitor will function in this topology, but its
advantages over the MII monitor are mitigated by the volume of probes
@@ -2505,10 +2601,10 @@ host in the network is configured with bonding).
13.1 Link Establishment and Failover Delays
-------------------------------------------
- Some switches exhibit undesirable behavior with regard to the
+Some switches exhibit undesirable behavior with regard to the
timing of link up and down reporting by the switch.
- First, when a link comes up, some switches may indicate that
+First, when a link comes up, some switches may indicate that
the link is up (carrier available), but not pass traffic over the
interface for some period of time. This delay is typically due to
some type of autonegotiation or routing protocol, but may also occur
@@ -2517,12 +2613,12 @@ failure). If you find this to be a problem, specify an appropriate
value to the updelay bonding module option to delay the use of the
relevant interface(s).
- Second, some switches may "bounce" the link state one or more
+Second, some switches may "bounce" the link state one or more
times while a link is changing state. This occurs most commonly while
the switch is initializing. Again, an appropriate updelay value may
help.
- Note that when a bonding interface has no active links, the
+Note that when a bonding interface has no active links, the
driver will immediately reuse the first link that goes up, even if the
updelay parameter has been specified (the updelay is ignored in this
case). If there are slave interfaces waiting for the updelay timeout
@@ -2532,7 +2628,7 @@ value of updelay has been overestimated, and since this occurs only in
cases with no connectivity, there is no additional penalty for
ignoring the updelay.
- In addition to the concerns about switch timings, if your
+In addition to the concerns about switch timings, if your
switches take a long time to go into backup mode, it may be desirable
to not activate a backup interface immediately after a link goes down.
Failover may be delayed via the downdelay bonding module option.
@@ -2540,31 +2636,31 @@ Failover may be delayed via the downdelay bonding module option.
13.2 Duplicated Incoming Packets
--------------------------------
- NOTE: Starting with version 3.0.2, the bonding driver has logic to
+NOTE: Starting with version 3.0.2, the bonding driver has logic to
suppress duplicate packets, which should largely eliminate this problem.
The following description is kept for reference.
- It is not uncommon to observe a short burst of duplicated
+It is not uncommon to observe a short burst of duplicated
traffic when the bonding device is first used, or after it has been
idle for some period of time. This is most easily observed by issuing
a "ping" to some other host on the network, and noticing that the
output from ping flags duplicates (typically one per slave).
- For example, on a bond in active-backup mode with five slaves
-all connected to one switch, the output may appear as follows:
-
-# ping -n 10.0.4.2
-PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
-64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
-64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
-64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
-64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
-64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
-64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
-64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
-64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
-
- This is not due to an error in the bonding driver, rather, it
+For example, on a bond in active-backup mode with five slaves
+all connected to one switch, the output may appear as follows::
+
+ # ping -n 10.0.4.2
+ PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+ 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
+ 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
+ 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
+
+This is not due to an error in the bonding driver, rather, it
is a side effect of how many switches update their MAC forwarding
tables. Initially, the switch does not associate the MAC address in
the packet with a particular switch port, and so it may send the
@@ -2574,7 +2670,7 @@ single switch, when the switch (temporarily) floods the traffic to all
ports, the bond device receives multiple copies of the same packet
(one per slave device).
- The duplicated packet behavior is switch dependent, some
+The duplicated packet behavior is switch dependent, some
switches exhibit this, and some do not. On switches that display this
behavior, it can be induced by clearing the MAC forwarding table (on
most Cisco switches, the privileged command "clear mac address-table
@@ -2583,16 +2679,16 @@ dynamic" will accomplish this).
14. Hardware Specific Considerations
====================================
- This section contains additional information for configuring
+This section contains additional information for configuring
bonding on specific hardware platforms, or for interfacing bonding
with particular switches or other devices.
14.1 IBM BladeCenter
--------------------
- This applies to the JS20 and similar systems.
+This applies to the JS20 and similar systems.
- On the JS20 blades, the bonding driver supports only
+On the JS20 blades, the bonding driver supports only
balance-rr, active-backup, balance-tlb and balance-alb modes. This is
largely due to the network topology inside the BladeCenter, detailed
below.
@@ -2600,7 +2696,7 @@ below.
JS20 network adapter information
--------------------------------
- All JS20s come with two Broadcom Gigabit Ethernet ports
+All JS20s come with two Broadcom Gigabit Ethernet ports
integrated on the planar (that's "motherboard" in IBM-speak). In the
BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
@@ -2608,36 +2704,36 @@ An add-on Broadcom daughter card can be installed on a JS20 to provide
two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
wired to I/O Modules 3 and 4, respectively.
- Each I/O Module may contain either a switch or a passthrough
+Each I/O Module may contain either a switch or a passthrough
module (which allows ports to be directly connected to an external
switch). Some bonding modes require a specific BladeCenter internal
network topology in order to function; these are detailed below.
- Additional BladeCenter-specific networking information can be
+Additional BladeCenter-specific networking information can be
found in two IBM Redbooks (www.ibm.com/redbooks):
-"IBM eServer BladeCenter Networking Options"
-"IBM eServer BladeCenter Layer 2-7 Network Switching"
+- "IBM eServer BladeCenter Networking Options"
+- "IBM eServer BladeCenter Layer 2-7 Network Switching"
BladeCenter networking configuration
------------------------------------
- Because a BladeCenter can be configured in a very large number
+Because a BladeCenter can be configured in a very large number
of ways, this discussion will be confined to describing basic
configurations.
- Normally, Ethernet Switch Modules (ESMs) are used in I/O
+Normally, Ethernet Switch Modules (ESMs) are used in I/O
modules 1 and 2. In this configuration, the eth0 and eth1 ports of a
JS20 will be connected to different internal switches (in the
respective I/O modules).
- A passthrough module (OPM or CPM, optical or copper,
+A passthrough module (OPM or CPM, optical or copper,
passthrough module) connects the I/O module directly to an external
switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
interfaces of a JS20 can be redirected to the outside world and
connected to a common external switch.
- Depending upon the mix of ESMs and PMs, the network will
+Depending upon the mix of ESMs and PMs, the network will
appear to bonding as either a single switch topology (all PMs) or as a
multiple switch topology (one or more ESMs, zero or more PMs). It is
also possible to connect ESMs together, resulting in a configuration
@@ -2647,24 +2743,24 @@ Topology," above.
Requirements for specific modes
-------------------------------
- The balance-rr mode requires the use of passthrough modules
+The balance-rr mode requires the use of passthrough modules
for devices in the bond, all connected to an common external switch.
That switch must be configured for "etherchannel" or "trunking" on the
appropriate ports, as is usual for balance-rr.
- The balance-alb and balance-tlb modes will function with
+The balance-alb and balance-tlb modes will function with
either switch modules or passthrough modules (or a mix). The only
specific requirement for these modes is that all network interfaces
must be able to reach all destinations for traffic sent over the
bonding device (i.e., the network must converge at some point outside
the BladeCenter).
- The active-backup mode has no additional requirements.
+The active-backup mode has no additional requirements.
Link monitoring issues
----------------------
- When an Ethernet Switch Module is in place, only the ARP
+When an Ethernet Switch Module is in place, only the ARP
monitor will reliably detect link loss to an external switch. This is
nothing unusual, but examination of the BladeCenter cabinet would
suggest that the "external" network ports are the ethernet ports for
@@ -2672,166 +2768,155 @@ the system, when it fact there is a switch between these "external"
ports and the devices on the JS20 system itself. The MII monitor is
only able to detect link failures between the ESM and the JS20 system.
- When a passthrough module is in place, the MII monitor does
+When a passthrough module is in place, the MII monitor does
detect failures to the "external" port, which is then directly
connected to the JS20 system.
Other concerns
--------------
- The Serial Over LAN (SoL) link is established over the primary
+The Serial Over LAN (SoL) link is established over the primary
ethernet (eth0) only, therefore, any loss of link to eth0 will result
in losing your SoL connection. It will not fail over with other
network traffic, as the SoL system is beyond the control of the
bonding driver.
- It may be desirable to disable spanning tree on the switch
+It may be desirable to disable spanning tree on the switch
(either the internal Ethernet Switch Module, or an external switch) to
avoid fail-over delay issues when using bonding.
-
+
15. Frequently Asked Questions
==============================
1. Is it SMP safe?
+-------------------
- Yes. The old 2.0.xx channel bonding patch was not SMP safe.
+Yes. The old 2.0.xx channel bonding patch was not SMP safe.
The new driver was designed to be SMP safe from the start.
2. What type of cards will work with it?
+-----------------------------------------
- Any Ethernet type cards (you can even mix cards - a Intel
+Any Ethernet type cards (you can even mix cards - a Intel
EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes,
devices need not be of the same speed.
- Starting with version 3.2.1, bonding also supports Infiniband
+Starting with version 3.2.1, bonding also supports Infiniband
slaves in active-backup mode.
3. How many bonding devices can I have?
+----------------------------------------
- There is no limit.
+There is no limit.
4. How many slaves can a bonding device have?
+----------------------------------------------
- This is limited only by the number of network interfaces Linux
+This is limited only by the number of network interfaces Linux
supports and/or the number of network cards you can place in your
system.
5. What happens when a slave link dies?
+----------------------------------------
- If link monitoring is enabled, then the failing device will be
+If link monitoring is enabled, then the failing device will be
disabled. The active-backup mode will fail over to a backup link, and
other modes will ignore the failed link. The link will continue to be
monitored, and should it recover, it will rejoin the bond (in whatever
manner is appropriate for the mode). See the sections on High
Availability and the documentation for each mode for additional
information.
-
- Link monitoring can be enabled via either the miimon or
+
+Link monitoring can be enabled via either the miimon or
arp_interval parameters (described in the module parameters section,
above). In general, miimon monitors the carrier state as sensed by
the underlying network device, and the arp monitor (arp_interval)
monitors connectivity to another host on the local network.
- If no link monitoring is configured, the bonding driver will
+If no link monitoring is configured, the bonding driver will
be unable to detect link failures, and will assume that all links are
always available. This will likely result in lost packets, and a
resulting degradation of performance. The precise performance loss
depends upon the bonding mode and network configuration.
6. Can bonding be used for High Availability?
+----------------------------------------------
- Yes. See the section on High Availability for details.
+Yes. See the section on High Availability for details.
7. Which switches/systems does it work with?
+---------------------------------------------
- The full answer to this depends upon the desired mode.
+The full answer to this depends upon the desired mode.
- In the basic balance modes (balance-rr and balance-xor), it
+In the basic balance modes (balance-rr and balance-xor), it
works with any system that supports etherchannel (also called
trunking). Most managed switches currently available have such
support, and many unmanaged switches as well.
- The advanced balance modes (balance-tlb and balance-alb) do
+The advanced balance modes (balance-tlb and balance-alb) do
not have special switch requirements, but do need device drivers that
support specific features (described in the appropriate section under
module parameters, above).
- In 802.3ad mode, it works with systems that support IEEE
+In 802.3ad mode, it works with systems that support IEEE
802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
switches currently available support 802.3ad.
- The active-backup mode should work with any Layer-II switch.
+The active-backup mode should work with any Layer-II switch.
8. Where does a bonding device get its MAC address from?
+---------------------------------------------------------
- When using slave devices that have fixed MAC addresses, or when
+When using slave devices that have fixed MAC addresses, or when
the fail_over_mac option is enabled, the bonding device's MAC address is
the MAC address of the active slave.
- For other configurations, if not explicitly configured (with
+For other configurations, if not explicitly configured (with
ifconfig or ip link), the MAC address of the bonding device is taken from
its first slave device. This MAC address is then passed to all following
slaves and remains persistent (even if the first slave is removed) until
the bonding device is brought down or reconfigured.
- If you wish to change the MAC address, you can set it with
-ifconfig or ip link:
+If you wish to change the MAC address, you can set it with
+ifconfig or ip link::
-# ifconfig bond0 hw ether 00:11:22:33:44:55
+ # ifconfig bond0 hw ether 00:11:22:33:44:55
-# ip link set bond0 address 66:77:88:99:aa:bb
+ # ip link set bond0 address 66:77:88:99:aa:bb
- The MAC address can be also changed by bringing down/up the
-device and then changing its slaves (or their order):
+The MAC address can be also changed by bringing down/up the
+device and then changing its slaves (or their order)::
-# ifconfig bond0 down ; modprobe -r bonding
-# ifconfig bond0 .... up
-# ifenslave bond0 eth...
+ # ifconfig bond0 down ; modprobe -r bonding
+ # ifconfig bond0 .... up
+ # ifenslave bond0 eth...
- This method will automatically take the address from the next
+This method will automatically take the address from the next
slave that is added.
- To restore your slaves' MAC addresses, you need to detach them
-from the bond (`ifenslave -d bond0 eth0'). The bonding driver will
+To restore your slaves' MAC addresses, you need to detach them
+from the bond (``ifenslave -d bond0 eth0``). The bonding driver will
then restore the MAC addresses that the slaves had before they were
enslaved.
16. Resources and Links
=======================
- The latest version of the bonding driver can be found in the latest
+The latest version of the bonding driver can be found in the latest
version of the linux kernel, found on http://kernel.org
- The latest version of this document can be found in the latest kernel
-source (named Documentation/networking/bonding.txt).
-
- Discussions regarding the usage of the bonding driver take place on the
-bonding-devel mailing list, hosted at sourceforge.net. If you have questions or
-problems, post them to the list. The list address is:
-
-bonding-devel@lists.sourceforge.net
-
- The administrative interface (to subscribe or unsubscribe) can
-be found at:
+The latest version of this document can be found in the latest kernel
+source (named Documentation/networking/bonding.rst).
-https://lists.sourceforge.net/lists/listinfo/bonding-devel
-
- Discussions regarding the development of the bonding driver take place
+Discussions regarding the development of the bonding driver take place
on the main Linux network mailing list, hosted at vger.kernel.org. The list
address is:
netdev@vger.kernel.org
- The administrative interface (to subscribe or unsubscribe) can
+The administrative interface (to subscribe or unsubscribe) can
be found at:
http://vger.kernel.org/vger-lists.html#netdev
-
-Donald Becker's Ethernet Drivers and diag programs may be found at :
- - http://web.archive.org/web/*/http://www.scyld.com/network/
-
-You will also find a lot of information regarding Ethernet, NWay, MII,
-etc. at www.scyld.com.
-
--- END --
diff --git a/Documentation/networking/caif/caif.rst b/Documentation/networking/caif/caif.rst
index 07afc8063d4d..d922d419c513 100644
--- a/Documentation/networking/caif/caif.rst
+++ b/Documentation/networking/caif/caif.rst
@@ -1,5 +1,3 @@
-:orphan:
-
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
@@ -70,11 +68,10 @@ There are debugfs parameters provided for serial communication.
* tty_status: Prints the bit-mask tty status information
- 0x01 - tty->warned is on.
- - 0x02 - tty->low_latency is on.
- 0x04 - tty->packed is on.
- - 0x08 - tty->flow_stopped is on.
+ - 0x08 - tty->flow.tco_stopped is on.
- 0x10 - tty->hw_stopped is on.
- - 0x20 - tty->stopped is on.
+ - 0x20 - tty->flow.stopped is on.
* last_tx_msg: Binary blob Prints the last transmitted frame.
diff --git a/Documentation/networking/caif/index.rst b/Documentation/networking/caif/index.rst
new file mode 100644
index 000000000000..ec29b6f4bdb4
--- /dev/null
+++ b/Documentation/networking/caif/index.rst
@@ -0,0 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+CAIF
+====
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ linux_caif
+ caif
diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/linux_caif.rst
index 0aa4bd381bec..a0480862ab8c 100644
--- a/Documentation/networking/caif/Linux-CAIF.txt
+++ b/Documentation/networking/caif/linux_caif.rst
@@ -1,12 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+==========
Linux CAIF
-===========
-copyright (C) ST-Ericsson AB 2010
-Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
-License terms: GNU General Public License (GPL) version 2
+==========
+
+Copyright |copy| ST-Ericsson AB 2010
+
+:Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+:License terms: GNU General Public License (GPL) version 2
Introduction
-------------
+============
+
CAIF is a MUX protocol used by ST-Ericsson cellular modems for
communication between Modem and host. The host processes can open virtual AT
channels, initiate GPRS Data connections, Video channels and Utility Channels.
@@ -16,13 +23,16 @@ ST-Ericsson modems support a number of transports between modem
and host. Currently, UART and Loopback are available for Linux.
-Architecture:
-------------
+Architecture
+============
+
The implementation of CAIF is divided into:
+
* CAIF Socket Layer and GPRS IP Interface.
* CAIF Core Protocol Implementation
* CAIF Link Layer, implemented as NET devices.
+::
RTNL
!
@@ -46,12 +56,12 @@ The implementation of CAIF is divided into:
-I M P L E M E N T A T I O N
-===========================
+Implementation
+==============
CAIF Core Protocol Layer
-=========================================
+------------------------
CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson.
It implements the CAIF protocol stack in a layered approach, where
@@ -59,8 +69,11 @@ each layer described in the specification is implemented as a separate layer.
The architecture is inspired by the design patterns "Protocol Layer" and
"Protocol Packet".
-== CAIF structure ==
+CAIF structure
+^^^^^^^^^^^^^^
+
The Core CAIF implementation contains:
+
- Simple implementation of CAIF.
- Layered architecture (a la Streams), each layer in the CAIF
specification is implemented in a separate c-file.
@@ -73,7 +86,8 @@ The Core CAIF implementation contains:
to the called function (except for framing layers' receive function)
Layered Architecture
---------------------
+====================
+
The CAIF protocol can be divided into two parts: Support functions and Protocol
Implementation. The support functions include:
@@ -112,7 +126,7 @@ The CAIF Protocol implementation contains:
- CFSERL CAIF Serial layer. Handles concatenation/split of frames
into CAIF Frames with correct length.
-
+::
+---------+
| Config |
@@ -143,18 +157,24 @@ The CAIF Protocol implementation contains:
In this layered approach the following "rules" apply.
+
- All layers embed the same structure "struct cflayer"
- A layer does not depend on any other layer's private data.
- - Layers are stacked by setting the pointers
+ - Layers are stacked by setting the pointers::
+
layer->up , layer->dn
- - In order to send data upwards, each layer should do
+
+ - In order to send data upwards, each layer should do::
+
layer->up->receive(layer->up, packet);
- - In order to send data downwards, each layer should do
+
+ - In order to send data downwards, each layer should do::
+
layer->dn->transmit(layer->dn, packet);
CAIF Socket and IP interface
-===========================
+============================
The IP interface and CAIF socket API are implemented on top of the
CAIF Core protocol. The IP Interface and CAIF socket have an instance of
diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt
deleted file mode 100644
index 9efd0687dc4c..000000000000
--- a/Documentation/networking/caif/spi_porting.txt
+++ /dev/null
@@ -1,208 +0,0 @@
-- CAIF SPI porting -
-
-- CAIF SPI basics:
-
-Running CAIF over SPI needs some extra setup, owing to the nature of SPI.
-Two extra GPIOs have been added in order to negotiate the transfers
- between the master and the slave. The minimum requirement for running
-CAIF over SPI is a SPI slave chip and two GPIOs (more details below).
-Please note that running as a slave implies that you need to keep up
-with the master clock. An overrun or underrun event is fatal.
-
-- CAIF SPI framework:
-
-To make porting as easy as possible, the CAIF SPI has been divided in
-two parts. The first part (called the interface part) deals with all
-generic functionality such as length framing, SPI frame negotiation
-and SPI frame delivery and transmission. The other part is the CAIF
-SPI slave device part, which is the module that you have to write if
-you want to run SPI CAIF on a new hardware. This part takes care of
-the physical hardware, both with regard to SPI and to GPIOs.
-
-- Implementing a CAIF SPI device:
-
- - Functionality provided by the CAIF SPI slave device:
-
- In order to implement a SPI device you will, as a minimum,
- need to implement the following
- functions:
-
- int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev):
-
- This function is called by the CAIF SPI interface to give
- you a chance to set up your hardware to be ready to receive
- a stream of data from the master. The xfer structure contains
- both physical and logical addresses, as well as the total length
- of the transfer in both directions.The dev parameter can be used
- to map to different CAIF SPI slave devices.
-
- void (*sig_xfer) (bool xfer, struct cfspi_dev *dev):
-
- This function is called by the CAIF SPI interface when the output
- (SPI_INT) GPIO needs to change state. The boolean value of the xfer
- variable indicates whether the GPIO should be asserted (HIGH) or
- deasserted (LOW). The dev parameter can be used to map to different CAIF
- SPI slave devices.
-
- - Functionality provided by the CAIF SPI interface:
-
- void (*ss_cb) (bool assert, struct cfspi_ifc *ifc);
-
- This function is called by the CAIF SPI slave device in order to
- signal a change of state of the input GPIO (SS) to the interface.
- Only active edges are mandatory to be reported.
- This function can be called from IRQ context (recommended in order
- not to introduce latency). The ifc parameter should be the pointer
- returned from the platform probe function in the SPI device structure.
-
- void (*xfer_done_cb) (struct cfspi_ifc *ifc);
-
- This function is called by the CAIF SPI slave device in order to
- report that a transfer is completed. This function should only be
- called once both the transmission and the reception are completed.
- This function can be called from IRQ context (recommended in order
- not to introduce latency). The ifc parameter should be the pointer
- returned from the platform probe function in the SPI device structure.
-
- - Connecting the bits and pieces:
-
- - Filling in the SPI slave device structure:
-
- Connect the necessary callback functions.
- Indicate clock speed (used to calculate toggle delays).
- Chose a suitable name (helps debugging if you use several CAIF
- SPI slave devices).
- Assign your private data (can be used to map to your structure).
-
- - Filling in the SPI slave platform device structure:
- Add name of driver to connect to ("cfspi_sspi").
- Assign the SPI slave device structure as platform data.
-
-- Padding:
-
-In order to optimize throughput, a number of SPI padding options are provided.
-Padding can be enabled independently for uplink and downlink transfers.
-Padding can be enabled for the head, the tail and for the total frame size.
-The padding needs to be correctly configured on both sides of the link.
-The padding can be changed via module parameters in cfspi_sspi.c or via
-the sysfs directory of the cfspi_sspi driver (before device registration).
-
-- CAIF SPI device template:
-
-/*
- * Copyright (C) ST-Ericsson AB 2010
- * Author: Daniel Martensson / Daniel.Martensson@stericsson.com
- * License terms: GNU General Public License (GPL), version 2.
- *
- */
-
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/device.h>
-#include <linux/wait.h>
-#include <linux/interrupt.h>
-#include <linux/dma-mapping.h>
-#include <net/caif/caif_spi.h>
-
-MODULE_LICENSE("GPL");
-
-struct sspi_struct {
- struct cfspi_dev sdev;
- struct cfspi_xfer *xfer;
-};
-
-static struct sspi_struct slave;
-static struct platform_device slave_device;
-
-static irqreturn_t sspi_irq(int irq, void *arg)
-{
- /* You only need to trigger on an edge to the active state of the
- * SS signal. Once a edge is detected, the ss_cb() function should be
- * called with the parameter assert set to true. It is OK
- * (and even advised) to call the ss_cb() function in IRQ context in
- * order not to add any delay. */
-
- return IRQ_HANDLED;
-}
-
-static void sspi_complete(void *context)
-{
- /* Normally the DMA or the SPI framework will call you back
- * in something similar to this. The only thing you need to
- * do is to call the xfer_done_cb() function, providing the pointer
- * to the CAIF SPI interface. It is OK to call this function
- * from IRQ context. */
-}
-
-static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev)
-{
- /* Store transfer info. For a normal implementation you should
- * set up your DMA here and make sure that you are ready to
- * receive the data from the master SPI. */
-
- struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
-
- sspi->xfer = xfer;
-
- return 0;
-}
-
-void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev)
-{
- /* If xfer is true then you should assert the SPI_INT to indicate to
- * the master that you are ready to receive the data from the master
- * SPI. If xfer is false then you should de-assert SPI_INT to indicate
- * that the transfer is done.
- */
-
- struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
-}
-
-static void sspi_release(struct device *dev)
-{
- /*
- * Here you should release your SPI device resources.
- */
-}
-
-static int __init sspi_init(void)
-{
- /* Here you should initialize your SPI device by providing the
- * necessary functions, clock speed, name and private data. Once
- * done, you can register your device with the
- * platform_device_register() function. This function will return
- * with the CAIF SPI interface initialized. This is probably also
- * the place where you should set up your GPIOs, interrupts and SPI
- * resources. */
-
- int res = 0;
-
- /* Initialize slave device. */
- slave.sdev.init_xfer = sspi_init_xfer;
- slave.sdev.sig_xfer = sspi_sig_xfer;
- slave.sdev.clk_mhz = 13;
- slave.sdev.priv = &slave;
- slave.sdev.name = "spi_sspi";
- slave_device.dev.release = sspi_release;
-
- /* Initialize platform device. */
- slave_device.name = "cfspi_sspi";
- slave_device.dev.platform_data = &slave.sdev;
-
- /* Register platform device. */
- res = platform_device_register(&slave_device);
- if (res) {
- printk(KERN_WARNING "sspi_init: failed to register dev.\n");
- return -ENODEV;
- }
-
- return res;
-}
-
-static void __exit sspi_exit(void)
-{
- platform_device_del(&slave_device);
-}
-
-module_init(sspi_init);
-module_exit(sspi_exit);
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index 2fd0b51a8c52..ebc822e605f5 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -168,7 +168,7 @@ reflect the correct [#f1]_ traffic on the node the loopback of the sent
data has to be performed right after a successful transmission. If
the CAN network interface is not capable of performing the loopback for
some reason the SocketCAN core can do this task as a fallback solution.
-See :ref:`socketcan-local-loopback1` for details (recommended).
+See :ref:`socketcan-local-loopback2` for details (recommended).
The loopback functionality is enabled by default to reflect standard
networking behaviour for CAN applications. Due to some requests from
@@ -228,20 +228,36 @@ send(2), sendto(2), sendmsg(2) and the recv* counterpart operations
on the socket as usual. There are also CAN specific socket options
described below.
-The basic CAN frame structure and the sockaddr structure are defined
-in include/linux/can.h:
+The Classical CAN frame structure (aka CAN 2.0B), the CAN FD frame structure
+and the sockaddr structure are defined in include/linux/can.h:
.. code-block:: C
struct can_frame {
canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
- __u8 can_dlc; /* frame payload length in byte (0 .. 8) */
+ union {
+ /* CAN frame payload length in byte (0 .. CAN_MAX_DLEN)
+ * was previously named can_dlc so we need to carry that
+ * name for legacy support
+ */
+ __u8 len;
+ __u8 can_dlc; /* deprecated */
+ };
__u8 __pad; /* padding */
__u8 __res0; /* reserved / padding */
- __u8 __res1; /* reserved / padding */
+ __u8 len8_dlc; /* optional DLC for 8 byte payload length (9 .. 15) */
__u8 data[8] __attribute__((aligned(8)));
};
+Remark: The len element contains the payload length in bytes and should be
+used instead of can_dlc. The deprecated can_dlc was misleadingly named as
+it always contained the plain payload length in bytes and not the so called
+'data length code' (DLC).
+
+To pass the raw DLC from/to a Classical CAN network device the len8_dlc
+element can contain values 9 .. 15 when the len element is 8 (the real
+payload length for all DLC values greater or equal to 8).
+
The alignment of the (linear) payload data[] to a 64bit boundary
allows the user to define their own structs and unions to easily access
the CAN payload. There is no given byteorder on the CAN bus by
@@ -260,6 +276,23 @@ PF_PACKET socket, that also binds to a specific interface:
/* transport protocol class address info (e.g. ISOTP) */
struct { canid_t rx_id, tx_id; } tp;
+ /* J1939 address information */
+ struct {
+ /* 8 byte name when using dynamic addressing */
+ __u64 name;
+
+ /* pgn:
+ * 8 bit: PS in PDU2 case, else 0
+ * 8 bit: PF
+ * 1 bit: DP
+ * 1 bit: reserved
+ */
+ __u32 pgn;
+
+ /* 1 byte address */
+ __u8 addr;
+ } j1939;
+
/* reserved for future CAN protocols address information */
} can_addr;
};
@@ -371,7 +404,7 @@ kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight
bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g.
the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that
switches the socket into a mode that allows the handling of CAN FD frames
-and (legacy) CAN frames simultaneously (see :ref:`socketcan-rawfd`).
+and Classical CAN frames simultaneously (see :ref:`socketcan-rawfd`).
The struct canfd_frame is defined in include/linux/can.h:
@@ -397,7 +430,7 @@ code (DLC) of the struct can_frame was used as a length information as the
length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve
the easy handling of the length information the canfd_frame.len element
contains a plain length value from 0 .. 64. So both canfd_frame.len and
-can_frame.can_dlc are equal and contain a length information and no DLC.
+can_frame.len are equal and contain a length information and no DLC.
For details about the distinction of CAN and CAN FD capable devices and
the mapping to the bus-relevant data length code (DLC), see :ref:`socketcan-can-fd-driver`.
@@ -407,7 +440,7 @@ definitions are specified for CAN specific MTUs in include/linux/can.h:
.. code-block:: C
- #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame
+ #define CAN_MTU (sizeof(struct can_frame)) == 16 => Classical CAN frame
#define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame
@@ -575,6 +608,8 @@ demand:
setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS,
&recv_own_msgs, sizeof(recv_own_msgs));
+Note that reception of a socket's own CAN frames are subject to the same
+filtering as other CAN frames (see :ref:`socketcan-rawfilter`).
.. _socketcan-rawfd:
@@ -609,7 +644,7 @@ Example:
printf("got CAN FD frame with length %d\n", cfd.len);
/* cfd.flags contains valid data */
} else if (nbytes == CAN_MTU) {
- printf("got legacy CAN frame with length %d\n", cfd.len);
+ printf("got Classical CAN frame with length %d\n", cfd.len);
/* cfd.flags is undefined */
} else {
fprintf(stderr, "read: invalid CAN(FD) frame\n");
@@ -623,7 +658,7 @@ Example:
printf("%02X ", cfd.data[i]);
When reading with size CANFD_MTU only returns CAN_MTU bytes that have
-been received from the socket a legacy CAN frame has been read into the
+been received from the socket a Classical CAN frame has been read into the
provided CAN FD structure. Note that the canfd_frame.flags data field is
not specified in the struct can_frame and therefore it is only valid in
CANFD_MTU sized CAN FD frames.
@@ -633,7 +668,7 @@ Implementation hint for new CAN applications:
To build a CAN FD aware application use struct canfd_frame as basic CAN
data structure for CAN_RAW based applications. When the application is
executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES
-socket option returns an error: No problem. You'll get legacy CAN frames
+socket option returns an error: No problem. You'll get Classical CAN frames
or CAN FD frames and can process them the same way.
When sending to CAN devices make sure that the device is capable to handle
@@ -842,6 +877,8 @@ TX_RESET_MULTI_IDX:
RX_RTR_FRAME:
Send reply for RTR-request (placed in op->frames[0]).
+CAN_FD_FRAME:
+ The CAN frames following the bcm_msg_head are struct canfd_frame's
Broadcast Manager Transmission Timers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1026,7 +1063,7 @@ Additional procfs files in /proc/net/can::
stats - SocketCAN core statistics (rx/tx frames, match ratios, ...)
reset_stats - manual statistic reset
- version - prints the SocketCAN core version and the ABI version
+ version - prints SocketCAN core and ABI version (removed in Linux 5.10)
Writing Own CAN Protocol Modules
@@ -1058,7 +1095,7 @@ drivers you mainly have to deal with:
- TX: Put the CAN frame from the socket buffer to the CAN controller.
- RX: Put the CAN frame from the CAN controller to the socket buffer.
-See e.g. at Documentation/networking/netdevices.txt . The differences
+See e.g. at Documentation/networking/netdevices.rst . The differences
for writing CAN network device driver are described below:
@@ -1070,7 +1107,7 @@ General Settings
dev->type = ARPHRD_CAN; /* the netdevice hardware type */
dev->flags = IFF_NOARP; /* CAN has no arp */
- dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */
+ dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */
or alternative, when the controller supports CAN with flexible data rate:
dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
@@ -1184,6 +1221,7 @@ Setting CAN device properties::
[ fd { on | off } ]
[ fd-non-iso { on | off } ]
[ presume-ack { on | off } ]
+ [ cc-len8-dlc { on | off } ]
[ restart-ms TIME-MS ]
[ restart ]
@@ -1326,22 +1364,22 @@ arbitration phase and the payload phase of the CAN FD frame. Therefore a
second bit timing has to be specified in order to enable the CAN FD bitrate.
Additionally CAN FD capable CAN controllers support up to 64 bytes of
-payload. The representation of this length in can_frame.can_dlc and
+payload. The representation of this length in can_frame.len and
canfd_frame.len for userspace applications and inside the Linux network
layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
-The data length code was a 1:1 mapping to the payload length in the legacy
+The data length code was a 1:1 mapping to the payload length in the Classical
CAN frames anyway. The payload length to the bus-relevant DLC mapping is
only performed inside the CAN drivers, preferably with the helper
-functions can_dlc2len() and can_len2dlc().
+functions can_fd_dlc2len() and can_fd_len2dlc().
The CAN netdevice driver capabilities can be distinguished by the network
devices maximum transfer unit (MTU)::
- MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device
+ MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => Classical CAN device
MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device
The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
-N.B. CAN FD capable devices can also handle and send legacy CAN frames.
+N.B. CAN FD capable devices can also handle and send Classical CAN frames.
When configuring CAN FD capable CAN controllers an additional 'data' bitrate
has to be set. This bitrate for the data phase of the CAN FD frame has to be
diff --git a/Documentation/networking/can_ucan_protocol.rst b/Documentation/networking/can_ucan_protocol.rst
index 4cef88d24fc7..638ac1ee7914 100644
--- a/Documentation/networking/can_ucan_protocol.rst
+++ b/Documentation/networking/can_ucan_protocol.rst
@@ -144,7 +144,7 @@ UCAN_COMMAND_SET_BITTIMING
*Host2Dev; mandatory*
-Setup bittiming by sending the the structure
+Setup bittiming by sending the structure
``ucan_ctl_payload_t.cmd_set_bittiming`` (see ``struct bittiming`` for
details)
@@ -232,7 +232,7 @@ UCAN_IN_TX_COMPLETE
zero
The CAN device has sent a message to the CAN bus. It answers with a
-list of of tuples <echo-ids, flags>.
+list of tuples <echo-ids, flags>.
The echo-id identifies the frame from (echos the id from a previous
UCAN_OUT_TX message). The flag indicates the result of the
diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.rst
index 4e68f0bc5dba..0048409c06b4 100644
--- a/Documentation/networking/cdc_mbim.txt
+++ b/Documentation/networking/cdc_mbim.rst
@@ -1,5 +1,8 @@
- cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
- ========================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
+======================================================
The cdc_mbim driver supports USB devices conforming to the "Universal
Serial Bus Communications Class Subclass Specification for Mobile
@@ -19,9 +22,9 @@ by a cdc_ncm driver parameter:
prefer_mbim
-----------
-Type: Boolean
-Valid Range: N/Y (0-1)
-Default Value: Y (MBIM is preferred)
+:Type: Boolean
+:Valid Range: N/Y (0-1)
+:Default Value: Y (MBIM is preferred)
This parameter sets the system policy for NCM/MBIM functions. Such
functions will be handled by either the cdc_ncm driver or the cdc_mbim
@@ -44,11 +47,13 @@ userspace MBIM management application always is required to enable a
MBIM function.
Such userspace applications includes, but are not limited to:
+
- mbimcli (included with the libmbim [3] library), and
- ModemManager [4]
Establishing a MBIM IP session reequires at least these actions by the
management application:
+
- open the control channel
- configure network connection settings
- connect to network
@@ -76,7 +81,7 @@ complies with all the control channel requirements in [1].
The cdc-wdmX device is created as a child of the MBIM control
interface USB device. The character device associated with a specific
-MBIM function can be looked up using sysfs. For example:
+MBIM function can be looked up using sysfs. For example::
bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc
cdc-wdm0
@@ -119,13 +124,15 @@ negotiated control message size.
/dev/cdc-wdmX ioctl()
---------------------
+---------------------
IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size
This ioctl returns the wMaxControlMessage field of the CDC MBIM
functional descriptor for MBIM devices. This is intended as a
convenience, eliminating the need to parse the USB descriptors from
userspace.
+::
+
#include <stdio.h>
#include <fcntl.h>
#include <sys/ioctl.h>
@@ -178,7 +185,7 @@ VLAN links prior to establishing MBIM IP sessions where the SessionId
is greater than 0. These links can be added by using the normal VLAN
kernel interfaces, either ioctl or netlink.
-For example, adding a link for a MBIM IP session with SessionId 3:
+For example, adding a link for a MBIM IP session with SessionId 3::
ip link add link wwan0 name wwan0.3 type vlan id 3
@@ -207,6 +214,7 @@ the stream to the end user in an appropriate way for the stream type.
The network device ABI requires a dummy ethernet header for every DSS
data frame being transported. The contents of this header is
arbitrary, with the following exceptions:
+
- TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped
- RX frames will have the protocol field set to ETH_P_802_3 (but will
not be properly formatted 802.3 frames)
@@ -218,7 +226,7 @@ adding the dummy ethernet header on TX and stripping it on RX.
This is a simple example using tools commonly available, exporting
DssSessionId 5 as a pty character device pointed to by a /dev/nmea
-symlink:
+symlink::
ip link add link wwan0 name wwan0.dss5 type vlan id 261
ip link set dev wwan0.dss5 up
@@ -236,7 +244,7 @@ map frames to the correct DSS session and adding 18 byte VLAN ethernet
headers with the appropriate tag on TX. In this case using a socket
filter is recommended, matching only the DSS VLAN subset. This avoid
unnecessary copying of unrelated IP session data to userspace. For
-example:
+example::
static struct sock_filter dssfilter[] = {
/* use special negative offsets to get VLAN tag */
@@ -249,11 +257,11 @@ example:
BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */
/* verify ethertype */
- BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
- BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
- BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
- BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
+ BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
+ BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
};
@@ -266,6 +274,7 @@ network device.
This mapping implies a few restrictions on multiplexed IPS and DSS
sessions, which may not always be practical:
+
- no IPS or DSS session can use a frame size greater than the MTU on
IP session 0
- no IPS or DSS session can be in the up state unless the network
@@ -280,7 +289,7 @@ device.
Tip: It might be less confusing to the end user to name this VLAN
subdevice after the MBIM SessionID instead of the VLAN ID. For
-example:
+example::
ip link add link wwan0 name wwan0.0 type vlan id 4094
@@ -290,7 +299,7 @@ VLAN mapping
Summarizing the cdc_mbim driver mapping described above, we have this
relationship between VLAN tags on the wwanY network device and MBIM
-sessions on the shared USB data channel:
+sessions on the shared USB data channel::
VLAN ID MBIM type MBIM SessionID Notes
---------------------------------------------------------
@@ -310,30 +319,37 @@ sessions on the shared USB data channel:
References
==========
-[1] USB Implementers Forum, Inc. - "Universal Serial Bus
- Communications Class Subclass Specification for Mobile Broadband
- Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+ 1) USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specification for Mobile Broadband
+ Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+
- http://www.usb.org/developers/docs/devclass_docs/
-[2] USB Implementers Forum, Inc. - "Universal Serial Bus
- Communications Class Subclass Specifications for Network Control
- Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+ 2) USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specifications for Network Control
+ Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+
- http://www.usb.org/developers/docs/devclass_docs/
-[3] libmbim - "a glib-based library for talking to WWAN modems and
- devices which speak the Mobile Interface Broadband Model (MBIM)
- protocol"
+ 3) libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+
- http://www.freedesktop.org/wiki/Software/libmbim/
-[4] ModemManager - "a DBus-activated daemon which controls mobile
- broadband (2G/3G/4G) devices and connections"
+ 4) ModemManager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+
- http://www.freedesktop.org/wiki/Software/ModemManager/
-[5] "MBIM (Mobile Broadband Interface Model) Registry"
+ 5) "MBIM (Mobile Broadband Interface Model) Registry"
+
- http://compliance.usb.org/mbim/
-[6] "/sys/kernel/debug/usb/devices output format"
+ 6) "/sys/kernel/debug/usb/devices output format"
+
- Documentation/driver-api/usb/usb.rst
-[7] "/sys/bus/usb/devices/.../descriptors"
+ 7) "/sys/bus/usb/devices/.../descriptors"
+
- Documentation/ABI/stable/sysfs-bus-usb
diff --git a/Documentation/networking/checksum-offloads.rst b/Documentation/networking/checksum-offloads.rst
index 905c8a84b103..69b23cf6879e 100644
--- a/Documentation/networking/checksum-offloads.rst
+++ b/Documentation/networking/checksum-offloads.rst
@@ -59,7 +59,7 @@ recomputed for each resulting segment. See the skbuff.h comment (section 'E')
for more details.
A driver declares its offload capabilities in netdev->hw_features; see
-Documentation/networking/netdev-features.txt for more. Note that a device
+Documentation/networking/netdev-features.rst for more. Note that a device
which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
csum_offset given in the SKB; if it tries to deduce these itself in hardware
(as some NICs do) the driver should check that the values in the SKB match
diff --git a/Documentation/networking/cops.txt b/Documentation/networking/cops.txt
deleted file mode 100644
index 3e344b448e07..000000000000
--- a/Documentation/networking/cops.txt
+++ /dev/null
@@ -1,63 +0,0 @@
-Text File for the COPS LocalTalk Linux driver (cops.c).
- By Jay Schulist <jschlst@samba.org>
-
-This driver has two modes and they are: Dayna mode and Tangent mode.
-Each mode corresponds with the type of card. It has been found
-that there are 2 main types of cards and all other cards are
-the same and just have different names or only have minor differences
-such as more IO ports. As this driver is tested it will
-become more clear exactly what cards are supported.
-
-Right now these cards are known to work with the COPS driver. The
-LT-200 cards work in a somewhat more limited capacity than the
-DL200 cards, which work very well and are in use by many people.
-
-TANGENT driver mode:
- Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
-DAYNA driver mode:
- Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
- Farallon PhoneNET PC III, Farallon PhoneNET PC II
-Other cards possibly supported mode unknown though:
- Dayna DL2000 (Full length)
-
-The COPS driver defaults to using Dayna mode. To change the driver's
-mode if you built a driver with dual support use board_type=1 or
-board_type=2 for Dayna or Tangent with insmod.
-
-** Operation/loading of the driver.
-Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #)
-If you do not specify any options the driver will try and use the IO = 0x240,
-IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
-
-To load multiple COPS driver Localtalk cards you can do one of the following.
-
-insmod cops io=0x240 irq=5
-insmod -o cops2 cops io=0x260 irq=3
-
-Or in lilo.conf put something like this:
- append="ether=5,0x240,lt0 ether=3,0x260,lt1"
-
-Then bring up the interface with ifconfig. It will look something like this:
-lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
- inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
- UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1
- RX packets:0 errors:0 dropped:0 overruns:0 frame:0
- TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
-
-** Netatalk Configuration
-You will need to configure atalkd with something like the following to make
-it work with the cops.c driver.
-
-* For single LTalk card use.
-dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
-lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple cards, Ethernet and LocalTalk.
-eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
-lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
-
-* For multiple LocalTalk cards, and an Ethernet card.
-* Order seems to matter here, Ethernet last.
-lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
-lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
-eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.rst
index 55c575fcaf17..91e5c33ba3ff 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.rst
@@ -1,16 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
DCCP protocol
=============
-Contents
-========
-- Introduction
-- Missing features
-- Socket options
-- Sysctl variables
-- IOCTLs
-- Other tunables
-- Notes
+.. Contents
+ - Introduction
+ - Missing features
+ - Socket options
+ - Sysctl variables
+ - IOCTLs
+ - Other tunables
+ - Notes
Introduction
@@ -38,6 +40,7 @@ The Linux DCCP implementation does not currently support all the features that a
specified in RFCs 4340...42.
The known bugs are at:
+
http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP
For more up-to-date versions of the DCCP implementation, please consider using
@@ -54,7 +57,8 @@ defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special,
and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an
u32 priority value as ancillary data to sendmsg(), where higher numbers indicate
a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to
-be formatted using a cmsg(3) message header filled in as follows:
+be formatted using a cmsg(3) message header filled in as follows::
+
cmsg->cmsg_level = SOL_DCCP;
cmsg->cmsg_type = DCCP_SCM_PRIORITY;
cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */
@@ -94,7 +98,7 @@ must be registered on the socket before calling connect() or listen().
DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets
the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID.
-Please note that the getsockopt argument type here is `int', not uint8_t.
+Please note that the getsockopt argument type here is ``int``, not uint8_t.
DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID.
@@ -113,6 +117,7 @@ be enabled at the receiver, too with suitable choice of CsCov.
DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
range 0..15 are acceptable. The default setting is 0 (full coverage),
values between 1..15 indicate partial coverage.
+
DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
sets a threshold, where again values 0..15 are acceptable. The default
of 0 means that all packets with a partial coverage will be discarded.
@@ -123,11 +128,13 @@ DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
The following two options apply to CCID 3 exclusively and are getsockopt()-only.
In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
+
DCCP_SOCKOPT_CCID_RX_INFO
- Returns a `struct tfrc_rx_info' in optval; the buffer for optval and
+ Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and
optlen must be set to at least sizeof(struct tfrc_rx_info).
+
DCCP_SOCKOPT_CCID_TX_INFO
- Returns a `struct tfrc_tx_info' in optval; the buffer for optval and
+ Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and
optlen must be set to at least sizeof(struct tfrc_tx_info).
On unidirectional connections it is useful to close the unused half-connection
@@ -182,19 +189,24 @@ sync_ratelimit = 125 ms
IOCTLS
======
FIONREAD
- Works as in udp(7): returns in the `int' argument pointer the size of
+ Works as in udp(7): returns in the ``int`` argument pointer the size of
the next pending datagram in bytes, or 0 when no datagram is pending.
+SIOCOUTQ
+ Returns the number of unsent data bytes in the socket send queue as ``int``
+ into the buffer specified by the argument pointer.
Other tunables
==============
Per-route rto_min support
CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value
of the RTO timer. This setting can be modified via the 'rto_min' option
- of iproute2; for example:
+ of iproute2; for example::
+
> ip route change 10.0.0.0/24 rto_min 250j dev wlan0
> ip route add 10.0.0.254/32 rto_min 800j dev wlan0
> ip route show dev wlan0
+
CCID-3 also supports the rto_min setting: it is used to define the lower
bound for the expiry of the nofeedback timer. This can be useful on LANs
with very low RTTs (e.g., loopback, Gbit ethernet).
diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.rst
index 13a857753208..4cc8bb2dad50 100644
--- a/Documentation/networking/dctcp.txt
+++ b/Documentation/networking/dctcp.rst
@@ -1,11 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
DCTCP (DataCenter TCP)
-----------------------
+======================
DCTCP is an enhancement to the TCP congestion control algorithm for data
center networks and leverages Explicit Congestion Notification (ECN) in
the data center network to provide multi-bit feedback to the end hosts.
-To enable it on end hosts:
+To enable it on end hosts::
sysctl -w net.ipv4.tcp_congestion_control=dctcp
sysctl -w net.ipv4.tcp_ecn_fallback=0 (optional)
@@ -25,14 +28,19 @@ SIGCOMM/SIGMETRICS papers:
i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
- "Data Center TCP (DCTCP)", Data Center Networks session
+
+ "Data Center TCP (DCTCP)", Data Center Networks session"
+
Proc. ACM SIGCOMM, New Delhi, 2010.
+
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192
ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
+
"Analysis of DCTCP: Stability, Convergence, and Fairness"
Proc. ACM SIGMETRICS, San Jose, 2011.
+
http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
IETF informational draft:
diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt
deleted file mode 100644
index d192f8b9948b..000000000000
--- a/Documentation/networking/decnet.txt
+++ /dev/null
@@ -1,230 +0,0 @@
- Linux DECnet Networking Layer Information
- ===========================================
-
-1) Other documentation....
-
- o Project Home Pages
- http://www.chygwyn.com/ - Kernel info
- http://linux-decnet.sourceforge.net/ - Userland tools
- http://www.sourceforge.net/projects/linux-decnet/ - Status page
-
-2) Configuring the kernel
-
-Be sure to turn on the following options:
-
- CONFIG_DECNET (obviously)
- CONFIG_PROC_FS (to see what's going on)
- CONFIG_SYSCTL (for easy configuration)
-
-if you want to try out router support (not properly debugged yet)
-you'll need the following options as well...
-
- CONFIG_DECNET_ROUTER (to be able to add/delete routes)
- CONFIG_NETFILTER (will be required for the DECnet routing daemon)
-
-Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
-that you need it, in general you won't and it can cause ifconfig to
-malfunction.
-
-Run time configuration has changed slightly from the 2.4 system. If you
-want to configure an endnode, then the simplified procedure is as follows:
-
- o Set the MAC address on your ethernet card before starting _any_ other
- network protocols.
-
-As soon as your network card is brought into the UP state, DECnet should
-start working. If you need something more complicated or are unsure how
-to set the MAC address, see the next section. Also all configurations which
-worked with 2.4 will work under 2.5 with no change.
-
-3) Command line options
-
-You can set a DECnet address on the kernel command line for compatibility
-with the 2.4 configuration procedure, but in general it's not needed any more.
-If you do st a DECnet address on the command line, it has only one purpose
-which is that its added to the addresses on the loopback device.
-
-With 2.4 kernels, DECnet would only recognise addresses as local if they
-were added to the loopback device. In 2.5, any local interface address
-can be used to loop back to the local machine. Of course this does not
-prevent you adding further addresses to the loopback device if you
-want to.
-
-N.B. Since the address list of an interface determines the addresses for
-which "hello" messages are sent, if you don't set an address on the loopback
-interface then you won't see any entries in /proc/net/neigh for the local
-host until such time as you start a connection. This doesn't affect the
-operation of the local communications in any other way though.
-
-The kernel command line takes options looking like the following:
-
- decnet.addr=1,2
-
-the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels
-and early 2.3.xx kernels, you must use a comma when specifying the
-DECnet address like this. For more recent 2.3.xx kernels, you may
-use almost any character except space, although a `.` would be the most
-obvious choice :-)
-
-There used to be a third number specifying the node type. This option
-has gone away in favour of a per interface node type. This is now set
-using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be
-set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router.
-
-There are also equivalent options for modules. The node address can
-also be set through the /proc/sys/net/decnet/ files, as can other system
-parameters.
-
-Currently the only supported devices are ethernet and ip_gre. The
-ethernet address of your ethernet card has to be set according to the DECnet
-address of the node in order for it to be autoconfigured (and then appear in
-/proc/net/decnet_dev). There is a utility available at the above
-FTP sites called dn2ethaddr which can compute the correct ethernet
-address to use. The address can be set by ifconfig either before or
-at the time the device is brought up. If you are using RedHat you can
-add the line:
-
- MACADDR=AA:00:04:00:03:04
-
-or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or
-wherever your network card's configuration lives. Setting the MAC address
-of your ethernet card to an address starting with "hi-ord" will cause a
-DECnet address which matches to be added to the interface (which you can
-verify with iproute2).
-
-The default device for routing can be set through the /proc filesystem
-by setting /proc/sys/net/decnet/default_device to the
-device you want DECnet to route packets out of when no specific route
-is available. Usually this will be eth0, for example:
-
- echo -n "eth0" >/proc/sys/net/decnet/default_device
-
-If you don't set the default device, then it will default to the first
-ethernet card which has been autoconfigured as described above. You can
-confirm that by looking in the default_device file of course.
-
-There is a list of what the other files under /proc/sys/net/decnet/ do
-on the kernel patch web site (shown above).
-
-4) Run time kernel configuration
-
-This is either done through the sysctl/proc interface (see the kernel web
-pages for details on what the various options do) or through the iproute2
-package in the same way as IPv4/6 configuration is performed.
-
-Documentation for iproute2 is included with the package, although there is
-as yet no specific section on DECnet, most of the features apply to both
-IP and DECnet, albeit with DECnet addresses instead of IP addresses and
-a reduced functionality.
-
-If you want to configure a DECnet router you'll need the iproute2 package
-since its the _only_ way to add and delete routes currently. Eventually
-there will be a routing daemon to send and receive routing messages for
-each interface and update the kernel routing tables accordingly. The
-routing daemon will use netfilter to listen to routing packets, and
-rtnetlink to update the kernels routing tables.
-
-The DECnet raw socket layer has been removed since it was there purely
-for use by the routing daemon which will now use netfilter (a much cleaner
-and more generic solution) instead.
-
-5) How can I tell if its working ?
-
-Here is a quick guide of what to look for in order to know if your DECnet
-kernel subsystem is working.
-
- - Is the node address set (see /proc/sys/net/decnet/node_address)
- - Is the node of the correct type
- (see /proc/sys/net/decnet/conf/<dev>/forwarding)
- - Is the Ethernet MAC address of each Ethernet card set to match
- the DECnet address. If in doubt use the dn2ethaddr utility available
- at the ftp archive.
- - If the previous two steps are satisfied, and the Ethernet card is up,
- you should find that it is listed in /proc/net/decnet_dev and also
- that it appears as a directory in /proc/sys/net/decnet/conf/. The
- loopback device (lo) should also appear and is required to communicate
- within a node.
- - If you have any DECnet routers on your network, they should appear
- in /proc/net/decnet_neigh, otherwise this file will only contain the
- entry for the node itself (if it doesn't check to see if lo is up).
- - If you want to send to any node which is not listed in the
- /proc/net/decnet_neigh file, you'll need to set the default device
- to point to an Ethernet card with connection to a router. This is
- again done with the /proc/sys/net/decnet/default_device file.
- - Try starting a simple server and client, like the dnping/dnmirror
- over the loopback interface. With luck they should communicate.
- For this step and those after, you'll need the DECnet library
- which can be obtained from the above ftp sites as well as the
- actual utilities themselves.
- - If this seems to work, then try talking to a node on your local
- network, and see if you can obtain the same results.
- - At this point you are on your own... :-)
-
-6) How to send a bug report
-
-If you've found a bug and want to report it, then there are several things
-you can do to help me work out exactly what it is that is wrong. Useful
-information (_most_ of which _is_ _essential_) includes:
-
- - What kernel version are you running ?
- - What version of the patch are you running ?
- - How far though the above set of tests can you get ?
- - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ?
- - Which services are you running ?
- - Which client caused the problem ?
- - How much data was being transferred ?
- - Was the network congested ?
- - How can the problem be reproduced ?
- - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of
- tcpdump don't understand how to dump DECnet properly, so including
- the hex listing of the packet contents is _essential_, usually the -x flag.
- You may also need to increase the length grabbed with the -s flag. The
- -e flag also provides very useful information (ethernet MAC addresses))
-
-7) MAC FAQ
-
-A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet
-interact and how to get the best performance from your hardware.
-
-Ethernet cards are designed to normally only pass received network frames
-to a host computer when they are addressed to it, or to the broadcast address.
-
-Linux has an interface which allows the setting of extra addresses for
-an ethernet card to listen to. If the ethernet card supports it, the
-filtering operation will be done in hardware, if not the extra unwanted packets
-received will be discarded by the host computer. In the latter case,
-significant processor time and bus bandwidth can be used up on a busy
-network (see the NAPI documentation for a longer explanation of these
-effects).
-
-DECnet makes use of this interface to allow running DECnet on an ethernet
-card which has already been configured using TCP/IP (presumably using the
-built in MAC address of the card, as usual) and/or to allow multiple DECnet
-addresses on each physical interface. If you do this, be aware that if your
-ethernet card doesn't support perfect hashing in its MAC address filter
-then your computer will be doing more work than required. Some cards
-will simply set themselves into promiscuous mode in order to receive
-packets from the DECnet specified addresses. So if you have one of these
-cards its better to set the MAC address of the card as described above
-to gain the best efficiency. Better still is to use a card which supports
-NAPI as well.
-
-
-8) Mailing list
-
-If you are keen to get involved in development, or want to ask questions
-about configuration, or even just report bugs, then there is a mailing
-list that you can join, details are at:
-
-http://sourceforge.net/mail/?group_id=4993
-
-9) Legal Info
-
-The Linux DECnet project team have placed their code under the GPL. The
-software is provided "as is" and without warranty express or implied.
-DECnet is a trademark of Compaq. This software is not a product of
-Compaq. We acknowledge the help of people at Compaq in providing extra
-documentation above and beyond what was previously publicly available.
-
-Steve Whitehouse <SteveW@ACM.org>
-
diff --git a/Documentation/networking/device_drivers/appletalk/cops.rst b/Documentation/networking/device_drivers/appletalk/cops.rst
new file mode 100644
index 000000000000..964ba80599a9
--- /dev/null
+++ b/Documentation/networking/device_drivers/appletalk/cops.rst
@@ -0,0 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+The COPS LocalTalk Linux driver (cops.c)
+========================================
+
+By Jay Schulist <jschlst@samba.org>
+
+This driver has two modes and they are: Dayna mode and Tangent mode.
+Each mode corresponds with the type of card. It has been found
+that there are 2 main types of cards and all other cards are
+the same and just have different names or only have minor differences
+such as more IO ports. As this driver is tested it will
+become more clear exactly what cards are supported.
+
+Right now these cards are known to work with the COPS driver. The
+LT-200 cards work in a somewhat more limited capacity than the
+DL200 cards, which work very well and are in use by many people.
+
+TANGENT driver mode:
+ - Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
+
+DAYNA driver mode:
+ - Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
+ - Farallon PhoneNET PC III, Farallon PhoneNET PC II
+
+Other cards possibly supported mode unknown though:
+ - Dayna DL2000 (Full length)
+
+The COPS driver defaults to using Dayna mode. To change the driver's
+mode if you built a driver with dual support use board_type=1 or
+board_type=2 for Dayna or Tangent with insmod.
+
+Operation/loading of the driver
+===============================
+
+Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #)
+If you do not specify any options the driver will try and use the IO = 0x240,
+IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
+
+To load multiple COPS driver Localtalk cards you can do one of the following::
+
+ insmod cops io=0x240 irq=5
+ insmod -o cops2 cops io=0x260 irq=3
+
+Or in lilo.conf put something like this::
+
+ append="ether=5,0x240,lt0 ether=3,0x260,lt1"
+
+Then bring up the interface with ifconfig. It will look something like this::
+
+ lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
+ inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
+ UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1
+ RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
+
+Netatalk Configuration
+======================
+
+You will need to configure atalkd with something like the following to make
+it work with the cops.c driver.
+
+* For single LTalk card use::
+
+ dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
+ lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple cards, Ethernet and LocalTalk::
+
+ eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
+ lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple LocalTalk cards, and an Ethernet card.
+
+* Order seems to matter here, Ethernet last::
+
+ lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
+ lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
+ eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/device_drivers/appletalk/index.rst b/Documentation/networking/device_drivers/appletalk/index.rst
new file mode 100644
index 000000000000..c196baeb0856
--- /dev/null
+++ b/Documentation/networking/device_drivers/appletalk/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+AppleTalk Device Drivers
+========================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ cops
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/device_drivers/atm/cxacru-cf.py
index b41d298398c8..b41d298398c8 100644
--- a/Documentation/networking/cxacru-cf.py
+++ b/Documentation/networking/device_drivers/atm/cxacru-cf.py
diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/device_drivers/atm/cxacru.rst
index 2cce04457b4d..6088af2ffeda 100644
--- a/Documentation/networking/cxacru.txt
+++ b/Documentation/networking/device_drivers/atm/cxacru.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+ATM cxacru device driver
+========================
+
Firmware is required for this device: http://accessrunner.sourceforge.net/
While it is capable of managing/maintaining the ADSL connection without the
@@ -19,29 +25,35 @@ several sysfs attribute files for retrieving device statistics:
* adsl_headend
* adsl_headend_environment
- Information about the remote headend.
+
+ - Information about the remote headend.
* adsl_config
- Configuration writing interface.
- Write parameters in hexadecimal format <index>=<value>,
- separated by whitespace, e.g.:
+
+ - Configuration writing interface.
+ - Write parameters in hexadecimal format <index>=<value>,
+ separated by whitespace, e.g.:
+
"1=0 a=5"
- Up to 7 parameters at a time will be sent and the modem will restart
- the ADSL connection when any value is set. These are logged for future
- reference.
+
+ - Up to 7 parameters at a time will be sent and the modem will restart
+ the ADSL connection when any value is set. These are logged for future
+ reference.
* downstream_attenuation (dB)
* downstream_bits_per_frame
* downstream_rate (kbps)
* downstream_snr_margin (dB)
- Downstream stats.
+
+ - Downstream stats.
* upstream_attenuation (dB)
* upstream_bits_per_frame
* upstream_rate (kbps)
* upstream_snr_margin (dB)
* transmitter_power (dBm/Hz)
- Upstream stats.
+
+ - Upstream stats.
* downstream_crc_errors
* downstream_fec_errors
@@ -49,48 +61,56 @@ several sysfs attribute files for retrieving device statistics:
* upstream_crc_errors
* upstream_fec_errors
* upstream_hec_errors
- Error counts.
+
+ - Error counts.
* line_startable
- Indicates that ADSL support on the device
- is/can be enabled, see adsl_start.
+
+ - Indicates that ADSL support on the device
+ is/can be enabled, see adsl_start.
* line_status
- "initialising"
- "down"
- "attempting to activate"
- "training"
- "channel analysis"
- "exchange"
- "waiting"
- "up"
+
+ - "initialising"
+ - "down"
+ - "attempting to activate"
+ - "training"
+ - "channel analysis"
+ - "exchange"
+ - "waiting"
+ - "up"
Changes between "down" and "attempting to activate"
if there is no signal.
* link_status
- "not connected"
- "connected"
- "lost"
+
+ - "not connected"
+ - "connected"
+ - "lost"
* mac_address
* modulation
- "" (when not connected)
- "ANSI T1.413"
- "ITU-T G.992.1 (G.DMT)"
- "ITU-T G.992.2 (G.LITE)"
+
+ - "" (when not connected)
+ - "ANSI T1.413"
+ - "ITU-T G.992.1 (G.DMT)"
+ - "ITU-T G.992.2 (G.LITE)"
* startup_attempts
- Count of total attempts to initialise ADSL.
+
+ - Count of total attempts to initialise ADSL.
To enable/disable ADSL, the following can be written to the adsl_state file:
- "start"
- "stop
- "restart" (stops, waits 1.5s, then starts)
- "poll" (used to resume status polling if it was disabled due to failure)
-Changes in adsl/line state are reported via kernel log messages:
+ - "start"
+ - "stop
+ - "restart" (stops, waits 1.5s, then starts)
+ - "poll" (used to resume status polling if it was disabled due to failure)
+
+Changes in adsl/line state are reported via kernel log messages::
+
[4942145.150704] ATM dev 0: ADSL state: running
[4942243.663766] ATM dev 0: ADSL line: down
[4942249.665075] ATM dev 0: ADSL line: attempting to activate
diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/device_drivers/atm/fore200e.rst
index 1f98f62b4370..55df9ec09ac8 100644
--- a/Documentation/networking/fore200e.txt
+++ b/Documentation/networking/device_drivers/atm/fore200e.rst
@@ -1,6 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+=============================================
FORE Systems PCA-200E/SBA-200E ATM NIC driver
----------------------------------------------
+=============================================
This driver adds support for the FORE Systems 200E-series ATM adapters
to the Linux operating system. It is based on the earlier PCA-200E driver
@@ -27,8 +29,8 @@ in the linux/drivers/atm directory for details and restrictions.
Firmware Updates
----------------
-The FORE Systems 200E-series driver is shipped with firmware data being
-uploaded to the ATM adapters at system boot time or at module loading time.
+The FORE Systems 200E-series driver is shipped with firmware data being
+uploaded to the ATM adapters at system boot time or at module loading time.
The supplied firmware images should work with all adapters.
However, if you encounter problems (the firmware doesn't start or the driver
diff --git a/Documentation/networking/device_drivers/atm/index.rst b/Documentation/networking/device_drivers/atm/index.rst
new file mode 100644
index 000000000000..7b593f031a60
--- /dev/null
+++ b/Documentation/networking/device_drivers/atm/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Asynchronous Transfer Mode (ATM) Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ cxacru
+ fore200e
+ iphase
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/device_drivers/atm/iphase.rst
index 670b72f16585..92d9b757d75a 100644
--- a/Documentation/networking/iphase.txt
+++ b/Documentation/networking/device_drivers/atm/iphase.rst
@@ -1,27 +1,35 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+ATM (i)Chip IA Linux Driver Source
+==================================
+
+ READ ME FISRT
- READ ME FISRT
- ATM (i)Chip IA Linux Driver Source
--------------------------------------------------------------------------------
- Read This Before You Begin!
+
+ Read This Before You Begin!
+
--------------------------------------------------------------------------------
Description
------------
+===========
-This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver
+This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver
source release.
The features and limitations of this driver are as follows:
+
- A single VPI (VPI value of 0) is supported.
- - Supports 4K VCs for the server board (with 512K control memory) and 1K
+ - Supports 4K VCs for the server board (with 512K control memory) and 1K
VCs for the client board (with 128K control memory).
- UBR, ABR and CBR service categories are supported.
- - Only AAL5 is supported.
- - Supports setting of PCR on the VCs.
+ - Only AAL5 is supported.
+ - Supports setting of PCR on the VCs.
- Multiple adapters in a system are supported.
- - All variants of Interphase ATM PCI (i)Chip adapter cards are supported,
- including x575 (OC3, control memory 128K , 512K and packet memory 128K,
- 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See
+ - All variants of Interphase ATM PCI (i)Chip adapter cards are supported,
+ including x575 (OC3, control memory 128K , 512K and packet memory 128K,
+ 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See
http://www.iphase.com/
for details.
- Only x86 platforms are supported.
@@ -29,128 +37,155 @@ The features and limitations of this driver are as follows:
Before You Start
-----------------
+================
Installation
------------
1. Installing the adapters in the system
+
To install the ATM adapters in the system, follow the steps below.
+
a. Login as root.
b. Shut down the system and power off the system.
c. Install one or more ATM adapters in the system.
- d. Connect each adapter to a port on an ATM switch. The green 'Link'
- LED on the front panel of the adapter will be on if the adapter is
- connected to the switch properly when the system is powered up.
+ d. Connect each adapter to a port on an ATM switch. The green 'Link'
+ LED on the front panel of the adapter will be on if the adapter is
+ connected to the switch properly when the system is powered up.
e. Power on and boot the system.
2. [ Removed ]
3. Rebuild kernel with ABR support
+
[ a. and b. removed ]
- c. Reconfigure the kernel, choose the Interphase ia driver through "make
+
+ c. Reconfigure the kernel, choose the Interphase ia driver through "make
menuconfig" or "make xconfig".
- d. Rebuild the kernel, loadable modules and the atm tools.
+ d. Rebuild the kernel, loadable modules and the atm tools.
e. Install the new built kernel and modules and reboot.
4. Load the adapter hardware driver (ia driver) if it is built as a module
+
a. Login as root.
b. Change directory to /lib/modules/<kernel-version>/atm.
c. Run "insmod suni.o;insmod iphase.o"
- The yellow 'status' LED on the front panel of the adapter will blink
- while the driver is loaded in the system.
- d. To verify that the 'ia' driver is loaded successfully, run the
- following command:
+ The yellow 'status' LED on the front panel of the adapter will blink
+ while the driver is loaded in the system.
+ d. To verify that the 'ia' driver is loaded successfully, run the
+ following command::
- cat /proc/atm/devices
+ cat /proc/atm/devices
- If the driver is loaded successfully, the output of the command will
- be similar to the following lines:
+ If the driver is loaded successfully, the output of the command will
+ be similar to the following lines::
- Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ...
- 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 )
+ Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ...
+ 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 )
- You can also check the system log file /var/log/messages for messages
- related to the ATM driver.
+ You can also check the system log file /var/log/messages for messages
+ related to the ATM driver.
-5. Ia Driver Configuration
+5. Ia Driver Configuration
5.1 Configuration of adapter buffers
The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and
- 1M. The RAM size decides the number of buffers and buffer size. The default
- size and number of buffers are set as following:
-
- Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf
- RAM size size size size size cnt cnt
- -------- ------ ------ ------ ------ ------ ------
- 128K 64K 64K 10K 10K 6 6
- 512K 256K 256K 10K 10K 25 25
- 1M 512K 512K 10K 10K 51 51
+ 1M. The RAM size decides the number of buffers and buffer size. The default
+ size and number of buffers are set as following:
+
+ ========= ======= ====== ====== ====== ====== ======
+ Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf
+ RAM size size size size size cnt cnt
+ ========= ======= ====== ====== ====== ====== ======
+ 128K 64K 64K 10K 10K 6 6
+ 512K 256K 256K 10K 10K 25 25
+ 1M 512K 512K 10K 10K 51 51
+ ========= ======= ====== ====== ====== ====== ======
These setting should work well in most environments, but can be
- changed by typing the following command:
-
- insmod <IA_DIR>/ia.o IA_RX_BUF=<RX_CNT> IA_RX_BUF_SZ=<RX_SIZE> \
- IA_TX_BUF=<TX_CNT> IA_TX_BUF_SZ=<TX_SIZE>
+ changed by typing the following command::
+
+ insmod <IA_DIR>/ia.o IA_RX_BUF=<RX_CNT> IA_RX_BUF_SZ=<RX_SIZE> \
+ IA_TX_BUF=<TX_CNT> IA_TX_BUF_SZ=<TX_SIZE>
+
Where:
- RX_CNT = number of receive buffers in the range (1-128)
- RX_SIZE = size of receive buffers in the range (48-64K)
- TX_CNT = number of transmit buffers in the range (1-128)
- TX_SIZE = size of transmit buffers in the range (48-64K)
- 1. Transmit and receive buffer size must be a multiple of 4.
- 2. Care should be taken so that the memory required for the
- transmit and receive buffers is less than or equal to the
- total adapter packet memory.
+ - RX_CNT = number of receive buffers in the range (1-128)
+ - RX_SIZE = size of receive buffers in the range (48-64K)
+ - TX_CNT = number of transmit buffers in the range (1-128)
+ - TX_SIZE = size of transmit buffers in the range (48-64K)
+
+ 1. Transmit and receive buffer size must be a multiple of 4.
+ 2. Care should be taken so that the memory required for the
+ transmit and receive buffers is less than or equal to the
+ total adapter packet memory.
5.2 Turn on ia debug trace
- When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver
- can provide more debug trace if needed. There is a bit mask variable,
- IADebugFlag, which controls the output of the traces. You can find the bit
- map of the IADebugFlag in iphase.h.
- The debug trace can be turn on through the insmod command line option, for
- example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug
+ When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver
+ can provide more debug trace if needed. There is a bit mask variable,
+ IADebugFlag, which controls the output of the traces. You can find the bit
+ map of the IADebugFlag in iphase.h.
+ The debug trace can be turn on through the insmod command line option, for
+ example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug
traces together with loading the driver.
6. Ia Driver Test Using ttcp_atm and PVC
- For the PVC setup, the test machines can either be connected back-to-back or
- through a switch. If connected through the switch, the switch must be
+ For the PVC setup, the test machines can either be connected back-to-back or
+ through a switch. If connected through the switch, the switch must be
configured for the PVC(s).
a. For UBR test:
- At the test machine intended to receive data, type:
- ttcp_atm -r -a -s 0.100
- At the other test machine, type:
- ttcp_atm -t -a -s 0.100 -n 10000
+
+ At the test machine intended to receive data, type::
+
+ ttcp_atm -r -a -s 0.100
+
+ At the other test machine, type::
+
+ ttcp_atm -t -a -s 0.100 -n 10000
+
Run "ttcp_atm -h" to display more options of the ttcp_atm tool.
b. For ABR test:
- It is the same as the UBR testing, but with an extra command option:
- -Pabr:max_pcr=<xxx>
- where:
- xxx = the maximum peak cell rate, from 170 - 353207.
- This option must be set on both the machines.
+
+ It is the same as the UBR testing, but with an extra command option::
+
+ -Pabr:max_pcr=<xxx>
+
+ where:
+
+ xxx = the maximum peak cell rate, from 170 - 353207.
+
+ This option must be set on both the machines.
+
c. For CBR test:
- It is the same as the UBR testing, but with an extra command option:
- -Pcbr:max_pcr=<xxx>
- where:
- xxx = the maximum peak cell rate, from 170 - 353207.
- This option may only be set on the transmit machine.
+ It is the same as the UBR testing, but with an extra command option::
+
+ -Pcbr:max_pcr=<xxx>
+
+ where:
+
+ xxx = the maximum peak cell rate, from 170 - 353207.
-OUTSTANDING ISSUES
-------------------
+ This option may only be set on the transmit machine.
+
+
+Outstanding Issues
+==================
Contact Information
-------------------
+::
+
Customer Support:
- United States: Telephone: (214) 654-5555
- Fax: (214) 654-5500
+ United States: Telephone: (214) 654-5555
+ Fax: (214) 654-5500
E-Mail: intouch@iphase.com
Europe: Telephone: 33 (0)1 41 15 44 00
Fax: 33 (0)1 41 15 12 13
diff --git a/Documentation/networking/device_drivers/cable/index.rst b/Documentation/networking/device_drivers/cable/index.rst
new file mode 100644
index 000000000000..cce3c4392972
--- /dev/null
+++ b/Documentation/networking/device_drivers/cable/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cable Modem Device Drivers
+==========================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ sb1000
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cable/sb1000.rst b/Documentation/networking/device_drivers/cable/sb1000.rst
new file mode 100644
index 000000000000..c8582ca4034d
--- /dev/null
+++ b/Documentation/networking/device_drivers/cable/sb1000.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+SB100 device driver
+===================
+
+sb1000 is a module network device driver for the General Instrument (also known
+as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card
+which is used by a number of cable TV companies to provide cable modem access.
+It's a one-way downstream-only cable modem, meaning that your upstream net link
+is provided by your regular phone modem.
+
+This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves
+a great deal of thanks for this wonderful piece of code!
+
+Needed tools
+============
+
+Support for this device is now a part of the standard Linux kernel. The
+driver source code file is drivers/net/sb1000.c. In addition to this
+you will need:
+
+1. The "cmconfig" program. This is a utility which supplements "ifconfig"
+ to configure the cable modem and network interface (usually called "cm0");
+
+2. Several PPP scripts which live in /etc/ppp to make connecting via your
+ cable modem easy.
+
+ These utilities can be obtained from:
+
+ http://www.jacksonville.net/~fventuri/
+
+ in Franco's original source code distribution .tar.gz file. Support for
+ the sb1000 driver can be found at:
+
+ - http://web.archive.org/web/%2E/http://home.adelphia.net/~siglercm/sb1000.html
+ - http://web.archive.org/web/%2E/http://linuxpower.cx/~cable/
+
+ along with these utilities.
+
+3. The standard isapnp tools. These are necessary to configure your SB1000
+ card at boot time (or afterwards by hand) since it's a PnP card.
+
+ If you don't have these installed as a standard part of your Linux
+ distribution, you can find them at:
+
+ http://www.roestock.demon.co.uk/isapnptools/
+
+ or check your Linux distribution binary CD or their web site. For help with
+ isapnp, pnpdump, or /etc/isapnp.conf, go to:
+
+ http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
+
+Using the driver
+================
+
+To make the SB1000 card work, follow these steps:
+
+1. Run ``make config``, or ``make menuconfig``, or ``make xconfig``, whichever
+ you prefer, in the top kernel tree directory to set up your kernel
+ configuration. Make sure to say "Y" to "Prompt for development drivers"
+ and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard
+ networking questions to get TCP/IP and PPP networking support.
+
+2. **BEFORE** you build the kernel, edit drivers/net/sb1000.c. Make sure
+ to redefine the value of READ_DATA_PORT to match the I/O address used
+ by isapnp to access your PnP cards. This is the value of READPORT in
+ /etc/isapnp.conf or given by the output of pnpdump.
+
+3. Build and install the kernel and modules as usual.
+
+4. Boot your new kernel following the usual procedures.
+
+5. Set up to configure the new SB1000 PnP card by capturing the output
+ of "pnpdump" to a file and editing this file to set the correct I/O ports,
+ IRQ, and DMA settings for all your PnP cards. Make sure none of the settings
+ conflict with one another. Then test this configuration by running the
+ "isapnp" command with your new config file as the input. Check for
+ errors and fix as necessary. (As an aside, I use I/O ports 0x110 and
+ 0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.)
+ Then save the finished config file as /etc/isapnp.conf for proper
+ configuration on subsequent reboots.
+
+6. Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
+ the others referenced above. As root, unpack it into a temporary directory
+ and do a ``make cmconfig`` and then ``install -c cmconfig /usr/local/sbin``.
+ Don't do ``make install`` because it expects to find all the utilities built
+ and ready for installation, not just cmconfig.
+
+7. As root, copy all the files under the ppp/ subdirectory in Franco's
+ tar file into /etc/ppp, being careful not to overwrite any files that are
+ already in there. Then modify ppp@gi-on to set the correct login name,
+ phone number, and frequency for the cable modem. Also edit pap-secrets
+ to specify your login name and password and any site-specific information
+ you need.
+
+8. Be sure to modify /etc/ppp/firewall to use ipchains instead of
+ the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to
+ convert ipfwadm commands to ipchains commands:
+
+ http://users.dhp.com/~whisper/ipfwadm2ipchains/
+
+ You may also wish to modify the firewall script to implement a different
+ firewalling scheme.
+
+9. Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be
+ root to do this. It's better to use a utility like sudo to execute
+ frequently used commands like this with root permissions if possible. If you
+ connect successfully the cable modem interface will come up and you'll see a
+ driver message like this at the console::
+
+ cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
+ sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
+
+ The "ifconfig" command should show two new interfaces, ppp0 and cm0.
+
+ The command "cmconfig cm0" will give you information about the cable modem
+ interface.
+
+10. Try pinging a site via ``ping -c 5 www.yahoo.com``, for example. You should
+ see packets received.
+
+11. If you can't get site names (like www.yahoo.com) to resolve into
+ IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
+ has no syntax errors and has the right nameserver IP addresses in it.
+ If this doesn't help, try something like ``ping -c 5 204.71.200.67`` to
+ see if the networking is running but the DNS resolution is where the
+ problem lies.
+
+12. If you still have problems, go to the support web sites mentioned above
+ and read the information and documentation there.
+
+Common problems
+===============
+
+1. Packets go out on the ppp0 interface but don't come back on the cm0
+ interface. It looks like I'm connected but I can't even ping any
+ numerical IP addresses. (This happens predominantly on Debian systems due
+ to a default boot-time configuration script.)
+
+Solution
+ As root ``echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter`` so it
+ can share the same IP address as the ppp0 interface. Note that this
+ command should probably be added to the /etc/ppp/cablemodem script
+ *right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
+ You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
+ If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
+ (in rc.local or some such) then any interfaces can share the same IP
+ addresses.
+
+2. I get "unresolved symbol" error messages on executing ``insmod sb1000.o``.
+
+Solution
+ You probably have a non-matching kernel source tree and
+ /usr/include/linux and /usr/include/asm header files. Make sure you
+ install the correct versions of the header files in these two directories.
+ Then rebuild and reinstall the kernel.
+
+3. When isapnp runs it reports an error, and my SB1000 card isn't working.
+
+Solution
+ There's a problem with later versions of isapnp using the "(CHECK)"
+ option in the lines that allocate the two I/O addresses for the SB1000 card.
+ This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses.
+ Make sure they don't conflict with any other pieces of hardware first! Then
+ rerun isapnp and go from there.
+
+4. I can't execute the /etc/ppp/ppp@gi-on file.
+
+Solution
+ As root do ``chmod ug+x /etc/ppp/ppp@gi-on``.
+
+5. The firewall script isn't working (with 2.2.x and higher kernels).
+
+Solution
+ Use the ipfwadm2ipchains script referenced above to convert the
+ /etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
+
+6. I'm getting *tons* of firewall deny messages in the /var/kern.log,
+ /var/messages, and/or /var/syslog files, and they're filling up my /var
+ partition!!!
+
+Solution
+ First, tell your ISP that you're receiving DoS (Denial of Service)
+ and/or portscanning (UDP connection attempts) attacks! Look over the deny
+ messages to figure out what the attack is and where it's coming from. Next,
+ edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
+ to the "cmconfig" command (uncomment that line). If you're not receiving these
+ denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
+ typically), then someone is attacking your machine in particular. Be careful
+ out there....
+
+7. Everything seems to work fine but my computer locks up after a while
+ (and typically during a lengthy download through the cable modem)!
+
+Solution
+ You may need to add a short delay in the driver to 'slow down' the
+ SURFboard because your PC might not be able to keep up with the transfer rate
+ of the SB1000. To do this, it's probably best to download Franco's
+ sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll
+ want to edit the 'Makefile' and look for the 'SB1000_DELAY'
+ define. Uncomment those 'CFLAGS' lines (and comment out the default ones)
+ and try setting the delay to something like 60 microseconds with:
+ '-DSB1000_DELAY=60'. Then do ``make`` and as root ``make install`` and try
+ it out. If it still doesn't work or you like playing with the driver, you may
+ try other numbers. Remember though that the higher the delay, the slower the
+ driver (which slows down the rest of the PC too when it is actively
+ used). Thanks to Ed Daiga for this tip!
+
+Credits
+=======
+
+This README came from Franco Venturi's original README file which is
+still supplied with his driver .tar.gz archive. I and all other sb1000 users
+owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten
+and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
+the SB1000 users who reported and helped debug the common problems listed
+above.
+
+
+ Clemmitt Sigler
+ csigler@vt.edu
diff --git a/Documentation/networking/device_drivers/can/can327.rst b/Documentation/networking/device_drivers/can/can327.rst
new file mode 100644
index 000000000000..b87bfbe5d51c
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/can327.rst
@@ -0,0 +1,331 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-3-Clause)
+
+can327: ELM327 driver for Linux SocketCAN
+==========================================
+
+Authors
+--------
+
+Max Staudt <max@enpas.org>
+
+
+
+Motivation
+-----------
+
+This driver aims to lower the initial cost for hackers interested in
+working with CAN buses.
+
+CAN adapters are expensive, few, and far between.
+ELM327 interfaces are cheap and plentiful.
+Let's use ELM327s as CAN adapters.
+
+
+
+Introduction
+-------------
+
+This driver is an effort to turn abundant ELM327 based OBD interfaces
+into full fledged (as far as possible) CAN interfaces.
+
+Since the ELM327 was never meant to be a stand alone CAN controller,
+the driver has to switch between its modes as quickly as possible in
+order to fake full-duplex operation.
+
+As such, can327 is a best effort driver. However, this is more than
+enough to implement simple request-response protocols (such as OBD II),
+and to monitor broadcast messages on a bus (such as in a vehicle).
+
+Most ELM327s come as nondescript serial devices, attached via USB or
+Bluetooth. The driver cannot recognize them by itself, and as such it
+is up to the user to attach it in form of a TTY line discipline
+(similar to PPP, SLIP, slcan, ...).
+
+This driver is meant for ELM327 versions 1.4b and up, see below for
+known limitations in older controllers and clones.
+
+
+
+Data sheet
+-----------
+
+The official data sheets can be found at ELM electronics' home page:
+
+ https://www.elmelectronics.com/
+
+
+
+How to attach the line discipline
+----------------------------------
+
+Every ELM327 chip is factory programmed to operate at a serial setting
+of 38400 baud/s, 8 data bits, no parity, 1 stopbit.
+
+If you have kept this default configuration, the line discipline can
+be attached on a command prompt as follows::
+
+ sudo ldattach \
+ --debug \
+ --speed 38400 \
+ --eightbits \
+ --noparity \
+ --onestopbit \
+ --iflag -ICRNL,INLCR,-IXOFF \
+ 30 \
+ /dev/ttyUSB0
+
+To change the ELM327's serial settings, please refer to its data
+sheet. This needs to be done before attaching the line discipline.
+
+Once the ldisc is attached, the CAN interface starts out unconfigured.
+Set the speed before starting it::
+
+ # The interface needs to be down to change parameters
+ sudo ip link set can0 down
+ sudo ip link set can0 type can bitrate 500000
+ sudo ip link set can0 up
+
+500000 bit/s is a common rate for OBD-II diagnostics.
+If you're connecting straight to a car's OBD port, this is the speed
+that most cars (but not all!) expect.
+
+After this, you can set out as usual with candump, cansniffer, etc.
+
+
+
+How to check the controller version
+------------------------------------
+
+Use a terminal program to attach to the controller.
+
+After issuing the "``AT WS``" command, the controller will respond with
+its version::
+
+ >AT WS
+
+
+ ELM327 v1.4b
+
+ >
+
+Note that clones may claim to be any version they like.
+It is not indicative of their actual feature set.
+
+
+
+
+Communication example
+----------------------
+
+This is a short and incomplete introduction on how to talk to an ELM327.
+It is here to guide understanding of the controller's and the driver's
+limitation (listed below) as well as manual testing.
+
+
+The ELM327 has two modes:
+
+- Command mode
+- Reception mode
+
+In command mode, it expects one command per line, terminated by CR.
+By default, the prompt is a "``>``", after which a command can be
+entered::
+
+ >ATE1
+ OK
+ >
+
+The init script in the driver switches off several configuration options
+that are only meaningful in the original OBD scenario the chip is meant
+for, and are actually a hindrance for can327.
+
+
+When a command is not recognized, such as by an older version of the
+ELM327, a question mark is printed as a response instead of OK::
+
+ >ATUNKNOWN
+ ?
+ >
+
+At present, can327 does not evaluate this response. See the section
+below on known limitations for details.
+
+
+When a CAN frame is to be sent, the target address is configured, after
+which the frame is sent as a command that consists of the data's hex
+dump::
+
+ >ATSH123
+ OK
+ >DEADBEEF12345678
+ OK
+ >
+
+The above interaction sends the SFF frame "``DE AD BE EF 12 34 56 78``"
+with (11 bit) CAN ID ``0x123``.
+For this to function, the controller must be configured for SFF sending
+mode (using "``AT PB``", see code or datasheet).
+
+
+Once a frame has been sent and wait-for-reply mode is on (``ATR1``,
+configured on ``listen-only=off``), or when the reply timeout expires
+and the driver sets the controller into monitoring mode (``ATMA``),
+the ELM327 will send one line for each received CAN frame, consisting
+of CAN ID, DLC, and data::
+
+ 123 8 DEADBEEF12345678
+
+For EFF (29 bit) CAN frames, the address format is slightly different,
+which can327 uses to tell the two apart::
+
+ 12 34 56 78 8 DEADBEEF12345678
+
+The ELM327 will receive both SFF and EFF frames - the current CAN
+config (``ATPB``) does not matter.
+
+
+If the ELM327's internal UART sending buffer runs full, it will abort
+the monitoring mode, print "BUFFER FULL" and drop back into command
+mode. Note that in this case, unlike with other error messages, the
+error message may appear on the same line as the last (usually
+incomplete) data frame::
+
+ 12 34 56 78 8 DEADBEEF123 BUFFER FULL
+
+
+
+Known limitations of the controller
+------------------------------------
+
+- Clone devices ("v1.5" and others)
+
+ Sending RTR frames is not supported and will be dropped silently.
+
+ Receiving RTR with DLC 8 will appear to be a regular frame with
+ the last received frame's DLC and payload.
+
+ "``AT CSM``" (CAN Silent Monitoring, i.e. don't send CAN ACKs) is
+ not supported, and is hard coded to ON. Thus, frames are not ACKed
+ while listening: "``AT MA``" (Monitor All) will always be "silent".
+ However, immediately after sending a frame, the ELM327 will be in
+ "receive reply" mode, in which it *does* ACK any received frames.
+ Once the bus goes silent, or an error occurs (such as BUFFER FULL),
+ or the receive reply timeout runs out, the ELM327 will end reply
+ reception mode on its own and can327 will fall back to "``AT MA``"
+ in order to keep monitoring the bus.
+
+ Other limitations may apply, depending on the clone and the quality
+ of its firmware.
+
+
+- All versions
+
+ No full duplex operation is supported. The driver will switch
+ between input/output mode as quickly as possible.
+
+ The length of outgoing RTR frames cannot be set. In fact, some
+ clones (tested with one identifying as "``v1.5``") are unable to
+ send RTR frames at all.
+
+ We don't have a way to get real-time notifications on CAN errors.
+ While there is a command (``AT CS``) to retrieve some basic stats,
+ we don't poll it as it would force us to interrupt reception mode.
+
+
+- Versions prior to 1.4b
+
+ These versions do not send CAN ACKs when in monitoring mode (AT MA).
+ However, they do send ACKs while waiting for a reply immediately
+ after sending a frame. The driver maximizes this time to make the
+ controller as useful as possible.
+
+ Starting with version 1.4b, the ELM327 supports the "``AT CSM``"
+ command, and the "listen-only" CAN option will take effect.
+
+
+- Versions prior to 1.4
+
+ These chips do not support the "``AT PB``" command, and thus cannot
+ change bitrate or SFF/EFF mode on-the-fly. This will have to be
+ programmed by the user before attaching the line discipline. See the
+ data sheet for details.
+
+
+- Versions prior to 1.3
+
+ These chips cannot be used at all with can327. They do not support
+ the "``AT D1``" command, which is necessary to avoid parsing conflicts
+ on incoming data, as well as distinction of RTR frame lengths.
+
+ Specifically, this allows for easy distinction of SFF and EFF
+ frames, and to check whether frames are complete. While it is possible
+ to deduce the type and length from the length of the line the ELM327
+ sends us, this method fails when the ELM327's UART output buffer
+ overruns. It may abort sending in the middle of the line, which will
+ then be mistaken for something else.
+
+
+
+Known limitations of the driver
+--------------------------------
+
+- No 8/7 timing.
+
+ ELM327 can only set CAN bitrates that are of the form 500000/n, where
+ n is an integer divisor.
+ However there is an exception: With a separate flag, it may set the
+ speed to be 8/7 of the speed indicated by the divisor.
+ This mode is not currently implemented.
+
+- No evaluation of command responses.
+
+ The ELM327 will reply with OK when a command is understood, and with ?
+ when it is not. The driver does not currently check this, and simply
+ assumes that the chip understands every command.
+ The driver is built such that functionality degrades gracefully
+ nevertheless. See the section on known limitations of the controller.
+
+- No use of hardware CAN ID filtering
+
+ An ELM327's UART sending buffer will easily overflow on heavy CAN bus
+ load, resulting in the "``BUFFER FULL``" message. Using the hardware
+ filters available through "``AT CF xxx``" and "``AT CM xxx``" would be
+ helpful here, however SocketCAN does not currently provide a facility
+ to make use of such hardware features.
+
+
+
+Rationale behind the chosen configuration
+------------------------------------------
+
+``AT E1``
+ Echo on
+
+ We need this to be able to get a prompt reliably.
+
+``AT S1``
+ Spaces on
+
+ We need this to distinguish 11/29 bit CAN addresses received.
+
+ Note:
+ We can usually do this using the line length (odd/even),
+ but this fails if the line is not transmitted fully to
+ the host (BUFFER FULL).
+
+``AT D1``
+ DLC on
+
+ We need this to tell the "length" of RTR frames.
+
+
+
+A note on CAN bus termination
+------------------------------
+
+Your adapter may have resistors soldered in which are meant to terminate
+the bus. This is correct when it is plugged into a OBD-II socket, but
+not helpful when trying to tap into the middle of an existing CAN bus.
+
+If communications don't work with the adapter connected, check for the
+termination resistors on its PCB and try removing them.
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
new file mode 100644
index 000000000000..40c92ea272af
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
@@ -0,0 +1,639 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+CTU CAN FD Driver
+=================
+
+Author: Martin Jerabek <martin.jerabek01@gmail.com>
+
+
+About CTU CAN FD IP Core
+------------------------
+
+`CTU CAN FD <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+is an open source soft core written in VHDL.
+It originated in 2015 as Ondrej Ille's project
+at the `Department of Measurement <https://meas.fel.cvut.cz/>`_
+of `FEE <http://www.fel.cvut.cz/en/>`_ at `CTU <https://www.cvut.cz/en>`_.
+
+The SocketCAN driver for Xilinx Zynq SoC based MicroZed board
+`Vivado integration <https://gitlab.fel.cvut.cz/canbus/zynq/zynq-can-sja1000-top>`_
+and Intel Cyclone V 5CSEMA4U23C6 based DE0-Nano-SoC Terasic board
+`QSys integration <https://gitlab.fel.cvut.cz/canbus/intel-soc-ctucanfd>`_
+has been developed as well as support for
+`PCIe integration <https://gitlab.fel.cvut.cz/canbus/pcie-ctucanfd>`_ of the core.
+
+In the case of Zynq, the core is connected via the APB system bus, which does
+not have enumeration support, and the device must be specified in Device Tree.
+This kind of devices is called platform device in the kernel and is
+handled by a platform device driver.
+
+The basic functional model of the CTU CAN FD peripheral has been
+accepted into QEMU mainline. See QEMU `CAN emulation support <https://www.qemu.org/docs/master/system/devices/can.html>`_
+for CAN FD buses, host connection and CTU CAN FD core emulation. The development
+version of emulation support can be cloned from ctu-canfd branch of QEMU local
+development `repository <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_.
+
+
+About SocketCAN
+---------------
+
+SocketCAN is a standard common interface for CAN devices in the Linux
+kernel. As the name suggests, the bus is accessed via sockets, similarly
+to common network devices. The reasoning behind this is in depth
+described in `Linux SocketCAN <https://www.kernel.org/doc/html/latest/networking/can.html>`_.
+In short, it offers a
+natural way to implement and work with higher layer protocols over CAN,
+in the same way as, e.g., UDP/IP over Ethernet.
+
+Device probe
+~~~~~~~~~~~~
+
+Before going into detail about the structure of a CAN bus device driver,
+let's reiterate how the kernel gets to know about the device at all.
+Some buses, like PCI or PCIe, support device enumeration. That is, when
+the system boots, it discovers all the devices on the bus and reads
+their configuration. The kernel identifies the device via its vendor ID
+and device ID, and if there is a driver registered for this identifier
+combination, its probe method is invoked to populate the driver's
+instance for the given hardware. A similar situation goes with USB, only
+it allows for device hot-plug.
+
+The situation is different for peripherals which are directly embedded
+in the SoC and connected to an internal system bus (AXI, APB, Avalon,
+and others). These buses do not support enumeration, and thus the kernel
+has to learn about the devices from elsewhere. This is exactly what the
+Device Tree was made for.
+
+Device tree
+~~~~~~~~~~~
+
+An entry in device tree states that a device exists in the system, how
+it is reachable (on which bus it resides) and its configuration –
+registers address, interrupts and so on. An example of such a device
+tree is given in .
+
+::
+
+ / {
+ /* ... */
+ amba: amba {
+ #address-cells = <1>;
+ #size-cells = <1>;
+ compatible = "simple-bus";
+
+ CTU_CAN_FD_0: CTU_CAN_FD@43c30000 {
+ compatible = "ctu,ctucanfd";
+ interrupt-parent = <&intc>;
+ interrupts = <0 30 4>;
+ clocks = <&clkc 15>;
+ reg = <0x43c30000 0x10000>;
+ };
+ };
+ };
+
+
+.. _sec:socketcan:drv:
+
+Driver structure
+~~~~~~~~~~~~~~~~
+
+The driver can be divided into two parts – platform-dependent device
+discovery and set up, and platform-independent CAN network device
+implementation.
+
+.. _sec:socketcan:platdev:
+
+Platform device driver
+^^^^^^^^^^^^^^^^^^^^^^
+
+In the case of Zynq, the core is connected via the AXI system bus, which
+does not have enumeration support, and the device must be specified in
+Device Tree. This kind of devices is called *platform device* in the
+kernel and is handled by a *platform device driver*\ [1]_.
+
+A platform device driver provides the following things:
+
+- A *probe* function
+
+- A *remove* function
+
+- A table of *compatible* devices that the driver can handle
+
+The *probe* function is called exactly once when the device appears (or
+the driver is loaded, whichever happens later). If there are more
+devices handled by the same driver, the *probe* function is called for
+each one of them. Its role is to allocate and initialize resources
+required for handling the device, as well as set up low-level functions
+for the platform-independent layer, e.g., *read_reg* and *write_reg*.
+After that, the driver registers the device to a higher layer, in our
+case as a *network device*.
+
+The *remove* function is called when the device disappears, or the
+driver is about to be unloaded. It serves to free the resources
+allocated in *probe* and to unregister the device from higher layers.
+
+Finally, the table of *compatible* devices states which devices the
+driver can handle. The Device Tree entry ``compatible`` is matched
+against the tables of all *platform drivers*.
+
+.. code:: c
+
+ /* Match table for OF platform binding */
+ static const struct of_device_id ctucan_of_match[] = {
+ { .compatible = "ctu,canfd-2", },
+ { .compatible = "ctu,ctucanfd", },
+ { /* end of list */ },
+ };
+ MODULE_DEVICE_TABLE(of, ctucan_of_match);
+
+ static int ctucan_probe(struct platform_device *pdev);
+ static int ctucan_remove(struct platform_device *pdev);
+
+ static struct platform_driver ctucanfd_driver = {
+ .probe = ctucan_probe,
+ .remove = ctucan_remove,
+ .driver = {
+ .name = DRIVER_NAME,
+ .of_match_table = ctucan_of_match,
+ },
+ };
+ module_platform_driver(ctucanfd_driver);
+
+
+.. _sec:socketcan:netdev:
+
+Network device driver
+^^^^^^^^^^^^^^^^^^^^^
+
+Each network device must support at least these operations:
+
+- Bring the device up: ``ndo_open``
+
+- Bring the device down: ``ndo_close``
+
+- Submit TX frames to the device: ``ndo_start_xmit``
+
+- Signal TX completion and errors to the network subsystem: ISR
+
+- Submit RX frames to the network subsystem: ISR and NAPI
+
+There are two possible event sources: the device and the network
+subsystem. Device events are usually signaled via an interrupt, handled
+in an Interrupt Service Routine (ISR). Handlers for the events
+originating in the network subsystem are then specified in
+``struct net_device_ops``.
+
+When the device is brought up, e.g., by calling ``ip link set can0 up``,
+the driver’s function ``ndo_open`` is called. It should validate the
+interface configuration and configure and enable the device. The
+analogous opposite is ``ndo_close``, called when the device is being
+brought down, be it explicitly or implicitly.
+
+When the system should transmit a frame, it does so by calling
+``ndo_start_xmit``, which enqueues the frame into the device. If the
+device HW queue (FIFO, mailboxes or whatever the implementation is)
+becomes full, the ``ndo_start_xmit`` implementation informs the network
+subsystem that it should stop the TX queue (via ``netif_stop_queue``).
+It is then re-enabled later in ISR when the device has some space
+available again and is able to enqueue another frame.
+
+All the device events are handled in ISR, namely:
+
+#. **TX completion**. When the device successfully finishes transmitting
+ a frame, the frame is echoed locally. On error, an informative error
+ frame [2]_ is sent to the network subsystem instead. In both cases,
+ the software TX queue is resumed so that more frames may be sent.
+
+#. **Error condition**. If something goes wrong (e.g., the device goes
+ bus-off or RX overrun happens), error counters are updated, and
+ informative error frames are enqueued to SW RX queue.
+
+#. **RX buffer not empty**. In this case, read the RX frames and enqueue
+ them to SW RX queue. Usually NAPI is used as a middle layer (see ).
+
+.. _sec:socketcan:napi:
+
+NAPI
+~~~~
+
+The frequency of incoming frames can be high and the overhead to invoke
+the interrupt service routine for each frame can cause significant
+system load. There are multiple mechanisms in the Linux kernel to deal
+with this situation. They evolved over the years of Linux kernel
+development and enhancements. For network devices, the current standard
+is NAPI – *the New API*. It is similar to classical top-half/bottom-half
+interrupt handling in that it only acknowledges the interrupt in the ISR
+and signals that the rest of the processing should be done in softirq
+context. On top of that, it offers the possibility to *poll* for new
+frames for a while. This has a potential to avoid the costly round of
+enabling interrupts, handling an incoming IRQ in ISR, re-enabling the
+softirq and switching context back to softirq.
+
+More detailed documentation of NAPI may be found on the pages of Linux
+Foundation `<https://wiki.linuxfoundation.org/networking/napi>`_.
+
+Integrating the core to Xilinx Zynq
+-----------------------------------
+
+The core interfaces a simple subset of the Avalon
+(search for Intel **Avalon Interface Specifications**)
+bus as it was originally used on
+Alterra FPGA chips, yet Xilinx natively interfaces with AXI
+(search for ARM **AMBA AXI and ACE Protocol Specification AXI3,
+AXI4, and AXI4-Lite, ACE and ACE-Lite**).
+The most obvious solution would be to use
+an Avalon/AXI bridge or implement some simple conversion entity.
+However, the core’s interface is half-duplex with no handshake
+signaling, whereas AXI is full duplex with two-way signaling. Moreover,
+even AXI-Lite slave interface is quite resource-intensive, and the
+flexibility and speed of AXI are not required for a CAN core.
+
+Thus a much simpler bus was chosen – APB (Advanced Peripheral Bus)
+(search for ARM **AMBA APB Protocol Specification**).
+APB-AXI bridge is directly available in
+Xilinx Vivado, and the interface adaptor entity is just a few simple
+combinatorial assignments.
+
+Finally, to be able to include the core in a block diagram as a custom
+IP, the core, together with the APB interface, has been packaged as a
+Vivado component.
+
+CTU CAN FD Driver design
+------------------------
+
+The general structure of a CAN device driver has already been examined
+in . The next paragraphs provide a more detailed description of the CTU
+CAN FD core driver in particular.
+
+Low-level driver
+~~~~~~~~~~~~~~~~
+
+The core is not intended to be used solely with SocketCAN, and thus it
+is desirable to have an OS-independent low-level driver. This low-level
+driver can then be used in implementations of OS driver or directly
+either on bare metal or in a user-space application. Another advantage
+is that if the hardware slightly changes, only the low-level driver
+needs to be modified.
+
+The code [3]_ is in part automatically generated and in part written
+manually by the core author, with contributions of the thesis’ author.
+The low-level driver supports operations such as: set bit timing, set
+controller mode, enable/disable, read RX frame, write TX frame, and so
+on.
+
+Configuring bit timing
+~~~~~~~~~~~~~~~~~~~~~~
+
+On CAN, each bit is divided into four segments: SYNC, PROP, PHASE1, and
+PHASE2. Their duration is expressed in multiples of a Time Quantum
+(details in `CAN Specification, Version 2.0 <http://esd.cs.ucr.edu/webres/can20.pdf>`_, chapter 8).
+When configuring
+bitrate, the durations of all the segments (and time quantum) must be
+computed from the bitrate and Sample Point. This is performed
+independently for both the Nominal bitrate and Data bitrate for CAN FD.
+
+SocketCAN is fairly flexible and offers either highly customized
+configuration by setting all the segment durations manually, or a
+convenient configuration by setting just the bitrate and sample point
+(and even that is chosen automatically per Bosch recommendation if not
+specified). However, each CAN controller may have different base clock
+frequency and different width of segment duration registers. The
+algorithm thus needs the minimum and maximum values for the durations
+(and clock prescaler) and tries to optimize the numbers to fit both the
+constraints and the requested parameters.
+
+.. code:: c
+
+ struct can_bittiming_const {
+ char name[16]; /* Name of the CAN controller hardware */
+ __u32 tseg1_min; /* Time segment 1 = prop_seg + phase_seg1 */
+ __u32 tseg1_max;
+ __u32 tseg2_min; /* Time segment 2 = phase_seg2 */
+ __u32 tseg2_max;
+ __u32 sjw_max; /* Synchronisation jump width */
+ __u32 brp_min; /* Bit-rate prescaler */
+ __u32 brp_max;
+ __u32 brp_inc;
+ };
+
+
+[lst:can_bittiming_const]
+
+A curious reader will notice that the durations of the segments PROP_SEG
+and PHASE_SEG1 are not determined separately but rather combined and
+then, by default, the resulting TSEG1 is evenly divided between PROP_SEG
+and PHASE_SEG1. In practice, this has virtually no consequences as the
+sample point is between PHASE_SEG1 and PHASE_SEG2. In CTU CAN FD,
+however, the duration registers ``PROP`` and ``PH1`` have different
+widths (6 and 7 bits, respectively), so the auto-computed values might
+overflow the shorter register and must thus be redistributed among the
+two [4]_.
+
+Handling RX
+~~~~~~~~~~~
+
+Frame reception is handled in NAPI queue, which is enabled from ISR when
+the RXNE (RX FIFO Not Empty) bit is set. Frames are read one by one
+until either no frame is left in the RX FIFO or the maximum work quota
+has been reached for the NAPI poll run (see ). Each frame is then passed
+to the network interface RX queue.
+
+An incoming frame may be either a CAN 2.0 frame or a CAN FD frame. The
+way to distinguish between these two in the kernel is to allocate either
+``struct can_frame`` or ``struct canfd_frame``, the two having different
+sizes. In the controller, the information about the frame type is stored
+in the first word of RX FIFO.
+
+This brings us a chicken-egg problem: we want to allocate the ``skb``
+for the frame, and only if it succeeds, fetch the frame from FIFO;
+otherwise keep it there for later. But to be able to allocate the
+correct ``skb``, we have to fetch the first work of FIFO. There are
+several possible solutions:
+
+#. Read the word, then allocate. If it fails, discard the rest of the
+ frame. When the system is low on memory, the situation is bad anyway.
+
+#. Always allocate ``skb`` big enough for an FD frame beforehand. Then
+ tweak the ``skb`` internals to look like it has been allocated for
+ the smaller CAN 2.0 frame.
+
+#. Add option to peek into the FIFO instead of consuming the word.
+
+#. If the allocation fails, store the read word into driver’s data. On
+ the next try, use the stored word instead of reading it again.
+
+Option 1 is simple enough, but not very satisfying if we could do
+better. Option 2 is not acceptable, as it would require modifying the
+private state of an integral kernel structure. The slightly higher
+memory consumption is just a virtual cherry on top of the “cake”. Option
+3 requires non-trivial HW changes and is not ideal from the HW point of
+view.
+
+Option 4 seems like a good compromise, with its disadvantage being that
+a partial frame may stay in the FIFO for a prolonged time. Nonetheless,
+there may be just one owner of the RX FIFO, and thus no one else should
+see the partial frame (disregarding some exotic debugging scenarios).
+Basides, the driver resets the core on its initialization, so the
+partial frame cannot be “adopted” either. In the end, option 4 was
+selected [5]_.
+
+.. _subsec:ctucanfd:rxtimestamp:
+
+Timestamping RX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+The CTU CAN FD core reports the exact timestamp when the frame has been
+received. The timestamp is by default captured at the sample point of
+the last bit of EOF but is configurable to be captured at the SOF bit.
+The timestamp source is external to the core and may be up to 64 bits
+wide. At the time of writing, passing the timestamp from kernel to
+userspace is not yet implemented, but is planned in the future.
+
+Handling TX
+~~~~~~~~~~~
+
+The CTU CAN FD core has 4 independent TX buffers, each with its own
+state and priority. When the core wants to transmit, a TX buffer in
+Ready state with the highest priority is selected.
+
+The priorities are 3bit numbers in register TX_PRIORITY
+(nibble-aligned). This should be flexible enough for most use cases.
+SocketCAN, however, supports only one FIFO queue for outgoing
+frames [6]_. The buffer priorities may be used to simulate the FIFO
+behavior by assigning each buffer a distinct priority and *rotating* the
+priorities after a frame transmission is completed.
+
+In addition to priority rotation, the SW must maintain head and tail
+pointers into the FIFO formed by the TX buffers to be able to determine
+which buffer should be used for next frame (``txb_head``) and which
+should be the first completed one (``txb_tail``). The actual buffer
+indices are (obviously) modulo 4 (number of TX buffers), but the
+pointers must be at least one bit wider to be able to distinguish
+between FIFO full and FIFO empty – in this situation,
+:math:`txb\_head \equiv txb\_tail\ (\textrm{mod}\ 4)`. An example of how
+the FIFO is maintained, together with priority rotation, is depicted in
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | A | B | C | |
++------+---+---+---+---+
+| Prio | 7 | 6 | 5 | 4 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+
+| TXB# | 0 | 1 | 2 | 3 |
++======+===+===+===+===+
+| Seq | | B | C | |
++------+---+---+---+---+
+| Prio | 4 | 7 | 6 | 5 |
++------+---+---+---+---+
+| | | T | | H |
++------+---+---+---+---+
+
+|
+
++------+---+---+---+---+----+
+| TXB# | 0 | 1 | 2 | 3 | 0’ |
++======+===+===+===+===+====+
+| Seq | E | B | C | D | |
++------+---+---+---+---+----+
+| Prio | 4 | 7 | 6 | 5 | |
++------+---+---+---+---+----+
+| | | T | | | H |
++------+---+---+---+---+----+
+
+|
+
+.. kernel-figure:: fsm_txt_buffer_user.svg
+
+ TX Buffer states with possible transitions
+
+.. _subsec:ctucanfd:txtimestamp:
+
+Timestamping TX frames
+^^^^^^^^^^^^^^^^^^^^^^
+
+When submitting a frame to a TX buffer, one may specify the timestamp at
+which the frame should be transmitted. The frame transmission may start
+later, but not sooner. Note that the timestamp does not participate in
+buffer prioritization – that is decided solely by the mechanism
+described above.
+
+Support for time-based packet transmission was recently merged to Linux
+v4.19 `Time-based packet transmission <https://lwn.net/Articles/748879/>`_,
+but it remains yet to be researched
+whether this functionality will be practical for CAN.
+
+Also similarly to retrieving the timestamp of RX frames, the core
+supports retrieving the timestamp of TX frames – that is the time when
+the frame was successfully delivered. The particulars are very similar
+to timestamping RX frames and are described in .
+
+Handling RX buffer overrun
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When a received frame does no more fit into the hardware RX FIFO in its
+entirety, RX FIFO overrun flag (STATUS[DOR]) is set and Data Overrun
+Interrupt (DOI) is triggered. When servicing the interrupt, care must be
+taken first to clear the DOR flag (via COMMAND[CDO]) and after that
+clear the DOI interrupt flag. Otherwise, the interrupt would be
+immediately [7]_ rearmed.
+
+**Note**: During development, it was discussed whether the internal HW
+pipelining cannot disrupt this clear sequence and whether an additional
+dummy cycle is necessary between clearing the flag and the interrupt. On
+the Avalon interface, it indeed proved to be the case, but APB being
+safe because it uses 2-cycle transactions. Essentially, the DOR flag
+would be cleared, but DOI register’s Preset input would still be high
+the cycle when the DOI clear request would also be applied (by setting
+the register’s Reset input high). As Set had higher priority than Reset,
+the DOI flag would not be reset. This has been already fixed by swapping
+the Set/Reset priority (see issue #187).
+
+Reporting Error Passive and Bus Off conditions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It may be desirable to report when the node reaches *Error Passive*,
+*Error Warning*, and *Bus Off* conditions. The driver is notified about
+error state change by an interrupt (EPI, EWLI), and then proceeds to
+determine the core’s error state by reading its error counters.
+
+There is, however, a slight race condition here – there is a delay
+between the time when the state transition occurs (and the interrupt is
+triggered) and when the error counters are read. When EPI is received,
+the node may be either *Error Passive* or *Bus Off*. If the node goes
+*Bus Off*, it obviously remains in the state until it is reset.
+Otherwise, the node is *or was* *Error Passive*. However, it may happen
+that the read state is *Error Warning* or even *Error Active*. It may be
+unclear whether and what exactly to report in that case, but I
+personally entertain the idea that the past error condition should still
+be reported. Similarly, when EWLI is received but the state is later
+detected to be *Error Passive*, *Error Passive* should be reported.
+
+
+CTU CAN FD Driver Sources Reference
+-----------------------------------
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd.h
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_base.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_pci.c
+ :internal:
+
+.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_platform.c
+ :internal:
+
+CTU CAN FD IP Core and Driver Development Acknowledgment
+---------------------------------------------------------
+
+* Odrej Ille <ondrej.ille@gmail.com>
+
+ * started the project as student at Department of Measurement, FEE, CTU
+ * invested great amount of personal time and enthusiasm to the project over years
+ * worked on more funded tasks
+
+* `Department of Measurement <https://meas.fel.cvut.cz/>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * is the main investor into the project over many years
+ * uses project in their CAN/CAN FD diagnostics framework for `Skoda Auto <https://www.skoda-auto.cz/>`_
+
+* `Digiteq Automotive <https://www.digiteqautomotive.com/en>`_
+
+ * funding of the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * negotiated and paid CTU to allow public access to the project
+ * provided additional funding of the work
+
+* `Department of Control Engineering <https://control.fel.cvut.cz/en>`_,
+ `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_,
+ `Czech Technical University <https://www.cvut.cz/en>`_
+
+ * solving the project CAN FD Open Cores Support Linux Kernel Based Systems
+ * providing GitLab management
+ * virtual servers and computational power for continuous integration
+ * providing hardware for HIL continuous integration tests
+
+* `PiKRON Ltd. <http://pikron.com/>`_
+
+ * minor funding to initiate preparation of the project open-sourcing
+
+* Petr Porazil <porazil@pikron.com>
+
+ * design of PCIe transceiver addon board and assembly of boards
+ * design and assembly of MZ_APO baseboard for MicroZed/Zynq based system
+
+* Martin Jerabek <martin.jerabek01@gmail.com>
+
+ * Linux driver development
+ * continuous integration platform architect and GHDL updates
+ * theses `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_
+
+* Jiri Novak <jnovak@fel.cvut.cz>
+
+ * project initiation, management and use at Department of Measurement, FEE, CTU
+
+* Pavel Pisa <pisa@cmp.felk.cvut.cz>
+
+ * initiate open-sourcing, project coordination, management at Department of Control Engineering, FEE, CTU
+
+* Jaroslav Beran<jara.beran@gmail.com>
+
+ * system integration for Intel SoC, core and driver testing and updates
+
+* Carsten Emde (`OSADL <https://www.osadl.org/>`_)
+
+ * provided OSADL expertise to discuss IP core licensing
+ * pointed to possible deadlock for LGPL and CAN bus possible patent case which lead to relicense IP core design to BSD like license
+
+* Reiner Zitzmann and Holger Zeltwanger (`CAN in Automation <https://www.can-cia.org/>`_)
+
+ * provided suggestions and help to inform community about the project and invited us to events focused on CAN bus future development directions
+
+* Jan Charvat
+
+ * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_)
+ * Bachelor theses Model of CAN FD Communication Controller for QEMU Emulator
+
+Notes
+-----
+
+
+.. [1]
+ Other buses have their own specific driver interface to set up the
+ device.
+
+.. [2]
+ Not to be mistaken with CAN Error Frame. This is a ``can_frame`` with
+ ``CAN_ERR_FLAG`` set and some error info in its ``data`` field.
+
+.. [3]
+ Available in CTU CAN FD repository
+ `<https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+
+.. [4]
+ As is done in the low-level driver functions
+ ``ctucan_hw_set_nom_bittiming`` and
+ ``ctucan_hw_set_data_bittiming``.
+
+.. [5]
+ At the time of writing this thesis, option 1 is still being used and
+ the modification is queued in gitlab issue #222
+
+.. [6]
+ Strictly speaking, multiple CAN TX queues are supported since v4.19
+ `can: enable multi-queue for SocketCAN devices <https://lore.kernel.org/patchwork/patch/913526/>`_ but no mainline driver is using
+ them yet.
+
+.. [7]
+ Or rather in the next clock cycle
diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
new file mode 100644
index 000000000000..b371650788f4
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg
@@ -0,0 +1,151 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<svg width="113.611mm" height="86.6873mm" version="1.1" viewBox="0 0 113.611 86.6873" xmlns="http://www.w3.org/2000/svg" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
+ <defs>
+ <marker id="marker3667" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3517" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3373" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3199" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker3037" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2779" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2477" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker2074" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1964" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="marker1856" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <marker id="Arrow2Mend" overflow="visible" orient="auto">
+ <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <marker id="marker2074-3" overflow="visible" orient="auto">
+ <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/>
+ </marker>
+ <filter id="filter1204-6" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-4" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ <filter id="filter1204-6-2-9-1-3-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB">
+ <feGaussianBlur stdDeviation="0.00018829868"/>
+ </filter>
+ </defs>
+ <metadata>
+ <rdf:RDF>
+ <cc:Work rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/>
+ <dc:title/>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g transform="translate(-49.0277 -104.823)">
+ <g>
+ <path d="m130.534 165.429h-71.1816v-17.5315" fill="none" marker-end="url(#marker2477)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m145.034 122.959v-11.5914h-43.1215" fill="none" marker-end="url(#marker3037)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="130.679" y="122.933" width="28.2965" height="45.2319" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".499999"/>
+ <path d="m102.044 116.236h23.3126l-0.13388 18.8185h19.9383v3.66603" fill="none" marker-end="url(#marker3199)" stroke="#28a4ff" stroke-width=".6"/>
+ <path d="m59.5006 138.391v-24.2517h20.6338" fill="none" marker-end="url(#marker2779)" stroke="#28a4ff" stroke-width=".6"/>
+ <rect x="78.1389" y="126.411" width="28.0037" height="35.0443" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".5"/>
+ </g>
+ <g fill="#ffcb35" stroke="#000" stroke-linecap="square">
+ <ellipse cx="92.1408" cy="114.239" rx="10.8866" ry="4.39308" stroke-width=".5"/>
+ <ellipse cx="92.1408" cy="134.185" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ <ellipse cx="92.1408" cy="152.199" rx="10.8866" ry="4.39308" stroke-width=".499999"/>
+ </g>
+ <g fill="#28a4ff" stroke="#000" stroke-linecap="square" stroke-width=".499999">
+ <ellipse cx="144.827" cy="143.316" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="144.827" cy="159.143" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="59.4364" cy="142.823" rx="7.36455" ry="4.39308"/>
+ <ellipse cx="144.827" cy="129.196" rx="10.8866" ry="4.39308"/>
+ <ellipse cx="143.077" cy="180.53" rx="10.8866" ry="4.39308"/>
+ </g>
+ <ellipse cx="110.386" cy="180.53" rx="10.8866" ry="4.39308" fill="#ffcb35" stroke="#000" stroke-linecap="square" stroke-width=".499999"/>
+ <text x="110.90907" y="179.42688" font-size="3.175px" xml:space="preserve"><tspan x="110.90907" y="179.42688" dy="0.60000002" text-align="center" text-anchor="middle">Accessible</tspan><tspan x="110.90907" y="183.39563"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <text x="143.5869" y="179.52795" xml:space="preserve"><tspan x="143.5869" y="179.52795" dy="1 0 0 0 0 0" font-family="sans-serif" font-size="2.82222px" text-align="center" text-anchor="middle" style="font-variant-caps:normal;font-variant-east-asian:normal;font-variant-ligatures:normal;font-variant-numeric:normal">Inaccessible</tspan><tspan x="143.5869" y="183.36786" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text>
+ <g font-size="3.175px">
+ <text x="91.95018" y="115.29005" xml:space="preserve"><tspan x="91.95018" y="115.29005" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Ready</tspan></tspan></text>
+ <text x="145.25127" y="130.49019" xml:space="preserve"><tspan x="145.25127" y="130.49019" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX OK</tspan></tspan></text>
+ <text x="145.31845" y="144.43121" xml:space="preserve"><tspan x="145.31845" y="144.43121" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Aborted</tspan></tspan></text>
+ <text x="145.40399" y="160.36035" xml:space="preserve"><tspan x="145.40399" y="160.36035" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX failed</tspan></tspan></text>
+ <text x="91.823967" y="133.53941" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.823967" y="133.53941" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX in</tspan></tspan><tspan x="91.823967" y="136.39691" text-align="center">progress</tspan></text>
+ <text x="91.648918" y="151.84813" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.648918" y="151.84813" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Abort in</tspan></tspan><tspan x="91.648918" y="154.70563" text-align="center">progress</tspan></text>
+ <text x="59.456043" y="143.91658" xml:space="preserve"><tspan x="59.456043" y="143.91658" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Empty</tspan></tspan></text>
+ </g>
+ <g fill="none">
+ <g stroke="#000">
+ <rect x="52.3943" y="171.63" width="106.581" height="16.601" rx="0" ry="0" stroke-linecap="square" stroke-width=".499999"/>
+ <g stroke-width=".6">
+ <path d="m106.383 159.046h26.4967" marker-end="url(#Arrow2Mend)"/>
+ <path d="m103.138 152.268h41.5564v-3.92426" marker-end="url(#marker1856)"/>
+ <path d="m106.38 129.354h17.7785"/>
+ <path d="m125.818 129.359h7.2418" marker-end="url(#marker1964)"/>
+ </g>
+ <path d="m124.169 129.354a0.959514 0.97091 0 0 1 0.47587-0.84557 0.959514 0.97091 0 0 1 0.96164-3e-3 0.959514 0.97091 0 0 1 0.48149 0.84231" stroke-linecap="square" stroke-width=".600001"/>
+ <path d="m55.7026 180.832h34.8131" marker-end="url(#marker2074)" stroke-width=".6"/>
+ </g>
+ <g>
+ <path d="m55.6464 185.744h34.8131" marker-end="url(#marker2074-3)" stroke="#28a4ff" stroke-width=".600001"/>
+ <g stroke-width=".6">
+ <path d="m94.0487 129.889v-10.6493" marker-end="url(#marker3373)" stroke="#000"/>
+ <path d="m89.7534 118.621v10.662" marker-end="url(#marker3517)" stroke="#000"/>
+ <path d="m92.119 138.812v7.9718" marker-end="url(#marker3667)" stroke="#28a4ff"/>
+ </g>
+ </g>
+ </g>
+ <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text>
+ <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccesfull</tspan></text>
+ <g font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">succesfull</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text>
+ </g>
+ <g stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 107.574 155.948)" x="38.824219" y="9.1171875" filter="url(#filter1204)" font-size="10.6667px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Retransmit</tspan><tspan x="38.824219" y="20.850557" text-align="center">limit reached or</tspan><tspan x="38.824219" y="32.583927" text-align="center">node went bus off</tspan><tspan x="38.824219" y="44.317299" text-align="center"/></text>
+ <text transform="matrix(.264583 0 0 .264583 60.7127 177.384)" x="38.824539" y="9.1173134" filter="url(#filter1204-6)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824539" y="9.1173134" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Transmission result</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 45.6885 173.226)" x="57.727047" y="9.11724" filter="url(#filter1204-6-9)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Legend:</tspan></text>
+ </g>
+ <g fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-anchor="middle">
+ <text transform="matrix(.264583 0 0 .264583 57.0045 182.079)" x="57.727047" y="9.11724" filter="url(#filter1204-6-2)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">SW command</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 57.7865 110.104)" x="40.822609" y="9.11724" filter="url(#filter1204-6-2-9)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="40.822609" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 116.893 107.491)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-4)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 87.5687 166.324)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-1)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set empty</tspan></text>
+ <text transform="matrix(.264583 0 0 .264583 106.53 113.074)" x="30.228771" y="8.9063139" filter="url(#filter1204-6-2-9-1-3)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="30.228771" y="8.9063139" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set abort</tspan></text>
+ </g>
+ </g>
+</svg>
diff --git a/Documentation/networking/device_drivers/can/freescale/flexcan.rst b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
new file mode 100644
index 000000000000..106cd2890135
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/freescale/flexcan.rst
@@ -0,0 +1,54 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=============================
+Flexcan CAN Controller driver
+=============================
+
+Authors: Marc Kleine-Budde <mkl@pengutronix.de>,
+Dario Binacchi <dario.binacchi@amarulasolutions.com>
+
+On/off RTR frames reception
+===========================
+
+For most flexcan IP cores the driver supports 2 RX modes:
+
+- FIFO
+- mailbox
+
+The older flexcan cores (integrated into the i.MX25, i.MX28, i.MX35
+and i.MX53 SOCs) only receive RTR frames if the controller is
+configured for RX-FIFO mode.
+
+The RX FIFO mode uses a hardware FIFO with a depth of 6 CAN frames,
+while the mailbox mode uses a software FIFO with a depth of up to 62
+CAN frames. With the help of the bigger buffer, the mailbox mode
+performs better under high system load situations.
+
+As reception of RTR frames is part of the CAN standard, all flexcan
+cores come up in a mode where RTR reception is possible.
+
+With the "rx-rtr" private flag the ability to receive RTR frames can
+be waived at the expense of losing the ability to receive RTR
+messages. This trade off is beneficial in certain use cases.
+
+"rx-rtr" on
+ Receive RTR frames. (default)
+
+ The CAN controller can and will receive RTR frames.
+
+ On some IP cores the controller cannot receive RTR frames in the
+ more performant "RX mailbox" mode and will use "RX FIFO" mode
+ instead.
+
+"rx-rtr" off
+
+ Waive ability to receive RTR frames. (not supported on all IP cores)
+
+ This mode activates the "RX mailbox mode" for better performance, on
+ some IP cores RTR frames cannot be received anymore.
+
+The setting can only be changed if the interface is down::
+
+ ip link set dev can0 down
+ ethtool --set-priv-flags can0 rx-rtr {off|on}
+ ip link set dev can0 up
diff --git a/Documentation/networking/device_drivers/can/index.rst b/Documentation/networking/device_drivers/can/index.rst
new file mode 100644
index 000000000000..6a8a4f74fa26
--- /dev/null
+++ b/Documentation/networking/device_drivers/can/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Controller Area Network (CAN) Device Drivers
+============================================
+
+Device drivers for CAN devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ can327
+ ctu/ctucanfd-driver
+ freescale/flexcan
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/index.rst b/Documentation/networking/device_drivers/cellular/index.rst
new file mode 100644
index 000000000000..fc1812d3fc70
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Cellular Modem Device Drivers
+=============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ qualcomm/rmnet
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
new file mode 100644
index 000000000000..4118384cf8eb
--- /dev/null
+++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Rmnet Driver
+============
+
+1. Introduction
+===============
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+================
+
+a. MAP packet v1 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. Map packet v4 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Reserved Pad Multiplexer ID Payload length
+
+ Bit 32-(x-33) (x-32)-x
+ Function Raw bytes Checksum offload header
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits must be zero when sent and ignored when received.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+Checksum offload header, has the information about the checksum processing done
+by the hardware.Checksum offload header fields are in big endian format.
+
+Packet format::
+
+ Bit 0-14 15 16-31
+ Function Reserved Valid Checksum start offset
+
+ Bit 31-47 48-64
+ Function Checksum length Checksum value
+
+Reserved bits must be zero when sent and ignored when received.
+
+Valid bit indicates whether the partial checksum is calculated and is valid.
+Set to 1, if its is valid. Set to 0 otherwise.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Checksum start offset, Indicates the offset in bytes from the beginning of the
+IP header, from which modem computed checksum.
+
+Checksum length is the Length in bytes starting from CKSUM_START_OFFSET,
+over which checksum is computed.
+
+Checksum value, indicates the checksum computed.
+
+c. MAP packet v5 (data / control)
+
+MAP header fields are in big endian format.
+
+Packet format::
+
+ Bit 0 1 2-7 8-15 16-31
+ Function Command / Data Next header Pad Multiplexer ID Payload length
+
+ Bit 32-x
+ Function Raw bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Command packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Next header is used to indicate the presence of another header, currently is
+limited to checksum header.
+
+Padding is the number of bytes to be appended to the payload to
+ensure 4 byte alignment.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+d. Checksum offload header v5
+
+Checksum offload header fields are in big endian format.
+
+ Bit 0 - 6 7 8-15 16-31
+ Function Header Type Next Header Checksum Valid Reserved
+
+Header Type is to indicate the type of header, this usually is set to CHECKSUM
+
+Header types
+= ==========================================
+0 Reserved
+1 Reserved
+2 checksum header
+
+Checksum Valid is to indicate whether the header checksum is valid. Value of 1
+implies that checksum is calculated on this packet and is valid, value of 0
+indicates that the calculated packet checksum is invalid.
+
+Reserved bits must be zero when sent and ignored when received.
+
+e. MAP packet v1/v5 (command specific)::
+
+ Bit 0 1 2-7 8 - 15 16 - 31
+ Function Command Reserved Pad Multiplexer ID Payload length
+ Bit 32 - 39 40 - 45 46 - 47 48 - 63
+ Function Command name Reserved Command Type Reserved
+ Bit 64 - 95
+ Function Transaction ID
+ Bit 96 - 127
+ Function Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types
+
+= ==========================================
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+= ==========================================
+
+f. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered to
+rmnet in a single linear skb. rmnet will process the individual
+packets and either ACK the MAP command or deliver the IP packet to the
+network stack as needed
+
+MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+
+MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+
+3. Userspace configuration
+==========================
+
+rmnet userspace configuration is done through netlink library librmnetctl
+and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
+The driver uses rtnl_link_ops for communication.
+
+https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
diff --git a/Documentation/networking/device_drivers/dec/de4x5.txt b/Documentation/networking/device_drivers/dec/de4x5.txt
deleted file mode 100644
index 452aac58341d..000000000000
--- a/Documentation/networking/device_drivers/dec/de4x5.txt
+++ /dev/null
@@ -1,178 +0,0 @@
- Originally, this driver was written for the Digital Equipment
- Corporation series of EtherWORKS Ethernet cards:
-
- DE425 TP/COAX EISA
- DE434 TP PCI
- DE435 TP/COAX/AUI PCI
- DE450 TP/COAX/AUI PCI
- DE500 10/100 PCI Fasternet
-
- but it will now attempt to support all cards which conform to the
- Digital Semiconductor SROM Specification. The driver currently
- recognises the following chips:
-
- DC21040 (no SROM)
- DC21041[A]
- DC21140[A]
- DC21142
- DC21143
-
- So far the driver is known to work with the following cards:
-
- KINGSTON
- Linksys
- ZNYX342
- SMC8432
- SMC9332 (w/new SROM)
- ZNYX31[45]
- ZNYX346 10/100 4 port (can act as a 10/100 bridge!)
-
- The driver has been tested on a relatively busy network using the DE425,
- DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred
- 16M of data to a DECstation 5000/200 as follows:
-
- TCP UDP
- TX RX TX RX
- DE425 1030k 997k 1170k 1128k
- DE434 1063k 995k 1170k 1125k
- DE435 1063k 995k 1170k 1125k
- DE500 1063k 998k 1170k 1125k in 10Mb/s mode
-
- All values are typical (in kBytes/sec) from a sample of 4 for each
- measurement. Their error is +/-20k on a quiet (private) network and also
- depend on what load the CPU has.
-
- =========================================================================
-
- The ability to load this driver as a loadable module has been included
- and used extensively during the driver development (to save those long
- reboot sequences). Loadable module support under PCI and EISA has been
- achieved by letting the driver autoprobe as if it were compiled into the
- kernel. Do make sure you're not sharing interrupts with anything that
- cannot accommodate interrupt sharing!
-
- To utilise this ability, you have to do 8 things:
-
- 0) have a copy of the loadable modules code installed on your system.
- 1) copy de4x5.c from the /linux/drivers/net directory to your favourite
- temporary directory.
- 2) for fixed autoprobes (not recommended), edit the source code near
- line 5594 to reflect the I/O address you're using, or assign these when
- loading by:
-
- insmod de4x5 io=0xghh where g = bus number
- hh = device number
-
- NB: autoprobing for modules is now supported by default. You may just
- use:
-
- insmod de4x5
-
- to load all available boards. For a specific board, still use
- the 'io=?' above.
- 3) compile de4x5.c, but include -DMODULE in the command line to ensure
- that the correct bits are compiled (see end of source code).
- 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a
- kernel with the de4x5 configuration turned off and reboot.
- 5) insmod de4x5 [io=0xghh]
- 6) run the net startup bits for your new eth?? interface(s) manually
- (usually /etc/rc.inet[12] at boot time).
- 7) enjoy!
-
- To unload a module, turn off the associated interface(s)
- 'ifconfig eth?? down' then 'rmmod de4x5'.
-
- Automedia detection is included so that in principle you can disconnect
- from, e.g. TP, reconnect to BNC and things will still work (after a
- pause while the driver figures out where its media went). My tests
- using ping showed that it appears to work....
-
- By default, the driver will now autodetect any DECchip based card.
- Should you have a need to restrict the driver to DIGITAL only cards, you
- can compile with a DEC_ONLY define, or if loading as a module, use the
- 'dec_only=1' parameter.
-
- I've changed the timing routines to use the kernel timer and scheduling
- functions so that the hangs and other assorted problems that occurred
- while autosensing the media should be gone. A bonus for the DC21040
- auto media sense algorithm is that it can now use one that is more in
- line with the rest (the DC21040 chip doesn't have a hardware timer).
- The downside is the 1 'jiffies' (10ms) resolution.
-
- IEEE 802.3u MII interface code has been added in anticipation that some
- products may use it in the future.
-
- The SMC9332 card has a non-compliant SROM which needs fixing - I have
- patched this driver to detect it because the SROM format used complies
- to a previous DEC-STD format.
-
- I have removed the buffer copies needed for receive on Intels. I cannot
- remove them for Alphas since the Tulip hardware only does longword
- aligned DMA transfers and the Alphas get alignment traps with non
- longword aligned data copies (which makes them really slow). No comment.
-
- I have added SROM decoding routines to make this driver work with any
- card that supports the Digital Semiconductor SROM spec. This will help
- all cards running the dc2114x series chips in particular. Cards using
- the dc2104x chips should run correctly with the basic driver. I'm in
- debt to <mjacob@feral.com> for the testing and feedback that helped get
- this feature working. So far we have tested KINGSTON, SMC8432, SMC9332
- (with the latest SROM complying with the SROM spec V3: their first was
- broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315
- (quad 21041 MAC) cards also appear to work despite their incorrectly
- wired IRQs.
-
- I have added a temporary fix for interrupt problems when some SCSI cards
- share the same interrupt as the DECchip based cards. The problem occurs
- because the SCSI card wants to grab the interrupt as a fast interrupt
- (runs the service routine with interrupts turned off) vs. this card
- which really needs to run the service routine with interrupts turned on.
- This driver will now add the interrupt service routine as a fast
- interrupt if it is bounced from the slow interrupt. THIS IS NOT A
- RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time
- until people sort out their compatibility issues and the kernel
- interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST
- INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not
- run on the same interrupt. PCMCIA/CardBus is another can of worms...
-
- Finally, I think I have really fixed the module loading problem with
- more than one DECchip based card. As a side effect, I don't mess with
- the device structure any more which means that if more than 1 card in
- 2.0.x is installed (4 in 2.1.x), the user will have to edit
- linux/drivers/net/Space.c to make room for them. Hence, module loading
- is the preferred way to use this driver, since it doesn't have this
- limitation.
-
- Where SROM media detection is used and full duplex is specified in the
- SROM, the feature is ignored unless lp->params.fdx is set at compile
- time OR during a module load (insmod de4x5 args='eth??:fdx' [see
- below]). This is because there is no way to automatically detect full
- duplex links except through autonegotiation. When I include the
- autonegotiation feature in the SROM autoconf code, this detection will
- occur automatically for that case.
-
- Command line arguments are now allowed, similar to passing arguments
- through LILO. This will allow a per adapter board set up of full duplex
- and media. The only lexical constraints are: the board name (dev->name)
- appears in the list before its parameters. The list of parameters ends
- either at the end of the parameter list or with another board name. The
- following parameters are allowed:
-
- fdx for full duplex
- autosense to set the media/speed; with the following
- sub-parameters:
- TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO
-
- Case sensitivity is important for the sub-parameters. They *must* be
- upper case. Examples:
-
- insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'.
-
- For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.
- DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"'
-
- Yes, I know full duplex isn't permissible on BNC or AUI; they're just
- examples. By default, full duplex is turned off and AUTO is the default
- autosense setting. In reality, I expect only the full duplex option to
- be used. Note the use of single quotes in the two examples above and the
- lack of commas to separate items.
diff --git a/Documentation/networking/device_drivers/3com/3c509.txt b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
index fbf722e15ac3..47f706bacdd9 100644
--- a/Documentation/networking/device_drivers/3com/3c509.txt
+++ b/Documentation/networking/device_drivers/ethernet/3com/3c509.rst
@@ -1,17 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================================
Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher)
-----------------------------------------------------------------------------
+=============================================================================
This file contains the instructions and caveats for v1.18c and higher versions
of the 3c509 driver. You should not use the driver without reading this file.
release 1.0
+
28 February 2002
+
Current maintainer (corrections to):
David Ruggiero <jdr@farfalle.com>
-----------------------------------------------------------------------------
-
-(0) Introduction
+Introduction
+============
The following are notes and information on using the 3Com EtherLink III series
ethercards in Linux. These cards are commonly known by the most widely-used
@@ -21,11 +25,11 @@ be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905"
provided by the module 3c509.c, which has code to support all of the following
models:
- 3c509 (original ISA card)
- 3c509B (later revision of the ISA card; supports full-duplex)
- 3c589 (PCMCIA)
- 3c589B (later revision of the 3c589; supports full-duplex)
- 3c579 (EISA)
+ - 3c509 (original ISA card)
+ - 3c509B (later revision of the ISA card; supports full-duplex)
+ - 3c589 (PCMCIA)
+ - 3c589B (later revision of the 3c589; supports full-duplex)
+ - 3c579 (EISA)
Large portions of this documentation were heavily borrowed from the guide
written the original author of the 3c509 driver, Donald Becker. The master
@@ -33,32 +37,34 @@ copy of that document, which contains notes on older versions of the driver,
currently resides on Scyld web server: http://www.scyld.com/.
-(1) Special Driver Features
+Special Driver Features
+=======================
Overriding card settings
The driver allows boot- or load-time overriding of the card's detected IOADDR,
IRQ, and transceiver settings, although this capability shouldn't generally be
needed except to enable full-duplex mode (see below). An example of the syntax
-for LILO parameters for doing this:
+for LILO parameters for doing this::
- ether=10,0x310,3,0x3c509,eth0
+ ether=10,0x310,3,0x3c509,eth0
This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and
transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts
with other card types when overriding the I/O address. When the driver is
loaded as a module, only the IRQ may be overridden. For example,
setting two cards to IRQ10 and IRQ11 is done by using the irq module
-option:
+option::
options 3c509 irq=10,11
-(2) Full-duplex mode
+Full-duplex mode
+================
The v1.18c driver added support for the 3c509B's full-duplex capabilities.
In order to enable and successfully use full-duplex mode, three conditions
-must be met:
+must be met:
(a) You must have a Etherlink III card model whose hardware supports full-
duplex operations. Currently, the only members of the 3c509 family that are
@@ -78,27 +84,32 @@ duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on
another system that's connected directly to the 3c509B via a crossover cable.
Full-duplex mode can be enabled using 'ethtool'.
-
-/////Extremely important caution concerning full-duplex mode/////
-Understand that the 3c509B's hardware's full-duplex support is much more
-limited than that provide by more modern network interface cards. Although
-at the physical layer of the network it fully supports full-duplex operation,
-the card was designed before the current Ethernet auto-negotiation (N-way)
-spec was written. This means that the 3c509B family ***cannot and will not
-auto-negotiate a full-duplex connection with its link partner under any
-circumstances, no matter how it is initialized***. If the full-duplex mode
-of the 3c509B is enabled, its link partner will very likely need to be
-independently _forced_ into full-duplex mode as well; otherwise various nasty
-failures will occur - at the very least, you'll see massive numbers of packet
-collisions. This is one of very rare circumstances where disabling auto-
-negotiation and forcing the duplex mode of a network interface card or switch
-would ever be necessary or desirable.
-
-
-(3) Available Transceiver Types
+
+.. warning::
+
+ Extremely important caution concerning full-duplex mode
+
+ Understand that the 3c509B's hardware's full-duplex support is much more
+ limited than that provide by more modern network interface cards. Although
+ at the physical layer of the network it fully supports full-duplex operation,
+ the card was designed before the current Ethernet auto-negotiation (N-way)
+ spec was written. This means that the 3c509B family ***cannot and will not
+ auto-negotiate a full-duplex connection with its link partner under any
+ circumstances, no matter how it is initialized***. If the full-duplex mode
+ of the 3c509B is enabled, its link partner will very likely need to be
+ independently _forced_ into full-duplex mode as well; otherwise various nasty
+ failures will occur - at the very least, you'll see massive numbers of packet
+ collisions. This is one of very rare circumstances where disabling auto-
+ negotiation and forcing the duplex mode of a network interface card or switch
+ would ever be necessary or desirable.
+
+
+Available Transceiver Types
+===========================
For versions of the driver v1.18c and above, the available transceiver types are:
-
+
+== =========================================================================
0 transceiver type from EEPROM config (normally 10baseT); force half-duplex
1 AUI (thick-net / DB15 connector)
2 (undefined)
@@ -106,6 +117,7 @@ For versions of the driver v1.18c and above, the available transceiver types are
4 10baseT (RJ-45 connector); force half-duplex mode
8 transceiver type and duplex mode taken from card's EEPROM config settings
12 10baseT (RJ-45 connector); force full-duplex mode
+== =========================================================================
Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note
that the new transceiver codes 8 and 12 are the *only* ones that will enable
@@ -116,26 +128,30 @@ it must always be explicitly enabled via one of these code in order to be
activated.
The transceiver type can be changed using 'ethtool'.
-
-(4a) Interpretation of error messages and common problems
+
+Interpretation of error messages and common problems
+----------------------------------------------------
Error Messages
+^^^^^^^^^^^^^^
-eth0: Infinite loop in interrupt, status 2011.
+eth0: Infinite loop in interrupt, status 2011.
These are "mostly harmless" message indicating that the driver had too much
work during that interrupt cycle. With a status of 0x2011 you are receiving
packets faster than they can be removed from the card. This should be rare
or impossible in normal operation. Possible causes of this error report are:
-
+
- a "green" mode enabled that slows the processor down when there is no
- keyboard activity.
+ keyboard activity.
- some other device or device driver hogging the bus or disabling interrupts.
Check /proc/interrupts for excessive interrupt counts. The timer tick
- interrupt should always be incrementing faster than the others.
+ interrupt should always be incrementing faster than the others.
+
+No received packets
+^^^^^^^^^^^^^^^^^^^
-No received packets
If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never
receives packets (as reported by /proc/net/dev or 'ifconfig') you likely
have an interrupt line problem. Check /proc/interrupts to verify that the
@@ -146,26 +162,37 @@ or IRQ5, and the easiest solution is to move the 3c509 to a different
interrupt line. If the device is receiving packets but 'ping' doesn't work,
you have a routing problem.
-Tx Carrier Errors Reported in /proc/net/dev
+Tx Carrier Errors Reported in /proc/net/dev
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
If an EtherLink III appears to transmit packets, but the "Tx carrier errors"
field in /proc/net/dev increments as quickly as the Tx packet count, you
-likely have an unterminated network or the incorrect media transceiver selected.
+likely have an unterminated network or the incorrect media transceiver selected.
+
+3c509B card is not detected on machines with an ISA PnP BIOS.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-3c509B card is not detected on machines with an ISA PnP BIOS.
While the updated driver works with most PnP BIOS programs, it does not work
with all. This can be fixed by disabling PnP support using the 3Com-supplied
-setup program.
+setup program.
+
+3c509 card is not detected on overclocked machines
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-3c509 card is not detected on overclocked machines
Increase the delay time in id_read_eeprom() from the current value, 500,
-to an absurdly high value, such as 5000.
+to an absurdly high value, such as 5000.
+
+Decoding Status and Error Messages
+----------------------------------
-(4b) Decoding Status and Error Messages
-The bits in the main status register are:
+The bits in the main status register are:
+===== ======================================
value description
+===== ======================================
0x01 Interrupt latch
0x02 Tx overrun, or Rx underrun
0x04 Tx complete
@@ -174,30 +201,38 @@ value description
0x20 A Rx packet has started to arrive
0x40 The driver has requested an interrupt
0x80 Statistics counter nearly full
+===== ======================================
-The bits in the transmit (Tx) status word are:
+The bits in the transmit (Tx) status word are:
-value description
-0x02 Out-of-window collision.
-0x04 Status stack overflow (normally impossible).
-0x08 16 collisions.
-0x10 Tx underrun (not enough PCI bus bandwidth).
-0x20 Tx jabber.
-0x40 Tx interrupt requested.
-0x80 Status is valid (this should always be set).
+===== ============================================
+value description
+===== ============================================
+0x02 Out-of-window collision.
+0x04 Status stack overflow (normally impossible).
+0x08 16 collisions.
+0x10 Tx underrun (not enough PCI bus bandwidth).
+0x20 Tx jabber.
+0x40 Tx interrupt requested.
+0x80 Status is valid (this should always be set).
+===== ============================================
-When a transmit error occurs the driver produces a status message such as
+When a transmit error occurs the driver produces a status message such as::
eth0: Transmit error, Tx status register 82
The two values typically seen here are:
-0x82
+0x82
+^^^^
+
Out of window collision. This typically occurs when some other Ethernet
-host is incorrectly set to full duplex on a half duplex network.
+host is incorrectly set to full duplex on a half duplex network.
+
+0x88
+^^^^
-0x88
16 collisions. This typically occurs when the network is exceptionally busy
or when another host doesn't correctly back off after a collision. If this
error is mixed with 0x82 errors it is the result of a host incorrectly set
@@ -207,7 +242,8 @@ Both of these errors are the result of network problems that should be
corrected. They do not represent driver malfunction.
-(5) Revision history (this file)
+Revision history (this file)
+============================
28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs
diff --git a/Documentation/networking/device_drivers/3com/vortex.txt b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
index 587f3fcfbcae..e89e4192af88 100644
--- a/Documentation/networking/device_drivers/3com/vortex.txt
+++ b/Documentation/networking/device_drivers/ethernet/3com/vortex.rst
@@ -1,5 +1,11 @@
-Documentation/networking/device_drivers/3com/vortex.txt
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+3Com Vortex device driver
+=========================
+
Andrew Morton
+
30 April 2000
@@ -8,12 +14,12 @@ driver for Linux, 3c59x.c.
The driver was written by Donald Becker <becker@scyld.com>
-Don is no longer the prime maintainer of this version of the driver.
+Don is no longer the prime maintainer of this version of the driver.
Please report problems to one or more of:
- Andrew Morton
- Netdev mailing list <netdev@vger.kernel.org>
- Linux kernel mailing list <linux-kernel@vger.kernel.org>
+- Andrew Morton
+- Netdev mailing list <netdev@vger.kernel.org>
+- Linux kernel mailing list <linux-kernel@vger.kernel.org>
Please note the 'Reporting and Diagnosing Problems' section at the end
of this file.
@@ -24,58 +30,58 @@ Since kernel 2.3.99-pre6, this driver incorporates the support for the
This driver supports the following hardware:
- 3c590 Vortex 10Mbps
- 3c592 EISA 10Mbps Demon/Vortex
- 3c597 EISA Fast Demon/Vortex
- 3c595 Vortex 100baseTx
- 3c595 Vortex 100baseT4
- 3c595 Vortex 100base-MII
- 3c900 Boomerang 10baseT
- 3c900 Boomerang 10Mbps Combo
- 3c900 Cyclone 10Mbps TPO
- 3c900 Cyclone 10Mbps Combo
- 3c900 Cyclone 10Mbps TPC
- 3c900B-FL Cyclone 10base-FL
- 3c905 Boomerang 100baseTx
- 3c905 Boomerang 100baseT4
- 3c905B Cyclone 100baseTx
- 3c905B Cyclone 10/100/BNC
- 3c905B-FX Cyclone 100baseFx
- 3c905C Tornado
- 3c920B-EMB-WNM (ATI Radeon 9100 IGP)
- 3c980 Cyclone
- 3c980C Python-T
- 3cSOHO100-TX Hurricane
- 3c555 Laptop Hurricane
- 3c556 Laptop Tornado
- 3c556B Laptop Hurricane
- 3c575 [Megahertz] 10/100 LAN CardBus
- 3c575 Boomerang CardBus
- 3CCFE575BT Cyclone CardBus
- 3CCFE575CT Tornado CardBus
- 3CCFE656 Cyclone CardBus
- 3CCFEM656B Cyclone+Winmodem CardBus
- 3CXFEM656C Tornado+Winmodem CardBus
- 3c450 HomePNA Tornado
- 3c920 Tornado
- 3c982 Hydra Dual Port A
- 3c982 Hydra Dual Port B
- 3c905B-T4
- 3c920B-EMB-WNM Tornado
+ - 3c590 Vortex 10Mbps
+ - 3c592 EISA 10Mbps Demon/Vortex
+ - 3c597 EISA Fast Demon/Vortex
+ - 3c595 Vortex 100baseTx
+ - 3c595 Vortex 100baseT4
+ - 3c595 Vortex 100base-MII
+ - 3c900 Boomerang 10baseT
+ - 3c900 Boomerang 10Mbps Combo
+ - 3c900 Cyclone 10Mbps TPO
+ - 3c900 Cyclone 10Mbps Combo
+ - 3c900 Cyclone 10Mbps TPC
+ - 3c900B-FL Cyclone 10base-FL
+ - 3c905 Boomerang 100baseTx
+ - 3c905 Boomerang 100baseT4
+ - 3c905B Cyclone 100baseTx
+ - 3c905B Cyclone 10/100/BNC
+ - 3c905B-FX Cyclone 100baseFx
+ - 3c905C Tornado
+ - 3c920B-EMB-WNM (ATI Radeon 9100 IGP)
+ - 3c980 Cyclone
+ - 3c980C Python-T
+ - 3cSOHO100-TX Hurricane
+ - 3c555 Laptop Hurricane
+ - 3c556 Laptop Tornado
+ - 3c556B Laptop Hurricane
+ - 3c575 [Megahertz] 10/100 LAN CardBus
+ - 3c575 Boomerang CardBus
+ - 3CCFE575BT Cyclone CardBus
+ - 3CCFE575CT Tornado CardBus
+ - 3CCFE656 Cyclone CardBus
+ - 3CCFEM656B Cyclone+Winmodem CardBus
+ - 3CXFEM656C Tornado+Winmodem CardBus
+ - 3c450 HomePNA Tornado
+ - 3c920 Tornado
+ - 3c982 Hydra Dual Port A
+ - 3c982 Hydra Dual Port B
+ - 3c905B-T4
+ - 3c920B-EMB-WNM Tornado
Module parameters
=================
There are several parameters which may be provided to the driver when
-its module is loaded. These are usually placed in /etc/modprobe.d/*.conf
-configuration files. Example:
+its module is loaded. These are usually placed in ``/etc/modprobe.d/*.conf``
+configuration files. Example::
-options 3c59x debug=3 rx_copybreak=300
+ options 3c59x debug=3 rx_copybreak=300
If you are using the PCMCIA tools (cardmgr) then the options may be
-placed in /etc/pcmcia/config.opts:
+placed in /etc/pcmcia/config.opts::
-module "3c59x" opts "debug=3 rx_copybreak=300"
+ module "3c59x" opts "debug=3 rx_copybreak=300"
The supported parameters are:
@@ -89,7 +95,7 @@ options=N1,N2,N3,...
Each number in the list provides an option to the corresponding
network card. So if you have two 3c905's and you wish to provide
- them with option 0x204 you would use:
+ them with option 0x204 you would use::
options=0x204,0x204
@@ -97,6 +103,8 @@ options=N1,N2,N3,...
have the following meanings:
Possible media type settings
+
+ == =================================
0 10baseT
1 10Mbs AUI
2 undefined
@@ -108,17 +116,20 @@ options=N1,N2,N3,...
8 Autonegotiate
9 External MII
10 Use default setting from EEPROM
+ == =================================
When generating a value for the 'options' setting, the above media
selection values may be OR'ed (or added to) the following:
+ ====== =============================================
0x8000 Set driver debugging level to 7
0x4000 Set driver debugging level to 2
0x0400 Enable Wake-on-LAN
0x0200 Force full duplex mode.
0x0010 Bus-master enable bit (Old Vortex cards only)
+ ====== =============================================
- For example:
+ For example::
insmod 3c59x options=0x204
@@ -127,14 +138,14 @@ options=N1,N2,N3,...
global_options=N
- Sets the `options' parameter for all 3c59x NICs in the machine.
- Entries in the `options' array above will override any setting of
+ Sets the ``options`` parameter for all 3c59x NICs in the machine.
+ Entries in the ``options`` array above will override any setting of
this.
full_duplex=N1,N2,N3...
Similar to bit 9 of 'options'. Forces the corresponding card into
- full-duplex mode. Please use this in preference to the `options'
+ full-duplex mode. Please use this in preference to the ``options``
parameter.
In fact, please don't use this at all! You're better off getting
@@ -143,13 +154,13 @@ full_duplex=N1,N2,N3...
global_full_duplex=N1
Sets full duplex mode for all 3c59x NICs in the machine. Entries
- in the `full_duplex' array above will override any setting of this.
+ in the ``full_duplex`` array above will override any setting of this.
flow_ctrl=N1,N2,N3...
Use 802.3x MAC-layer flow control. The 3com cards only support the
PAUSE command, which means that they will stop sending packets for a
- short period if they receive a PAUSE frame from the link partner.
+ short period if they receive a PAUSE frame from the link partner.
The driver only allows flow control on a link which is operating in
full duplex mode.
@@ -170,14 +181,14 @@ rx_copybreak=M
This is a speed/space tradeoff.
- The value of rx_copybreak is used to decide when to make the copy.
- If the packet size is less than rx_copybreak, the packet is copied.
+ The value of rx_copybreak is used to decide when to make the copy.
+ If the packet size is less than rx_copybreak, the packet is copied.
The default value for rx_copybreak is 200 bytes.
max_interrupt_work=N
The driver's interrupt service routine can handle many receive and
- transmit packets in a single invocation. It does this in a loop.
+ transmit packets in a single invocation. It does this in a loop.
The value of max_interrupt_work governs how many times the interrupt
service routine will loop. The default value is 32 loops. If this
is exceeded the interrupt service routine gives up and generates a
@@ -186,7 +197,7 @@ max_interrupt_work=N
hw_checksums=N1,N2,N3,...
Recent 3com NICs are able to generate IPv4, TCP and UDP checksums
- in hardware. Linux has used the Rx checksumming for a long time.
+ in hardware. Linux has used the Rx checksumming for a long time.
The "zero copy" patch which is planned for the 2.4 kernel series
allows you to make use of the NIC's DMA scatter/gather and transmit
checksumming as well.
@@ -196,11 +207,11 @@ hw_checksums=N1,N2,N3,...
This module parameter has been provided so you can override this
decision. If you think that Tx checksums are causing a problem, you
- may disable the feature with `hw_checksums=0'.
+ may disable the feature with ``hw_checksums=0``.
If you think your NIC should be performing Tx checksumming and the
driver isn't enabling it, you can force the use of hardware Tx
- checksumming with `hw_checksums=1'.
+ checksumming with ``hw_checksums=1``.
The driver drops a message in the logfiles to indicate whether or
not it is using hardware scatter/gather and hardware Tx checksums.
@@ -210,8 +221,8 @@ hw_checksums=N1,N2,N3,...
decrease in throughput for send(). There is no effect upon receive
efficiency.
-compaq_ioaddr=N
-compaq_irq=N
+compaq_ioaddr=N,
+compaq_irq=N,
compaq_device_id=N
"Variables to work-around the Compaq PCI BIOS32 problem"....
@@ -219,7 +230,7 @@ compaq_device_id=N
watchdog=N
Sets the time duration (in milliseconds) after which the kernel
- decides that the transmitter has become stuck and needs to be reset.
+ decides that the transmitter has become stuck and needs to be reset.
This is mainly for debugging purposes, although it may be advantageous
to increase this value on LANs which have very high collision rates.
The default value is 5000 (5.0 seconds).
@@ -227,7 +238,7 @@ watchdog=N
enable_wol=N1,N2,N3,...
Enable Wake-on-LAN support for the relevant interface. Donald
- Becker's `ether-wake' application may be used to wake suspended
+ Becker's ``ether-wake`` application may be used to wake suspended
machines.
Also enables the NIC's power management support.
@@ -235,7 +246,7 @@ enable_wol=N1,N2,N3,...
global_enable_wol=N
Sets enable_wol mode for all 3c59x NICs in the machine. Entries in
- the `enable_wol' array above will override any setting of this.
+ the ``enable_wol`` array above will override any setting of this.
Media selection
---------------
@@ -325,12 +336,12 @@ Autonegotiation notes
Cisco switches (Jeff Busch <jbusch@deja.com>)
- My "standard config" for ports to which PC's/servers connect directly:
+ My "standard config" for ports to which PC's/servers connect directly::
- interface FastEthernet0/N
- description machinename
- load-interval 30
- spanning-tree portfast
+ interface FastEthernet0/N
+ description machinename
+ load-interval 30
+ spanning-tree portfast
If autonegotiation is a problem, you may need to specify "speed
100" and "duplex full" as well (or "speed 10" and "duplex half").
@@ -363,14 +374,14 @@ steps you should take:
email address will be in the driver source or in the MAINTAINERS file.
- The contents of your report will vary a lot depending upon the
- problem. If it's a kernel crash then you should refer to the
- admin-guide/reporting-bugs.rst file.
+ problem. If it's a kernel crash then you should refer to
+ 'Documentation/admin-guide/reporting-issues.rst'.
But for most problems it is useful to provide the following:
- o Kernel version, driver version
+ - Kernel version, driver version
- o A copy of the banner message which the driver generates when
+ - A copy of the banner message which the driver generates when
it is initialised. For example:
eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19
@@ -378,68 +389,68 @@ steps you should take:
MII transceiver found at address 24, status 782d.
Enabling bus-master transmits and whole-frame receives.
- NOTE: You must provide the `debug=2' modprobe option to generate
- a full detection message. Please do this:
+ NOTE: You must provide the ``debug=2`` modprobe option to generate
+ a full detection message. Please do this::
modprobe 3c59x debug=2
- o If it is a PCI device, the relevant output from 'lspci -vx', eg:
-
- 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
- Subsystem: 3Com Corporation: Unknown device 9200
- Flags: bus master, medium devsel, latency 32, IRQ 19
- I/O ports at a400 [size=128]
- Memory at db000000 (32-bit, non-prefetchable) [size=128]
- Expansion ROM at <unassigned> [disabled] [size=128K]
- Capabilities: [dc] Power Management version 2
- 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00
- 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00
- 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10
- 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a
-
- o A description of the environment: 10baseT? 100baseT?
+ - If it is a PCI device, the relevant output from 'lspci -vx', eg::
+
+ 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
+ Subsystem: 3Com Corporation: Unknown device 9200
+ Flags: bus master, medium devsel, latency 32, IRQ 19
+ I/O ports at a400 [size=128]
+ Memory at db000000 (32-bit, non-prefetchable) [size=128]
+ Expansion ROM at <unassigned> [disabled] [size=128K]
+ Capabilities: [dc] Power Management version 2
+ 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00
+ 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00
+ 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10
+ 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a
+
+ - A description of the environment: 10baseT? 100baseT?
full/half duplex? switched or hubbed?
- o Any additional module parameters which you may be providing to the driver.
+ - Any additional module parameters which you may be providing to the driver.
- o Any kernel logs which are produced. The more the merrier.
+ - Any kernel logs which are produced. The more the merrier.
If this is a large file and you are sending your report to a
mailing list, mention that you have the logfile, but don't send
it. If you're reporting direct to the maintainer then just send
it.
To ensure that all kernel logs are available, add the
- following line to /etc/syslog.conf:
+ following line to /etc/syslog.conf::
- kern.* /var/log/messages
+ kern.* /var/log/messages
- Then restart syslogd with:
+ Then restart syslogd with::
- /etc/rc.d/init.d/syslog restart
+ /etc/rc.d/init.d/syslog restart
(The above may vary, depending upon which Linux distribution you use).
- o If your problem is reproducible then that's great. Try the
+ - If your problem is reproducible then that's great. Try the
following:
1) Increase the debug level. Usually this is done via:
- a) modprobe driver debug=7
- b) In /etc/modprobe.d/driver.conf:
- options driver debug=7
+ a) modprobe driver debug=7
+ b) In /etc/modprobe.d/driver.conf:
+ options driver debug=7
2) Recreate the problem with the higher debug level,
- send all logs to the maintainer.
+ send all logs to the maintainer.
3) Download you card's diagnostic tool from Donald
- Becker's website <http://www.scyld.com/ethercard_diag.html>.
- Download mii-diag.c as well. Build these.
+ Becker's website <http://www.scyld.com/ethercard_diag.html>.
+ Download mii-diag.c as well. Build these.
- a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is
- working correctly. Save the output.
+ a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is
+ working correctly. Save the output.
- b) Run the above commands when the card is malfunctioning. Send
- both sets of output.
+ b) Run the above commands when the card is malfunctioning. Send
+ both sets of output.
Finally, please be patient and be prepared to do some work. You may
end up working on this problem for a week or more as the maintainer
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
index 50b8589d12fd..7a7040072e58 100644
--- a/Documentation/networking/altera_tse.txt
+++ b/Documentation/networking/device_drivers/ethernet/altera/altera_tse.rst
@@ -1,6 +1,12 @@
- Altera Triple-Speed Ethernet MAC driver
+.. SPDX-License-Identifier: GPL-2.0
-Copyright (C) 2008-2014 Altera Corporation
+.. include:: <isonum.txt>
+
+=======================================
+Altera Triple-Speed Ethernet MAC driver
+=======================================
+
+Copyright |copy| 2008-2014 Altera Corporation
This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
@@ -46,23 +52,33 @@ Jumbo frames are not supported at this time.
The driver limits PHY operations to 10/100Mbps, and has not yet been fully
tested for 1Gbps. This support will be added in a future maintenance update.
-1) Kernel Configuration
+1. Kernel Configuration
+=======================
+
The kernel configuration option is ALTERA_TSE:
+
Device Drivers ---> Network device support ---> Ethernet driver support --->
Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
-2) Driver parameters list:
- debug: message level (0: no output, 16: all);
- dma_rx_num: Number of descriptors in the RX list (default is 64);
- dma_tx_num: Number of descriptors in the TX list (default is 64).
+2. Driver parameters list
+=========================
+
+ - debug: message level (0: no output, 16: all);
+ - dma_rx_num: Number of descriptors in the RX list (default is 64);
+ - dma_tx_num: Number of descriptors in the TX list (default is 64).
+
+3. Command line options
+=======================
+
+Driver parameters can be also passed in command line by using::
-3) Command line options
-Driver parameters can be also passed in command line by using:
altera_tse=dma_rx_num:128,dma_tx_num:512
-4) Driver information and notes
+4. Driver information and notes
+===============================
-4.1) Transmit process
+4.1. Transmit process
+---------------------
When the driver's transmit routine is called by the kernel, it sets up a
transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
MSGDMA), and initiates a transmit operation. Once the transmit is complete, an
@@ -70,7 +86,8 @@ interrupt is driven by the transmit DMA logic. The driver handles the transmit
completion in the context of the interrupt handling chain by recycling
resource required to send and track the requested transmit operation.
-4.2) Receive process
+4.2. Receive process
+--------------------
The driver will post receive buffers to the receive DMA logic during driver
initialization. Receive buffers may or may not be queued depending upon the
underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
@@ -79,34 +96,39 @@ received, the DMA logic generates an interrupt. The driver handles a receive
interrupt by obtaining the DMA receive logic status, reaping receive
completions until no more receive completions are available.
-4.3) Interrupt Mitigation
+4.3. Interrupt Mitigation
+-------------------------
The driver is able to mitigate the number of its DMA interrupts
using NAPI for receive operations. Interrupt mitigation is not yet supported
for transmit operations, but will be added in a future maintenance release.
4.4) Ethtool support
+--------------------
Ethtool is supported. Driver statistics and internal errors can be taken using:
ethtool -S ethX command. It is possible to dump registers etc.
4.5) PHY Support
+----------------
The driver is compatible with PAL to work with PHY and GPHY devices.
4.7) List of source files:
- o Kconfig
- o Makefile
- o altera_tse_main.c: main network device driver
- o altera_tse_ethtool.c: ethtool support
- o altera_tse.h: private driver structure and common definitions
- o altera_msgdma.h: MSGDMA implementation function definitions
- o altera_sgdma.h: SGDMA implementation function definitions
- o altera_msgdma.c: MSGDMA implementation
- o altera_sgdma.c: SGDMA implementation
- o altera_sgdmahw.h: SGDMA register and descriptor definitions
- o altera_msgdmahw.h: MSGDMA register and descriptor definitions
- o altera_utils.c: Driver utility functions
- o altera_utils.h: Driver utility function definitions
-
-5) Debug Information
+--------------------------
+ - Kconfig
+ - Makefile
+ - altera_tse_main.c: main network device driver
+ - altera_tse_ethtool.c: ethtool support
+ - altera_tse.h: private driver structure and common definitions
+ - altera_msgdma.h: MSGDMA implementation function definitions
+ - altera_sgdma.h: SGDMA implementation function definitions
+ - altera_msgdma.c: MSGDMA implementation
+ - altera_sgdma.c: SGDMA implementation
+ - altera_sgdmahw.h: SGDMA register and descriptor definitions
+ - altera_msgdmahw.h: MSGDMA register and descriptor definitions
+ - altera_utils.c: Driver utility functions
+ - altera_utils.h: Driver utility function definitions
+
+5. Debug Information
+====================
The driver exports debug information such as internal statistics,
debug information, MAC and DMA registers etc.
@@ -118,17 +140,18 @@ or sees the MAC registers: e.g. using: ethtool -d ethX
The developer can also use the "debug" module parameter to get
further debug information.
-6) Statistics Support
+6. Statistics Support
+=====================
The controller and driver support a mix of IEEE standard defined statistics,
RFC defined statistics, and driver or Altera defined statistics. The four
specifications containing the standard definitions for these statistics are
as follows:
- o IEEE 802.3-2012 - IEEE Standard for Ethernet.
- o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
- o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
- o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
+ - IEEE 802.3-2012 - IEEE Standard for Ethernet.
+ - RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
+ - RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
+ - Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
The statistics supported by the TSE and the device driver are as follows:
diff --git a/Documentation/networking/device_drivers/amazon/ena.txt b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 1bb55c7b604c..8bcb173e0353 100644
--- a/Documentation/networking/device_drivers/amazon/ena.txt
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -1,18 +1,22 @@
-Linux kernel driver for Elastic Network Adapter (ENA) family:
-=============================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
+Linux kernel driver for Elastic Network Adapter (ENA) family
+============================================================
+
+Overview
+========
-Overview:
-=========
ENA is a networking interface designed to make good use of modern CPU
features and system architectures.
The ENA device exposes a lightweight management interface with a
-minimal set of memory mapped registers and extendable command set
+minimal set of memory mapped registers and extendible command set
through an Admin Queue.
The driver supports a range of ENA devices, is link-speed independent
-(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has
-a negotiated and extendable feature set.
+(i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc), and has
+a negotiated and extendible feature set.
Some ENA devices support SR-IOV. This driver is used for both the
SR-IOV Physical Function (PF) and Virtual Function (VF) devices.
@@ -23,9 +27,9 @@ is advertised by the device via the Admin Queue), a dedicated MSI-X
interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation,
and CPU cacheline optimized data placement.
-The ENA driver supports industry standard TCP/IP offload features such
-as checksum offload and TCP transmit segmentation offload (TSO).
-Receive-side scaling (RSS) is supported for multi-core scaling.
+The ENA driver supports industry standard TCP/IP offload features such as
+checksum offload. Receive-side scaling (RSS) is supported for multi-core
+scaling.
The ENA driver and its corresponding devices implement health
monitoring mechanisms such as watchdog, enabling the device and driver
@@ -34,40 +38,36 @@ debug logs.
Some of the ENA devices support a working mode called Low-latency
Queue (LLQ), which saves several more microseconds.
-
-Supported PCI vendor ID/device IDs:
+ENA Source Code Directory Structure
===================================
-1d0f:0ec2 - ENA PF
-1d0f:1ec2 - ENA PF with LLQ support
-1d0f:ec20 - ENA VF
-1d0f:ec21 - ENA VF with LLQ support
-
-ENA Source Code Directory Structure:
-====================================
-ena_com.[ch] - Management communication layer. This layer is
+
+================= ======================================================
+ena_com.[ch] Management communication layer. This layer is
responsible for the handling all the management
(admin) communication between the device and the
driver.
-ena_eth_com.[ch] - Tx/Rx data path.
-ena_admin_defs.h - Definition of ENA management interface.
-ena_eth_io_defs.h - Definition of ENA data path interface.
-ena_common_defs.h - Common definitions for ena_com layer.
-ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers.
-ena_netdev.[ch] - Main Linux kernel driver.
-ena_syfsfs.[ch] - Sysfs files.
-ena_ethtool.c - ethtool callbacks.
-ena_pci_id_tbl.h - Supported device IDs.
+ena_eth_com.[ch] Tx/Rx data path.
+ena_admin_defs.h Definition of ENA management interface.
+ena_eth_io_defs.h Definition of ENA data path interface.
+ena_common_defs.h Common definitions for ena_com layer.
+ena_regs_defs.h Definition of ENA PCI memory-mapped (MMIO) registers.
+ena_netdev.[ch] Main Linux kernel driver.
+ena_ethtool.c ethtool callbacks.
+ena_pci_id_tbl.h Supported device IDs.
+================= ======================================================
Management Interface:
=====================
+
ENA management interface is exposed by means of:
+
- PCIe Configuration Space
- Device Registers
- Admin Queue (AQ) and Admin Completion Queue (ACQ)
- Asynchronous Event Notification Queue (AENQ)
ENA device MMIO Registers are accessed only during driver
-initialization and are not involved in further normal device
+initialization and are not used during further normal device
operation.
AQ is used for submitting management commands, and the
@@ -78,6 +78,7 @@ vendor-specific extensions. Most of the management operations are
framed in a generic Get/Set feature command.
The following admin queue commands are supported:
+
- Create I/O submission queue
- Create I/O completion queue
- Destroy I/O submission queue
@@ -96,25 +97,28 @@ be reported using ACQ. AENQ events are subdivided into groups. Each
group may have multiple syndromes, as shown below
The events are:
- Group Syndrome
- Link state change - X -
- Fatal error - X -
- Notification Suspend traffic
- Notification Resume traffic
- Keep-Alive - X -
+
+==================== ===============
+Group Syndrome
+==================== ===============
+Link state change **X**
+Fatal error **X**
+Notification Suspend traffic
+Notification Resume traffic
+Keep-Alive **X**
+==================== ===============
ACQ and AENQ share the same MSI-X vector.
-Keep-Alive is a special mechanism that allows monitoring of the
-device's health. The driver maintains a watchdog (WD) handler which,
-if fired, logs the current state and statistics then resets and
-restarts the ENA device and driver. A Keep-Alive event is delivered by
-the device every second. The driver re-arms the WD upon reception of a
-Keep-Alive event. A missed Keep-Alive event causes the WD handler to
-fire.
+Keep-Alive is a special mechanism that allows monitoring the device's health.
+A Keep-Alive event is delivered by the device every second.
+The driver maintains a watchdog (WD) handler which logs the current state and
+statistics. If the keep-alive events aren't delivered as expected the WD resets
+the device and the driver.
+
+Data Path Interface
+===================
-Data Path Interface:
-====================
I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx
SQ correspondingly). Each SQ has a completion queue (CQ) associated
with it.
@@ -123,25 +127,28 @@ The SQs and CQs are implemented as descriptor rings in contiguous
physical memory.
The ENA driver supports two Queue Operation modes for Tx SQs:
-- Regular mode
- * In this mode the Tx SQs reside in the host's memory. The ENA
- device fetches the ENA Tx descriptors and packet data from host
- memory.
-- Low Latency Queue (LLQ) mode or "push-mode".
- * In this mode the driver pushes the transmit descriptors and the
- first 128 bytes of the packet directly to the ENA device memory
- space. The rest of the packet payload is fetched by the
- device. For this operation mode, the driver uses a dedicated PCI
- device memory BAR, which is mapped with write-combine capability.
-The Rx SQs support only the regular mode.
+- **Regular mode:**
+ In this mode the Tx SQs reside in the host's memory. The ENA
+ device fetches the ENA Tx descriptors and packet data from host
+ memory.
+
+- **Low Latency Queue (LLQ) mode or "push-mode":**
+ In this mode the driver pushes the transmit descriptors and the
+ first 96 bytes of the packet directly to the ENA device memory
+ space. The rest of the packet payload is fetched by the
+ device. For this operation mode, the driver uses a dedicated PCI
+ device memory BAR, which is mapped with write-combine capability.
+
+ **Note that** not all ENA devices support LLQ, and this feature is negotiated
+ with the device upon initialization. If the ENA device does not
+ support LLQ mode, the driver falls back to the regular mode.
-Note: Not all ENA devices support LLQ, and this feature is negotiated
- with the device upon initialization. If the ENA device does not
- support LLQ mode, the driver falls back to the regular mode.
+The Rx SQs support only the regular mode.
The driver supports multi-queue for both Tx and Rx. This has various
benefits:
+
- Reduced CPU/thread/process contention on a given Ethernet interface.
- Cache miss rate on completion is reduced, particularly for data
cache lines that hold the sk_buff structures.
@@ -151,8 +158,9 @@ benefits:
packet is running.
- In hardware interrupt re-direction.
-Interrupt Modes:
-================
+Interrupt Modes
+===============
+
The driver assigns a single MSI-X vector per queue pair (for both Tx
and Rx directions). The driver assigns an additional dedicated MSI-X vector
for management (for ACQ and AENQ).
@@ -163,9 +171,12 @@ removed. I/O queue interrupt registration is performed when the Linux
interface of the adapter is opened, and it is de-registered when the
interface is closed.
-The management interrupt is named:
+The management interrupt is named::
+
ena-mgmnt@pci:<PCI domain:bus:slot.function>
-and for each queue pair, an interrupt is named:
+
+and for each queue pair, an interrupt is named::
+
<interface name>-Tx-Rx-<queue index>
The ENA device operates in auto-mask and auto-clear interrupt
@@ -173,74 +184,61 @@ modes. That is, once MSI-X is delivered to the host, its Cause bit is
automatically cleared and the interrupt is masked. The interrupt is
unmasked by the driver after NAPI processing is complete.
-Interrupt Moderation:
-=====================
+Interrupt Moderation
+====================
+
ENA driver and device can operate in conventional or adaptive interrupt
moderation mode.
-In conventional mode the driver instructs device to postpone interrupt
+**In conventional mode** the driver instructs device to postpone interrupt
posting according to static interrupt delay value. The interrupt delay
-value can be configured through ethtool(8). The following ethtool
-parameters are supported by the driver: tx-usecs, rx-usecs
+value can be configured through `ethtool(8)`. The following `ethtool`
+parameters are supported by the driver: ``tx-usecs``, ``rx-usecs``
-In adaptive interrupt moderation mode the interrupt delay value is
+**In adaptive interrupt** moderation mode the interrupt delay value is
updated by the driver dynamically and adjusted every NAPI cycle
according to the traffic nature.
-By default ENA driver applies adaptive coalescing on Rx traffic and
-conventional coalescing on Tx traffic.
-
-Adaptive coalescing can be switched on/off through ethtool(8)
-adaptive_rx on|off parameter.
-
-The driver chooses interrupt delay value according to the number of
-bytes and packets received between interrupt unmasking and interrupt
-posting. The driver uses interrupt delay table that subdivides the
-range of received bytes/packets into 5 levels and assigns interrupt
-delay value to each level.
+Adaptive coalescing can be switched on/off through `ethtool(8)`'s
+:code:`adaptive_rx on|off` parameter.
-The user can enable/disable adaptive moderation, modify the interrupt
-delay table and restore its default values through sysfs.
+More information about Adaptive Interrupt Moderation (DIM) can be found in
+Documentation/networking/net_dim.rst
-RX copybreak:
-=============
+RX copybreak
+============
The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK
and can be configured by the ETHTOOL_STUNABLE command of the
SIOCETHTOOL ioctl.
-SKB:
-====
-The driver-allocated SKB for frames received from Rx handling using
-NAPI context. The allocation method depends on the size of the packet.
-If the frame length is larger than rx_copybreak, napi_get_frags()
-is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
-content is copied (by CPU) to the SKB, and the buffer is recycled.
-
-Statistics:
-===========
-The user can obtain ENA device and driver statistics using ethtool.
+Statistics
+==========
+
+The user can obtain ENA device and driver statistics using `ethtool`.
The driver can collect regular or extended statistics (including
per-queue stats) from the device.
In addition the driver logs the stats to syslog upon device reset.
-MTU:
-====
+MTU
+===
+
The driver supports an arbitrarily large MTU with a maximum that is
negotiated with the device. The driver configures MTU using the
SetFeature command (ENA_ADMIN_MTU property). The user can change MTU
-via ip(8) and similar legacy tools.
+via `ip(8)` and similar legacy tools.
+
+Stateless Offloads
+==================
-Stateless Offloads:
-===================
The ENA driver supports:
-- TSO over IPv4/IPv6
-- TSO with ECN
+
- IPv4 header checksum offload
- TCP/UDP over IPv4/IPv6 checksum offloads
-RSS:
-====
+RSS
+===
+
- The ENA device supports RSS that allows flexible Rx traffic
steering.
- Toeplitz and CRC32 hash functions are supported.
@@ -248,61 +246,72 @@ RSS:
inputs for hash functions.
- The driver configures RSS settings using the AQ SetFeature command
(ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and
- ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties).
+ ENA_ADMIN_RSS_INDIRECTION_TABLE_CONFIG properties).
- If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash
function delivered in the Rx CQ descriptor is set in the received
SKB.
- The user can provide a hash key, hash function, and configure the
- indirection table through ethtool(8).
+ indirection table through `ethtool(8)`.
-DATA PATH:
-==========
-Tx:
----
-end_start_xmit() is called by the stack. This function does the following:
-- Maps data buffers (skb->data and frags).
-- Populates ena_buf for the push buffer (if the driver and device are
- in push mode.)
+DATA PATH
+=========
+
+Tx
+--
+
+:code:`ena_start_xmit()` is called by the stack. This function does the following:
+
+- Maps data buffers (``skb->data`` and frags).
+- Populates ``ena_buf`` for the push buffer (if the driver and device are
+ in push mode).
- Prepares ENA bufs for the remaining frags.
-- Allocates a new request ID from the empty req_id ring. The request
+- Allocates a new request ID from the empty ``req_id`` ring. The request
ID is the index of the packet in the Tx info. This is used for
- out-of-order TX completions.
+ out-of-order Tx completions.
- Adds the packet to the proper place in the Tx ring.
-- Calls ena_com_prepare_tx(), an ENA communication layer that converts
- the ena_bufs to ENA descriptors (and adds meta ENA descriptors as
- needed.)
+- Calls :code:`ena_com_prepare_tx()`, an ENA communication layer that converts
+ the ``ena_bufs`` to ENA descriptors (and adds meta ENA descriptors as
+ needed).
+
* This function also copies the ENA descriptors and the push buffer
- to the Device memory space (if in push mode.)
-- Writes doorbell to the ENA device.
+ to the Device memory space (if in push mode).
+
+- Writes a doorbell to the ENA device.
- When the ENA device finishes sending the packet, a completion
interrupt is raised.
- The interrupt handler schedules NAPI.
-- The ena_clean_tx_irq() function is called. This function handles the
+- The :code:`ena_clean_tx_irq()` function is called. This function handles the
completion descriptors generated by the ENA, with a single
completion descriptor per completed packet.
- * req_id is retrieved from the completion descriptor. The tx_info of
- the packet is retrieved via the req_id. The data buffers are
- unmapped and req_id is returned to the empty req_id ring.
+
+ * ``req_id`` is retrieved from the completion descriptor. The ``tx_info`` of
+ the packet is retrieved via the ``req_id``. The data buffers are
+ unmapped and ``req_id`` is returned to the empty ``req_id`` ring.
* The function stops when the completion descriptors are completed or
the budget is reached.
-Rx:
----
+Rx
+--
+
- When a packet is received from the ENA device.
- The interrupt handler schedules NAPI.
-- The ena_clean_rx_irq() function is called. This function calls
- ena_rx_pkt(), an ENA communication layer function, which returns the
- number of descriptors used for a new unhandled packet, and zero if
+- The :code:`ena_clean_rx_irq()` function is called. This function calls
+ :code:`ena_com_rx_pkt()`, an ENA communication layer function, which returns the
+ number of descriptors used for a new packet, and zero if
no new packet is found.
-- Then it calls the ena_clean_rx_irq() function.
-- ena_eth_rx_skb() checks packet length:
+- :code:`ena_rx_skb()` checks packet length:
+
* If the packet is small (len < rx_copybreak), the driver allocates
a SKB for the new packet, and copies the packet payload into the
SKB data buffer.
+
- In this way the original data buffer is not passed to the stack
and is reused for future Rx packets.
- * Otherwise the function unmaps the Rx buffer, then allocates the
- new SKB structure and hooks the Rx buffer to the SKB frags.
+
+ * Otherwise the function unmaps the Rx buffer, sets the first
+ descriptor as `skb`'s linear part and the other descriptors as the
+ `skb`'s frags.
+
- The new SKB is updated with the necessary information (protocol,
- checksum hw verify result, etc.), and then passed to the network
- stack, using the NAPI interface function napi_gro_receive().
+ checksum hw verify result, etc), and then passed to the network
+ stack, using the NAPI interface function :code:`napi_gro_receive()`.
diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
index 2013fcedc2da..595ddef1c8b3 100644
--- a/Documentation/networking/device_drivers/aquantia/atlantic.txt
+++ b/Documentation/networking/device_drivers/ethernet/aquantia/atlantic.rst
@@ -1,83 +1,96 @@
-Marvell(Aquantia) AQtion Driver for the aQuantia Multi-Gigabit PCI Express
-Family of Ethernet Adapters
-=============================================================================
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
-Contents
-========
+===============================
+Marvell(Aquantia) AQtion Driver
+===============================
-- Identifying Your Adapter
-- Configuration
-- Supported ethtool options
-- Command Line Parameters
-- Config file parameters
-- Support
-- License
+For the aQuantia Multi-Gigabit PCI Express Family of Ethernet Adapters
+
+.. Contents
+
+ - Identifying Your Adapter
+ - Configuration
+ - Supported ethtool options
+ - Command Line Parameters
+ - Config file parameters
+ - Support
+ - License
Identifying Your Adapter
========================
-The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters.
+The driver in this release is compatible with AQC-100, AQC-107, AQC-108
+based ethernet adapters.
SFP+ Devices (for AQC-100 based adapters)
-----------------------------------
+-----------------------------------------
-This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver.
+This release tested with passive Direct Attach Cables (DAC) and SFP+/LC
+Optical Transceiver.
Configuration
-=========================
- Viewing Link Messages
- ---------------------
+=============
+
+Viewing Link Messages
+---------------------
Link messages will not be displayed to the console if the distribution is
restricting system messages. In order to see network driver link messages on
- your console, set dmesg to eight by entering the following:
+ your console, set dmesg to eight by entering the following::
dmesg -n 8
- NOTE: This setting is not saved across reboots.
+ .. note::
- Jumbo Frames
- ------------
+ This setting is not saved across reboots.
+
+Jumbo Frames
+------------
The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
enabled by changing the MTU to a value larger than the default of 1500.
The maximum value for the MTU is 16000. Use the `ip` command to
- increase the MTU size. For example:
+ increase the MTU size. For example::
- ip link set mtu 16000 dev enp1s0
+ ip link set mtu 16000 dev enp1s0
- ethtool
- -------
+ethtool
+-------
The driver utilizes the ethtool interface for driver configuration and
diagnostics, as well as displaying statistical information. The latest
ethtool version is required for this functionality.
- NAPI
- ----
+NAPI
+----
NAPI (Rx polling mode) is supported in the atlantic driver.
Supported ethtool options
-============================
- Viewing adapter settings
- ---------------------
- ethtool <ethX>
+=========================
+
+Viewing adapter settings
+------------------------
+
+ ::
- Output example:
+ ethtool <ethX>
+
+ Output example::
Settings for enp1s0:
Supported ports: [ TP ]
Supported link modes: 100baseT/Full
- 1000baseT/Full
- 10000baseT/Full
- 2500baseT/Full
- 5000baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 100baseT/Full
- 1000baseT/Full
- 10000baseT/Full
- 2500baseT/Full
- 5000baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
@@ -92,16 +105,22 @@ Supported ethtool options
Wake-on: d
Link detected: yes
- ---
- Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10.
- But you can still use these speeds:
+
+ .. note::
+
+ AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10.
+ But you can still use these speeds::
+
ethtool -s eth0 autoneg off speed 2500
- Viewing adapter information
- ---------------------
- ethtool -i <ethX>
+Viewing adapter information
+---------------------------
- Output example:
+ ::
+
+ ethtool -i <ethX>
+
+ Output example::
driver: atlantic
version: 5.2.0-050200rc5-generic-kern
@@ -115,12 +134,16 @@ Supported ethtool options
supports-priv-flags: no
- Viewing Ethernet adapter statistics:
- ---------------------
- ethtool -S <ethX>
+Viewing Ethernet adapter statistics
+-----------------------------------
+
+ ::
- Output example:
- NIC statistics:
+ ethtool -S <ethX>
+
+ Output example::
+
+ NIC statistics:
InPackets: 13238607
InUCast: 13293852
InMCast: 52
@@ -164,85 +187,95 @@ Supported ethtool options
Queue[3] InLroPackets: 0
Queue[3] InErrors: 0
- Interrupt coalescing support
- ---------------------------------
- ITR mode, TX/RX coalescing timings could be viewed with:
+Interrupt coalescing support
+----------------------------
- ethtool -c <ethX>
+ ITR mode, TX/RX coalescing timings could be viewed with::
- and changed with:
+ ethtool -c <ethX>
- ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs>
+ and changed with::
- To disable coalescing:
+ ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs>
- ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1
+ To disable coalescing::
- Wake on LAN support
- ---------------------------------
+ ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1
- WOL support by magic packet:
+Wake on LAN support
+-------------------
- ethtool -s <ethX> wol g
+ WOL support by magic packet::
- To disable WOL:
+ ethtool -s <ethX> wol g
- ethtool -s <ethX> wol d
+ To disable WOL::
- Set and check the driver message level
- ---------------------------------
+ ethtool -s <ethX> wol d
+
+Set and check the driver message level
+--------------------------------------
Set message level
- ethtool -s <ethX> msglvl <level>
+ ::
+
+ ethtool -s <ethX> msglvl <level>
Level values:
- 0x0001 - general driver status.
- 0x0002 - hardware probing.
- 0x0004 - link state.
- 0x0008 - periodic status check.
- 0x0010 - interface being brought down.
- 0x0020 - interface being brought up.
- 0x0040 - receive error.
- 0x0080 - transmit error.
- 0x0200 - interrupt handling.
- 0x0400 - transmit completion.
- 0x0800 - receive completion.
- 0x1000 - packet contents.
- 0x2000 - hardware status.
- 0x4000 - Wake-on-LAN status.
+ ====== =============================
+ 0x0001 general driver status.
+ 0x0002 hardware probing.
+ 0x0004 link state.
+ 0x0008 periodic status check.
+ 0x0010 interface being brought down.
+ 0x0020 interface being brought up.
+ 0x0040 receive error.
+ 0x0080 transmit error.
+ 0x0200 interrupt handling.
+ 0x0400 transmit completion.
+ 0x0800 receive completion.
+ 0x1000 packet contents.
+ 0x2000 hardware status.
+ 0x4000 Wake-on-LAN status.
+ ====== =============================
By default, the level of debugging messages is set 0x0001(general driver status).
Check message level
- ethtool <ethX> | grep "Current message level"
+ ::
- If you want to disable the output of messages
+ ethtool <ethX> | grep "Current message level"
- ethtool -s <ethX> msglvl 0
+ If you want to disable the output of messages::
+
+ ethtool -s <ethX> msglvl 0
+
+RX flow rules (ntuple filters)
+------------------------------
- RX flow rules (ntuple filters)
- ---------------------------------
There are separate rules supported, that applies in that order:
+
1. 16 VLAN ID rules
2. 16 L2 EtherType rules
3. 8 L3/L4 5-Tuple rules
The driver utilizes the ethtool interface for configuring ntuple filters,
- via "ethtool -N <device> <filter>".
+ via ``ethtool -N <device> <filter>``.
- To enable or disable the RX flow rules:
+ To enable or disable the RX flow rules::
- ethtool -K ethX ntuple <on|off>
+ ethtool -K ethX ntuple <on|off>
When disabling ntuple filters, all the user programed filters are
flushed from the driver cache and hardware. All needed filters must
be re-added when ntuple is re-enabled.
Because of the fixed order of the rules, the location of filters is also fixed:
+
- Locations 0 - 15 for VLAN ID filters
- Locations 16 - 31 for L2 EtherType filters
- Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6)
@@ -253,32 +286,34 @@ Supported ethtool options
addresses can be supported. Source and destination ports are only compared for
TCP/UDP/SCTP packets.
- To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch:
+ To add a filter that directs packet to queue 5, use
+ ``<-N|-U|--config-nfc|--config-ntuple>`` switch::
- ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32>
+ ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32>
- action is the queue number.
- loc is the rule number.
- For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc
+ For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you must set the loc
number within 32 - 39.
- For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules
+ For ``flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6`` you can set 8 rules
for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic
IPv6 is 32 and 36.
At the moment you can not use IPv4 and IPv6 filters at the same time.
- Example filter for IPv6 filter traffic:
+ Example filter for IPv6 filter traffic::
- sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32
- sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36
+ sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32
+ sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36
- Example filter for IPv4 filter traffic:
+ Example filter for IPv4 filter traffic::
- sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32
- sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33
- sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34
+ sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32
+ sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33
+ sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34
If you set action -1, then all traffic corresponding to the filter will be discarded.
+
The maximum value action is 31.
@@ -287,8 +322,9 @@ Supported ethtool options
from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID
are passed in the same 'vlan' parameter.
- To add a filter that directs packets from VLAN 2001 to queue 5:
- ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0
+ To add a filter that directs packets from VLAN 2001 to queue 5::
+
+ ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0
L2 EtherType filters allows filter packet by EtherType field or both EtherType
@@ -297,17 +333,17 @@ Supported ethtool options
distinguish VLAN filter from L2 Ethertype filter with UserPriority since both
User Priority and VLAN ID are passed in the same 'vlan' parameter.
- To add a filter that directs IP4 packess of priority 3 to queue 3:
- ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16
+ To add a filter that directs IP4 packess of priority 3 to queue 3::
+ ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16
- To see the list of filters currently present:
+ To see the list of filters currently present::
- ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX>
+ ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX>
- Rules may be deleted from the table itself. This is done using:
+ Rules may be deleted from the table itself. This is done using::
- sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc>
+ sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc>
- loc is the rule number to be deleted.
@@ -316,34 +352,37 @@ Supported ethtool options
case, any flow that matches the filter criteria will be directed to the
appropriate queue. RX filters is supported on all kernels 2.6.30 and later.
- RSS for UDP
- ---------------------------------
+RSS for UDP
+-----------
+
Currently, NIC does not support RSS for fragmented IP packets, which leads to
incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the
RX Flow L3/L4 rule may be used.
- Example:
- ethtool -N eth0 flow-type udp4 action 0 loc 32
+ Example::
+
+ ethtool -N eth0 flow-type udp4 action 0 loc 32
+
+UDP GSO hardware offload
+------------------------
- UDP GSO hardware offload
- ---------------------------------
UDP GSO allows to boost UDP tx rates by offloading UDP headers allocation
into hardware. A special userspace socket option is required for this,
- could be validated with /kernel/tools/testing/selftests/net/
+ could be validated with /kernel/tools/testing/selftests/net/::
udpgso_bench_tx -u -4 -D 10.0.1.1 -s 6300 -S 100
Will cause sending out of 100 byte sized UDP packets formed from single
6300 bytes user buffer.
- UDP GSO is configured by:
+ UDP GSO is configured by::
ethtool -K eth0 tx-udp-segmentation on
- Private flags (testing)
- ---------------------------------
+Private flags (testing)
+-----------------------
- Atlantic driver supports private flags for hardware custom features:
+ Atlantic driver supports private flags for hardware custom features::
$ ethtool --show-priv-flags ethX
@@ -354,7 +393,7 @@ Supported ethtool options
PHYInternalLoopback: off
PHYExternalLoopback: off
- Example:
+ Example::
$ ethtool --set-priv-flags ethX DMASystemLoopback on
@@ -370,93 +409,130 @@ Command Line Parameters
The following command line parameters are available on atlantic driver:
aq_itr -Interrupt throttling mode
-----------------------------------------
+---------------------------------
Accepted values: 0, 1, 0xFFFF
+
Default value: 0xFFFF
-0 - Disable interrupt throttling.
-1 - Enable interrupt throttling and use specified tx and rx rates.
-0xFFFF - Auto throttling mode. Driver will choose the best RX and TX
- interrupt throtting settings based on link speed.
+
+====== ==============================================================
+0 Disable interrupt throttling.
+1 Enable interrupt throttling and use specified tx and rx rates.
+0xFFFF Auto throttling mode. Driver will choose the best RX and TX
+ interrupt throtting settings based on link speed.
+====== ==============================================================
aq_itr_tx - TX interrupt throttle rate
-----------------------------------------
+--------------------------------------
+
Accepted values: 0 - 0x1FF
+
Default value: 0
+
TX side throttling in microseconds. Adapter will setup maximum interrupt delay
to this value. Minimum interrupt delay will be a half of this value
aq_itr_rx - RX interrupt throttle rate
-----------------------------------------
+--------------------------------------
+
Accepted values: 0 - 0x1FF
+
Default value: 0
+
RX side throttling in microseconds. Adapter will setup maximum interrupt delay
to this value. Minimum interrupt delay will be a half of this value
-Note: ITR settings could be changed in runtime by ethtool -c means (see below)
+.. note::
+
+ ITR settings could be changed in runtime by ethtool -c means (see below)
Config file parameters
-=======================
+======================
+
For some fine tuning and performance optimizations,
some parameters can be changed in the {source_dir}/aq_cfg.h file.
AQ_CFG_RX_PAGEORDER
-----------------------------------------
+-------------------
+
Default value: 0
+
RX page order override. Thats a power of 2 number of RX pages allocated for
-each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX.
+each descriptor. Received descriptor size is still limited by
+AQ_CFG_RX_FRAME_MAX.
+
Increasing pageorder makes page reuse better (actual on iommu enabled systems).
AQ_CFG_RX_REFILL_THRES
-----------------------------------------
+----------------------
+
Default value: 32
+
RX refill threshold. RX path will not refill freed descriptors until the
specified number of free descriptors is observed. Larger values may help
better page reuse but may lead to packet drops as well.
AQ_CFG_VECS_DEF
-------------------------------------------------------------
+---------------
+
Number of queues
+
Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX)
+
Default value: 8
+
Notice this value will be capped by the number of cores available on the system.
AQ_CFG_IS_RSS_DEF
-------------------------------------------------------------
+-----------------
+
Enable/disable Receive Side Scaling
This feature allows the adapter to distribute receive processing
across multiple CPU-cores and to prevent from overloading a single CPU core.
Valid values
-0 - disabled
-1 - enabled
+
+== ========
+0 disabled
+1 enabled
+== ========
Default value: 1
AQ_CFG_NUM_RSS_QUEUES_DEF
-------------------------------------------------------------
+-------------------------
+
Number of queues for Receive Side Scaling
+
Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF)
Default value: AQ_CFG_VECS_DEF
AQ_CFG_IS_LRO_DEF
-------------------------------------------------------------
+-----------------
+
Enable/disable Large Receive Offload
This offload enables the adapter to coalesce multiple TCP segments and indicate
them as a single coalesced unit to the OS networking subsystem.
-The system consumes less energy but it also introduces more latency in packets processing.
+
+The system consumes less energy but it also introduces more latency in packets
+processing.
Valid values
-0 - disabled
-1 - enabled
+
+== ========
+0 disabled
+1 enabled
+== ========
Default value: 1
AQ_CFG_TX_CLEAN_BUDGET
-----------------------------------------
+----------------------
+
Maximum descriptors to cleanup on TX at once.
+
Default value: 256
After the aq_cfg.h file changed the driver must be rebuilt to take effect.
@@ -472,7 +548,8 @@ License
=======
aQuantia Corporation Network Driver
-Copyright(c) 2014 - 2019 aQuantia Corporation.
+
+Copyright |copy| 2014 - 2019 aQuantia Corporation.
This program is free software; you can redistribute it and/or modify it
under the terms and conditions of the GNU General Public License,
diff --git a/Documentation/networking/device_drivers/chelsio/cxgb.txt b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
index 20a887615c4a..435dce5fa2c7 100644
--- a/Documentation/networking/device_drivers/chelsio/cxgb.txt
+++ b/Documentation/networking/device_drivers/ethernet/chelsio/cxgb.rst
@@ -1,13 +1,18 @@
- Chelsio N210 10Gb Ethernet Network Controller
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
- Driver Release Notes for Linux
+=============================================
+Chelsio N210 10Gb Ethernet Network Controller
+=============================================
- Version 2.1.1
+Driver Release Notes for Linux
- June 20, 2005
+Version 2.1.1
+
+June 20, 2005
+
+.. Contents
-CONTENTS
-========
INTRODUCTION
FEATURES
PERFORMANCE
@@ -16,7 +21,7 @@ CONTENTS
SUPPORT
-INTRODUCTION
+Introduction
============
This document describes the Linux driver for Chelsio 10Gb Ethernet Network
@@ -24,11 +29,11 @@ INTRODUCTION
compatible with the Chelsio N110 model 10Gb NICs.
-FEATURES
+Features
========
- Adaptive Interrupts (adaptive-rx)
- ---------------------------------
+Adaptive Interrupts (adaptive-rx)
+---------------------------------
This feature provides an adaptive algorithm that adjusts the interrupt
coalescing parameters, allowing the driver to dynamically adapt the latency
@@ -39,24 +44,24 @@ FEATURES
ethtool manpage for additional usage information.
By default, adaptive-rx is disabled.
- To enable adaptive-rx:
+ To enable adaptive-rx::
ethtool -C <interface> adaptive-rx on
- To disable adaptive-rx, use ethtool:
+ To disable adaptive-rx, use ethtool::
ethtool -C <interface> adaptive-rx off
After disabling adaptive-rx, the timer latency value will be set to 50us.
- You may set the timer latency after disabling adaptive-rx:
+ You may set the timer latency after disabling adaptive-rx::
ethtool -C <interface> rx-usecs <microseconds>
- An example to set the timer latency value to 100us on eth0:
+ An example to set the timer latency value to 100us on eth0::
ethtool -C eth0 rx-usecs 100
- You may also provide a timer latency value while disabling adaptive-rx:
+ You may also provide a timer latency value while disabling adaptive-rx::
ethtool -C <interface> adaptive-rx off rx-usecs <microseconds>
@@ -64,13 +69,13 @@ FEATURES
will be set to the specified value until changed by the user or until
adaptive-rx is enabled.
- To view the status of the adaptive-rx and timer latency values:
+ To view the status of the adaptive-rx and timer latency values::
ethtool -c <interface>
- TCP Segmentation Offloading (TSO) Support
- -----------------------------------------
+TCP Segmentation Offloading (TSO) Support
+-----------------------------------------
This feature, also known as "large send", enables a system's protocol stack
to offload portions of outbound TCP processing to a network interface card
@@ -80,20 +85,20 @@ FEATURES
Please see the ethtool manpage for additional usage information.
By default, TSO is enabled.
- To disable TSO:
+ To disable TSO::
ethtool -K <interface> tso off
- To enable TSO:
+ To enable TSO::
ethtool -K <interface> tso on
- To view the status of TSO:
+ To view the status of TSO::
ethtool -k <interface>
-PERFORMANCE
+Performance
===========
The following information is provided as an example of how to change system
@@ -111,59 +116,81 @@ PERFORMANCE
your system. You may want to write a script that runs at boot-up which
includes the optimal settings for your system.
- Setting PCI Latency Timer:
- setpci -d 1425:* 0x0c.l=0x0000F800
+ Setting PCI Latency Timer::
+
+ setpci -d 1425::
+
+* 0x0c.l=0x0000F800
+
+ Disabling TCP timestamp::
- Disabling TCP timestamp:
sysctl -w net.ipv4.tcp_timestamps=0
- Disabling SACK:
+ Disabling SACK::
+
sysctl -w net.ipv4.tcp_sack=0
- Setting large number of incoming connection requests:
+ Setting large number of incoming connection requests::
+
sysctl -w net.ipv4.tcp_max_syn_backlog=3000
- Setting maximum receive socket buffer size:
+ Setting maximum receive socket buffer size::
+
sysctl -w net.core.rmem_max=1024000
- Setting maximum send socket buffer size:
+ Setting maximum send socket buffer size::
+
sysctl -w net.core.wmem_max=1024000
- Set smp_affinity (on a multiprocessor system) to a single CPU:
+ Set smp_affinity (on a multiprocessor system) to a single CPU::
+
echo 1 > /proc/irq/<interrupt_number>/smp_affinity
- Setting default receive socket buffer size:
+ Setting default receive socket buffer size::
+
sysctl -w net.core.rmem_default=524287
- Setting default send socket buffer size:
+ Setting default send socket buffer size::
+
sysctl -w net.core.wmem_default=524287
- Setting maximum option memory buffers:
+ Setting maximum option memory buffers::
+
sysctl -w net.core.optmem_max=524287
- Setting maximum backlog (# of unprocessed packets before kernel drops):
+ Setting maximum backlog (# of unprocessed packets before kernel drops)::
+
sysctl -w net.core.netdev_max_backlog=300000
- Setting TCP read buffers (min/default/max):
+ Setting TCP read buffers (min/default/max)::
+
sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
- Setting TCP write buffers (min/pressure/max):
+ Setting TCP write buffers (min/pressure/max)::
+
sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
- Setting TCP buffer space (min/pressure/max):
+ Setting TCP buffer space (min/pressure/max)::
+
sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000"
TCP window size for single connections:
+
The receive buffer (RX_WINDOW) size must be at least as large as the
Bandwidth-Delay Product of the communication link between the sender and
receiver. Due to the variations of RTT, you may want to increase the buffer
size up to 2 times the Bandwidth-Delay Product. Reference page 289 of
"TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens.
- At 10Gb speeds, use the following formula:
+
+ At 10Gb speeds, use the following formula::
+
RX_WINDOW >= 1.25MBytes * RTT(in milliseconds)
Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000
+
RX_WINDOW sizes of 256KB - 512KB should be sufficient.
- Setting the min, max, and default receive buffer (RX_WINDOW) size:
+
+ Setting the min, max, and default receive buffer (RX_WINDOW) size::
+
sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>"
TCP window size for multiple connections:
@@ -174,30 +201,35 @@ PERFORMANCE
not supported on the machine. Experimentation may be necessary to attain
the correct value. This method is provided as a starting point for the
correct receive buffer size.
+
Setting the min, max, and default receive buffer (RX_WINDOW) size is
performed in the same manner as single connection.
-DRIVER MESSAGES
+Driver Messages
===============
The following messages are the most common messages logged by syslog. These
may be found in /var/log/messages.
- Driver up:
+ Driver up::
+
Chelsio Network Driver - version 2.1.1
- NIC detected:
+ NIC detected::
+
eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit
- Link up:
+ Link up::
+
eth#: link is up at 10 Gbps, full duplex
- Link down:
+ Link down::
+
eth#: link is down
-KNOWN ISSUES
+Known Issues
============
These issues have been identified during testing. The following information
@@ -214,27 +246,33 @@ KNOWN ISSUES
To eliminate the TCP retransmits, set smp_affinity on the particular
interrupt to a single CPU. You can locate the interrupt (IRQ) used on
- the N110/N210 by using ifconfig:
- ifconfig <dev_name> | grep Interrupt
- Set the smp_affinity to a single CPU:
- echo 1 > /proc/irq/<interrupt_number>/smp_affinity
+ the N110/N210 by using ifconfig::
+
+ ifconfig <dev_name> | grep Interrupt
+
+ Set the smp_affinity to a single CPU::
+
+ echo 1 > /proc/irq/<interrupt_number>/smp_affinity
It is highly suggested that you do not run the irqbalance daemon on your
system, as this will change any smp_affinity setting you have applied.
The irqbalance daemon runs on a 10 second interval and binds interrupts
- to the least loaded CPU determined by the daemon. To disable this daemon:
- chkconfig --level 2345 irqbalance off
+ to the least loaded CPU determined by the daemon. To disable this daemon::
+
+ chkconfig --level 2345 irqbalance off
By default, some Linux distributions enable the kernel feature,
irqbalance, which performs the same function as the daemon. To disable
- this feature, add the following line to your bootloader:
- noirqbalance
+ this feature, add the following line to your bootloader::
+
+ noirqbalance
+
+ Example using the Grub bootloader::
- Example using the Grub bootloader:
- title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
- root (hd0,0)
- kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
- initrd /initrd-2.4.21-27.ELsmp.img
+ title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
+ root (hd0,0)
+ kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
+ initrd /initrd-2.4.21-27.ELsmp.img
2. After running insmod, the driver is loaded and the incorrect network
interface is brought up without running ifup.
@@ -277,12 +315,13 @@ KNOWN ISSUES
AMD's provides three workarounds for this problem, however, Chelsio
recommends the first option for best performance with this bug:
- For 133Mhz secondary bus operation, limit the transaction length and
- the number of outstanding transactions, via BIOS configuration
- programming of the PCI-X card, to the following:
+ For 133Mhz secondary bus operation, limit the transaction length and
+ the number of outstanding transactions, via BIOS configuration
+ programming of the PCI-X card, to the following:
- Data Length (bytes): 1k
- Total allowed outstanding transactions: 2
+ Data Length (bytes): 1k
+
+ Total allowed outstanding transactions: 2
Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004,
section 56, "133-MHz Mode Split Completion Data Corruption" for more
@@ -293,8 +332,10 @@ KNOWN ISSUES
have issues with these settings, please revert to the "safe" settings
and duplicate the problem before submitting a bug or asking for support.
- NOTE: The default setting on most systems is 8 outstanding transactions
- and 2k bytes data length.
+ .. note::
+
+ The default setting on most systems is 8 outstanding transactions
+ and 2k bytes data length.
4. On multiprocessor systems, it has been noted that an application which
is handling 10Gb networking can switch between CPUs causing degraded
@@ -320,14 +361,16 @@ KNOWN ISSUES
particular CPU: runon 0 ifup eth0
-SUPPORT
+Support
=======
If you have problems with the software or hardware, please contact our
customer support team via email at support@chelsio.com or check our website
at http://www.chelsio.com
-===============================================================================
+-------------------------------------------------------------------------------
+
+::
Chelsio Communications
370 San Aleso Ave.
@@ -343,10 +386,8 @@ You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
-THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
+THIS SOFTWARE IS PROVIDED ``AS IS`` AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
- Copyright (c) 2003-2005 Chelsio Communications. All rights reserved.
-
-===============================================================================
+Copyright |copy| 2003-2005 Chelsio Communications. All rights reserved.
diff --git a/Documentation/networking/device_drivers/cirrus/cs89x0.txt b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
index 0e190180eec8..e5c283940ac5 100644
--- a/Documentation/networking/device_drivers/cirrus/cs89x0.txt
+++ b/Documentation/networking/device_drivers/ethernet/cirrus/cs89x0.rst
@@ -1,79 +1,84 @@
+.. SPDX-License-Identifier: GPL-2.0
-NOTE
-----
+================================================
+Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters
+================================================
-This document was contributed by Cirrus Logic for kernel 2.2.5. This version
-has been updated for 2.3.48 by Andrew Morton.
+.. note::
+
+ This document was contributed by Cirrus Logic for kernel 2.2.5. This version
+ has been updated for 2.3.48 by Andrew Morton.
+
+ Still, this is too outdated! A major cleanup is needed here.
Cirrus make a copy of this driver available at their website, as
described below. In general, you should use the driver version which
comes with your Linux distribution.
-
-CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
Linux Network Interface Driver ver. 2.00 <kernel 2.3.48>
-===============================================================================
-
-
-TABLE OF CONTENTS
-
-1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
- 1.1 Product Overview
- 1.2 Driver Description
- 1.2.1 Driver Name
- 1.2.2 File in the Driver Package
- 1.3 System Requirements
- 1.4 Licensing Information
-
-2.0 ADAPTER INSTALLATION and CONFIGURATION
- 2.1 CS8900-based Adapter Configuration
- 2.2 CS8920-based Adapter Configuration
-
-3.0 LOADING THE DRIVER AS A MODULE
-
-4.0 COMPILING THE DRIVER
- 4.1 Compiling the Driver as a Loadable Module
- 4.2 Compiling the driver to support memory mode
- 4.3 Compiling the driver to support Rx DMA
-
-5.0 TESTING AND TROUBLESHOOTING
- 5.1 Known Defects and Limitations
- 5.2 Testing the Adapter
- 5.2.1 Diagnostic Self-Test
- 5.2.2 Diagnostic Network Test
- 5.3 Using the Adapter's LEDs
- 5.4 Resolving I/O Conflicts
-
-6.0 TECHNICAL SUPPORT
- 6.1 Contacting Cirrus Logic's Technical Support
- 6.2 Information Required Before Contacting Technical Support
- 6.3 Obtaining the Latest Driver Version
- 6.4 Current maintainer
- 6.5 Kernel boot parameters
-
-
-1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
-===============================================================================
-
-
-1.1 PRODUCT OVERVIEW
-
-The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow
-IEEE 802.3 standards and support half or full-duplex operation in ISA bus
-computers on 10 Mbps Ethernet networks. The adapters are designed for operation
-in 16-bit ISA or EISA bus expansion slots and are available in
-10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5
-or fiber networks).
-
-CS8920-based adapters are similar to the CS8900-based adapter with additional
-features for Plug and Play (PnP) support and Wakeup Frame recognition. As
-such, the configuration procedures differ somewhat between the two types of
-adapters. Refer to the "Adapter Configuration" section for details on
+
+
+.. TABLE OF CONTENTS
+
+ 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
+ 1.1 Product Overview
+ 1.2 Driver Description
+ 1.2.1 Driver Name
+ 1.2.2 File in the Driver Package
+ 1.3 System Requirements
+ 1.4 Licensing Information
+
+ 2.0 ADAPTER INSTALLATION and CONFIGURATION
+ 2.1 CS8900-based Adapter Configuration
+ 2.2 CS8920-based Adapter Configuration
+
+ 3.0 LOADING THE DRIVER AS A MODULE
+
+ 4.0 COMPILING THE DRIVER
+ 4.1 Compiling the Driver as a Loadable Module
+ 4.2 Compiling the driver to support memory mode
+ 4.3 Compiling the driver to support Rx DMA
+
+ 5.0 TESTING AND TROUBLESHOOTING
+ 5.1 Known Defects and Limitations
+ 5.2 Testing the Adapter
+ 5.2.1 Diagnostic Self-Test
+ 5.2.2 Diagnostic Network Test
+ 5.3 Using the Adapter's LEDs
+ 5.4 Resolving I/O Conflicts
+
+ 6.0 TECHNICAL SUPPORT
+ 6.1 Contacting Cirrus Logic's Technical Support
+ 6.2 Information Required Before Contacting Technical Support
+ 6.3 Obtaining the Latest Driver Version
+ 6.4 Current maintainer
+ 6.5 Kernel boot parameters
+
+
+1. Cirrus Logic LAN CS8900/CS8920 Ethernet Adapters
+===================================================
+
+
+1.1. Product Overview
+=====================
+
+The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow
+IEEE 802.3 standards and support half or full-duplex operation in ISA bus
+computers on 10 Mbps Ethernet networks. The adapters are designed for operation
+in 16-bit ISA or EISA bus expansion slots and are available in
+10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5
+or fiber networks).
+
+CS8920-based adapters are similar to the CS8900-based adapter with additional
+features for Plug and Play (PnP) support and Wakeup Frame recognition. As
+such, the configuration procedures differ somewhat between the two types of
+adapters. Refer to the "Adapter Configuration" section for details on
configuring both types of adapters.
-1.2 DRIVER DESCRIPTION
+1.2. Driver Description
+=======================
The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux
v2.3.48 or greater kernel. It can be compiled directly into the kernel
@@ -85,22 +90,25 @@ or loaded at run-time as a device driver module.
The files in the driver at Cirrus' website include:
- readme.txt - this file
- build - batch file to compile cs89x0.c.
- cs89x0.c - driver C code
- cs89x0.h - driver header file
- cs89x0.o - pre-compiled module (for v2.2.5 kernel)
- config/Config.in - sample file to include cs89x0 driver in the kernel.
- config/Makefile - sample file to include cs89x0 driver in the kernel.
- config/Space.c - sample file to include cs89x0 driver in the kernel.
+ =================== ====================================================
+ readme.txt this file
+ build batch file to compile cs89x0.c.
+ cs89x0.c driver C code
+ cs89x0.h driver header file
+ cs89x0.o pre-compiled module (for v2.2.5 kernel)
+ config/Config.in sample file to include cs89x0 driver in the kernel.
+ config/Makefile sample file to include cs89x0 driver in the kernel.
+ config/Space.c sample file to include cs89x0 driver in the kernel.
+ =================== ====================================================
-1.3 SYSTEM REQUIREMENTS
+1.3. System Requirements
+------------------------
The following hardware is required:
- * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter
+ * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter
* IBM or IBM-compatible PC with:
* An 80386 or higher processor
@@ -118,20 +126,21 @@ The following software is required:
* LINUX kernel sources for your kernel (if compiling into kernel)
- * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel
- or a module)
+ * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel
+ or a module)
-1.4 LICENSING INFORMATION
+1.4. Licensing Information
+--------------------------
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, version 1.
This program is distributed in the hope that it will be useful, but WITHOUT
-ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
-FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
For a full copy of the GNU General Public License, write to the Free Software
@@ -139,28 +148,29 @@ Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-2.0 ADAPTER INSTALLATION and CONFIGURATION
-===============================================================================
+2. Adapter Installation and Configuration
+=========================================
-Both the CS8900 and CS8920-based adapters can be configured using parameters
-stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup
-Utility if you want to change the adapter's configuration in EEPROM.
+Both the CS8900 and CS8920-based adapters can be configured using parameters
+stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup
+Utility if you want to change the adapter's configuration in EEPROM.
-When loading the driver as a module, you can specify many of the adapter's
-configuration parameters on the command-line to override the EEPROM's settings
-or for interface configuration when an EEPROM is not used. (CS8920-based
+When loading the driver as a module, you can specify many of the adapter's
+configuration parameters on the command-line to override the EEPROM's settings
+or for interface configuration when an EEPROM is not used. (CS8920-based
adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE.
-Since the CS8900/20 Setup Utility is a DOS-based application, you must install
-and configure the adapter in a DOS-based system using the CS8900/20 Setup
-Utility before installation in the target LINUX system. (Not required if
+Since the CS8900/20 Setup Utility is a DOS-based application, you must install
+and configure the adapter in a DOS-based system using the CS8900/20 Setup
+Utility before installation in the target LINUX system. (Not required if
installing a CS8900-based adapter and the default configuration is acceptable.)
-
-2.1 CS8900-BASED ADAPTER CONFIGURATION
-CS8900-based adapters shipped from Cirrus Logic have been configured
-with the following "default" settings:
+2.1. CS8900-based Adapter Configuration
+---------------------------------------
+
+CS8900-based adapters shipped from Cirrus Logic have been configured
+with the following "default" settings::
Operation Mode: Memory Mode
IRQ: 10
@@ -169,15 +179,16 @@ with the following "default" settings:
Optimization: DOS Client
Transmission Mode: Half-duplex
BootProm: None
- Media Type: Autodetect (3-media cards) or
- 10BASE-T (10BASE-T only adapter)
+ Media Type: Autodetect (3-media cards) or
+ 10BASE-T (10BASE-T only adapter)
-You should only change the default configuration settings if conflicts with
-another adapter exists. To change the adapter's configuration, run the
-CS8900/20 Setup Utility.
+You should only change the default configuration settings if conflicts with
+another adapter exists. To change the adapter's configuration, run the
+CS8900/20 Setup Utility.
-2.2 CS8920-BASED ADAPTER CONFIGURATION
+2.2. CS8920-based Adapter Configuration
+---------------------------------------
CS8920-based adapters are shipped from Cirrus Logic configured as Plug
and Play (PnP) enabled. However, since the cs89x0 driver does NOT
@@ -185,82 +196,83 @@ support PnP, you must install the CS8920 adapter in a DOS-based PC and
run the CS8900/20 Setup Utility to disable PnP and configure the
adapter before installation in the target Linux system. Failure to do
this will leave the adapter inactive and the driver will be unable to
-communicate with the adapter.
+communicate with the adapter.
+::
- ****************************************************************
- * CS8920-BASED ADAPTERS: *
- * *
- * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. *
- * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST *
- * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND *
- * TO ACTIVATE THE ADAPTER. *
- ****************************************************************
+ ****************************************************************
+ * CS8920-BASED ADAPTERS: *
+ * *
+ * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. *
+ * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST *
+ * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND *
+ * TO ACTIVATE THE ADAPTER. *
+ ****************************************************************
-3.0 LOADING THE DRIVER AS A MODULE
-===============================================================================
+3. Loading the Driver as a Module
+=================================
If the driver is compiled as a loadable module, you can load the driver module
-with the 'modprobe' command. Many of the adapter's configuration parameters can
-be specified as command-line arguments to the load command. This facility
-provides a means to override the EEPROM's settings or for interface
+with the 'modprobe' command. Many of the adapter's configuration parameters can
+be specified as command-line arguments to the load command. This facility
+provides a means to override the EEPROM's settings or for interface
configuration when an EEPROM is not used.
-Example:
+Example::
insmod cs89x0.o io=0x200 irq=0xA media=aui
This example loads the module and configures the adapter to use an IO port base
address of 200h, interrupt 10, and use the AUI media connection. The following
-configuration options are available on the command line:
-
-* io=### - specify IO address (200h-360h)
-* irq=## - specify interrupt level
-* use_dma=1 - Enable DMA
-* dma=# - specify dma channel (Driver is compiled to support
- Rx DMA only)
-* dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16.
-* media=rj45 - specify media type
+configuration options are available on the command line::
+
+ io=### - specify IO address (200h-360h)
+ irq=## - specify interrupt level
+ use_dma=1 - Enable DMA
+ dma=# - specify dma channel (Driver is compiled to support
+ Rx DMA only)
+ dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16.
+ media=rj45 - specify media type
or media=bnc
or media=aui
or media=auto
-* duplex=full - specify forced half/full/autonegotiate duplex
+ duplex=full - specify forced half/full/autonegotiate duplex
or duplex=half
or duplex=auto
-* debug=# - debug level (only available if the driver was compiled
- for debugging)
+ debug=# - debug level (only available if the driver was compiled
+ for debugging)
-NOTES:
+**Notes:**
a) If an EEPROM is present, any specified command-line parameter
will override the corresponding configuration value stored in
EEPROM.
-b) The "io" parameter must be specified on the command-line.
+b) The "io" parameter must be specified on the command-line.
c) The driver's hardware probe routine is designed to avoid
writing to I/O space until it knows that there is a cs89x0
card at the written addresses. This could cause problems
with device probing. To avoid this behaviour, add one
- to the `io=' module parameter. This doesn't actually change
+ to the ``io=`` module parameter. This doesn't actually change
the I/O address, but it is a flag to tell the driver
to partially initialise the hardware before trying to
identify the card. This could be dangerous if you are
not sure that there is a cs89x0 card at the provided address.
For example, to scan for an adapter located at IO base 0x300,
- specify an IO address of 0x301.
+ specify an IO address of 0x301.
d) The "duplex=auto" parameter is only supported for the CS8920.
e) The minimum command-line configuration required if an EEPROM is
not present is:
- io
- irq
+ io
+ irq
media type (no autodetect)
f) The following additional parameters are CS89XX defaults (values
@@ -282,13 +294,13 @@ h) Many Linux distributions use the 'modprobe' command to load
module when it is loaded. All the configuration options which are
described above may be placed within /etc/conf.modules.
- For example:
+ For example::
- > cat /etc/conf.modules
- ...
- alias eth0 cs89x0
- options cs89x0 io=0x0200 dma=5 use_dma=1
- ...
+ > cat /etc/conf.modules
+ ...
+ alias eth0 cs89x0
+ options cs89x0 io=0x0200 dma=5 use_dma=1
+ ...
In this example we are telling the module system that the
ethernet driver for this machine should use the cs89x0 driver. We
@@ -305,9 +317,9 @@ j) The cs89x0 supports DMA for receiving only. DMA mode is
k) If your Linux kernel was compiled with inbuilt plug-and-play
support you will be able to find information about the cs89x0 card
- with the command
+ with the command::
- cat /proc/isapnp
+ cat /proc/isapnp
l) If during DMA operation you find erratic behavior or network data
corruption you should use your PC's BIOS to slow the EISA bus clock.
@@ -321,11 +333,11 @@ n) If the cs89x0 driver is compiled directly into the kernel, DMA
mode may be selected by providing the kernel with a boot option
'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7).
- Kernel boot options may be provided on the LILO command line:
+ Kernel boot options may be provided on the LILO command line::
LILO boot: linux cs89x0_dma=5
- or they may be placed in /etc/lilo.conf:
+ or they may be placed in /etc/lilo.conf::
image=/boot/bzImage-2.3.48
append="cs89x0_dma=5"
@@ -337,237 +349,246 @@ n) If the cs89x0 driver is compiled directly into the kernel, DMA
(64k mode is not available).
-4.0 COMPILING THE DRIVER
-===============================================================================
+4. Compiling the Driver
+=======================
The cs89x0 driver can be compiled directly into the kernel or compiled into
a loadable device driver module.
+Just use the standard way to configure the driver and compile the Kernel.
-4.1 COMPILING THE DRIVER AS A LOADABLE MODULE
-
-To compile the driver into a loadable module, use the following command
-(single command line, without quotes):
-
-"gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall
--Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS
--c cs89x0.c"
-
-4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE
-
-Support for memory mode was not carried over into the 2.3 series kernels.
-4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA
+4.1. Compiling the Driver to Support Rx DMA
+-------------------------------------------
The compile-time optionality for DMA was removed in the 2.3 kernel
series. DMA support is now unconditionally part of the driver. It is
enabled by the 'use_dma=1' module option.
-5.0 TESTING AND TROUBLESHOOTING
-===============================================================================
+5. Testing and Troubleshooting
+==============================
-5.1 KNOWN DEFECTS and LIMITATIONS
+5.1. Known Defects and Limitations
+----------------------------------
-Refer to the RELEASE.TXT file distributed as part of this archive for a list of
+Refer to the RELEASE.TXT file distributed as part of this archive for a list of
known defects, driver limitations, and work arounds.
-5.2 TESTING THE ADAPTER
+5.2. Testing the Adapter
+------------------------
-Once the adapter has been installed and configured, the diagnostic option of
-the CS8900/20 Setup Utility can be used to test the functionality of the
+Once the adapter has been installed and configured, the diagnostic option of
+the CS8900/20 Setup Utility can be used to test the functionality of the
adapter and its network connection. Use the diagnostics 'Self Test' option to
test the functionality of the adapter with the hardware configuration you have
assigned. You can use the diagnostics 'Network Test' to test the ability of the
-adapter to communicate across the Ethernet with another PC equipped with a
-CS8900/20-based adapter card (it must also be running the CS8900/20 Setup
+adapter to communicate across the Ethernet with another PC equipped with a
+CS8900/20-based adapter card (it must also be running the CS8900/20 Setup
Utility).
- NOTE: The Setup Utility's diagnostics are designed to run in a
- DOS-only operating system environment. DO NOT run the diagnostics
- from a DOS or command prompt session under Windows 95, Windows NT,
- OS/2, or other operating system.
+.. note::
+
+ The Setup Utility's diagnostics are designed to run in a
+ DOS-only operating system environment. DO NOT run the diagnostics
+ from a DOS or command prompt session under Windows 95, Windows NT,
+ OS/2, or other operating system.
To run the diagnostics tests on the CS8900/20 adapter:
- 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility.
+ 1. Boot DOS on the PC and start the CS8900/20 Setup Utility.
- 2.) The adapter's current configuration is displayed. Hit the ENTER key to
+ 2. The adapter's current configuration is displayed. Hit the ENTER key to
get to the main menu.
- 4.) Select 'Diagnostics' (ALT-G) from the main menu.
+ 4. Select 'Diagnostics' (ALT-G) from the main menu.
* Select 'Self-Test' to test the adapter's basic functionality.
* Select 'Network Test' to test the network connection and cabling.
-5.2.1 DIAGNOSTIC SELF-TEST
+5.2.1. Diagnostic Self-test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The diagnostic self-test checks the adapter's basic functionality as well as
-its ability to communicate across the ISA bus based on the system resources
+The diagnostic self-test checks the adapter's basic functionality as well as
+its ability to communicate across the ISA bus based on the system resources
assigned during hardware configuration. The following tests are performed:
* IO Register Read/Write Test
- The IO Register Read/Write test insures that the CS8900/20 can be
+
+ The IO Register Read/Write test insures that the CS8900/20 can be
accessed in IO mode, and that the IO base address is correct.
* Shared Memory Test
- The Shared Memory test insures the CS8900/20 can be accessed in memory
- mode and that the range of memory addresses assigned does not conflict
+
+ The Shared Memory test insures the CS8900/20 can be accessed in memory
+ mode and that the range of memory addresses assigned does not conflict
with other devices in the system.
* Interrupt Test
+
The Interrupt test insures there are no conflicts with the assigned IRQ
signal.
* EEPROM Test
+
The EEPROM test insures the EEPROM can be read.
* Chip RAM Test
+
The Chip RAM test insures the 4K of memory internal to the CS8900/20 is
working properly.
* Internal Loop-back Test
- The Internal Loop Back test insures the adapter's transmitter and
- receiver are operating properly. If this test fails, make sure the
- adapter's cable is connected to the network (check for LED activity for
+
+ The Internal Loop Back test insures the adapter's transmitter and
+ receiver are operating properly. If this test fails, make sure the
+ adapter's cable is connected to the network (check for LED activity for
example).
* Boot PROM Test
+
The Boot PROM test insures the Boot PROM is present, and can be read.
Failure indicates the Boot PROM was not successfully read due to a
hardware problem or due to a conflicts on the Boot PROM address
assignment. (Test only applies if the adapter is configured to use the
Boot PROM option.)
-Failure of a test item indicates a possible system resource conflict with
-another device on the ISA bus. In this case, you should use the Manual Setup
+Failure of a test item indicates a possible system resource conflict with
+another device on the ISA bus. In this case, you should use the Manual Setup
option to reconfigure the adapter by selecting a different value for the system
resource that failed.
-5.2.2 DIAGNOSTIC NETWORK TEST
+5.2.2. Diagnostic Network Test
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The Diagnostic Network Test verifies a working network connection by
-transferring data between two CS8900/20 adapters installed in different PCs
-on the same network. (Note: the diagnostic network test should not be run
-between two nodes across a router.)
+The Diagnostic Network Test verifies a working network connection by
+transferring data between two CS8900/20 adapters installed in different PCs
+on the same network. (Note: the diagnostic network test should not be run
+between two nodes across a router.)
This test requires that each of the two PCs have a CS8900/20-based adapter
-installed and have the CS8900/20 Setup Utility running. The first PC is
-configured as a Responder and the other PC is configured as an Initiator.
-Once the Initiator is started, it sends data frames to the Responder which
+installed and have the CS8900/20 Setup Utility running. The first PC is
+configured as a Responder and the other PC is configured as an Initiator.
+Once the Initiator is started, it sends data frames to the Responder which
returns the frames to the Initiator.
-The total number of frames received and transmitted are displayed on the
-Initiator's display, along with a count of the number of frames received and
-transmitted OK or in error. The test can be terminated anytime by the user at
+The total number of frames received and transmitted are displayed on the
+Initiator's display, along with a count of the number of frames received and
+transmitted OK or in error. The test can be terminated anytime by the user at
either PC.
To setup the Diagnostic Network Test:
- 1.) Select a PC with a CS8900/20-based adapter and a known working network
- connection to act as the Responder. Run the CS8900/20 Setup Utility
- and select 'Diagnostics -> Network Test -> Responder' from the main
- menu. Hit ENTER to start the Responder.
+ 1. Select a PC with a CS8900/20-based adapter and a known working network
+ connection to act as the Responder. Run the CS8900/20 Setup Utility
+ and select 'Diagnostics -> Network Test -> Responder' from the main
+ menu. Hit ENTER to start the Responder.
- 2.) Return to the PC with the CS8900/20-based adapter you want to test and
- start the CS8900/20 Setup Utility.
+ 2. Return to the PC with the CS8900/20-based adapter you want to test and
+ start the CS8900/20 Setup Utility.
+
+ 3. From the main menu, Select 'Diagnostic -> Network Test -> Initiator'.
+ Hit ENTER to start the test.
- 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'.
- Hit ENTER to start the test.
-
You may stop the test on the Initiator at any time while allowing the Responder
-to continue running. In this manner, you can move to additional PCs and test
-them by starting the Initiator on another PC without having to stop/start the
+to continue running. In this manner, you can move to additional PCs and test
+them by starting the Initiator on another PC without having to stop/start the
Responder.
-
-5.3 USING THE ADAPTER'S LEDs
-The 2 and 3-media adapters have two LEDs visible on the back end of the board
-located near the 10Base-T connector.
+5.3. Using the Adapter's LEDs
+-----------------------------
+
+The 2 and 3-media adapters have two LEDs visible on the back end of the board
+located near the 10Base-T connector.
-Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T
+Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T
connection. (Only applies to 10Base-T. The green LED has no significance for
a 10Base-2 or AUI connection.)
-TX/RX LED: The yellow LED lights briefly each time the adapter transmits or
+TX/RX LED: The yellow LED lights briefly each time the adapter transmits or
receives data. (The yellow LED will appear to "flicker" on a typical network.)
-5.4 RESOLVING I/O CONFLICTS
+5.4. Resolving I/O Conflicts
+----------------------------
-An IO conflict occurs when two or more adapter use the same ISA resource (IO
-address, memory address or IRQ). You can usually detect an IO conflict in one
+An IO conflict occurs when two or more adapter use the same ISA resource (IO
+address, memory address or IRQ). You can usually detect an IO conflict in one
of four ways after installing and or configuring the CS8900/20-based adapter:
- 1.) The system does not boot properly (or at all).
+ 1. The system does not boot properly (or at all).
- 2.) The driver cannot communicate with the adapter, reporting an "Adapter
- not found" error message.
+ 2. The driver cannot communicate with the adapter, reporting an "Adapter
+ not found" error message.
- 3.) You cannot connect to the network or the driver will not load.
+ 3. You cannot connect to the network or the driver will not load.
- 4.) If you have configured the adapter to run in memory mode but the driver
- reports it is using IO mode when loading, this is an indication of a
- memory address conflict.
+ 4. If you have configured the adapter to run in memory mode but the driver
+ reports it is using IO mode when loading, this is an indication of a
+ memory address conflict.
-If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a
-diagnostic self-test. Normally, the ISA resource in conflict will fail the
-self-test. If so, reconfigure the adapter selecting another choice for the
-resource in conflict. Run the diagnostics again to check for further IO
+If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a
+diagnostic self-test. Normally, the ISA resource in conflict will fail the
+self-test. If so, reconfigure the adapter selecting another choice for the
+resource in conflict. Run the diagnostics again to check for further IO
conflicts.
In some cases, such as when the PC will not boot, it may be necessary to remove
-the adapter and reconfigure it by installing it in another PC to run the
-CS8900/20 Setup Utility. Once reinstalled in the target system, run the
-diagnostics self-test to ensure the new configuration is free of conflicts
+the adapter and reconfigure it by installing it in another PC to run the
+CS8900/20 Setup Utility. Once reinstalled in the target system, run the
+diagnostics self-test to ensure the new configuration is free of conflicts
before loading the driver again.
-When manually configuring the adapter, keep in mind the typical ISA system
+When manually configuring the adapter, keep in mind the typical ISA system
resource usage as indicated in the tables below.
-I/O Address Device IRQ Device
------------ -------- --- --------
- 200-20F Game I/O adapter 3 COM2, Bus Mouse
- 230-23F Bus Mouse 4 COM1
- 270-27F LPT3: third parallel port 5 LPT2
- 2F0-2FF COM2: second serial port 6 Floppy Disk controller
- 320-32F Fixed disk controller 7 LPT1
- 8 Real-time Clock
- 9 EGA/VGA display adapter
- 12 Mouse (PS/2)
-Memory Address Device 13 Math Coprocessor
--------------- --------------------- 14 Hard Disk controller
-A000-BFFF EGA Graphics Adapter
-A000-C7FF VGA Graphics Adapter
-B000-BFFF Mono Graphics Adapter
-B800-BFFF Color Graphics Adapter
-E000-FFFF AT BIOS
+::
+ I/O Address Device IRQ Device
+ ----------- -------- --- --------
+ 200-20F Game I/O adapter 3 COM2, Bus Mouse
+ 230-23F Bus Mouse 4 COM1
+ 270-27F LPT3: third parallel port 5 LPT2
+ 2F0-2FF COM2: second serial port 6 Floppy Disk controller
+ 320-32F Fixed disk controller 7 LPT1
+ 8 Real-time Clock
+ 9 EGA/VGA display adapter
+ 12 Mouse (PS/2)
+ Memory Address Device 13 Math Coprocessor
+ -------------- --------------------- 14 Hard Disk controller
+ A000-BFFF EGA Graphics Adapter
+ A000-C7FF VGA Graphics Adapter
+ B000-BFFF Mono Graphics Adapter
+ B800-BFFF Color Graphics Adapter
+ E000-FFFF AT BIOS
-6.0 TECHNICAL SUPPORT
-===============================================================================
-6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT
+6. Technical Support
+====================
-Cirrus Logic's CS89XX Technical Application Support can be reached at:
+6.1. Contacting Cirrus Logic's Technical Support
+------------------------------------------------
-Telephone :(800) 888-5016 (from inside U.S. and Canada)
- :(512) 442-7555 (from outside the U.S. and Canada)
-Fax :(512) 912-3871
-Email :ethernet@crystal.cirrus.com
-WWW :http://www.cirrus.com
+Cirrus Logic's CS89XX Technical Application Support can be reached at::
+ Telephone :(800) 888-5016 (from inside U.S. and Canada)
+ :(512) 442-7555 (from outside the U.S. and Canada)
+ Fax :(512) 912-3871
+ Email :ethernet@crystal.cirrus.com
+ WWW :http://www.cirrus.com
-6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT
-Before contacting Cirrus Logic for technical support, be prepared to provide as
-Much of the following information as possible.
+6.2. Information Required before Contacting Technical Support
+-------------------------------------------------------------
+
+Before contacting Cirrus Logic for technical support, be prepared to provide as
+Much of the following information as possible.
1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.)
@@ -575,7 +596,7 @@ Much of the following information as possible.
* IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel
* Plug and Play enabled/disabled (CS8920-based adapters only)
- * Configured for media auto-detect or specific media type (which type).
+ * Configured for media auto-detect or specific media type (which type).
3.) PC System's Configuration
@@ -590,35 +611,37 @@ Much of the following information as possible.
* CS89XX driver and version
* Your network operating system and version
- * Your system's OS version
+ * Your system's OS version
* Version of all protocol support files
5.) Any Error Message displayed.
-6.3 OBTAINING THE LATEST DRIVER VERSION
+6.3 Obtaining the Latest Driver Version
+---------------------------------------
-You can obtain the latest CS89XX drivers and support software from Cirrus Logic's
+You can obtain the latest CS89XX drivers and support software from Cirrus Logic's
Web site. You can also contact Cirrus Logic's Technical Support (email:
-ethernet@crystal.cirrus.com) and request that you be registered for automatic
+ethernet@crystal.cirrus.com) and request that you be registered for automatic
software-update notification.
Cirrus Logic maintains a web page at http://www.cirrus.com with the
latest drivers and technical publications.
-6.4 Current maintainer
+6.4. Current maintainer
+-----------------------
In February 2000 the maintenance of this driver was assumed by Andrew
Morton.
6.5 Kernel module parameters
+----------------------------
For use in embedded environments with no cs89x0 EEPROM, the kernel boot
-parameter `cs89x0_media=' has been implemented. Usage is:
+parameter ``cs89x0_media=`` has been implemented. Usage is::
cs89x0_media=rj45 or
cs89x0_media=aui or
cs89x0_media=bnc
-
diff --git a/Documentation/networking/device_drivers/davicom/dm9000.txt b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
index 5552e2e575c5..14eb0a4d4e4e 100644
--- a/Documentation/networking/device_drivers/davicom/dm9000.txt
+++ b/Documentation/networking/device_drivers/ethernet/davicom/dm9000.rst
@@ -1,7 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
DM9000 Network driver
=====================
Copyright 2008 Simtec Electronics,
+
Ben Dooks <ben@simtec.co.uk> <ben-linux@fluff.org>
@@ -30,9 +34,9 @@ These resources should be specified in that order, as the ordering of the
two address regions is important (the driver expects these to be address
and then data).
-An example from arch/arm/mach-s3c2410/mach-bast.c is:
+An example from arch/arm/mach-s3c/mach-bast.c is::
-static struct resource bast_dm9k_resource[] = {
+ static struct resource bast_dm9k_resource[] = {
[0] = {
.start = S3C2410_CS5 + BAST_PA_DM9000,
.end = S3C2410_CS5 + BAST_PA_DM9000 + 3,
@@ -48,14 +52,14 @@ static struct resource bast_dm9k_resource[] = {
.end = IRQ_DM9000,
.flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL,
}
-};
+ };
-static struct platform_device bast_device_dm9k = {
+ static struct platform_device bast_device_dm9k = {
.name = "dm9000",
.id = 0,
.num_resources = ARRAY_SIZE(bast_dm9k_resource),
.resource = bast_dm9k_resource,
-};
+ };
Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags,
as this will generate a warning if it is not present. The trigger from
@@ -64,13 +68,13 @@ handler to ensure that the IRQ is setup correctly.
This shows a typical platform device, without the optional configuration
platform data supplied. The next example uses the same resources, but adds
-the optional platform data to pass extra configuration data:
+the optional platform data to pass extra configuration data::
-static struct dm9000_plat_data bast_dm9k_platdata = {
+ static struct dm9000_plat_data bast_dm9k_platdata = {
.flags = DM9000_PLATF_16BITONLY,
-};
+ };
-static struct platform_device bast_device_dm9k = {
+ static struct platform_device bast_device_dm9k = {
.name = "dm9000",
.id = 0,
.num_resources = ARRAY_SIZE(bast_dm9k_resource),
@@ -78,7 +82,7 @@ static struct platform_device bast_device_dm9k = {
.dev = {
.platform_data = &bast_dm9k_platdata,
}
-};
+ };
The platform data is defined in include/linux/dm9000.h and described below.
diff --git a/Documentation/networking/device_drivers/dec/dmfe.txt b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
index 25320bf19c86..c4cf809cad84 100644
--- a/Documentation/networking/device_drivers/dec/dmfe.txt
+++ b/Documentation/networking/device_drivers/ethernet/dec/dmfe.rst
@@ -1,6 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================================
+Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux
+==============================================================
+
Note: This driver doesn't have a maintainer.
-Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
@@ -16,29 +21,29 @@ GNU General Public License for more details.
This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET
10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you
didn't compile this driver as a module, it will automatically load itself on boot and print a
-line similar to :
+line similar to::
dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17)
-If you compiled this driver as a module, you have to load it on boot.You can load it with command :
+If you compiled this driver as a module, you have to load it on boot.You can load it with command::
insmod dmfe
This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass
-a mode= setting to module while loading, like :
+a mode= setting to module while loading, like::
insmod dmfe mode=0 # Force 10M Half Duplex
insmod dmfe mode=1 # Force 100M Half Duplex
insmod dmfe mode=4 # Force 10M Full Duplex
insmod dmfe mode=5 # Force 100M Full Duplex
-Next you should configure your network interface with a command similar to :
+Next you should configure your network interface with a command similar to::
ifconfig eth0 172.22.3.18
- ^^^^^^^^^^^
+ ^^^^^^^^^^^
Your IP Address
-Then you may have to modify the default routing table with command :
+Then you may have to modify the default routing table with command::
route add default eth0
@@ -48,10 +53,10 @@ Now your ethernet card should be up and running.
TODO:
-Implement pci_driver::suspend() and pci_driver::resume() power management methods.
-Check on 64 bit boxes.
-Check and fix on big endian boxes.
-Test and make sure PCI latency is now correct for all cases.
+- Implement pci_driver::suspend() and pci_driver::resume() power management methods.
+- Check on 64 bit boxes.
+- Check and fix on big endian boxes.
+- Test and make sure PCI latency is now correct for all cases.
Authors:
@@ -60,7 +65,7 @@ Sten Wang <sten_wang@davicom.com.tw > : Original Author
Contributors:
-Marcelo Tosatti <marcelo@conectiva.com.br>
-Alan Cox <alan@lxorguk.ukuu.org.uk>
-Jeff Garzik <jgarzik@pobox.com>
-Vojtech Pavlik <vojtech@suse.cz>
+- Marcelo Tosatti <marcelo@conectiva.com.br>
+- Alan Cox <alan@lxorguk.ukuu.org.uk>
+- Jeff Garzik <jgarzik@pobox.com>
+- Vojtech Pavlik <vojtech@suse.cz>
diff --git a/Documentation/networking/device_drivers/dlink/dl2k.txt b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
index cba74f7a3abc..ccdb5d0d7460 100644
--- a/Documentation/networking/device_drivers/dlink/dl2k.txt
+++ b/Documentation/networking/device_drivers/ethernet/dlink/dl2k.rst
@@ -1,10 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
- D-Link DL2000-based Gigabit Ethernet Adapter Installation
- for Linux
- May 23, 2002
+=========================================================
+D-Link DL2000-based Gigabit Ethernet Adapter Installation
+=========================================================
+
+May 23, 2002
+
+.. Contents
-Contents
-========
- Compatibility List
- Quick Install
- Compiling the Driver
@@ -15,12 +18,13 @@ Contents
Compatibility List
-=================
+==================
+
Adapter Support:
-D-Link DGE-550T Gigabit Ethernet Adapter.
-D-Link DGE-550SX Gigabit Ethernet Adapter.
-D-Link DL2000-based Gigabit Ethernet Adapter.
+- D-Link DGE-550T Gigabit Ethernet Adapter.
+- D-Link DGE-550SX Gigabit Ethernet Adapter.
+- D-Link DL2000-based Gigabit Ethernet Adapter.
The driver support Linux kernel 2.4.7 later. We had tested it
@@ -34,28 +38,32 @@ on the environments below.
Quick Install
=============
-Install linux driver as following command:
+Install linux driver as following command::
+
+ 1. make all
+ 2. insmod dl2k.ko
+ 3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0
+ ^^^^^^^^^^^^^^^\ ^^^^^^^^\
+ IP NETMASK
-1. make all
-2. insmod dl2k.ko
-3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0
- ^^^^^^^^^^^^^^^\ ^^^^^^^^\
- IP NETMASK
Now eth0 should active, you can test it by "ping" or get more information by
"ifconfig". If tested ok, continue the next step.
-4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net
-5. Add the following line to /etc/modprobe.d/dl2k.conf:
+4. ``cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net``
+5. Add the following line to /etc/modprobe.d/dl2k.conf::
+
alias eth0 dl2k
-6. Run depmod to updated module indexes.
-7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0
+
+6. Run ``depmod`` to updated module indexes.
+7. Run ``netconfig`` or ``netconf`` to create configuration script ifcfg-eth0
located at /etc/sysconfig/network-scripts or create it manually.
+
[see - Configuration Script Sample]
8. Driver will automatically load and configure at next boot time.
Compiling the Driver
====================
- In Linux, NIC drivers are most commonly configured as loadable modules.
+In Linux, NIC drivers are most commonly configured as loadable modules.
The approach of building a monolithic kernel has become obsolete. The driver
can be compiled as part of a monolithic kernel, but is strongly discouraged.
The remainder of this section assumes the driver is built as a loadable module.
@@ -73,93 +81,108 @@ to compile and link the driver:
CD-ROM drive
------------
-[root@XXX /] mkdir cdrom
-[root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom
-[root@XXX /] cd root
-[root@XXX /root] mkdir dl2k
-[root@XXX /root] cd dl2k
-[root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k
-[root@XXX dl2k] tar xfvz dl2k.tgz
-[root@XXX dl2k] make all
+::
+
+ [root@XXX /] mkdir cdrom
+ [root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom
+ [root@XXX /] cd root
+ [root@XXX /root] mkdir dl2k
+ [root@XXX /root] cd dl2k
+ [root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k
+ [root@XXX dl2k] tar xfvz dl2k.tgz
+ [root@XXX dl2k] make all
Floppy disc drive
-----------------
-[root@XXX /] cd root
-[root@XXX /root] mkdir dl2k
-[root@XXX /root] cd dl2k
-[root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k
-[root@XXX dl2k] tar xfvz dl2k.tgz
-[root@XXX dl2k] make all
+::
+
+ [root@XXX /] cd root
+ [root@XXX /root] mkdir dl2k
+ [root@XXX /root] cd dl2k
+ [root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k
+ [root@XXX dl2k] tar xfvz dl2k.tgz
+ [root@XXX dl2k] make all
Installing the Driver
=====================
- Manual Installation
- -------------------
+Manual Installation
+-------------------
+
Once the driver has been compiled, it must be loaded, enabled, and bound
to a protocol stack in order to establish network connectivity. To load a
- module enter the command:
+ module enter the command::
+
+ insmod dl2k.o
+
+ or::
+
+ insmod dl2k.o <optional parameter> ; add parameter
- insmod dl2k.o
+---------------------------------------------------------
- or
+ example::
- insmod dl2k.o <optional parameter> ; add parameter
+ insmod dl2k.o media=100mbps_hd
- ===============================================================
- example: insmod dl2k.o media=100mbps_hd
- or insmod dl2k.o media=3
- or insmod dl2k.o media=3,2 ; for 2 cards
- ===============================================================
+ or::
+
+ insmod dl2k.o media=3
+
+ or::
+
+ insmod dl2k.o media=3,2 ; for 2 cards
+
+---------------------------------------------------------
Please reference the list of the command line parameters supported by
the Linux device driver below.
The insmod command only loads the driver and gives it a name of the form
eth0, eth1, etc. To bring the NIC into an operational state,
- it is necessary to issue the following command:
+ it is necessary to issue the following command::
- ifconfig eth0 up
+ ifconfig eth0 up
Finally, to bind the driver to the active protocol (e.g., TCP/IP with
- Linux), enter the following command:
+ Linux), enter the following command::
- ifup eth0
+ ifup eth0
Note that this is meaningful only if the system can find a configuration
script that contains the necessary network information. A sample will be
given in the next paragraph.
- The commands to unload a driver are as follows:
+ The commands to unload a driver are as follows::
- ifdown eth0
- ifconfig eth0 down
- rmmod dl2k.o
+ ifdown eth0
+ ifconfig eth0 down
+ rmmod dl2k.o
The following are the commands to list the currently loaded modules and
- to see the current network configuration.
+ to see the current network configuration::
- lsmod
- ifconfig
+ lsmod
+ ifconfig
- Automated Installation
- ----------------------
+Automated Installation
+----------------------
This section describes how to install the driver such that it is
automatically loaded and configured at boot time. The following description
is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to
other distributions as well.
- Red Hat v6.x/v7.x
- -----------------
+Red Hat v6.x/v7.x
+-----------------
1. Copy dl2k.o to the network modules directory, typically
/lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net.
2. Locate the boot module configuration file, most commonly in the
- /etc/modprobe.d/ directory. Add the following lines:
+ /etc/modprobe.d/ directory. Add the following lines::
- alias ethx dl2k
- options dl2k <optional parameters>
+ alias ethx dl2k
+ options dl2k <optional parameters>
where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if
one other ethernet adapter is installed, etc. Refer to the table in the
@@ -180,11 +203,15 @@ parameter. Below is a list of the command line parameters supported by the
Linux device
driver.
-mtu=packet_size - Specifies the maximum packet size. default
+
+=============================== ==============================================
+mtu=packet_size Specifies the maximum packet size. default
is 1500.
-media=media_type - Specifies the media type the NIC operates at.
+media=media_type Specifies the media type the NIC operates at.
autosense Autosensing active media.
+
+ =========== =========================
10mbps_hd 10Mbps half duplex.
10mbps_fd 10Mbps full duplex.
100mbps_hd 100Mbps half duplex.
@@ -198,85 +225,90 @@ media=media_type - Specifies the media type the NIC operates at.
4 100Mbps full duplex.
5 1000Mbps half duplex.
6 1000Mbps full duplex.
+ =========== =========================
By default, the NIC operates at autosense.
1000mbps_fd and 1000mbps_hd types are only
available for fiber adapter.
-vlan=n - Specifies the VLAN ID. If vlan=0, the
+vlan=n Specifies the VLAN ID. If vlan=0, the
Virtual Local Area Network (VLAN) function is
disable.
-jumbo=[0|1] - Specifies the jumbo frame support. If jumbo=1,
+jumbo=[0|1] Specifies the jumbo frame support. If jumbo=1,
the NIC accept jumbo frames. By default, this
function is disabled.
Jumbo frame usually improve the performance
int gigabit.
- This feature need jumbo frame compatible
+ This feature need jumbo frame compatible
remote.
-
-rx_coalesce=m - Number of rx frame handled each interrupt.
-rx_timeout=n - Rx DMA wait time for an interrupt.
- If set rx_coalesce > 0, hardware only assert
- an interrupt for m frames. Hardware won't
+
+rx_coalesce=m Number of rx frame handled each interrupt.
+rx_timeout=n Rx DMA wait time for an interrupt.
+ If set rx_coalesce > 0, hardware only assert
+ an interrupt for m frames. Hardware won't
assert rx interrupt until m frames received or
- reach timeout of n * 640 nano seconds.
- Set proper rx_coalesce and rx_timeout can
+ reach timeout of n * 640 nano seconds.
+ Set proper rx_coalesce and rx_timeout can
reduce congestion collapse and overload which
has been a bottleneck for high speed network.
-
+
For example, rx_coalesce=10 rx_timeout=800.
- that is, hardware assert only 1 interrupt
- for 10 frames received or timeout of 512 us.
+ that is, hardware assert only 1 interrupt
+ for 10 frames received or timeout of 512 us.
-tx_coalesce=n - Number of tx frame handled each interrupt.
- Set n > 1 can reduce the interrupts
+tx_coalesce=n Number of tx frame handled each interrupt.
+ Set n > 1 can reduce the interrupts
congestion usually lower performance of
high speed network card. Default is 16.
-
-tx_flow=[1|0] - Specifies the Tx flow control. If tx_flow=0,
+
+tx_flow=[1|0] Specifies the Tx flow control. If tx_flow=0,
the Tx flow control disable else driver
autodetect.
-rx_flow=[1|0] - Specifies the Rx flow control. If rx_flow=0,
+rx_flow=[1|0] Specifies the Rx flow control. If rx_flow=0,
the Rx flow control enable else driver
autodetect.
+=============================== ==============================================
Configuration Script Sample
===========================
-Here is a sample of a simple configuration script:
+Here is a sample of a simple configuration script::
-DEVICE=eth0
-USERCTL=no
-ONBOOT=yes
-POOTPROTO=none
-BROADCAST=207.200.5.255
-NETWORK=207.200.5.0
-NETMASK=255.255.255.0
-IPADDR=207.200.5.2
+ DEVICE=eth0
+ USERCTL=no
+ ONBOOT=yes
+ POOTPROTO=none
+ BROADCAST=207.200.5.255
+ NETWORK=207.200.5.0
+ NETMASK=255.255.255.0
+ IPADDR=207.200.5.2
Troubleshooting
===============
Q1. Source files contain ^ M behind every line.
- Make sure all files are Unix file format (no LF). Try the following
- shell command to convert files.
+
+ Make sure all files are Unix file format (no LF). Try the following
+ shell command to convert files::
cat dl2k.c | col -b > dl2k.tmp
mv dl2k.tmp dl2k.c
- OR
+ OR::
cat dl2k.c | tr -d "\r" > dl2k.tmp
mv dl2k.tmp dl2k.c
-Q2: Could not find header files (*.h) ?
- To compile the driver, you need kernel header files. After
+Q2: Could not find header files (``*.h``)?
+
+ To compile the driver, you need kernel header files. After
installing the kernel source, the header files are usually located in
/usr/src/linux/include, which is the default include directory configured
in Makefile. For some distributions, there is a copy of header files in
/usr/src/include/linux and /usr/src/include/asm, that you can change the
INCLUDEDIR in Makefile to /usr/include without installing kernel source.
- Note that RH 7.0 didn't provide correct header files in /usr/include,
+
+ Note that RH 7.0 didn't provide correct header files in /usr/include,
including those files will make a wrong version driver.
diff --git a/Documentation/networking/device_drivers/freescale/dpaa.txt b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
index b06601ff9200..241c6c6f6e68 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa.txt
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa.rst
@@ -1,12 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
The QorIQ DPAA Ethernet Driver
==============================
Authors:
-Madalin Bucur <madalin.bucur@nxp.com>
-Camelia Groza <camelia.groza@nxp.com>
+- Madalin Bucur <madalin.bucur@nxp.com>
+- Camelia Groza <camelia.groza@nxp.com>
-Contents
-========
+.. Contents
- DPAA Ethernet Overview
- DPAA Ethernet Supported SoCs
@@ -34,7 +36,7 @@ following drivers in the Linux kernel:
- Queue Manager (QMan), Buffer Manager (BMan)
drivers/soc/fsl/qbman
-A simplified view of the dpaa_eth interfaces mapped to FMan MACs:
+A simplified view of the dpaa_eth interfaces mapped to FMan MACs::
dpaa_eth /eth0\ ... /ethN\
driver | | | |
@@ -42,89 +44,93 @@ A simplified view of the dpaa_eth interfaces mapped to FMan MACs:
-Ports / Tx Rx \ ... / Tx Rx \
FMan | | | |
-MACs | MAC0 | | MACN |
- / dtsec0 \ ... / dtsecN \ (or tgec)
- / \ / \(or memac)
+ / dtsec0 \ ... / dtsecN \ (or tgec)
+ / \ / \(or memac)
--------- -------------- --- -------------- ---------
FMan, FMan Port, FMan SP, FMan MURAM drivers
---------------------------------------------------------
FMan HW blocks: MURAM, MACs, Ports, SP
---------------------------------------------------------
-The dpaa_eth relation to the QMan, BMan and FMan:
- ________________________________
+The dpaa_eth relation to the QMan, BMan and FMan::
+
+ ________________________________
dpaa_eth / eth0 \
driver / \
--------- -^- -^- -^- --- ---------
QMan driver / \ / \ / \ \ / | BMan |
- |Rx | |Rx | |Tx | |Tx | | driver |
+ |Rx | |Rx | |Tx | |Tx | | driver |
--------- |Dfl| |Err| |Cnf| |FQs| | |
QMan HW |FQ | |FQ | |FQs| | | | |
- / \ / \ / \ \ / | |
+ / \ / \ / \ \ / | |
--------- --- --- --- -v- ---------
- | FMan QMI | |
- | FMan HW FMan BMI | BMan HW |
- ----------------------- --------
+ | FMan QMI | |
+ | FMan HW FMan BMI | BMan HW |
+ ----------------------- --------
where the acronyms used above (and in the code) are:
-DPAA = Data Path Acceleration Architecture
-FMan = DPAA Frame Manager
-QMan = DPAA Queue Manager
-BMan = DPAA Buffers Manager
-QMI = QMan interface in FMan
-BMI = BMan interface in FMan
-FMan SP = FMan Storage Profiles
-MURAM = Multi-user RAM in FMan
-FQ = QMan Frame Queue
-Rx Dfl FQ = default reception FQ
-Rx Err FQ = Rx error frames FQ
-Tx Cnf FQ = Tx confirmation FQs
-Tx FQs = transmission frame queues
-dtsec = datapath three speed Ethernet controller (10/100/1000 Mbps)
-tgec = ten gigabit Ethernet controller (10 Gbps)
-memac = multirate Ethernet MAC (10/100/1000/10000)
+
+=============== ===========================================================
+DPAA Data Path Acceleration Architecture
+FMan DPAA Frame Manager
+QMan DPAA Queue Manager
+BMan DPAA Buffers Manager
+QMI QMan interface in FMan
+BMI BMan interface in FMan
+FMan SP FMan Storage Profiles
+MURAM Multi-user RAM in FMan
+FQ QMan Frame Queue
+Rx Dfl FQ default reception FQ
+Rx Err FQ Rx error frames FQ
+Tx Cnf FQ Tx confirmation FQs
+Tx FQs transmission frame queues
+dtsec datapath three speed Ethernet controller (10/100/1000 Mbps)
+tgec ten gigabit Ethernet controller (10 Gbps)
+memac multirate Ethernet MAC (10/100/1000/10000)
+=============== ===========================================================
DPAA Ethernet Supported SoCs
============================
The DPAA drivers enable the Ethernet controllers present on the following SoCs:
-# PPC
-P1023
-P2041
-P3041
-P4080
-P5020
-P5040
-T1023
-T1024
-T1040
-T1042
-T2080
-T4240
-B4860
-
-# ARM
-LS1043A
-LS1046A
+PPC
+- P1023
+- P2041
+- P3041
+- P4080
+- P5020
+- P5040
+- T1023
+- T1024
+- T1040
+- T1042
+- T2080
+- T4240
+- B4860
+
+ARM
+- LS1043A
+- LS1046A
Configuring DPAA Ethernet in your kernel
========================================
-To enable the DPAA Ethernet driver, the following Kconfig options are required:
+To enable the DPAA Ethernet driver, the following Kconfig options are required::
-# common for arch/arm64 and arch/powerpc platforms
-CONFIG_FSL_DPAA=y
-CONFIG_FSL_FMAN=y
-CONFIG_FSL_DPAA_ETH=y
-CONFIG_FSL_XGMAC_MDIO=y
+ # common for arch/arm64 and arch/powerpc platforms
+ CONFIG_FSL_DPAA=y
+ CONFIG_FSL_FMAN=y
+ CONFIG_FSL_DPAA_ETH=y
+ CONFIG_FSL_XGMAC_MDIO=y
-# for arch/powerpc only
-CONFIG_FSL_PAMU=y
+ # for arch/powerpc only
+ CONFIG_FSL_PAMU=y
-# common options needed for the PHYs used on the RDBs
-CONFIG_VITESSE_PHY=y
-CONFIG_REALTEK_PHY=y
-CONFIG_AQUANTIA_PHY=y
+ # common options needed for the PHYs used on the RDBs
+ CONFIG_VITESSE_PHY=y
+ CONFIG_REALTEK_PHY=y
+ CONFIG_AQUANTIA_PHY=y
DPAA Ethernet Frame Processing
==============================
@@ -167,7 +173,9 @@ classes as follows:
* priorities 8 to 11 - traffic class 2 (medium-high priority)
* priorities 12 to 15 - traffic class 3 (high priority)
-tc qdisc add dev <int> root handle 1: \
+::
+
+ tc qdisc add dev <int> root handle 1: \
mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1
DPAA IRQ Affinity and Receive Side Scaling
@@ -201,11 +209,11 @@ of these frame queues will arrive at the same portal and will always
be processed by the same CPU. This ensures intra-flow order preservation
and workload distribution for multiple traffic flows.
-RSS can be turned off for a certain interface using ethtool, i.e.
+RSS can be turned off for a certain interface using ethtool, i.e.::
# ethtool -N fm1-mac9 rx-flow-hash tcp4 ""
-To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6:
+To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6::
# ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn
@@ -216,7 +224,7 @@ going to control the rx-flow-hashing for all protocols on that interface.
Besides using the FMan Keygen computed hash for spreading traffic on the
128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when
the NETIF_F_RXHASH feature is on (active by default). This can be turned
-on or off through ethtool, i.e.:
+on or off through ethtool, i.e.::
# ethtool -K fm1-mac9 rx-hashing off
# ethtool -k fm1-mac9 | grep hash
@@ -246,6 +254,7 @@ The following statistics are exported for each interface through ethtool:
- Rx error count per CPU
- Rx error count per type
- congestion related statistics:
+
- congestion status
- time spent in congestion
- number of time the device entered congestion
@@ -254,7 +263,7 @@ The following statistics are exported for each interface through ethtool:
The driver also exports the following information in sysfs:
- the FQ IDs for each FQ type
- /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/fqids
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/fqids
- the ID of the buffer pool in use
- /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/bpids
+ /sys/devices/platform/soc/<addr>.fman/<addr>.ethernet/dpaa-ethernet.<id>/net/fm<nr>-mac<nr>/bpids
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
index 17dbee1ac53e..e4ebfe62a183 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/dpio-driver.rst
@@ -1,5 +1,6 @@
.. include:: <isonum.txt>
+===================================
DPAA2 DPIO (Data Path I/O) Overview
===================================
@@ -19,8 +20,10 @@ pool management for network interfaces.
This document provides an overview the Linux DPIO driver, its
subcomponents, and its APIs.
-See Documentation/networking/device_drivers/freescale/dpaa2/overview.rst for
-a general overview of DPAA2 and the general DPAA2 driver architecture in Linux.
+See
+Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+for a general overview of DPAA2 and the general DPAA2 driver architecture
+in Linux.
Driver Overview
---------------
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
index cb4c9a0c5a17..682f3986c15b 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/ethernet-driver.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/ethernet-driver.rst
@@ -33,7 +33,8 @@ hardware resources, like queues, do not have a corresponding MC object and
are treated as internal resources of other objects.
For a more detailed description of the DPAA2 architecture and its object
-abstractions see *Documentation/networking/device_drivers/freescale/dpaa2/overview.rst*.
+abstractions see
+*Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst*.
Each Linux net device is built on top of a Datapath Network Interface (DPNI)
object and uses Buffer Pools (DPBPs), I/O Portals (DPIOs) and Concentrators
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
index ee40fcc5ddff..62f4a4aff6ec 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/index.rst
@@ -9,3 +9,4 @@ DPAA2 Documentation
dpio-driver
ethernet-driver
mac-phy-support
+ switch-driver
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
index 51e6624fb774..51e6624fb774 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/mac-phy-support.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index d638b5a8aadd..199647729251 100644
--- a/Documentation/networking/device_drivers/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -183,6 +183,7 @@ PHY and allows physical transmission and reception of Ethernet frames.
IRQ config, enable, reset
DPNI (Datapath Network Interface)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Contains TX/RX queues, network interface configuration, and RX buffer pool
configuration mechanisms. The TX/RX queues are in memory and are identified
by queue number.
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
new file mode 100644
index 000000000000..8bf411b857d4
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/switch-driver.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===================
+DPAA2 Switch driver
+===================
+
+:Copyright: |copy| 2021 NXP
+
+The DPAA2 Switch driver probes on the Datapath Switch (DPSW) object which can
+be instantiated on the following DPAA2 SoCs and their variants: LS2088A and
+LX2160A.
+
+The driver uses the switch device driver model and exposes each switch port as
+a network interface, which can be included in a bridge or used as a standalone
+interface. Traffic switched between ports is offloaded into the hardware.
+
+The DPSW can have ports connected to DPNIs or to DPMACs for external access.
+::
+
+ [ethA] [ethB] [ethC] [ethD] [ethE] [ethF]
+ : : : : : :
+ : : : : : :
+ [dpaa2-eth] [dpaa2-eth] [ dpaa2-switch ]
+ : : : : : : kernel
+ =============================================================================
+ : : : : : : hardware
+ [DPNI] [DPNI] [============= DPSW =================]
+ | | | | | |
+ | ---------- | [DPMAC] [DPMAC]
+ ------------------------------- | |
+ | |
+ [PHY] [PHY]
+
+Creating an Ethernet Switch
+===========================
+
+The dpaa2-switch driver probes on DPSW devices found on the fsl-mc bus. These
+devices can be either created statically through the boot time configuration
+file - DataPath Layout (DPL) - or at runtime using the DPAA2 object APIs
+(incorporated already into the restool userspace tool).
+
+At the moment, the dpaa2-switch driver imposes the following restrictions on
+the DPSW object that it will probe:
+
+ * The minimum number of FDBs should be at least equal to the number of switch
+ interfaces. This is necessary so that separation of switch ports can be
+ done, ie when not under a bridge, each switch port will have its own FDB.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: The number of FDBs is lower than the number of ports, cannot probe
+
+ * Both the broadcast and flooding configuration should be per FDB. This
+ enables the driver to restrict the broadcast and flooding domains of each
+ FDB depending on the switch ports that are sharing it (aka are under the
+ same bridge).
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Flooding domain is not per FDB, cannot probe
+ fsl_dpaa2_switch dpsw.0: Broadcast domain is not per FDB, cannot probe
+
+ * The control interface of the switch should not be disabled
+ (DPSW_OPT_CTRL_IF_DIS not passed as a create time option). Without the
+ control interface, the driver is not capable to provide proper Rx/Tx traffic
+ support on the switch port netdevices.
+ ::
+
+ fsl_dpaa2_switch dpsw.0: Control Interface is disabled, cannot probe
+
+Besides the configuration of the actual DPSW object, the dpaa2-switch driver
+will need the following DPAA2 objects:
+
+ * 1 DPMCP - A Management Command Portal object is needed for any interraction
+ with the MC firmware.
+
+ * 1 DPBP - A Buffer Pool is used for seeding buffers intended for the Rx path
+ on the control interface.
+
+ * Access to at least one DPIO object (Software Portal) is needed for any
+ enqueue/dequeue operation to be performed on the control interface queues.
+ The DPIO object will be shared, no need for a private one.
+
+Switching features
+==================
+
+The driver supports the configuration of L2 forwarding rules in hardware for
+port bridging as well as standalone usage of the independent switch interfaces.
+
+The hardware is not configurable with respect to VLAN awareness, thus any DPAA2
+switch port should be used only in usecases with a VLAN aware bridge::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1
+
+ $ ip link add dev br1 type bridge
+ $ ip link set dev ethX master br1
+ Error: fsl_dpaa2_switch: Cannot join a VLAN-unaware bridge
+
+Topology and loop detection through STP is supported when ``stp_state 1`` is
+used at bridge create ::
+
+ $ ip link add dev br0 type bridge vlan_filtering 1 stp_state 1
+
+L2 FDB manipulation (add/delete/dump) is supported.
+
+HW FDB learning can be configured on each switch port independently through
+bridge commands. When the HW learning is disabled, a fast age procedure will be
+run and any previously learnt addresses will be removed.
+::
+
+ $ bridge link set dev ethX learning off
+ $ bridge link set dev ethX learning on
+
+Restricting the unknown unicast and multicast flooding domain is supported, but
+not independently of each other::
+
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood off
+ $ ip link set dev ethX type bridge_slave flood off mcast_flood on
+ Error: fsl_dpaa2_switch: Cannot configure multicast flooding independently of unicast.
+
+Broadcast flooding on a switch port can be disabled/enabled through the brport sysfs::
+
+ $ echo 0 > /sys/bus/fsl-mc/devices/dpsw.Y/net/ethX/brport/broadcast_flood
+
+Offloads
+========
+
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The DPAA2 switch is able to offload flow-based redirection of packets making
+use of ACL tables. Shared filter blocks are supported by sharing a single ACL
+table between multiple ports.
+
+The following flow keys are supported:
+
+ * Ethernet: dst_mac/src_mac
+ * IPv4: dst_ip/src_ip/ip_proto/tos
+ * VLAN: vlan_id/vlan_prio/vlan_tpid/vlan_dei
+ * L4: dst_port/src_port
+
+Also, the matchall filter can be used to redirect the entire traffic received
+on a port.
+
+As per flow actions, the following are supported:
+
+ * drop
+ * mirred egress redirect
+ * trap
+
+Each ACL entry (filter) can be setup with only one of the listed
+actions.
+
+Example 1: send frames received on eth4 with a SA of 00:01:02:03:04:05 to the
+CPU::
+
+ $ tc qdisc add dev eth4 clsact
+ $ tc filter add dev eth4 ingress flower src_mac 00:01:02:03:04:05 skip_sw action trap
+
+Example 2: drop frames received on eth4 with VID 100 and PCP of 3::
+
+ $ tc filter add dev eth4 ingress protocol 802.1q flower skip_sw vlan_id 100 vlan_prio 3 action drop
+
+Example 3: redirect all frames received on eth4 to eth1::
+
+ $ tc filter add dev eth4 ingress matchall action mirred egress redirect dev eth1
+
+Example 4: Use a single shared filter block on both eth5 and eth6::
+
+ $ tc qdisc add dev eth5 ingress_block 1 clsact
+ $ tc qdisc add dev eth6 ingress_block 1 clsact
+ $ tc filter add block 1 ingress flower dst_mac 00:01:02:03:04:04 skip_sw \
+ action trap
+ $ tc filter add block 1 ingress protocol ipv4 flower src_ip 192.168.1.1 skip_sw \
+ action mirred egress redirect dev eth3
+
+Mirroring
+~~~~~~~~~
+
+The DPAA2 switch supports only per port mirroring and per VLAN mirroring.
+Adding mirroring filters in shared blocks is also supported.
+
+When using the tc-flower classifier with the 802.1q protocol, only the
+''vlan_id'' key will be accepted. Mirroring based on any other fields from the
+802.1q protocol will be rejected::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_prio 3 action mirred egress mirror dev eth6
+ Error: fsl_dpaa2_switch: Only matching on VLAN ID supported.
+ We have an error talking to the kernel
+
+If a mirroring VLAN filter is requested on a port, the VLAN must to be
+installed on the switch port in question either using ''bridge'' or by creating
+a VLAN upper device if the switch port is used as a standalone interface::
+
+ $ tc qdisc add dev eth8 ingress_block 1 clsact
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ Error: VLAN must be installed on the switch port.
+ We have an error talking to the kernel
+
+ $ bridge vlan add vid 200 dev eth8
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+ $ ip link add link eth8 name eth8.200 type vlan id 200
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+
+Also, it should be noted that the mirrored traffic will be subject to the same
+egress restrictions as any other traffic. This means that when a mirrored
+packet will reach the mirror port, if the VLAN found in the packet is not
+installed on the port it will get dropped.
+
+The DPAA2 switch supports only a single mirroring destination, thus multiple
+mirror rules can be installed but their ''to'' port has to be the same::
+
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 200 action mirred egress mirror dev eth6
+ $ tc filter add block 1 ingress protocol 802.1q flower skip_sw vlan_id 100 action mirred egress mirror dev eth7
+ Error: fsl_dpaa2_switch: Multiple mirror ports not supported.
+ We have an error talking to the kernel
diff --git a/Documentation/networking/device_drivers/freescale/gianfar.txt b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
index ba1daea7f2e4..9c4a91d3824b 100644
--- a/Documentation/networking/device_drivers/freescale/gianfar.txt
+++ b/Documentation/networking/device_drivers/ethernet/freescale/gianfar.rst
@@ -1,10 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
The Gianfar Ethernet Driver
+===========================
-Author: Andy Fleming <afleming@freescale.com>
-Updated: 2005-07-28
+:Author: Andy Fleming <afleming@freescale.com>
+:Updated: 2005-07-28
-CHECKSUM OFFLOADING
+Checksum Offloading
+===================
The eTSEC controller (first included in parts from late 2005 like
the 8548) has the ability to perform TCP, UDP, and IP checksums
@@ -15,13 +20,15 @@ packets. Use ethtool to enable or disable this feature for RX
and TX.
VLAN
+====
In order to use VLAN, please consult Linux documentation on
configuring VLANs. The gianfar driver supports hardware insertion and
extraction of VLAN headers, but not filtering. Filtering will be
done by the kernel.
-MULTICASTING
+Multicasting
+============
The gianfar driver supports using the group hash table on the
TSEC (and the extended hash table on the eTSEC) for multicast
@@ -29,13 +36,15 @@ filtering. On the eTSEC, the exact-match MAC registers are used
before the hash tables. See Linux documentation on how to join
multicast groups.
-PADDING
+Padding
+=======
The gianfar driver supports padding received frames with 2 bytes
to align the IP header to a 16-byte boundary, when supported by
hardware.
-ETHTOOL
+Ethtool
+=======
The gianfar driver supports the use of ethtool for many
configuration options. You must run ethtool only on currently
diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/ethernet/google/gve.rst
index 793693cef6e3..6d73ee78f3d7 100644
--- a/Documentation/networking/device_drivers/google/gve.rst
+++ b/Documentation/networking/device_drivers/ethernet/google/gve.rst
@@ -47,13 +47,24 @@ The driver interacts with the device in the following ways:
- Transmit and Receive Queues
- See description below
+Descriptor Formats
+------------------
+GVE supports two descriptor formats: GQI and DQO. These two formats have
+entirely different descriptors, which will be described below.
+
Registers
---------
-All registers are MMIO and big endian.
+All registers are MMIO.
The registers are used for initializing and configuring the device as well as
querying device status in response to management interrupts.
+Endianness
+----------
+- Admin Queue messages and registers are all Big Endian.
+- GQI descriptors and datapath registers are Big Endian.
+- DQO descriptors and datapath registers are Little Endian.
+
Admin Queue (AQ)
----------------
The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
@@ -97,10 +108,10 @@ the queues associated with that interrupt.
The handler for these irqs schedule the napi for that block to run
and poll the queues.
-Traffic Queues
---------------
-gVNIC's queues are composed of a descriptor ring and a buffer and are
-assigned to a notification block.
+GQI Traffic Queues
+------------------
+GQI queues are composed of a descriptor ring and a buffer and are assigned to a
+notification block.
The descriptor rings are power-of-two-sized ring buffers consisting of
fixed-size descriptors. They advance their head pointer using a __be32
@@ -121,3 +132,35 @@ Receive
The buffers for receive rings are put into a data ring that is the same
length as the descriptor ring and the head and tail pointers advance over
the rings together.
+
+DQO Traffic Queues
+------------------
+- Every TX and RX queue is assigned a notification block.
+
+- TX and RX buffers queues, which send descriptors to the device, use MMIO
+ doorbells to notify the device of new descriptors.
+
+- RX and TX completion queues, which receive descriptors from the device, use a
+ "generation bit" to know when a descriptor was populated by the device. The
+ driver initializes all bits with the "current generation". The device will
+ populate received descriptors with the "next generation" which is inverted
+ from the current generation. When the ring wraps, the current/next generation
+ are swapped.
+
+- It's the driver's responsibility to ensure that the RX and TX completion
+ queues are not overrun. This can be accomplished by limiting the number of
+ descriptors posted to HW.
+
+- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
+ buffer_id. These will be returned on the TX completion and RX queues
+ respectively to let the driver know which packet/buffer was completed.
+
+Transmit
+~~~~~~~~
+A packet's buffers are DMA mapped for the device to access before transmission.
+After the packet was successfully transmitted, the buffers are unmapped.
+
+Receive
+~~~~~~~
+The driver posts fixed sized buffers to HW on the RX buffer queue. The packet
+received on the associated RX queue may span multiple descriptors.
diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
index 989366a4039c..867ac8f4e04a 100644
--- a/Documentation/networking/hinic.txt
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================
Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family
============================================================
@@ -110,7 +113,7 @@ hinic_dev - de/constructs the Logical Tx and Rx Queues.
(hinic_main.c, hinic_dev.h)
-Miscellaneous:
+Miscellaneous
=============
Common functions that are used by HW and Logical Device.
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
new file mode 100644
index 000000000000..5196905582c5
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Ethernet Device Drivers
+=======================
+
+Device drivers for Ethernet and Ethernet-based virtual function devices.
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ 3com/3c509
+ 3com/vortex
+ amazon/ena
+ altera/altera_tse
+ aquantia/atlantic
+ chelsio/cxgb
+ cirrus/cs89x0
+ dlink/dl2k
+ davicom/dm9000
+ dec/dmfe
+ freescale/dpaa
+ freescale/dpaa2/index
+ freescale/gianfar
+ google/gve
+ huawei/hinic
+ intel/e100
+ intel/e1000
+ intel/e1000e
+ intel/fm10k
+ intel/igb
+ intel/igbvf
+ intel/ixgb
+ intel/ixgbe
+ intel/ixgbevf
+ intel/i40e
+ intel/iavf
+ intel/ice
+ marvell/octeontx2
+ marvell/octeon_ep
+ mellanox/mlx5
+ microsoft/netvsc
+ neterion/s2io
+ netronome/nfp
+ pensando/ionic
+ smsc/smc9
+ stmicro/stmmac
+ ti/cpsw
+ ti/cpsw_switchdev
+ ti/am65_nuss_cpsw_switchdev
+ ti/tlan
+ toshiba/spider_net
+ wangxun/txgbe
+ wangxun/ngbe
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
index caf023cc88de..3d4a9ba21946 100644
--- a/Documentation/networking/device_drivers/intel/e100.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
@@ -33,7 +33,7 @@ The following features are now available in supported kernels:
- SNMP
Channel Bonding documentation can be found in the Linux kernel source:
-/Documentation/networking/bonding.txt
+/Documentation/networking/bonding.rst
Identifying Your Adapter
@@ -41,7 +41,7 @@ Identifying Your Adapter
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Driver Configuration Parameters
===============================
@@ -179,7 +179,7 @@ filtering by
Support
=======
For general information, go to the Intel support website at:
-http://www.intel.com/support/
+https://www.intel.com/support/
or the Intel Wired Networking project hosted by Sourceforge at:
http://sourceforge.net/projects/e1000
diff --git a/Documentation/networking/device_drivers/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
index 4aaae0f7d6ba..4aaae0f7d6ba 100644
--- a/Documentation/networking/device_drivers/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
diff --git a/Documentation/networking/device_drivers/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
index f49cd370e7bf..f49cd370e7bf 100644
--- a/Documentation/networking/device_drivers/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
diff --git a/Documentation/networking/device_drivers/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
index 4d279e64e221..9258ef6f515c 100644
--- a/Documentation/networking/device_drivers/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
@@ -22,7 +22,7 @@ Ethernet Multi-host Controller.
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Flow Control
diff --git a/Documentation/networking/device_drivers/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index 8a9b18573688..ac35bd472bdc 100644
--- a/Documentation/networking/device_drivers/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -173,7 +173,7 @@ Director rule is added from ethtool (Sideband filter), ATR is turned off by the
driver. To re-enable ATR, the sideband can be disabled with the ethtool -K
option. For example::
- ethtool –K [adapter] ntuple [off|on]
+ ethtool -K [adapter] ntuple [off|on]
If sideband is re-enabled after ATR is re-enabled, ATR remains enabled until a
TCP-IP flow is added. When all TCP-IP sideband rules are deleted, ATR is
@@ -466,7 +466,7 @@ network. PTP support varies among Intel devices that support this driver. Use
"ethtool -T <netdev name>" to get a definitive list of PTP capabilities
supported by the device.
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -523,8 +523,8 @@ of a port's bandwidth (should it be available). The sum of all the values for
Maximum Bandwidth is not restricted, because no more than 100% of a port's
bandwidth can ever be used.
-NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
-per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
+NOTE: X710/XXV710 devices fail to enable Max VFs (64) when Multiple Functions
+per Port (MFP) and SR-IOV are enabled. An error from i40e is logged that says
"add vsi failed for VF N, aq_err 16". To workaround the issue, enable less than
64 virtual functions (VFs).
@@ -688,7 +688,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as `ifstat` or `sar -n DEV [interval] [number of samples]`
2. Enable HW TC offload on interface::
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
index 84ac7e75f363..151af0a8da9c 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -43,7 +43,7 @@ device.
For information on how to identify your adapter, and for the latest NVM/FW
images and Intel network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Additional Features and Configurations
@@ -113,7 +113,7 @@ which the AVF is associated. The following are base mode features:
- AVF device ID
- HW mailbox is used for VF to PF communications (including on Windows)
-IEEE 802.1ad (QinQ) Support
+IEEE 802.1ad (QinQ) Support
---------------------------
The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
@@ -179,7 +179,7 @@ shaper bw_rlimit: for each tc, sets minimum and maximum bandwidth rates.
Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
-monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+monitoring tools such as ``ifstat`` or ``sar -n DEV [interval] [number of samples]``
NOTE:
Setting up channels via ethtool (ethtool -L) is not supported when the
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
new file mode 100644
index 000000000000..dc2e60ced927
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -0,0 +1,1040 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=================================================================
+Linux Base Driver for the Intel(R) Ethernet Controller 800 Series
+=================================================================
+
+Intel ice Linux driver.
+Copyright(c) 2018-2021 Intel Corporation.
+
+Contents
+========
+
+- Overview
+- Identifying Your Adapter
+- Important Notes
+- Additional Features & Configurations
+- Performance Optimization
+
+
+The associated Virtual Function (VF) driver for this driver is iavf.
+
+Driver information can be obtained using ethtool and lspci.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel adapter. All hardware requirements listed apply to use
+with Linux.
+
+This driver supports XDP (Express Data Path) and AF_XDP zero-copy. Note that
+XDP is blocked for frame sizes larger than 3KB.
+
+
+Identifying Your Adapter
+========================
+For information on how to identify your adapter, and for the latest Intel
+network drivers, refer to the Intel Support website:
+https://www.intel.com/support
+
+
+Important Notes
+===============
+
+Packet drops may occur under receive stress
+-------------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series are designed to
+tolerate a limited amount of system latency during PCIe and DMA transactions.
+If these transactions take longer than the tolerated latency, it can impact the
+length of time the packets are buffered in the device and associated memory,
+which may result in dropped packets. These packets drops typically do not have
+a noticeable impact on throughput and performance under standard workloads.
+
+If these packet drops appear to affect your workload, the following may improve
+the situation:
+
+1) Make sure that your system's physical memory is in a high-performance
+ configuration, as recommended by the platform vendor. A common
+ recommendation is for all channels to be populated with a single DIMM
+ module.
+2) In your system's BIOS/UEFI settings, select the "Performance" profile.
+3) Your distribution may provide tools like "tuned," which can help tweak
+ kernel settings to achieve better standard settings for different workloads.
+
+
+Configuring SR-IOV for improved network security
+------------------------------------------------
+In a virtualized environment, on Intel(R) Ethernet Network Adapters that
+support SR-IOV, the virtual function (VF) may be subject to malicious behavior.
+Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE
+802.1Qbb (priority based flow-control), and others of this type, are not
+expected and can throttle traffic between the host and the virtual switch,
+reducing performance. To resolve this issue, and to ensure isolation from
+unintended traffic streams, configure all SR-IOV enabled ports for VLAN tagging
+from the administrative interface on the PF. This configuration allows
+unexpected, and potentially malicious, frames to be dropped.
+
+See "Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports" later in this
+README for configuration instructions.
+
+
+Do not unload port driver if VF with active VM is bound to it
+-------------------------------------------------------------
+Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
+Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
+Once the VM shuts down, or otherwise releases the VF, the command will
+complete.
+
+
+Important notes for SR-IOV and Link Aggregation
+-----------------------------------------------
+Link Aggregation is mutually exclusive with SR-IOV.
+
+- If Link Aggregation is active, SR-IOV VFs cannot be created on the PF.
+- If SR-IOV is active, you cannot set up Link Aggregation on the interface.
+
+Bridging and MACVLAN are also affected by this. If you wish to use bridging or
+MACVLAN with SR-IOV, you must set up bridging or MACVLAN before enabling
+SR-IOV. If you are using bridging or MACVLAN in conjunction with SR-IOV, and
+you want to remove the interface from the bridge or MACVLAN, you must follow
+these steps:
+
+1. Destroy SR-IOV VFs if they exist
+2. Remove the interface from the bridge or MACVLAN
+3. Recreate SRIOV VFs as needed
+
+
+Additional Features and Configurations
+======================================
+
+ethtool
+-------
+The driver utilizes the ethtool interface for driver configuration and
+diagnostics, as well as displaying statistical information. The latest ethtool
+version is required for this functionality. Download it at:
+https://kernel.org/pub/software/network/ethtool/
+
+NOTE: The rx_bytes value of ethtool does not match the rx_bytes value of
+Netdev, due to the 4-byte CRC being stripped by the device. The difference
+between the two rx_bytes values will be 4 x the number of Rx packets. For
+example, if Rx packets are 10 and Netdev (software statistics) displays
+rx_bytes as "X", then ethtool (hardware statistics) will display rx_bytes as
+"X+40" (4 bytes CRC x 10 packets).
+
+
+Viewing Link Messages
+---------------------
+Link messages will not be displayed to the console if the distribution is
+restricting system messages. In order to see network driver link messages on
+your console, set dmesg to eight by entering the following::
+
+ # dmesg -n 8
+
+NOTE: This setting is not saved across reboots.
+
+
+Dynamic Device Personalization
+------------------------------
+Dynamic Device Personalization (DDP) allows you to change the packet processing
+pipeline of a device by applying a profile package to the device at runtime.
+Profiles can be used to, for example, add support for new protocols, change
+existing protocols, or change default settings. DDP profiles can also be rolled
+back without rebooting the system.
+
+The DDP package loads during device initialization. The driver looks for
+``intel/ice/ddp/ice.pkg`` in your firmware root (typically ``/lib/firmware/``
+or ``/lib/firmware/updates/``) and checks that it contains a valid DDP package
+file.
+
+NOTE: Your distribution should likely have provided the latest DDP file, but if
+ice.pkg is missing, you can find it in the linux-firmware repository or from
+intel.com.
+
+If the driver is unable to load the DDP package, the device will enter Safe
+Mode. Safe Mode disables advanced and performance features and supports only
+basic traffic and minimal functionality, such as updating the NVM or
+downloading a new driver or DDP package. Safe Mode only applies to the affected
+physical function and does not impact any other PFs. See the "Intel(R) Ethernet
+Adapters and Devices User Guide" for more details on DDP and Safe Mode.
+
+NOTES:
+
+- If you encounter issues with the DDP package file, you may need to download
+ an updated driver or DDP package file. See the log messages for more
+ information.
+
+- The ice.pkg file is a symbolic link to the default DDP package file.
+
+- You cannot update the DDP package if any PF drivers are already loaded. To
+ overwrite a package, unload all PFs and then reload the driver with the new
+ package.
+
+- Only the first loaded PF per device can download a package for that device.
+
+You can install specific DDP package files for different physical devices in
+the same system. To install a specific DDP package file:
+
+1. Download the DDP package file you want for your device.
+
+2. Rename the file ice-xxxxxxxxxxxxxxxx.pkg, where 'xxxxxxxxxxxxxxxx' is the
+ unique 64-bit PCI Express device serial number (in hex) of the device you
+ want the package downloaded on. The filename must include the complete
+ serial number (including leading zeros) and be all lowercase. For example,
+ if the 64-bit serial number is b887a3ffffca0568, then the file name would be
+ ice-b887a3ffffca0568.pkg.
+
+ To find the serial number from the PCI bus address, you can use the
+ following command::
+
+ # lspci -vv -s af:00.0 | grep -i Serial
+ Capabilities: [150 v1] Device Serial Number b8-87-a3-ff-ff-ca-05-68
+
+ You can use the following command to format the serial number without the
+ dashes::
+
+ # lspci -vv -s af:00.0 | grep -i Serial | awk '{print $7}' | sed s/-//g
+ b887a3ffffca0568
+
+3. Copy the renamed DDP package file to
+ ``/lib/firmware/updates/intel/ice/ddp/``. If the directory does not yet
+ exist, create it before copying the file.
+
+4. Unload all of the PFs on the device.
+
+5. Reload the driver with the new package.
+
+NOTE: The presence of a device-specific DDP package file overrides the loading
+of the default DDP package file (ice.pkg).
+
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+The Intel Ethernet Flow Director performs the following tasks:
+
+- Directs receive packets according to their flows to different queues
+- Enables tight control on routing a flow in the platform
+- Matches flows and CPU cores for flow affinity
+
+NOTE: This driver supports the following flow types:
+
+- IPv4
+- TCPv4
+- UDPv4
+- SCTPv4
+- IPv6
+- TCPv6
+- UDPv6
+- SCTPv6
+
+Each flow type supports valid combinations of IP addresses (source or
+destination) and UDP/TCP/SCTP ports (source and destination). You can supply
+only a source IP address, a source IP address and a destination port, or any
+combination of one or more of these four parameters.
+
+NOTE: This driver allows you to filter traffic based on a user-defined flexible
+two-byte pattern and offset by using the ethtool user-def and mask fields. Only
+L3 and L4 flow types are supported for user-defined flexible filters. For a
+given flow type, you must clear all Intel Ethernet Flow Director filters before
+changing the input set (for that flow type).
+
+
+Flow Director Filters
+---------------------
+Flow Director filters are used to direct traffic that matches specified
+characteristics. They are enabled through ethtool's ntuple interface. To enable
+or disable the Intel Ethernet Flow Director and these filters::
+
+ # ethtool -K <ethX> ntuple <off|on>
+
+NOTE: When you disable ntuple filters, all the user programmed filters are
+flushed from the driver cache and hardware. All needed filters must be re-added
+when ntuple is re-enabled.
+
+To display all of the active filters::
+
+ # ethtool -u <ethX>
+
+To add a new filter::
+
+ # ethtool -U <ethX> flow-type <type> src-ip <ip> [m <ip_mask>] dst-ip <ip>
+ [m <ip_mask>] src-port <port> [m <port_mask>] dst-port <port> [m <port_mask>]
+ action <queue>
+
+ Where:
+ <ethX> - the Ethernet device to program
+ <type> - can be ip4, tcp4, udp4, sctp4, ip6, tcp6, udp6, sctp6
+ <ip> - the IP address to match on
+ <ip_mask> - the IPv4 address to mask on
+ NOTE: These filters use inverted masks.
+ <port> - the port number to match on
+ <port_mask> - the 16-bit integer for masking
+ NOTE: These filters use inverted masks.
+ <queue> - the queue to direct traffic toward (-1 discards the
+ matched traffic)
+
+To delete a filter::
+
+ # ethtool -U <ethX> delete <N>
+
+ Where <N> is the filter ID displayed when printing all the active filters,
+ and may also have been specified using "loc <N>" when adding the filter.
+
+EXAMPLES:
+
+To add a filter that directs packet to queue 2::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 src-port 2000 dst-port 2001 action 2 [loc 1]
+
+To set a filter using only the source and destination IP address::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 action 2 [loc 1]
+
+To set a filter based on a user-defined pattern and offset::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.10.1 dst-ip \
+ 192.168.10.2 user-def 0x4FFFF action 2 [loc 1]
+
+ where the value of the user-def field contains the offset (4 bytes) and
+ the pattern (0xffff).
+
+To match TCP traffic sent from 192.168.0.1, port 5300, directed to 192.168.0.5,
+port 80, and then send it to queue 7::
+
+ # ethtool -U enp130s0 flow-type tcp4 src-ip 192.168.0.1 dst-ip 192.168.0.5
+ src-port 5300 dst-port 80 action 7
+
+To add a TCPv4 filter with a partial mask for a source IP subnet::
+
+ # ethtool -U <ethX> flow-type tcp4 src-ip 192.168.0.0 m 0.255.255.255 dst-ip
+ 192.168.5.12 src-port 12600 dst-port 31 action 12
+
+NOTES:
+
+For each flow-type, the programmed filters must all have the same matching
+input set. For example, issuing the following two commands is acceptable::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.5 src-port 55 action 10
+
+Issuing the next two commands, however, is not acceptable, since the first
+specifies src-ip and the second specifies dst-ip::
+
+ # ethtool -U enp130s0 flow-type ip4 src-ip 192.168.0.1 src-port 5300 action 7
+ # ethtool -U enp130s0 flow-type ip4 dst-ip 192.168.0.5 src-port 55 action 10
+
+The second command will fail with an error. You may program multiple filters
+with the same fields, using different values, but, on one device, you may not
+program two tcp4 filters with different matching fields.
+
+The ice driver does not support matching on a subportion of a field, thus
+partial mask fields are not supported.
+
+
+Flex Byte Flow Director Filters
+-------------------------------
+The driver also supports matching user-defined data within the packet payload.
+This flexible data is specified using the "user-def" field of the ethtool
+command in the following way:
+
+.. table::
+
+ ============================== ============================
+ ``31 28 24 20 16`` ``15 12 8 4 0``
+ ``offset into packet payload`` ``2 bytes of flexible data``
+ ============================== ============================
+
+For example,
+
+::
+
+ ... user-def 0x4FFFF ...
+
+tells the filter to look 4 bytes into the payload and match that value against
+0xFFFF. The offset is based on the beginning of the payload, and not the
+beginning of the packet. Thus
+
+::
+
+ flow-type tcp4 ... user-def 0x8BEAF ...
+
+would match TCP/IPv4 packets which have the value 0xBEAF 8 bytes into the
+TCP/IPv4 payload.
+
+Note that ICMP headers are parsed as 4 bytes of header and 4 bytes of payload.
+Thus to match the first byte of the payload, you must actually add 4 bytes to
+the offset. Also note that ip4 filters match both ICMP frames as well as raw
+(unknown) ip4 frames, where the payload will be the L3 payload of the IP4
+frame.
+
+The maximum offset is 64. The hardware will only read up to 64 bytes of data
+from the payload. The offset must be even because the flexible data is 2 bytes
+long and must be aligned to byte 0 of the packet payload.
+
+The user-defined flexible offset is also considered part of the input set and
+cannot be programmed separately for multiple filters of the same type. However,
+the flexible data is not part of the input set and multiple filters may use the
+same offset but match against different data.
+
+
+RSS Hash Flow
+-------------
+Allows you to set the hash bytes per flow type and any combination of one or
+more options for Receive Side Scaling (RSS) hash byte configuration.
+
+::
+
+ # ethtool -N <ethX> rx-flow-hash <type> <option>
+
+ Where <type> is:
+ tcp4 signifying TCP over IPv4
+ udp4 signifying UDP over IPv4
+ tcp6 signifying TCP over IPv6
+ udp6 signifying UDP over IPv6
+ And <option> is one or more of:
+ s Hash on the IP source address of the Rx packet.
+ d Hash on the IP destination address of the Rx packet.
+ f Hash on bytes 0 and 1 of the Layer 4 header of the Rx packet.
+ n Hash on bytes 2 and 3 of the Layer 4 header of the Rx packet.
+
+
+Accelerated Receive Flow Steering (aRFS)
+----------------------------------------
+Devices based on the Intel(R) Ethernet Controller 800 Series support
+Accelerated Receive Flow Steering (aRFS) on the PF. aRFS is a load-balancing
+mechanism that allows you to direct packets to the same CPU where an
+application is running or consuming the packets in that flow.
+
+NOTES:
+
+- aRFS requires that ntuple filtering is enabled via ethtool.
+- aRFS support is limited to the following packet types:
+
+ - TCP over IPv4 and IPv6
+ - UDP over IPv4 and IPv6
+ - Nonfragmented packets
+
+- aRFS only supports Flow Director filters, which consist of the
+ source/destination IP addresses and source/destination ports.
+- aRFS and ethtool's ntuple interface both use the device's Flow Director. aRFS
+ and ntuple features can coexist, but you may encounter unexpected results if
+ there's a conflict between aRFS and ntuple requests. See "Intel(R) Ethernet
+ Flow Director" for additional information.
+
+To set up aRFS:
+
+1. Enable the Intel Ethernet Flow Director and ntuple filters using ethtool.
+
+::
+
+ # ethtool -K <ethX> ntuple on
+
+2. Set up the number of entries in the global flow table. For example:
+
+::
+
+ # NUM_RPS_ENTRIES=16384
+ # echo $NUM_RPS_ENTRIES > /proc/sys/net/core/rps_sock_flow_entries
+
+3. Set up the number of entries in the per-queue flow table. For example:
+
+::
+
+ # NUM_RX_QUEUES=64
+ # for file in /sys/class/net/$IFACE/queues/rx-*/rps_flow_cnt; do
+ # echo $(($NUM_RPS_ENTRIES/$NUM_RX_QUEUES)) > $file;
+ # done
+
+4. Disable the IRQ balance daemon (this is only a temporary stop of the service
+ until the next reboot).
+
+::
+
+ # systemctl stop irqbalance
+
+5. Configure the interrupt affinity.
+
+ See ``/Documentation/core-api/irq/irq-affinity.rst``
+
+
+To disable aRFS using ethtool::
+
+ # ethtool -K <ethX> ntuple off
+
+NOTE: This command will disable ntuple filters and clear any aRFS filters in
+software and hardware.
+
+Example Use Case:
+
+1. Set the server application on the desired CPU (e.g., CPU 4).
+
+::
+
+ # taskset -c 4 netserver
+
+2. Use netperf to route traffic from the client to CPU 4 on the server with
+ aRFS configured. This example uses TCP over IPv4.
+
+::
+
+ # netperf -H <Host IPv4 Address> -t TCP_STREAM
+
+
+Enabling Virtual Functions (VFs)
+--------------------------------
+Use sysfs to enable virtual functions (VF).
+
+For example, you can create 4 VFs as follows::
+
+ # echo 4 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+To disable VFs, write 0 to the same file::
+
+ # echo 0 > /sys/class/net/<ethX>/device/sriov_numvfs
+
+The maximum number of VFs for the ice driver is 256 total (all ports). To check
+how many VFs each PF supports, use the following command::
+
+ # cat /sys/class/net/<ethX>/device/sriov_totalvfs
+
+Note: You cannot use SR-IOV when link aggregation (LAG)/bonding is active, and
+vice versa. To enforce this, the driver checks for this mutual exclusion.
+
+
+Displaying VF Statistics on the PF
+----------------------------------
+Use the following command to display the statistics for the PF and its VFs::
+
+ # ip -s link show dev <ethX>
+
+NOTE: The output of this command can be very large due to the maximum number of
+possible VFs.
+
+The PF driver will display a subset of the statistics for the PF and for all
+VFs that are configured. The PF will always print a statistics block for each
+of the possible VFs, and it will show zero for all unconfigured VFs.
+
+
+Configuring VLAN Tagging on SR-IOV Enabled Adapter Ports
+--------------------------------------------------------
+To configure VLAN tagging for the ports on an SR-IOV enabled adapter, use the
+following command. The VLAN configuration should be done before the VF driver
+is loaded or the VM is booted. The VF is not aware of the VLAN tag being
+inserted on transmit and removed on received frames (sometimes called "port
+VLAN" mode).
+
+::
+
+ # ip link set dev <ethX> vf <id> vlan <vlan id>
+
+For example, the following will configure PF eth0 and the first VF on VLAN 10::
+
+ # ip link set dev eth0 vf 0 vlan 10
+
+
+Enabling a VF link if the port is disconnected
+----------------------------------------------
+If the physical function (PF) link is down, you can force link up (from the
+host PF) on any virtual functions (VF) bound to the PF.
+
+For example, to force link up on VF 0 bound to PF eth0::
+
+ # ip link set eth0 vf 0 state enable
+
+Note: If the command does not work, it may not be supported by your system.
+
+
+Setting the MAC Address for a VF
+--------------------------------
+To change the MAC address for the specified VF::
+
+ # ip link set <ethX> vf 0 mac <address>
+
+For example::
+
+ # ip link set <ethX> vf 0 mac 00:01:02:03:04:05
+
+This setting lasts until the PF is reloaded.
+
+NOTE: Assigning a MAC address for a VF from the host will disable any
+subsequent requests to change the MAC address from within the VM. This is a
+security feature. The VM is not aware of this restriction, so if this is
+attempted in the VM, it will trigger MDD events.
+
+
+Trusted VFs and VF Promiscuous Mode
+-----------------------------------
+This feature allows you to designate a particular VF as trusted and allows that
+trusted VF to request selective promiscuous mode on the Physical Function (PF).
+
+To set a VF as trusted or untrusted, enter the following command in the
+Hypervisor::
+
+ # ip link set dev <ethX> vf 1 trust [on|off]
+
+NOTE: It's important to set the VF to trusted before setting promiscuous mode.
+If the VM is not trusted, the PF will ignore promiscuous mode requests from the
+VF. If the VM becomes trusted after the VF driver is loaded, you must make a
+new request to set the VF to promiscuous.
+
+Once the VF is designated as trusted, use the following commands in the VM to
+set the VF to promiscuous mode.
+
+For promiscuous all::
+
+ # ip link set <ethX> promisc on
+ Where <ethX> is a VF interface in the VM
+
+For promiscuous Multicast::
+
+ # ip link set <ethX> allmulticast on
+ Where <ethX> is a VF interface in the VM
+
+NOTE: By default, the ethtool private flag vf-true-promisc-support is set to
+"off," meaning that promiscuous mode for the VF will be limited. To set the
+promiscuous mode for the VF to true promiscuous and allow the VF to see all
+ingress traffic, use the following command::
+
+ # ethtool --set-priv-flags <ethX> vf-true-promisc-support on
+
+The vf-true-promisc-support private flag does not enable promiscuous mode;
+rather, it designates which type of promiscuous mode (limited or true) you will
+get when you enable promiscuous mode using the ip link commands above. Note
+that this is a global setting that affects the entire device. However, the
+vf-true-promisc-support private flag is only exposed to the first PF of the
+device. The PF remains in limited promiscuous mode regardless of the
+vf-true-promisc-support setting.
+
+Next, add a VLAN interface on the VF interface. For example::
+
+ # ip link add link eth2 name eth2.100 type vlan id 100
+
+Note that the order in which you set the VF to promiscuous mode and add the
+VLAN interface does not matter (you can do either first). The result in this
+example is that the VF will get all traffic that is tagged with VLAN 100.
+
+
+Malicious Driver Detection (MDD) for VFs
+----------------------------------------
+Some Intel Ethernet devices use Malicious Driver Detection (MDD) to detect
+malicious traffic from the VF and disable Tx/Rx queues or drop the offending
+packet until a VF driver reset occurs. You can view MDD messages in the PF's
+system log using the dmesg command.
+
+- If the PF driver logs MDD events from the VF, confirm that the correct VF
+ driver is installed.
+- To restore functionality, you can manually reload the VF or VM or enable
+ automatic VF resets.
+- When automatic VF resets are enabled, the PF driver will immediately reset
+ the VF and reenable queues when it detects MDD events on the receive path.
+- If automatic VF resets are disabled, the PF will not automatically reset the
+ VF when it detects MDD events.
+
+To enable or disable automatic VF resets, use the following command::
+
+ # ethtool --set-priv-flags <ethX> mdd-auto-reset-vf on|off
+
+
+MAC and VLAN Anti-Spoofing Feature for VFs
+------------------------------------------
+When a malicious driver on a Virtual Function (VF) interface attempts to send a
+spoofed packet, it is dropped by the hardware and not transmitted.
+
+NOTE: This feature can be disabled for a specific VF::
+
+ # ip link set <ethX> vf <vf id> spoofchk {off|on}
+
+
+Jumbo Frames
+------------
+Jumbo Frames support is enabled by changing the Maximum Transmission Unit (MTU)
+to a value larger than the default value of 1500.
+
+Use the ifconfig command to increase the MTU size. For example, enter the
+following where <ethX> is the interface number::
+
+ # ifconfig <ethX> mtu 9000 up
+
+Alternatively, you can use the ip command as follows::
+
+ # ip link set mtu 9000 dev <ethX>
+ # ip link set up dev <ethX>
+
+This setting is not saved across reboots.
+
+
+NOTE: The maximum MTU setting for jumbo frames is 9702. This corresponds to the
+maximum jumbo frame size of 9728 bytes.
+
+NOTE: This driver will attempt to use multiple page sized buffers to receive
+each jumbo packet. This should help to avoid buffer starvation issues when
+allocating receive packets.
+
+NOTE: Packet loss may have a greater impact on throughput when you use jumbo
+frames. If you observe a drop in performance after enabling jumbo frames,
+enabling flow control may mitigate the issue.
+
+
+Speed and Duplex Configuration
+------------------------------
+In addressing speed and duplex configuration issues, you need to distinguish
+between copper-based adapters and fiber-based adapters.
+
+In the default mode, an Intel(R) Ethernet Network Adapter using copper
+connections will attempt to auto-negotiate with its link partner to determine
+the best setting. If the adapter cannot establish link with the link partner
+using auto-negotiation, you may need to manually configure the adapter and link
+partner to identical settings to establish link and pass packets. This should
+only be needed when attempting to link with an older switch that does not
+support auto-negotiation or one that has been forced to a specific speed or
+duplex mode. Your link partner must match the setting you choose. 1 Gbps speeds
+and higher cannot be forced. Use the autonegotiation advertising setting to
+manually set devices for 1 Gbps and higher.
+
+Speed, duplex, and autonegotiation advertising are configured through the
+ethtool utility. For the latest version, download and install ethtool from the
+following website:
+
+ https://kernel.org/pub/software/network/ethtool/
+
+To see the speed configurations your device supports, run the following::
+
+ # ethtool <ethX>
+
+Caution: Only experienced network administrators should force speed and duplex
+or change autonegotiation advertising manually. The settings at the switch must
+always match the adapter settings. Adapter performance may suffer or your
+adapter may not operate if you configure the adapter differently from your
+switch.
+
+
+Data Center Bridging (DCB)
+--------------------------
+NOTE: The kernel assumes that TC0 is available, and will disable Priority Flow
+Control (PFC) on the device if TC0 is not available. To fix this, ensure TC0 is
+enabled when setting up DCB on your switch.
+
+DCB is a configuration Quality of Service implementation in hardware. It uses
+the VLAN priority tag (802.1p) to filter traffic. That means that there are 8
+different priorities that traffic can be filtered into. It also enables
+priority flow control (802.1Qbb) which can limit or eliminate the number of
+dropped packets during network stress. Bandwidth can be allocated to each of
+these priorities, which is enforced at the hardware level (802.1Qaz).
+
+DCB is normally configured on the network using the DCBX protocol (802.1Qaz), a
+specialization of LLDP (802.1AB). The ice driver supports the following
+mutually exclusive variants of DCBX support:
+
+1) Firmware-based LLDP Agent
+2) Software-based LLDP Agent
+
+In firmware-based mode, firmware intercepts all LLDP traffic and handles DCBX
+negotiation transparently for the user. In this mode, the adapter operates in
+"willing" DCBX mode, receiving DCB settings from the link partner (typically a
+switch). The local user can only query the negotiated DCB configuration. For
+information on configuring DCBX parameters on a switch, please consult the
+switch manufacturer's documentation.
+
+In software-based mode, LLDP traffic is forwarded to the network stack and user
+space, where a software agent can handle it. In this mode, the adapter can
+operate in either "willing" or "nonwilling" DCBX mode and DCB configuration can
+be both queried and set locally. This mode requires the FW-based LLDP Agent to
+be disabled.
+
+NOTE:
+
+- You can enable and disable the firmware-based LLDP Agent using an ethtool
+ private flag. Refer to the "FW-LLDP (Firmware Link Layer Discovery Protocol)"
+ section in this README for more information.
+- In software-based DCBX mode, you can configure DCB parameters using software
+ LLDP/DCBX agents that interface with the Linux kernel's DCB Netlink API. We
+ recommend using OpenLLDP as the DCBX agent when running in software mode. For
+ more information, see the OpenLLDP man pages and
+ https://github.com/intel/openlldp.
+- The driver implements the DCB netlink interface layer to allow the user space
+ to communicate with the driver and query DCB configuration for the port.
+- iSCSI with DCB is not supported.
+
+
+FW-LLDP (Firmware Link Layer Discovery Protocol)
+------------------------------------------------
+Use ethtool to change FW-LLDP settings. The FW-LLDP setting is per port and
+persists across boots.
+
+To enable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent on
+
+To disable LLDP::
+
+ # ethtool --set-priv-flags <ethX> fw-lldp-agent off
+
+To check the current LLDP setting::
+
+ # ethtool --show-priv-flags <ethX>
+
+NOTE: You must enable the UEFI HII "LLDP Agent" attribute for this setting to
+take effect. If "LLDP AGENT" is set to disabled, you cannot enable it from the
+OS.
+
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ice. When transmit is enabled,
+pause frames are generated when the receive packet buffer crosses a predefined
+threshold. When receive is enabled, the transmit unit will halt for the time
+delay specified when a pause frame is received.
+
+NOTE: You must have a flow control capable link partner.
+
+Flow Control is disabled by default.
+
+Use ethtool to change the flow control settings.
+
+To enable or disable Rx or Tx Flow Control::
+
+ # ethtool -A <ethX> rx <on|off> tx <on|off>
+
+Note: This command only enables or disables Flow Control if auto-negotiation is
+disabled. If auto-negotiation is enabled, this command changes the parameters
+used for auto-negotiation with the link partner.
+
+Note: Flow Control auto-negotiation is part of link auto-negotiation. Depending
+on your device, you may not be able to change the auto-negotiation setting.
+
+NOTE:
+
+- The ice driver requires flow control on both the port and link partner. If
+ flow control is disabled on one of the sides, the port may appear to hang on
+ heavy traffic.
+- You may encounter issues with link-level flow control (LFC) after disabling
+ DCB. The LFC status may show as enabled but traffic is not paused. To resolve
+ this issue, disable and reenable LFC using ethtool::
+
+ # ethtool -A <ethX> rx off tx off
+ # ethtool -A <ethX> rx on tx on
+
+
+NAPI
+----
+This driver supports NAPI (Rx polling mode).
+For more information on NAPI, see
+https://www.linuxfoundation.org/collaborate/workgroups/networking/napi
+
+
+MACVLAN
+-------
+This driver supports MACVLAN. Kernel support for MACVLAN can be tested by
+checking if the MACVLAN driver is loaded. You can run 'lsmod | grep macvlan' to
+see if the MACVLAN driver is loaded or run 'modprobe macvlan' to try to load
+the MACVLAN driver.
+
+NOTE:
+
+- In passthru mode, you can only set up one MACVLAN device. It will inherit the
+ MAC address of the underlying PF (Physical Function) device.
+
+
+IEEE 802.1ad (QinQ) Support
+---------------------------
+The IEEE 802.1ad standard, informally known as QinQ, allows for multiple VLAN
+IDs within a single Ethernet frame. VLAN IDs are sometimes referred to as
+"tags," and multiple VLAN IDs are thus referred to as a "tag stack." Tag stacks
+allow L2 tunneling and the ability to segregate traffic within a particular
+VLAN ID, among other uses.
+
+NOTES:
+
+- Receive checksum offloads and VLAN acceleration are not supported for 802.1ad
+ (QinQ) packets.
+
+- 0x88A8 traffic will not be received unless VLAN stripping is disabled with
+ the following command::
+
+ # ethtool -K <ethX> rxvlan off
+
+- 0x88A8/0x8100 double VLANs cannot be used with 0x8100 or 0x8100/0x8100 VLANS
+ configured on the same port. 0x88a8/0x8100 traffic will not be received if
+ 0x8100 VLANs are configured.
+
+- The VF can only transmit 0x88A8/0x8100 (i.e., 802.1ad/802.1Q) traffic if:
+
+ 1) The VF is not assigned a port VLAN.
+ 2) spoofchk is disabled from the PF. If you enable spoofchk, the VF will
+ not transmit 0x88A8/0x8100 traffic.
+
+- The VF may not receive all network traffic based on the Inner VLAN header
+ when VF true promiscuous mode (vf-true-promisc-support) and double VLANs are
+ enabled in SR-IOV mode.
+
+The following are examples of how to configure 802.1ad (QinQ)::
+
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+
+ Where "24" and "371" are example VLAN IDs.
+
+
+Tunnel/Overlay Stateless Offloads
+---------------------------------
+Supported tunnels and overlays include VXLAN, GENEVE, and others depending on
+hardware and software configuration. Stateless offloads are enabled by default.
+
+To view the current state of all offloads::
+
+ # ethtool -k <ethX>
+
+
+UDP Segmentation Offload
+------------------------
+Allows the adapter to offload transmit segmentation of UDP packets with
+payloads up to 64K into valid Ethernet frames. Because the adapter hardware is
+able to complete data segmentation much faster than operating system software,
+this feature may improve transmission performance.
+In addition, the adapter may use fewer CPU resources.
+
+NOTE:
+
+- The application sending UDP packets must support UDP segmentation offload.
+
+To enable/disable UDP Segmentation Offload, issue the following command::
+
+ # ethtool -K <ethX> tx-udp-segmentation [off|on]
+
+GNSS module
+-----------
+Allows user to read messages from the GNSS module and write supported commands.
+If the module is physically present, driver creates 2 TTYs for each supported
+device in /dev, ttyGNSS_<device>:<function>_0 and _1. First one (_0) is RW and
+the second one is RO.
+The protocol of write commands is dependent on the GNSS module as the driver
+writes raw bytes from the TTY to the GNSS i2c. Please refer to the module
+documentation for details.
+
+Performance Optimization
+========================
+Driver defaults are meant to fit a wide variety of workloads, but if further
+optimization is required, we recommend experimenting with the following
+settings.
+
+
+Rx Descriptor Ring Size
+-----------------------
+To reduce the number of Rx packet discards, increase the number of Rx
+descriptors for each Rx ring using ethtool.
+
+ Check if the interface is dropping Rx packets due to buffers being full
+ (rx_dropped.nic can mean that there is no PCIe bandwidth)::
+
+ # ethtool -S <ethX> | grep "rx_dropped"
+
+ If the previous command shows drops on queues, it may help to increase
+ the number of descriptors using 'ethtool -G'::
+
+ # ethtool -G <ethX> rx <N>
+ Where <N> is the desired number of ring entries/descriptors
+
+ This can provide temporary buffering for issues that create latency while
+ the CPUs process descriptors.
+
+
+Interrupt Rate Limiting
+-----------------------
+This driver supports an adaptive interrupt throttle rate (ITR) mechanism that
+is tuned for general workloads. The user can customize the interrupt rate
+control for specific workloads, via ethtool, adjusting the number of
+microseconds between interrupts.
+
+To set the interrupt rate manually, you must disable adaptive mode::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off
+
+For lower CPU utilization:
+
+ Disable adaptive ITR and lower Rx and Tx interrupts. The examples below
+ affect every queue of the specified interface.
+
+ Setting rx-usecs and tx-usecs to 80 will limit interrupts to about
+ 12,500 interrupts per second per queue::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 80 tx-usecs 80
+
+For reduced latency:
+
+ Disable adaptive ITR and ITR by setting rx-usecs and tx-usecs to 0
+ using ethtool::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0
+
+Per-queue interrupt rate settings:
+
+ The following examples are for queues 1 and 3, but you can adjust other
+ queues.
+
+ To disable Rx adaptive ITR and set static Rx ITR to 10 microseconds or
+ about 100,000 interrupts/second, for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --coalesce adaptive-rx off
+ rx-usecs 10
+
+ To show the current coalesce settings for queues 1 and 3::
+
+ # ethtool --per-queue <ethX> queue_mask 0xa --show-coalesce
+
+Bounding interrupt rates using rx-usecs-high:
+
+ :Valid Range: 0-236 (0=no limit)
+
+ The range of 0-236 microseconds provides an effective range of 4,237 to
+ 250,000 interrupts per second. The value of rx-usecs-high can be set
+ independently of rx-usecs and tx-usecs in the same ethtool command, and is
+ also independent of the adaptive interrupt moderation algorithm. The
+ underlying hardware supports granularity in 4-microsecond intervals, so
+ adjacent values may result in the same interrupt rate.
+
+ The following command would disable adaptive interrupt moderation, and allow
+ a maximum of 5 microseconds before indicating a receive or transmit was
+ complete. However, instead of resulting in as many as 200,000 interrupts per
+ second, it limits total interrupts per second to 50,000 via the rx-usecs-high
+ parameter.
+
+ ::
+
+ # ethtool -C <ethX> adaptive-rx off adaptive-tx off rx-usecs-high 20
+ rx-usecs 5 tx-usecs 5
+
+
+Virtualized Environments
+------------------------
+In addition to the other suggestions in this section, the following may be
+helpful to optimize performance in VMs.
+
+ Using the appropriate mechanism (vcpupin) in the VM, pin the CPUs to
+ individual LCPUs, making sure to use a set of CPUs included in the
+ device's local_cpulist: ``/sys/class/net/<ethX>/device/local_cpulist``.
+
+ Configure as many Rx/Tx queues in the VM as available. (See the iavf driver
+ documentation for the number of queues supported.) For example::
+
+ # ethtool -L <virt_interface> rx <max> tx <max>
+
+
+Support
+=======
+For general information, go to the Intel support website at:
+https://www.intel.com/support/
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+https://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on a supported kernel
+with a supported adapter, email the specific information related to the issue
+to e1000-devel@lists.sf.net.
+
+
+Trademarks
+==========
+Intel is a trademark or registered trademark of Intel Corporation or its
+subsidiaries in the United States and/or other countries.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/device_drivers/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
index 87e560fe5eaa..d46289e182cf 100644
--- a/Documentation/networking/device_drivers/intel/igb.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
@@ -20,7 +20,7 @@ Identifying Your Adapter
========================
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Command Line Parameters
diff --git a/Documentation/networking/device_drivers/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
index 557fc020ef31..40fa210c5e14 100644
--- a/Documentation/networking/device_drivers/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
@@ -35,7 +35,7 @@ Identifying Your Adapter
========================
For information on how to identify your adapter, and for the latest Intel
network drivers, refer to the Intel Support website:
-http://www.intel.com/support
+https://www.intel.com/support
Additional Features and Configurations
diff --git a/Documentation/networking/device_drivers/intel/ixgb.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst
index 945018207a92..c6a233e68ad6 100644
--- a/Documentation/networking/device_drivers/intel/ixgb.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst
@@ -37,7 +37,7 @@ The following features are available in this kernel:
- SNMP
Channel Bonding documentation can be found in the Linux kernel source:
-/Documentation/networking/bonding.txt
+/Documentation/networking/bonding.rst
The driver information previously displayed in the /proc filesystem is not
supported in this release. Alternatively, you can use ethtool (version 1.6
@@ -203,7 +203,7 @@ With the 10 Gigabit server adapters, the default Linux configuration will
very likely limit the total available throughput artificially. There is a set
of configuration changes that, when applied together, will increase the ability
of Linux to transmit and receive data. The following enhancements were
-originally acquired from settings published at http://www.spec.org/web99/ for
+originally acquired from settings published at https://www.spec.org/web99/ for
various submitted results using Linux.
NOTE:
diff --git a/Documentation/networking/device_drivers/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
index f1d5233e5e51..0a233b17c664 100644
--- a/Documentation/networking/device_drivers/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
@@ -440,6 +440,22 @@ NOTE: For 82599-based network connections, if you are enabling jumbo frames in
a virtual function (VF), jumbo frames must first be enabled in the physical
function (PF). The VF MTU setting cannot be larger than the PF MTU.
+NBASE-T Support
+---------------
+The ixgbe driver supports NBASE-T on some devices. However, the advertisement
+of NBASE-T speeds is suppressed by default, to accommodate broken network
+switches which cannot cope with advertised NBASE-T speeds. Use the ethtool
+command to enable advertising NBASE-T speeds on devices which support it::
+
+ ethtool -s eth? advertise 0x1800000001028
+
+On Linux systems with INTERFACES(5), this can be specified as a pre-up command
+in /etc/network/interfaces so that the interface is always brought up with
+NBASE-T support, e.g.::
+
+ iface eth? inet dhcp
+ pre-up ethtool -s eth? advertise 0x1800000001028 || true
+
Generic Receive Offload, aka GRO
--------------------------------
The driver supports the in-kernel software implementation of GRO. GRO has
diff --git a/Documentation/networking/device_drivers/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
index 76bbde736f21..76bbde736f21 100644
--- a/Documentation/networking/device_drivers/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
new file mode 100644
index 000000000000..bc562c49011b
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -0,0 +1,35 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+====================================================================
+Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC
+====================================================================
+
+Network driver for Marvell's Octeon PCI EndPoint NIC.
+Copyright (c) 2020 Marvell International Ltd.
+
+Contents
+========
+
+- `Overview`_
+- `Supported Devices`_
+- `Interface Control`_
+
+Overview
+========
+This driver implements networking functionality of Marvell's Octeon PCI
+EndPoint NIC.
+
+Supported Devices
+=================
+Currently, this driver support following devices:
+ * Network controller: Cavium, Inc. Device b200
+
+Interface Control
+=================
+Network Interface control like changing mtu, link speed, link down/up are
+done by writing command to mailbox command queue, a mailbox interface
+implemented through a reserved region in BAR4.
+This driver writes the commands into the mailbox and the firmware on the
+Octeon device processes them. The firmware also sends unsolicited notifications
+to driver for events suchs as link change, through notification queue
+implemented as part of mailbox interface.
diff --git a/Documentation/networking/device_drivers/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index 88f508338c5f..dd5cd69467be 100644
--- a/Documentation/networking/device_drivers/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -12,6 +12,7 @@ Contents
- `Overview`_
- `Drivers`_
- `Basic packet flow`_
+- `Devlink health reporters`_
Overview
========
@@ -157,3 +158,132 @@ Egress
3. The SQ descriptor ring is maintained in buffers allocated from SQ mapped pool of NPA block LF.
4. NIX block transmits the pkt on the designated channel.
5. NPC MCAM entries can be installed to divert pkt onto a different channel.
+
+Devlink health reporters
+========================
+
+NPA Reporters
+-------------
+The NPA reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Error due to operation of unmapped PF.
+ - Error due to disabled alloc/free for other HW blocks (NIX, SSO, TIM, DPI and AURA).
+
+2. ERROR events
+
+ - Fault due to NPA_AQ_INST_S read or NPA_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NPA_AQ_INST_S/NPA_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-10 last_dump_time 09:39:09 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 2872 recover 2872 last_dump_date 2020-12-11 last_dump_time 04:43:04 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 2871 recover 2871 last_dump_date 2020-12-10 last_dump_time 09:39:17 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 last_dump_date 2020-12-10 last_dump_time 09:32:40 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_gen
+ NPA_AF_GENERAL:
+ NPA General Interrupt Reg : 1
+ NIX0: free disabled RX
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_intr
+ NPA_AF_RVU:
+ NPA RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_npa_err
+ NPA_AF_ERR:
+ NPA Error Interrupt Reg : 4096
+ AQ Doorbell Error
+
+
+NIX Reporters
+-------------
+The NIX reporters are responsible for reporting and recovering the following group of errors:
+
+1. GENERAL events
+
+ - Receive mirror/multicast packet drop due to insufficient buffer.
+ - SMQ Flush operation.
+
+2. ERROR events
+
+ - Memory Fault due to WQE read/write from multicast/mirror buffer.
+ - Receive multicast/mirror replication list error.
+ - Receive packet on an unmapped PF.
+ - Fault due to NIX_AQ_INST_S read or NIX_AQ_RES_S write.
+ - AQ Doorbell Error.
+
+3. RAS events
+
+ - RAS Error Reporting for NIX Receive Multicast/Mirror Entry Structure.
+ - RAS Error Reporting for WQE/Packet Data read from Multicast/Mirror Buffer..
+ - RAS Error Reporting for NIX_AQ_INST_S/NIX_AQ_RES_S.
+
+4. RVU events
+
+ - Error due to unmapped slot.
+
+Sample Output::
+
+ ~# ./devlink health
+ pci/0002:01:00.0:
+ reporter hw_npa_intr
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_gen
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_err
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_npa_ras
+ state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_intr
+ state healthy error 1121 recover 1121 last_dump_date 2021-01-19 last_dump_time 05:42:26 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_gen
+ state healthy error 949 recover 949 last_dump_date 2021-01-19 last_dump_time 05:42:43 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_err
+ state healthy error 1147 recover 1147 last_dump_date 2021-01-19 last_dump_time 05:42:59 grace_period 0 auto_recover true auto_dump true
+ reporter hw_nix_ras
+ state healthy error 409 recover 409 last_dump_date 2021-01-19 last_dump_time 05:43:16 grace_period 0 auto_recover true auto_dump true
+
+Each reporter dumps the
+
+ - Error Type
+ - Error Register value
+ - Reason in words
+
+For example::
+
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
+ NIX_AF_RVU:
+ NIX RVU Interrupt Reg : 1
+ Unmap Slot Error
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
+ NIX_AF_GENERAL:
+ NIX General Interrupt Reg : 1
+ Rx multicast pkt drop
+ ~# devlink health dump show pci/0002:01:00.0 reporter hw_nix_err
+ NIX_AF_ERR:
+ NIX Error Interrupt Reg : 64
+ Rx on unmapped PF_FUNC
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
new file mode 100644
index 000000000000..5edf50d7dbd5
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -0,0 +1,762 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+
+=================================================
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+Copyright (c) 2019, Mellanox Technologies LTD.
+
+Contents
+========
+
+- `Enabling the driver and kconfig options`_
+- `Devlink info`_
+- `Devlink parameters`_
+- `Bridge offload`_
+- `mlx5 subfunction`_
+- `mlx5 function attributes`_
+- `Devlink health reporters`_
+- `mlx5 tracepoints`_
+
+Enabling the driver and kconfig options
+=======================================
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features please see below.
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+| built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+| Enables ethtool receive network flow classification, which allows user defined
+| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
+| user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+| and switching for the enabled VFs and PF in two available modes:
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
+| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+| IPoIB offloads & acceleration support.
+| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+| IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+| Build support for the Innova family of network cards by Mellanox Technologies.
+| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+| building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+| TLS cryptography-offload accelaration.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+**CONFIG_MLX5_SF=(y/n)**
+
+| Build support for subfunction.
+| Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
+| will enable support for creating subfunction devices.
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxlan support will be enabled.
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+
+Devlink info
+============
+
+The devlink info reports the running and stored firmware versions on device.
+It also prints the device PSID which represents the HCA board type ID.
+
+User command example::
+
+ $ devlink dev info pci/0000:00:06.0
+ pci/0000:00:06.0:
+ driver mlx5_core
+ versions:
+ fixed:
+ fw.psid MT_0000000009
+ running:
+ fw.version 16.26.0100
+ stored:
+ fw.version 16.26.0100
+
+Devlink parameters
+==================
+
+flow_steering_mode: Device flow steering mode
+---------------------------------------------
+The flow steering mode parameter controls the flow steering mode of the driver.
+Two modes are supported:
+1. 'dmfs' - Device managed flow steering.
+2. 'smfs - Software/Driver managed flow steering.
+
+In DMFS mode, the HW steering entities are created and managed through the
+Firmware.
+In SMFS mode, the HW steering entities are created and managed though by
+the driver directly into Hardware without firmware intervention.
+
+SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
+
+User command examples:
+
+- Set SMFS flow steering mode::
+
+ $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
+
+- Read device flow steering mode::
+
+ $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
+ pci/0000:06:00.0:
+ name flow_steering_mode type driver-specific
+ values:
+ cmode runtime value smfs
+
+enable_roce: RoCE enablement state
+----------------------------------
+RoCE enablement state controls driver support for RoCE traffic.
+When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
+
+To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
+
+User command examples:
+
+- Disable RoCE::
+
+ $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
+ $ devlink dev reload pci/0000:06:00.0
+
+- Read RoCE enablement state::
+
+ $ devlink dev param show pci/0000:06:00.0 name enable_roce
+ pci/0000:06:00.0:
+ name enable_roce type generic
+ values:
+ cmode driverinit value true
+
+esw_port_metadata: Eswitch port metadata state
+----------------------------------------------
+When applicable, disabling Eswitch metadata can increase packet rate
+up to 20% depending on the use case and packet sizes.
+
+Eswitch port metadata state controls whether to internally tag packets with
+metadata. Metadata tagging must be enabled for multi-port RoCE, failover
+between representors and stacked devices.
+By default metadata is enabled on the supported devices in E-switch.
+Metadata is applicable only for E-switch in switchdev mode and
+users may disable it when NONE of the below use cases will be in use:
+1. HCA is in Dual/multi-port RoCE mode.
+2. VF/SF representor bonding (Usually used for Live migration)
+3. Stacked devices
+
+When metadata is disabled, the above use cases will fail to initialize if
+users try to enable them.
+
+- Show eswitch port metadata::
+
+ $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
+ pci/0000:06:00.0:
+ name esw_port_metadata type driver-specific
+ values:
+ cmode runtime value true
+
+- Disable eswitch port metadata::
+
+ $ devlink dev param set pci/0000:06:00.0 name esw_port_metadata value false cmode runtime
+
+- Change eswitch mode to switchdev mode where after choosing the metadata value::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+Bridge offload
+==============
+The mlx5 driver implements support for offloading bridge rules when in switchdev
+mode. Linux bridge FDBs are automatically offloaded when mlx5 switchdev
+representor is attached to bridge.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Attach mlx5 switchdev representor 'enp8s0f0' to bridge netdev 'bridge1'::
+
+ $ ip link set enp8s0f0 master bridge1
+
+VLANs
+-----
+Following bridge VLAN functions are supported by mlx5:
+
+- VLAN filtering (including multiple VLANs per port)::
+
+ $ ip link set bridge1 type bridge vlan_filtering 1
+ $ bridge vlan add dev enp8s0f0 vid 2-3
+
+- VLAN push on bridge ingress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 pvid
+
+- VLAN pop on bridge egress::
+
+ $ bridge vlan add dev enp8s0f0 vid 3 untagged
+
+mlx5 subfunction
+================
+mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+A Subfunction has its own function capabilities and its own resources. This
+means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
+queues are neither shared nor stolen from the parent PCI function.
+
+When a subfunction is RDMA capable, it has its own QP1, GID table and rdma
+resources neither shared nor stolen from the parent PCI function.
+
+A subfunction has a dedicated window in PCI BAR space that is not shared
+with ther other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned
+PCI BAR space.
+
+A Subfunction supports eswitch representation through which it supports tc
+offloads. The user configures eswitch to send/receive packets from/to
+the subfunction port.
+
+Subfunctions share PCI level resources such as PCI MSI-X IRQs with
+other subfunctions and/or with its parent PCI function.
+
+Example mlx5 software, system and device view::
+
+ _______
+ | admin |
+ | user |----------
+ |_______| |
+ | |
+ ____|____ __|______ _________________
+ | | | | | |
+ | devlink | | tc tool | | user |
+ | tool | |_________| | applications |
+ |_________| | |_________________|
+ | | | |
+ | | | | Userspace
+ +---------|-------------|-------------------|----------|--------------------+
+ | | +----------+ +----------+ Kernel
+ | | | netdev | | rdma dev |
+ | | +----------+ +----------+
+ (devlink port add/del | ^ ^
+ port function set) | | |
+ | | +---------------|
+ _____|___ | | _______|_______
+ | | | | | mlx5 class |
+ | devlink | +------------+ | | drivers |
+ | kernel | | rep netdev | | |(mlx5_core,ib) |
+ |_________| +------------+ | |_______________|
+ | | | ^
+ (devlink ops) | | (probe/remove)
+ _________|________ | | ____|________
+ | subfunction | | +---------------+ | subfunction |
+ | management driver|----- | subfunction |---| driver |
+ | (mlx5_core) | | auxiliary dev | | (mlx5_core) |
+ |__________________| +---------------+ |_____________|
+ | ^
+ (sf add/del, vhca events) |
+ | (device add/del)
+ _____|____ ____|________
+ | | | subfunction |
+ | PCI NIC |---- activate/deactive events---->| host driver |
+ |__________| | (mlx5_core) |
+ |_____________|
+
+Subfunction is created using devlink port interface.
+
+- Change device to switchdev mode::
+
+ $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+
+- Add a devlink port of subfunction flaovur::
+
+ $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
+ pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Show a devlink port of the subfunction::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00 state inactive opstate detached
+
+- Delete a devlink port of subfunction after use::
+
+ $ devlink port del pci/0000:06:00.0/32768
+
+mlx5 function attributes
+========================
+The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
+a unified way for SmartNIC and non-SmartNIC.
+
+This is supported only when the eswitch mode is set to switchdev. Port function
+configuration of the PCI VF/SF is supported through devlink eswitch port.
+
+Port function attributes should be set before PCI VF/SF is enumerated by the
+driver.
+
+MAC address setup
+-----------------
+mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
+
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+SF state setup
+--------------
+To use the SF, the user must active the SF using the SF function state
+attribute.
+
+- Get the state of the SF identified by its unique devlink port index::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state inactive opstate detached
+
+- Activate the function and verify its state is active::
+
+ $ devlink port function set ens2f0npf0sf88 state active
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate detached
+
+Upon function activation, the PF driver instance gets the event from the device
+that a particular SF was activated. It's the cue to put the device on bus, probe
+it and instantiate the devlink instance and class specific auxiliary devices
+for it.
+
+- Show the auxiliary device and port of the subfunction::
+
+ $ devlink dev show
+ devlink dev show auxiliary/mlx5_core.sf.4
+
+ $ devlink port show auxiliary/mlx5_core.sf.4/1
+ auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false
+
+ $ rdma link show mlx5_0/1
+ link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88
+
+ $ rdma dev show
+ 8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
+ 13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
+
+- Subfunction auxiliary device and class device hierarchy::
+
+ mlx5_core.sf.4
+ (subfunction auxiliary device)
+ /\
+ / \
+ / \
+ / \
+ / \
+ mlx5_core.eth.4 mlx5_core.rdma.4
+ (sf eth aux dev) (sf rdma aux dev)
+ | |
+ | |
+ p0sf88 mlx5_0
+ (sf netdev) (sf rdma device)
+
+Additionally, the SF port also gets the event when the driver attaches to the
+auxiliary device of the subfunction. This results in changing the operational
+state of the function. This provides visiblity to the user to decide when is it
+safe to delete the SF port for graceful termination of the subfunction.
+
+- Show the SF port operational state::
+
+ $ devlink port show ens2f0npf0sf88
+ pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
+ function:
+ hw_addr 00:00:00:00:88:88 state active opstate attached
+
+Devlink health reporters
+========================
+
+tx reporter
+-----------
+The tx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- TX timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- TX error completion
+ Report on error tx completion.
+ Recover by flushing the TX queue and reset it.
+
+TX reporter also support on demand diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+rx reporter
+-----------
+The rx reporter is responsible for reporting and recovering of the following two error scenarios:
+
+- RX queues initialization (population) timeout
+ RX queues descriptors population on ring initialization is done in
+ napi context via triggering an irq, in case of a failure to get
+ the minimum amount of descriptors, a timeout would occur and it
+ could be recoverable by polling the EQ (Event Queue).
+- RX completions with errors (reported by HW on interrupt context)
+ Report on rx completion error.
+ Recover (if needed) by flushing the related queue and reset it.
+
+RX reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues status.
+
+- Diagnose rx queues status, and corresponding completion queue::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter rx
+
+NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+
+- Show number of rx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter rx
+
+fw reporter
+-----------
+The fw reporter implements diagnose and dump callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+NOTE: This command can run only on the PF which has fw tracer ownership,
+running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements dump and recover callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+On firmware error, the health buffer is dumped into the dmesg. The log
+level is derived from the error's severity (given in health buffer).
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already strored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+NOTE: This command can run only on PF.
+
+mlx5 tracepoints
+================
+
+mlx5 driver provides internal trace points for tracking and debugging using
+kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
+
+For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
+
+tc and eswitch offloads tracepoints:
+
+- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
+
+- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
+
+ $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
+
+- mlx5e_stats_flower: trace flower stats request::
+
+ $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
+
+- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
+
+ $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
+
+- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
+
+ $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
+
+Bridge offloads tracepoints:
+
+- mlx5_esw_bridge_fdb_entry_init: trace bridge FDB entry offloaded to mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_init >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u20:9-2217 [003] ...1 318.582243: mlx5_esw_bridge_fdb_entry_init: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=0 flags=0 used=0
+
+- mlx5_esw_bridge_fdb_entry_cleanup: trace bridge FDB entry deleted from mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2581 [005] ...1 318.629871: mlx5_esw_bridge_fdb_entry_cleanup: net_device=enp8s0f0_1 addr=e4:fd:05:08:00:03 vid=0 flags=0 used=16
+
+- mlx5_esw_bridge_fdb_entry_refresh: trace bridge FDB entry offload refreshed in
+ mlx5::
+
+ $ echo mlx5:mlx5_esw_bridge_fdb_entry_refresh >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u20:8-3849 [003] ...1 466716: mlx5_esw_bridge_fdb_entry_refresh: net_device=enp8s0f0_0 addr=e4:fd:05:08:00:02 vid=3 flags=0 used=0
+
+- mlx5_esw_bridge_vlan_create: trace bridge VLAN object add on mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_create >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.460258: mlx5_esw_bridge_vlan_create: vid=1 flags=6
+
+- mlx5_esw_bridge_vlan_cleanup: trace bridge VLAN object delete from mlx5
+ representor::
+
+ $ echo mlx5:mlx5_esw_bridge_vlan_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ bridge-2582 [007] ...1 318.653496: mlx5_esw_bridge_vlan_cleanup: vid=2 flags=8
+
+- mlx5_esw_bridge_vport_init: trace mlx5 vport assigned with bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_init >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-2560 [007] ...1 318.458915: mlx5_esw_bridge_vport_init: vport_num=1
+
+- mlx5_esw_bridge_vport_cleanup: trace mlx5 vport removed from bridge upper
+ device::
+
+ $ echo mlx5:mlx5_esw_bridge_vport_cleanup >> set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ ip-5387 [000] ...1 573713: mlx5_esw_bridge_vport_cleanup: vport_num=1
+
+Eswitch QoS tracepoints:
+
+- mlx5_esw_vport_qos_create: trace creation of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_create >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-23496 [018] .... 73136.838831: mlx5_esw_vport_qos_create: (0000:82:00.0) vport=2 tsar_ix=4 bw_share=0, max_rate=0 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_config: trace configuration of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_config >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-26548 [023] .... 75754.223823: mlx5_esw_vport_qos_config: (0000:82:00.0) vport=1 tsar_ix=3 bw_share=34, max_rate=10000 group=000000007b576bb3
+
+- mlx5_esw_vport_qos_destroy: trace deletion of transmit scheduler arbiter for vport::
+
+ $ echo mlx5:mlx5_esw_vport_qos_destroy >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-27418 [004] .... 76546.680901: mlx5_esw_vport_qos_destroy: (0000:82:00.0) vport=1 tsar_ix=3
+
+- mlx5_esw_group_qos_create: trace creation of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_create >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-26578 [008] .... 75776.022112: mlx5_esw_group_qos_create: (0000:82:00.0) group=000000008dac63ea tsar_ix=5
+
+- mlx5_esw_group_qos_config: trace configuration of transmit scheduler arbiter for rate group::
+
+ $ echo mlx5:mlx5_esw_group_qos_config >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-27303 [020] .... 76461.455356: mlx5_esw_group_qos_config: (0000:82:00.0) group=000000008dac63ea tsar_ix=5 bw_share=100 max_rate=20000
+
+- mlx5_esw_group_qos_destroy: trace deletion of transmit scheduler arbiter for group::
+
+ $ echo mlx5:mlx5_esw_group_qos_destroy >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ <...>-27418 [006] .... 76547.187258: mlx5_esw_group_qos_destroy: (0000:82:00.0) group=000000007b576bb3 tsar_ix=1
+
+SF tracepoints:
+
+- mlx5_sf_add: trace addition of the SF port::
+
+ $ echo mlx5:mlx5_sf_add >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ devlink-9363 [031] ..... 24610.188722: mlx5_sf_add: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_free: trace freeing of the SF port::
+
+ $ echo mlx5:mlx5_sf_free >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ devlink-9830 [038] ..... 26300.404749: mlx5_sf_free: (0000:06:00.0) port_index=32768 controller=0 hw_id=0x8000
+
+- mlx5_sf_hwc_alloc: trace allocating of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_alloc >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ devlink-9775 [031] ..... 26296.385259: mlx5_sf_hwc_alloc: (0000:06:00.0) controller=0 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_hwc_free: trace freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_free >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365771: mlx5_sf_hwc_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_hwc_deferred_free : trace deferred freeing of the hardware SF context::
+
+ $ echo mlx5:mlx5_sf_hwc_deferred_free >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ devlink-9519 [046] ..... 24624.400271: mlx5_sf_hwc_deferred_free: (0000:06:00.0) hw_id=0x8000
+
+- mlx5_sf_vhca_event: trace SF vhca event and state::
+
+ $ echo mlx5:mlx5_sf_vhca_event >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u128:3-9093 [046] ..... 24625.365525: mlx5_sf_vhca_event: (0000:06:00.0) hw_id=0x8000 sfnum=88 vhca_state=1
+
+- mlx5_sf_dev_add : trace SF device add event::
+
+ $ echo mlx5:mlx5_sf_dev_add>> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u128:3-9093 [000] ..... 24616.524495: mlx5_sf_dev_add: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
+
+- mlx5_sf_dev_del : trace SF device delete event::
+
+ $ echo mlx5:mlx5_sf_dev_del >> /sys/kernel/debug/tracing/set_event
+ $ cat /sys/kernel/debug/tracing/trace
+ ...
+ kworker/u128:3-9093 [044] ..... 24624.400749: mlx5_sf_dev_del: (0000:06:00.0) sfdev=00000000fc5d96fd aux_id=4 hw_id=0x8000 sfnum=88
diff --git a/Documentation/networking/device_drivers/microsoft/netvsc.txt b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
index cd63556b27a0..fc5acd427a5d 100644
--- a/Documentation/networking/device_drivers/microsoft/netvsc.txt
+++ b/Documentation/networking/device_drivers/ethernet/microsoft/netvsc.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
Hyper-V network driver
======================
@@ -10,15 +13,15 @@ Windows 10.
Features
========
- Checksum offload
- ----------------
+Checksum offload
+----------------
The netvsc driver supports checksum offload as long as the
Hyper-V host version does. Windows Server 2016 and Azure
support checksum offload for TCP and UDP for both IPv4 and
IPv6. Windows Server 2012 only supports checksum offload for TCP.
- Receive Side Scaling
- --------------------
+Receive Side Scaling
+--------------------
Hyper-V supports receive side scaling. For TCP & UDP, packets can
be distributed among available queues based on IP address and port
number.
@@ -32,30 +35,37 @@ Features
hashing. Using L3 hashing is recommended in this case.
For example, for UDP over IPv4 on eth0:
- To include UDP port numbers in hashing:
- ethtool -N eth0 rx-flow-hash udp4 sdfn
- To exclude UDP port numbers in hashing:
- ethtool -N eth0 rx-flow-hash udp4 sd
- To show UDP hash level:
- ethtool -n eth0 rx-flow-hash udp4
-
- Generic Receive Offload, aka GRO
- --------------------------------
+
+ To include UDP port numbers in hashing::
+
+ ethtool -N eth0 rx-flow-hash udp4 sdfn
+
+ To exclude UDP port numbers in hashing::
+
+ ethtool -N eth0 rx-flow-hash udp4 sd
+
+ To show UDP hash level::
+
+ ethtool -n eth0 rx-flow-hash udp4
+
+Generic Receive Offload, aka GRO
+--------------------------------
The driver supports GRO and it is enabled by default. GRO coalesces
like packets and significantly reduces CPU usage under heavy Rx
load.
- Large Receive Offload (LRO), or Receive Side Coalescing (RSC)
- -------------------------------------------------------------
+Large Receive Offload (LRO), or Receive Side Coalescing (RSC)
+-------------------------------------------------------------
The driver supports LRO/RSC in the vSwitch feature. It reduces the per packet
processing overhead by coalescing multiple TCP segments when possible. The
feature is enabled by default on VMs running on Windows Server 2019 and
- later. It may be changed by ethtool command:
+ later. It may be changed by ethtool command::
+
ethtool -K eth0 lro on
ethtool -K eth0 lro off
- SR-IOV support
- --------------
+SR-IOV support
+--------------
Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
is enabled in both the vSwitch and the guest configuration, then the
Virtual Function (VF) device is passed to the guest as a PCI
@@ -70,21 +80,25 @@ Features
flow direction is desired, these should be applied directly to the
VF slave device.
- Receive Buffer
- --------------
+Receive Buffer
+--------------
Packets are received into a receive area which is created when device
is probed. The receive area is broken into MTU sized chunks and each may
contain one or more packets. The number of receive sections may be changed
via ethtool Rx ring parameters.
- There is a similar send buffer which is used to aggregate packets for sending.
- The send area is broken into chunks of 6144 bytes, each of section may
- contain one or more packets. The send buffer is an optimization, the driver
- will use slower method to handle very large packets or if the send buffer
- area is exhausted.
-
- XDP support
- -----------
+ There is a similar send buffer which is used to aggregate packets
+ for sending. The send area is broken into chunks, typically of 6144
+ bytes, each of section may contain one or more packets. Small
+ packets are usually transmitted via copy to the send buffer. However,
+ if the buffer is temporarily exhausted, or the packet to be transmitted is
+ an LSO packet, the driver will provide the host with pointers to the data
+ from the SKB. This attempts to achieve a balance between the overhead of
+ data copy and the impact of remapping VM memory to be accessible by the
+ host.
+
+XDP support
+-----------
XDP (eXpress Data Path) is a feature that runs eBPF bytecode at the early
stage when packets arrive at a NIC card. The goal is to increase performance
for packet processing, reducing the overhead of SKB allocation and other
@@ -99,7 +113,8 @@ Features
overwritten by setting of synthetic NIC.
XDP program cannot run with LRO (RSC) enabled, so you need to disable LRO
- before running XDP:
+ before running XDP::
+
ethtool -K eth0 lro off
XDP_REDIRECT action is not yet supported.
diff --git a/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
new file mode 100644
index 000000000000..c5673ec4559b
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/neterion/s2io.rst
@@ -0,0 +1,196 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================================
+Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver
+=========================================================
+
+Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver.
+
+.. Contents
+ - 1. Introduction
+ - 2. Identifying the adapter/interface
+ - 3. Features supported
+ - 4. Command line parameters
+ - 5. Performance suggestions
+ - 6. Available Downloads
+
+
+1. Introduction
+===============
+This Linux driver supports Neterion's Xframe I PCI-X 1.0 and
+Xframe II PCI-X 2.0 adapters. It supports several features
+such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on.
+See below for complete list of features.
+
+All features are supported for both IPv4 and IPv6.
+
+2. Identifying the adapter/interface
+====================================
+
+a. Insert the adapter(s) in your system.
+b. Build and load driver::
+
+ # insmod s2io.ko
+
+c. View log messages::
+
+ # dmesg | tail -40
+
+You will see messages similar to::
+
+ eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA
+ eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA
+ eth4: Device is on 64 bit 133MHz PCIX(M1) bus
+
+The above messages identify the adapter type(Xframe I/II), adapter revision,
+driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X).
+In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed
+as well.
+
+To associate an interface with a physical adapter use "ethtool -p <ethX>".
+The corresponding adapter's LED will blink multiple times.
+
+3. Features supported
+=====================
+a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes,
+ modifiable using ip command.
+
+b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit
+ and receive, TSO.
+
+c. Multi-buffer receive mode. Scattering of packet across multiple
+ buffers. Currently driver supports 2-buffer mode which yields
+ significant performance improvement on certain platforms(SGI Altix,
+ IBM xSeries).
+
+d. MSI/MSI-X. Can be enabled on platforms which support this feature
+ (IA64, Xeon) resulting in noticeable performance improvement(up to 7%
+ on certain platforms).
+
+e. Statistics. Comprehensive MAC-level and software statistics displayed
+ using "ethtool -S" option.
+
+f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings,
+ with multiple steering options.
+
+4. Command line parameters
+==========================
+
+a. tx_fifo_num
+ Number of transmit queues
+
+Valid range: 1-8
+
+Default: 1
+
+b. rx_ring_num
+ Number of receive rings
+
+Valid range: 1-8
+
+Default: 1
+
+c. tx_fifo_len
+ Size of each transmit queue
+
+Valid range: Total length of all queues should not exceed 8192
+
+Default: 4096
+
+d. rx_ring_sz
+ Size of each receive ring(in 4K blocks)
+
+Valid range: Limited by memory on system
+
+Default: 30
+
+e. intr_type
+ Specifies interrupt type. Possible values 0(INTA), 2(MSI-X)
+
+Valid values: 0, 2
+
+Default: 2
+
+5. Performance suggestions
+==========================
+
+General:
+
+a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration)
+b. Set TCP windows size to optimal value.
+
+For instance, for MTU=1500 a value of 210K has been observed to result in
+good performance::
+
+ # sysctl -w net.ipv4.tcp_rmem="210000 210000 210000"
+ # sysctl -w net.ipv4.tcp_wmem="210000 210000 210000"
+
+For MTU=9000, TCP window size of 10 MB is recommended::
+
+ # sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
+ # sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
+
+Transmit performance:
+
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+ However, you may want to experiment with PCI bus parameters
+ max-split-transactions(MOST) and MMRBC (use setpci command).
+
+ A MOST value of 2 has been found optimal for Opterons and 3 for Itanium.
+
+ It could be different for your hardware.
+
+ Set MMRBC to 4K**.
+
+ For example you can set
+
+ For opteron::
+
+ #setpci -d 17d5:* 62=1d
+
+ For Itanium::
+
+ #setpci -d 17d5:* 62=3d
+
+ For detailed description of the PCI registers, please see Xframe User Guide.
+
+b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this
+ parameter.
+
+c. Turn on TSO(using "ethtool -K")::
+
+ # ethtool -K <ethX> tso on
+
+Receive performance:
+
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+ However, you may want to set PCI latency timer to 248::
+
+ #setpci -d 17d5:* LATENCY_TIMER=f8
+
+ For detailed description of the PCI registers, please see Xframe User Guide.
+
+b. Use 2-buffer mode. This results in large performance boost on
+ certain platforms(eg. SGI Altix, IBM xSeries).
+
+c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to
+ set/verify this option.
+
+d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network
+ device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to
+ bring down CPU utilization.
+
+.. note::
+
+ For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are
+ recommended as safe parameters.
+
+For more information, please review the AMD8131 errata at
+http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
+26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf
+
+6. Support
+==========
+
+For further support please contact either your 10GbE Xframe NIC vendor (IBM,
+HP, SGI etc.)
diff --git a/Documentation/networking/device_drivers/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
index ada611fb427c..ada611fb427c 100644
--- a/Documentation/networking/device_drivers/netronome/nfp.rst
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
new file mode 100644
index 000000000000..0eabbc347d6c
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst
@@ -0,0 +1,274 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the Pensando(R) Ethernet adapter family
+========================================================
+
+Pensando Linux Ethernet driver.
+Copyright(c) 2019 Pensando Systems, Inc
+
+Contents
+========
+
+- Identifying the Adapter
+- Enabling the driver
+- Configuring the driver
+- Statistics
+- Support
+
+Identifying the Adapter
+=======================
+
+To find if one or more Pensando PCI Ethernet devices are installed on the
+host, check for the PCI devices::
+
+ $ lspci -d 1dd8:
+ b5:00.0 Ethernet controller: Device 1dd8:1002
+ b6:00.0 Ethernet controller: Device 1dd8:1002
+
+If such devices are listed as above, then the ionic.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep ionic
+ ionic 0000:b5:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
+ ionic 0000:b5:00.0 enp181s0: renamed from eth0
+ ionic 0000:b5:00.0 enp181s0: Link up - 100 Gbps
+ ionic 0000:b6:00.0: 126.016 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x16 link)
+ ionic 0000:b6:00.0 enp182s0: renamed from eth0
+ ionic 0000:b6:00.0 enp182s0: Link up - 100 Gbps
+
+Driver and firmware version information can be gathered with either of
+ethtool or devlink tools::
+
+ $ ethtool -i enp181s0
+ driver: ionic
+ version: 5.7.0
+ firmware-version: 1.8.0-28
+ ...
+
+ $ devlink dev info pci/0000:b5:00.0
+ pci/0000:b5:00.0:
+ driver ionic
+ serial_number FLM18420073
+ versions:
+ fixed:
+ asic.id 0x0
+ asic.rev 0x0
+ running:
+ fw 1.8.0-28
+
+See Documentation/networking/devlink/ionic.rst for more information
+on the devlink dev info data.
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> Pensando devices
+ -> Pensando Ethernet IONIC Support
+
+Configuring the Driver
+======================
+
+MTU
+---
+
+Jumbo frame support is available with a maximim size of 9194 bytes.
+
+Interrupt coalescing
+--------------------
+
+Interrupt coalescing can be configured by changing the rx-usecs value with
+the "ethtool -C" command. The rx-usecs range is 0-190. The tx-usecs value
+reflects the rx-usecs value as they are tied together on the same interrupt.
+
+SR-IOV
+------
+
+Minimal SR-IOV support is currently offered and can be enabled by setting
+the sysfs 'sriov_numvfs' value, if supported by your particular firmware
+configuration.
+
+Statistics
+==========
+
+Basic hardware stats
+--------------------
+
+The commands ``netstat -i``, ``ip -s link show``, and ``ifconfig`` show
+a limited set of statistics taken directly from firmware. For example::
+
+ $ ip -s link show enp181s0
+ 7: enp181s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
+ link/ether 00:ae:cd:00:07:68 brd ff:ff:ff:ff:ff:ff
+ RX: bytes packets errors dropped overrun mcast
+ 414 5 0 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 1384 18 0 0 0 0
+
+ethtool -S
+----------
+
+The statistics shown from the ``ethtool -S`` command includes a combination of
+driver counters and firmware counters, including port and queue specific values.
+The driver values are counters computed by the driver, and the firmware values
+are gathered by the firmware from the port hardware and passed through the
+driver with no further interpretation.
+
+Driver port specific::
+
+ tx_packets: 12
+ tx_bytes: 964
+ rx_packets: 5
+ rx_bytes: 414
+ tx_tso: 0
+ tx_tso_bytes: 0
+ tx_csum_none: 12
+ tx_csum: 0
+ rx_csum_none: 0
+ rx_csum_complete: 3
+ rx_csum_error: 0
+
+Driver queue specific::
+
+ tx_0_pkts: 3
+ tx_0_bytes: 294
+ tx_0_clean: 3
+ tx_0_dma_map_err: 0
+ tx_0_linearize: 0
+ tx_0_frags: 0
+ tx_0_tso: 0
+ tx_0_tso_bytes: 0
+ tx_0_csum_none: 3
+ tx_0_csum: 0
+ tx_0_vlan_inserted: 0
+ rx_0_pkts: 2
+ rx_0_bytes: 120
+ rx_0_dma_map_err: 0
+ rx_0_alloc_err: 0
+ rx_0_csum_none: 0
+ rx_0_csum_complete: 0
+ rx_0_csum_error: 0
+ rx_0_dropped: 0
+ rx_0_vlan_stripped: 0
+
+Firmware port specific::
+
+ hw_tx_dropped: 0
+ hw_rx_dropped: 0
+ hw_rx_over_errors: 0
+ hw_rx_missed_errors: 0
+ hw_tx_aborted_errors: 0
+ frames_rx_ok: 15
+ frames_rx_all: 15
+ frames_rx_bad_fcs: 0
+ frames_rx_bad_all: 0
+ octets_rx_ok: 1290
+ octets_rx_all: 1290
+ frames_rx_unicast: 10
+ frames_rx_multicast: 5
+ frames_rx_broadcast: 0
+ frames_rx_pause: 0
+ frames_rx_bad_length: 0
+ frames_rx_undersized: 0
+ frames_rx_oversized: 0
+ frames_rx_fragments: 0
+ frames_rx_jabber: 0
+ frames_rx_pripause: 0
+ frames_rx_stomped_crc: 0
+ frames_rx_too_long: 0
+ frames_rx_vlan_good: 3
+ frames_rx_dropped: 0
+ frames_rx_less_than_64b: 0
+ frames_rx_64b: 4
+ frames_rx_65b_127b: 11
+ frames_rx_128b_255b: 0
+ frames_rx_256b_511b: 0
+ frames_rx_512b_1023b: 0
+ frames_rx_1024b_1518b: 0
+ frames_rx_1519b_2047b: 0
+ frames_rx_2048b_4095b: 0
+ frames_rx_4096b_8191b: 0
+ frames_rx_8192b_9215b: 0
+ frames_rx_other: 0
+ frames_tx_ok: 31
+ frames_tx_all: 31
+ frames_tx_bad: 0
+ octets_tx_ok: 2614
+ octets_tx_total: 2614
+ frames_tx_unicast: 8
+ frames_tx_multicast: 21
+ frames_tx_broadcast: 2
+ frames_tx_pause: 0
+ frames_tx_pripause: 0
+ frames_tx_vlan: 0
+ frames_tx_less_than_64b: 0
+ frames_tx_64b: 4
+ frames_tx_65b_127b: 27
+ frames_tx_128b_255b: 0
+ frames_tx_256b_511b: 0
+ frames_tx_512b_1023b: 0
+ frames_tx_1024b_1518b: 0
+ frames_tx_1519b_2047b: 0
+ frames_tx_2048b_4095b: 0
+ frames_tx_4096b_8191b: 0
+ frames_tx_8192b_9215b: 0
+ frames_tx_other: 0
+ frames_tx_pri_0: 0
+ frames_tx_pri_1: 0
+ frames_tx_pri_2: 0
+ frames_tx_pri_3: 0
+ frames_tx_pri_4: 0
+ frames_tx_pri_5: 0
+ frames_tx_pri_6: 0
+ frames_tx_pri_7: 0
+ frames_rx_pri_0: 0
+ frames_rx_pri_1: 0
+ frames_rx_pri_2: 0
+ frames_rx_pri_3: 0
+ frames_rx_pri_4: 0
+ frames_rx_pri_5: 0
+ frames_rx_pri_6: 0
+ frames_rx_pri_7: 0
+ tx_pripause_0_1us_count: 0
+ tx_pripause_1_1us_count: 0
+ tx_pripause_2_1us_count: 0
+ tx_pripause_3_1us_count: 0
+ tx_pripause_4_1us_count: 0
+ tx_pripause_5_1us_count: 0
+ tx_pripause_6_1us_count: 0
+ tx_pripause_7_1us_count: 0
+ rx_pripause_0_1us_count: 0
+ rx_pripause_1_1us_count: 0
+ rx_pripause_2_1us_count: 0
+ rx_pripause_3_1us_count: 0
+ rx_pripause_4_1us_count: 0
+ rx_pripause_5_1us_count: 0
+ rx_pripause_6_1us_count: 0
+ rx_pripause_7_1us_count: 0
+ rx_pause_1us_count: 0
+ frames_tx_truncated: 0
+
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by Pensando personnel::
+
+ netdev@vger.kernel.org
+
+For more specific support needs, please use the Pensando driver support
+email::
+
+ drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
new file mode 100644
index 000000000000..e5eac896a631
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/smsc/smc9.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+SMC 9xxxx Driver
+================
+
+Revision 0.12
+
+3/5/96
+
+Copyright 1996 Erik Stahlman
+
+Released under terms of the GNU General Public License.
+
+This file contains the instructions and caveats for my SMC9xxx driver. You
+should not be using the driver without reading this file.
+
+Things to note about installation:
+
+ 1. The driver should work on all kernels from 1.2.13 until 1.3.71.
+ (A kernel patch is supplied for 1.3.71 )
+
+ 2. If you include this into the kernel, you might need to change some
+ options, such as for forcing IRQ.
+
+
+ 3. To compile as a module, run 'make'.
+ Make will give you the appropriate options for various kernel support.
+
+ 4. Loading the driver as a module::
+
+ use: insmod smc9194.o
+ optional parameters:
+ io=xxxx : your base address
+ irq=xx : your irq
+ ifport=x : 0 for whatever is default
+ 1 for twisted pair
+ 2 for AUI ( or BNC on some cards )
+
+How to obtain the latest version?
+
+FTP:
+ ftp://fenris.campus.vt.edu/smc9/smc9-12.tar.gz
+ ftp://sfbox.vt.edu/filebox/F/fenris/smc9/smc9-12.tar.gz
+
+
+Contacting me:
+ erik@mail.vt.edu
diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
index c34bab3d2df0..5d46e5036129 100644
--- a/Documentation/networking/device_drivers/stmicro/stmmac.rst
+++ b/Documentation/networking/device_drivers/ethernet/stmicro/stmmac.rst
@@ -32,7 +32,8 @@ is also supported.
DesignWare(R) Cores Ethernet MAC 10/100/1000 Universal version 3.70a
(and older) and DesignWare(R) Cores Ethernet Quality-of-Service version 4.0
(and upper) have been used for developing this driver as well as
-DesignWare(R) Cores XGMAC - 10G Ethernet MAC.
+DesignWare(R) Cores XGMAC - 10G Ethernet MAC and DesignWare(R) Cores
+Enterprise MAC - 100G Ethernet MAC.
This driver supports both the platform bus and PCI.
@@ -48,6 +49,8 @@ Cores Ethernet Controllers and corresponding minimum and maximum versions:
+-------------------------------+--------------+--------------+--------------+
| XGMAC - 10G Ethernet MAC | 2.10a | N/A | XGMAC2+ |
+-------------------------------+--------------+--------------+--------------+
+| XLGMAC - 100G Ethernet MAC | 2.00a | N/A | XLGMAC2+ |
++-------------------------------+--------------+--------------+--------------+
For questions related to hardware requirements, refer to the documentation
supplied with your Ethernet adapter. All hardware requirements listed apply
@@ -57,7 +60,7 @@ Feature List
============
The following features are available in this driver:
- - GMII/MII/RGMII/SGMII/RMII/XGMII Interface
+ - GMII/MII/RGMII/SGMII/RMII/XGMII/XLGMII Interface
- Half-Duplex / Full-Duplex Operation
- Energy Efficient Ethernet (EEE)
- IEEE 802.3x PAUSE Packets (Flow Control)
diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
new file mode 100644
index 000000000000..f24adfab6a1b
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst
@@ -0,0 +1,143 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================================================
+Texas Instruments K3 AM65 CPSW NUSS switchdev based ethernet driver
+===================================================================
+
+:Version: 1.0
+
+Port renaming
+=============
+
+In order to rename via udev::
+
+ ip -d link show dev sw0p1 | grep switchid
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+
+Multi mac mode
+==============
+
+- The driver is operating in multi-mac mode by default, thus
+ working as N individual network interfaces.
+
+Devlink configuration parameters
+================================
+
+See Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
+
+Enabling "switch"
+=================
+
+The Switch mode can be enabled by configuring devlink driver parameter
+"switch_mode" to 1/true::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+This can be done regardless of the state of Port's netdev devices - UP/DOWN, but
+Port's netdev devices have to be in UP before joining to the bridge to avoid
+overwriting of bridge configuration as CPSW switch driver completely reloads its
+configuration when first port changes its state to UP.
+
+When the both interfaces joined the bridge - CPSW switch driver will enable
+marking packets with offload_fwd_mark flag.
+
+All configuration is implemented via switchdev API.
+
+Bridge setup
+============
+
+::
+
+ devlink dev param set platform/c000000.ethernet \
+ name switch_mode value true cmode runtime
+
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p1 master br0
+ ip link set dev sw0p2 master br0
+
+ [*] bridge vlan add dev br0 vid 1 pvid untagged self
+
+ [*] if vlan_filtering=1. where default_pvid=1
+
+ Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
+
+::
+
+ ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+ bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+
+Note. This step is mandatory for bridge/default_pvid.
+
+Add extra VLANs
+===============
+
+ 1. untagged::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+
+FDBs
+----
+
+FDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding FDBs::
+
+ bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+ bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+
+MDBs
+----
+
+MDBs are automatically added on the appropriate switch port upon detection
+
+Manually adding MDBs::
+
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+
+Multicast flooding
+==================
+CPU port mcast_flooding is always on
+
+Turning flooding on/off on swithch ports:
+bridge link set dev sw0p1 mcast_flood on/off
+
+Access and Trunk port
+=====================
+
+::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 master
+
+
+ bridge vlan add dev br0 vid 100 self
+ ip link add link br0 name br0.100 type vlan id 100
+
+Note. Setting PVID on Bridge device itself works only for
+default VLAN (default_pvid).
diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
new file mode 100644
index 000000000000..a88946bd188b
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
@@ -0,0 +1,587 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+Texas Instruments CPSW ethernet driver
+======================================
+
+Multiqueue & CBS & MQPRIO
+=========================
+
+
+The cpsw has 3 CBS shapers for each external ports. This document
+describes MQPRIO and CBS Qdisc offload configuration for cpsw driver
+based on examples. It potentially can be used in audio video bridging
+(AVB) and time sensitive networking (TSN).
+
+The following examples were tested on AM572x EVM and BBB boards.
+
+Test setup
+==========
+
+Under consideration two examples with AM572x EVM running cpsw driver
+in dual_emac mode.
+
+Several prerequisites:
+
+- TX queues must be rated starting from txq0 that has highest priority
+- Traffic classes are used starting from 0, that has highest priority
+- CBS shapers should be used with rated queues
+- The bandwidth for CBS shapers has to be set a little bit more then
+ potential incoming rate, thus, rate of all incoming tx queues has
+ to be a little less
+- Real rates can differ, due to discreetness
+- Map skb-priority to txq is not enough, also skb-priority to l2 prio
+ map has to be created with ip or vconfig tool
+- Any l2/socket prio (0 - 7) for classes can be used, but for
+ simplicity default values are used: 3 and 2
+- only 2 classes tested: A and B, but checked and can work with more,
+ maximum allowed 4, but only for 3 rate can be set.
+
+Test setup for examples
+=======================
+
+::
+
+ +-------------------------------+
+ |--+ |
+ | | Workstation0 |
+ |E | MAC 18:03:73:66:87:42 |
+ +-----------------------------+ +--|t | |
+ | | 1 | E | | |h |./tsn_listener -d \ |
+ | Target board: | 0 | t |--+ |0 | 18:03:73:66:87:42 -i eth0 \|
+ | AM572x EVM | 0 | h | | | -s 1500 |
+ | | 0 | 0 | |--+ |
+ | Only 2 classes: |Mb +---| +-------------------------------+
+ | class A, class B | |
+ | | +---| +-------------------------------+
+ | | 1 | E | |--+ |
+ | | 0 | t | | | Workstation1 |
+ | | 0 | h |--+ |E | MAC 20:cf:30:85:7d:fd |
+ | |Mb | 1 | +--|t | |
+ +-----------------------------+ |h |./tsn_listener -d \ |
+ |0 | 20:cf:30:85:7d:fd -i eth0 \|
+ | | -s 1500 |
+ |--+ |
+ +-------------------------------+
+
+
+Example 1: One port tx AVB configuration scheme for target board
+----------------------------------------------------------------
+
+(prints and scheme for AM572x evm, applicable for single port boards)
+
+- tc - traffic class
+- txq - transmit queue
+- p - priority
+- f - fifo (cpsw fifo)
+- S - shaper configured
+
+::
+
+ +------------------------------------------------------------------+ u
+ | +---------------+ +---------------+ +------+ +------+ | s
+ | | | | | | | | | | e
+ | | App 1 | | App 2 | | Apps | | Apps | | r
+ | | Class A | | Class B | | Rest | | Rest | |
+ | | Eth0 | | Eth0 | | Eth0 | | Eth1 | | s
+ | | VLAN100 | | VLAN100 | | | | | | | | p
+ | | 40 Mb/s | | 20 Mb/s | | | | | | | | a
+ | | SO_PRIORITY=3 | | SO_PRIORITY=2 | | | | | | | | c
+ | | | | | | | | | | | | | | e
+ | +---|-----------+ +---|-----------+ +---|--+ +---|--+ |
+ +-----|------------------|------------------|--------|-------------+
+ +-+ +------------+ | |
+ | | +-----------------+ +--+
+ | | | |
+ +---|-------|-------------|-----------------------|----------------+
+ | +----+ +----+ +----+ +----+ +----+ |
+ | | p3 | | p2 | | p1 | | p0 | | p0 | | k
+ | \ / \ / \ / \ / \ / | e
+ | \ / \ / \ / \ / \ / | r
+ | \/ \/ \/ \/ \/ | n
+ | | | | | | e
+ | | | +-----+ | | l
+ | | | | | |
+ | +----+ +----+ +----+ +----+ | s
+ | |tc0 | |tc1 | |tc2 | |tc0 | | p
+ | \ / \ / \ / \ / | a
+ | \ / \ / \ / \ / | c
+ | \/ \/ \/ \/ | e
+ | | | +-----+ | |
+ | | | | | | |
+ | | | | | | |
+ | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ |
+ | |txq0| |txq1| |txq2| |txq3| |txq4| |
+ | \ / \ / \ / \ / \ / |
+ | \ / \ / \ / \ / \ / |
+ | \/ \/ \/ \/ \/ |
+ | +-|------|------|------|--+ +--|--------------+ |
+ | | | | | | | Eth0.100 | | Eth1 | |
+ +---|------|------|------|------------------------|----------------+
+ | | | | |
+ p p p p |
+ 3 2 0-1, 4-7 <- L2 priority |
+ | | | | |
+ | | | | |
+ +---|------|------|------|------------------------|----------------+
+ | | | | | |----------+ |
+ | +----+ +----+ +----+ +----+ +----+ |
+ | |dma7| |dma6| |dma5| |dma4| |dma3| |
+ | \ / \ / \ / \ / \ / | c
+ | \S / \S / \ / \ / \ / | p
+ | \/ \/ \/ \/ \/ | s
+ | | | | +----- | | w
+ | | | | | | |
+ | | | | | | | d
+ | +----+ +----+ +----+p p+----+ | r
+ | | | | | | |o o| | | i
+ | | f3 | | f2 | | f0 |r r| f0 | | v
+ | |tc0 | |tc1 | |tc2 |t t|tc0 | | e
+ | \CBS / \CBS / \CBS /1 2\CBS / | r
+ | \S / \S / \ / \ / |
+ | \/ \/ \/ \/ |
+ +------------------------------------------------------------------+
+
+
+1) ::
+
+
+ // Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1
+ $ ethtool -L eth0 rx 1 tx 5
+ rx unmodified, ignoring
+
+2) ::
+
+ // Check if num of queues is set correctly:
+ $ ethtool -l eth0
+ Channel parameters for eth0:
+ Pre-set maximums:
+ RX: 8
+ TX: 8
+ Other: 0
+ Combined: 0
+ Current hardware settings:
+ RX: 1
+ TX: 5
+ Other: 0
+ Combined: 0
+
+3) ::
+
+ // TX queues must be rated starting from 0, so set bws for tx0 and tx1
+ // Set rates 40 and 20 Mb/s appropriately.
+ // Pay attention, real speed can differ a bit due to discreetness.
+ // Leave last 2 tx queues not rated.
+ $ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+ $ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+
+4) ::
+
+ // Check maximum rate of tx (cpdma) queues:
+ $ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+ 40
+ 20
+ 0
+ 0
+ 0
+
+5) ::
+
+ // Map skb->priority to traffic class:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3)
+ $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
+
+5a) ::
+
+ // As two interface sharing same set of tx queues, assign all traffic
+ // coming to interface Eth1 to separate queue in order to not mix it
+ // with traffic from interface Eth0, so use separate txq to send
+ // packets to Eth1, so all prio -> tc0 and tc0 -> txq4
+ // Here hw 0, so here still default configuration for eth1 in hw
+ $ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \
+ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0
+
+6) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth0
+ +---(100:ffe2) mqprio
+ | +---(100:3) mqprio
+ | +---(100:4) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:2) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+ $ tc -g class show dev eth1
+ +---(100:ffe0) mqprio
+ +---(100:5) mqprio
+
+7) ::
+
+ // Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc
+ // Set it +1 Mb for reserve (important!)
+ // here only idle slope is important, others arg are ignored
+ // Pay attention, real speed can differ a bit due to discreetness
+ $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \
+ hicredit 62 sendslope -959000 idleslope 41000 offload 1
+ net eth0: set FIFO3 bw = 50
+
+8) ::
+
+ // Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc:
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \
+ hicredit 65 sendslope -979000 idleslope 21000 offload 1
+ net eth0: set FIFO2 bw = 30
+
+9) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos
+ $ ip link add link eth0 name eth0.100 type vlan id 100
+ 8021q: 802.1Q VLAN Support v1.8
+ 8021q: adding VLAN 0 to HW filter on device eth0
+ 8021q: adding VLAN 0 to HW filter on device eth1
+ net eth0: Adding vlanid 100 to vlan filter
+
+10) ::
+
+ // Map skb->priority to L2 prio, 1 to 1
+ $ ip link set eth0.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth0.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12) ::
+
+ // Run your appropriate tools with socket option "SO_PRIORITY"
+ // to 3 for class A and/or to 2 for class B
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+
+13) ::
+
+ // run your listener on workstation (should be in same vlan)
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39000 kbps
+
+14) ::
+
+ // Restore default configuration if needed
+ $ ip link del eth0.100
+ $ tc qdisc del dev eth1 root
+ $ tc qdisc del dev eth0 root
+ net eth0: Prev FIFO2 is shaped
+ net eth0: set FIFO3 bw = 0
+ net eth0: set FIFO2 bw = 0
+ $ ethtool -L eth0 rx 1 tx 1
+
+Example 2: Two port tx AVB configuration scheme for target board
+----------------------------------------------------------------
+
+(prints and scheme for AM572x evm, for dual emac boards only)
+
+::
+
+ +------------------------------------------------------------------+ u
+ | +----------+ +----------+ +------+ +----------+ +----------+ | s
+ | | | | | | | | | | | | e
+ | | App 1 | | App 2 | | Apps | | App 3 | | App 4 | | r
+ | | Class A | | Class B | | Rest | | Class B | | Class A | |
+ | | Eth0 | | Eth0 | | | | | Eth1 | | Eth1 | | s
+ | | VLAN100 | | VLAN100 | | | | | VLAN100 | | VLAN100 | | p
+ | | 40 Mb/s | | 20 Mb/s | | | | | 10 Mb/s | | 30 Mb/s | | a
+ | | SO_PRI=3 | | SO_PRI=2 | | | | | SO_PRI=3 | | SO_PRI=2 | | c
+ | | | | | | | | | | | | | | | | | e
+ | +---|------+ +---|------+ +---|--+ +---|------+ +---|------+ |
+ +-----|-------------|-------------|---------|-------------|--------+
+ +-+ +-------+ | +----------+ +----+
+ | | +-------+------+ | |
+ | | | | | |
+ +---|-------|-------------|--------------|-------------|-------|---+
+ | +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
+ | | p3 | | p2 | | p1 | | p0 | | p0 | | p1 | | p2 | | p3 | | k
+ | \ / \ / \ / \ / \ / \ / \ / \ / | e
+ | \ / \ / \ / \ / \ / \ / \ / \ / | r
+ | \/ \/ \/ \/ \/ \/ \/ \/ | n
+ | | | | | | | | e
+ | | | +----+ +----+ | | | l
+ | | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ +----+ | s
+ | |tc0 | |tc1 | |tc2 | |tc2 | |tc1 | |tc0 | | p
+ | \ / \ / \ / \ / \ / \ / | a
+ | \ / \ / \ / \ / \ / \ / | c
+ | \/ \/ \/ \/ \/ \/ | e
+ | | | +-----+ +-----+ | | |
+ | | | | | | | | | |
+ | | | | | | | | | |
+ | | | | | E E | | | | |
+ | +----+ +----+ +----+ +----+ t t +----+ +----+ +----+ +----+ |
+ | |txq0| |txq1| |txq4| |txq5| h h |txq6| |txq7| |txq3| |txq2| |
+ | \ / \ / \ / \ / 0 1 \ / \ / \ / \ / |
+ | \ / \ / \ / \ / . . \ / \ / \ / \ / |
+ | \/ \/ \/ \/ 1 1 \/ \/ \/ \/ |
+ | +-|------|------|------|--+ 0 0 +-|------|------|------|--+ |
+ | | | | | | | 0 0 | | | | | | |
+ +---|------|------|------|---------------|------|------|------|----+
+ | | | | | | | |
+ p p p p p p p p
+ 3 2 0-1, 4-7 <-L2 pri-> 0-1, 4-7 2 3
+ | | | | | | | |
+ | | | | | | | |
+ +---|------|------|------|---------------|------|------|------|----+
+ | | | | | | | | | |
+ | +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
+ | |dma7| |dma6| |dma3| |dma2| |dma1| |dma0| |dma4| |dma5| |
+ | \ / \ / \ / \ / \ / \ / \ / \ / | c
+ | \S / \S / \ / \ / \ / \ / \S / \S / | p
+ | \/ \/ \/ \/ \/ \/ \/ \/ | s
+ | | | | +----- | | | | | w
+ | | | | | +----+ | | | |
+ | | | | | | | | | | d
+ | +----+ +----+ +----+p p+----+ +----+ +----+ | r
+ | | | | | | |o o| | | | | | | i
+ | | f3 | | f2 | | f0 |r CPSW r| f3 | | f2 | | f0 | | v
+ | |tc0 | |tc1 | |tc2 |t t|tc0 | |tc1 | |tc2 | | e
+ | \CBS / \CBS / \CBS /1 2\CBS / \CBS / \CBS / | r
+ | \S / \S / \ / \S / \S / \ / |
+ | \/ \/ \/ \/ \/ \/ |
+ +------------------------------------------------------------------+
+ ========================================Eth==========================>
+
+1) ::
+
+ // Add 8 tx queues, for interface Eth0, but they are common, so are accessed
+ // by two interfaces Eth0 and Eth1.
+ $ ethtool -L eth1 rx 1 tx 8
+ rx unmodified, ignoring
+
+2) ::
+
+ // Check if num of queues is set correctly:
+ $ ethtool -l eth0
+ Channel parameters for eth0:
+ Pre-set maximums:
+ RX: 8
+ TX: 8
+ Other: 0
+ Combined: 0
+ Current hardware settings:
+ RX: 1
+ TX: 8
+ Other: 0
+ Combined: 0
+
+3) ::
+
+ // TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0
+ // and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately
+ // for Eth0 and 30 and 10 Mb/s for Eth1.
+ // Real speed can differ a bit due to discreetness
+ // Leave last 4 tx queues as not rated
+ $ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
+ $ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
+ $ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate
+ $ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate
+
+4) ::
+
+ // Check maximum rate of tx (cpdma) queues:
+ $ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
+ 40
+ 20
+ 30
+ 10
+ 0
+ 0
+ 0
+ 0
+
+5) ::
+
+ // Map skb->priority to traffic class for Eth0:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5)
+ $ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1
+
+6) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth0
+ +---(100:ffe2) mqprio
+ | +---(100:5) mqprio
+ | +---(100:6) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:2) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:1) mqprio
+
+7) ::
+
+ // Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0
+ // here only idle slope is important, others ignored
+ // Real speed can differ a bit due to discreetness
+ $ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \
+ hicredit 62 sendslope -959000 idleslope 41000 offload 1
+ net eth0: set FIFO3 bw = 50
+
+8) ::
+
+ // Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0
+ $ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
+ hicredit 65 sendslope -979000 idleslope 21000 offload 1
+ net eth0: set FIFO2 bw = 30
+
+9) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos for Eth0
+ $ ip link add link eth0 name eth0.100 type vlan id 100
+ net eth0: Adding vlanid 100 to vlan filter
+
+10) ::
+
+ // Map skb->priority to L2 prio for Eth0.100, one to one
+ $ ip link set eth0.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+11) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth0.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+12) ::
+
+ // Map skb->priority to traffic class for Eth1:
+ // 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
+ // Map traffic class to transmit queue:
+ // tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7)
+ $ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \
+ map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1
+
+13) ::
+
+ // Check classes settings
+ $ tc -g class show dev eth1
+ +---(100:ffe2) mqprio
+ | +---(100:7) mqprio
+ | +---(100:8) mqprio
+ |
+ +---(100:ffe1) mqprio
+ | +---(100:4) mqprio
+ |
+ +---(100:ffe0) mqprio
+ +---(100:3) mqprio
+
+14) ::
+
+ // Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1
+ // here only idle slope is important, others ignored, but calculated
+ // for interface speed - 100Mb for eth1 port.
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth1 parent 100:3 cbs locredit -1035 \
+ hicredit 465 sendslope -69000 idleslope 31000 offload 1
+ net eth1: set FIFO3 bw = 31
+
+15) ::
+
+ // Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1
+ // Set it +1 Mb for reserve (important!)
+ $ tc qdisc add dev eth1 parent 100:4 cbs locredit -1335 \
+ hicredit 405 sendslope -89000 idleslope 11000 offload 1
+ net eth1: set FIFO2 bw = 11
+
+16) ::
+
+ // Create vlan 100 to map sk->priority to vlan qos for Eth1
+ $ ip link add link eth1 name eth1.100 type vlan id 100
+ net eth1: Adding vlanid 100 to vlan filter
+
+17) ::
+
+ // Map skb->priority to L2 prio for Eth1.100, one to one
+ $ ip link set eth1.100 type vlan \
+ egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+18) ::
+
+ // Check egress map for vlan 100
+ $ cat /proc/net/vlan/eth1.100
+ [...]
+ INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
+ EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
+
+19) ::
+
+ // Run appropriate tools with socket option "SO_PRIORITY" to 3
+ // for class A and to 2 for class B. For both interfaces
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
+ ./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
+ ./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500&
+ ./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500&
+
+20) ::
+
+ // run your listener on workstation (should be in same vlan)
+ // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ ./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39012 kbps
+ Receiving data rate: 39000 kbps
+
+21) ::
+
+ // Restore default configuration if needed
+ $ ip link del eth1.100
+ $ ip link del eth0.100
+ $ tc qdisc del dev eth1 root
+ net eth1: Prev FIFO2 is shaped
+ net eth1: set FIFO3 bw = 0
+ net eth1: set FIFO2 bw = 0
+ $ tc qdisc del dev eth0 root
+ net eth0: Prev FIFO2 is shaped
+ net eth0: set FIFO3 bw = 0
+ net eth0: set FIFO2 bw = 0
+ $ ethtool -L eth0 rx 1 tx 1
diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
index 12855ab268b8..1241ecac73bd 100644
--- a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst
@@ -1,30 +1,44 @@
-* Texas Instruments CPSW switchdev based ethernet driver 2.0
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
+Texas Instruments CPSW switchdev based ethernet driver
+======================================================
+
+:Version: 2.0
+
+Port renaming
+=============
-- Port renaming
On older udev versions renaming of ethX to swXpY will not be automatically
supported
-In order to rename via udev:
-ip -d link show dev sw0p1 | grep switchid
-SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
- ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+In order to rename via udev::
+
+ ip -d link show dev sw0p1 | grep switchid
+
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}==<switchid>, \
+ ATTR{phys_port_name}!="", NAME="sw0$attr{phys_port_name}"
+
+Dual mac mode
+=============
-====================
-# Dual mac mode
-====================
- The new (cpsw_new.c) driver is operating in dual-emac mode by default, thus
-working as 2 individual network interfaces. Main differences from legacy CPSW
-driver are:
+ working as 2 individual network interfaces. Main differences from legacy CPSW
+ driver are:
+
- optimized promiscuous mode: The P0_UNI_FLOOD (both ports) is enabled in
-addition to ALLMULTI (current port) instead of ALE_BYPASS.
-So, Ports in promiscuous mode will keep possibility of mcast and vlan filtering,
-which is provides significant benefits when ports are joined to the same bridge,
-but without enabling "switch" mode, or to different bridges.
+ addition to ALLMULTI (current port) instead of ALE_BYPASS.
+ So, Ports in promiscuous mode will keep possibility of mcast and vlan
+ filtering, which is provides significant benefits when ports are joined
+ to the same bridge, but without enabling "switch" mode, or to different
+ bridges.
- learning disabled on ports as it make not too much sense for
segregated ports - no forwarding in HW.
- enabled basic support for devlink.
+ ::
+
devlink dev show
platform/48484000.switch
@@ -38,22 +52,25 @@ but without enabling "switch" mode, or to different bridges.
cmode runtime value false
Devlink configuration parameters
-====================
+================================
+
See Documentation/networking/devlink/ti-cpsw-switch.rst
-====================
-# Bridging in dual mac mode
-====================
+Bridging in dual mac mode
+=========================
+
The dual_mac mode requires two vids to be reserved for internal purposes,
which, by default, equal CPSW Port numbers. As result, bridge has to be
-configured in vlan unaware mode or default_pvid has to be adjusted.
+configured in vlan unaware mode or default_pvid has to be adjusted::
ip link add name br0 type bridge
ip link set dev br0 type bridge vlan_filtering 0
echo 0 > /sys/class/net/br0/bridge/default_pvid
ip link set dev sw0p1 master br0
ip link set dev sw0p2 master br0
- - or -
+
+or::
+
ip link add name br0 type bridge
ip link set dev br0 type bridge vlan_filtering 0
echo 100 > /sys/class/net/br0/bridge/default_pvid
@@ -61,11 +78,12 @@ configured in vlan unaware mode or default_pvid has to be adjusted.
ip link set dev sw0p1 master br0
ip link set dev sw0p2 master br0
-====================
-# Enabling "switch"
-====================
+Enabling "switch"
+=================
+
The Switch mode can be enabled by configuring devlink driver parameter
-"switch_mode" to 1/true:
+"switch_mode" to 1/true::
+
devlink dev param set platform/48484000.switch \
name switch_mode value 1 cmode runtime
@@ -79,9 +97,11 @@ marking packets with offload_fwd_mark flag unless "ale_bypass=0"
All configuration is implemented via switchdev API.
-====================
-# Bridge setup
-====================
+Bridge setup
+============
+
+::
+
devlink dev param set platform/48484000.switch \
name switch_mode value 1 cmode runtime
@@ -91,56 +111,65 @@ All configuration is implemented via switchdev API.
ip link set dev sw0p2 up
ip link set dev sw0p1 master br0
ip link set dev sw0p2 master br0
+
[*] bridge vlan add dev br0 vid 1 pvid untagged self
-[*] if vlan_filtering=1. where default_pvid=1
+ [*] if vlan_filtering=1. where default_pvid=1
-=================
-# On/off STP
-=================
-ip link set dev BRDEV type bridge stp_state 1/0
+ Note. Steps [*] are mandatory.
+
+
+On/off STP
+==========
-Note. Steps [*] are mandatory.
+::
-====================
-# VLAN configuration
-====================
-bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
+ ip link set dev BRDEV type bridge stp_state 1/0
+
+VLAN configuration
+==================
+
+::
+
+ bridge vlan add dev br0 vid 1 pvid untagged self <---- add cpu port to VLAN 1
Note. This step is mandatory for bridge/default_pvid.
-=================
-# Add extra VLANs
-=================
- 1. untagged:
- bridge vlan add dev sw0p1 vid 100 pvid untagged master
- bridge vlan add dev sw0p2 vid 100 pvid untagged master
- bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+Add extra VLANs
+===============
- 2. tagged:
- bridge vlan add dev sw0p1 vid 100 master
- bridge vlan add dev sw0p2 vid 100 master
- bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
+ 1. untagged::
+
+ bridge vlan add dev sw0p1 vid 100 pvid untagged master
+ bridge vlan add dev sw0p2 vid 100 pvid untagged master
+ bridge vlan add dev br0 vid 100 pvid untagged self <---- Add cpu port to VLAN100
+
+ 2. tagged::
+
+ bridge vlan add dev sw0p1 vid 100 master
+ bridge vlan add dev sw0p2 vid 100 master
+ bridge vlan add dev br0 vid 100 pvid tagged self <---- Add cpu port to VLAN100
-====
FDBs
-====
+----
+
FDBs are automatically added on the appropriate switch port upon detection
-Manually adding FDBs:
-bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
-bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
+Manually adding FDBs::
+
+ bridge fdb add aa:bb:cc:dd:ee:ff dev sw0p1 master vlan 100
+ bridge fdb add aa:bb:cc:dd:ee:fe dev sw0p2 master <---- Add on all VLANs
-====
MDBs
-====
+----
+
MDBs are automatically added on the appropriate switch port upon detection
-Manually adding MDBs:
-bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
-bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
+Manually adding MDBs::
+
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent vid 100
+ bridge mdb add dev br0 port sw0p1 grp 239.1.1.1 permanent <---- Add on all VLANs
-==================
Multicast flooding
==================
CPU port mcast_flooding is always on
@@ -148,9 +177,11 @@ CPU port mcast_flooding is always on
Turning flooding on/off on swithch ports:
bridge link set dev sw0p1 mcast_flood on/off
-==================
Access and Trunk port
-==================
+=====================
+
+::
+
bridge vlan add dev sw0p1 vid 100 pvid untagged master
bridge vlan add dev sw0p2 vid 100 master
@@ -158,52 +189,54 @@ Access and Trunk port
bridge vlan add dev br0 vid 100 self
ip link add link br0 name br0.100 type vlan id 100
- Note. Setting PVID on Bridge device itself working only for
- default VLAN (default_pvid).
+Note. Setting PVID on Bridge device itself working only for
+default VLAN (default_pvid).
+
+NFS
+===
-=====================
- NFS
-=====================
The only way for NFS to work is by chrooting to a minimal environment when
switch configuration that will affect connectivity is needed.
Assuming you are booting NFS with eth1 interface(the script is hacky and
it's just there to prove NFS is doable).
-setup.sh:
-#!/bin/sh
-mkdir proc
-mount -t proc none /proc
-ifconfig br0 > /dev/null
-if [ $? -ne 0 ]; then
- echo "Setting up bridge"
- ip link add name br0 type bridge
- ip link set dev br0 type bridge ageing_time 1000
- ip link set dev br0 type bridge vlan_filtering 1
-
- ip link set eth1 down
- ip link set eth1 name sw0p1
- ip link set dev sw0p1 up
- ip link set dev sw0p2 up
- ip link set dev sw0p2 master br0
- ip link set dev sw0p1 master br0
- bridge vlan add dev br0 vid 1 pvid untagged self
- ifconfig sw0p1 0.0.0.0
- udhchc -i br0
-fi
-umount /proc
-
-run_nfs.sh:
-#!/bin/sh
-mkdir /tmp/root/bin -p
-mkdir /tmp/root/lib -p
-
-cp -r /lib/ /tmp/root/
-cp -r /bin/ /tmp/root/
-cp /sbin/ip /tmp/root/bin
-cp /sbin/bridge /tmp/root/bin
-cp /sbin/ifconfig /tmp/root/bin
-cp /sbin/udhcpc /tmp/root/bin
-cp /path/to/setup.sh /tmp/root/bin
-chroot /tmp/root/ busybox sh /bin/setup.sh
-
-run ./run_nfs.sh
+setup.sh::
+
+ #!/bin/sh
+ mkdir proc
+ mount -t proc none /proc
+ ifconfig br0 > /dev/null
+ if [ $? -ne 0 ]; then
+ echo "Setting up bridge"
+ ip link add name br0 type bridge
+ ip link set dev br0 type bridge ageing_time 1000
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ ip link set eth1 down
+ ip link set eth1 name sw0p1
+ ip link set dev sw0p1 up
+ ip link set dev sw0p2 up
+ ip link set dev sw0p2 master br0
+ ip link set dev sw0p1 master br0
+ bridge vlan add dev br0 vid 1 pvid untagged self
+ ifconfig sw0p1 0.0.0.0
+ udhchc -i br0
+ fi
+ umount /proc
+
+run_nfs.sh:::
+
+ #!/bin/sh
+ mkdir /tmp/root/bin -p
+ mkdir /tmp/root/lib -p
+
+ cp -r /lib/ /tmp/root/
+ cp -r /bin/ /tmp/root/
+ cp /sbin/ip /tmp/root/bin
+ cp /sbin/bridge /tmp/root/bin
+ cp /sbin/ifconfig /tmp/root/bin
+ cp /sbin/udhcpc /tmp/root/bin
+ cp /path/to/setup.sh /tmp/root/bin
+ chroot /tmp/root/ busybox sh /bin/setup.sh
+
+ run ./run_nfs.sh
diff --git a/Documentation/networking/device_drivers/ti/tlan.txt b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
index 34550dfcef74..4fdc0907f4fc 100644
--- a/Documentation/networking/device_drivers/ti/tlan.txt
+++ b/Documentation/networking/device_drivers/ethernet/ti/tlan.rst
@@ -1,20 +1,33 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+TLAN driver for Linux
+=====================
+
+:Version: 1.14a
+
(C) 1997-1998 Caldera, Inc.
+
(C) 1998 James Banks
+
(C) 1999-2001 Torben Mathiasen <tmm@image.dk, torben.mathiasen@compaq.com>
For driver information/updates visit http://www.compaq.com
-TLAN driver for Linux, version 1.14a
-README
-I. Supported Devices.
+
+I. Supported Devices
+====================
Only PCI devices will work with this driver.
Supported:
+
+ ========= ========= ===========================================
Vendor ID Device ID Name
+ ========= ========= ===========================================
0e11 ae32 Compaq Netelligent 10/100 TX PCI UTP
0e11 ae34 Compaq Netelligent 10 T PCI UTP
0e11 ae35 Compaq Integrated NetFlex 3/P
@@ -25,13 +38,14 @@ I. Supported Devices.
0e11 b030 Compaq Netelligent 10/100 TX UTP
0e11 f130 Compaq NetFlex 3/P
0e11 f150 Compaq NetFlex 3/P
- 108d 0012 Olicom OC-2325
+ 108d 0012 Olicom OC-2325
108d 0013 Olicom OC-2183
- 108d 0014 Olicom OC-2326
+ 108d 0014 Olicom OC-2326
+ ========= ========= ===========================================
Caveats:
-
+
I am not sure if 100BaseTX daughterboards (for those cards which
support such things) will work. I haven't had any solid evidence
either way.
@@ -41,21 +55,25 @@ I. Supported Devices.
The "Netelligent 10 T/2 PCI UTP/Coax" (b012) device is untested,
but I do not expect any problems.
-
-II. Driver Options
+
+II. Driver Options
+==================
+
1. You can append debug=x to the end of the insmod line to get
- debug messages, where x is a bit field where the bits mean
+ debug messages, where x is a bit field where the bits mean
the following:
-
+
+ ==== =====================================
0x01 Turn on general debugging messages.
0x02 Turn on receive debugging messages.
0x04 Turn on transmit debugging messages.
0x08 Turn on list debugging messages.
+ ==== =====================================
2. You can append aui=1 to the end of the insmod line to cause
- the adapter to use the AUI interface instead of the 10 Base T
- interface. This is also what to do if you want to use the BNC
+ the adapter to use the AUI interface instead of the 10 Base T
+ interface. This is also what to do if you want to use the BNC
connector on a TLAN based device. (Setting this option on a
device that does not have an AUI/BNC connector will probably
cause it to not function correctly.)
@@ -70,41 +88,45 @@ II. Driver Options
5. You have to use speed=X duplex=Y together now. If you just
do "insmod tlan.o speed=100" the driver will do Auto-Neg.
- To force a 10Mbps Half-Duplex link do "insmod tlan.o speed=10
+ To force a 10Mbps Half-Duplex link do "insmod tlan.o speed=10
duplex=1".
6. If the driver is built into the kernel, you can use the 3rd
and 4th parameters to set aui and debug respectively. For
- example:
+ example::
- ether=0,0,0x1,0x7,eth0
+ ether=0,0,0x1,0x7,eth0
This sets aui to 0x1 and debug to 0x7, assuming eth0 is a
supported TLAN device.
The bits in the third byte are assigned as follows:
- 0x01 = aui
- 0x02 = use half duplex
- 0x04 = use full duplex
- 0x08 = use 10BaseT
- 0x10 = use 100BaseTx
+ ==== ===============
+ 0x01 aui
+ 0x02 use half duplex
+ 0x04 use full duplex
+ 0x08 use 10BaseT
+ 0x10 use 100BaseTx
+ ==== ===============
You also need to set both speed and duplex settings when forcing
- speeds with kernel-parameters.
+ speeds with kernel-parameters.
ether=0,0,0x12,0,eth0 will force link to 100Mbps Half-Duplex.
7. If you have more than one tlan adapter in your system, you can
use the above options on a per adapter basis. To force a 100Mbit/HD
- link with your eth1 adapter use:
-
- insmod tlan speed=0,100 duplex=0,1
+ link with your eth1 adapter use::
+
+ insmod tlan speed=0,100 duplex=0,1
Now eth0 will use auto-neg and eth1 will be forced to 100Mbit/HD.
Note that the tlan driver supports a maximum of 8 adapters.
-III. Things to try if you have problems.
+III. Things to try if you have problems
+=======================================
+
1. Make sure your card's PCI id is among those listed in
section I, above.
2. Make sure routing is correct.
@@ -113,5 +135,6 @@ III. Things to try if you have problems.
There is also a tlan mailing list which you can join by sending "subscribe tlan"
in the body of an email to majordomo@vuser.vu.union.edu.
+
There is also a tlan website at http://www.compaq.com
diff --git a/Documentation/networking/device_drivers/toshiba/spider_net.txt b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
index b0b75f8463b3..fe5b32be15cd 100644
--- a/Documentation/networking/device_drivers/toshiba/spider_net.txt
+++ b/Documentation/networking/device_drivers/ethernet/toshiba/spider_net.rst
@@ -1,6 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
- The Spidernet Device Driver
- ===========================
+===========================
+The Spidernet Device Driver
+===========================
Written by Linas Vepstas <linas@austin.ibm.com>
@@ -78,15 +80,15 @@ GDACTDPA, tail and head pointers. It will also summarize the contents
of the ring, starting at the tail pointer, and listing the status
of the descrs that follow.
-A typical example of the output, for a nearly idle system, might be
+A typical example of the output, for a nearly idle system, might be::
-net eth1: Total number of descrs=256
-net eth1: Chain tail located at descr=20
-net eth1: Chain head is at 20
-net eth1: HW curr desc (GDACTDPA) is at 21
-net eth1: Have 1 descrs with stat=x40800101
-net eth1: HW next desc (GDACNEXTDA) is at 22
-net eth1: Last 255 descrs with stat=xa0800000
+ net eth1: Total number of descrs=256
+ net eth1: Chain tail located at descr=20
+ net eth1: Chain head is at 20
+ net eth1: HW curr desc (GDACTDPA) is at 21
+ net eth1: Have 1 descrs with stat=x40800101
+ net eth1: HW next desc (GDACNEXTDA) is at 22
+ net eth1: Last 255 descrs with stat=xa0800000
In the above, the hardware has filled in one descr, number 20. Both
head and tail are pointing at 20, because it has not yet been emptied.
@@ -101,11 +103,11 @@ The status x4... corresponds to "full" and status xa... corresponds
to "empty". The actual value printed is RXCOMST_A.
In the device driver source code, a different set of names are
-used for these same concepts, so that
+used for these same concepts, so that::
-"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
-"full" == SPIDER_NET_DESCR_FRAME_END == 0x4
-"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
+ "empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
+ "full" == SPIDER_NET_DESCR_FRAME_END == 0x4
+ "not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
The RX RAM full bug/feature
@@ -137,19 +139,19 @@ while the hardware is waiting for a different set of descrs to become
empty.
A call to show_rx_chain() at this point indicates the nature of the
-problem. A typical print when the network is hung shows the following:
-
-net eth1: Spider RX RAM full, incoming packets might be discarded!
-net eth1: Total number of descrs=256
-net eth1: Chain tail located at descr=255
-net eth1: Chain head is at 255
-net eth1: HW curr desc (GDACTDPA) is at 0
-net eth1: Have 1 descrs with stat=xa0800000
-net eth1: HW next desc (GDACNEXTDA) is at 1
-net eth1: Have 127 descrs with stat=x40800101
-net eth1: Have 1 descrs with stat=x40800001
-net eth1: Have 126 descrs with stat=x40800101
-net eth1: Last 1 descrs with stat=xa0800000
+problem. A typical print when the network is hung shows the following::
+
+ net eth1: Spider RX RAM full, incoming packets might be discarded!
+ net eth1: Total number of descrs=256
+ net eth1: Chain tail located at descr=255
+ net eth1: Chain head is at 255
+ net eth1: HW curr desc (GDACTDPA) is at 0
+ net eth1: Have 1 descrs with stat=xa0800000
+ net eth1: HW next desc (GDACNEXTDA) is at 1
+ net eth1: Have 127 descrs with stat=x40800101
+ net eth1: Have 1 descrs with stat=x40800001
+ net eth1: Have 126 descrs with stat=x40800101
+ net eth1: Last 1 descrs with stat=xa0800000
Both the tail and head pointers are pointing at descr 255, which is
marked xa... which is "empty". Thus, from the OS point of view, there
@@ -198,7 +200,3 @@ For large packets, this mechanism generates a relatively small number
of interrupts, about 1K/sec. For smaller packets, this will drop to zero
interrupts, as the hardware can empty the queue faster than the kernel
can fill it.
-
-
- ======= END OF DOCUMENT ========
-
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
new file mode 100644
index 000000000000..43a02f9943e1
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================================
+Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters
+=============================================================
+
+WangXun Gigabit Linux driver.
+Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at nic-support@net-swift.com or check our website
+ at https://www.net-swift.com
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
new file mode 100644
index 000000000000..eaa87dbe8848
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================================
+Linux Base Driver for WangXun(R) 10 Gigabit PCI Express Adapters
+================================================================
+
+WangXun 10 Gigabit Linux driver.
+Copyright (c) 2015 - 2022 Beijing WangXun Technology Co., Ltd.
+
+
+Contents
+========
+
+- Support
+
+
+Support
+=======
+If you got any problem, contact Wangxun support team via support@trustnetic.com
+and Cc: netdev.
diff --git a/Documentation/networking/defza.txt b/Documentation/networking/device_drivers/fddi/defza.rst
index 663e4a906751..7393f33ea705 100644
--- a/Documentation/networking/defza.txt
+++ b/Documentation/networking/device_drivers/fddi/defza.rst
@@ -1,4 +1,10 @@
-Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver v.1.1.4.
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
+Notes on the DEC FDDIcontroller 700 (DEFZA-xx) driver
+=====================================================
+
+:Version: v.1.1.4
DEC FDDIcontroller 700 is DEC's first-generation TURBOchannel FDDI
@@ -54,4 +60,4 @@ To do:
Both success and failure reports are welcome.
-Maciej W. Rozycki <macro@linux-mips.org>
+Maciej W. Rozycki <macro@orcam.me.uk>
diff --git a/Documentation/networking/device_drivers/fddi/index.rst b/Documentation/networking/device_drivers/fddi/index.rst
new file mode 100644
index 000000000000..0b75294e6c8b
--- /dev/null
+++ b/Documentation/networking/device_drivers/fddi/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Fiber Distributed Data Interface (FDDI) Device Drivers
+======================================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ defza
+ skfp
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/device_drivers/fddi/skfp.rst
index 203ec66c9fb4..58f548105c1d 100644
--- a/Documentation/networking/skfp.txt
+++ b/Documentation/networking/device_drivers/fddi/skfp.rst
@@ -1,35 +1,41 @@
-(C)Copyright 1998-2000 SysKonnect,
-===========================================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+========================
+SysKonnect driver - SKFP
+========================
+
+|copy| Copyright 1998-2000 SysKonnect,
skfp.txt created 11-May-2000
Readme File for skfp.o v2.06
-This file contains
-(1) OVERVIEW
-(2) SUPPORTED ADAPTERS
-(3) GENERAL INFORMATION
-(4) INSTALLATION
-(5) INCLUSION OF THE ADAPTER IN SYSTEM START
-(6) TROUBLESHOOTING
-(7) FUNCTION OF THE ADAPTER LEDS
-(8) HISTORY
+.. This file contains
-===========================================================================
+ (1) OVERVIEW
+ (2) SUPPORTED ADAPTERS
+ (3) GENERAL INFORMATION
+ (4) INSTALLATION
+ (5) INCLUSION OF THE ADAPTER IN SYSTEM START
+ (6) TROUBLESHOOTING
+ (7) FUNCTION OF THE ADAPTER LEDS
+ (8) HISTORY
-
-(1) OVERVIEW
-============
+1. Overview
+===========
This README explains how to use the driver 'skfp' for Linux with your
network adapter.
Chapter 2: Contains a list of all network adapters that are supported by
- this driver.
+this driver.
-Chapter 3: Gives some general information.
+Chapter 3:
+ Gives some general information.
Chapter 4: Describes common problems and solutions.
@@ -37,14 +43,13 @@ Chapter 5: Shows the changed functionality of the adapter LEDs.
Chapter 6: History of development.
-***
-
-(2) SUPPORTED ADAPTERS
-======================
+2. Supported adapters
+=====================
The network driver 'skfp' supports the following network adapters:
SysKonnect adapters:
+
- SK-5521 (SK-NET FDDI-UP)
- SK-5522 (SK-NET FDDI-UP DAS)
- SK-5541 (SK-NET FDDI-FP)
@@ -55,157 +60,187 @@ SysKonnect adapters:
- SK-5841 (SK-NET FDDI-FP64)
- SK-5843 (SK-NET FDDI-LP64)
- SK-5844 (SK-NET FDDI-LP64 DAS)
+
Compaq adapters (not tested):
+
- Netelligent 100 FDDI DAS Fibre SC
- Netelligent 100 FDDI SAS Fibre SC
- Netelligent 100 FDDI DAS UTP
- Netelligent 100 FDDI SAS UTP
- Netelligent 100 FDDI SAS Fibre MIC
-***
-(3) GENERAL INFORMATION
-=======================
+3. General Information
+======================
From v2.01 on, the driver is integrated in the linux kernel sources.
Therefore, the installation is the same as for any other adapter
supported by the kernel.
+
Refer to the manual of your distribution about the installation
of network adapters.
-Makes my life much easier :-)
-***
+Makes my life much easier :-)
-(4) TROUBLESHOOTING
-===================
+4. Troubleshooting
+==================
If you run into problems during installation, check those items:
-Problem: The FDDI adapter cannot be found by the driver.
-Reason: Look in /proc/pci for the following entry:
- 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+Problem:
+ The FDDI adapter cannot be found by the driver.
+
+Reason:
+ Look in /proc/pci for the following entry:
+
+ 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+
If this entry exists, then the FDDI adapter has been
found by the system and should be able to be used.
+
If this entry does not exist or if the file '/proc/pci'
is not there, then you may have a hardware problem or PCI
support may not be enabled in your kernel.
+
The adapter can be checked using the diagnostic program
which is available from the SysKonnect web site:
+
www.syskonnect.de
+
Some COMPAQ machines have a problem with PCI under
Linux. This is described in the 'PCI howto' document
(included in some distributions or available from the
www, e.g. at 'www.linux.org') and no workaround is available.
-Problem: You want to use your computer as a router between
- multiple IP subnetworks (using multiple adapters), but
+Problem:
+ You want to use your computer as a router between
+ multiple IP subnetworks (using multiple adapters), but
you cannot reach computers in other subnetworks.
-Reason: Either the router's kernel is not configured for IP
+
+Reason:
+ Either the router's kernel is not configured for IP
forwarding or there is a problem with the routing table
and gateway configuration in at least one of the
computers.
If your problem is not listed here, please contact our
-technical support for help.
-You can send email to:
- linux@syskonnect.de
+technical support for help.
+
+You can send email to: linux@syskonnect.de
+
When contacting our technical support,
please ensure that the following information is available:
+
- System Manufacturer and Model
- Boards in your system
- Distribution
- Kernel version
-***
-
-
-(5) FUNCTION OF THE ADAPTER LEDS
-================================
- The functionality of the LED's on the FDDI network adapters was
- changed in SMT version v2.82. With this new SMT version, the yellow
- LED works as a ring operational indicator. An active yellow LED
- indicates that the ring is down. The green LED on the adapter now
- works as a link indicator where an active GREEN LED indicates that
- the respective port has a physical connection.
+5. Function of the Adapter LEDs
+===============================
- With versions of SMT prior to v2.82 a ring up was indicated if the
- yellow LED was off while the green LED(s) showed the connection
- status of the adapter. During a ring down the green LED was off and
- the yellow LED was on.
+ The functionality of the LED's on the FDDI network adapters was
+ changed in SMT version v2.82. With this new SMT version, the yellow
+ LED works as a ring operational indicator. An active yellow LED
+ indicates that the ring is down. The green LED on the adapter now
+ works as a link indicator where an active GREEN LED indicates that
+ the respective port has a physical connection.
- All implementations indicate that a driver is not loaded if
- all LEDs are off.
+ With versions of SMT prior to v2.82 a ring up was indicated if the
+ yellow LED was off while the green LED(s) showed the connection
+ status of the adapter. During a ring down the green LED was off and
+ the yellow LED was on.
-***
+ All implementations indicate that a driver is not loaded if
+ all LEDs are off.
-(6) HISTORY
-===========
+6. History
+==========
v2.06 (20000511) (In-Kernel version)
New features:
+
- 64 bit support
- new pci dma interface
- in kernel 2.3.99
v2.05 (20000217) (In-Kernel version)
New features:
+
- Changes for 2.3.45 kernel
v2.04 (20000207) (Standalone version)
New features:
+
- Added rx/tx byte counter
v2.03 (20000111) (Standalone version)
Problems fixed:
+
- Fixed printk statements from v2.02
v2.02 (991215) (Standalone version)
Problems fixed:
+
- Removed unnecessary output
- Fixed path for "printver.sh" in makefile
v2.01 (991122) (In-Kernel version)
New features:
+
- Integration in Linux kernel sources
- Support for memory mapped I/O.
v2.00 (991112)
New features:
+
- Full source released under GPL
v1.05 (991023)
Problems fixed:
+
- Compilation with kernel version 2.2.13 failed
v1.04 (990427)
Changes:
+
- New SMT module included, changing LED functionality
+
Problems fixed:
+
- Synchronization on SMP machines was buggy
v1.03 (990325)
Problems fixed:
+
- Interrupt routing on SMP machines could be incorrect
v1.02 (990310)
New features:
+
- Support for kernel versions 2.2.x added
- Kernel patch instead of private duplicate of kernel functions
v1.01 (980812)
Problems fixed:
+
Connection hangup with telnet
Slow telnet connection
v1.00 beta 01 (980507)
New features:
+
None.
+
Problems fixed:
+
None.
+
Known limitations:
- - tar archive instead of standard package format (rpm).
+
+ - tar archive instead of standard package format (rpm).
- FDDI statistic is empty.
- not tested with 2.1.xx kernels
- integration in kernel not tested
@@ -216,5 +251,3 @@ v1.00 beta 01 (980507)
- does not work on some COMPAQ machines. See the PCI howto
document for details about this problem.
- data corruption with kernel versions below 2.0.33.
-
-*** End of information file ***
diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/device_drivers/hamradio/baycom.rst
index 688f18fd4467..fe2d010f0e86 100644
--- a/Documentation/networking/baycom.txt
+++ b/Documentation/networking/device_drivers/hamradio/baycom.rst
@@ -1,26 +1,31 @@
- LINUX DRIVERS FOR BAYCOM MODEMS
+.. SPDX-License-Identifier: GPL-2.0
- Thomas M. Sailer, HB9JNX/AE4WA, <sailer@ife.ee.ethz.ch>
+===============================
+Linux Drivers for Baycom Modems
+===============================
-!!NEW!! (04/98) The drivers for the baycom modems have been split into
+Thomas M. Sailer, HB9JNX/AE4WA, <sailer@ife.ee.ethz.ch>
+
+The drivers for the baycom modems have been split into
separate drivers as they did not share any code, and the driver
and device names have changed.
This document describes the Linux Kernel Drivers for simple Baycom style
-amateur radio modems.
+amateur radio modems.
The following drivers are available:
+====================================
baycom_ser_fdx:
This driver supports the SER12 modems either full or half duplex.
- Its baud rate may be changed via the `baud' module parameter,
+ Its baud rate may be changed via the ``baud`` module parameter,
therefore it supports just about every bit bang modem on a
serial port. Its devices are called bcsf0 through bcsf3.
This is the recommended driver for SER12 type modems,
however if you have a broken UART clone that does not have working
- delta status bits, you may try baycom_ser_hdx.
+ delta status bits, you may try baycom_ser_hdx.
-baycom_ser_hdx:
+baycom_ser_hdx:
This is an alternative driver for SER12 type modems.
It only supports half duplex, and only 1200 baud. Its devices
are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx
@@ -37,45 +42,48 @@ baycom_epp:
The following modems are supported:
-ser12: This is a very simple 1200 baud AFSK modem. The modem consists only
- of a modulator/demodulator chip, usually a TI TCM3105. The computer
- is responsible for regenerating the receiver bit clock, as well as
- for handling the HDLC protocol. The modem connects to a serial port,
- hence the name. Since the serial port is not used as an async serial
- port, the kernel driver for serial ports cannot be used, and this
- driver only supports standard serial hardware (8250, 16450, 16550)
-
-par96: This is a modem for 9600 baud FSK compatible to the G3RUH standard.
- The modem does all the filtering and regenerates the receiver clock.
- Data is transferred from and to the PC via a shift register.
- The shift register is filled with 16 bits and an interrupt is signalled.
- The PC then empties the shift register in a burst. This modem connects
- to the parallel port, hence the name. The modem leaves the
- implementation of the HDLC protocol and the scrambler polynomial to
- the PC.
-
-picpar: This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem
- is protocol compatible to par96, but uses only three low power ICs
- and can therefore be fed from the parallel port and does not require
- an additional power supply. Furthermore, it incorporates a carrier
- detect circuitry.
-
-EPP: This is a high-speed modem adaptor that connects to an enhanced parallel port.
- Its target audience is users working over a high speed hub (76.8kbit/s).
-
-eppfpga: This is a redesign of the EPP adaptor.
-
-
+======= ========================================================================
+ser12 This is a very simple 1200 baud AFSK modem. The modem consists only
+ of a modulator/demodulator chip, usually a TI TCM3105. The computer
+ is responsible for regenerating the receiver bit clock, as well as
+ for handling the HDLC protocol. The modem connects to a serial port,
+ hence the name. Since the serial port is not used as an async serial
+ port, the kernel driver for serial ports cannot be used, and this
+ driver only supports standard serial hardware (8250, 16450, 16550)
+
+par96 This is a modem for 9600 baud FSK compatible to the G3RUH standard.
+ The modem does all the filtering and regenerates the receiver clock.
+ Data is transferred from and to the PC via a shift register.
+ The shift register is filled with 16 bits and an interrupt is signalled.
+ The PC then empties the shift register in a burst. This modem connects
+ to the parallel port, hence the name. The modem leaves the
+ implementation of the HDLC protocol and the scrambler polynomial to
+ the PC.
+
+picpar This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem
+ is protocol compatible to par96, but uses only three low power ICs
+ and can therefore be fed from the parallel port and does not require
+ an additional power supply. Furthermore, it incorporates a carrier
+ detect circuitry.
+
+EPP This is a high-speed modem adaptor that connects to an enhanced parallel
+ port.
+
+ Its target audience is users working over a high speed hub (76.8kbit/s).
+
+eppfpga This is a redesign of the EPP adaptor.
+======= ========================================================================
All of the above modems only support half duplex communications. However,
the driver supports the KISS (see below) fullduplex command. It then simply
starts to send as soon as there's a packet to transmit and does not care
about DCD, i.e. it starts to send even if there's someone else on the channel.
-This command is required by some implementations of the DAMA channel
+This command is required by some implementations of the DAMA channel
access protocol.
The Interface of the drivers
+============================
Unlike previous drivers, these drivers are no longer character devices,
but they are now true kernel network interfaces. Installation is therefore
@@ -88,20 +96,22 @@ me for WAMPES which allows attaching a kernel network interface directly.
Configuring the driver
+======================
Every time a driver is inserted into the kernel, it has to know which
modems it should access at which ports. This can be done with the setbaycom
utility. If you are only using one modem, you can also configure the
driver from the insmod command line (or by means of an option line in
-/etc/modprobe.d/*.conf).
+``/etc/modprobe.d/*.conf``).
+
+Examples::
-Examples:
modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4
sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4
Both lines configure the first port to drive a ser12 modem at the first
-serial port (COM1 under DOS). The * in the mode parameter instructs the driver to use
-the software DCD algorithm (see below).
+serial port (COM1 under DOS). The * in the mode parameter instructs the driver
+to use the software DCD algorithm (see below)::
insmod baycom_par mode="picpar" iobase=0x378
sethdlc -i bcp0 -p mode "picpar" io 0x378
@@ -115,29 +125,33 @@ Note that both utilities interpret the values slightly differently.
Hardware DCD versus Software DCD
+================================
To avoid collisions on the air, the driver must know when the channel is
busy. This is the task of the DCD circuitry/software. The driver may either
utilise a software DCD algorithm (options=1) or use a DCD signal from
the hardware (options=0).
-ser12: if software DCD is utilised, the radio's squelch should always be
- open. It is highly recommended to use the software DCD algorithm,
- as it is much faster than most hardware squelch circuitry. The
- disadvantage is a slightly higher load on the system.
+======= =================================================================
+ser12 if software DCD is utilised, the radio's squelch should always be
+ open. It is highly recommended to use the software DCD algorithm,
+ as it is much faster than most hardware squelch circuitry. The
+ disadvantage is a slightly higher load on the system.
-par96: the software DCD algorithm for this type of modem is rather poor.
- The modem simply does not provide enough information to implement
- a reasonable DCD algorithm in software. Therefore, if your radio
- feeds the DCD input of the PAR96 modem, the use of the hardware
- DCD circuitry is recommended.
+par96 the software DCD algorithm for this type of modem is rather poor.
+ The modem simply does not provide enough information to implement
+ a reasonable DCD algorithm in software. Therefore, if your radio
+ feeds the DCD input of the PAR96 modem, the use of the hardware
+ DCD circuitry is recommended.
-picpar: the picpar modem features a builtin DCD hardware, which is highly
- recommended.
+picpar the picpar modem features a builtin DCD hardware, which is highly
+ recommended.
+======= =================================================================
Compatibility with the rest of the Linux kernel
+===============================================
The serial driver and the baycom serial drivers compete
for the same hardware resources. Of course only one driver can access a given
@@ -154,5 +168,7 @@ The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem
to arbitrate the ports between different client drivers.
vy 73s de
+
Tom Sailer, sailer@ife.ee.ethz.ch
+
hb9jnx @ hb9w.ampr.org
diff --git a/Documentation/networking/device_drivers/hamradio/index.rst b/Documentation/networking/device_drivers/hamradio/index.rst
new file mode 100644
index 000000000000..7e731732057b
--- /dev/null
+++ b/Documentation/networking/device_drivers/hamradio/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Amateur Radio Device Drivers
+============================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ baycom
+ z8530drv
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/z8530drv.txt b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
index 2206abbc3e1b..d2942760f167 100644
--- a/Documentation/networking/z8530drv.txt
+++ b/Documentation/networking/device_drivers/hamradio/z8530drv.rst
@@ -1,33 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=========================================================
+SCC.C - Linux driver for Z8530 based HDLC cards for AX.25
+=========================================================
+
+
This is a subset of the documentation. To use this driver you MUST have the
full package from:
Internet:
-=========
-1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz
+ 1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz
-2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz
+ 2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz
Please note that the information in this document may be hopelessly outdated.
A new version of the documentation, along with links to other important
Linux Kernel AX.25 documentation and programs, is available on
http://yaina.de/jreuter
------------------------------------------------------------------------------
-
-
- SCC.C - Linux driver for Z8530 based HDLC cards for AX.25
-
- ********************************************************************
-
- (c) 1993,2000 by Joerg Reuter DL1BKE <jreuter@yaina.de>
-
- portions (c) 1993 Guido ten Dolle PE1NNZ
-
- for the complete copyright notice see >> Copying.Z8530DRV <<
+Copyright |copy| 1993,2000 by Joerg Reuter DL1BKE <jreuter@yaina.de>
- ********************************************************************
+portions Copyright |copy| 1993 Guido ten Dolle PE1NNZ
+for the complete copyright notice see >> Copying.Z8530DRV <<
1. Initialization of the driver
===============================
@@ -50,7 +47,7 @@ AX.25-HOWTO on how to emulate a KISS TNC on network device drivers.
(If you're going to compile the driver as a part of the kernel image,
skip this chapter and continue with 1.2)
-Before you can use a module, you'll have to load it with
+Before you can use a module, you'll have to load it with::
insmod scc.o
@@ -75,61 +72,73 @@ The file itself consists of two main sections.
==========================================
The hardware setup section defines the following parameters for each
-Z8530:
-
-chip 1
-data_a 0x300 # data port A
-ctrl_a 0x304 # control port A
-data_b 0x301 # data port B
-ctrl_b 0x305 # control port B
-irq 5 # IRQ No. 5
-pclock 4915200 # clock
-board BAYCOM # hardware type
-escc no # enhanced SCC chip? (8580/85180/85280)
-vector 0 # latch for interrupt vector
-special no # address of special function register
-option 0 # option to set via sfr
-
-
-chip - this is just a delimiter to make sccinit a bit simpler to
+Z8530::
+
+ chip 1
+ data_a 0x300 # data port A
+ ctrl_a 0x304 # control port A
+ data_b 0x301 # data port B
+ ctrl_b 0x305 # control port B
+ irq 5 # IRQ No. 5
+ pclock 4915200 # clock
+ board BAYCOM # hardware type
+ escc no # enhanced SCC chip? (8580/85180/85280)
+ vector 0 # latch for interrupt vector
+ special no # address of special function register
+ option 0 # option to set via sfr
+
+
+chip
+ - this is just a delimiter to make sccinit a bit simpler to
program. A parameter has no effect.
-data_a - the address of the data port A of this Z8530 (needed)
-ctrl_a - the address of the control port A (needed)
-data_b - the address of the data port B (needed)
-ctrl_b - the address of the control port B (needed)
-
-irq - the used IRQ for this chip. Different chips can use different
- IRQs or the same. If they share an interrupt, it needs to be
+data_a
+ - the address of the data port A of this Z8530 (needed)
+ctrl_a
+ - the address of the control port A (needed)
+data_b
+ - the address of the data port B (needed)
+ctrl_b
+ - the address of the control port B (needed)
+
+irq
+ - the used IRQ for this chip. Different chips can use different
+ IRQs or the same. If they share an interrupt, it needs to be
specified within one chip-definition only.
pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is
- default), measured in Hertz
+ default), measured in Hertz
-board - the "type" of the board:
+board
+ - the "type" of the board:
+ ======================= ========
SCC type value
- ---------------------------------
+ ======================= ========
PA0HZP SCC card PA0HZP
EAGLE card EAGLE
PC100 card PC100
PRIMUS-PC (DG9BL) card PRIMUS
BayCom (U)SCC card BAYCOM
+ ======================= ========
-escc - if you want support for ESCC chips (8580, 85180, 85280), set
- this to "yes" (option, defaults to "no")
+escc
+ - if you want support for ESCC chips (8580, 85180, 85280), set
+ this to "yes" (option, defaults to "no")
-vector - address of the vector latch (aka "intack port") for PA0HZP
- cards. There can be only one vector latch for all chips!
+vector
+ - address of the vector latch (aka "intack port") for PA0HZP
+ cards. There can be only one vector latch for all chips!
(option, defaults to 0)
-special - address of the special function register on several cards.
- (option, defaults to 0)
+special
+ - address of the special function register on several cards.
+ (option, defaults to 0)
option - The value you write into that register (option, default is 0)
You can specify up to four chips (8 channels). If this is not enough,
-just change
+just change::
#define MAXSCC 4
@@ -138,75 +147,81 @@ to a higher value.
Example for the BAYCOM USCC:
----------------------------
-chip 1
-data_a 0x300 # data port A
-ctrl_a 0x304 # control port A
-data_b 0x301 # data port B
-ctrl_b 0x305 # control port B
-irq 5 # IRQ No. 5 (#)
-board BAYCOM # hardware type (*)
-#
-# SCC chip 2
-#
-chip 2
-data_a 0x302
-ctrl_a 0x306
-data_b 0x303
-ctrl_b 0x307
-board BAYCOM
+::
+
+ chip 1
+ data_a 0x300 # data port A
+ ctrl_a 0x304 # control port A
+ data_b 0x301 # data port B
+ ctrl_b 0x305 # control port B
+ irq 5 # IRQ No. 5 (#)
+ board BAYCOM # hardware type (*)
+ #
+ # SCC chip 2
+ #
+ chip 2
+ data_a 0x302
+ ctrl_a 0x306
+ data_b 0x303
+ ctrl_b 0x307
+ board BAYCOM
An example for a PA0HZP card:
-----------------------------
-chip 1
-data_a 0x153
-data_b 0x151
-ctrl_a 0x152
-ctrl_b 0x150
-irq 9
-pclock 4915200
-board PA0HZP
-vector 0x168
-escc no
-#
-#
-#
-chip 2
-data_a 0x157
-data_b 0x155
-ctrl_a 0x156
-ctrl_b 0x154
-irq 9
-pclock 4915200
-board PA0HZP
-vector 0x168
-escc no
+::
+
+ chip 1
+ data_a 0x153
+ data_b 0x151
+ ctrl_a 0x152
+ ctrl_b 0x150
+ irq 9
+ pclock 4915200
+ board PA0HZP
+ vector 0x168
+ escc no
+ #
+ #
+ #
+ chip 2
+ data_a 0x157
+ data_b 0x155
+ ctrl_a 0x156
+ ctrl_b 0x154
+ irq 9
+ pclock 4915200
+ board PA0HZP
+ vector 0x168
+ escc no
A DRSI would should probably work with this:
--------------------------------------------
(actually: two DRSI cards...)
-chip 1
-data_a 0x303
-data_b 0x301
-ctrl_a 0x302
-ctrl_b 0x300
-irq 7
-pclock 4915200
-board DRSI
-escc no
-#
-#
-#
-chip 2
-data_a 0x313
-data_b 0x311
-ctrl_a 0x312
-ctrl_b 0x310
-irq 7
-pclock 4915200
-board DRSI
-escc no
+::
+
+ chip 1
+ data_a 0x303
+ data_b 0x301
+ ctrl_a 0x302
+ ctrl_b 0x300
+ irq 7
+ pclock 4915200
+ board DRSI
+ escc no
+ #
+ #
+ #
+ chip 2
+ data_a 0x313
+ data_b 0x311
+ ctrl_a 0x312
+ ctrl_b 0x310
+ irq 7
+ pclock 4915200
+ board DRSI
+ escc no
Note that you cannot use the on-board baudrate generator off DRSI
cards. Use "mode dpll" for clock source (see below).
@@ -220,17 +235,19 @@ The utility "gencfg"
If you only know the parameters for the PE1CHL driver for DOS,
run gencfg. It will generate the correct port addresses (I hope).
Its parameters are exactly the same as the ones you use with
-the "attach scc" command in net, except that the string "init" must
-not appear. Example:
+the "attach scc" command in net, except that the string "init" must
+not appear. Example::
-gencfg 2 0x150 4 2 0 1 0x168 9 4915200
+ gencfg 2 0x150 4 2 0 1 0x168 9 4915200
will print a skeleton z8530drv.conf for the OptoSCC to stdout.
-gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10
+::
+
+ gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10
does the same for the BAYCOM USCC card. In my opinion it is much easier
-to edit scc_config.h...
+to edit scc_config.h...
1.2.2 channel configuration
@@ -239,58 +256,58 @@ to edit scc_config.h...
The channel definition is divided into three sub sections for each
channel:
-An example for scc0:
-
-# DEVICE
-
-device scc0 # the device for the following params
-
-# MODEM / BUFFERS
-
-speed 1200 # the default baudrate
-clock dpll # clock source:
- # dpll = normal half duplex operation
- # external = MODEM provides own Rx/Tx clock
- # divider = use full duplex divider if
- # installed (1)
-mode nrzi # HDLC encoding mode
- # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM
- # nrz = DF9IC 9k6 MODEM
- #
-bufsize 384 # size of buffers. Note that this must include
- # the AX.25 header, not only the data field!
- # (optional, defaults to 384)
-
-# KISS (Layer 1)
-
-txdelay 36 # (see chapter 1.4)
-persist 64
-slot 8
-tail 8
-fulldup 0
-wait 12
-min 3
-maxkey 7
-idle 3
-maxdef 120
-group 0
-txoff off
-softdcd on
-slip off
+An example for scc0::
+
+ # DEVICE
+
+ device scc0 # the device for the following params
+
+ # MODEM / BUFFERS
+
+ speed 1200 # the default baudrate
+ clock dpll # clock source:
+ # dpll = normal half duplex operation
+ # external = MODEM provides own Rx/Tx clock
+ # divider = use full duplex divider if
+ # installed (1)
+ mode nrzi # HDLC encoding mode
+ # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM
+ # nrz = DF9IC 9k6 MODEM
+ #
+ bufsize 384 # size of buffers. Note that this must include
+ # the AX.25 header, not only the data field!
+ # (optional, defaults to 384)
+
+ # KISS (Layer 1)
+
+ txdelay 36 # (see chapter 1.4)
+ persist 64
+ slot 8
+ tail 8
+ fulldup 0
+ wait 12
+ min 3
+ maxkey 7
+ idle 3
+ maxdef 120
+ group 0
+ txoff off
+ softdcd on
+ slip off
The order WITHIN these sections is unimportant. The order OF these
sections IS important. The MODEM parameters are set with the first
recognized KISS parameter...
Please note that you can initialize the board only once after boot
-(or insmod). You can change all parameters but "mode" and "clock"
-later with the Sccparam program or through KISS. Just to avoid
-security holes...
+(or insmod). You can change all parameters but "mode" and "clock"
+later with the Sccparam program or through KISS. Just to avoid
+security holes...
(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not
- present at all (BayCom). It feeds back the output of the DPLL
- (digital pll) as transmit clock. Using this mode without a divider
- installed will normally result in keying the transceiver until
+ present at all (BayCom). It feeds back the output of the DPLL
+ (digital pll) as transmit clock. Using this mode without a divider
+ installed will normally result in keying the transceiver until
maxkey expires --- of course without sending anything (useful).
2. Attachment of a channel by your AX.25 software
@@ -299,15 +316,15 @@ security holes...
2.1 Kernel AX.25
================
-To set up an AX.25 device you can simply type:
+To set up an AX.25 device you can simply type::
ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7
-This will create a network interface with the IP number 44.128.20.107
-and the callsign "dl0tha". If you do not have any IP number (yet) you
-can use any of the 44.128.0.0 network. Note that you do not need
-axattach. The purpose of axattach (like slattach) is to create a KISS
-network device linked to a TTY. Please read the documentation of the
+This will create a network interface with the IP number 44.128.20.107
+and the callsign "dl0tha". If you do not have any IP number (yet) you
+can use any of the 44.128.0.0 network. Note that you do not need
+axattach. The purpose of axattach (like slattach) is to create a KISS
+network device linked to a TTY. Please read the documentation of the
ax25-utils and the AX.25-HOWTO to learn how to set the parameters of
the kernel AX.25.
@@ -318,16 +335,16 @@ Since the TTY driver (aka KISS TNC emulation) is gone you need
to emulate the old behaviour. The cost of using these programs is
that you probably need to compile the kernel AX.25, regardless of whether
you actually use it or not. First setup your /etc/ax25/axports,
-for example:
+for example::
9k6 dl0tha-9 9600 255 4 9600 baud port (scc3)
axlink dl0tha-15 38400 255 4 Link to NOS
-Now "ifconfig" the scc device:
+Now "ifconfig" the scc device::
ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9
-You can now axattach a pseudo-TTY:
+You can now axattach a pseudo-TTY::
axattach /dev/ptys0 axlink
@@ -335,11 +352,11 @@ and start your NOS and attach /dev/ptys0 there. The problem is that
NOS is reachable only via digipeating through the kernel AX.25
(disastrous on a DAMA controlled channel). To solve this problem,
configure "rxecho" to echo the incoming frames from "9k6" to "axlink"
-and outgoing frames from "axlink" to "9k6" and start:
+and outgoing frames from "axlink" to "9k6" and start::
rxecho
-Or simply use "kissbridge" coming with z8530drv-utils:
+Or simply use "kissbridge" coming with z8530drv-utils::
ifconfig scc3 hw ax25 dl0tha-9
kissbridge scc3 /dev/ptys0
@@ -351,55 +368,57 @@ Or simply use "kissbridge" coming with z8530drv-utils:
3.1 Displaying SCC Parameters:
==============================
-Once a SCC channel has been attached, the parameter settings and
-some statistic information can be shown using the param program:
+Once a SCC channel has been attached, the parameter settings and
+some statistic information can be shown using the param program::
-dl1bke-u:~$ sccstat scc0
+ dl1bke-u:~$ sccstat scc0
-Parameters:
+ Parameters:
-speed : 1200 baud
-txdelay : 36
-persist : 255
-slottime : 0
-txtail : 8
-fulldup : 1
-waittime : 12
-mintime : 3 sec
-maxkeyup : 7 sec
-idletime : 3 sec
-maxdefer : 120 sec
-group : 0x00
-txoff : off
-softdcd : on
-SLIP : off
+ speed : 1200 baud
+ txdelay : 36
+ persist : 255
+ slottime : 0
+ txtail : 8
+ fulldup : 1
+ waittime : 12
+ mintime : 3 sec
+ maxkeyup : 7 sec
+ idletime : 3 sec
+ maxdefer : 120 sec
+ group : 0x00
+ txoff : off
+ softdcd : on
+ SLIP : off
-Status:
+ Status:
-HDLC Z8530 Interrupts Buffers
------------------------------------------------------------------------
-Sent : 273 RxOver : 0 RxInts : 125074 Size : 384
-Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0
-RxErrors : 1591 ExInts : 11776
-TxErrors : 0 SpInts : 1503
-Tx State : idle
+ HDLC Z8530 Interrupts Buffers
+ -----------------------------------------------------------------------
+ Sent : 273 RxOver : 0 RxInts : 125074 Size : 384
+ Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0
+ RxErrors : 1591 ExInts : 11776
+ TxErrors : 0 SpInts : 1503
+ Tx State : idle
The status info shown is:
-Sent - number of frames transmitted
-Received - number of frames received
-RxErrors - number of receive errors (CRC, ABORT)
-TxErrors - number of discarded Tx frames (due to various reasons)
-Tx State - status of the Tx interrupt handler: idle/busy/active/tail (2)
-RxOver - number of receiver overruns
-TxUnder - number of transmitter underruns
-RxInts - number of receiver interrupts
-TxInts - number of transmitter interrupts
-EpInts - number of receiver special condition interrupts
-SpInts - number of external/status interrupts
-Size - maximum size of an AX.25 frame (*with* AX.25 headers!)
-NoSpace - number of times a buffer could not get allocated
+============== ==============================================================
+Sent number of frames transmitted
+Received number of frames received
+RxErrors number of receive errors (CRC, ABORT)
+TxErrors number of discarded Tx frames (due to various reasons)
+Tx State status of the Tx interrupt handler: idle/busy/active/tail (2)
+RxOver number of receiver overruns
+TxUnder number of transmitter underruns
+RxInts number of receiver interrupts
+TxInts number of transmitter interrupts
+EpInts number of receiver special condition interrupts
+SpInts number of external/status interrupts
+Size maximum size of an AX.25 frame (*with* AX.25 headers!)
+NoSpace number of times a buffer could not get allocated
+============== ==============================================================
An overrun is abnormal. If lots of these occur, the product of
baudrate and number of interfaces is too high for the processing
@@ -411,32 +430,34 @@ driver or the kernel AX.25.
======================
-The setting of parameters of the emulated KISS TNC is done in the
+The setting of parameters of the emulated KISS TNC is done in the
same way in the SCC driver. You can change parameters by using
-the kissparms program from the ax25-utils package or use the program
-"sccparam":
+the kissparms program from the ax25-utils package or use the program
+"sccparam"::
sccparam <device> <paramname> <decimal-|hexadecimal value>
You can change the following parameters:
-param : value
-------------------------
-speed : 1200
-txdelay : 36
-persist : 255
-slottime : 0
-txtail : 8
-fulldup : 1
-waittime : 12
-mintime : 3
-maxkeyup : 7
-idletime : 3
-maxdefer : 120
-group : 0x00
-txoff : off
-softdcd : on
-SLIP : off
+=========== =====
+param value
+=========== =====
+speed 1200
+txdelay 36
+persist 255
+slottime 0
+txtail 8
+fulldup 1
+waittime 12
+mintime 3
+maxkeyup 7
+idletime 3
+maxdefer 120
+group 0x00
+txoff off
+softdcd on
+SLIP off
+=========== =====
The parameters have the following meaning:
@@ -447,92 +468,92 @@ speed:
Example: sccparam /dev/scc3 speed 9600
txdelay:
- The delay (in units of 10 ms) after keying of the
- transmitter, until the first byte is sent. This is usually
- called "TXDELAY" in a TNC. When 0 is specified, the driver
- will just wait until the CTS signal is asserted. This
- assumes the presence of a timer or other circuitry in the
- MODEM and/or transmitter, that asserts CTS when the
+ The delay (in units of 10 ms) after keying of the
+ transmitter, until the first byte is sent. This is usually
+ called "TXDELAY" in a TNC. When 0 is specified, the driver
+ will just wait until the CTS signal is asserted. This
+ assumes the presence of a timer or other circuitry in the
+ MODEM and/or transmitter, that asserts CTS when the
transmitter is ready for data.
A normal value of this parameter is 30-36.
Example: sccparam /dev/scc0 txd 20
persist:
- This is the probability that the transmitter will be keyed
- when the channel is found to be free. It is a value from 0
- to 255, and the probability is (value+1)/256. The value
- should be somewhere near 50-60, and should be lowered when
+ This is the probability that the transmitter will be keyed
+ when the channel is found to be free. It is a value from 0
+ to 255, and the probability is (value+1)/256. The value
+ should be somewhere near 50-60, and should be lowered when
the channel is used more heavily.
Example: sccparam /dev/scc2 persist 20
slottime:
- This is the time between samples of the channel. It is
- expressed in units of 10 ms. About 200-300 ms (value 20-30)
+ This is the time between samples of the channel. It is
+ expressed in units of 10 ms. About 200-300 ms (value 20-30)
seems to be a good value.
Example: sccparam /dev/scc0 slot 20
tail:
- The time the transmitter will remain keyed after the last
- byte of a packet has been transferred to the SCC. This is
- necessary because the CRC and a flag still have to leave the
- SCC before the transmitter is keyed down. The value depends
- on the baudrate selected. A few character times should be
+ The time the transmitter will remain keyed after the last
+ byte of a packet has been transferred to the SCC. This is
+ necessary because the CRC and a flag still have to leave the
+ SCC before the transmitter is keyed down. The value depends
+ on the baudrate selected. A few character times should be
sufficient, e.g. 40ms at 1200 baud. (value 4)
The value of this parameter is in 10 ms units.
Example: sccparam /dev/scc2 4
full:
- The full-duplex mode switch. This can be one of the following
+ The full-duplex mode switch. This can be one of the following
values:
- 0: The interface will operate in CSMA mode (the normal
- half-duplex packet radio operation)
- 1: Fullduplex mode, i.e. the transmitter will be keyed at
- any time, without checking the received carrier. It
- will be unkeyed when there are no packets to be sent.
- 2: Like 1, but the transmitter will remain keyed, also
- when there are no packets to be sent. Flags will be
- sent in that case, until a timeout (parameter 10)
- occurs.
+ 0: The interface will operate in CSMA mode (the normal
+ half-duplex packet radio operation)
+ 1: Fullduplex mode, i.e. the transmitter will be keyed at
+ any time, without checking the received carrier. It
+ will be unkeyed when there are no packets to be sent.
+ 2: Like 1, but the transmitter will remain keyed, also
+ when there are no packets to be sent. Flags will be
+ sent in that case, until a timeout (parameter 10)
+ occurs.
Example: sccparam /dev/scc0 fulldup off
wait:
- The initial waittime before any transmit attempt, after the
- frame has been queue for transmit. This is the length of
+ The initial waittime before any transmit attempt, after the
+ frame has been queue for transmit. This is the length of
the first slot in CSMA mode. In full duplex modes it is
set to 0 for maximum performance.
- The value of this parameter is in 10 ms units.
+ The value of this parameter is in 10 ms units.
Example: sccparam /dev/scc1 wait 4
maxkey:
- The maximal time the transmitter will be keyed to send
- packets, in seconds. This can be useful on busy CSMA
- channels, to avoid "getting a bad reputation" when you are
- generating a lot of traffic. After the specified time has
+ The maximal time the transmitter will be keyed to send
+ packets, in seconds. This can be useful on busy CSMA
+ channels, to avoid "getting a bad reputation" when you are
+ generating a lot of traffic. After the specified time has
elapsed, no new frame will be started. Instead, the trans-
- mitter will be switched off for a specified time (parameter
- min), and then the selected algorithm for keyup will be
+ mitter will be switched off for a specified time (parameter
+ min), and then the selected algorithm for keyup will be
started again.
- The value 0 as well as "off" will disable this feature,
- and allow infinite transmission time.
+ The value 0 as well as "off" will disable this feature,
+ and allow infinite transmission time.
Example: sccparam /dev/scc0 maxk 20
min:
- This is the time the transmitter will be switched off when
+ This is the time the transmitter will be switched off when
the maximum transmission time is exceeded.
Example: sccparam /dev/scc3 min 10
-idle
- This parameter specifies the maximum idle time in full duplex
- 2 mode, in seconds. When no frames have been sent for this
+idle:
+ This parameter specifies the maximum idle time in full duplex
+ 2 mode, in seconds. When no frames have been sent for this
time, the transmitter will be keyed down. A value of 0 is
has same result as the fullduplex mode 1. This parameter
can be disabled.
@@ -541,7 +562,7 @@ idle
maxdefer
This is the maximum time (in seconds) to wait for a free channel
- to send. When this timer expires the transmitter will be keyed
+ to send. When this timer expires the transmitter will be keyed
IMMEDIATELY. If you love to get trouble with other users you
should set this to a very low value ;-)
@@ -555,32 +576,38 @@ txoff:
Example: sccparam /dev/scc2 txoff on
group:
- It is possible to build special radio equipment to use more than
- one frequency on the same band, e.g. using several receivers and
+ It is possible to build special radio equipment to use more than
+ one frequency on the same band, e.g. using several receivers and
only one transmitter that can be switched between frequencies.
- Also, you can connect several radios that are active on the same
- band. In these cases, it is not possible, or not a good idea, to
- transmit on more than one frequency. The SCC driver provides a
- method to lock transmitters on different interfaces, using the
- "param <interface> group <x>" command. This will only work when
+ Also, you can connect several radios that are active on the same
+ band. In these cases, it is not possible, or not a good idea, to
+ transmit on more than one frequency. The SCC driver provides a
+ method to lock transmitters on different interfaces, using the
+ "param <interface> group <x>" command. This will only work when
you are using CSMA mode (parameter full = 0).
- The number <x> must be 0 if you want no group restrictions, and
+
+ The number <x> must be 0 if you want no group restrictions, and
can be computed as follows to create restricted groups:
<x> is the sum of some OCTAL numbers:
- 200 This transmitter will only be keyed when all other
- transmitters in the group are off.
- 100 This transmitter will only be keyed when the carrier
- detect of all other interfaces in the group is off.
- 0xx A byte that can be used to define different groups.
- Interfaces are in the same group, when the logical AND
- between their xx values is nonzero.
+
+ === =======================================================
+ 200 This transmitter will only be keyed when all other
+ transmitters in the group are off.
+ 100 This transmitter will only be keyed when the carrier
+ detect of all other interfaces in the group is off.
+ 0xx A byte that can be used to define different groups.
+ Interfaces are in the same group, when the logical AND
+ between their xx values is nonzero.
+ === =======================================================
Examples:
- When 2 interfaces use group 201, their transmitters will never be
+
+ When 2 interfaces use group 201, their transmitters will never be
keyed at the same time.
- When 2 interfaces use group 101, the transmitters will only key
- when both channels are clear at the same time. When group 301,
+
+ When 2 interfaces use group 101, the transmitters will only key
+ when both channels are clear at the same time. When group 301,
the transmitters will not be keyed at the same time.
Don't forget to convert the octal numbers into decimal before
@@ -595,19 +622,19 @@ softdcd:
Example: sccparam /dev/scc0 soft on
-4. Problems
+4. Problems
===========
If you have tx-problems with your BayCom USCC card please check
the manufacturer of the 8530. SGS chips have a slightly
-different timing. Try Zilog... A solution is to write to register 8
-instead to the data port, but this won't work with the ESCC chips.
+different timing. Try Zilog... A solution is to write to register 8
+instead to the data port, but this won't work with the ESCC chips.
*SIGH!*
A very common problem is that the PTT locks until the maxkeyup timer
expires, although interrupts and clock source are correct. In most
cases compiling the driver with CONFIG_SCC_DELAY (set with
-make config) solves the problems. For more hints read the (pseudo) FAQ
+make config) solves the problems. For more hints read the (pseudo) FAQ
and the documentation coming with z8530drv-utils.
I got reports that the driver has problems on some 386-based systems.
@@ -651,7 +678,9 @@ got it up-and-running?
Many thanks to Linus Torvalds and Alan Cox for including the driver
in the Linux standard distribution and their support.
-Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org
- AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU
- Internet: jreuter@yaina.de
- WWW : http://yaina.de/jreuter
+::
+
+ Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org
+ AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU
+ Internet: jreuter@yaina.de
+ WWW : http://yaina.de/jreuter
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index a191faaf97de..601eacaf12f3 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -1,32 +1,24 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
-Vendor Device Drivers
-=====================
+Hardware Device Drivers
+=======================
Contents:
.. toctree::
:maxdepth: 2
- freescale/dpaa2/index
- intel/e100
- intel/e1000
- intel/e1000e
- intel/fm10k
- intel/igb
- intel/igbvf
- intel/ixgb
- intel/ixgbe
- intel/ixgbevf
- intel/i40e
- intel/iavf
- intel/ice
- google/gve
- marvell/octeontx2
- mellanox/mlx5
- netronome/nfp
- pensando/ionic
- stmicro/stmmac
+ appletalk/index
+ atm/index
+ cable/index
+ can/index
+ cellular/index
+ ethernet/index
+ fddi/index
+ hamradio/index
+ qlogic/index
+ wifi/index
+ wwan/index
.. only:: subproject and html
diff --git a/Documentation/networking/device_drivers/intel/ice.rst b/Documentation/networking/device_drivers/intel/ice.rst
deleted file mode 100644
index ee43ea57d443..000000000000
--- a/Documentation/networking/device_drivers/intel/ice.rst
+++ /dev/null
@@ -1,46 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-==================================================================
-Linux Base Driver for the Intel(R) Ethernet Connection E800 Series
-==================================================================
-
-Intel ice Linux driver.
-Copyright(c) 2018 Intel Corporation.
-
-Contents
-========
-
-- Enabling the driver
-- Support
-
-The driver in this release supports Intel's E800 Series of products. For
-more information, visit Intel's support page at https://support.intel.com.
-
-Enabling the driver
-===================
-The driver is enabled via the standard kernel configuration system,
-using the make command::
-
- make oldconfig/menuconfig/etc.
-
-The driver is located in the menu structure at:
-
- -> Device Drivers
- -> Network device support (NETDEVICES [=y])
- -> Ethernet driver support
- -> Intel devices
- -> Intel(R) Ethernet Connection E800 Series Support
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst
deleted file mode 100644
index f575a49790e8..000000000000
--- a/Documentation/networking/device_drivers/mellanox/mlx5.rst
+++ /dev/null
@@ -1,321 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
-
-=================================================
-Mellanox ConnectX(R) mlx5 core VPI Network Driver
-=================================================
-
-Copyright (c) 2019, Mellanox Technologies LTD.
-
-Contents
-========
-
-- `Enabling the driver and kconfig options`_
-- `Devlink info`_
-- `Devlink parameters`_
-- `Devlink health reporters`_
-- `mlx5 tracepoints`_
-
-Enabling the driver and kconfig options
-================================================
-
-| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
-| at build time via kernel Kconfig flags.
-| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
-| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
-| For the list of advanced features please see below.
-
-**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
-
-| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
-| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
-
-
-**CONFIG_MLX5_CORE_EN=(y/n)**
-
-| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
-| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
-| built-in into mlx5_core.ko.
-
-
-**CONFIG_MLX5_EN_ARFS=(y/n)**
-
-| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
-| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
-
-
-**CONFIG_MLX5_EN_RXNFC=(y/n)**
-
-| Enables ethtool receive network flow classification, which allows user defined
-| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
-
-
-**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
-
-| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
-
-
-**CONFIG_MLX5_MPFS=(y/n)**
-
-| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
-| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
-| user configured unicast MAC addresses to the requesting PF.
-
-
-**CONFIG_MLX5_ESWITCH=(y/n)**
-
-| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
-| and switching for the enabled VFs and PF in two available modes:
-| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
-| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
-
-
-**CONFIG_MLX5_CORE_IPOIB=(y/n)**
-
-| IPoIB offloads & acceleration support.
-| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
-| IPoIB ulp netdevice.
-
-
-**CONFIG_MLX5_FPGA=(y/n)**
-
-| Build support for the Innova family of network cards by Mellanox Technologies.
-| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
-| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
-| building sandbox-specific client drivers.
-
-
-**CONFIG_MLX5_EN_IPSEC=(y/n)**
-
-| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
-
-**CONFIG_MLX5_EN_TLS=(y/n)**
-
-| TLS cryptography-offload accelaration.
-
-
-**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
-
-| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
-
-
-**External options** ( Choose if the corresponding mlx5 feature is required )
-
-- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
-- CONFIG_VXLAN: When chosen, mlx5 vxaln support will be enabled.
-- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
-
-Devlink info
-============
-
-The devlink info reports the running and stored firmware versions on device.
-It also prints the device PSID which represents the HCA board type ID.
-
-User command example::
-
- $ devlink dev info pci/0000:00:06.0
- pci/0000:00:06.0:
- driver mlx5_core
- versions:
- fixed:
- fw.psid MT_0000000009
- running:
- fw.version 16.26.0100
- stored:
- fw.version 16.26.0100
-
-Devlink parameters
-==================
-
-flow_steering_mode: Device flow steering mode
----------------------------------------------
-The flow steering mode parameter controls the flow steering mode of the driver.
-Two modes are supported:
-1. 'dmfs' - Device managed flow steering.
-2. 'smfs - Software/Driver managed flow steering.
-
-In DMFS mode, the HW steering entities are created and managed through the
-Firmware.
-In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into Hardware without firmware intervention.
-
-SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
-
-User command examples:
-
-- Set SMFS flow steering mode::
-
- $ devlink dev param set pci/0000:06:00.0 name flow_steering_mode value "smfs" cmode runtime
-
-- Read device flow steering mode::
-
- $ devlink dev param show pci/0000:06:00.0 name flow_steering_mode
- pci/0000:06:00.0:
- name flow_steering_mode type driver-specific
- values:
- cmode runtime value smfs
-
-enable_roce: RoCE enablement state
-----------------------------------
-RoCE enablement state controls driver support for RoCE traffic.
-When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
-
-To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
-
-User command examples:
-
-- Disable RoCE::
-
- $ devlink dev param set pci/0000:06:00.0 name enable_roce value false cmode driverinit
- $ devlink dev reload pci/0000:06:00.0
-
-- Read RoCE enablement state::
-
- $ devlink dev param show pci/0000:06:00.0 name enable_roce
- pci/0000:06:00.0:
- name enable_roce type generic
- values:
- cmode driverinit value true
-
-Devlink health reporters
-========================
-
-tx reporter
------------
-The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- TX timeout
- Report on kernel tx timeout detection.
- Recover by searching lost interrupts.
-- TX error completion
- Report on error tx completion.
- Recover by flushing the TX queue and reset it.
-
-TX reporter also support on demand diagnose callback, on which it provides
-real time information of its send queues status.
-
-User commands examples:
-
-- Diagnose send queues status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter tx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of tx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter tx
-
-rx reporter
------------
-The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-
-- RX queues initialization (population) timeout
- RX queues descriptors population on ring initialization is done in
- napi context via triggering an irq, in case of a failure to get
- the minimum amount of descriptors, a timeout would occur and it
- could be recoverable by polling the EQ (Event Queue).
-- RX completions with errors (reported by HW on interrupt context)
- Report on rx completion error.
- Recover (if needed) by flushing the related queue and reset it.
-
-RX reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues status.
-
-- Diagnose rx queues status, and corresponding completion queue::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter rx
-
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
-
-- Show number of rx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
-
- $ devlink health show pci/0000:82:00.0 reporter rx
-
-fw reporter
------------
-The fw reporter implements diagnose and dump callbacks.
-It follows symptoms of fw error such as fw syndrome by triggering
-fw core dump and storing it into the dump buffer.
-The fw reporter diagnose command can be triggered any time by the user to check
-current fw status.
-
-User commands examples:
-
-- Check fw heath status::
-
- $ devlink health diagnose pci/0000:82:00.0 reporter fw
-
-- Read FW core dump if already stored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.0 reporter fw
-
-NOTE: This command can run only on the PF which has fw tracer ownership,
-running it on other PF or any VF will return "Operation not permitted".
-
-fw fatal reporter
------------------
-The fw fatal reporter implements dump and recover callbacks.
-It follows fatal errors indications by CR-space dump and recover flow.
-The CR-space dump uses vsc interface which is valid even if the FW command
-interface is not functional, which is the case in most FW fatal errors.
-The recover function runs recover flow which reloads the driver and triggers fw
-reset if needed.
-
-User commands examples:
-
-- Run fw recover flow manually::
-
- $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-
-- Read FW CR-space dump if already strored or trigger new one::
-
- $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
-
-NOTE: This command can run only on PF.
-
-mlx5 tracepoints
-================
-
-mlx5 driver provides internal trace points for tracking and debugging using
-kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
-
-For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
-
-tc and eswitch offloads tracepoints:
-
-- mlx5e_configure_flower: trace flower filter actions and cookies offloaded to mlx5::
-
- $ echo mlx5:mlx5e_configure_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6535 [019] ...1 2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT
-
-- mlx5e_delete_flower: trace flower filter actions and cookies deleted from mlx5::
-
- $ echo mlx5:mlx5e_delete_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6569 [010] .N.1 2686.379075: mlx5e_delete_flower: cookie=0000000067874a55 actions= NULL
-
-- mlx5e_stats_flower: trace flower stats request::
-
- $ echo mlx5:mlx5e_stats_flower >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- tc-6546 [010] ...1 2679.704889: mlx5e_stats_flower: cookie=0000000060eb3d6a bytes=0 packets=0 lastused=4295560217
-
-- mlx5e_tc_update_neigh_used_value: trace tunnel rule neigh update value offloaded to mlx5::
-
- $ echo mlx5:mlx5e_tc_update_neigh_used_value >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- kworker/u48:4-8806 [009] ...1 55117.882428: mlx5e_tc_update_neigh_used_value: netdev: ens1f0 IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_used=1
-
-- mlx5e_rep_neigh_update: trace neigh update tasks scheduled due to neigh state change events::
-
- $ echo mlx5:mlx5e_rep_neigh_update >> /sys/kernel/debug/tracing/set_event
- $ cat /sys/kernel/debug/tracing/trace
- ...
- kworker/u48:7-2221 [009] ...1 1475.387435: mlx5e_rep_neigh_update: netdev: ens1f0 MAC: 24:8a:07:9a:17:9a IPv4: 1.1.1.10 IPv6: ::ffff:1.1.1.10 neigh_connected=1
diff --git a/Documentation/networking/device_drivers/neterion/s2io.txt b/Documentation/networking/device_drivers/neterion/s2io.txt
deleted file mode 100644
index 0362a42f7cf4..000000000000
--- a/Documentation/networking/device_drivers/neterion/s2io.txt
+++ /dev/null
@@ -1,141 +0,0 @@
-Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver.
-
-Contents
-=======
-- 1. Introduction
-- 2. Identifying the adapter/interface
-- 3. Features supported
-- 4. Command line parameters
-- 5. Performance suggestions
-- 6. Available Downloads
-
-
-1. Introduction:
-This Linux driver supports Neterion's Xframe I PCI-X 1.0 and
-Xframe II PCI-X 2.0 adapters. It supports several features
-such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on.
-See below for complete list of features.
-All features are supported for both IPv4 and IPv6.
-
-2. Identifying the adapter/interface:
-a. Insert the adapter(s) in your system.
-b. Build and load driver
-# insmod s2io.ko
-c. View log messages
-# dmesg | tail -40
-You will see messages similar to:
-eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA
-eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA
-eth4: Device is on 64 bit 133MHz PCIX(M1) bus
-
-The above messages identify the adapter type(Xframe I/II), adapter revision,
-driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X).
-In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed
-as well.
-
-To associate an interface with a physical adapter use "ethtool -p <ethX>".
-The corresponding adapter's LED will blink multiple times.
-
-3. Features supported:
-a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes,
-modifiable using ip command.
-
-b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit
-and receive, TSO.
-
-c. Multi-buffer receive mode. Scattering of packet across multiple
-buffers. Currently driver supports 2-buffer mode which yields
-significant performance improvement on certain platforms(SGI Altix,
-IBM xSeries).
-
-d. MSI/MSI-X. Can be enabled on platforms which support this feature
-(IA64, Xeon) resulting in noticeable performance improvement(up to 7%
-on certain platforms).
-
-e. Statistics. Comprehensive MAC-level and software statistics displayed
-using "ethtool -S" option.
-
-f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings,
-with multiple steering options.
-
-4. Command line parameters
-a. tx_fifo_num
-Number of transmit queues
-Valid range: 1-8
-Default: 1
-
-b. rx_ring_num
-Number of receive rings
-Valid range: 1-8
-Default: 1
-
-c. tx_fifo_len
-Size of each transmit queue
-Valid range: Total length of all queues should not exceed 8192
-Default: 4096
-
-d. rx_ring_sz
-Size of each receive ring(in 4K blocks)
-Valid range: Limited by memory on system
-Default: 30
-
-e. intr_type
-Specifies interrupt type. Possible values 0(INTA), 2(MSI-X)
-Valid values: 0, 2
-Default: 2
-
-5. Performance suggestions
-General:
-a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration)
-b. Set TCP windows size to optimal value.
-For instance, for MTU=1500 a value of 210K has been observed to result in
-good performance.
-# sysctl -w net.ipv4.tcp_rmem="210000 210000 210000"
-# sysctl -w net.ipv4.tcp_wmem="210000 210000 210000"
-For MTU=9000, TCP window size of 10 MB is recommended.
-# sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
-# sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
-
-Transmit performance:
-a. By default, the driver respects BIOS settings for PCI bus parameters.
-However, you may want to experiment with PCI bus parameters
-max-split-transactions(MOST) and MMRBC (use setpci command).
-A MOST value of 2 has been found optimal for Opterons and 3 for Itanium.
-It could be different for your hardware.
-Set MMRBC to 4K**.
-
-For example you can set
-For opteron
-#setpci -d 17d5:* 62=1d
-For Itanium
-#setpci -d 17d5:* 62=3d
-
-For detailed description of the PCI registers, please see Xframe User Guide.
-
-b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this
-parameter.
-c. Turn on TSO(using "ethtool -K")
-# ethtool -K <ethX> tso on
-
-Receive performance:
-a. By default, the driver respects BIOS settings for PCI bus parameters.
-However, you may want to set PCI latency timer to 248.
-#setpci -d 17d5:* LATENCY_TIMER=f8
-For detailed description of the PCI registers, please see Xframe User Guide.
-b. Use 2-buffer mode. This results in large performance boost on
-certain platforms(eg. SGI Altix, IBM xSeries).
-c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to
-set/verify this option.
-d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network
-device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to
-bring down CPU utilization.
-
-** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are
-recommended as safe parameters.
-For more information, please review the AMD8131 errata at
-http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
-26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf
-
-6. Support
-For further support please contact either your 10GbE Xframe NIC vendor (IBM,
-HP, SGI etc.)
diff --git a/Documentation/networking/device_drivers/neterion/vxge.txt b/Documentation/networking/device_drivers/neterion/vxge.txt
deleted file mode 100644
index abfec245f97c..000000000000
--- a/Documentation/networking/device_drivers/neterion/vxge.txt
+++ /dev/null
@@ -1,93 +0,0 @@
-Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver
-==============================================================================
-
-Contents
---------
-
-1) Introduction
-2) Features supported
-3) Configurable driver parameters
-4) Troubleshooting
-
-1) Introduction:
-----------------
-This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O
-Virtualized Server adapters.
-The X3100 series supports four modes of operation, configurable via
-firmware -
- Single function mode
- Multi function mode
- SRIOV mode
- MRIOV mode
-The functions share a 10GbE link and the pci-e bus, but hardly anything else
-inside the ASIC. Features like independent hw reset, statistics, bandwidth/
-priority allocation and guarantees, GRO, TSO, interrupt moderation etc are
-supported independently on each function.
-
-(See below for a complete list of features supported for both IPv4 and IPv6)
-
-2) Features supported:
-----------------------
-
-i) Single function mode (up to 17 queues)
-
-ii) Multi function mode (up to 17 functions)
-
-iii) PCI-SIG's I/O Virtualization
- - Single Root mode: v1.0 (up to 17 functions)
- - Multi-Root mode: v1.0 (up to 17 functions)
-
-iv) Jumbo frames
- X3100 Series supports MTU up to 9600 bytes, modifiable using
- ip command.
-
-v) Offloads supported: (Enabled by default)
- Checksum offload (TCP/UDP/IP) on transmit and receive paths
- TCP Segmentation Offload (TSO) on transmit path
- Generic Receive Offload (GRO) on receive path
-
-vi) MSI-X: (Enabled by default)
- Resulting in noticeable performance improvement (up to 7% on certain
- platforms).
-
-vii) NAPI: (Enabled by default)
- For better Rx interrupt moderation.
-
-viii)RTH (Receive Traffic Hash): (Enabled by default)
- Receive side steering for better scaling.
-
-ix) Statistics
- Comprehensive MAC-level and software statistics displayed using
- "ethtool -S" option.
-
-x) Multiple hardware queues: (Enabled by default)
- Up to 17 hardware based transmit and receive data channels, with
- multiple steering options (transmit multiqueue enabled by default).
-
-3) Configurable driver parameters:
-----------------------------------
-
-i) max_config_dev
- Specifies maximum device functions to be enabled.
- Valid range: 1-8
-
-ii) max_config_port
- Specifies number of ports to be enabled.
- Valid range: 1,2
- Default: 1
-
-iii)max_config_vpath
- Specifies maximum VPATH(s) configured for each device function.
- Valid range: 1-17
-
-iv) vlan_tag_strip
- Enables/disables vlan tag stripping from all received tagged frames that
- are not replicated at the internal L2 switch.
- Valid range: 0,1 (disabled, enabled respectively)
- Default: 1
-
-v) addr_learn_en
- Enable learning the mac address of the guest OS interface in
- virtualization environment.
- Valid range: 0,1 (disabled, enabled respectively)
- Default: 0
diff --git a/Documentation/networking/device_drivers/pensando/ionic.rst b/Documentation/networking/device_drivers/pensando/ionic.rst
deleted file mode 100644
index c17d680cf334..000000000000
--- a/Documentation/networking/device_drivers/pensando/ionic.rst
+++ /dev/null
@@ -1,45 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-========================================================
-Linux Driver for the Pensando(R) Ethernet adapter family
-========================================================
-
-Pensando Linux Ethernet driver.
-Copyright(c) 2019 Pensando Systems, Inc
-
-Contents
-========
-
-- Identifying the Adapter
-- Support
-
-Identifying the Adapter
-=======================
-
-To find if one or more Pensando PCI Ethernet devices are installed on the
-host, check for the PCI devices::
-
- $ lspci -d 1dd8:
- b5:00.0 Ethernet controller: Device 1dd8:1002
- b6:00.0 Ethernet controller: Device 1dd8:1002
-
-If such devices are listed as above, then the ionic.ko driver should find
-and configure them for use. There should be log entries in the kernel
-messages such as these::
-
- $ dmesg | grep ionic
- ionic Pensando Ethernet NIC Driver, ver 0.15.0-k
- ionic 0000:b5:00.0 enp181s0: renamed from eth0
- ionic 0000:b6:00.0 enp182s0: renamed from eth0
-
-Support
-=======
-For general Linux networking support, please use the netdev mailing
-list, which is monitored by Pensando personnel::
-
- netdev@vger.kernel.org
-
-For more specific support needs, please use the Pensando driver support
-email::
-
- drivers@pensando.io
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx b/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
deleted file mode 100644
index 2f2077e34d81..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qla3xxx
+++ /dev/null
@@ -1,46 +0,0 @@
-Copyright (c) 2003-2006 QLogic Corporation
-QLogic Linux Networking HBA Driver
-
-This program includes a device driver for Linux 2.6 that may be
-distributed with QLogic hardware specific firmware binary file.
-You may modify and redistribute the device driver code under the
-GNU General Public License as published by the Free Software
-Foundation (version 2 or a later version).
-
-You may redistribute the hardware specific firmware binary file
-under the following terms:
-
- 1. Redistribution of source code (only if applicable),
- must retain the above copyright notice, this list of
- conditions and the following disclaimer.
-
- 2. Redistribution in binary form must reproduce the above
- copyright notice, this list of conditions and the
- following disclaimer in the documentation and/or other
- materials provided with the distribution.
-
- 3. The name of QLogic Corporation may not be used to
- endorse or promote products derived from this software
- without specific prior written permission
-
-REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE,
-THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY
-EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR
-BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
-TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
-OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGE.
-
-USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT
-CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR
-OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT,
-TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN
-ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN
-COMBINATION WITH THIS PROGRAM.
-
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic b/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
deleted file mode 100644
index 2ae3b64983ab..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlcnic
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2009-2013 QLogic Corporation
-QLogic Linux qlcnic NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
- GNU GENERAL PUBLIC LICENSE
- Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
- Preamble
-
- The licenses for most software are designed to take away your
-freedom to share and change it. By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users. This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it. (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.) You can apply it to
-your programs, too.
-
- When we speak of free software, we are referring to freedom, not
-price. Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
- To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
- For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have. You must make sure that they, too, receive or can get the
-source code. And you must show them these terms so they know their
-rights.
-
- We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
- Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software. If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
- Finally, any free program is threatened constantly by software
-patents. We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary. To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
- The precise terms and conditions for copying, distribution and
-modification follow.
-
- GNU GENERAL PUBLIC LICENSE
- TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
- 0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License. The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language. (Hereinafter, translation is included without limitation in
-the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope. The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
- 1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
- 2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
- a) You must cause the modified files to carry prominent notices
- stating that you changed the files and the date of any change.
-
- b) You must cause any work that you distribute or publish, that in
- whole or in part contains or is derived from the Program or any
- part thereof, to be licensed as a whole at no charge to all third
- parties under the terms of this License.
-
- c) If the modified program normally reads commands interactively
- when run, you must cause it, when started running for such
- interactive use in the most ordinary way, to print or display an
- announcement including an appropriate copyright notice and a
- notice that there is no warranty (or else, saying that you provide
- a warranty) and that users may redistribute the program under
- these conditions, and telling the user how to view a copy of this
- License. (Exception: if the Program itself is interactive but
- does not normally print such an announcement, your work based on
- the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works. But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
- 3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
- a) Accompany it with the complete corresponding machine-readable
- source code, which must be distributed under the terms of Sections
- 1 and 2 above on a medium customarily used for software interchange; or,
-
- b) Accompany it with a written offer, valid for at least three
- years, to give any third party, for a charge no more than your
- cost of physically performing source distribution, a complete
- machine-readable copy of the corresponding source code, to be
- distributed under the terms of Sections 1 and 2 above on a medium
- customarily used for software interchange; or,
-
- c) Accompany it with the information you received as to the offer
- to distribute corresponding source code. (This alternative is
- allowed only for noncommercial distribution and only if you
- received the program in object code or executable form with such
- an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it. For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable. However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
- 4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License. Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
- 5. You are not required to accept this License, since you have not
-signed it. However, nothing else grants you permission to modify or
-distribute the Program or its derivative works. These actions are
-prohibited by law if you do not accept this License. Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
- 6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions. You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
- 7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License. If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all. For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices. Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
- 8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded. In such case, this License incorporates
-the limitation as if written in the body of this License.
-
- 9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time. Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation. If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
- 10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission. For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this. Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
- NO WARRANTY
-
- 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
- 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge b/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
deleted file mode 100644
index ce64e4d15b21..000000000000
--- a/Documentation/networking/device_drivers/qlogic/LICENSE.qlge
+++ /dev/null
@@ -1,288 +0,0 @@
-Copyright (c) 2003-2011 QLogic Corporation
-QLogic Linux qlge NIC Driver
-
-You may modify and redistribute the device driver code under the
-GNU General Public License (a copy of which is attached hereto as
-Exhibit A) published by the Free Software Foundation (version 2).
-
-
-EXHIBIT A
-
- GNU GENERAL PUBLIC LICENSE
- Version 2, June 1991
-
- Copyright (C) 1989, 1991 Free Software Foundation, Inc.
- 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
- Everyone is permitted to copy and distribute verbatim copies
- of this license document, but changing it is not allowed.
-
- Preamble
-
- The licenses for most software are designed to take away your
-freedom to share and change it. By contrast, the GNU General Public
-License is intended to guarantee your freedom to share and change free
-software--to make sure the software is free for all its users. This
-General Public License applies to most of the Free Software
-Foundation's software and to any other program whose authors commit to
-using it. (Some other Free Software Foundation software is covered by
-the GNU Lesser General Public License instead.) You can apply it to
-your programs, too.
-
- When we speak of free software, we are referring to freedom, not
-price. Our General Public Licenses are designed to make sure that you
-have the freedom to distribute copies of free software (and charge for
-this service if you wish), that you receive source code or can get it
-if you want it, that you can change the software or use pieces of it
-in new free programs; and that you know you can do these things.
-
- To protect your rights, we need to make restrictions that forbid
-anyone to deny you these rights or to ask you to surrender the rights.
-These restrictions translate to certain responsibilities for you if you
-distribute copies of the software, or if you modify it.
-
- For example, if you distribute copies of such a program, whether
-gratis or for a fee, you must give the recipients all the rights that
-you have. You must make sure that they, too, receive or can get the
-source code. And you must show them these terms so they know their
-rights.
-
- We protect your rights with two steps: (1) copyright the software, and
-(2) offer you this license which gives you legal permission to copy,
-distribute and/or modify the software.
-
- Also, for each author's protection and ours, we want to make certain
-that everyone understands that there is no warranty for this free
-software. If the software is modified by someone else and passed on, we
-want its recipients to know that what they have is not the original, so
-that any problems introduced by others will not reflect on the original
-authors' reputations.
-
- Finally, any free program is threatened constantly by software
-patents. We wish to avoid the danger that redistributors of a free
-program will individually obtain patent licenses, in effect making the
-program proprietary. To prevent this, we have made it clear that any
-patent must be licensed for everyone's free use or not licensed at all.
-
- The precise terms and conditions for copying, distribution and
-modification follow.
-
- GNU GENERAL PUBLIC LICENSE
- TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
-
- 0. This License applies to any program or other work which contains
-a notice placed by the copyright holder saying it may be distributed
-under the terms of this General Public License. The "Program", below,
-refers to any such program or work, and a "work based on the Program"
-means either the Program or any derivative work under copyright law:
-that is to say, a work containing the Program or a portion of it,
-either verbatim or with modifications and/or translated into another
-language. (Hereinafter, translation is included without limitation in
-the term "modification".) Each licensee is addressed as "you".
-
-Activities other than copying, distribution and modification are not
-covered by this License; they are outside its scope. The act of
-running the Program is not restricted, and the output from the Program
-is covered only if its contents constitute a work based on the
-Program (independent of having been made by running the Program).
-Whether that is true depends on what the Program does.
-
- 1. You may copy and distribute verbatim copies of the Program's
-source code as you receive it, in any medium, provided that you
-conspicuously and appropriately publish on each copy an appropriate
-copyright notice and disclaimer of warranty; keep intact all the
-notices that refer to this License and to the absence of any warranty;
-and give any other recipients of the Program a copy of this License
-along with the Program.
-
-You may charge a fee for the physical act of transferring a copy, and
-you may at your option offer warranty protection in exchange for a fee.
-
- 2. You may modify your copy or copies of the Program or any portion
-of it, thus forming a work based on the Program, and copy and
-distribute such modifications or work under the terms of Section 1
-above, provided that you also meet all of these conditions:
-
- a) You must cause the modified files to carry prominent notices
- stating that you changed the files and the date of any change.
-
- b) You must cause any work that you distribute or publish, that in
- whole or in part contains or is derived from the Program or any
- part thereof, to be licensed as a whole at no charge to all third
- parties under the terms of this License.
-
- c) If the modified program normally reads commands interactively
- when run, you must cause it, when started running for such
- interactive use in the most ordinary way, to print or display an
- announcement including an appropriate copyright notice and a
- notice that there is no warranty (or else, saying that you provide
- a warranty) and that users may redistribute the program under
- these conditions, and telling the user how to view a copy of this
- License. (Exception: if the Program itself is interactive but
- does not normally print such an announcement, your work based on
- the Program is not required to print an announcement.)
-
-These requirements apply to the modified work as a whole. If
-identifiable sections of that work are not derived from the Program,
-and can be reasonably considered independent and separate works in
-themselves, then this License, and its terms, do not apply to those
-sections when you distribute them as separate works. But when you
-distribute the same sections as part of a whole which is a work based
-on the Program, the distribution of the whole must be on the terms of
-this License, whose permissions for other licensees extend to the
-entire whole, and thus to each and every part regardless of who wrote it.
-
-Thus, it is not the intent of this section to claim rights or contest
-your rights to work written entirely by you; rather, the intent is to
-exercise the right to control the distribution of derivative or
-collective works based on the Program.
-
-In addition, mere aggregation of another work not based on the Program
-with the Program (or with a work based on the Program) on a volume of
-a storage or distribution medium does not bring the other work under
-the scope of this License.
-
- 3. You may copy and distribute the Program (or a work based on it,
-under Section 2) in object code or executable form under the terms of
-Sections 1 and 2 above provided that you also do one of the following:
-
- a) Accompany it with the complete corresponding machine-readable
- source code, which must be distributed under the terms of Sections
- 1 and 2 above on a medium customarily used for software interchange; or,
-
- b) Accompany it with a written offer, valid for at least three
- years, to give any third party, for a charge no more than your
- cost of physically performing source distribution, a complete
- machine-readable copy of the corresponding source code, to be
- distributed under the terms of Sections 1 and 2 above on a medium
- customarily used for software interchange; or,
-
- c) Accompany it with the information you received as to the offer
- to distribute corresponding source code. (This alternative is
- allowed only for noncommercial distribution and only if you
- received the program in object code or executable form with such
- an offer, in accord with Subsection b above.)
-
-The source code for a work means the preferred form of the work for
-making modifications to it. For an executable work, complete source
-code means all the source code for all modules it contains, plus any
-associated interface definition files, plus the scripts used to
-control compilation and installation of the executable. However, as a
-special exception, the source code distributed need not include
-anything that is normally distributed (in either source or binary
-form) with the major components (compiler, kernel, and so on) of the
-operating system on which the executable runs, unless that component
-itself accompanies the executable.
-
-If distribution of executable or object code is made by offering
-access to copy from a designated place, then offering equivalent
-access to copy the source code from the same place counts as
-distribution of the source code, even though third parties are not
-compelled to copy the source along with the object code.
-
- 4. You may not copy, modify, sublicense, or distribute the Program
-except as expressly provided under this License. Any attempt
-otherwise to copy, modify, sublicense or distribute the Program is
-void, and will automatically terminate your rights under this License.
-However, parties who have received copies, or rights, from you under
-this License will not have their licenses terminated so long as such
-parties remain in full compliance.
-
- 5. You are not required to accept this License, since you have not
-signed it. However, nothing else grants you permission to modify or
-distribute the Program or its derivative works. These actions are
-prohibited by law if you do not accept this License. Therefore, by
-modifying or distributing the Program (or any work based on the
-Program), you indicate your acceptance of this License to do so, and
-all its terms and conditions for copying, distributing or modifying
-the Program or works based on it.
-
- 6. Each time you redistribute the Program (or any work based on the
-Program), the recipient automatically receives a license from the
-original licensor to copy, distribute or modify the Program subject to
-these terms and conditions. You may not impose any further
-restrictions on the recipients' exercise of the rights granted herein.
-You are not responsible for enforcing compliance by third parties to
-this License.
-
- 7. If, as a consequence of a court judgment or allegation of patent
-infringement or for any other reason (not limited to patent issues),
-conditions are imposed on you (whether by court order, agreement or
-otherwise) that contradict the conditions of this License, they do not
-excuse you from the conditions of this License. If you cannot
-distribute so as to satisfy simultaneously your obligations under this
-License and any other pertinent obligations, then as a consequence you
-may not distribute the Program at all. For example, if a patent
-license would not permit royalty-free redistribution of the Program by
-all those who receive copies directly or indirectly through you, then
-the only way you could satisfy both it and this License would be to
-refrain entirely from distribution of the Program.
-
-If any portion of this section is held invalid or unenforceable under
-any particular circumstance, the balance of the section is intended to
-apply and the section as a whole is intended to apply in other
-circumstances.
-
-It is not the purpose of this section to induce you to infringe any
-patents or other property right claims or to contest validity of any
-such claims; this section has the sole purpose of protecting the
-integrity of the free software distribution system, which is
-implemented by public license practices. Many people have made
-generous contributions to the wide range of software distributed
-through that system in reliance on consistent application of that
-system; it is up to the author/donor to decide if he or she is willing
-to distribute software through any other system and a licensee cannot
-impose that choice.
-
-This section is intended to make thoroughly clear what is believed to
-be a consequence of the rest of this License.
-
- 8. If the distribution and/or use of the Program is restricted in
-certain countries either by patents or by copyrighted interfaces, the
-original copyright holder who places the Program under this License
-may add an explicit geographical distribution limitation excluding
-those countries, so that distribution is permitted only in or among
-countries not thus excluded. In such case, this License incorporates
-the limitation as if written in the body of this License.
-
- 9. The Free Software Foundation may publish revised and/or new versions
-of the General Public License from time to time. Such new versions will
-be similar in spirit to the present version, but may differ in detail to
-address new problems or concerns.
-
-Each version is given a distinguishing version number. If the Program
-specifies a version number of this License which applies to it and "any
-later version", you have the option of following the terms and conditions
-either of that version or of any later version published by the Free
-Software Foundation. If the Program does not specify a version number of
-this License, you may choose any version ever published by the Free Software
-Foundation.
-
- 10. If you wish to incorporate parts of the Program into other free
-programs whose distribution conditions are different, write to the author
-to ask for permission. For software which is copyrighted by the Free
-Software Foundation, write to the Free Software Foundation; we sometimes
-make exceptions for this. Our decision will be guided by the two goals
-of preserving the free status of all derivatives of our free software and
-of promoting the sharing and reuse of software generally.
-
- NO WARRANTY
-
- 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
-FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
-OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
-PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
-OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
-MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
-TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
-PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
-REPAIR OR CORRECTION.
-
- 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
-WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
-REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
-INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
-OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
-TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
-YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
-PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
-POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/device_drivers/qlogic/index.rst b/Documentation/networking/device_drivers/qlogic/index.rst
new file mode 100644
index 000000000000..ad05b04286e4
--- /dev/null
+++ b/Documentation/networking/device_drivers/qlogic/index.rst
@@ -0,0 +1,18 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+QLogic QLGE Device Drivers
+===============================================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ qlge
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/qlogic/qlge.rst b/Documentation/networking/device_drivers/qlogic/qlge.rst
new file mode 100644
index 000000000000..0b888253d152
--- /dev/null
+++ b/Documentation/networking/device_drivers/qlogic/qlge.rst
@@ -0,0 +1,118 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+QLogic QLGE 10Gb Ethernet device driver
+=======================================
+
+This driver use drgn and devlink for debugging.
+
+Dump kernel data structures in drgn
+-----------------------------------
+
+To dump kernel data structures, the following Python script can be used
+in drgn:
+
+.. code-block:: python
+
+ def align(x, a):
+ """the alignment a should be a power of 2
+ """
+ mask = a - 1
+ return (x+ mask) & ~mask
+
+ def struct_size(struct_type):
+ struct_str = "struct {}".format(struct_type)
+ return sizeof(Object(prog, struct_str, address=0x0))
+
+ def netdev_priv(netdevice):
+ NETDEV_ALIGN = 32
+ return netdevice.value_() + align(struct_size("net_device"), NETDEV_ALIGN)
+
+ name = 'xxx'
+ qlge_device = None
+ netdevices = prog['init_net'].dev_base_head.address_of_()
+ for netdevice in list_for_each_entry("struct net_device", netdevices, "dev_list"):
+ if netdevice.name.string_().decode('ascii') == name:
+ print(netdevice.name)
+
+ ql_adapter = Object(prog, "struct ql_adapter", address=netdev_priv(qlge_device))
+
+The struct ql_adapter will be printed in drgn as follows,
+
+ >>> ql_adapter
+ (struct ql_adapter){
+ .ricb = (struct ricb){
+ .base_cq = (u8)0,
+ .flags = (u8)120,
+ .mask = (__le16)26637,
+ .hash_cq_id = (u8 [1024]){ 172, 142, 255, 255 },
+ .ipv6_hash_key = (__le32 [10]){},
+ .ipv4_hash_key = (__le32 [4]){},
+ },
+ .flags = (unsigned long)0,
+ .wol = (u32)0,
+ .nic_stats = (struct nic_stats){
+ .tx_pkts = (u64)0,
+ .tx_bytes = (u64)0,
+ .tx_mcast_pkts = (u64)0,
+ .tx_bcast_pkts = (u64)0,
+ .tx_ucast_pkts = (u64)0,
+ .tx_ctl_pkts = (u64)0,
+ .tx_pause_pkts = (u64)0,
+ ...
+ },
+ .active_vlans = (unsigned long [64]){
+ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 52780853100545, 18446744073709551615,
+ 18446619461681283072, 0, 42949673024, 2147483647,
+ },
+ .rx_ring = (struct rx_ring [17]){
+ {
+ .cqicb = (struct cqicb){
+ .msix_vect = (u8)0,
+ .reserved1 = (u8)0,
+ .reserved2 = (u8)0,
+ .flags = (u8)0,
+ .len = (__le16)0,
+ .rid = (__le16)0,
+ ...
+ },
+ .cq_base = (void *)0x0,
+ .cq_base_dma = (dma_addr_t)0,
+ }
+ ...
+ }
+ }
+
+coredump via devlink
+--------------------
+
+
+And the coredump obtained via devlink in json format looks like,
+
+.. code:: shell
+
+ $ devlink health dump show DEVICE reporter coredump -p -j
+ {
+ "Core Registers": {
+ "segment": 1,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ "Test Logic Regs": {
+ "segment": 2,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ "RMII Registers": {
+ "segment": 3,
+ "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
+ },
+ ...
+ "Sem Registers": {
+ "segment": 50,
+ "values": [ 0,0,0,0 ]
+ }
+ }
+
+When the module parameter qlge_force_coredump is set to be true, the MPI
+RISC reset before coredumping. So coredumping will much longer since
+devlink tool has to wait for 5 secs for the resetting to be
+finished.
diff --git a/Documentation/networking/device_drivers/qualcomm/rmnet.txt b/Documentation/networking/device_drivers/qualcomm/rmnet.txt
deleted file mode 100644
index 6b341eaf2062..000000000000
--- a/Documentation/networking/device_drivers/qualcomm/rmnet.txt
+++ /dev/null
@@ -1,82 +0,0 @@
-1. Introduction
-
-rmnet driver is used for supporting the Multiplexing and aggregation
-Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
-Technologies, Inc. modems.
-
-This driver can be used to register onto any physical network device in
-IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
-
-Multiplexing allows for creation of logical netdevices (rmnet devices) to
-handle multiple private data networks (PDN) like a default internet, tethering,
-multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
-packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
-routes to the appropriate PDN after removing the MAP header.
-
-Aggregation is required to achieve high data rates. This involves hardware
-sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
-these MAP frames and send them to appropriate PDN's.
-
-2. Packet format
-
-a. MAP packet (data / control)
-
-MAP header has the same endianness of the IP packet.
-
-Packet format -
-
-Bit 0 1 2-7 8 - 15 16 - 31
-Function Command / Data Reserved Pad Multiplexer ID Payload length
-Bit 32 - x
-Function Raw Bytes
-
-Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
-or data packet. Control packet is used for transport level flow control. Data
-packets are standard IP packets.
-
-Reserved bits are usually zeroed out and to be ignored by receiver.
-
-Padding is number of bytes to be added for 4 byte alignment if required by
-hardware.
-
-Multiplexer ID is to indicate the PDN on which data has to be sent.
-
-Payload length includes the padding length but does not include MAP header
-length.
-
-b. MAP packet (command specific)
-
-Bit 0 1 2-7 8 - 15 16 - 31
-Function Command Reserved Pad Multiplexer ID Payload length
-Bit 32 - 39 40 - 45 46 - 47 48 - 63
-Function Command name Reserved Command Type Reserved
-Bit 64 - 95
-Function Transaction ID
-Bit 96 - 127
-Function Command data
-
-Command 1 indicates disabling flow while 2 is enabling flow
-
-Command types -
-0 for MAP command request
-1 is to acknowledge the receipt of a command
-2 is for unsupported commands
-3 is for error during processing of commands
-
-c. Aggregation
-
-Aggregation is multiple MAP packets (can be data or command) delivered to
-rmnet in a single linear skb. rmnet will process the individual
-packets and either ACK the MAP command or deliver the IP packet to the
-network stack as needed
-
-MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
-MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
-
-3. Userspace configuration
-
-rmnet userspace configuration is done through netlink library librmnetctl
-and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
-The driver uses rtnl_link_ops for communication.
-
-https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
diff --git a/Documentation/networking/device_drivers/sb1000.txt b/Documentation/networking/device_drivers/sb1000.txt
deleted file mode 100644
index f92c2aac56a9..000000000000
--- a/Documentation/networking/device_drivers/sb1000.txt
+++ /dev/null
@@ -1,207 +0,0 @@
-sb1000 is a module network device driver for the General Instrument (also known
-as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card
-which is used by a number of cable TV companies to provide cable modem access.
-It's a one-way downstream-only cable modem, meaning that your upstream net link
-is provided by your regular phone modem.
-
-This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves
-a great deal of thanks for this wonderful piece of code!
-
------------------------------------------------------------------------------
-
-Support for this device is now a part of the standard Linux kernel. The
-driver source code file is drivers/net/sb1000.c. In addition to this
-you will need:
-
-1.) The "cmconfig" program. This is a utility which supplements "ifconfig"
-to configure the cable modem and network interface (usually called "cm0");
-and
-
-2.) Several PPP scripts which live in /etc/ppp to make connecting via your
-cable modem easy.
-
- These utilities can be obtained from:
-
- http://www.jacksonville.net/~fventuri/
-
- in Franco's original source code distribution .tar.gz file. Support for
- the sb1000 driver can be found at:
-
- http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html
- http://web.archive.org/web/*/http://linuxpower.cx/~cable/
-
- along with these utilities.
-
-3.) The standard isapnp tools. These are necessary to configure your SB1000
-card at boot time (or afterwards by hand) since it's a PnP card.
-
- If you don't have these installed as a standard part of your Linux
- distribution, you can find them at:
-
- http://www.roestock.demon.co.uk/isapnptools/
-
- or check your Linux distribution binary CD or their web site. For help with
- isapnp, pnpdump, or /etc/isapnp.conf, go to:
-
- http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
-
------------------------------------------------------------------------------
-
-To make the SB1000 card work, follow these steps:
-
-1.) Run `make config', or `make menuconfig', or `make xconfig', whichever
-you prefer, in the top kernel tree directory to set up your kernel
-configuration. Make sure to say "Y" to "Prompt for development drivers"
-and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard
-networking questions to get TCP/IP and PPP networking support.
-
-2.) *BEFORE* you build the kernel, edit drivers/net/sb1000.c. Make sure
-to redefine the value of READ_DATA_PORT to match the I/O address used
-by isapnp to access your PnP cards. This is the value of READPORT in
-/etc/isapnp.conf or given by the output of pnpdump.
-
-3.) Build and install the kernel and modules as usual.
-
-4.) Boot your new kernel following the usual procedures.
-
-5.) Set up to configure the new SB1000 PnP card by capturing the output
-of "pnpdump" to a file and editing this file to set the correct I/O ports,
-IRQ, and DMA settings for all your PnP cards. Make sure none of the settings
-conflict with one another. Then test this configuration by running the
-"isapnp" command with your new config file as the input. Check for
-errors and fix as necessary. (As an aside, I use I/O ports 0x110 and
-0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.)
-Then save the finished config file as /etc/isapnp.conf for proper configuration
-on subsequent reboots.
-
-6.) Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
-the others referenced above. As root, unpack it into a temporary directory and
-do a `make cmconfig' and then `install -c cmconfig /usr/local/sbin'. Don't do
-`make install' because it expects to find all the utilities built and ready for
-installation, not just cmconfig.
-
-7.) As root, copy all the files under the ppp/ subdirectory in Franco's
-tar file into /etc/ppp, being careful not to overwrite any files that are
-already in there. Then modify ppp@gi-on to set the correct login name,
-phone number, and frequency for the cable modem. Also edit pap-secrets
-to specify your login name and password and any site-specific information
-you need.
-
-8.) Be sure to modify /etc/ppp/firewall to use ipchains instead of
-the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to
-convert ipfwadm commands to ipchains commands:
-
- http://users.dhp.com/~whisper/ipfwadm2ipchains/
-
-You may also wish to modify the firewall script to implement a different
-firewalling scheme.
-
-9.) Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be
-root to do this. It's better to use a utility like sudo to execute
-frequently used commands like this with root permissions if possible. If you
-connect successfully the cable modem interface will come up and you'll see a
-driver message like this at the console:
-
- cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
- sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
-
-The "ifconfig" command should show two new interfaces, ppp0 and cm0.
-The command "cmconfig cm0" will give you information about the cable modem
-interface.
-
-10.) Try pinging a site via `ping -c 5 www.yahoo.com', for example. You should
-see packets received.
-
-11.) If you can't get site names (like www.yahoo.com) to resolve into
-IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
-has no syntax errors and has the right nameserver IP addresses in it.
-If this doesn't help, try something like `ping -c 5 204.71.200.67' to
-see if the networking is running but the DNS resolution is where the
-problem lies.
-
-12.) If you still have problems, go to the support web sites mentioned above
-and read the information and documentation there.
-
------------------------------------------------------------------------------
-
-Common problems:
-
-1.) Packets go out on the ppp0 interface but don't come back on the cm0
-interface. It looks like I'm connected but I can't even ping any
-numerical IP addresses. (This happens predominantly on Debian systems due
-to a default boot-time configuration script.)
-
-Solution -- As root `echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter' so it
-can share the same IP address as the ppp0 interface. Note that this
-command should probably be added to the /etc/ppp/cablemodem script
-*right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
-You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
-If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
-(in rc.local or some such) then any interfaces can share the same IP
-addresses.
-
-2.) I get "unresolved symbol" error messages on executing `insmod sb1000.o'.
-
-Solution -- You probably have a non-matching kernel source tree and
-/usr/include/linux and /usr/include/asm header files. Make sure you
-install the correct versions of the header files in these two directories.
-Then rebuild and reinstall the kernel.
-
-3.) When isapnp runs it reports an error, and my SB1000 card isn't working.
-
-Solution -- There's a problem with later versions of isapnp using the "(CHECK)"
-option in the lines that allocate the two I/O addresses for the SB1000 card.
-This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses.
-Make sure they don't conflict with any other pieces of hardware first! Then
-rerun isapnp and go from there.
-
-4.) I can't execute the /etc/ppp/ppp@gi-on file.
-
-Solution -- As root do `chmod ug+x /etc/ppp/ppp@gi-on'.
-
-5.) The firewall script isn't working (with 2.2.x and higher kernels).
-
-Solution -- Use the ipfwadm2ipchains script referenced above to convert the
-/etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
-
-6.) I'm getting *tons* of firewall deny messages in the /var/kern.log,
-/var/messages, and/or /var/syslog files, and they're filling up my /var
-partition!!!
-
-Solution -- First, tell your ISP that you're receiving DoS (Denial of Service)
-and/or portscanning (UDP connection attempts) attacks! Look over the deny
-messages to figure out what the attack is and where it's coming from. Next,
-edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
-to the "cmconfig" command (uncomment that line). If you're not receiving these
-denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
-typically), then someone is attacking your machine in particular. Be careful
-out there....
-
-7.) Everything seems to work fine but my computer locks up after a while
-(and typically during a lengthy download through the cable modem)!
-
-Solution -- You may need to add a short delay in the driver to 'slow down' the
-SURFboard because your PC might not be able to keep up with the transfer rate
-of the SB1000. To do this, it's probably best to download Franco's
-sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll
-want to edit the 'Makefile' and look for the 'SB1000_DELAY'
-define. Uncomment those 'CFLAGS' lines (and comment out the default ones)
-and try setting the delay to something like 60 microseconds with:
-'-DSB1000_DELAY=60'. Then do `make' and as root `make install' and try
-it out. If it still doesn't work or you like playing with the driver, you may
-try other numbers. Remember though that the higher the delay, the slower the
-driver (which slows down the rest of the PC too when it is actively
-used). Thanks to Ed Daiga for this tip!
-
------------------------------------------------------------------------------
-
-Credits: This README came from Franco Venturi's original README file which is
-still supplied with his driver .tar.gz archive. I and all other sb1000 users
-owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten
-and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
-the SB1000 users who reported and helped debug the common problems listed
-above.
-
-
- Clemmitt Sigler
- csigler@vt.edu
diff --git a/Documentation/networking/device_drivers/smsc/smc9.txt b/Documentation/networking/device_drivers/smsc/smc9.txt
deleted file mode 100644
index d1e15074e43d..000000000000
--- a/Documentation/networking/device_drivers/smsc/smc9.txt
+++ /dev/null
@@ -1,42 +0,0 @@
-
-SMC 9xxxx Driver
-Revision 0.12
-3/5/96
-Copyright 1996 Erik Stahlman
-Released under terms of the GNU General Public License.
-
-This file contains the instructions and caveats for my SMC9xxx driver. You
-should not be using the driver without reading this file.
-
-Things to note about installation:
-
- 1. The driver should work on all kernels from 1.2.13 until 1.3.71.
- (A kernel patch is supplied for 1.3.71 )
-
- 2. If you include this into the kernel, you might need to change some
- options, such as for forcing IRQ.
-
-
- 3. To compile as a module, run 'make' .
- Make will give you the appropriate options for various kernel support.
-
- 4. Loading the driver as a module :
-
- use: insmod smc9194.o
- optional parameters:
- io=xxxx : your base address
- irq=xx : your irq
- ifport=x : 0 for whatever is default
- 1 for twisted pair
- 2 for AUI ( or BNC on some cards )
-
-How to obtain the latest version?
-
-FTP:
- ftp://fenris.campus.vt.edu/smc9/smc9-12.tar.gz
- ftp://sfbox.vt.edu/filebox/F/fenris/smc9/smc9-12.tar.gz
-
-
-Contacting me:
- erik@mail.vt.edu
-
diff --git a/Documentation/networking/device_drivers/ti/cpsw.txt b/Documentation/networking/device_drivers/ti/cpsw.txt
deleted file mode 100644
index d4d4c0751a09..000000000000
--- a/Documentation/networking/device_drivers/ti/cpsw.txt
+++ /dev/null
@@ -1,541 +0,0 @@
-* Texas Instruments CPSW ethernet driver
-
-Multiqueue & CBS & MQPRIO
-=====================================================================
-=====================================================================
-
-The cpsw has 3 CBS shapers for each external ports. This document
-describes MQPRIO and CBS Qdisc offload configuration for cpsw driver
-based on examples. It potentially can be used in audio video bridging
-(AVB) and time sensitive networking (TSN).
-
-The following examples were tested on AM572x EVM and BBB boards.
-
-Test setup
-==========
-
-Under consideration two examples with AM572x EVM running cpsw driver
-in dual_emac mode.
-
-Several prerequisites:
-- TX queues must be rated starting from txq0 that has highest priority
-- Traffic classes are used starting from 0, that has highest priority
-- CBS shapers should be used with rated queues
-- The bandwidth for CBS shapers has to be set a little bit more then
- potential incoming rate, thus, rate of all incoming tx queues has
- to be a little less
-- Real rates can differ, due to discreetness
-- Map skb-priority to txq is not enough, also skb-priority to l2 prio
- map has to be created with ip or vconfig tool
-- Any l2/socket prio (0 - 7) for classes can be used, but for
- simplicity default values are used: 3 and 2
-- only 2 classes tested: A and B, but checked and can work with more,
- maximum allowed 4, but only for 3 rate can be set.
-
-Test setup for examples
-=======================
- +-------------------------------+
- |--+ |
- | | Workstation0 |
- |E | MAC 18:03:73:66:87:42 |
-+-----------------------------+ +--|t | |
-| | 1 | E | | |h |./tsn_listener -d \ |
-| Target board: | 0 | t |--+ |0 | 18:03:73:66:87:42 -i eth0 \|
-| AM572x EVM | 0 | h | | | -s 1500 |
-| | 0 | 0 | |--+ |
-| Only 2 classes: |Mb +---| +-------------------------------+
-| class A, class B | |
-| | +---| +-------------------------------+
-| | 1 | E | |--+ |
-| | 0 | t | | | Workstation1 |
-| | 0 | h |--+ |E | MAC 20:cf:30:85:7d:fd |
-| |Mb | 1 | +--|t | |
-+-----------------------------+ |h |./tsn_listener -d \ |
- |0 | 20:cf:30:85:7d:fd -i eth0 \|
- | | -s 1500 |
- |--+ |
- +-------------------------------+
-
-*********************************************************************
-*********************************************************************
-*********************************************************************
-Example 1: One port tx AVB configuration scheme for target board
-----------------------------------------------------------------------
-(prints and scheme for AM572x evm, applicable for single port boards)
-
-tc - traffic class
-txq - transmit queue
-p - priority
-f - fifo (cpsw fifo)
-S - shaper configured
-
-+------------------------------------------------------------------+ u
-| +---------------+ +---------------+ +------+ +------+ | s
-| | | | | | | | | | e
-| | App 1 | | App 2 | | Apps | | Apps | | r
-| | Class A | | Class B | | Rest | | Rest | |
-| | Eth0 | | Eth0 | | Eth0 | | Eth1 | | s
-| | VLAN100 | | VLAN100 | | | | | | | | p
-| | 40 Mb/s | | 20 Mb/s | | | | | | | | a
-| | SO_PRIORITY=3 | | SO_PRIORITY=2 | | | | | | | | c
-| | | | | | | | | | | | | | e
-| +---|-----------+ +---|-----------+ +---|--+ +---|--+ |
-+-----|------------------|------------------|--------|-------------+
- +-+ +------------+ | |
- | | +-----------------+ +--+
- | | | |
-+---|-------|-------------|-----------------------|----------------+
-| +----+ +----+ +----+ +----+ +----+ |
-| | p3 | | p2 | | p1 | | p0 | | p0 | | k
-| \ / \ / \ / \ / \ / | e
-| \ / \ / \ / \ / \ / | r
-| \/ \/ \/ \/ \/ | n
-| | | | | | e
-| | | +-----+ | | l
-| | | | | |
-| +----+ +----+ +----+ +----+ | s
-| |tc0 | |tc1 | |tc2 | |tc0 | | p
-| \ / \ / \ / \ / | a
-| \ / \ / \ / \ / | c
-| \/ \/ \/ \/ | e
-| | | +-----+ | |
-| | | | | | |
-| | | | | | |
-| | | | | | |
-| +----+ +----+ +----+ +----+ +----+ |
-| |txq0| |txq1| |txq2| |txq3| |txq4| |
-| \ / \ / \ / \ / \ / |
-| \ / \ / \ / \ / \ / |
-| \/ \/ \/ \/ \/ |
-| +-|------|------|------|--+ +--|--------------+ |
-| | | | | | | Eth0.100 | | Eth1 | |
-+---|------|------|------|------------------------|----------------+
- | | | | |
- p p p p |
- 3 2 0-1, 4-7 <- L2 priority |
- | | | | |
- | | | | |
-+---|------|------|------|------------------------|----------------+
-| | | | | |----------+ |
-| +----+ +----+ +----+ +----+ +----+ |
-| |dma7| |dma6| |dma5| |dma4| |dma3| |
-| \ / \ / \ / \ / \ / | c
-| \S / \S / \ / \ / \ / | p
-| \/ \/ \/ \/ \/ | s
-| | | | +----- | | w
-| | | | | | |
-| | | | | | | d
-| +----+ +----+ +----+p p+----+ | r
-| | | | | | |o o| | | i
-| | f3 | | f2 | | f0 |r r| f0 | | v
-| |tc0 | |tc1 | |tc2 |t t|tc0 | | e
-| \CBS / \CBS / \CBS /1 2\CBS / | r
-| \S / \S / \ / \ / |
-| \/ \/ \/ \/ |
-+------------------------------------------------------------------+
-========================================Eth==========================>
-
-1)
-// Add 4 tx queues, for interface Eth0, and 1 tx queue for Eth1
-$ ethtool -L eth0 rx 1 tx 5
-rx unmodified, ignoring
-
-2)
-// Check if num of queues is set correctly:
-$ ethtool -l eth0
-Channel parameters for eth0:
-Pre-set maximums:
-RX: 8
-TX: 8
-Other: 0
-Combined: 0
-Current hardware settings:
-RX: 1
-TX: 5
-Other: 0
-Combined: 0
-
-3)
-// TX queues must be rated starting from 0, so set bws for tx0 and tx1
-// Set rates 40 and 20 Mb/s appropriately.
-// Pay attention, real speed can differ a bit due to discreetness.
-// Leave last 2 tx queues not rated.
-$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
-$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
-
-4)
-// Check maximum rate of tx (cpdma) queues:
-$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
-40
-20
-0
-0
-0
-
-5)
-// Map skb->priority to traffic class:
-// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
-// Map traffic class to transmit queue:
-// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq2, txq3)
-$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
-map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 1
-
-5a)
-// As two interface sharing same set of tx queues, assign all traffic
-// coming to interface Eth1 to separate queue in order to not mix it
-// with traffic from interface Eth0, so use separate txq to send
-// packets to Eth1, so all prio -> tc0 and tc0 -> txq4
-// Here hw 0, so here still default configuration for eth1 in hw
-$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 1 \
-map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@4 hw 0
-
-6)
-// Check classes settings
-$ tc -g class show dev eth0
-+---(100:ffe2) mqprio
-| +---(100:3) mqprio
-| +---(100:4) mqprio
-|
-+---(100:ffe1) mqprio
-| +---(100:2) mqprio
-|
-+---(100:ffe0) mqprio
- +---(100:1) mqprio
-
-$ tc -g class show dev eth1
-+---(100:ffe0) mqprio
- +---(100:5) mqprio
-
-7)
-// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc
-// Set it +1 Mb for reserve (important!)
-// here only idle slope is important, others arg are ignored
-// Pay attention, real speed can differ a bit due to discreetness
-$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1438 \
-hicredit 62 sendslope -959000 idleslope 41000 offload 1
-net eth0: set FIFO3 bw = 50
-
-8)
-// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc:
-// Set it +1 Mb for reserve (important!)
-$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1468 \
-hicredit 65 sendslope -979000 idleslope 21000 offload 1
-net eth0: set FIFO2 bw = 30
-
-9)
-// Create vlan 100 to map sk->priority to vlan qos
-$ ip link add link eth0 name eth0.100 type vlan id 100
-8021q: 802.1Q VLAN Support v1.8
-8021q: adding VLAN 0 to HW filter on device eth0
-8021q: adding VLAN 0 to HW filter on device eth1
-net eth0: Adding vlanid 100 to vlan filter
-
-10)
-// Map skb->priority to L2 prio, 1 to 1
-$ ip link set eth0.100 type vlan \
-egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-11)
-// Check egress map for vlan 100
-$ cat /proc/net/vlan/eth0.100
-[...]
-INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
-EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-12)
-// Run your appropriate tools with socket option "SO_PRIORITY"
-// to 3 for class A and/or to 2 for class B
-// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
-./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
-./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
-
-13)
-// run your listener on workstation (should be in same vlan)
-// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
-./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39000 kbps
-
-14)
-// Restore default configuration if needed
-$ ip link del eth0.100
-$ tc qdisc del dev eth1 root
-$ tc qdisc del dev eth0 root
-net eth0: Prev FIFO2 is shaped
-net eth0: set FIFO3 bw = 0
-net eth0: set FIFO2 bw = 0
-$ ethtool -L eth0 rx 1 tx 1
-
-*********************************************************************
-*********************************************************************
-*********************************************************************
-Example 2: Two port tx AVB configuration scheme for target board
-----------------------------------------------------------------------
-(prints and scheme for AM572x evm, for dual emac boards only)
-
-+------------------------------------------------------------------+ u
-| +----------+ +----------+ +------+ +----------+ +----------+ | s
-| | | | | | | | | | | | e
-| | App 1 | | App 2 | | Apps | | App 3 | | App 4 | | r
-| | Class A | | Class B | | Rest | | Class B | | Class A | |
-| | Eth0 | | Eth0 | | | | | Eth1 | | Eth1 | | s
-| | VLAN100 | | VLAN100 | | | | | VLAN100 | | VLAN100 | | p
-| | 40 Mb/s | | 20 Mb/s | | | | | 10 Mb/s | | 30 Mb/s | | a
-| | SO_PRI=3 | | SO_PRI=2 | | | | | SO_PRI=3 | | SO_PRI=2 | | c
-| | | | | | | | | | | | | | | | | e
-| +---|------+ +---|------+ +---|--+ +---|------+ +---|------+ |
-+-----|-------------|-------------|---------|-------------|--------+
- +-+ +-------+ | +----------+ +----+
- | | +-------+------+ | |
- | | | | | |
-+---|-------|-------------|--------------|-------------|-------|---+
-| +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
-| | p3 | | p2 | | p1 | | p0 | | p0 | | p1 | | p2 | | p3 | | k
-| \ / \ / \ / \ / \ / \ / \ / \ / | e
-| \ / \ / \ / \ / \ / \ / \ / \ / | r
-| \/ \/ \/ \/ \/ \/ \/ \/ | n
-| | | | | | | | e
-| | | +----+ +----+ | | | l
-| | | | | | | |
-| +----+ +----+ +----+ +----+ +----+ +----+ | s
-| |tc0 | |tc1 | |tc2 | |tc2 | |tc1 | |tc0 | | p
-| \ / \ / \ / \ / \ / \ / | a
-| \ / \ / \ / \ / \ / \ / | c
-| \/ \/ \/ \/ \/ \/ | e
-| | | +-----+ +-----+ | | |
-| | | | | | | | | |
-| | | | | | | | | |
-| | | | | E E | | | | |
-| +----+ +----+ +----+ +----+ t t +----+ +----+ +----+ +----+ |
-| |txq0| |txq1| |txq4| |txq5| h h |txq6| |txq7| |txq3| |txq2| |
-| \ / \ / \ / \ / 0 1 \ / \ / \ / \ / |
-| \ / \ / \ / \ / . . \ / \ / \ / \ / |
-| \/ \/ \/ \/ 1 1 \/ \/ \/ \/ |
-| +-|------|------|------|--+ 0 0 +-|------|------|------|--+ |
-| | | | | | | 0 0 | | | | | | |
-+---|------|------|------|---------------|------|------|------|----+
- | | | | | | | |
- p p p p p p p p
- 3 2 0-1, 4-7 <-L2 pri-> 0-1, 4-7 2 3
- | | | | | | | |
- | | | | | | | |
-+---|------|------|------|---------------|------|------|------|----+
-| | | | | | | | | |
-| +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ |
-| |dma7| |dma6| |dma3| |dma2| |dma1| |dma0| |dma4| |dma5| |
-| \ / \ / \ / \ / \ / \ / \ / \ / | c
-| \S / \S / \ / \ / \ / \ / \S / \S / | p
-| \/ \/ \/ \/ \/ \/ \/ \/ | s
-| | | | +----- | | | | | w
-| | | | | +----+ | | | |
-| | | | | | | | | | d
-| +----+ +----+ +----+p p+----+ +----+ +----+ | r
-| | | | | | |o o| | | | | | | i
-| | f3 | | f2 | | f0 |r CPSW r| f3 | | f2 | | f0 | | v
-| |tc0 | |tc1 | |tc2 |t t|tc0 | |tc1 | |tc2 | | e
-| \CBS / \CBS / \CBS /1 2\CBS / \CBS / \CBS / | r
-| \S / \S / \ / \S / \S / \ / |
-| \/ \/ \/ \/ \/ \/ |
-+------------------------------------------------------------------+
-========================================Eth==========================>
-
-1)
-// Add 8 tx queues, for interface Eth0, but they are common, so are accessed
-// by two interfaces Eth0 and Eth1.
-$ ethtool -L eth1 rx 1 tx 8
-rx unmodified, ignoring
-
-2)
-// Check if num of queues is set correctly:
-$ ethtool -l eth0
-Channel parameters for eth0:
-Pre-set maximums:
-RX: 8
-TX: 8
-Other: 0
-Combined: 0
-Current hardware settings:
-RX: 1
-TX: 8
-Other: 0
-Combined: 0
-
-3)
-// TX queues must be rated starting from 0, so set bws for tx0 and tx1 for Eth0
-// and for tx2 and tx3 for Eth1. That is, rates 40 and 20 Mb/s appropriately
-// for Eth0 and 30 and 10 Mb/s for Eth1.
-// Real speed can differ a bit due to discreetness
-// Leave last 4 tx queues as not rated
-$ echo 40 > /sys/class/net/eth0/queues/tx-0/tx_maxrate
-$ echo 20 > /sys/class/net/eth0/queues/tx-1/tx_maxrate
-$ echo 30 > /sys/class/net/eth1/queues/tx-2/tx_maxrate
-$ echo 10 > /sys/class/net/eth1/queues/tx-3/tx_maxrate
-
-4)
-// Check maximum rate of tx (cpdma) queues:
-$ cat /sys/class/net/eth0/queues/tx-*/tx_maxrate
-40
-20
-30
-10
-0
-0
-0
-0
-
-5)
-// Map skb->priority to traffic class for Eth0:
-// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
-// Map traffic class to transmit queue:
-// tc0 -> txq0, tc1 -> txq1, tc2 -> (txq4, txq5)
-$ tc qdisc replace dev eth0 handle 100: parent root mqprio num_tc 3 \
-map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@4 hw 1
-
-6)
-// Check classes settings
-$ tc -g class show dev eth0
-+---(100:ffe2) mqprio
-| +---(100:5) mqprio
-| +---(100:6) mqprio
-|
-+---(100:ffe1) mqprio
-| +---(100:2) mqprio
-|
-+---(100:ffe0) mqprio
- +---(100:1) mqprio
-
-7)
-// Set rate for class A - 41 Mbit (tc0, txq0) using CBS Qdisc for Eth0
-// here only idle slope is important, others ignored
-// Real speed can differ a bit due to discreetness
-$ tc qdisc add dev eth0 parent 100:1 cbs locredit -1470 \
-hicredit 62 sendslope -959000 idleslope 41000 offload 1
-net eth0: set FIFO3 bw = 50
-
-8)
-// Set rate for class B - 21 Mbit (tc1, txq1) using CBS Qdisc for Eth0
-$ tc qdisc add dev eth0 parent 100:2 cbs locredit -1470 \
-hicredit 65 sendslope -979000 idleslope 21000 offload 1
-net eth0: set FIFO2 bw = 30
-
-9)
-// Create vlan 100 to map sk->priority to vlan qos for Eth0
-$ ip link add link eth0 name eth0.100 type vlan id 100
-net eth0: Adding vlanid 100 to vlan filter
-
-10)
-// Map skb->priority to L2 prio for Eth0.100, one to one
-$ ip link set eth0.100 type vlan \
-egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-11)
-// Check egress map for vlan 100
-$ cat /proc/net/vlan/eth0.100
-[...]
-INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
-EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-12)
-// Map skb->priority to traffic class for Eth1:
-// 3pri -> tc0, 2pri -> tc1, (0,1,4-7)pri -> tc2
-// Map traffic class to transmit queue:
-// tc0 -> txq2, tc1 -> txq3, tc2 -> (txq6, txq7)
-$ tc qdisc replace dev eth1 handle 100: parent root mqprio num_tc 3 \
-map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@2 1@3 2@6 hw 1
-
-13)
-// Check classes settings
-$ tc -g class show dev eth1
-+---(100:ffe2) mqprio
-| +---(100:7) mqprio
-| +---(100:8) mqprio
-|
-+---(100:ffe1) mqprio
-| +---(100:4) mqprio
-|
-+---(100:ffe0) mqprio
- +---(100:3) mqprio
-
-14)
-// Set rate for class A - 31 Mbit (tc0, txq2) using CBS Qdisc for Eth1
-// here only idle slope is important, others ignored, but calculated
-// for interface speed - 100Mb for eth1 port.
-// Set it +1 Mb for reserve (important!)
-$ tc qdisc add dev eth1 parent 100:3 cbs locredit -1035 \
-hicredit 465 sendslope -69000 idleslope 31000 offload 1
-net eth1: set FIFO3 bw = 31
-
-15)
-// Set rate for class B - 11 Mbit (tc1, txq3) using CBS Qdisc for Eth1
-// Set it +1 Mb for reserve (important!)
-$ tc qdisc add dev eth1 parent 100:4 cbs locredit -1335 \
-hicredit 405 sendslope -89000 idleslope 11000 offload 1
-net eth1: set FIFO2 bw = 11
-
-16)
-// Create vlan 100 to map sk->priority to vlan qos for Eth1
-$ ip link add link eth1 name eth1.100 type vlan id 100
-net eth1: Adding vlanid 100 to vlan filter
-
-17)
-// Map skb->priority to L2 prio for Eth1.100, one to one
-$ ip link set eth1.100 type vlan \
-egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-18)
-// Check egress map for vlan 100
-$ cat /proc/net/vlan/eth1.100
-[...]
-INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
-EGRESS priority mappings: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
-
-19)
-// Run appropriate tools with socket option "SO_PRIORITY" to 3
-// for class A and to 2 for class B. For both interfaces
-./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
-./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
-./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p2 -s 1500&
-./tsn_talker -d 20:cf:30:85:7d:fd -i eth1.100 -p3 -s 1500&
-
-20)
-// run your listener on workstation (should be in same vlan)
-// (I took at https://www.spinics.net/lists/netdev/msg460869.html)
-./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39012 kbps
-Receiving data rate: 39000 kbps
-
-21)
-// Restore default configuration if needed
-$ ip link del eth1.100
-$ ip link del eth0.100
-$ tc qdisc del dev eth1 root
-net eth1: Prev FIFO2 is shaped
-net eth1: set FIFO3 bw = 0
-net eth1: set FIFO2 bw = 0
-$ tc qdisc del dev eth0 root
-net eth0: Prev FIFO2 is shaped
-net eth0: set FIFO3 bw = 0
-net eth0: set FIFO2 bw = 0
-$ ethtool -L eth0 rx 1 tx 1
diff --git a/Documentation/networking/device_drivers/wifi/index.rst b/Documentation/networking/device_drivers/wifi/index.rst
new file mode 100644
index 000000000000..bf91a87c7acf
--- /dev/null
+++ b/Documentation/networking/device_drivers/wifi/index.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+Wi-Fi Device Drivers
+====================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ intel/ipw2100
+ intel/ipw2200
+ ray_cs
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/intel/ipw2100.txt b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
index 6f85e1d06031..883e96355799 100644
--- a/Documentation/networking/device_drivers/intel/ipw2100.txt
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2100.rst
@@ -1,31 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
-Intel(R) PRO/Wireless 2100 Driver for Linux in support of:
+===========================================
+Intel(R) PRO/Wireless 2100 Driver for Linux
+===========================================
-Intel(R) PRO/Wireless 2100 Network Connection
+Support for:
-Copyright (C) 2003-2006, Intel Corporation
+- Intel(R) PRO/Wireless 2100 Network Connection
+
+Copyright |copy| 2003-2006, Intel Corporation
README.ipw2100
-Version: git-1.1.5
-Date : January 25, 2006
+:Version: git-1.1.5
+:Date: January 25, 2006
-Index
------------------------------------------------
-0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
-1. Introduction
-2. Release git-1.1.5 Current Features
-3. Command Line Parameters
-4. Sysfs Helper Files
-5. Radio Kill Switch
-6. Dynamic Firmware
-7. Power Management
-8. Support
-9. License
+.. Index
+
+ 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+ 1. Introduction
+ 2. Release git-1.1.5 Current Features
+ 3. Command Line Parameters
+ 4. Sysfs Helper Files
+ 5. Radio Kill Switch
+ 6. Dynamic Firmware
+ 7. Power Management
+ 8. Support
+ 9. License
-0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
------------------------------------------------
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+=================================================
Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
@@ -72,13 +78,13 @@ such, if you are interested in deploying or shipping a driver as part of
solution intended to be used for purposes other than development, please
obtain a tested driver from Intel Customer Support at:
-http://www.intel.com/support/wireless/sb/CS-006408.htm
+https://www.intel.com/support/wireless/sb/CS-006408.htm
1. Introduction
------------------------------------------------
+===============
-This document provides a brief overview of the features supported by the
-IPW2100 driver project. The main project website, where the latest
+This document provides a brief overview of the features supported by the
+IPW2100 driver project. The main project website, where the latest
development version of the driver can be found, is:
http://ipw2100.sourceforge.net
@@ -89,10 +95,11 @@ for the driver project.
2. Release git-1.1.5 Current Supported Features
------------------------------------------------
+===============================================
+
- Managed (BSS) and Ad-Hoc (IBSS)
- WEP (shared key and open)
-- Wireless Tools support
+- Wireless Tools support
- 802.1x (tested with XSupplicant 1.0.1)
Enabled (but not supported) features:
@@ -105,11 +112,11 @@ performed on a given feature.
3. Command Line Parameters
------------------------------------------------
+==========================
If the driver is built as a module, the following optional parameters are used
by entering them on the command line with the modprobe command using this
-syntax:
+syntax::
modprobe ipw2100 [<option>=<VAL1><,VAL2>...]
@@ -119,61 +126,76 @@ For example, to disable the radio on driver loading, enter:
The ipw2100 driver supports the following module parameters:
-Name Value Example:
-debug 0x0-0xffffffff debug=1024
-mode 0,1,2 mode=1 /* AdHoc */
-channel int channel=3 /* Only valid in AdHoc or Monitor */
-associate boolean associate=0 /* Do NOT auto associate */
-disable boolean disable=1 /* Do not power the HW */
+========= ============== ============ ==============================
+Name Value Example Meaning
+========= ============== ============ ==============================
+debug 0x0-0xffffffff debug=1024 Debug level set to 1024
+mode 0,1,2 mode=1 AdHoc
+channel int channel=3 Only valid in AdHoc or Monitor
+associate boolean associate=0 Do NOT auto associate
+disable boolean disable=1 Do not power the HW
+========= ============== ============ ==============================
4. Sysfs Helper Files
----------------------------
------------------------------------------------
+=====================
-There are several ways to control the behavior of the driver. Many of the
+There are several ways to control the behavior of the driver. Many of the
general capabilities are exposed through the Wireless Tools (iwconfig). There
are a few capabilities that are exposed through entries in the Linux Sysfs.
------ Driver Level ------
+**Driver Level**
+
For the driver level files, look in /sys/bus/pci/drivers/ipw2100/
- debug_level
-
- This controls the same global as the 'debug' module parameter. For
- information on the various debugging levels available, run the 'dvals'
+ debug_level
+ This controls the same global as the 'debug' module parameter. For
+ information on the various debugging levels available, run the 'dvals'
script found in the driver source directory.
- NOTE: 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn
- on.
+ .. note::
+
+ 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn on.
+
+**Device Level**
+
+For the device level files look in::
------ Device Level ------
-For the device level files look in
-
/sys/bus/pci/drivers/ipw2100/{PCI-ID}/
-For example:
+For example::
+
/sys/bus/pci/drivers/ipw2100/0000:02:01.0
For the device level files, see /sys/bus/pci/drivers/ipw2100:
rf_kill
- read -
- 0 = RF kill not enabled (radio on)
- 1 = SW based RF kill active (radio off)
- 2 = HW based RF kill active (radio off)
- 3 = Both HW and SW RF kill active (radio off)
- write -
- 0 = If SW based RF kill active, turn the radio back on
- 1 = If radio is on, activate SW based RF kill
+ read
+
+ == =========================================
+ 0 RF kill not enabled (radio on)
+ 1 SW based RF kill active (radio off)
+ 2 HW based RF kill active (radio off)
+ 3 Both HW and SW RF kill active (radio off)
+ == =========================================
+
+ write
+
+ == ==================================================
+ 0 If SW based RF kill active, turn the radio back on
+ 1 If radio is on, activate SW based RF kill
+ == ==================================================
- NOTE: If you enable the SW based RF kill and then toggle the HW
- based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+ .. note::
+
+ If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
5. Radio Kill Switch
------------------------------------------------
+====================
+
Most laptops provide the ability for the user to physically disable the radio.
Some vendors have implemented this as a physical switch that requires no
software to turn the radio off and on. On other laptops, however, the switch
@@ -186,9 +208,10 @@ on your system.
6. Dynamic Firmware
------------------------------------------------
-As the firmware is licensed under a restricted use license, it can not be
-included within the kernel sources. To enable the IPW2100 you will need a
+===================
+
+As the firmware is licensed under a restricted use license, it can not be
+included within the kernel sources. To enable the IPW2100 you will need a
firmware image to load into the wireless NIC's processors.
You can obtain these images from <http://ipw2100.sf.net/firmware.php>.
@@ -197,52 +220,57 @@ See INSTALL for instructions on installing the firmware.
7. Power Management
------------------------------------------------
-The IPW2100 supports the configuration of the Power Save Protocol
-through a private wireless extension interface. The IPW2100 supports
+===================
+
+The IPW2100 supports the configuration of the Power Save Protocol
+through a private wireless extension interface. The IPW2100 supports
the following different modes:
+ === ===========================================================
off No power management. Radio is always on.
on Automatic power management
- 1-5 Different levels of power management. The higher the
- number the greater the power savings, but with an impact to
- packet latencies.
-
-Power management works by powering down the radio after a certain
-interval of time has passed where no packets are passed through the
-radio. Once powered down, the radio remains in that state for a given
-period of time. For higher power savings, the interval between last
+ 1-5 Different levels of power management. The higher the
+ number the greater the power savings, but with an impact to
+ packet latencies.
+ === ===========================================================
+
+Power management works by powering down the radio after a certain
+interval of time has passed where no packets are passed through the
+radio. Once powered down, the radio remains in that state for a given
+period of time. For higher power savings, the interval between last
packet processed to sleep is shorter and the sleep period is longer.
-When the radio is asleep, the access point sending data to the station
-must buffer packets at the AP until the station wakes up and requests
-any buffered packets. If you have an AP that does not correctly support
-the PSP protocol you may experience packet loss or very poor performance
-while power management is enabled. If this is the case, you will need
-to try and find a firmware update for your AP, or disable power
-management (via `iwconfig eth1 power off`)
+When the radio is asleep, the access point sending data to the station
+must buffer packets at the AP until the station wakes up and requests
+any buffered packets. If you have an AP that does not correctly support
+the PSP protocol you may experience packet loss or very poor performance
+while power management is enabled. If this is the case, you will need
+to try and find a firmware update for your AP, or disable power
+management (via ``iwconfig eth1 power off``)
-To configure the power level on the IPW2100 you use a combination of
-iwconfig and iwpriv. iwconfig is used to turn power management on, off,
+To configure the power level on the IPW2100 you use a combination of
+iwconfig and iwpriv. iwconfig is used to turn power management on, off,
and set it to auto.
+ ========================= ====================================
iwconfig eth1 power off Disables radio power down
- iwconfig eth1 power on Enables radio power management to
+ iwconfig eth1 power on Enables radio power management to
last set level (defaults to AUTO)
- iwpriv eth1 set_power 0 Sets power level to AUTO and enables
- power management if not previously
+ iwpriv eth1 set_power 0 Sets power level to AUTO and enables
+ power management if not previously
enabled.
- iwpriv eth1 set_power 1-5 Set the power level as specified,
- enabling power management if not
+ iwpriv eth1 set_power 1-5 Set the power level as specified,
+ enabling power management if not
previously enabled.
+ ========================= ====================================
+
+You can view the current power level setting via::
-You can view the current power level setting via:
-
iwpriv eth1 get_power
It will return the current period or timeout that is configured as a string
in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of
-time after packet processing), yyyy is the period to sleep (amount of time to
+time after packet processing), yyyy is the period to sleep (amount of time to
wait before powering the radio and querying the access point for buffered
packets), and z is the 'power level'. If power management is turned off the
xxxx/yyyy will be replaced with 'off' -- the level reported will be the active
@@ -250,44 +278,46 @@ level if `iwconfig eth1 power on` is invoked.
8. Support
------------------------------------------------
+==========
For general development information and support,
go to:
-
+
http://ipw2100.sf.net/
-The ipw2100 1.1.0 driver and firmware can be downloaded from:
+The ipw2100 1.1.0 driver and firmware can be downloaded from:
http://support.intel.com
-For installation support on the ipw2100 1.1.0 driver on Linux kernels
-2.6.8 or greater, email support is available from:
+For installation support on the ipw2100 1.1.0 driver on Linux kernels
+2.6.8 or greater, email support is available from:
http://supportmail.intel.com
9. License
------------------------------------------------
+==========
- Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved.
+ Copyright |copy| 2003 - 2006 Intel Corporation. All rights reserved.
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License (version 2) as
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License (version 2) as
published by the Free Software Foundation.
-
- This program is distributed in the hope that it will be useful, but WITHOUT
- ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
-
+
You should have received a copy of the GNU General Public License along with
- this program; if not, write to the Free Software Foundation, Inc., 59
+ this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
-
+
The full GNU General Public License is included in this distribution in the
file called LICENSE.
-
+
License Contact Information:
+
James P. Ketrenos <ipw2100-admin@linux.intel.com>
+
Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
diff --git a/Documentation/networking/device_drivers/intel/ipw2200.txt b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
index b7658bed4906..0cb42d2fd7e5 100644
--- a/Documentation/networking/device_drivers/intel/ipw2200.txt
+++ b/Documentation/networking/device_drivers/wifi/intel/ipw2200.rst
@@ -1,8 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
-Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of:
+==============================================
+Intel(R) PRO/Wireless 2915ABG Driver for Linux
+==============================================
-Intel(R) PRO/Wireless 2200BG Network Connection
-Intel(R) PRO/Wireless 2915ABG Network Connection
+
+Support for:
+
+- Intel(R) PRO/Wireless 2200BG Network Connection
+- Intel(R) PRO/Wireless 2915ABG Network Connection
Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R)
PRO/Wireless 2200BG Driver for Linux is a unified driver that works on
@@ -10,37 +17,37 @@ both hardware adapters listed above. In this document the Intel(R)
PRO/Wireless 2915ABG Driver for Linux will be used to reference the
unified driver.
-Copyright (C) 2004-2006, Intel Corporation
+Copyright |copy| 2004-2006, Intel Corporation
README.ipw2200
-Version: 1.1.2
-Date : March 30, 2006
+:Version: 1.1.2
+:Date: March 30, 2006
-Index
------------------------------------------------
-0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
-1. Introduction
-1.1. Overview of features
-1.2. Module parameters
-1.3. Wireless Extension Private Methods
-1.4. Sysfs Helper Files
-1.5. Supported channels
-2. Ad-Hoc Networking
-3. Interacting with Wireless Tools
-3.1. iwconfig mode
-3.2. iwconfig sens
-4. About the Version Numbers
-5. Firmware installation
-6. Support
-7. License
+.. Index
+ 0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+ 1. Introduction
+ 1.1. Overview of features
+ 1.2. Module parameters
+ 1.3. Wireless Extension Private Methods
+ 1.4. Sysfs Helper Files
+ 1.5. Supported channels
+ 2. Ad-Hoc Networking
+ 3. Interacting with Wireless Tools
+ 3.1. iwconfig mode
+ 3.2. iwconfig sens
+ 4. About the Version Numbers
+ 5. Firmware installation
+ 6. Support
+ 7. License
-0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
------------------------------------------------
-Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+=================================================
+
+Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
Intel wireless LAN adapters are engineered, manufactured, tested, and
quality checked to ensure that they meet all necessary local and
@@ -56,7 +63,7 @@ product is granted. Intel's wireless LAN's EEPROM, firmware, and
software driver are designed to carefully control parameters that affect
radio operation and to ensure electromagnetic compliance (EMC). These
parameters include, without limitation, RF power, spectrum usage,
-channel scanning, and human exposure.
+channel scanning, and human exposure.
For these reasons Intel cannot permit any manipulation by third parties
of the software provided in binary format with the wireless WLAN
@@ -70,7 +77,7 @@ no liability, under any theory of liability for any issues associated
with the modified products, including without limitation, claims under
the warranty and/or issues arising from regulatory non-compliance, and
(iii) Intel will not provide or be required to assist in providing
-support to any third parties for such modified products.
+support to any third parties for such modified products.
Note: Many regulatory agencies consider Wireless LAN adapters to be
modules, and accordingly, condition system-level regulatory approval
@@ -78,23 +85,24 @@ upon receipt and review of test data documenting that the antennas and
system configuration do not cause the EMC and radio operation to be
non-compliant.
-The drivers available for download from SourceForge are provided as a
-part of a development project. Conformance to local regulatory
-requirements is the responsibility of the individual developer. As
-such, if you are interested in deploying or shipping a driver as part of
-solution intended to be used for purposes other than development, please
+The drivers available for download from SourceForge are provided as a
+part of a development project. Conformance to local regulatory
+requirements is the responsibility of the individual developer. As
+such, if you are interested in deploying or shipping a driver as part of
+solution intended to be used for purposes other than development, please
obtain a tested driver from Intel Customer Support at:
http://support.intel.com
-1. Introduction
------------------------------------------------
-The following sections attempt to provide a brief introduction to using
+1. Introduction
+===============
+
+The following sections attempt to provide a brief introduction to using
the Intel(R) PRO/Wireless 2915ABG Driver for Linux.
-This document is not meant to be a comprehensive manual on
-understanding or using wireless technologies, but should be sufficient
+This document is not meant to be a comprehensive manual on
+understanding or using wireless technologies, but should be sufficient
to get you moving without wires on Linux.
For information on building and installing the driver, see the INSTALL
@@ -102,14 +110,14 @@ file.
1.1. Overview of Features
------------------------------------------------
+-------------------------
The current release (1.1.2) supports the following features:
+ BSS mode (Infrastructure, Managed)
+ IBSS mode (Ad-Hoc)
+ WEP (OPEN and SHARED KEY mode)
+ 802.1x EAP via wpa_supplicant and xsupplicant
-+ Wireless Extension support
++ Wireless Extension support
+ Full B and G rate support (2200 and 2915)
+ Full A rate support (2915 only)
+ Transmit power control
@@ -122,102 +130,107 @@ supported:
+ long/short preamble support
+ Monitor mode (aka RFMon)
-The distinction between officially supported and enabled is a reflection
+The distinction between officially supported and enabled is a reflection
on the amount of validation and interoperability testing that has been
-performed on a given feature.
+performed on a given feature.
1.2. Command Line Parameters
------------------------------------------------
+----------------------------
Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless
-2915ABG Driver for Linux allows configuration options to be provided
-as module parameters. The most common way to specify a module parameter
-is via the command line.
+2915ABG Driver for Linux allows configuration options to be provided
+as module parameters. The most common way to specify a module parameter
+is via the command line.
-The general form is:
+The general form is::
-% modprobe ipw2200 parameter=value
+ % modprobe ipw2200 parameter=value
Where the supported parameter are:
associate
Set to 0 to disable the auto scan-and-associate functionality of the
- driver. If disabled, the driver will not attempt to scan
- for and associate to a network until it has been configured with
- one or more properties for the target network, for example configuring
+ driver. If disabled, the driver will not attempt to scan
+ for and associate to a network until it has been configured with
+ one or more properties for the target network, for example configuring
the network SSID. Default is 0 (do not auto-associate)
-
+
Example: % modprobe ipw2200 associate=0
auto_create
- Set to 0 to disable the auto creation of an Ad-Hoc network
- matching the channel and network name parameters provided.
+ Set to 0 to disable the auto creation of an Ad-Hoc network
+ matching the channel and network name parameters provided.
Default is 1.
channel
channel number for association. The normal method for setting
- the channel would be to use the standard wireless tools
- (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes
+ the channel would be to use the standard wireless tools
+ (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes
to set this while debugging. Channel 0 means 'ANY'
debug
If using a debug build, this is used to control the amount of debug
info is logged. See the 'dvals' and 'load' script for more info on
- how to use this (the dvals and load scripts are provided as part
- of the ipw2200 development snapshot releases available from the
+ how to use this (the dvals and load scripts are provided as part
+ of the ipw2200 development snapshot releases available from the
SourceForge project at http://ipw2200.sf.net)
-
+
led
Can be used to turn on experimental LED code.
0 = Off, 1 = On. Default is 1.
mode
- Can be used to set the default mode of the adapter.
+ Can be used to set the default mode of the adapter.
0 = Managed, 1 = Ad-Hoc, 2 = Monitor
1.3. Wireless Extension Private Methods
------------------------------------------------
+---------------------------------------
-As an interface designed to handle generic hardware, there are certain
-capabilities not exposed through the normal Wireless Tool interface. As
-such, a provision is provided for a driver to declare custom, or
-private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux
+As an interface designed to handle generic hardware, there are certain
+capabilities not exposed through the normal Wireless Tool interface. As
+such, a provision is provided for a driver to declare custom, or
+private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux
defines several of these to configure various settings.
-The general form of using the private wireless methods is:
+The general form of using the private wireless methods is::
% iwpriv $IFNAME method parameters
-Where $IFNAME is the interface name the device is registered with
+Where $IFNAME is the interface name the device is registered with
(typically eth1, customized via one of the various network interface
name managers, such as ifrename)
The supported private methods are:
get_mode
- Can be used to report out which IEEE mode the driver is
+ Can be used to report out which IEEE mode the driver is
configured to support. Example:
-
+
% iwpriv eth1 get_mode
eth1 get_mode:802.11bg (6)
set_mode
- Can be used to configure which IEEE mode the driver will
- support.
+ Can be used to configure which IEEE mode the driver will
+ support.
+
+ Usage::
+
+ % iwpriv eth1 set_mode {mode}
- Usage:
- % iwpriv eth1 set_mode {mode}
Where {mode} is a number in the range 1-7:
+
+ == =====================
1 802.11a (2915 only)
2 802.11b
3 802.11ab (2915 only)
- 4 802.11g
+ 4 802.11g
5 802.11ag (2915 only)
6 802.11bg
7 802.11abg (2915 only)
+ == =====================
get_preamble
Can be used to report configuration of preamble length.
@@ -225,99 +238,123 @@ The supported private methods are:
set_preamble
Can be used to set the configuration of preamble length:
- Usage:
- % iwpriv eth1 set_preamble {mode}
+ Usage::
+
+ % iwpriv eth1 set_preamble {mode}
+
Where {mode} is one of:
+
+ == ========================================
1 Long preamble only
0 Auto (long or short based on connection)
-
+ == ========================================
+
-1.4. Sysfs Helper Files:
------------------------------------------------
+1.4. Sysfs Helper Files
+-----------------------
-The Linux kernel provides a pseudo file system that can be used to
+The Linux kernel provides a pseudo file system that can be used to
access various components of the operating system. The Intel(R)
PRO/Wireless 2915ABG Driver for Linux exposes several configuration
parameters through this mechanism.
-An entry in the sysfs can support reading and/or writing. You can
-typically query the contents of a sysfs entry through the use of cat,
-and can set the contents via echo. For example:
+An entry in the sysfs can support reading and/or writing. You can
+typically query the contents of a sysfs entry through the use of cat,
+and can set the contents via echo. For example::
-% cat /sys/bus/pci/drivers/ipw2200/debug_level
+ % cat /sys/bus/pci/drivers/ipw2200/debug_level
-Will report the current debug level of the driver's logging subsystem
+Will report the current debug level of the driver's logging subsystem
(only available if CONFIG_IPW2200_DEBUG was configured when the driver
was built).
-You can set the debug level via:
+You can set the debug level via::
-% echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level
+ % echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level
-Where $VALUE would be a number in the case of this sysfs entry. The
-input to sysfs files does not have to be a number. For example, the
-firmware loader used by hotplug utilizes sysfs entries for transferring
+Where $VALUE would be a number in the case of this sysfs entry. The
+input to sysfs files does not have to be a number. For example, the
+firmware loader used by hotplug utilizes sysfs entries for transferring
the firmware image from user space into the driver.
-The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries
-at two levels -- driver level, which apply to all instances of the driver
-(in the event that there are more than one device installed) and device
+The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries
+at two levels -- driver level, which apply to all instances of the driver
+(in the event that there are more than one device installed) and device
level, which applies only to the single specific instance.
1.4.1 Driver Level Sysfs Helper Files
------------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For the driver level files, look in /sys/bus/pci/drivers/ipw2200/
- debug_level
-
+ debug_level
This controls the same global as the 'debug' module parameter
1.4.2 Device Level Sysfs Helper Files
------------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For the device level files, look in::
-For the device level files, look in
-
/sys/bus/pci/drivers/ipw2200/{PCI-ID}/
-For example:
+For example:::
+
/sys/bus/pci/drivers/ipw2200/0000:02:01.0
For the device level files, see /sys/bus/pci/drivers/ipw2200:
rf_kill
- read -
- 0 = RF kill not enabled (radio on)
- 1 = SW based RF kill active (radio off)
- 2 = HW based RF kill active (radio off)
- 3 = Both HW and SW RF kill active (radio off)
+ read -
+
+ == =========================================
+ 0 RF kill not enabled (radio on)
+ 1 SW based RF kill active (radio off)
+ 2 HW based RF kill active (radio off)
+ 3 Both HW and SW RF kill active (radio off)
+ == =========================================
+
write -
- 0 = If SW based RF kill active, turn the radio back on
- 1 = If radio is on, activate SW based RF kill
- NOTE: If you enable the SW based RF kill and then toggle the HW
- based RF kill from ON -> OFF -> ON, the radio will NOT come back on
-
- ucode
+ == ==================================================
+ 0 If SW based RF kill active, turn the radio back on
+ 1 If radio is on, activate SW based RF kill
+ == ==================================================
+
+ .. note::
+
+ If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+
+ ucode
read-only access to the ucode version number
led
read -
- 0 = LED code disabled
- 1 = LED code enabled
+
+ == =================
+ 0 LED code disabled
+ 1 LED code enabled
+ == =================
+
write -
- 0 = Disable LED code
- 1 = Enable LED code
- NOTE: The LED code has been reported to hang some systems when
- running ifconfig and is therefore disabled by default.
+ == ================
+ 0 Disable LED code
+ 1 Enable LED code
+ == ================
+
+
+ .. note::
+
+ The LED code has been reported to hang some systems when
+ running ifconfig and is therefore disabled by default.
1.5. Supported channels
------------------------------------------------
+-----------------------
Upon loading the Intel(R) PRO/Wireless 2915ABG Driver for Linux, a
message stating the detected geography code and the number of 802.11
@@ -326,44 +363,59 @@ channels supported by the card will be displayed in the log.
The geography code corresponds to a regulatory domain as shown in the
table below.
- Supported channels
-Code Geography 802.11bg 802.11a
-
---- Restricted 11 0
-ZZF Custom US/Canada 11 8
-ZZD Rest of World 13 0
-ZZA Custom USA & Europe & High 11 13
-ZZB Custom NA & Europe 11 13
-ZZC Custom Japan 11 4
-ZZM Custom 11 0
-ZZE Europe 13 19
-ZZJ Custom Japan 14 4
-ZZR Rest of World 14 0
-ZZH High Band 13 4
-ZZG Custom Europe 13 4
-ZZK Europe 13 24
-ZZL Europe 11 13
-
-
-2. Ad-Hoc Networking
------------------------------------------------
-
-When using a device in an Ad-Hoc network, it is useful to understand the
-sequence and requirements for the driver to be able to create, join, or
+ +------+----------------------------+--------------------+
+ | | | Supported channels |
+ | Code | Geography +----------+---------+
+ | | | 802.11bg | 802.11a |
+ +======+============================+==========+=========+
+ | --- | Restricted | 11 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZF | Custom US/Canada | 11 | 8 |
+ +------+----------------------------+----------+---------+
+ | ZZD | Rest of World | 13 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZA | Custom USA & Europe & High | 11 | 13 |
+ +------+----------------------------+----------+---------+
+ | ZZB | Custom NA & Europe | 11 | 13 |
+ +------+----------------------------+----------+---------+
+ | ZZC | Custom Japan | 11 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZM | Custom | 11 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZE | Europe | 13 | 19 |
+ +------+----------------------------+----------+---------+
+ | ZZJ | Custom Japan | 14 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZR | Rest of World | 14 | 0 |
+ +------+----------------------------+----------+---------+
+ | ZZH | High Band | 13 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZG | Custom Europe | 13 | 4 |
+ +------+----------------------------+----------+---------+
+ | ZZK | Europe | 13 | 24 |
+ +------+----------------------------+----------+---------+
+ | ZZL | Europe | 11 | 13 |
+ +------+----------------------------+----------+---------+
+
+2. Ad-Hoc Networking
+=====================
+
+When using a device in an Ad-Hoc network, it is useful to understand the
+sequence and requirements for the driver to be able to create, join, or
merge networks.
-The following attempts to provide enough information so that you can
-have a consistent experience while using the driver as a member of an
+The following attempts to provide enough information so that you can
+have a consistent experience while using the driver as a member of an
Ad-Hoc network.
2.1. Joining an Ad-Hoc Network
------------------------------------------------
+------------------------------
-The easiest way to get onto an Ad-Hoc network is to join one that
+The easiest way to get onto an Ad-Hoc network is to join one that
already exists.
2.2. Creating an Ad-Hoc Network
------------------------------------------------
+-------------------------------
An Ad-Hoc networks is created using the syntax of the Wireless tool.
@@ -371,21 +423,21 @@ For Example:
iwconfig eth1 mode ad-hoc essid testing channel 2
2.3. Merging Ad-Hoc Networks
------------------------------------------------
+----------------------------
-3. Interaction with Wireless Tools
------------------------------------------------
+3. Interaction with Wireless Tools
+==================================
3.1 iwconfig mode
------------------------------------------------
+-----------------
When configuring the mode of the adapter, all run-time configured parameters
are reset to the value used when the module was loaded. This includes
channels, rates, ESSID, etc.
3.2 iwconfig sens
------------------------------------------------
+-----------------
The 'iwconfig ethX sens XX' command will not set the signal sensitivity
threshold, as described in iwconfig documentation, but rather the number
@@ -394,35 +446,35 @@ to another access point. At the same time, it will set the disassociation
threshold to 3 times the given value.
-4. About the Version Numbers
------------------------------------------------
+4. About the Version Numbers
+=============================
-Due to the nature of open source development projects, there are
-frequently changes being incorporated that have not gone through
-a complete validation process. These changes are incorporated into
+Due to the nature of open source development projects, there are
+frequently changes being incorporated that have not gone through
+a complete validation process. These changes are incorporated into
development snapshot releases.
-Releases are numbered with a three level scheme:
+Releases are numbered with a three level scheme:
major.minor.development
Any version where the 'development' portion is 0 (for example
-1.0.0, 1.1.0, etc.) indicates a stable version that will be made
+1.0.0, 1.1.0, etc.) indicates a stable version that will be made
available for kernel inclusion.
Any version where the 'development' portion is not a 0 (for
example 1.0.1, 1.1.5, etc.) indicates a development version that is
-being made available for testing and cutting edge users. The stability
+being made available for testing and cutting edge users. The stability
and functionality of the development releases are not know. We make
efforts to try and keep all snapshots reasonably stable, but due to the
-frequency of their release, and the desire to get those releases
+frequency of their release, and the desire to get those releases
available as quickly as possible, unknown anomalies should be expected.
The major version number will be incremented when significant changes
are made to the driver. Currently, there are no major changes planned.
-5. Firmware installation
-----------------------------------------------
+5. Firmware installation
+========================
The driver requires a firmware image, download it and extract the
files under /lib/firmware (or wherever your hotplug's firmware.agent
@@ -433,40 +485,42 @@ The firmware can be downloaded from the following URL:
http://ipw2200.sf.net/
-6. Support
------------------------------------------------
+6. Support
+==========
-For direct support of the 1.0.0 version, you can contact
+For direct support of the 1.0.0 version, you can contact
http://supportmail.intel.com, or you can use the open source project
support.
For general information and support, go to:
-
+
http://ipw2200.sf.net/
-7. License
------------------------------------------------
+7. License
+==========
- Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved.
+ Copyright |copy| 2003 - 2006 Intel Corporation. All rights reserved.
- This program is free software; you can redistribute it and/or modify it
- under the terms of the GNU General Public License version 2 as
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation.
-
- This program is distributed in the hope that it will be useful, but WITHOUT
- ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
- FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
-
+
You should have received a copy of the GNU General Public License along with
- this program; if not, write to the Free Software Foundation, Inc., 59
+ this program; if not, write to the Free Software Foundation, Inc., 59
Temple Place - Suite 330, Boston, MA 02111-1307, USA.
-
+
The full GNU General Public License is included in this distribution in the
file called LICENSE.
-
+
Contact Information:
+
James P. Ketrenos <ipw2100-admin@linux.intel.com>
+
Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/device_drivers/wifi/ray_cs.rst
index c0c12307ed9d..9a46d1ae8f20 100644
--- a/Documentation/networking/ray_cs.txt
+++ b/Documentation/networking/device_drivers/wifi/ray_cs.rst
@@ -1,6 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+=========================
+Raylink wireless LAN card
+=========================
+
September 21, 1999
-Copyright (c) 1998 Corey Thomas (corey@world.std.com)
+Copyright |copy| 1998 Corey Thomas (corey@world.std.com)
This file is the documentation for the Raylink Wireless LAN card driver for
Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
@@ -13,7 +21,7 @@ wireless LAN cards.
As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
source. My web page for the development of ray_cs is at
-http://web.ralinktech.com/ralink/Home/Support/Linux.html
+http://web.ralinktech.com/ralink/Home/Support/Linux.html
and I can be emailed at corey@world.std.com
The kernel driver is based on ray_cs-1.62.tgz
@@ -29,6 +37,7 @@ with nondefault parameters, they can be edited in
will find them all.
Information on card services is available at:
+
http://pcmcia-cs.sourceforge.net/
@@ -39,72 +48,78 @@ the driver.
Currently, ray_cs is not part of David Hinds card services package,
so the following magic is required.
-At the end of the /etc/pcmcia/config.opts file, add the line:
-source ./ray_cs.opts
+At the end of the /etc/pcmcia/config.opts file, add the line:
+source ./ray_cs.opts
This will make card services read the ray_cs.opts file
when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
-following:
+following::
-#### start of /etc/pcmcia/ray_cs.opts ###################
-# Configuration options for Raylink Wireless LAN PCMCIA card
-device "ray_cs"
- class "network" module "misc/ray_cs"
+ #### start of /etc/pcmcia/ray_cs.opts ###################
+ # Configuration options for Raylink Wireless LAN PCMCIA card
+ device "ray_cs"
+ class "network" module "misc/ray_cs"
-card "RayLink PC Card WLAN Adapter"
- manfid 0x01a6, 0x0000
- bind "ray_cs"
+ card "RayLink PC Card WLAN Adapter"
+ manfid 0x01a6, 0x0000
+ bind "ray_cs"
-module "misc/ray_cs" opts ""
-#### end of /etc/pcmcia/ray_cs.opts #####################
+ module "misc/ray_cs" opts ""
+ #### end of /etc/pcmcia/ray_cs.opts #####################
To join an existing network with
-different parameters, contact the network administrator for the
+different parameters, contact the network administrator for the
configuration information, and edit /etc/pcmcia/ray_cs.opts.
Add the parameters below between the empty quotes.
Parameters for ray_cs driver which may be specified in ray_cs.opts:
-bc integer 0 = normal mode (802.11 timing)
- 1 = slow down inter frame timing to allow
- operation with older breezecom access
- points.
-
-beacon_period integer beacon period in Kilo-microseconds
- legal values = must be integer multiple
- of hop dwell
- default = 256
-
-country integer 1 = USA (default)
- 2 = Europe
- 3 = Japan
- 4 = Korea
- 5 = Spain
- 6 = France
- 7 = Israel
- 8 = Australia
+=============== =============== =============================================
+bc integer 0 = normal mode (802.11 timing),
+ 1 = slow down inter frame timing to allow
+ operation with older breezecom access
+ points.
+
+beacon_period integer beacon period in Kilo-microseconds,
+
+ legal values = must be integer multiple
+ of hop dwell
+
+ default = 256
+
+country integer 1 = USA (default),
+ 2 = Europe,
+ 3 = Japan,
+ 4 = Korea,
+ 5 = Spain,
+ 6 = France,
+ 7 = Israel,
+ 8 = Australia
essid string ESS ID - network name to join
+
string with maximum length of 32 chars
default value = "ADHOC_ESSID"
-hop_dwell integer hop dwell time in Kilo-microseconds
+hop_dwell integer hop dwell time in Kilo-microseconds
+
legal values = 16,32,64,128(default),256
irq_mask integer linux standard 16 bit value 1bit/IRQ
+
lsb is IRQ 0, bit 1 is IRQ 1 etc.
Used to restrict choice of IRQ's to use.
- Recommended method for controlling
- interrupts is in /etc/pcmcia/config.opts
+ Recommended method for controlling
+ interrupts is in /etc/pcmcia/config.opts
-net_type integer 0 (default) = adhoc network,
+net_type integer 0 (default) = adhoc network,
1 = infrastructure
phy_addr string string containing new MAC address in
hex, must start with x eg
x00008f123456
-psm integer 0 = continuously active
+psm integer 0 = continuously active,
1 = power save mode (not useful yet)
pc_debug integer (0-5) larger values for more verbose
@@ -114,14 +129,14 @@ ray_debug integer Replaced with pc_debug
ray_mem_speed integer defaults to 500
-sniffer integer 0 = not sniffer (default)
- 1 = sniffer which can be used to record all
- network traffic using tcpdump or similar,
- but no normal network use is allowed.
+sniffer integer 0 = not sniffer (default),
+ 1 = sniffer which can be used to record all
+ network traffic using tcpdump or similar,
+ but no normal network use is allowed.
-translate integer 0 = no translation (encapsulate frames)
+translate integer 0 = no translation (encapsulate frames),
1 = translation (RFC1042/802.1)
-
+=============== =============== =============================================
More on sniffer mode:
@@ -136,7 +151,7 @@ package which parses the 802.11 headers.
Known Problems and missing features
- Does not work with non x86
+ Does not work with non x86
Does not work with SMP
diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst
new file mode 100644
index 000000000000..370d8264d5dc
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/index.rst
@@ -0,0 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+WWAN Device Drivers
+===================
+
+Contents:
+
+.. toctree::
+ :maxdepth: 2
+
+ iosm
+ t7xx
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/device_drivers/wwan/iosm.rst b/Documentation/networking/device_drivers/wwan/iosm.rst
new file mode 100644
index 000000000000..aceb0223eb46
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/iosm.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _iosm_driver_doc:
+
+===========================================
+IOSM Driver for Intel M.2 PCIe based Modems
+===========================================
+The IOSM (IPC over Shared Memory) driver is a WWAN PCIe host driver developed
+for linux or chrome platform for data exchange over PCIe interface between
+Host platform & Intel M.2 Modem. The driver exposes interface conforming to the
+MBIM protocol [1]. Any front end application ( eg: Modem Manager) could easily
+manage the MBIM interface to enable data communication towards WWAN.
+
+Basic usage
+===========
+MBIM functions are inactive when unmanaged. The IOSM driver only provides a
+userspace interface MBIM "WWAN PORT" representing MBIM control channel and does
+not play any role in managing the functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM functionality.
+
+Examples of few such userspace application are:
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The IOSM driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session. The Driver supports
+upto 8 IP sessions for simultaneous IP communication.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev-name wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+References
+==========
+[1] "MBIM (Mobile Broadband Interface Model) Errata-1"
+ - https://www.usb.org/document-library/
+
+[2] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] Modem Manager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst
new file mode 100644
index 000000000000..dd5b731957ca
--- /dev/null
+++ b/Documentation/networking/device_drivers/wwan/t7xx.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+.. Copyright (C) 2020-21 Intel Corporation
+
+.. _t7xx_driver_doc:
+
+============================================
+t7xx driver for MTK PCIe based T700 5G modem
+============================================
+The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS platforms
+for data exchange over PCIe interface between Host platform & MediaTek's T700 5G modem.
+The driver exposes an interface conforming to the MBIM protocol [1]. Any front end
+application (e.g. Modem Manager) could easily manage the MBIM interface to enable
+data communication towards WWAN. The driver also provides an interface to interact
+with the MediaTek's modem via AT commands.
+
+Basic usage
+===========
+MBIM & AT functions are inactive when unmanaged. The t7xx driver provides
+WWAN port userspace interfaces representing MBIM & AT control channels and does
+not play any role in managing their functionality. It is the job of a userspace
+application to detect port enumeration and enable MBIM & AT functionalities.
+
+Examples of few such userspace applications are:
+
+- mbimcli (included with the libmbim [2] library), and
+- Modem Manager [3]
+
+Management Applications to carry out below required actions for establishing
+MBIM IP session:
+
+- open the MBIM control channel
+- configure network connection settings
+- connect to network
+- configure IP network interface
+
+Management Applications to carry out below required actions for send an AT
+command and receive response:
+
+- open the AT control channel using a UART tool or a special user tool
+
+Management application development
+==================================
+The driver and userspace interfaces are described below. The MBIM protocol is
+described in [1] Mobile Broadband Interface Model v1.0 Errata-1.
+
+MBIM control channel userspace ABI
+----------------------------------
+
+/dev/wwan0mbim0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an MBIM interface to the MBIM function by implementing
+MBIM WWAN Port. The userspace end of the control channel pipe is a
+/dev/wwan0mbim0 character device. Application shall use this interface for
+MBIM protocol communication.
+
+Fragmentation
+~~~~~~~~~~~~~
+The userspace application is responsible for all control message fragmentation
+and defragmentation as per MBIM specification.
+
+/dev/wwan0mbim0 write()
+~~~~~~~~~~~~~~~~~~~~~~~
+The MBIM control messages from the management application must not exceed the
+negotiated control message size.
+
+/dev/wwan0mbim0 read()
+~~~~~~~~~~~~~~~~~~~~~~
+The management application must accept control messages of up the negotiated
+control message size.
+
+MBIM data channel userspace ABI
+-------------------------------
+
+wwan0-X network device
+~~~~~~~~~~~~~~~~~~~~~~
+The t7xx driver exposes IP link interface "wwan0-X" of type "wwan" for IP
+traffic. Iproute network utility is used for creating "wwan0-X" network
+interface and for associating it with MBIM IP session.
+
+The userspace management application is responsible for creating new IP link
+prior to establishing MBIM IP session where the SessionId is greater than 0.
+
+For example, creating new IP link for a MBIM IP session with SessionId 1:
+
+ ip link add dev wwan0-1 parentdev wwan0 type wwan linkid 1
+
+The driver will automatically map the "wwan0-1" network device to MBIM IP
+session 1.
+
+AT port userspace ABI
+----------------------------------
+
+/dev/wwan0at0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes an AT port by implementing AT WWAN Port.
+The userspace end of the control port is a /dev/wwan0at0 character
+device. Application shall use this interface to issue AT commands.
+
+The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
+
+References
+==========
+[1] *MBIM (Mobile Broadband Interface Model) Errata-1*
+
+- https://www.usb.org/document-library/
+
+[2] *libmbim "a glib-based library for talking to WWAN modems and devices which
+speak the Mobile Interface Broadband Model (MBIM) protocol"*
+
+- http://www.freedesktop.org/wiki/Software/libmbim/
+
+[3] *Modem Manager "a DBus-activated daemon which controls mobile broadband
+(2G/3G/4G/5G) devices and connections"*
+
+- http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[4] *Specification # 27.007 - 3GPP*
+
+- https://www.3gpp.org/DynaReport/27007.htm
diff --git a/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
new file mode 100644
index 000000000000..1e589c26abff
--- /dev/null
+++ b/Documentation/networking/devlink/am65-nuss-cpsw-switch.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+am65-cpsw-nuss devlink support
+==============================
+
+This document describes the devlink features implemented by the ``am65-cpsw-nuss``
+device driver.
+
+Parameters
+==========
+
+The ``am65-cpsw-nuss`` driver implements the following driver-specific
+parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``switch_mode``
+ - Boolean
+ - runtime
+ - Enable switch mode
diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst
index 82ef9ec46707..a4fb27663cd6 100644
--- a/Documentation/networking/devlink/bnxt.rst
+++ b/Documentation/networking/devlink/bnxt.rst
@@ -22,6 +22,8 @@ Parameters
- Permanent
* - ``msix_vec_per_pf_min``
- Permanent
+ * - ``enable_remote_dev_reset``
+ - Runtime
The ``bnxt`` driver also implements the following driver-specific
parameters.
@@ -51,6 +53,9 @@ The ``bnxt_en`` driver reports the following versions
* - Name
- Type
- Description
+ * - ``board.id``
+ - fixed
+ - Part number identifying the board design
* - ``asic.id``
- fixed
- ASIC design identifier
@@ -63,12 +68,15 @@ The ``bnxt_en`` driver reports the following versions
* - ``fw``
- stored, running
- Overall board firmware version
- * - ``fw.app``
- - stored, running
- - Data path firmware version
* - ``fw.mgmt``
- stored, running
- - Management firmware version
+ - NIC hardware resource management firmware version
+ * - ``fw.mgmt.api``
+ - running
+ - Minimum firmware interface spec version supported between driver and firmware
+ * - ``fw.nsci``
+ - stored, running
+ - General platform management firmware version
* - ``fw.roce``
- stored, running
- RoCE management firmware version
diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst
index 468fe1001b74..af37f250df43 100644
--- a/Documentation/networking/devlink/devlink-dpipe.rst
+++ b/Documentation/networking/devlink/devlink-dpipe.rst
@@ -52,7 +52,7 @@ purposes as a standard complementary tool. The system's view from
``devlink-dpipe`` should change according to the changes done by the
standard configuration tools.
-For example, it’s quiet common to implement Access Control Lists (ACL)
+For example, it’s quite common to implement Access Control Lists (ACL)
using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
divided into TCAM regions. Complex TC filters can have multiple rules with
different priorities and different lookup keys. On the other hand hardware
diff --git a/Documentation/networking/devlink/devlink-flash.rst b/Documentation/networking/devlink/devlink-flash.rst
new file mode 100644
index 000000000000..603e732f00cc
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-flash.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _devlink_flash:
+
+=============
+Devlink Flash
+=============
+
+The ``devlink-flash`` API allows updating device firmware. It replaces the
+older ``ethtool-flash`` mechanism, and doesn't require taking any
+networking locks in the kernel to perform the flash update. Example use::
+
+ $ devlink dev flash pci/0000:05:00.0 file flash-boot.bin
+
+Note that the file name is a path relative to the firmware loading path
+(usually ``/lib/firmware/``). Drivers may send status updates to inform
+user space about the progress of the update operation.
+
+Overwrite Mask
+==============
+
+The ``devlink-flash`` command allows optionally specifying a mask indicating
+how the device should handle subsections of flash components when updating.
+This mask indicates the set of sections which are allowed to be overwritten.
+
+.. list-table:: List of overwrite mask bits
+ :widths: 5 95
+
+ * - Name
+ - Description
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Indicates that the device should overwrite settings in the components
+ being updated with the settings found in the provided image.
+ * - ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Indicates that the device should overwrite identifiers in the
+ components being updated with the identifiers found in the provided
+ image. This includes MAC addresses, serial IDs, and similar device
+ identifiers.
+
+Multiple overwrite bits may be combined and requested together. If no bits
+are provided, it is expected that the device only update firmware binaries
+in the components being updated. Settings and identifiers are expected to be
+preserved across the update. A device may not support every combination and
+the driver for such a device must reject any combination which cannot be
+faithfully implemented.
+
+Firmware Loading
+================
+
+Devices which require firmware to operate usually store it in non-volatile
+memory on the board, e.g. flash. Some devices store only basic firmware on
+the board, and the driver loads the rest from disk during probing.
+``devlink-info`` allows users to query firmware information (loaded
+components and versions).
+
+In other cases the device can both store the image on the board, load from
+disk, or automatically flash a new image from disk. The ``fw_load_policy``
+devlink parameter can be used to control this behavior
+(:ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`).
+
+On-disk firmware files are usually stored in ``/lib/firmware/``.
+
+Firmware Version Management
+===========================
+
+Drivers are expected to implement ``devlink-flash`` and ``devlink-info``
+functionality, which together allow for implementing vendor-independent
+automated firmware update facilities.
+
+``devlink-info`` exposes the ``driver`` name and three version groups
+(``fixed``, ``running``, ``stored``).
+
+The ``driver`` attribute and ``fixed`` group identify the specific device
+design, e.g. for looking up applicable firmware updates. This is why
+``serial_number`` is not part of the ``fixed`` versions (even though it
+is fixed) - ``fixed`` versions should identify the design, not a single
+device.
+
+``running`` and ``stored`` firmware versions identify the firmware running
+on the device, and firmware which will be activated after reboot or device
+reset.
+
+The firmware update agent is supposed to be able to follow this simple
+algorithm to update firmware contents, regardless of the device vendor:
+
+.. code-block:: sh
+
+ # Get unique HW design identifier
+ $hw_id = devlink-dev-info['fixed']
+
+ # Find out which FW flash we want to use for this NIC
+ $want_flash_vers = some-db-backed.lookup($hw_id, 'flash')
+
+ # Update flash if necessary
+ if $want_flash_vers != devlink-dev-info['stored']:
+ $file = some-db-backed.download($hw_id, 'flash')
+ devlink-dev-flash($file)
+
+ # Find out the expected overall firmware versions
+ $want_fw_vers = some-db-backed.lookup($hw_id, 'all')
+
+ # Update on-disk file if necessary
+ if $want_fw_vers != devlink-dev-info['running']:
+ $file = some-db-backed.download($hw_id, 'disk')
+ write($file, '/lib/firmware/')
+
+ # Try device reset, if available
+ if $want_fw_vers != devlink-dev-info['running']:
+ devlink-reset()
+
+ # Reboot, if reset wasn't enough
+ if $want_fw_vers != devlink-dev-info['running']:
+ reboot()
+
+Note that each reference to ``devlink-dev-info`` in this pseudo-code
+is expected to fetch up-to-date information from the kernel.
+
+For the convenience of identifying firmware files some vendors add
+``bundle_id`` information to the firmware versions. This meta-version covers
+multiple per-component versions and can be used e.g. in firmware file names
+(all component versions could get rather long.)
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 0c99b11f05f9..e37f77734b5b 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -24,7 +24,7 @@ attributes of the health reporting and recovery procedures.
The ``devlink`` health reporter:
Device driver creates a "health reporter" per each error/health type.
-Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
or unknown (driver specific).
For each registered health reporter a driver can issue error/health reports
asynchronously. All health reports handling is done by ``devlink``.
@@ -48,6 +48,7 @@ Once an error is reported, devlink health will perform the following actions:
* Object dump is being taken and saved at the reporter instance (as long as
there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
+
- Auto-recovery configuration
- Grace period vs. time passed since last recover
@@ -72,14 +73,18 @@ via ``devlink``, e.g per error type (per health reporter):
* - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
- Allows reporter-related configuration setting.
* - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
- - Triggers a reporter's recovery procedure.
+ - Triggers reporter's recovery procedure.
+ * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
+ - Triggers a fake health event on the reporter. The effects of the test
+ event in terms of recovery flow should follow closely that of a real
+ event.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
- - Retrieves diagnostics data from a reporter on a device.
+ - Retrieves current device state related to the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
- Retrieves the last stored dump. Devlink health
- saves a single dump. If an dump is not already stored by the devlink
+ saves a single dump. If an dump is not already stored by devlink
for this reporter, devlink generates a new dump.
- dump output is defined by the reporter.
+ Dump output is defined by the reporter.
* - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
- Clears the last saved dump file for the specified reporter.
@@ -93,7 +98,7 @@ The following diagram provides a general overview of ``devlink-health``::
+--------------------------+
|request for ops
|(diagnose,
- mlx5_core devlink |recover,
+ driver devlink |recover,
|dump)
+--------+ +--------------------------+
| | | reporter| |
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 70981dd1b981..7572bf6de5c1 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -5,34 +5,121 @@ Devlink Info
============
The ``devlink-info`` mechanism enables device drivers to report device
-information in a generic fashion. It is extensible, and enables exporting
-even device or driver specific information.
+(hardware and firmware) information in a standard, extensible fashion.
-devlink supports representing the following types of versions
+The original motivation for the ``devlink-info`` API was twofold:
-.. list-table:: List of version types
+ - making it possible to automate device and firmware management in a fleet
+ of machines in a vendor-independent fashion (see also
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`);
+ - name the per component FW versions (as opposed to the crowded ethtool
+ version string).
+
+``devlink-info`` supports reporting multiple types of objects. Reporting driver
+versions is generally discouraged - here, and via any other Linux API.
+
+.. list-table:: List of top level info objects
:widths: 5 95
- * - Type
+ * - Name
- Description
+ * - ``driver``
+ - Name of the currently used device driver, also available through sysfs.
+
+ * - ``serial_number``
+ - Serial number of the device.
+
+ This is usually the serial number of the ASIC, also often available
+ in PCI config space of the device in the *Device Serial Number*
+ capability.
+
+ The serial number should be unique per physical device.
+ Sometimes the serial number of the device is only 48 bits long (the
+ length of the Ethernet MAC address), and since PCI DSN is 64 bits long
+ devices pad or encode additional information into the serial number.
+ One example is adding port ID or PCI interface ID in the extra two bytes.
+ Drivers should make sure to strip or normalize any such padding
+ or interface ID, and report only the part of the serial number
+ which uniquely identifies the hardware. In other words serial number
+ reported for two ports of the same device or on two hosts of
+ a multi-host device should be identical.
+
+ * - ``board.serial_number``
+ - Board serial number of the device.
+
+ This is usually the serial number of the board, often available in
+ PCI *Vital Product Data*.
+
* - ``fixed``
- - Represents fixed versions, which cannot change. For example,
+ - Group for hardware identifiers, and versions of components
+ which are not field-updatable.
+
+ Versions in this section identify the device design. For example,
component identifiers or the board version reported in the PCI VPD.
+ Data in ``devlink-info`` should be broken into the smallest logical
+ components, e.g. PCI VPD may concatenate various information
+ to form the Part Number string, while in ``devlink-info`` all parts
+ should be reported as separate items.
+
+ This group must not contain any frequently changing identifiers,
+ such as serial numbers. See
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`
+ to understand why.
+
* - ``running``
- - Represents the version of the currently running component. For
- example the running version of firmware. These versions generally
- only update after a reboot.
+ - Group for information about currently running software/firmware.
+ These versions often only update after a reboot, sometimes device reset.
+
* - ``stored``
- - Represents the version of a component as stored, such as after a
- flash update. Stored values should update to reflect changes in the
- flash even if a reboot has not yet occurred.
+ - Group for software/firmware versions in device flash.
+
+ Stored values must update to reflect changes in the flash even
+ if reboot has not yet occurred. If device is not capable of updating
+ ``stored`` versions when new software is flashed, it must not report
+ them.
+
+Each version can be reported at most once in each version group. Firmware
+components stored on the flash should feature in both the ``running`` and
+``stored`` sections, if device is capable of reporting ``stored`` versions
+(see :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`).
+In case software/firmware components are loaded from the disk (e.g.
+``/lib/firmware``) only the running version should be reported via
+the kernel API.
Generic Versions
================
It is expected that drivers use the following generic names for exporting
-version information. Other information may be exposed using driver-specific
-names, but these should be documented in the driver-specific file.
+version information. If a generic name for a given component doesn't exist yet,
+driver authors should consult existing driver-specific versions and attempt
+reuse. As last resort, if a component is truly unique, using driver-specific
+names is allowed, but these should be documented in the driver-specific file.
+
+All versions should try to use the following terminology:
+
+.. list-table:: List of common version suffixes
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``id``, ``revision``
+ - Identifiers of designs and revision, mostly used for hardware versions.
+
+ * - ``api``
+ - Version of API between components. API items are usually of limited
+ value to the user, and can be inferred from other versions by the vendor,
+ so adding API versions is generally discouraged as noise.
+
+ * - ``bundle_id``
+ - Identifier of a distribution package which was flashed onto the device.
+ This is an attribute of a firmware package which covers multiple versions
+ for ease of managing firmware images (see
+ :ref:`Documentation/networking/devlink/devlink-flash.rst <devlink_flash>`).
+
+ ``bundle_id`` can appear in both ``running`` and ``stored`` versions,
+ but it must not be reported if any of the components covered by the
+ ``bundle_id`` was changed and no longer matches the version from
+ the bundle.
board.id
--------
@@ -52,7 +139,7 @@ ASIC design identifier.
asic.rev
--------
-ASIC design revision.
+ASIC design revision/stepping.
board.manufacture
-----------------
@@ -72,6 +159,12 @@ Control unit firmware version. This firmware is responsible for house
keeping tasks, PHY control etc. but not the packet-by-packet data path
operation.
+fw.mgmt.api
+-----------
+
+Firmware interface specification version of the software interfaces between
+driver and firmware.
+
fw.app
------
@@ -91,10 +184,27 @@ Network Controller Sideband Interface.
fw.psid
-------
-Unique identifier of the firmware parameter set.
+Unique identifier of the firmware parameter set. These are usually
+parameters of a particular board, defined at manufacturing time.
fw.roce
-------
RoCE firmware version which is responsible for handling roce
management.
+
+fw.bundle_id
+------------
+
+Unique identifier of the entire firmware bundle.
+
+Future work
+===========
+
+The following extensions could be useful:
+
+ - on-disk firmware file names - drivers list the file names of firmware they
+ may need to load onto devices via the ``MODULE_FIRMWARE()`` macro. These,
+ however, are per module, rather than per device. It'd be useful to list
+ the names of firmware files the driver will try to load for a given device,
+ in order of priority.
diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst
new file mode 100644
index 000000000000..6c0b8928bc13
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-linecard.rst
@@ -0,0 +1,122 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Devlink Line card
+=================
+
+Background
+==========
+
+The ``devlink-linecard`` mechanism is targeted for manipulation of
+line cards that serve as a detachable PHY modules for modular switch
+system. Following operations are provided:
+
+ * Get a list of supported line card types.
+ * Provision of a slot with specific line card type.
+ * Get and monitor of line card state and its change.
+
+Line card according to the type may contain one or more gearboxes
+to mux the lanes with certain speed to multiple ports with lanes
+of different speed. Line card ensures N:M mapping between
+the switch ASIC modules and physical front panel ports.
+
+Overview
+========
+
+Each line card devlink object is created by device driver,
+according to the physical line card slots available on the device.
+
+Similar to splitter cable, where the device might have no way
+of detection of the splitter cable geometry, the device
+might not have a way to detect line card type. For that devices,
+concept of provisioning is introduced. It allows the user to:
+
+ * Provision a line card slot with certain line card type
+
+ - Device driver would instruct the ASIC to prepare all
+ resources accordingly. The device driver would
+ create all instances, namely devlink port and netdevices
+ that reside on the line card, according to the line card type
+ * Manipulate of line card entities even without line card
+ being physically connected or powered-up
+ * Setup splitter cable on line card ports
+
+ - As on the ordinary ports, user may provision a splitter
+ cable of a certain type, without the need to
+ be physically connected to the port
+ * Configure devlink ports and netdevices
+
+Netdevice carrier is decided as follows:
+
+ * Line card is not inserted or powered-down
+
+ - The carrier is always down
+ * Line card is inserted and powered up
+
+ - The carrier is decided as for ordinary port netdevice
+
+Line card state
+===============
+
+The ``devlink-linecard`` mechanism supports the following line card states:
+
+ * ``unprovisioned``: Line card is not provisioned on the slot.
+ * ``unprovisioning``: Line card slot is currently being unprovisioned.
+ * ``provisioning``: Line card slot is currently in a process of being provisioned
+ with a line card type.
+ * ``provisioning_failed``: Provisioning was not successful.
+ * ``provisioned``: Line card slot is provisioned with a type.
+ * ``active``: Line card is powered-up and active.
+
+The following diagram provides a general overview of ``devlink-linecard``
+state transitions::
+
+ +-------------------------+
+ | |
+ +----------------------------------> unprovisioned |
+ | | |
+ | +--------|-------^--------+
+ | | |
+ | | |
+ | +--------v-------|--------+
+ | | |
+ | | provisioning |
+ | | |
+ | +------------|------------+
+ | |
+ | +-----------------------------+
+ | | |
+ | +------------v------------+ +------------v------------+ +-------------------------+
+ | | | | ----> |
+ +----- provisioning_failed | | provisioned | | active |
+ | | | | <---- |
+ | +------------^------------+ +------------|------------+ +-------------------------+
+ | | |
+ | | |
+ | | +------------v------------+
+ | | | |
+ | | | unprovisioning |
+ | | | |
+ | | +------------|------------+
+ | | |
+ | +-----------------------------+
+ | |
+ +-----------------------------------------------+
+
+
+Example usage
+=============
+
+.. code:: shell
+
+ $ devlink lc show [ DEV [ lc LC_INDEX ] ]
+ $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ]
+
+ # Show current line card configuration and status for all slots:
+ $ devlink lc
+
+ # Set slot 8 to be provisioned with type "16x100G":
+ $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
+
+ # Set slot 8 to be unprovisioned:
+ $ devlink lc set pci/0000:01:00.0 lc 8 notype
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index da2f85c0fa21..4e01dc32bc08 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -41,6 +41,8 @@ In order for ``driverinit`` parameters to take effect, the driver must
support reloading via the ``devlink-reload`` command. This command will
request a reload of the device driver.
+.. _devlink_params_generic:
+
Generic configuration parameters
================================
The following is a list of generic configuration parameters that drivers may
@@ -95,14 +97,43 @@ own name.
* - ``enable_roce``
- Boolean
- Enable handling of RoCE traffic in the device.
+ * - ``enable_eth``
+ - Boolean
+ - When enabled, the device driver will instantiate Ethernet specific
+ auxiliary device of the devlink device.
+ * - ``enable_rdma``
+ - Boolean
+ - When enabled, the device driver will instantiate RDMA specific
+ auxiliary device of the devlink device.
+ * - ``enable_vnet``
+ - Boolean
+ - When enabled, the device driver will instantiate VDPA networking
+ specific auxiliary device of the devlink device.
+ * - ``enable_iwarp``
+ - Boolean
+ - Enable handling of iWARP traffic in the device.
* - ``internal_err_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
* - ``max_macs``
- u32
- - Specifies the maximum number of MAC addresses per ethernet port of
- this device.
+ - Typically macvlan, vlan net devices mac are also programmed in their
+ parent netdevice's Function rx filter. This parameter limit the
+ maximum number of unicast mac address filters to receive traffic from
+ per ethernet port of this device.
* - ``region_snapshot_enable``
- Boolean
- Enable capture of ``devlink-region`` snapshots.
+ * - ``enable_remote_dev_reset``
+ - Boolean
+ - Enable device reset by remote host. When cleared, the device driver
+ will NACK any attempt of other host to reset the device. This parameter
+ is useful for setups where a device is shared by different hosts, such
+ as multi-host setup.
+ * - ``io_eq_size``
+ - u32
+ - Control the size of I/O completion EQs.
+ * - ``event_eq_size``
+ - u32
+ - Control the size of asynchronous control events EQ.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
new file mode 100644
index 000000000000..7627b1da01f2
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -0,0 +1,234 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _devlink_port:
+
+============
+Devlink Port
+============
+
+``devlink-port`` is a port that exists on the device. It has a logically
+separate ingress/egress point of the device. A devlink port can be any one
+of many flavours. A devlink port flavour along with port attributes
+describe what a port represents.
+
+A device driver that intends to publish a devlink port sets the
+devlink port attributes and registers the devlink port.
+
+Devlink port flavours are described below.
+
+.. list-table:: List of devlink port flavours
+ :widths: 33 90
+
+ * - Flavour
+ - Description
+ * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
+ - Any kind of physical port. This can be an eswitch physical port or any
+ other physical port on the device.
+ * - ``DEVLINK_PORT_FLAVOUR_DSA``
+ - This indicates a DSA interconnect port.
+ * - ``DEVLINK_PORT_FLAVOUR_CPU``
+ - This indicates a CPU port applicable only to DSA.
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
+ - This indicates an eswitch port representing a port of PCI
+ physical function (PF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
+ - This indicates an eswitch port representing a port of PCI
+ virtual function (VF).
+ * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
+ - This indicates an eswitch port representing a port of PCI
+ subfunction (SF).
+ * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
+ - This indicates a virtual port for the PCI virtual function.
+
+Devlink port can have a different type based on the link layer described below.
+
+.. list-table:: List of devlink port types
+ :widths: 23 90
+
+ * - Type
+ - Description
+ * - ``DEVLINK_PORT_TYPE_ETH``
+ - Driver should set this port type when a link layer of the port is
+ Ethernet.
+ * - ``DEVLINK_PORT_TYPE_IB``
+ - Driver should set this port type when a link layer of the port is
+ InfiniBand.
+ * - ``DEVLINK_PORT_TYPE_AUTO``
+ - This type is indicated by the user when driver should detect the port
+ type automatically.
+
+PCI controllers
+---------------
+In most cases a PCI device has only one controller. A controller consists of
+potentially multiple physical, virtual functions and subfunctions. A function
+consists of one or more ports. This port is represented by the devlink eswitch
+port.
+
+A PCI device connected to multiple CPUs or multiple PCI root complexes or a
+SmartNIC, however, may have multiple controllers. For a device with multiple
+controllers, each controller is distinguished by a unique controller number.
+An eswitch is on the PCI device which supports ports of multiple controllers.
+
+An example view of a system with two controllers::
+
+ ---------------------------------------------------------
+ | |
+ | --------- --------- ------- ------- |
+ ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
+ | connect | | ------- ------- |
+ ----------- | | controller_num=1 (no eswitch) |
+ ------|--------------------------------------------------
+ (internal wire)
+ |
+ ---------------------------------------------------------
+ | devlink eswitch ports and reps |
+ | ----------------------------------------------------- |
+ | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
+ | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
+ | ----------------------------------------------------- |
+ | |
+ | |
+ ----------- | --------- --------- ------- ------- |
+ | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
+ | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
+ | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
+ ----------- | ------- ------- |
+ | |
+ | local controller_num=0 (eswitch) |
+ ---------------------------------------------------------
+
+In the above example, the external controller (identified by controller number = 1)
+doesn't have the eswitch. Local controller (identified by controller number = 0)
+has the eswitch. The Devlink instance on the local controller has eswitch
+devlink ports for both the controllers.
+
+Function configuration
+======================
+
+A user can configure the function attribute before enumerating the PCI
+function. Usually it means, user should configure function attribute
+before a bus specific device for the function is created. However, when
+SRIOV is enabled, virtual function devices are created on the PCI bus.
+Hence, function attribute should be configured before binding virtual
+function device to the driver. For subfunctions, this means user should
+configure port function attribute before activating the port function.
+
+A user may set the hardware address of the function using
+'devlink port function set hw_addr' command. For Ethernet port function
+this means a MAC address.
+
+Subfunction
+============
+
+Subfunction is a lightweight function that has a parent PCI function on which
+it is deployed. Subfunction is created and deployed in unit of 1. Unlike
+SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
+A subfunction communicates with the hardware through the parent PCI function.
+
+To use a subfunction, 3 steps setup sequence is followed.
+(1) create - create a subfunction;
+(2) configure - configure subfunction attributes;
+(3) deploy - deploy the subfunction;
+
+Subfunction management is done using devlink port user interface.
+User performs setup on the subfunction management device.
+
+(1) Create
+----------
+A subfunction is created using a devlink port interface. A user adds the
+subfunction by adding a devlink port of subfunction flavour. The devlink
+kernel code calls down to subfunction management driver (devlink ops) and asks
+it to create a subfunction devlink port. Driver then instantiates the
+subfunction port and any associated objects such as health reporters and
+representor netdevice.
+
+(2) Configure
+-------------
+A subfunction devlink port is created but it is not active yet. That means the
+entities are created on devlink side, the e-switch port representor is created,
+but the subfunction device itself is not created. A user might use e-switch port
+representor to do settings, putting it into bridge, adding TC rules, etc. A user
+might as well configure the hardware address (such as MAC address) of the
+subfunction while subfunction is inactive.
+
+(3) Deploy
+----------
+Once a subfunction is configured, user must activate it to use it. Upon
+activation, subfunction management driver asks the subfunction management
+device to instantiate the subfunction device on particular PCI function.
+A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
+At this point a matching subfunction driver binds to the subfunction's auxiliary device.
+
+Rate object management
+======================
+
+Devlink provides API to manage tx rates of single devlink port or a group.
+This is done through rate objects, which can be one of the two types:
+
+``leaf``
+ Represents a single devlink port; created/destroyed by the driver. Since leaf
+ have 1to1 mapping to its devlink port, in user space it is referred as
+ ``pci/<bus_addr>/<port_index>``;
+
+``node``
+ Represents a group of rate objects (leafs and/or nodes); created/deleted by
+ request from the userspace; initially empty (no rate objects added). In
+ userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
+ ``node_name`` can be any identifier, except decimal number, to avoid
+ collisions with leafs.
+
+API allows to configure following rate object's parameters:
+
+``tx_share``
+ Minimum TX rate value shared among all other rate objects, or rate objects
+ that parts of the parent group, if it is a part of the same group.
+
+``tx_max``
+ Maximum TX rate value.
+
+``parent``
+ Parent node name. Parent node rate limits are considered as additional limits
+ to all node children limits. ``tx_max`` is an upper limit for children.
+ ``tx_share`` is a total bandwidth distributed among children.
+
+Driver implementations are allowed to support both or either rate object types
+and setting methods of their parameters.
+
+Terms and Definitions
+=====================
+
+.. list-table:: Terms and Definitions
+ :widths: 22 90
+
+ * - Term
+ - Definitions
+ * - ``PCI device``
+ - A physical PCI device having one or more PCI buses consists of one or
+ more PCI controllers.
+ * - ``PCI controller``
+ - A controller consists of potentially multiple physical functions,
+ virtual functions and subfunctions.
+ * - ``Port function``
+ - An object to manage the function of a port.
+ * - ``Subfunction``
+ - A lightweight function that has parent PCI function on which it is
+ deployed.
+ * - ``Subfunction device``
+ - A bus device of the subfunction, usually on a auxiliary bus.
+ * - ``Subfunction driver``
+ - A device driver for the subfunction auxiliary device.
+ * - ``Subfunction management device``
+ - A PCI physical function that supports subfunction management.
+ * - ``Subfunction management driver``
+ - A device driver for PCI physical function that supports
+ subfunction management using devlink port interface.
+ * - ``Subfunction host driver``
+ - A device driver for PCI physical function that hosts subfunction
+ devices. In most cases it is same as subfunction management driver. When
+ subfunction is used on external controller, subfunction management and
+ host drivers are different.
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 8b46e8591fe0..f06dca9a1eb6 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -14,11 +14,22 @@ Region snapshots are collected by the driver, and can be accessed via read
or dump commands. This allows future analysis on the created snapshots.
Regions may optionally support triggering snapshots on demand.
+Snapshot identifiers are scoped to the devlink instance, not a region.
+All snapshots with the same snapshot id within a devlink instance
+correspond to the same event.
+
The major benefit to creating a region is to provide access to internal
address regions that are otherwise inaccessible to the user.
Regions may also be used to provide an additional way to debug complex error
-states, but see also :doc:`devlink-health`
+states, but see also Documentation/networking/devlink/devlink-health.rst
+
+Regions may optionally support capturing a snapshot on demand via the
+``DEVLINK_CMD_REGION_NEW`` netlink message. A driver wishing to allow
+requested snapshots must implement the ``.snapshot`` callback for the region
+in its ``devlink_region_ops`` structure. If snapshot id is not set in
+the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
+the snapshot information to user space.
example usage
-------------
@@ -29,17 +40,20 @@ example usage
$ devlink region show [ DEV/REGION ]
$ devlink region del DEV/REGION snapshot SNAPSHOT_ID
$ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
- $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
- address ADDRESS length length
+ $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length
# Show all of the exposed regions with region sizes:
$ devlink region show
- pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
- pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
+ pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] max 8
+ pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] max 8
# Delete a snapshot using:
$ devlink region del pci/0000:00:05.0/cr-space snapshot 1
+ # Request an immediate snapshot, if supported by the region
+ $ devlink region new pci/0000:00:05.0/cr-space
+ pci/0000:00:05.0/cr-space: snapshot 5
+
# Dump a snapshot:
$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
@@ -48,8 +62,7 @@ example usage
0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
# Read a specific part of a snapshot:
- $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
- length 16
+ $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
As regions are likely very device or driver specific, no generic regions are
diff --git a/Documentation/networking/devlink/devlink-reload.rst b/Documentation/networking/devlink/devlink-reload.rst
new file mode 100644
index 000000000000..505d22da027d
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-reload.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Reload
+==============
+
+``devlink-reload`` provides mechanism to reinit driver entities, applying
+``devlink-params`` and ``devlink-resources`` new values. It also provides
+mechanism to activate firmware.
+
+Reload Actions
+==============
+
+User may select a reload action.
+By default ``driver_reinit`` action is selected.
+
+.. list-table:: Possible reload actions
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``driver-reinit``
+ - Devlink driver entities re-initialization, including applying
+ new values to devlink entities which are used during driver
+ load such as ``devlink-params`` in configuration mode
+ ``driverinit`` or ``devlink-resources``
+ * - ``fw_activate``
+ - Firmware activate. Activates new firmware if such image is stored and
+ pending activation. If no limitation specified this action may involve
+ firmware reset. If no new image pending this action will reload current
+ firmware image.
+
+Note that even though user asks for a specific action, the driver
+implementation might require to perform another action alongside with
+it. For example, some driver do not support driver reinitialization
+being performed without fw activation. Therefore, the devlink reload
+command returns the list of actions which were actrually performed.
+
+Reload Limits
+=============
+
+By default reload actions are not limited and driver implementation may
+include reset or downtime as needed to perform the actions.
+
+However, some drivers support action limits, which limit the action
+implementation to specific constraints.
+
+.. list-table:: Possible reload limits
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``no_reset``
+ - No reset allowed, no down time allowed, no link flap and no
+ configuration is lost.
+
+Change Namespace
+================
+
+The netns option allows user to be able to move devlink instances into
+namespaces during devlink reload operation.
+By default all devlink instances are created in init_net and stay there.
+
+example usage
+-------------
+
+.. code:: shell
+
+ $ devlink dev reload help
+ $ devlink dev reload DEV [ netns { PID | NAME | ID } ] [ action { driver_reinit | fw_activate } ] [ limit no_reset ]
+
+ # Run reload command for devlink driver entities re-initialization:
+ $ devlink dev reload pci/0000:82:00.0 action driver_reinit
+ reload_actions_performed:
+ driver_reinit
+
+ # Run reload command to activate firmware:
+ # Note that mlx5 driver reloads the driver while activating firmware
+ $ devlink dev reload pci/0000:82:00.0 action fw_activate
+ reload_actions_performed:
+ driver_reinit fw_activate
diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst
index 93e92d2f0752..3d5ae51e65a2 100644
--- a/Documentation/networking/devlink/devlink-resource.rst
+++ b/Documentation/networking/devlink/devlink-resource.rst
@@ -23,6 +23,20 @@ current size and related sub resources. To access a sub resource, you
specify the path of the resource. For example ``/IPv4/fib`` is the id for
the ``fib`` sub-resource under the ``IPv4`` resource.
+Generic Resources
+=================
+
+Generic resources are used to describe resources that can be shared by multiple
+device drivers and their description must be added to the following table:
+
+.. list-table:: List of Generic Resources
+ :widths: 10 90
+
+ * - Name
+ - Description
+ * - ``physical_ports``
+ - A limited capacity of physical ports that the switch ASIC can support
+
example usage
-------------
diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst
new file mode 100644
index 000000000000..c0aa1f3aef0d
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-selftests.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+=================
+Devlink Selftests
+=================
+
+The ``devlink-selftests`` API allows executing selftests on the device.
+
+Tests Mask
+==========
+The ``devlink-selftests`` command should be run with a mask indicating
+the tests to be executed.
+
+Tests Description
+=================
+The following is a list of tests that drivers may execute.
+
+.. list-table:: List of tests
+ :widths: 5 90
+
+ * - Name
+ - Description
+ * - ``DEVLINK_SELFTEST_FLASH``
+ - Devices may have the firmware on non-volatile memory on the board, e.g.
+ flash. This particular test helps to run a flash selftest on the device.
+ Implementation of the test is left to the driver/firmware.
+
+example usage
+-------------
+
+.. code:: shell
+
+ # Query selftests supported on the devlink device
+ $ devlink dev selftests show DEV
+ # Query selftests supported on all devlink devices
+ $ devlink dev selftests show
+ # Executes selftests on the device
+ $ devlink dev selftests run DEV id flash
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 47a429bb8658..90d1381b88de 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -55,7 +55,7 @@ The following diagram provides a general overview of ``devlink-trap``::
| |
+-------^--------+
|
- |
+ | Non-control traps
|
+----+----+
| | Kernel's Rx path
@@ -97,6 +97,12 @@ The ``devlink-trap`` mechanism supports the following packet trap types:
processed by ``devlink`` and injected to the kernel's Rx path. Changing the
action of such traps is not allowed, as it can easily break the control
plane.
+ * ``control``: Trapped packets were trapped by the device because these are
+ control packets required for the correct functioning of the control plane.
+ For example, ARP request and IGMP query packets. Packets are injected to
+ the kernel's Rx path, but not reported to the kernel's drop monitor.
+ Changing the action of such traps is not allowed, as it can easily break
+ the control plane.
.. _Trap-Actions:
@@ -108,6 +114,8 @@ The ``devlink-trap`` mechanism supports the following packet trap actions:
* ``trap``: The sole copy of the packet is sent to the CPU.
* ``drop``: The packet is dropped by the underlying device and a copy is not
sent to the CPU.
+ * ``mirror``: The packet is forwarded by the underlying device and a copy is
+ sent to the CPU.
Generic Packet Traps
====================
@@ -238,6 +246,245 @@ be added to the following table:
- ``drop``
- Traps NVE packets that the device decided to drop because their overlay
source MAC is multicast
+ * - ``ingress_flow_action_drop``
+ - ``drop``
+ - Traps packets dropped during processing of ingress flow action drop
+ * - ``egress_flow_action_drop``
+ - ``drop``
+ - Traps packets dropped during processing of egress flow action drop
+ * - ``stp``
+ - ``control``
+ - Traps STP packets
+ * - ``lacp``
+ - ``control``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``control``
+ - Traps LLDP packets
+ * - ``igmp_query``
+ - ``control``
+ - Traps IGMP Membership Query packets
+ * - ``igmp_v1_report``
+ - ``control``
+ - Traps IGMP Version 1 Membership Report packets
+ * - ``igmp_v2_report``
+ - ``control``
+ - Traps IGMP Version 2 Membership Report packets
+ * - ``igmp_v3_report``
+ - ``control``
+ - Traps IGMP Version 3 Membership Report packets
+ * - ``igmp_v2_leave``
+ - ``control``
+ - Traps IGMP Version 2 Leave Group packets
+ * - ``mld_query``
+ - ``control``
+ - Traps MLD Multicast Listener Query packets
+ * - ``mld_v1_report``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Report packets
+ * - ``mld_v2_report``
+ - ``control``
+ - Traps MLD Version 2 Multicast Listener Report packets
+ * - ``mld_v1_done``
+ - ``control``
+ - Traps MLD Version 1 Multicast Listener Done packets
+ * - ``ipv4_dhcp``
+ - ``control``
+ - Traps IPv4 DHCP packets
+ * - ``ipv6_dhcp``
+ - ``control``
+ - Traps IPv6 DHCP packets
+ * - ``arp_request``
+ - ``control``
+ - Traps ARP request packets
+ * - ``arp_response``
+ - ``control``
+ - Traps ARP response packets
+ * - ``arp_overlay``
+ - ``control``
+ - Traps NVE-decapsulated ARP packets that reached the overlay network.
+ This is required, for example, when the address that needs to be
+ resolved is a local address
+ * - ``ipv6_neigh_solicit``
+ - ``control``
+ - Traps IPv6 Neighbour Solicitation packets
+ * - ``ipv6_neigh_advert``
+ - ``control``
+ - Traps IPv6 Neighbour Advertisement packets
+ * - ``ipv4_bfd``
+ - ``control``
+ - Traps IPv4 BFD packets
+ * - ``ipv6_bfd``
+ - ``control``
+ - Traps IPv6 BFD packets
+ * - ``ipv4_ospf``
+ - ``control``
+ - Traps IPv4 OSPF packets
+ * - ``ipv6_ospf``
+ - ``control``
+ - Traps IPv6 OSPF packets
+ * - ``ipv4_bgp``
+ - ``control``
+ - Traps IPv4 BGP packets
+ * - ``ipv6_bgp``
+ - ``control``
+ - Traps IPv6 BGP packets
+ * - ``ipv4_vrrp``
+ - ``control``
+ - Traps IPv4 VRRP packets
+ * - ``ipv6_vrrp``
+ - ``control``
+ - Traps IPv6 VRRP packets
+ * - ``ipv4_pim``
+ - ``control``
+ - Traps IPv4 PIM packets
+ * - ``ipv6_pim``
+ - ``control``
+ - Traps IPv6 PIM packets
+ * - ``uc_loopback``
+ - ``control``
+ - Traps unicast packets that need to be routed through the same layer 3
+ interface from which they were received. Such packets are routed by the
+ kernel, but also cause it to potentially generate ICMP redirect packets
+ * - ``local_route``
+ - ``control``
+ - Traps unicast packets that hit a local route and need to be locally
+ delivered
+ * - ``external_route``
+ - ``control``
+ - Traps packets that should be routed through an external interface (e.g.,
+ management interface) that does not belong to the same device (e.g.,
+ switch ASIC) as the ingress interface
+ * - ``ipv6_uc_dip_link_local_scope``
+ - ``control``
+ - Traps unicast IPv6 packets that need to be routed and have a destination
+ IP address with a link-local scope (i.e., fe80::/10). The trap allows
+ device drivers to avoid programming link-local routes, but still receive
+ packets for local delivery
+ * - ``ipv6_dip_all_nodes``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Nodes
+ Address" (i.e., ff02::1)
+ * - ``ipv6_dip_all_routers``
+ - ``control``
+ - Traps IPv6 packets that their destination IP address is the "All Routers
+ Address" (i.e., ff02::2)
+ * - ``ipv6_router_solicit``
+ - ``control``
+ - Traps IPv6 Router Solicitation packets
+ * - ``ipv6_router_advert``
+ - ``control``
+ - Traps IPv6 Router Advertisement packets
+ * - ``ipv6_redirect``
+ - ``control``
+ - Traps IPv6 Redirect Message packets
+ * - ``ipv4_router_alert``
+ - ``control``
+ - Traps IPv4 packets that need to be routed and include the Router Alert
+ option. Such packets need to be locally delivered to raw sockets that
+ have the IP_ROUTER_ALERT socket option set
+ * - ``ipv6_router_alert``
+ - ``control``
+ - Traps IPv6 packets that need to be routed and include the Router Alert
+ option in their Hop-by-Hop extension header. Such packets need to be
+ locally delivered to raw sockets that have the IPV6_ROUTER_ALERT socket
+ option set
+ * - ``ptp_event``
+ - ``control``
+ - Traps PTP time-critical event messages (Sync, Delay_req, Pdelay_Req and
+ Pdelay_Resp)
+ * - ``ptp_general``
+ - ``control``
+ - Traps PTP general messages (Announce, Follow_Up, Delay_Resp,
+ Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``flow_action_sample``
+ - ``control``
+ - Traps packets sampled during processing of flow action sample (e.g., via
+ tc's sample action)
+ * - ``flow_action_trap``
+ - ``control``
+ - Traps packets logged during processing of flow action trap (e.g., via
+ tc's trap action)
+ * - ``early_drop``
+ - ``drop``
+ - Traps packets dropped due to the RED (Random Early Detection) algorithm
+ (i.e., early drops)
+ * - ``vxlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VXLAN header parsing which
+ might be because of packet truncation or the I flag is not set.
+ * - ``llc_snap_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the LLC+SNAP header parsing
+ * - ``vlan_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the VLAN header parsing. Could
+ include unexpected packet truncation.
+ * - ``pppoe_ppp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the PPPoE+PPP header parsing.
+ This could include finding a session ID of 0xFFFF (which is reserved and
+ not for use), a PPPoE length which is larger than the frame received or
+ any common error on this type of header
+ * - ``mpls_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the MPLS header parsing which
+ could include unexpected header truncation
+ * - ``arp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ARP header parsing
+ * - ``ip_1_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the first IP header parsing.
+ This packet trap could include packets which do not pass an IP checksum
+ check, a header length check (a minimum of 20 bytes), which might suffer
+ from packet truncation thus the total length field exceeds the received
+ packet length etc
+ * - ``ip_n_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the parsing of the last IP
+ header (the inner one in case of an IP over IP tunnel). The same common
+ error checking is performed here as for the ip_1_parsing trap
+ * - ``gre_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GRE header parsing
+ * - ``udp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the UDP header parsing.
+ This packet trap could include checksum errorrs, an improper UDP
+ length detected (smaller than 8 bytes) or detection of header
+ truncation.
+ * - ``tcp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the TCP header parsing.
+ This could include TCP checksum errors, improper combination of SYN, FIN
+ and/or RESET etc.
+ * - ``ipsec_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the IPSEC header parsing
+ * - ``sctp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the SCTP header parsing.
+ This would mean that port number 0 was used or that the header is
+ truncated.
+ * - ``dccp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the DCCP header parsing
+ * - ``gtp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the GTP header parsing
+ * - ``esp_parsing``
+ - ``drop``
+ - Traps packets dropped due to an error in the ESP header parsing
+ * - ``blackhole_nexthop``
+ - ``drop``
+ - Traps packets that the device decided to drop in case they hit a
+ blackhole nexthop
+ * - ``dmac_filter``
+ - ``drop``
+ - Traps incoming packets that the device decided to drop because
+ the destination MAC is not configured in the MAC table and
+ the interface is not in promiscuous mode
Driver-specific Packet Traps
============================
@@ -248,8 +495,11 @@ help debug packet drops caused by these exceptions. The following list includes
links to the description of driver-specific traps registered by various device
drivers:
- * :doc:`netdevsim`
- * :doc:`mlxsw`
+ * Documentation/networking/devlink/netdevsim.rst
+ * Documentation/networking/devlink/mlxsw.rst
+ * Documentation/networking/devlink/prestera.rst
+
+.. _Generic-Packet-Trap-Groups:
Generic Packet Trap Groups
==========================
@@ -269,14 +519,102 @@ narrow. The description of these groups must be added to the following table:
- Contains packet traps for packets that were dropped by the device during
layer 2 forwarding (i.e., bridge)
* - ``l3_drops``
- - Contains packet traps for packets that were dropped by the device or hit
- an exception (e.g., TTL error) during layer 3 forwarding
+ - Contains packet traps for packets that were dropped by the device during
+ layer 3 forwarding
+ * - ``l3_exceptions``
+ - Contains packet traps for packets that hit an exception (e.g., TTL
+ error) during layer 3 forwarding
* - ``buffer_drops``
- Contains packet traps for packets that were dropped by the device due to
an enqueue decision
* - ``tunnel_drops``
- Contains packet traps for packets that were dropped by the device during
tunnel encapsulation / decapsulation
+ * - ``acl_drops``
+ - Contains packet traps for packets that were dropped by the device during
+ ACL processing
+ * - ``stp``
+ - Contains packet traps for STP packets
+ * - ``lacp``
+ - Contains packet traps for LACP packets
+ * - ``lldp``
+ - Contains packet traps for LLDP packets
+ * - ``mc_snooping``
+ - Contains packet traps for IGMP and MLD packets required for multicast
+ snooping
+ * - ``dhcp``
+ - Contains packet traps for DHCP packets
+ * - ``neigh_discovery``
+ - Contains packet traps for neighbour discovery packets (e.g., ARP, IPv6
+ ND)
+ * - ``bfd``
+ - Contains packet traps for BFD packets
+ * - ``ospf``
+ - Contains packet traps for OSPF packets
+ * - ``bgp``
+ - Contains packet traps for BGP packets
+ * - ``vrrp``
+ - Contains packet traps for VRRP packets
+ * - ``pim``
+ - Contains packet traps for PIM packets
+ * - ``uc_loopback``
+ - Contains a packet trap for unicast loopback packets (i.e.,
+ ``uc_loopback``). This trap is singled-out because in cases such as
+ one-armed router it will be constantly triggered. To limit the impact on
+ the CPU usage, a packet trap policer with a low rate can be bound to the
+ group without affecting other traps
+ * - ``local_delivery``
+ - Contains packet traps for packets that should be locally delivered after
+ routing, but do not match more specific packet traps (e.g.,
+ ``ipv4_bgp``)
+ * - ``external_delivery``
+ - Contains packet traps for packets that should be routed through an
+ external interface (e.g., management interface) that does not belong to
+ the same device (e.g., switch ASIC) as the ingress interface
+ * - ``ipv6``
+ - Contains packet traps for various IPv6 control packets (e.g., Router
+ Advertisements)
+ * - ``ptp_event``
+ - Contains packet traps for PTP time-critical event messages (Sync,
+ Delay_req, Pdelay_Req and Pdelay_Resp)
+ * - ``ptp_general``
+ - Contains packet traps for PTP general messages (Announce, Follow_Up,
+ Delay_Resp, Pdelay_Resp_Follow_Up, management and signaling)
+ * - ``acl_sample``
+ - Contains packet traps for packets that were sampled by the device during
+ ACL processing
+ * - ``acl_trap``
+ - Contains packet traps for packets that were trapped (logged) by the
+ device during ACL processing
+ * - ``parser_error_drops``
+ - Contains packet traps for packets that were marked by the device during
+ parsing as erroneous
+
+Packet Trap Policers
+====================
+
+As previously explained, the underlying device can trap certain packets to the
+CPU for processing. In most cases, the underlying device is capable of handling
+packet rates that are several orders of magnitude higher compared to those that
+can be handled by the CPU.
+
+Therefore, in order to prevent the underlying device from overwhelming the CPU,
+devices usually include packet trap policers that are able to police the
+trapped packets to rates that can be handled by the CPU.
+
+The ``devlink-trap`` mechanism allows capable device drivers to register their
+supported packet trap policers with ``devlink``. The device driver can choose
+to associate these policers with supported packet trap groups (see
+:ref:`Generic-Packet-Trap-Groups`) during its initialization, thereby exposing
+its default control plane policy to user space.
+
+Device drivers should allow user space to change the parameters of the policers
+(e.g., rate, burst size) as well as the association between the policers and
+trap groups by implementing the relevant callbacks.
+
+If possible, device drivers should implement a callback that allows user space
+to retrieve the number of packets that were dropped by the policer because its
+configured policy was violated.
Testing
=======
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
new file mode 100644
index 000000000000..4562a6e4782f
--- /dev/null
+++ b/Documentation/networking/devlink/hns3.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+hns3 devlink support
+====================
+
+This document describes the devlink features implemented by the ``hns3``
+device driver.
+
+The ``hns3`` driver supports reloading via ``DEVLINK_CMD_RELOAD``.
+
+Info versions
+=============
+
+The ``hns3`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 10 10 80
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Used to represent the firmware version.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
new file mode 100644
index 000000000000..0c89ceb8986d
--- /dev/null
+++ b/Documentation/networking/devlink/ice.rst
@@ -0,0 +1,256 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================
+ice devlink support
+===================
+
+This document describes the devlink features implemented by the ``ice``
+device driver.
+
+Info versions
+=============
+
+The ``ice`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``board.id``
+ - fixed
+ - K65390-000
+ - The Product Board Assembly (PBA) identifier of the board.
+ * - ``fw.mgmt``
+ - running
+ - 2.1.7
+ - 3-digit version number of the management firmware running on the
+ Embedded Management Processor of the device. It controls the PHY,
+ link, access to device resources, etc. Intel documentation refers to
+ this as the EMP firmware.
+ * - ``fw.mgmt.api``
+ - running
+ - 1.5.1
+ - 3-digit version number (major.minor.patch) of the API exported over
+ the AdminQ by the management firmware. Used by the driver to
+ identify what commands are supported. Historical versions of the
+ kernel only displayed a 2-digit version number (major.minor).
+ * - ``fw.mgmt.build``
+ - running
+ - 0x305d955f
+ - Unique identifier of the source for the management firmware.
+ * - ``fw.undi``
+ - running
+ - 1.2581.0
+ - Version of the Option ROM containing the UEFI driver. The version is
+ reported in ``major.minor.patch`` format. The major version is
+ incremented whenever a major breaking change occurs, or when the
+ minor version would overflow. The minor version is incremented for
+ non-breaking changes and reset to 1 when the major version is
+ incremented. The patch version is normally 0 but is incremented when
+ a fix is delivered as a patch against an older base Option ROM.
+ * - ``fw.psid.api``
+ - running
+ - 0.80
+ - Version defining the format of the flash contents.
+ * - ``fw.bundle_id``
+ - running
+ - 0x80002ec0
+ - Unique identifier of the firmware image file that was loaded onto
+ the device. Also referred to as the EETRACK identifier of the NVM.
+ * - ``fw.app.name``
+ - running
+ - ICE OS Default Package
+ - The name of the DDP package that is active in the device. The DDP
+ package is loaded by the driver during initialization. Each
+ variation of the DDP package has a unique name.
+ * - ``fw.app``
+ - running
+ - 1.3.1.0
+ - The version of the DDP package that is active in the device. Note
+ that both the name (as reported by ``fw.app.name``) and version are
+ required to uniquely identify the package.
+ * - ``fw.app.bundle_id``
+ - running
+ - 0xc0000001
+ - Unique identifier for the DDP package loaded in the device. Also
+ referred to as the DDP Track ID. Can be used to uniquely identify
+ the specific DDP package.
+ * - ``fw.netlist``
+ - running
+ - 1.1.2000-6.7.0
+ - The version of the netlist module. This module defines the device's
+ Ethernet capabilities and default settings, and is used by the
+ management firmware as part of managing link and device
+ connectivity.
+ * - ``fw.netlist.build``
+ - running
+ - 0xee16ced7
+ - The first 4 bytes of the hash of the netlist module contents.
+
+Flash Update
+============
+
+The ``ice`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+ :widths: 5 95
+
+ * - Bits
+ - Behavior
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Do not preserve settings stored in the flash components being
+ updated. This includes overwriting the port configuration that
+ determines the number of physical functions the device will
+ initialize with.
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Do not preserve either settings or identifiers. Overwrite everything
+ in the flash with the contents from the provided image, without
+ performing any preservation. This includes overwriting device
+ identifying fields such as the MAC address, VPD area, and device
+ serial number. It is expected that this combination be used with an
+ image customized for the specific device.
+
+The ice hardware does not support overwriting only identifiers while
+preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its
+own will be rejected. If no overwrite mask is provided, the firmware will be
+instructed to preserve all settings and identifying fields when updating.
+
+Reload
+======
+
+The ``ice`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+ $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Port split
+==========
+
+The ``ice`` driver supports port splitting only for port 0, as the FW has
+a predefined set of available port split options for the whole device.
+
+A system reboot is required for port split to be applied.
+
+The following command will select the port split option with 4 ports:
+
+.. code:: shell
+
+ $ devlink port split pci/0000:16:00.0/0 count 4
+
+The list of all available port options will be printed to dynamic debug after
+each ``split`` and ``unsplit`` command. The first option is the default.
+
+.. code:: shell
+
+ ice 0000:16:00.0: Available port split options and max port speeds (Gbps):
+ ice 0000:16:00.0: Status Split Quad 0 Quad 1
+ ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7
+ ice 0000:16:00.0: Active 2 100 - - - 100 - - -
+ ice 0000:16:00.0: 2 50 - 50 - - - - -
+ ice 0000:16:00.0: Pending 4 25 25 25 25 - - - -
+ ice 0000:16:00.0: 4 25 25 - - 25 25 - -
+ ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10
+ ice 0000:16:00.0: 1 100 - - - - - - -
+
+There could be multiple FW port options with the same port split count. When
+the same port split count request is issued again, the next FW port option with
+the same port split count will be selected.
+
+``devlink port unsplit`` will select the option with a split count of 1. If
+there is no FW option available with split count 1, you will receive an error.
+
+Regions
+=======
+
+The ``ice`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``nvm-flash``
+ - The contents of the entire flash chip, sometimes referred to as
+ the device's Non Volatile Memory.
+ * - ``device-caps``
+ - The contents of the device firmware's capabilities buffer. Useful to
+ determine the current state and configuration of the device.
+
+Users can request an immediate capture of a snapshot via the
+``DEVLINK_CMD_REGION_NEW``
+
+.. code:: shell
+
+ $ devlink region show
+ pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+ pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
+ $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
+ $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+ 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+ 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+ $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+ $ devlink region delete pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region new pci/0000:01:00.0/device-caps snapshot 1
+ $ devlink region dump pci/0000:01:00.0/device-caps snapshot 1
+ 0000000000000000 01 00 01 00 00 00 00 00 01 00 00 00 00 00 00 00
+ 0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000020 02 00 02 01 32 03 00 00 0a 00 00 00 25 00 00 00
+ 0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000040 04 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000060 05 00 01 00 03 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000080 06 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000a0 08 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000c0 12 00 01 00 01 00 00 00 01 00 01 00 00 00 00 00
+ 00000000000000d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000000e0 13 00 01 00 00 01 00 00 00 00 00 00 00 00 00 00
+ 00000000000000f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000100 14 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000120 15 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000140 16 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000150 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000160 17 00 01 00 06 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000170 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000180 18 00 01 00 01 00 00 00 01 00 00 00 08 00 00 00
+ 0000000000000190 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001a0 22 00 01 00 01 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001c0 40 00 01 00 00 08 00 00 08 00 00 00 00 00 00 00
+ 00000000000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 00000000000001e0 41 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 00000000000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+ 0000000000000200 42 00 01 00 00 08 00 00 00 00 00 00 00 00 00 00
+ 0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
+
+ $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 087ff54d53fc..4b653d040627 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -4,6 +4,20 @@ Linux Devlink Documentation
devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held all callbacks of devlink
+netlink commands.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
Interface documentation
-----------------------
@@ -16,10 +30,15 @@ general.
devlink-dpipe
devlink-health
devlink-info
+ devlink-flash
devlink-params
+ devlink-port
devlink-region
devlink-resource
+ devlink-reload
+ devlink-selftests
devlink-trap
+ devlink-linecard
Driver-specific documentation
-----------------------------
@@ -31,7 +50,9 @@ parameters, info versions, and other features it supports.
:maxdepth: 1
bnxt
+ hns3
ionic
+ ice
mlx4
mlx5
mlxsw
@@ -40,3 +61,7 @@ parameters, info versions, and other features it supports.
nfp
qed
ti-cpsw-switch
+ am65-nuss-cpsw-switch
+ prestera
+ iosm
+ octeontx2
diff --git a/Documentation/networking/devlink/iosm.rst b/Documentation/networking/devlink/iosm.rst
new file mode 100644
index 000000000000..6136181339aa
--- /dev/null
+++ b/Documentation/networking/devlink/iosm.rst
@@ -0,0 +1,162 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+iosm devlink support
+====================
+
+This document describes the devlink features implemented by the ``iosm``
+device driver.
+
+Parameters
+==========
+
+The ``iosm`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``erase_full_flash``
+ - u8
+ - runtime
+ - erase_full_flash parameter is used to check if full erase is required for
+ the device during firmware flashing.
+ If set, Full nand erase command will be sent to the device. By default,
+ only conditional erase support is enabled.
+
+
+Flash Update
+============
+
+The ``iosm`` driver implements support for flash update using the
+``devlink-flash`` interface.
+
+It supports updating the device flash using a combined flash image which contains
+the Bootloader images and other modem software images.
+
+The driver uses DEVLINK_SUPPORT_FLASH_UPDATE_COMPONENT to identify type of
+firmware image that need to be flashed as requested by user space application.
+Supported firmware image types.
+
+.. list-table:: Firmware Image types
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``PSI RAM``
+ - Primary Signed Image
+ * - ``EBL``
+ - External Bootloader
+ * - ``FLS``
+ - Modem Software Image
+
+PSI RAM and EBL are the RAM images which are injected to the device when the
+device is in BOOT ROM stage. Once this is successful, the actual modem firmware
+image is flashed to the device. The modem software image contains multiple files
+each having one secure bin file and at least one Loadmap/Region file. For flashing
+these files, appropriate commands are sent to the modem device along with the
+data required for flashing. The data like region count and address of each region
+has to be passed to the driver using the devlink param command.
+
+If the device has to be fully erased before firmware flashing, user application
+need to set the erase_full_flash parameter using devlink param command.
+By default, conditional erase feature is supported.
+
+Flash Commands:
+===============
+1) When modem is in Boot ROM stage, user can use below command to inject PSI RAM
+image using devlink flash command.
+
+$ devlink dev flash pci/0000:02:00.0 file <PSI_RAM_File_name>
+
+2) If user want to do a full erase, below command need to be issued to set the
+erase full flash param (To be set only if full erase required).
+
+$ devlink dev param set pci/0000:02:00.0 name erase_full_flash value true cmode runtime
+
+3) Inject EBL after the modem is in PSI stage.
+
+$ devlink dev flash pci/0000:02:00.0 file <EBL_File_name>
+
+4) Once EBL is injected successfully, then the actual firmware flashing takes
+place. Below is the sequence of commands used for each of the firmware images.
+
+a) Flash secure bin file.
+
+$ devlink dev flash pci/0000:02:00.0 file <Secure_bin_file_name>
+
+b) Flashing the Loadmap/Region file
+
+$ devlink dev flash pci/0000:02:00.0 file <Load_map_file_name>
+
+Regions
+=======
+
+The ``iosm`` driver supports dumping the coredump logs.
+
+In case a firmware encounters an exception, a snapshot will be taken by the
+driver. Following regions are accessed for device internal data.
+
+.. list-table:: Regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``report.json``
+ - The summary of exception details logged as part of this region.
+ * - ``coredump.fcd``
+ - This region contains the details related to the exception occurred in the
+ device (RAM dump).
+ * - ``cdd.log``
+ - This region contains the logs related to the modem CDD driver.
+ * - ``eeprom.bin``
+ - This region contains the eeprom logs.
+ * - ``bootcore_trace.bin``
+ - This region contains the current instance of bootloader logs.
+ * - ``bootcore_prev_trace.bin``
+ - This region contains the previous instance of bootloader logs.
+
+
+Region commands
+===============
+
+$ devlink region show
+
+$ devlink region new pci/0000:02:00.0/report.json
+
+$ devlink region dump pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region del pci/0000:02:00.0/report.json snapshot 0
+
+$ devlink region new pci/0000:02:00.0/coredump.fcd
+
+$ devlink region dump pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region del pci/0000:02:00.0/coredump.fcd snapshot 1
+
+$ devlink region new pci/0000:02:00.0/cdd.log
+
+$ devlink region dump pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region del pci/0000:02:00.0/cdd.log snapshot 2
+
+$ devlink region new pci/0000:02:00.0/eeprom.bin
+
+$ devlink region dump pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region del pci/0000:02:00.0/eeprom.bin snapshot 3
+
+$ devlink region new pci/0000:02:00.0/bootcore_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region del pci/0000:02:00.0/bootcore_trace.bin snapshot 4
+
+$ devlink region new pci/0000:02:00.0/bootcore_prev_trace.bin
+
+$ devlink region dump pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
+
+$ devlink region del pci/0000:02:00.0/bootcore_prev_trace.bin snapshot 5
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 629a6e69c036..29ad304e6fba 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -14,8 +14,19 @@ Parameters
* - Name
- Mode
+ - Validation
* - ``enable_roce``
- driverinit
+ - Type: Boolean
+ * - ``io_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``event_eq_size``
+ - driverinit
+ - The range is between 64 and 4096.
+ * - ``max_macs``
+ - driverinit
+ - The range is between 1 and 2^31. Only power of 2 values are supported.
The ``mlx5`` driver also implements the following driver-specific
parameters.
@@ -37,6 +48,12 @@ parameters.
* ``smfs`` Software managed flow steering. In SMFS mode, the HW
steering entities are created and manage through the driver without
firmware intervention.
+ * - ``fdb_large_groups``
+ - u32
+ - driverinit
+ - Control the number of large groups (size > 1) in the FDB table.
+
+ * The default value is 15, and the range is between 1 and 1024.
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
index cf857cb4ba8f..433962225bd4 100644
--- a/Documentation/networking/devlink/mlxsw.rst
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -58,6 +58,30 @@ The ``mlxsw`` driver reports the following versions
- running
- Three digit firmware version
+Line card auxiliary device info versions
+========================================
+
+The ``mlxsw`` driver reports the following versions for line card auxiliary device
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``hw.revision``
+ - fixed
+ - The hardware revision for this line card
+ * - ``ini.version``
+ - running
+ - Version of line card INI loaded
+ * - ``fw.psid``
+ - fixed
+ - Line card device PSID
+ * - ``fw.version``
+ - running
+ - Three digit firmware version of line card device
+
Driver-specific Traps
=====================
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 2a266b7e7b38..ec5e6d79b2e2 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -46,7 +46,7 @@ Resources
=========
The ``netdevsim`` driver exposes resources to control the number of FIB
-entries and FIB rule entries that the driver will allow.
+entries, FIB rule entries and nexthops that the driver will allow.
.. code:: shell
@@ -54,8 +54,35 @@ entries and FIB rule entries that the driver will allow.
$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64
$ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16
+ $ devlink resource set netdevsim/netdevsim0 path /nexthops size 16
$ devlink dev reload netdevsim/netdevsim0
+Rate objects
+============
+
+The ``netdevsim`` driver supports rate objects management, which includes:
+
+- registerging/unregistering leaf rate objects per VF devlink port;
+- creation/deletion node rate objects;
+- setting tx_share and tx_max rate values for any rate object type;
+- setting parent node for any rate object type.
+
+Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode.
+For example created rate node with name ``some_group``:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/rate_groups/some_group
+ rate_parent tx_max tx_share
+
+Same parameters are exposed for leaf objects in corresponding ports directories.
+For ex.:
+
+.. code:: shell
+
+ $ ls /sys/kernel/debug/netdevsim/netdevsim0/ports/1
+ dev ethtool rate_parent tx_max tx_share
+
Driver-specific Traps
=====================
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
new file mode 100644
index 000000000000..610de99b728a
--- /dev/null
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+octeontx2 devlink support
+=========================
+
+This document describes the devlink features implemented by the ``octeontx2 AF, PF and VF``
+device drivers.
+
+Parameters
+==========
+
+The ``octeontx2 PF and VF`` drivers implement the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``mcam_count``
+ - u16
+ - runtime
+ - Select number of match CAM entries to be allocated for an interface.
+ The same is used for ntuple filters of the interface. Supported by
+ PF and VF drivers.
+
+The ``octeontx2 AF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``dwrr_mtu``
+ - u32
+ - runtime
+ - Use to set the quantum which hardware uses for scheduling among transmit queues.
+ Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
diff --git a/Documentation/networking/devlink/prestera.rst b/Documentation/networking/devlink/prestera.rst
new file mode 100644
index 000000000000..49409d1d3081
--- /dev/null
+++ b/Documentation/networking/devlink/prestera.rst
@@ -0,0 +1,141 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+prestera devlink support
+========================
+
+This document describes the devlink features implemented by the ``prestera``
+device driver.
+
+Driver-specific Traps
+=====================
+
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+.. list-table:: List of Driver-specific Traps Registered by ``prestera``
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``arp_bc``
+ - ``trap``
+ - Traps ARP broadcast packets (both requests/responses)
+ * - ``is_is``
+ - ``trap``
+ - Traps IS-IS packets
+ * - ``ospf``
+ - ``trap``
+ - Traps OSPF packets
+ * - ``ip_bc_mac``
+ - ``trap``
+ - Traps IPv4 packets with broadcast DA Mac address
+ * - ``stp``
+ - ``trap``
+ - Traps STP BPDU
+ * - ``lacp``
+ - ``trap``
+ - Traps LACP packets
+ * - ``lldp``
+ - ``trap``
+ - Traps LLDP packets
+ * - ``router_mc``
+ - ``trap``
+ - Traps multicast packets
+ * - ``vrrp``
+ - ``trap``
+ - Traps VRRP packets
+ * - ``dhcp``
+ - ``trap``
+ - Traps DHCP packets
+ * - ``mtu_error``
+ - ``trap``
+ - Traps (exception) packets that exceeded port's MTU
+ * - ``mac_to_me``
+ - ``trap``
+ - Traps packets with switch-port's DA Mac address
+ * - ``ttl_error``
+ - ``trap``
+ - Traps (exception) IPv4 packets whose TTL exceeded
+ * - ``ipv4_options``
+ - ``trap``
+ - Traps (exception) packets due to the malformed IPV4 header options
+ * - ``ip_default_route``
+ - ``trap``
+ - Traps packets that have no specific IP interface (IP to me) and no forwarding prefix
+ * - ``local_route``
+ - ``trap``
+ - Traps packets that have been send to one of switch IP interfaces addresses
+ * - ``ipv4_icmp_redirect``
+ - ``trap``
+ - Traps (exception) IPV4 ICMP redirect packets
+ * - ``arp_response``
+ - ``trap``
+ - Traps ARP replies packets that have switch-port's DA Mac address
+ * - ``acl_code_0``
+ - ``trap``
+ - Traps packets that have ACL priority set to 0 (tc pref 0)
+ * - ``acl_code_1``
+ - ``trap``
+ - Traps packets that have ACL priority set to 1 (tc pref 1)
+ * - ``acl_code_2``
+ - ``trap``
+ - Traps packets that have ACL priority set to 2 (tc pref 2)
+ * - ``acl_code_3``
+ - ``trap``
+ - Traps packets that have ACL priority set to 3 (tc pref 3)
+ * - ``acl_code_4``
+ - ``trap``
+ - Traps packets that have ACL priority set to 4 (tc pref 4)
+ * - ``acl_code_5``
+ - ``trap``
+ - Traps packets that have ACL priority set to 5 (tc pref 5)
+ * - ``acl_code_6``
+ - ``trap``
+ - Traps packets that have ACL priority set to 6 (tc pref 6)
+ * - ``acl_code_7``
+ - ``trap``
+ - Traps packets that have ACL priority set to 7 (tc pref 7)
+ * - ``ipv4_bgp``
+ - ``trap``
+ - Traps IPv4 BGP packets
+ * - ``ssh``
+ - ``trap``
+ - Traps SSH packets
+ * - ``telnet``
+ - ``trap``
+ - Traps Telnet packets
+ * - ``icmp``
+ - ``trap``
+ - Traps ICMP packets
+ * - ``rxdma_drop``
+ - ``drop``
+ - Drops packets (RxDMA) due to the lack of ingress buffers etc.
+ * - ``port_no_vlan``
+ - ``drop``
+ - Drops packets due to faulty-configured network or due to internal bug (config issue).
+ * - ``local_port``
+ - ``drop``
+ - Drops packets whose decision (FDB entry) is to bridge packet back to the incoming port/trunk.
+ * - ``invalid_sa``
+ - ``drop``
+ - Drops packets with multicast source MAC address.
+ * - ``illegal_ip_addr``
+ - ``drop``
+ - Drops packets with illegal SIP/DIP multicast/unicast addresses.
+ * - ``illegal_ipv4_hdr``
+ - ``drop``
+ - Drops packets with illegal IPV4 header.
+ * - ``ip_uc_dip_da_mismatch``
+ - ``drop``
+ - Drops packets with destination MAC being unicast, but destination IP address being multicast.
+ * - ``ip_sip_is_zero``
+ - ``drop``
+ - Drops packets with zero (0) IPV4 source address.
+ * - ``met_red``
+ - ``drop``
+ - Drops non-conforming packets (dropped by Ingress policer, metering drop), e.g. packet rate exceeded configured bandwith.
diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.rst
index eaa8f9a6fd5d..add4d59a99a5 100644
--- a/Documentation/networking/dns_resolver.txt
+++ b/Documentation/networking/dns_resolver.rst
@@ -1,8 +1,10 @@
- ===================
- DNS Resolver Module
- ===================
+.. SPDX-License-Identifier: GPL-2.0
-Contents:
+===================
+DNS Resolver Module
+===================
+
+.. Contents:
- Overview.
- Compilation.
@@ -12,8 +14,7 @@ Contents:
- Debugging.
-========
-OVERVIEW
+Overview
========
The DNS resolver module provides a way for kernel services to make DNS queries
@@ -33,50 +34,50 @@ It does not yet support the following AFS features:
This code is extracted from the CIFS filesystem.
-===========
-COMPILATION
+Compilation
===========
-The module should be enabled by turning on the kernel configuration options:
+The module should be enabled by turning on the kernel configuration options::
CONFIG_DNS_RESOLVER - tristate "DNS Resolver support"
-==========
-SETTING UP
+Setting up
==========
To set up this facility, the /etc/request-key.conf file must be altered so that
/sbin/request-key can appropriately direct the upcalls. For example, to handle
basic dname to IPv4/IPv6 address resolution, the following line should be
-added:
+added::
+
#OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ...
#====== ============ ======= ======= ==========================
create dns_resolver * * /usr/sbin/cifs.upcall %k
To direct a query for query type 'foo', a line of the following should be added
-before the more general line given above as the first match is the one taken.
+before the more general line given above as the first match is the one taken::
create dns_resolver foo:* * /usr/sbin/dns.foo %k
-=====
-USAGE
+Usage
=====
To make use of this facility, one of the following functions that are
-implemented in the module can be called after doing:
+implemented in the module can be called after doing::
#include <linux/dns_resolver.h>
- (1) int dns_query(const char *type, const char *name, size_t namelen,
- const char *options, char **_result, time_t *_expiry);
+ ::
+
+ int dns_query(const char *type, const char *name, size_t namelen,
+ const char *options, char **_result, time_t *_expiry);
This is the basic access function. It looks for a cached DNS query and if
it doesn't find it, it upcalls to userspace to make a new DNS query, which
may then be cached. The key description is constructed as a string of the
- form:
+ form::
[<type>:]<name>
@@ -107,16 +108,14 @@ This can be cleared by any process that has the CAP_SYS_ADMIN capability by
the use of KEYCTL_KEYRING_CLEAR on the keyring ID.
-===============================
-READING DNS KEYS FROM USERSPACE
+Reading DNS Keys from Userspace
===============================
Keys of dns_resolver type can be read from userspace using keyctl_read() or
"keyctl read/print/pipe".
-=========
-MECHANISM
+Mechanism
=========
The dnsresolver module registers a key type called "dns_resolver". Keys of
@@ -147,11 +146,10 @@ See <file:Documentation/security/keys/request-key.rst> for further
information about request-key function.
-=========
-DEBUGGING
+Debugging
=========
Debugging messages can be turned on dynamically by writing a 1 into the
-following file:
+following file::
- /sys/module/dnsresolver/parameters/debug
+ /sys/module/dnsresolver/parameters/debug
diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.rst
index da59e2884130..64f7236ff10b 100644
--- a/Documentation/networking/driver.txt
+++ b/Documentation/networking/driver.rst
@@ -1,14 +1,18 @@
-Document about softnet driver issues
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Softnet Driver Issues
+=====================
Transmit path guidelines:
1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under
any normal circumstances. It is considered a hard error unless
- there is no way your device can tell ahead of time when it's
+ there is no way your device can tell ahead of time when its
transmit function will become busy.
Instead it must maintain the queue properly. For example,
- for a driver implementing scatter-gather this means:
+ for a driver implementing scatter-gather this means::
static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev)
@@ -38,25 +42,25 @@ Transmit path guidelines:
return NETDEV_TX_OK;
}
- And then at the end of your TX reclamation event handling:
+ And then at the end of your TX reclamation event handling::
if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
+ TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
netif_wake_queue(dp->dev);
- For a non-scatter-gather supporting card, the three tests simply become:
+ For a non-scatter-gather supporting card, the three tests simply become::
/* This is a hard error log it. */
if (TX_BUFFS_AVAIL(dp) <= 0)
- and:
+ and::
if (TX_BUFFS_AVAIL(dp) == 0)
- and:
+ and::
if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > 0)
+ TX_BUFFS_AVAIL(dp) > 0)
netif_wake_queue(dp->dev);
2) An ndo_start_xmit method must not modify the shared parts of a
@@ -86,7 +90,7 @@ Close/stop guidelines:
1) After the ndo_stop routine has been called, the hardware must
not receive or transmit any data. All in flight packets must
- be aborted. If necessary, poll or wait for completion of
+ be aborted. If necessary, poll or wait for completion of
any reset commands.
2) The ndo_stop routine will be called by unregister_netdevice
diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst
index af029b3ca2ab..827701f8cbfe 100644
--- a/Documentation/networking/dsa/configuration.rst
+++ b/Documentation/networking/dsa/configuration.rst
@@ -34,14 +34,24 @@ interface. The CPU port is the switch port connected to an Ethernet MAC chip.
The corresponding linux Ethernet interface is called the master interface.
All other corresponding linux interfaces are called slave interfaces.
-The slave interfaces depend on the master interface. They can only brought up,
-when the master interface is up.
+The slave interfaces depend on the master interface being up in order for them
+to send or receive traffic. Prior to kernel v5.12, the state of the master
+interface had to be managed explicitly by the user. Starting with kernel v5.12,
+the behavior is as follows:
+
+- when a DSA slave interface is brought up, the master interface is
+ automatically brought up.
+- when the master interface is brought down, all DSA slave interfaces are
+ automatically brought down.
In this documentation the following Ethernet interfaces are used:
*eth0*
the master interface
+*eth1*
+ another master interface
+
*lan1*
a slave interface
@@ -78,79 +88,76 @@ The tagging based configuration is desired and supported by the majority of
DSA switches. These switches are capable to tag incoming and outgoing traffic
without using a VLAN based configuration.
-single port
-~~~~~~~~~~~
-
-.. code-block:: sh
-
- # configure each interface
- ip addr add 192.0.2.1/30 dev lan1
- ip addr add 192.0.2.5/30 dev lan2
- ip addr add 192.0.2.9/30 dev lan3
-
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+*single port*
+ .. code-block:: sh
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # configure each interface
+ ip addr add 192.0.2.1/30 dev lan1
+ ip addr add 192.0.2.5/30 dev lan2
+ ip addr add 192.0.2.9/30 dev lan3
-bridge
-~~~~~~
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
-.. code-block:: sh
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+*bridge*
+ .. code-block:: sh
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
- # create bridge
- ip link add name br0 type bridge
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
+ # create bridge
+ ip link add name br0 type bridge
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
- # bring up the bridge
- ip link set dev br0 up
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
-gateway
-~~~~~~~
+ # bring up the bridge
+ ip link set dev br0 up
-.. code-block:: sh
+*gateway*
+ .. code-block:: sh
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
- # bring up the slave interfaces
- ip link set wan up
- ip link set lan1 up
- ip link set lan2 up
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
- # configure the upstream port
- ip addr add 192.0.2.1/30 dev wan
+ # configure the upstream port
+ ip addr add 192.0.2.1/30 dev wan
- # create bridge
- ip link add name br0 type bridge
+ # create bridge
+ ip link add name br0 type bridge
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
- # bring up the bridge
- ip link set dev br0 up
+ # bring up the bridge
+ ip link set dev br0 up
.. _dsa-vlan-configuration:
@@ -161,132 +168,291 @@ A minority of switches are not capable to use a taging protocol
(DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based
configuration.
-single port
-~~~~~~~~~~~
-The configuration can only be set up via VLAN tagging and bridge setup.
-
-.. code-block:: sh
-
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
- ip link add link eth0 name eth0.2 type vlan id 2
- ip link add link eth0 name eth0.3 type vlan id 3
-
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
- ip link set eth0.2 up
- ip link set eth0.3 up
-
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan1 up
- ip link set lan3 up
-
- # create bridge
- ip link add name br0 type bridge
-
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
-
- # add ports to bridges
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
-
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 2 pvid untagged
- bridge vlan add dev lan3 vid 3 pvid untagged
-
- # configure the VLANs
- ip addr add 192.0.2.1/30 dev eth0.1
- ip addr add 192.0.2.5/30 dev eth0.2
- ip addr add 192.0.2.9/30 dev eth0.3
-
- # bring up the bridge devices
- ip link set br0 up
-
+*single port*
+ The configuration can only be set up via VLAN tagging and bridge setup.
-bridge
-~~~~~~
+ .. code-block:: sh
-.. code-block:: sh
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+ ip link add link eth0 name eth0.3 type vlan id 3
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+ ip link set eth0.3 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # bring up the slave interfaces
- ip link set lan1 up
- ip link set lan2 up
- ip link set lan3 up
+ # create bridge
+ ip link add name br0 type bridge
- # create bridge
- ip link add name br0 type bridge
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
+ # add ports to bridges
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
- # add ports to bridge
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
- ip link set dev lan3 master br0
- ip link set eth0.1 master br0
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 2 pvid untagged
+ bridge vlan add dev lan3 vid 3 pvid untagged
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 1 pvid untagged
- bridge vlan add dev lan3 vid 1 pvid untagged
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.1
+ ip addr add 192.0.2.5/30 dev eth0.2
+ ip addr add 192.0.2.9/30 dev eth0.3
- # configure the bridge
- ip addr add 192.0.2.129/25 dev br0
+ # bring up the bridge devices
+ ip link set br0 up
- # bring up the bridge
- ip link set dev br0 up
-gateway
-~~~~~~~
+*bridge*
+ .. code-block:: sh
-.. code-block:: sh
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
- # tag traffic on CPU port
- ip link add link eth0 name eth0.1 type vlan id 1
- ip link add link eth0 name eth0.2 type vlan id 2
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
- # The master interface needs to be brought up before the slave ports.
- ip link set eth0 up
- ip link set eth0.1 up
- ip link set eth0.2 up
+ # bring up the slave interfaces
+ ip link set lan1 up
+ ip link set lan2 up
+ ip link set lan3 up
- # bring up the slave interfaces
- ip link set wan up
- ip link set lan1 up
- ip link set lan2 up
+ # create bridge
+ ip link add name br0 type bridge
- # create bridge
- ip link add name br0 type bridge
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
- # activate VLAN filtering
- ip link set dev br0 type bridge vlan_filtering 1
+ # add ports to bridge
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+ ip link set dev lan3 master br0
+ ip link set eth0.1 master br0
- # add ports to bridges
- ip link set dev wan master br0
- ip link set eth0.1 master br0
- ip link set dev lan1 master br0
- ip link set dev lan2 master br0
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev lan3 vid 1 pvid untagged
- # tag traffic on ports
- bridge vlan add dev lan1 vid 1 pvid untagged
- bridge vlan add dev lan2 vid 1 pvid untagged
- bridge vlan add dev wan vid 2 pvid untagged
+ # configure the bridge
+ ip addr add 192.0.2.129/25 dev br0
- # configure the VLANs
- ip addr add 192.0.2.1/30 dev eth0.2
- ip addr add 192.0.2.129/25 dev br0
+ # bring up the bridge
+ ip link set dev br0 up
- # bring up the bridge devices
- ip link set br0 up
+*gateway*
+ .. code-block:: sh
+
+ # tag traffic on CPU port
+ ip link add link eth0 name eth0.1 type vlan id 1
+ ip link add link eth0 name eth0.2 type vlan id 2
+
+ # For kernels earlier than v5.12, the master interface needs to be
+ # brought up manually before the slave ports.
+ ip link set eth0 up
+ ip link set eth0.1 up
+ ip link set eth0.2 up
+
+ # bring up the slave interfaces
+ ip link set wan up
+ ip link set lan1 up
+ ip link set lan2 up
+
+ # create bridge
+ ip link add name br0 type bridge
+
+ # activate VLAN filtering
+ ip link set dev br0 type bridge vlan_filtering 1
+
+ # add ports to bridges
+ ip link set dev wan master br0
+ ip link set eth0.1 master br0
+ ip link set dev lan1 master br0
+ ip link set dev lan2 master br0
+
+ # tag traffic on ports
+ bridge vlan add dev lan1 vid 1 pvid untagged
+ bridge vlan add dev lan2 vid 1 pvid untagged
+ bridge vlan add dev wan vid 2 pvid untagged
+
+ # configure the VLANs
+ ip addr add 192.0.2.1/30 dev eth0.2
+ ip addr add 192.0.2.129/25 dev br0
+
+ # bring up the bridge devices
+ ip link set br0 up
+
+Forwarding database (FDB) management
+------------------------------------
+
+The existing DSA switches do not have the necessary hardware support to keep
+the software FDB of the bridge in sync with the hardware tables, so the two
+tables are managed separately (``bridge fdb show`` queries both, and depending
+on whether the ``self`` or ``master`` flags are being used, a ``bridge fdb
+add`` or ``bridge fdb del`` command acts upon entries from one or both tables).
+
+Up until kernel v4.14, DSA only supported user space management of bridge FDB
+entries using the bridge bypass operations (which do not update the software
+FDB, just the hardware one) using the ``self`` flag (which is optional and can
+be omitted).
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 self static
+ # or shorthand
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Due to a bug, the bridge bypass FDB implementation provided by DSA did not
+distinguish between ``static`` and ``local`` FDB entries (``static`` are meant
+to be forwarded, while ``local`` are meant to be locally terminated, i.e. sent
+to the host port). Instead, all FDB entries with the ``self`` flag (implicit or
+explicit) are treated by DSA as ``static`` even if they are ``local``.
+
+ .. code-block:: sh
+
+ # This command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+ # behaves the same for DSA as this command:
+ bridge fdb add dev swp0 00:01:02:03:04:05 local
+ # or shorthand, because the 'local' flag is implicit if 'static' is not
+ # specified, it also behaves the same as:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+
+The last command is an incorrect way of adding a static bridge FDB entry to a
+DSA switch using the bridge bypass operations, and works by mistake. Other
+drivers will treat an FDB entry added by the same command as ``local`` and as
+such, will not forward it, as opposed to DSA.
+
+Between kernel v4.14 and v5.14, DSA has supported in parallel two modes of
+adding a bridge FDB entry to the switch: the bridge bypass discussed above, as
+well as a new mode using the ``master`` flag which installs FDB entries in the
+software bridge too.
+
+ .. code-block:: sh
+
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+
+Since kernel v5.14, DSA has gained stronger integration with the bridge's
+software FDB, and the support for its bridge bypass FDB implementation (using
+the ``self`` flag) has been removed. This results in the following changes:
+
+ .. code-block:: sh
+
+ # This is the only valid way of adding an FDB entry that is supported,
+ # compatible with v4.14 kernels and later:
+ bridge fdb add dev swp0 00:01:02:03:04:05 master static
+ # This command is no longer buggy and the entry is properly treated as
+ # 'local' instead of being forwarded:
+ bridge fdb add dev swp0 00:01:02:03:04:05
+ # This command no longer installs a static FDB entry to hardware:
+ bridge fdb add dev swp0 00:01:02:03:04:05 static
+
+Script writers are therefore encouraged to use the ``master static`` set of
+flags when working with bridge FDB entries on DSA switch interfaces.
+
+Affinity of user ports to CPU ports
+-----------------------------------
+
+Typically, DSA switches are attached to the host via a single Ethernet
+interface, but in cases where the switch chip is discrete, the hardware design
+may permit the use of 2 or more ports connected to the host, for an increase in
+termination throughput.
+
+DSA can make use of multiple CPU ports in two ways. First, it is possible to
+statically assign the termination traffic associated with a certain user port
+to be processed by a certain CPU port. This way, user space can implement
+custom policies of static load balancing between user ports, by spreading the
+affinities according to the available CPU ports.
+
+Secondly, it is possible to perform load balancing between CPU ports on a per
+packet basis, rather than statically assigning user ports to CPU ports.
+This can be achieved by placing the DSA masters under a LAG interface (bonding
+or team). DSA monitors this operation and creates a mirror of this software LAG
+on the CPU ports facing the physical DSA masters that constitute the LAG slave
+devices.
+
+To make use of multiple CPU ports, the firmware (device tree) description of
+the switch must mark all the links between CPU ports and their DSA masters
+using the ``ethernet`` reference/phandle. At startup, only a single CPU port
+and DSA master will be used - the numerically first port from the firmware
+description which has an ``ethernet`` property. It is up to the user to
+configure the system for the switch to use other masters.
+
+DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
+changing the DSA master of a user port. The ``IFLA_DSA_MASTER`` u32 netlink
+attribute contains the ifindex of the master device that handles each slave
+device. The DSA master must be a valid candidate based on firmware node
+information, or a LAG interface which contains only slaves which are valid
+candidates.
+
+Using iproute2, the following manipulations are possible:
+
+ .. code-block:: sh
+
+ # See the DSA master in current use
+ ip -d link show dev swp0
+ (...)
+ dsa master eth0
+
+ # Static CPU port distribution
+ ip link set swp0 type dsa master eth1
+ ip link set swp1 type dsa master eth0
+ ip link set swp2 type dsa master eth1
+ ip link set swp3 type dsa master eth0
+
+ # CPU ports in LAG, using explicit assignment of the DSA master
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip link set swp0 type dsa master bond0
+ ip link set swp1 type dsa master bond0
+ ip link set swp2 type dsa master bond0
+ ip link set swp3 type dsa master bond0
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+ # CPU ports in LAG, relying on implicit migration of the DSA master
+ ip link add bond0 type bond mode balance-xor && ip link set bond0 up
+ ip link set eth0 down && ip link set eth0 master bond0
+ ip link set eth1 down && ip link set eth1 master bond0
+ ip -d link show dev swp0
+ (...)
+ dsa master bond0
+
+Notice that in the case of CPU ports under a LAG, the use of the
+``IFLA_DSA_MASTER`` netlink attribute is not strictly needed, but rather, DSA
+reacts to the ``IFLA_MASTER`` attribute change of its present master (``eth0``)
+and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
+when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
+that were assigned to this interface to the first physical DSA master which is
+eligible, based on the firmware description (it effectively reverts to the
+startup configuration).
+
+In a setup with more than 2 physical CPU ports, it is therefore possible to mix
+static user to CPU port assignment with LAG between DSA masters. It is not
+possible to statically assign a user port towards a DSA master that has any
+upper interfaces (this includes LAG devices - the master must always be the LAG
+in this case).
+
+Live changing of the DSA master (and thus CPU port) affinity of a user port is
+permitted, in order to allow dynamic redistribution in response to traffic.
+
+Physical DSA masters are allowed to join and leave at any time a LAG interface
+used as a DSA master; however, DSA will reject a LAG interface as a valid
+candidate for being a DSA master unless it has at least one physical DSA master
+as a slave device.
diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst
index 563d56c6a25c..a94ddf83348a 100644
--- a/Documentation/networking/dsa/dsa.rst
+++ b/Documentation/networking/dsa/dsa.rst
@@ -10,21 +10,21 @@ in joining the effort.
Design principles
=================
-The Distributed Switch Architecture is a subsystem which was primarily designed
-to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
-using Linux, but has since evolved to support other vendors as well.
+The Distributed Switch Architecture subsystem was primarily designed to
+support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product
+line) using Linux, but has since evolved to support other vendors as well.
The original philosophy behind this design was to be able to use unmodified
Linux tools such as bridge, iproute2, ifconfig to work transparently whether
they configured/queried a switch port network device or a regular network
device.
-An Ethernet switch is typically comprised of multiple front-panel ports, and one
-or more CPU or management port. The DSA subsystem currently relies on the
+An Ethernet switch typically comprises multiple front-panel ports and one
+or more CPU or management ports. The DSA subsystem currently relies on the
presence of a management port connected to an Ethernet controller capable of
receiving Ethernet frames from the switch. This is a very common setup for all
kinds of Ethernet switches found in Small Home and Office products: routers,
-gateways, or even top-of-the rack switches. This host Ethernet controller will
+gateways, or even top-of-rack switches. This host Ethernet controller will
be later referred to as "master" and "cpu" in DSA terminology and code.
The D in DSA stands for Distributed, because the subsystem has been designed
@@ -33,14 +33,14 @@ using upstream and downstream Ethernet links between switches. These specific
ports are referred to as "dsa" ports in DSA terminology and code. A collection
of multiple switches connected to each other is called a "switch tree".
-For each front-panel port, DSA will create specialized network devices which are
+For each front-panel port, DSA creates specialized network devices which are
used as controlling and data-flowing endpoints for use by the Linux networking
stack. These specialized network interfaces are referred to as "slave" network
interfaces in DSA terminology and code.
The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
which is a hardware feature making the switch insert a specific tag for each
-Ethernet frames it received to/from specific ports to help the management
+Ethernet frame it receives to/from specific ports to help the management
interface figure out:
- what port is this frame coming from
@@ -65,14 +65,8 @@ Note that DSA does not currently create network interfaces for the "cpu" and
Switch tagging protocols
------------------------
-DSA currently supports 5 different tagging protocols, and a tag-less mode as
-well. The different protocols are implemented in:
-
-- ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
-- ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
-- ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
-- ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
-- ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
+DSA supports many vendor-specific tagging protocols, one software-defined
+tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
The exact format of the tag protocol is vendor specific, but in general, they
all contain something which:
@@ -80,6 +74,149 @@ all contain something which:
- identifies which port the Ethernet frame came from/should be sent to
- provides a reason why this frame was forwarded to the management interface
+All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
+methods of the ``struct dsa_device_ops`` structure, which are detailed below.
+
+Tagging protocols generally fall in one of three categories:
+
+1. The switch-specific frame header is located before the Ethernet header,
+ shifting to the right (from the perspective of the DSA master's frame
+ parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
+2. The switch-specific frame header is located before the EtherType, keeping
+ the MAC DA and MAC SA in place from the DSA master's perspective, but
+ shifting the 'real' EtherType and L2 payload to the right.
+3. The switch-specific frame header is located at the tail of the packet,
+ keeping all frame headers in place and not altering the view of the packet
+ that the DSA master's frame parser has.
+
+A tagging protocol may tag all packets with switch tags of the same length, or
+the tag length might vary (for example packets with PTP timestamps might
+require an extended switch tag, or there might be one tag length on TX and a
+different one on RX). Either way, the tagging protocol driver must populate the
+``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
+with the length in octets of the longest switch frame header/trailer. The DSA
+framework will automatically adjust the MTU of the master interface to
+accommodate for this extra size in order for DSA user ports to support the
+standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
+``needed_tailroom`` properties are also used to request from the network stack,
+on a best-effort basis, the allocation of packets with enough extra space such
+that the act of pushing the switch tag on transmission of a packet does not
+cause it to reallocate due to lack of memory.
+
+Even though applications are not expected to parse DSA-specific frame headers,
+the format on the wire of the tagging protocol represents an Application Binary
+Interface exposed by the kernel towards user space, for decoders such as
+``libpcap``. The tagging protocol driver must populate the ``proto`` member of
+``struct dsa_device_ops`` with a value that uniquely describes the
+characteristics of the interaction required between the switch hardware and the
+data path driver: the offset of each bit field within the frame header and any
+stateful processing required to deal with the frames (as may be required for
+PTP timestamping).
+
+From the perspective of the network stack, all switches within the same DSA
+switch tree use the same tagging protocol. In case of a packet transiting a
+fabric with more than one switch, the switch-specific frame header is inserted
+by the first switch in the fabric that the packet was received on. This header
+typically contains information regarding its type (whether it is a control
+frame that must be trapped to the CPU, or a data frame to be forwarded).
+Control frames should be decapsulated only by the software data path, whereas
+data frames might also be autonomously forwarded towards other user ports of
+other switches from the same fabric, and in this case, the outermost switch
+ports must decapsulate the packet.
+
+Note that in certain cases, it might be the case that the tagging format used
+by a leaf switch (not connected directly to the CPU) is not the same as what
+the network stack sees. This can be seen with Marvell switch trees, where the
+CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
+format, but the DSA links are configured to use the shorter (without Ethertype)
+DSA frame header, in order to reduce the autonomous packet forwarding overhead.
+It still remains the case that, if the DSA switch tree is configured for the
+EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
+leaf switches that tagged them with the shorter DSA header. This can be done
+because the Marvell switch connected directly to the CPU is configured to
+perform tag translation between DSA and EDSA (which is simply the operation of
+adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
+
+It is possible to construct cascaded setups of DSA switches even if their
+tagging protocols are not compatible with one another. In this case, there are
+no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
+tree. The DSA links are viewed as simply a pair of a DSA master (the out-facing
+port of the upstream DSA switch) and a CPU port (the in-facing port of the
+downstream DSA switch).
+
+The tagging protocol of the attached DSA switch tree can be viewed through the
+``dsa/tagging`` sysfs attribute of the DSA master::
+
+ cat /sys/class/net/eth0/dsa/tagging
+
+If the hardware and driver are capable, the tagging protocol of the DSA switch
+tree can be changed at runtime. This is done by writing the new tagging
+protocol name to the same sysfs device attribute as above (the DSA master and
+all attached switch ports must be down while doing this).
+
+It is desirable that all tagging protocols are testable with the ``dsa_loop``
+mockup driver, which can be attached to any network interface. The goal is that
+any network interface should be capable of transmitting the same packet in the
+same way, and the tagger should decode the same received packet in the same way
+regardless of the driver used for the switch control path, and the driver used
+for the DSA master.
+
+The transmission of a packet goes through the tagger's ``xmit`` function.
+The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
+``struct net_device *dev`` represents the virtual DSA user network interface
+whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
+The job of this method is to prepare the skb in a way that the switch will
+understand what egress port the packet is for (and not deliver it towards other
+ports). Typically this is fulfilled by pushing a frame header. Checking for
+insufficient size in the skb headroom or tailroom is unnecessary provided that
+the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
+properly, because DSA ensures there is enough space before calling this method.
+
+The reception of a packet goes through the tagger's ``rcv`` function. The
+passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
+``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
+the EtherType would have been, were this frame not tagged. The role of this
+method is to consume the frame header, adjust ``skb->data`` to really point at
+the first octet after the EtherType, and to change ``skb->dev`` to point to the
+virtual DSA user network interface corresponding to the physical front-facing
+switch port that the packet was received on.
+
+Since tagging protocols in category 1 and 2 break software (and most often also
+hardware) packet dissection on the DSA master, features such as RPS (Receive
+Packet Steering) on the DSA master would be broken. The DSA framework deals
+with this by hooking into the flow dissector and shifting the offset at which
+the IP header is to be found in the tagged frame as seen by the DSA master.
+This behavior is automatic based on the ``overhead`` value of the tagging
+protocol. If not all packets are of equal size, the tagger can implement the
+``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
+default behavior by specifying the correct offset incurred by each individual
+RX packet. Tail taggers do not cause issues to the flow dissector.
+
+Checksum offload should work with category 1 and 2 taggers when the DSA master
+driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and
+csum_offset. For those cases, DSA will shift the checksum start and offset by
+the tag size. If the DSA master driver still uses the legacy NETIF_F_IP_CSUM
+or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
+offload hardware already expects that specific tag (perhaps due to matching
+vendors). DSA slaves inherit those flags from the master port, and it is up to
+the driver to correctly fall back to software checksum when the IP header is not
+where the hardware expects. If that check is ineffective, the packets might go
+to the network without a proper checksum (the checksum field will have the
+pseudo IP header sum). For category 3, when the offload hardware does not
+already expect the switch tag in use, the checksum must be calculated before any
+tag is inserted (i.e. inside the tagger). Otherwise, the DSA master would
+include the tail tag in the (software or hardware) checksum calculation. Then,
+when the tag gets stripped by the switch during transmission, it will leave an
+incorrect IP checksum in place.
+
+Due to various reasons (most common being category 1 taggers being associated
+with DSA-unaware masters, mangling what the master perceives as MAC DA), the
+tagging protocol may require the DSA master to operate in promiscuous mode, to
+receive all frames regardless of the value of the MAC DA. This can be done by
+setting the ``promisc_on_master`` property of the ``struct dsa_device_ops``.
+Note that this assumes a DSA-unaware master driver, which is the norm.
+
Master network devices
----------------------
@@ -95,7 +232,7 @@ Ethernet switch.
Networking stack hooks
----------------------
-When a master netdev is used with DSA, a small hook is placed in in the
+When a master netdev is used with DSA, a small hook is placed in the
networking stack is in order to have the DSA subsystem process the Ethernet
switch specific tagging protocol. DSA accomplishes this by registering a
specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
@@ -150,21 +287,35 @@ These interfaces are specialized in order to:
to/from specific switch ports
- query the switch for ethtool operations: statistics, link state,
Wake-on-LAN, register dumps...
-- external/internal PHY management: link, auto-negotiation etc.
+- manage external/internal PHY: link, auto-negotiation, etc.
These slave network devices have custom net_device_ops and ethtool_ops function
pointers which allow DSA to introduce a level of layering between the networking
-stack/ethtool, and the switch driver implementation.
+stack/ethtool and the switch driver implementation.
Upon frame transmission from these slave network devices, DSA will look up which
-switch tagging protocol is currently registered with these network devices, and
+switch tagging protocol is currently registered with these network devices and
invoke a specific transmit routine which takes care of adding the relevant
switch tag in the Ethernet frames.
These frames are then queued for transmission using the master network device
-``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
+``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the
Ethernet switch will be able to process these incoming frames from the
-management interface and delivers these frames to the physical switch port.
+management interface and deliver them to the physical switch port.
+
+When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
+device between the DSA slave devices and the physical DSA masters. The LAG
+device is thus also a DSA master, but the LAG slave devices continue to be DSA
+masters as well (just with no user port assigned to them; this is needed for
+recovery in case the LAG DSA master disappears). Thus, the data path of the LAG
+DSA master is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
+calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA master;
+LAG slave). Therefore, the RX data path of the LAG DSA master is not used.
+On the other hand, TX takes place linearly: ``dsa_slave_xmit`` calls
+``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA master.
+The latter calls ``dev_queue_xmit`` towards one physical DSA master or the
+other, and in both cases, the packet exits the system through a hardware path
+towards the switch.
Graphical representation
------------------------
@@ -172,23 +323,34 @@ Graphical representation
Summarized, this is basically how DSA looks like from a network device
perspective::
-
- |---------------------------
- | CPU network device (eth0)|
- ----------------------------
- | <tag added by switch |
- | |
- | |
- | tag added by CPU> |
- |--------------------------------------------|
- | Switch driver |
- |--------------------------------------------|
- || || ||
- |-------| |-------| |-------|
- | sw0p0 | | sw0p1 | | sw0p2 |
- |-------| |-------| |-------|
-
-
+ Unaware application
+ opens and binds socket
+ | ^
+ | |
+ +-----------v--|--------------------+
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ |+------+-+------+-+------+-+------+|
+ | DSA switch driver |
+ +-----------------------------------+
+ | ^
+ Tag added by | | Tag consumed by
+ switch driver | | switch driver
+ v |
+ +-----------------------------------+
+ | Unmodified host interface driver | Software
+ --------+-----------------------------------+------------
+ | Host interface (eth0) | Hardware
+ +-----------------------------------+
+ | ^
+ Tag consumed by | | Tag added by
+ switch hardware | | switch hardware
+ v |
+ +-----------------------------------+
+ | Switch |
+ |+------+ +------+ +------+ +------+|
+ || swp0 | | swp1 | | swp2 | | swp3 ||
+ ++------+-+------+-+------+-+------++
Slave MDIO bus
--------------
@@ -199,9 +361,9 @@ MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
switches, these functions would utilize direct or indirect PHY addressing mode
to return standard MII registers from the switch builtin PHYs, allowing the PHY
library and/or to return link status, link partner pages, auto-negotiation
-results etc..
+results, etc.
-For Ethernet switches which have both external and internal MDIO busses, the
+For Ethernet switches which have both external and internal MDIO buses, the
slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
internal or external MDIO devices this switch might be connected to: internal
PHYs, external PHYs, or even external switches.
@@ -218,7 +380,7 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
table indication (when cascading switches)
- ``dsa_platform_data``: platform device configuration data which can reference
- a collection of dsa_chip_data structure if multiples switches are cascaded,
+ a collection of dsa_chip_data structures if multiple switches are cascaded,
the master network device this switch tree is attached to needs to be
referenced
@@ -239,14 +401,6 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as
Design limitations
==================
-Limits on the number of devices and ports
------------------------------------------
-
-DSA currently limits the number of maximum switches within a tree to 4
-(``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
-These limits could be extended to support larger configurations would this need
-arise.
-
Lack of CPU/DSA network devices
-------------------------------
@@ -273,10 +427,6 @@ will not make us go through the switch tagging protocol transmit function, so
the Ethernet switch on the other end, expecting a tag will typically drop this
frame.
-Slave network devices check that the master network device is UP before allowing
-you to administratively bring UP these slave network devices. A common
-configuration mistake is forgetting to bring UP the master network device first.
-
Interactions with other subsystems
==================================
@@ -285,6 +435,7 @@ DSA currently leverages the following subsystems:
- MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
- Switchdev:``net/switchdev/*``
- Device Tree for various of_* functions
+- Devlink: ``net/core/devlink.c``
MDIO/PHY library
----------------
@@ -306,7 +457,7 @@ logic basically looks like this:
"phy-handle" property, if found, this PHY device is created and registered
using ``of_phy_connect()``
-- if Device Tree is used, and the PHY device is "fixed", that is, conforms to
+- if Device Tree is used and the PHY device is "fixed", that is, conforms to
the definition of a non-MDIO managed PHY as defined in
``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
and connected transparently using the special fixed MDIO bus driver
@@ -321,14 +472,39 @@ SWITCHDEV
DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
more specifically with its VLAN filtering portion when configuring VLANs on top
-of per-port slave network devices. Since DSA primarily deals with
-MDIO-connected switches, although not exclusively, SWITCHDEV's
-prepare/abort/commit phases are often simplified into a prepare phase which
-checks whether the operation is supported by the DSA switch driver, and a commit
-phase which applies the changes.
-
-As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
-objects.
+of per-port slave network devices. As of today, the only SWITCHDEV objects
+supported by DSA are the FDB and VLAN objects.
+
+Devlink
+-------
+
+DSA registers one devlink device per physical switch in the fabric.
+For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
+links or unused ports) is exposed as a devlink port.
+
+DSA drivers can make use of the following devlink features:
+
+- Regions: debugging feature which allows user space to dump driver-defined
+ areas of hardware information in a low-level, binary format. Both global
+ regions as well as per-port regions are supported. It is possible to export
+ devlink regions even for pieces of data that are already exposed in some way
+ to the standard iproute2 user space programs (ip-link, bridge), like address
+ tables and VLAN tables. For example, this might be useful if the tables
+ contain additional hardware-specific details which are not visible through
+ the iproute2 abstraction, or it might be useful to inspect these tables on
+ the non-user ports too, which are invisible to iproute2 because no network
+ interface is registered for them.
+- Params: a feature which enables user to configure certain low-level tunable
+ knobs pertaining to the device. Drivers may implement applicable generic
+ devlink params, or may add new device-specific devlink params.
+- Resources: a monitoring feature which enables users to see the degree of
+ utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
+- Shared buffers: a QoS feature for adjusting and partitioning memory and frame
+ reservations per port and per traffic class, in the ingress and egress
+ directions, such that low-priority bulk traffic does not impede the
+ processing of high-priority critical traffic.
+
+For more details, consult ``Documentation/networking/devlink/``.
Device Tree
-----------
@@ -336,35 +512,117 @@ Device Tree
DSA features a standardized binding which is documented in
``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
-per-port PHY specific details: interface connection, MDIO bus location etc..
+per-port PHY specific details: interface connection, MDIO bus location, etc.
Driver development
==================
-DSA switch drivers need to implement a dsa_switch_ops structure which will
+DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will
contain the various members described below.
-``register_switch_driver()`` registers this dsa_switch_ops in its internal list
-of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
+Probing, registration and device lifetime
+-----------------------------------------
+
+DSA switches are regular ``device`` structures on buses (be they platform, SPI,
+I2C, MDIO or otherwise). The DSA framework is not involved in their probing
+with the device core.
+
+Switch registration from the perspective of a driver means passing a valid
+``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the
+switch driver's probing function. The following members must be valid in the
+provided structure:
+
+- ``ds->dev``: will be used to parse the switch's OF node or platform data.
+
+- ``ds->num_ports``: will be used to create the port list for this switch, and
+ to validate the port indices provided in the OF node.
+
+- ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA
+ method implementations.
+
+- ``ds->priv``: backpointer to a driver-private data structure which can be
+ retrieved in all further DSA method callbacks.
+
+In addition, the following flags in the ``dsa_switch`` structure may optionally
+be configured to obtain driver-specific behavior from the DSA core. Their
+behavior when set is documented through comments in ``include/net/dsa.h``.
+
+- ``ds->vlan_filtering_is_global``
+
+- ``ds->needs_standalone_vlan_filtering``
+
+- ``ds->configure_vlan_while_not_filtering``
+
+- ``ds->untag_bridge_pvid``
+
+- ``ds->assisted_learning_on_cpu_port``
-Unless requested differently by setting the priv_size member accordingly, DSA
-does not allocate any driver private context space.
+- ``ds->mtu_enforcement_ingress``
+
+- ``ds->fdb_isolation``
+
+Internally, DSA keeps an array of switch trees (group of switches) global to
+the kernel, and attaches a ``dsa_switch`` structure to a tree on registration.
+The tree ID to which the switch is attached is determined by the first u32
+number of the ``dsa,member`` property of the switch's OF node (0 if missing).
+The switch ID within the tree is determined by the second u32 number of the
+same OF property (0 if missing). Registering multiple switches with the same
+switch ID and tree ID is illegal and will cause an error. Using platform data,
+a single switch and a single switch tree is permitted.
+
+In case of a tree with multiple switches, probing takes place asymmetrically.
+The first N-1 callers of ``dsa_register_switch()`` only add their ports to the
+port list of the tree (``dst->ports``), each port having a backpointer to its
+associated switch (``dp->ds``). Then, these switches exit their
+``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()``
+has determined that the tree is not yet complete (not all ports referenced by
+DSA links are present in the tree's port list). The tree becomes complete when
+the last switch calls ``dsa_register_switch()``, and this triggers the effective
+continuation of initialization (including the call to ``ds->ops->setup()``) for
+all switches within that tree, all as part of the calling context of the last
+switch's probe function.
+
+The opposite of registration takes place when calling ``dsa_unregister_switch()``,
+which removes a switch's ports from the port list of the tree. The entire tree
+is torn down when the first switch unregisters.
+
+It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback
+of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal
+version of the full teardown performed by ``dsa_unregister_switch()``).
+The reason is that DSA keeps a reference on the master net device, and if the
+driver for the master device decides to unbind on shutdown, DSA's reference
+will block that operation from finalizing.
+
+Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called,
+but not both, and the device driver model permits the bus' ``remove()`` method
+to be called even if ``shutdown()`` was already called. Therefore, drivers are
+expected to implement a mutual exclusion method between ``remove()`` and
+``shutdown()`` by setting their drvdata to NULL after any of these has run, and
+checking whether the drvdata is NULL before proceeding to take any action.
+
+After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no
+further callbacks via the provided ``dsa_switch_ops`` may take place, and the
+driver may free the data structures associated with the ``dsa_switch``.
Switch configuration
--------------------
-- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
- should be a valid value from the ``dsa_tag_protocol`` enum
+- ``get_tag_protocol``: this is to indicate what kind of tagging protocol is
+ supported, should be a valid value from the ``dsa_tag_protocol`` enum.
+ The returned information does not have to be static; the driver is passed the
+ CPU port number, as well as the tagging protocol of a possibly stacked
+ upstream switch, in case there are hardware limitations in terms of supported
+ tag formats.
-- ``probe``: probe routine which will be invoked by the DSA platform device upon
- registration to test for the presence/absence of a switch device. For MDIO
- devices, it is recommended to issue a read towards internal registers using
- the switch pseudo-PHY and return whether this is a supported device. For other
- buses, return a non-NULL string
+- ``change_tag_protocol``: when the default tagging protocol has compatibility
+ problems with the master or other issues, the driver may support changing it
+ at runtime, either through a device tree property or through sysfs. In that
+ case, further calls to ``get_tag_protocol`` should report the protocol in
+ current use.
- ``setup``: setup function for the switch, this function is responsible for setting
up the ``dsa_switch_ops`` private structure with all it needs: register maps,
- interrupts, mutexes, locks etc.. This function is also expected to properly
+ interrupts, mutexes, locks, etc. This function is also expected to properly
configure the switch to separate all network interfaces from each other, that
is, they should be isolated by the switch hardware itself, typically by creating
a Port-based VLAN ID for each port and allowing only the CPU port and the
@@ -373,7 +631,35 @@ Switch configuration
fully configured and ready to serve any kind of request. It is recommended
to issue a software reset of the switch during this setup function in order to
avoid relying on what a previous software agent such as a bootloader/firmware
- may have previously configured.
+ may have previously configured. The method responsible for undoing any
+ applicable allocations or operations done here is ``teardown``.
+
+- ``port_setup`` and ``port_teardown``: methods for initialization and
+ destruction of per-port data structures. It is mandatory for some operations
+ such as registering and unregistering devlink port regions to be done from
+ these methods, otherwise they are optional. A port will be torn down only if
+ it has been previously set up. It is possible for a port to be set up during
+ probing only to be torn down immediately afterwards, for example in case its
+ PHY cannot be found. In this case, probing of the DSA switch continues
+ without that particular port.
+
+- ``port_change_master``: method through which the affinity (association used
+ for traffic termination purposes) between a user port and a CPU port can be
+ changed. By default all user ports from a tree are assigned to the first
+ available CPU port that makes sense for them (most of the times this means
+ the user ports of a tree are all assigned to the same CPU port, except for H
+ topologies as described in commit 2c0b03258b8b). The ``port`` argument
+ represents the index of the user port, and the ``master`` argument represents
+ the new DSA master ``net_device``. The CPU port associated with the new
+ master can be retrieved by looking at ``struct dsa_port *cpu_dp =
+ master->dsa_ptr``. Additionally, the master can also be a LAG device where
+ all the slave devices are physical DSA masters. LAG DSA masters also have a
+ valid ``master->dsa_ptr`` pointer, however this is not unique, but rather a
+ duplicate of the first physical DSA master's (LAG slave) ``dsa_ptr``. In case
+ of a LAG DSA master, a further call to ``port_lag_join`` will be emitted
+ separately for the physical CPU ports associated with the physical DSA
+ masters, requesting them to create a hardware LAG associated with the LAG
+ interface.
PHY devices and link management
-------------------------------
@@ -381,13 +667,13 @@ PHY devices and link management
- ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
if the PHY library PHY driver needs to know about information it cannot obtain
on its own (e.g.: coming from switch memory mapped registers), this function
- should return a 32-bits bitmask of "flags", that is private between the switch
+ should return a 32-bit bitmask of "flags" that is private between the switch
driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
- ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
the switch port MDIO registers. If unavailable, return 0xffff for each read.
For builtin switch Ethernet PHYs, this function should allow reading the link
- status, auto-negotiation results, link partner pages etc..
+ status, auto-negotiation results, link partner pages, etc.
- ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
to the switch port MDIO registers. If unavailable return a negative error
@@ -409,7 +695,7 @@ Ethtool operations
------------------
- ``get_strings``: ethtool function used to query the driver's strings, will
- typically return statistics strings, private flags strings etc.
+ typically return statistics strings, private flags strings, etc.
- ``get_ethtool_stats``: ethtool function used to query per-port statistics and
return their values. DSA overlays slave network devices general statistics:
@@ -419,7 +705,7 @@ Ethtool operations
- ``get_sset_count``: ethtool function used to query the number of statistics items
- ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
- function may, for certain implementations also query the master network device
+ function may for certain implementations also query the master network device
Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
- ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
@@ -462,37 +748,226 @@ Power management
in a fully active state
- ``port_enable``: function invoked by the DSA slave network device ndo_open
- function when a port is administratively brought up, this function should be
- fully enabling a given switch port. DSA takes care of marking the port with
+ function when a port is administratively brought up, this function should
+ fully enable a given switch port. DSA takes care of marking the port with
``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
was not, and propagating these changes down to the hardware
- ``port_disable``: function invoked by the DSA slave network device ndo_close
- function when a port is administratively brought down, this function should be
- fully disabling a given switch port. DSA takes care of marking the port with
+ function when a port is administratively brought down, this function should
+ fully disable a given switch port. DSA takes care of marking the port with
``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
disabled while being a bridge member
+Address databases
+-----------------
+
+Switching hardware is expected to have a table for FDB entries, however not all
+of them are active at the same time. An address database is the subset (partition)
+of FDB entries that is active (can be matched by address learning on RX, or FDB
+lookup on TX) depending on the state of the port. An address database may
+occasionally be called "FID" (Filtering ID) in this document, although the
+underlying implementation may choose whatever is available to the hardware.
+
+For example, all ports that belong to a VLAN-unaware bridge (which is
+*currently* VLAN-unaware) are expected to learn source addresses in the
+database associated by the driver with that bridge (and not with other
+VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a
+VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having
+the same MAC DA as the packet, which is present on another port member of the
+same bridge. At the same time, the FDB lookup process must be able to not find
+an FDB entry having the same MAC DA as the packet, if that entry points towards
+a port which is a member of a different VLAN-unaware bridge (and is therefore
+associated with a different address database).
+
+Similarly, each VLAN of each offloaded VLAN-aware bridge should have an
+associated address database, which is shared by all ports which are members of
+that VLAN, but not shared by ports belonging to different bridges that are
+members of the same VID.
+
+In this context, a VLAN-unaware database means that all packets are expected to
+match on it irrespective of VLAN ID (only MAC address lookup), whereas a
+VLAN-aware database means that packets are supposed to match based on the VLAN
+ID from the classified 802.1Q header (or the pvid if untagged).
+
+At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0,
+whereas VLAN-aware FDB entries have non-zero VID values. Note that a
+VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a
+VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the
+software bridge keeps separate address databases, and offloads to hardware the
+FDB entries belonging to these databases, through switchdev, asynchronously
+relative to the moment when the databases become active or inactive.
+
+When a user port operates in standalone mode, its driver should configure it to
+use a separate database called a port private database. This is different from
+the databases described above, and should impede operation as standalone port
+(packet in, packet out to the CPU port) as little as possible. For example,
+on ingress, it should not attempt to learn the MAC SA of ingress traffic, since
+learning is a bridging layer service and this is a standalone port, therefore
+it would consume useless space. With no address learning, the port private
+database should be empty in a naive implementation, and in this case, all
+received packets should be trivially flooded to the CPU port.
+
+DSA (cascade) and CPU ports are also called "shared" ports because they service
+multiple address databases, and the database that a packet should be associated
+to is usually embedded in the DSA tag. This means that the CPU port may
+simultaneously transport packets coming from a standalone port (which were
+classified by hardware in one address database), and from a bridge port (which
+were classified to a different address database).
+
+Switch drivers which satisfy certain criteria are able to optimize the naive
+configuration by removing the CPU port from the flooding domain of the switch,
+and just program the hardware with FDB entries pointing towards the CPU port
+for which it is known that software is interested in those MAC addresses.
+Packets which do not match a known FDB entry will not be delivered to the CPU,
+which will save CPU cycles required for creating an skb just to drop it.
+
+DSA is able to perform host address filtering for the following kinds of
+addresses:
+
+- Primary unicast MAC addresses of ports (``dev->dev_addr``). These are
+ associated with the port private database of the respective user port,
+ and the driver is notified to install them through ``port_fdb_add`` towards
+ the CPU port.
+
+- Secondary unicast and multicast MAC addresses of ports (addresses added
+ through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated
+ with the port private database of the respective user port.
+
+- Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC
+ addresses of the bridge ports, for which packets must be terminated locally
+ and not forwarded. They are associated with the address database for that
+ bridge.
+
+- Static bridge FDB entries installed towards foreign (non-DSA) interfaces
+ present in the same bridge as some DSA switch ports. These are also
+ associated with the address database for that bridge.
+
+- Dynamically learned FDB entries on foreign interfaces present in the same
+ bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port``
+ is set to true by the driver. These are associated with the address database
+ for that bridge.
+
+For various operations detailed below, DSA provides a ``dsa_db`` structure
+which can be of the following types:
+
+- ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to
+ the port private database of user port ``db->dp``.
+- ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge
+ ``db->bridge``. Separation between the VLAN-unaware database and the per-VID
+ databases of this bridge is expected to be done by the driver.
+- ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``.
+ Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future.
+
+The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``,
+``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true.
+
+DSA associates each offloaded bridge and each offloaded LAG with a one-based ID
+(``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of
+refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering
+scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may
+implement their own.
+
+Only the drivers which declare support for FDB isolation are notified of FDB
+entries on the CPU port belonging to ``DSA_DB_PORT`` databases.
+For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to
+drivers even if they do not support FDB isolation. However, ``db->bridge.num``
+and ``db->lag.id`` are always set to 0 in that case (to denote the lack of
+isolation, for refcounting purposes).
+
+Note that it is not mandatory for a switch driver to implement physically
+separate address databases for each standalone user port. Since FDB entries in
+the port private databases will always point to the CPU port, there is no risk
+for incorrect forwarding decisions. In this case, all standalone ports may
+share the same database, but the reference counting of host-filtered addresses
+(not deleting the FDB entry for a port's MAC address if it's still in use by
+another port) becomes the responsibility of the driver, because DSA is unaware
+that the port databases are in fact shared. This can be achieved by calling
+``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``.
+The down side is that the RX filtering lists of each user port are in fact
+shared, which means that user port A may accept a packet with a MAC DA it
+shouldn't have, only because that MAC address was in the RX filtering list of
+user port B. These packets will still be dropped in software, however.
+
Bridge layer
------------
+Offloading the bridge forwarding plane is optional and handled by the methods
+below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may
+be non-zero and exceeded, and in this case, joining a bridge port is still
+possible, but the packet forwarding will take place in software, and the ports
+under a software bridge must remain configured in the same way as for
+standalone operation, i.e. have all bridging service functions (address
+learning etc) disabled, and send all received packets to the CPU port only.
+
+Concretely, a port starts offloading the forwarding plane of a bridge once it
+returns success to the ``port_bridge_join`` method, and stops doing so after
+``port_bridge_leave`` has been called. Offloading the bridge means autonomously
+learning FDB entries in accordance with the software bridge port's state, and
+autonomously forwarding (or flooding) received packets without CPU intervention.
+This is optional even when offloading a bridge port. Tagging protocol drivers
+are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which
+have already been autonomously forwarded in the forwarding domain of the
+ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all
+switch ports part of the same tree ID to be part of the same bridge forwarding
+domain (capable of autonomous forwarding to each other).
+
+Offloading the TX forwarding process of a bridge is a distinct concept from
+simply offloading its forwarding plane, and refers to the ability of certain
+driver and tag protocol combinations to transmit a single skb coming from the
+bridge device's transmit function to potentially multiple egress ports (and
+thereby avoid its cloning in software).
+
+Packets for which the bridge requests this behavior are called data plane
+packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol
+driver's ``xmit`` function. Data plane packets are subject to FDB lookup,
+hardware learning on the CPU port, and do not override the port STP state.
+Additionally, replication of data plane packets (multicast, flooding) is
+handled in hardware and the bridge driver will transmit a single skb for each
+packet that may or may not need replication.
+
+When the TX forwarding offload is enabled, the tag protocol driver is
+responsible to inject packets into the data plane of the hardware towards the
+correct bridging domain (FID) that the port is a part of. The port may be
+VLAN-unaware, and in this case the FID must be equal to the FID used by the
+driver for its VLAN-unaware address database associated with that bridge.
+Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed
+that the packet is also VLAN-tagged with the VLAN ID that the bridge processed
+this packet in. It is the responsibility of the hardware to untag the VID on
+the egress-untagged ports, or keep the tag on the egress-tagged ones.
+
- ``port_bridge_join``: bridge layer function invoked when a given switch port is
- added to a bridge, this function should be doing the necessary at the switch
- level to permit the joining port from being added to the relevant logical
+ added to a bridge, this function should do what's necessary at the switch
+ level to permit the joining port to be added to the relevant logical
domain for it to ingress/egress traffic with other members of the bridge.
+ By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process
+ of this bridge is also offloaded.
- ``port_bridge_leave``: bridge layer function invoked when a given switch port is
- removed from a bridge, this function should be doing the necessary at the
+ removed from a bridge, this function should do what's necessary at the
switch level to deny the leaving port from ingress/egress traffic from the
- remaining bridge members. When the port leaves the bridge, it should be aged
- out at the switch hardware for the switch to (re) learn MAC addresses behind
- this port.
+ remaining bridge members.
- ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
state is computed by the bridge layer and should be propagated to switch
- hardware to forward/block/learn traffic. The switch driver is responsible for
- computing a STP state change based on current and asked parameters and perform
- the relevant ageing based on the intersection results
+ hardware to forward/block/learn traffic.
+
+- ``port_bridge_flags``: bridge layer function invoked when a port must
+ configure its settings for e.g. flooding of unknown traffic or source address
+ learning. The switch driver is responsible for initial setup of the
+ standalone ports with address learning disabled and egress flooding of all
+ types of traffic, then the DSA core notifies of any change to the bridge port
+ flags when the port joins and leaves a bridge. DSA does not currently manage
+ the bridge port flags for the CPU port. The assumption is that address
+ learning should be statically enabled (if supported by the hardware) on the
+ CPU port, and flooding towards the CPU port should also be enabled, due to a
+ lack of an explicit address filtering mechanism in the DSA core.
+
+- ``port_fast_age``: bridge layer function invoked when flushing the
+ dynamically learned FDB entries on the port is necessary. This is called when
+ transitioning from an STP state where learning should take place to an STP
+ state where it shouldn't, or when leaving a bridge, or when address learning
+ is turned off via ``port_bridge_flags``.
Bridge VLAN filtering
---------------------
@@ -507,63 +982,139 @@ Bridge VLAN filtering
accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
allowed.
-- ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
- configuration of a VLAN on the given port. If the operation is not supported
- by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
- code to fallback to a software implementation. No hardware setup must be done
- in this function. See port_vlan_add for this and details.
-
- ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
- (tagged or untagged) for the given switch port
+ (tagged or untagged) for the given switch port. The CPU port becomes a member
+ of a VLAN only if a foreign bridge port is also a member of it (and
+ forwarding needs to take place in software), or the VLAN is installed to the
+ VLAN group of the bridge device itself, for termination purposes
+ (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are
+ reference counted and removed when there is no user left. Drivers do not need
+ to manually install a VLAN on the CPU port.
- ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
given switch port
-- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each VLAN the given port is a member
- of. A switchdev object is used to carry the VID and bridge flags.
-
- ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
Forwarding Database entry, the switch hardware should be programmed with the
specified address in the specified VLAN Id in the forwarding database
- associated with this VLAN ID. If the operation is not supported, this
- function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
- a software implementation.
-
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
- of DSA, would be its port-based VLAN, used by the associated bridge device.
+ associated with this VLAN ID.
- ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
Forwarding Database entry, the switch hardware should be programmed to delete
the specified MAC address from the specified VLAN ID if it was mapped into
this port forwarding database
-- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each MAC address known to be behind
- the given port. A switchdev object is used to carry the VID and FDB info.
-
-- ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
- installation of a multicast database entry. If the operation is not supported,
- this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
- to a software implementation. No hardware setup must be done in this function.
- See ``port_fdb_add`` for this and details.
+- ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the
+ physical DSA port interfaces. Since DSA does not attempt to keep in sync its
+ hardware FDB entries with the software bridge, this method is implemented as
+ a means to view the entries visible on user ports in the hardware database.
+ The entries reported by this function have the ``self`` flag in the output of
+ the ``bridge fdb show`` command.
- ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
- a multicast database entry, the switch hardware should be programmed with the
+ a multicast database entry. The switch hardware should be programmed with the
specified address in the specified VLAN ID in the forwarding database
associated with this VLAN ID.
-.. note:: VLAN ID 0 corresponds to the port private database, which, in the context
- of DSA, would be its port-based VLAN, used by the associated bridge device.
-
- ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
multicast database entry, the switch hardware should be programmed to delete
the specified MAC address from the specified VLAN ID if it was mapped into
this port forwarding database.
-- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
- function that the driver has to call for each MAC address known to be behind
- the given port. A switchdev object is used to carry the VID and MDB info.
+Link aggregation
+----------------
+
+Link aggregation is implemented in the Linux networking stack by the bonding
+and team drivers, which are modeled as virtual, stackable network interfaces.
+DSA is capable of offloading a link aggregation group (LAG) to hardware that
+supports the feature, and supports bridging between physical ports and LAGs,
+as well as between LAGs. A bonding/team interface which holds multiple physical
+ports constitutes a logical port, although DSA has no explicit concept of a
+logical port at the moment. Due to this, events where a LAG joins/leaves a
+bridge are treated as if all individual physical ports that are members of that
+LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
+state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
+are treated similarly: DSA offloads the same switchdev object / port attribute
+on all members of the LAG. Static bridge FDB entries on a LAG are not yet
+supported, since the DSA driver API does not have the concept of a logical port
+ID.
+
+- ``port_lag_join``: function invoked when a given switch port is added to a
+ LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
+ back to a software implementation where all traffic from this port is sent to
+ the CPU.
+- ``port_lag_leave``: function invoked when a given switch port leaves a LAG
+ and returns to operation as a standalone port.
+- ``port_lag_change``: function invoked when the link state of any member of
+ the LAG changes, and the hashing function needs rebalancing to only make use
+ of the subset of physical LAG member ports that are up.
+
+Drivers that benefit from having an ID associated with each offloaded LAG
+can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
+method. The LAG ID associated with a bonding/team interface can then be
+retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
+
+IEC 62439-2 (MRP)
+-----------------
+
+The Media Redundancy Protocol is a topology management protocol optimized for
+fast fault recovery time for ring networks, which has some components
+implemented as a function of the bridge driver. MRP uses management PDUs
+(Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
+address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
+Depending on the node's role in the ring (MRM: Media Redundancy Manager,
+MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
+PDUs might need to be terminated locally and others might need to be forwarded.
+An MRM might also benefit from offloading to hardware the creation and
+transmission of certain MRP PDUs (Test).
+
+Normally an MRP instance can be created on top of any network interface,
+however in the case of a device with an offloaded data path such as DSA, it is
+necessary for the hardware, even if it is not MRP-aware, to be able to extract
+the MRP PDUs from the fabric before the driver can proceed with the software
+implementation. DSA today has no driver which is MRP-aware, therefore it only
+listens for the bare minimum switchdev objects required for the software assist
+to work properly. The operations are detailed below.
+
+- ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
+ with a certain ring ID, priority, primary port and secondary port is
+ created/deleted.
+- ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
+ when an MRP instance changes ring roles between MRM or MRC. This affects
+ which MRP PDUs should be trapped to software and which should be autonomously
+ forwarded.
+
+IEC 62439-3 (HSR/PRP)
+---------------------
+
+The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
+works by duplicating and sequence numbering packets through two independent L2
+networks (which are unaware of the PRP tail tags carried in the packets), and
+eliminating the duplicates at the receiver. The High-availability Seamless
+Redundancy (HSR) protocol is similar in concept, except all nodes that carry
+the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
+uses a header with an EtherType of 0x892f) and are physically connected in a
+ring topology. Both HSR and PRP use supervision frames for monitoring the
+health of the network and for discovery of other nodes.
+
+In Linux, both HSR and PRP are implemented in the hsr driver, which
+instantiates a virtual, stackable network interface with two member ports.
+The driver only implements the basic roles of DANH (Doubly Attached Node
+implementing HSR) and DANP (Doubly Attached Node implementing PRP); the roles
+of RedBox and QuadBox are not implemented (therefore, bridging a hsr network
+interface with a physical switch port does not produce the expected result).
+
+A driver which is able of offloading certain functions of a DANP or DANH should
+declare the corresponding netdev features as indicated by the documentation at
+``Documentation/networking/netdev-features.rst``. Additionally, the following
+methods must be implemented:
+
+- ``port_hsr_join``: function invoked when a given switch port is added to a
+ DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
+ fall back to a software implementation where all traffic from this port is
+ sent to the CPU.
+- ``port_hsr_leave``: function invoked when a given switch port leaves a
+ DANP/DANH and returns to normal operation as a standalone port.
TODO
====
@@ -576,12 +1127,3 @@ capable hardware, but does not enforce a strict switch device driver model. On
the other DSA enforces a fairly strict device driver model, and deals with most
of the switch specific. At some point we should envision a merger between these
two subsystems and get the best of both worlds.
-
-Other hanging fruits
---------------------
-
-- making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
-- allowing more than one CPU/management interface:
- http://comments.gmane.org/gmane.linux.network/365657
-- porting more drivers from other vendors:
- http://comments.gmane.org/gmane.linux.network/365510
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index 64553d8d91cb..e0219c1452ab 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -5,7 +5,7 @@ NXP SJA1105 switch driver
Overview
========
-The NXP SJA1105 is a family of 6 devices:
+The NXP SJA1105 is a family of 10 SPI-managed automotive switches:
- SJA1105E: First generation, no TTEthernet
- SJA1105T: First generation, TTEthernet
@@ -13,9 +13,11 @@ The NXP SJA1105 is a family of 6 devices:
- SJA1105Q: Second generation, TTEthernet, no SGMII
- SJA1105R: Second generation, no TTEthernet, SGMII
- SJA1105S: Second generation, TTEthernet, SGMII
-
-These are SPI-managed automotive switches, with all ports being gigabit
-capable, and supporting MII/RMII/RGMII and optionally SGMII on one port.
+- SJA1110A: Third generation, TTEthernet, SGMII, integrated 100base-T1 and
+ 100base-TX PHYs
+- SJA1110B: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110C: Third generation, TTEthernet, SGMII, 100base-T1, 100base-TX
+- SJA1110D: Third generation, TTEthernet, SGMII, 100base-T1
Being automotive parts, their configuration interface is geared towards
set-and-forget use, with minimal dynamic interaction at runtime. They
@@ -63,38 +65,6 @@ If that changed setting can be transmitted to the switch through the dynamic
reconfiguration interface, it is; otherwise the switch is reset and
reprogrammed with the updated static configuration.
-Traffic support
-===============
-
-The switches do not support switch tagging in hardware. But they do support
-customizing the TPID by which VLAN traffic is identified as such. The switch
-driver is leveraging ``CONFIG_NET_DSA_TAG_8021Q`` by requesting that special
-VLANs (with a custom TPID of ``ETH_P_EDSA`` instead of ``ETH_P_8021Q``) are
-installed on its ports when not in ``vlan_filtering`` mode. This does not
-interfere with the reception and transmission of real 802.1Q-tagged traffic,
-because the switch does no longer parse those packets as VLAN after the TPID
-change.
-The TPID is restored when ``vlan_filtering`` is requested by the user through
-the bridge layer, and general IP termination becomes no longer possible through
-the switch netdevices in this mode.
-
-The switches have two programmable filters for link-local destination MACs.
-These are used to trap BPDUs and PTP traffic to the master netdevice, and are
-further used to support STP and 1588 ordinary clock/boundary clock
-functionality.
-
-The following traffic modes are supported over the switch netdevices:
-
-+--------------------+------------+------------------+------------------+
-| | Standalone | Bridged with | Bridged with |
-| | ports | vlan_filtering 0 | vlan_filtering 1 |
-+====================+============+==================+==================+
-| Regular traffic | Yes | Yes | No (use master) |
-+--------------------+------------+------------------+------------------+
-| Management traffic | Yes | Yes | Yes |
-| (BPDU, PTP) | | | |
-+--------------------+------------+------------------+------------------+
-
Switching features
==================
@@ -119,33 +89,10 @@ untagged), and therefore this mode is also supported.
Segregating the switch ports in multiple bridges is supported (e.g. 2 + 2), but
all bridges should have the same level of VLAN awareness (either both have
-``vlan_filtering`` 0, or both 1). Also an inevitable limitation of the fact
-that VLAN awareness is global at the switch level is that once a bridge with
-``vlan_filtering`` enslaves at least one switch port, the other un-bridged
-ports are no longer available for standalone traffic termination.
+``vlan_filtering`` 0, or both 1).
Topology and loop detection through STP is supported.
-L2 FDB manipulation (add/delete/dump) is currently possible for the first
-generation devices. Aging time of FDB entries, as well as enabling fully static
-management (no address learning and no flooding of unknown traffic) is not yet
-configurable in the driver.
-
-A special comment about bridging with other netdevices (illustrated with an
-example):
-
-A board has eth0, eth1, swp0@eth1, swp1@eth1, swp2@eth1, swp3@eth1.
-The switch ports (swp0-3) are under br0.
-It is desired that eth0 is turned into another switched port that communicates
-with swp0-3.
-
-If br0 has vlan_filtering 0, then eth0 can simply be added to br0 with the
-intended results.
-If br0 has vlan_filtering 1, then a new br1 interface needs to be created that
-enslaves eth0 and eth1 (the DSA master of the switch ports). This is because in
-this mode, the switch ports beneath br0 are not capable of regular traffic, and
-are only used as a conduit for switchdev operations.
-
Offloads
========
@@ -230,10 +177,153 @@ simultaneously on two ports. The driver checks the consistency of the schedules
against this restriction and errors out when appropriate. Schedule analysis is
needed to avoid this, which is outside the scope of the document.
+Routing actions (redirect, trap, drop)
+--------------------------------------
+
+The switch is able to offload flow-based redirection of packets to a set of
+destination ports specified by the user. Internally, this is implemented by
+making use of Virtual Links, a TTEthernet concept.
+
+The driver supports 2 types of keys for Virtual Links:
+
+- VLAN-aware virtual links: these match on destination MAC address, VLAN ID and
+ VLAN PCP.
+- VLAN-unaware virtual links: these match on destination MAC address only.
+
+The VLAN awareness state of the bridge (vlan_filtering) cannot be changed while
+there are virtual link rules installed.
+
+Composing multiple actions inside the same rule is supported. When only routing
+actions are requested, the driver creates a "non-critical" virtual link. When
+the action list also contains tc-gate (more details below), the virtual link
+becomes "time-critical" (draws frame buffers from a reserved memory partition,
+etc).
+
+The 3 routing actions that are supported are "trap", "drop" and "redirect".
+
+Example 1: send frames received on swp2 with a DA of 42:be:24:9b:76:20 to the
+CPU and to swp3. This type of key (DA only) when the port's VLAN awareness
+state is off::
+
+ tc qdisc add dev swp2 clsact
+ tc filter add dev swp2 ingress flower skip_sw dst_mac 42:be:24:9b:76:20 \
+ action mirred egress redirect dev swp3 \
+ action trap
+
+Example 2: drop frames received on swp2 with a DA of 42:be:24:9b:76:20, a VID
+of 100 and a PCP of 0::
+
+ tc filter add dev swp2 ingress protocol 802.1Q flower skip_sw \
+ dst_mac 42:be:24:9b:76:20 vlan_id 100 vlan_prio 0 action drop
+
+Time-based ingress policing
+---------------------------
+
+The TTEthernet hardware abilities of the switch can be constrained to act
+similarly to the Per-Stream Filtering and Policing (PSFP) clause specified in
+IEEE 802.1Q-2018 (formerly 802.1Qci). This means it can be used to perform
+tight timing-based admission control for up to 1024 flows (identified by a
+tuple composed of destination MAC address, VLAN ID and VLAN PCP). Packets which
+are received outside their expected reception window are dropped.
+
+This capability can be managed through the offload of the tc-gate action. As
+routing actions are intrinsic to virtual links in TTEthernet (which performs
+explicit routing of time-critical traffic and does not leave that in the hands
+of the FDB, flooding etc), the tc-gate action may never appear alone when
+asking sja1105 to offload it. One (or more) redirect or trap actions must also
+follow along.
+
+Example: create a tc-taprio schedule that is phase-aligned with a tc-gate
+schedule (the clocks must be synchronized by a 1588 application stack, which is
+outside the scope of this document). No packet delivered by the sender will be
+dropped. Note that the reception window is larger than the transmission window
+(and much more so, in this example) to compensate for the packet propagation
+delay of the link (which can be determined by the 1588 application stack).
+
+Receiver (sja1105)::
+
+ tc qdisc add dev swp2 clsact
+ now=$(phc_ctl /dev/ptp1 get | awk '/clock time is/ {print $5}') && \
+ sec=$(echo $now | awk -F. '{print $1}') && \
+ base_time="$(((sec + 2) * 1000000000))" && \
+ echo "base time ${base_time}"
+ tc filter add dev swp2 ingress flower skip_sw \
+ dst_mac 42:be:24:9b:76:20 \
+ action gate base-time ${base_time} \
+ sched-entry OPEN 60000 -1 -1 \
+ sched-entry CLOSE 40000 -1 -1 \
+ action trap
+
+Sender::
+
+ now=$(phc_ctl /dev/ptp0 get | awk '/clock time is/ {print $5}') && \
+ sec=$(echo $now | awk -F. '{print $1}') && \
+ base_time="$(((sec + 2) * 1000000000))" && \
+ echo "base time ${base_time}"
+ tc qdisc add dev eno0 parent root taprio \
+ num_tc 8 \
+ map 0 1 2 3 4 5 6 7 \
+ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
+ base-time ${base_time} \
+ sched-entry S 01 50000 \
+ sched-entry S 00 50000 \
+ flags 2
+
+The engine used to schedule the ingress gate operations is the same that the
+one used for the tc-taprio offload. Therefore, the restrictions regarding the
+fact that no two gate actions (either tc-gate or tc-taprio gates) may fire at
+the same time (during the same 200 ns slot) still apply.
+
+To come in handy, it is possible to share time-triggered virtual links across
+more than 1 ingress port, via flow blocks. In this case, the restriction of
+firing at the same time does not apply because there is a single schedule in
+the system, that of the shared virtual link::
+
+ tc qdisc add dev swp2 ingress_block 1 clsact
+ tc qdisc add dev swp3 ingress_block 1 clsact
+ tc filter add block 1 flower skip_sw dst_mac 42:be:24:9b:76:20 \
+ action gate index 2 \
+ base-time 0 \
+ sched-entry OPEN 50000000 -1 -1 \
+ sched-entry CLOSE 50000000 -1 -1 \
+ action trap
+
+Hardware statistics for each flow are also available ("pkts" counts the number
+of dropped frames, which is a sum of frames dropped due to timing violations,
+lack of destination ports and MTU enforcement checks). Byte-level counters are
+not available.
+
+Limitations
+===========
+
+The SJA1105 switch family always performs VLAN processing. When configured as
+VLAN-unaware, frames carry a different VLAN tag internally, depending on
+whether the port is standalone or under a VLAN-unaware bridge.
+
+The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the
+driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware
+bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on
+whether the port is standalone or in a VLAN-unaware bridge, and accepts only
+"VLAN-unaware" tc-flower keys (MAC DA).
+
+The existing tc-flower keys that are offloaded using virtual links are no
+longer operational after one of the following happens:
+
+- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware)
+- port is part of a bridge whose VLAN awareness state changes
+- port was part of a bridge and becomes standalone
+- port was standalone, but another port joins a VLAN-aware bridge and this
+ changes the global VLAN awareness state of the bridge
+
+The driver cannot veto all these operations, and it cannot update/remove the
+existing tc-flower filters either. So for proper operation, the tc-flower
+filters should be installed only after the forwarding configuration of the port
+has been made, and removed by user space before making any changes to it.
+
Device Tree bindings and board design
=====================================
-This section references ``Documentation/devicetree/bindings/net/dsa/sja1105.txt``
+This section references ``Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml``
and aims to showcase some potential switch caveats.
RMII PHY role and out-of-band signaling
@@ -302,3 +392,54 @@ A board would need to hook up the PHYs connected to the switch to any other
MDIO bus available to Linux within the system (e.g. to the DSA master's MDIO
bus). Link state management then works by the driver manually keeping in sync
(over SPI commands) the MAC link speed with the settings negotiated by the PHY.
+
+By comparison, the SJA1110 supports an MDIO slave access point over which its
+internal 100base-T1 PHYs can be accessed from the host. This is, however, not
+used by the driver, instead the internal 100base-T1 and 100base-TX PHYs are
+accessed through SPI commands, modeled in Linux as virtual MDIO buses.
+
+The microcontroller attached to the SJA1110 port 0 also has an MDIO controller
+operating in master mode, however the driver does not support this either,
+since the microcontroller gets disabled when the Linux driver operates.
+Discrete PHYs connected to the switch ports should have their MDIO interface
+attached to an MDIO controller from the host system and not to the switch,
+similar to SJA1105.
+
+Port compatibility matrix
+-------------------------
+
+The SJA1105 port compatibility matrix is:
+
+===== ============== ============== ==============
+Port SJA1105E/T SJA1105P/Q SJA1105R/S
+===== ============== ============== ==============
+0 xMII xMII xMII
+1 xMII xMII xMII
+2 xMII xMII xMII
+3 xMII xMII xMII
+4 xMII xMII SGMII
+===== ============== ============== ==============
+
+
+The SJA1110 port compatibility matrix is:
+
+===== ============== ============== ============== ==============
+Port SJA1110A SJA1110B SJA1110C SJA1110D
+===== ============== ============== ============== ==============
+0 RevMII (uC) RevMII (uC) RevMII (uC) RevMII (uC)
+1 100base-TX 100base-TX 100base-TX
+ or SGMII SGMII
+2 xMII xMII xMII xMII
+ or SGMII or SGMII
+3 xMII xMII xMII
+ or SGMII or SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X
+4 SGMII SGMII SGMII SGMII
+ or 2500base-X or 2500base-X or 2500base-X or 2500base-X
+5 100base-T1 100base-T1 100base-T1 100base-T1
+6 100base-T1 100base-T1 100base-T1 100base-T1
+7 100base-T1 100base-T1 100base-T1 100base-T1
+8 100base-T1 100base-T1 n/a n/a
+9 100base-T1 100base-T1 n/a n/a
+10 100base-T1 n/a n/a n/a
+===== ============== ============== ============== ==============
diff --git a/Documentation/networking/eql.txt b/Documentation/networking/eql.rst
index 0f1550150f05..a628c4c81166 100644
--- a/Documentation/networking/eql.txt
+++ b/Documentation/networking/eql.rst
@@ -1,5 +1,11 @@
- EQL Driver: Serial IP Load Balancing HOWTO
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================
+EQL Driver: Serial IP Load Balancing HOWTO
+==========================================
+
Simon "Guru Aleph-Null" Janes, simon@ncm.com
+
v1.1, February 27, 1995
This is the manual for the EQL device driver. EQL is a software device
@@ -12,7 +18,8 @@
which was only created to patch cleanly in the very latest kernel
source trees. (Yes, it worked fine.)
- 1. Introduction
+1. Introduction
+===============
Which is worse? A huge fee for a 56K leased line or two phone lines?
It's probably the former. If you find yourself craving more bandwidth,
@@ -41,47 +48,40 @@
Hey, we can all dream you know...
- 2. Kernel Configuration
+2. Kernel Configuration
+=======================
Here I describe the general steps of getting a kernel up and working
with the eql driver. From patching, building, to installing.
- 2.1. Patching The Kernel
+2.1. Patching The Kernel
+------------------------
If you do not have or cannot get a copy of the kernel with the eql
driver folded into it, get your copy of the driver from
ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz.
Unpack this archive someplace obvious like /usr/local/src/. It will
- create the following files:
-
-
+ create the following files::
- ______________________________________________________________________
-rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY
-rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch
-rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave
-rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c
- ______________________________________________________________________
Unpack a recent kernel (something after 1.1.92) someplace convenient
like say /usr/src/linux-1.1.92.eql. Use symbolic links to point
/usr/src/linux to this development directory.
- Apply the patch by running the commands:
+ Apply the patch by running the commands::
-
- ______________________________________________________________________
cd /usr/src
patch </usr/local/src/eql-1.1/eql-1.1.patch
- ______________________________________________________________________
-
-
-
- 2.2. Building The Kernel
+2.2. Building The Kernel
+------------------------
After patching the kernel, run make config and configure the kernel
for your hardware.
@@ -90,7 +90,8 @@
After configuration, make and install according to your habit.
- 3. Network Configuration
+3. Network Configuration
+========================
So far, I have only used the eql device with the DSLIP SLIP connection
manager by Matt Dillon (-- "The man who sold his soul to code so much
@@ -100,37 +101,27 @@
connection.
- 3.1. /etc/rc.d/rc.inet1
+3.1. /etc/rc.d/rc.inet1
+-----------------------
In rc.inet1, ifconfig the eql device to the IP address you usually use
for your machine, and the MTU you prefer for your SLIP lines. One
could argue that MTU should be roughly half the usual size for two
modems, one-third for three, one-fourth for four, etc... But going
too far below 296 is probably overkill. Here is an example ifconfig
- command that sets up the eql device:
-
+ command that sets up the eql device::
-
- ______________________________________________________________________
ifconfig eql 198.67.33.239 mtu 1006
- ______________________________________________________________________
-
-
-
-
Once the eql device is up and running, add a static default route to
it in the routing table using the cool new route syntax that makes
- life so much easier:
+ life so much easier::
-
-
- ______________________________________________________________________
route add default eql
- ______________________________________________________________________
- 3.2. Enslaving Devices By Hand
+3.2. Enslaving Devices By Hand
+------------------------------
Enslaving devices by hand requires two utility programs: eql_enslave
and eql_emancipate (-- eql_emancipate hasn't been written because when
@@ -140,87 +131,56 @@
The syntax for enslaving a device is "eql_enslave <master-name>
- <slave-name> <estimated-bps>". Here are some example enslavings:
-
+ <slave-name> <estimated-bps>". Here are some example enslavings::
-
- ______________________________________________________________________
eql_enslave eql sl0 28800
eql_enslave eql ppp0 14400
eql_enslave eql sl1 57600
- ______________________________________________________________________
-
-
-
-
When you want to free a device from its life of slavery, you can
either down the device with ifconfig (eql will automatically bury the
dead slave and remove it from its queue) or use eql_emancipate to free
it. (-- Or just ifconfig it down, and the eql driver will take it out
- for you.--)
-
-
+ for you.--)::
- ______________________________________________________________________
eql_emancipate eql sl0
eql_emancipate eql ppp0
eql_emancipate eql sl1
- ______________________________________________________________________
-
-
-
- 3.3. DSLIP Configuration for the eql Device
+3.3. DSLIP Configuration for the eql Device
+-------------------------------------------
The general idea is to bring up and keep up as many SLIP connections
as you need, automatically.
- 3.3.1. /etc/slip/runslip.conf
-
- Here is an example runslip.conf:
-
-
-
-
-
-
-
-
-
-
-
+3.3.1. /etc/slip/runslip.conf
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+ Here is an example runslip.conf::
+ name sl-line-1
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua2-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua2
+ name sl-line-2
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua3-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua3
- ______________________________________________________________________
- name sl-line-1
- enabled
- baud 38400
- mtu 576
- ducmd -e /etc/slip/dialout/cua2-288.xp -t 9
- command eql_enslave eql $interface 28800
- address 198.67.33.239
- line /dev/cua2
- name sl-line-2
- enabled
- baud 38400
- mtu 576
- ducmd -e /etc/slip/dialout/cua3-288.xp -t 9
- command eql_enslave eql $interface 28800
- address 198.67.33.239
- line /dev/cua3
- ______________________________________________________________________
-
-
-
-
-
- 3.4. Using PPP and the eql Device
+3.4. Using PPP and the eql Device
+---------------------------------
I have not yet done any load-balancing testing for PPP devices, mainly
because I don't have a PPP-connection manager like SLIP has with
@@ -235,7 +195,8 @@
year.
- 4. About the Slave Scheduler Algorithm
+4. About the Slave Scheduler Algorithm
+======================================
The slave scheduler probably could be replaced with a dozen other
things and push traffic much faster. The formula in the current set
@@ -254,7 +215,8 @@
traffic and the "slower" modem starved.
- 5. Testers' Reports
+5. Testers' Reports
+===================
Some people have experimented with the eql device with newer
kernels (than 1.1.75). I have since updated the driver to patch
@@ -262,87 +224,29 @@
balancing" driver config option.
- o icee from LinuxNET patched 1.1.86 without any rejects and was able
+ - icee from LinuxNET patched 1.1.86 without any rejects and was able
to boot the kernel and enslave a couple of ISDN PPP links.
- 5.1. Randolph Bentson's Test Report
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+5.1. Randolph Bentson's Test Report
+-----------------------------------
+ ::
+ From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995
+ Date: Tue, 7 Feb 95 22:57 PST
+ From: Randolph Bentson <bentson@grieg.seaslug.org>
+ To: guru@ncm.com
+ Subject: EQL driver tests
+ I have been checking out your eql driver. (Nice work, that!)
+ Although you may already done this performance testing, here
+ are some data I've discovered.
+ Randolph Bentson
+ bentson@grieg.seaslug.org
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995
- Date: Tue, 7 Feb 95 22:57 PST
- From: Randolph Bentson <bentson@grieg.seaslug.org>
- To: guru@ncm.com
- Subject: EQL driver tests
-
-
- I have been checking out your eql driver. (Nice work, that!)
- Although you may already done this performance testing, here
- are some data I've discovered.
-
- Randolph Bentson
- bentson@grieg.seaslug.org
-
- ---------------------------------------------------------
+------------------------------------------------------------------
A pseudo-device driver, EQL, written by Simon Janes, can be used
@@ -363,7 +267,7 @@
Once a link was established, I timed a binary ftp transfer of
289284 bytes of data. If there were no overhead (packet headers,
inter-character and inter-packet delays, etc.) the transfers
- would take the following times:
+ would take the following times::
bits/sec seconds
345600 8.3
@@ -388,141 +292,82 @@
that the connection establishment seemed fragile for the higher
speeds. Once established, the connection seemed robust enough.)
- #lines speed mtu seconds theory actual %of
- kbit/sec duration speed speed max
- 3 115200 900 _ 345600
- 3 115200 400 18.1 345600 159825 46
- 2 115200 900 _ 230400
- 2 115200 600 18.1 230400 159825 69
- 2 115200 400 19.3 230400 149888 65
- 4 57600 900 _ 234600
- 4 57600 600 _ 234600
- 4 57600 400 _ 234600
- 3 57600 600 20.9 172800 138413 80
- 3 57600 900 21.2 172800 136455 78
- 3 115200 600 21.7 345600 133311 38
- 3 57600 400 22.5 172800 128571 74
- 4 38400 900 25.2 153600 114795 74
- 4 38400 600 26.4 153600 109577 71
- 4 38400 400 27.3 153600 105965 68
- 2 57600 900 29.1 115200 99410.3 86
- 1 115200 900 30.7 115200 94229.3 81
- 2 57600 600 30.2 115200 95789.4 83
- 3 38400 900 30.3 115200 95473.3 82
- 3 38400 600 31.2 115200 92719.2 80
- 1 115200 600 31.3 115200 92423 80
- 2 57600 400 32.3 115200 89561.6 77
- 1 115200 400 32.8 115200 88196.3 76
- 3 38400 400 33.5 115200 86353.4 74
- 2 38400 900 43.7 76800 66197.7 86
- 2 38400 600 44 76800 65746.4 85
- 2 38400 400 47.2 76800 61289 79
- 4 19200 900 50.8 76800 56945.7 74
- 4 19200 400 53.2 76800 54376.7 70
- 4 19200 600 53.7 76800 53870.4 70
- 1 57600 900 54.6 57600 52982.4 91
- 1 57600 600 56.2 57600 51474 89
- 3 19200 900 60.5 57600 47815.5 83
- 1 57600 400 60.2 57600 48053.8 83
- 3 19200 600 62 57600 46658.7 81
- 3 19200 400 64.7 57600 44711.6 77
- 1 38400 900 79.4 38400 36433.8 94
- 1 38400 600 82.4 38400 35107.3 91
- 2 19200 900 84.4 38400 34275.4 89
- 1 38400 400 86.8 38400 33327.6 86
- 2 19200 600 87.6 38400 33023.3 85
- 2 19200 400 91.2 38400 31719.7 82
- 4 9600 900 94.7 38400 30547.4 79
- 4 9600 400 106 38400 27290.9 71
- 4 9600 600 110 38400 26298.5 68
- 3 9600 900 118 28800 24515.6 85
- 3 9600 600 120 28800 24107 83
- 3 9600 400 131 28800 22082.7 76
- 1 19200 900 155 19200 18663.5 97
- 1 19200 600 161 19200 17968 93
- 1 19200 400 170 19200 17016.7 88
- 2 9600 600 176 19200 16436.6 85
- 2 9600 900 180 19200 16071.3 83
- 2 9600 400 181 19200 15982.5 83
- 1 9600 900 305 9600 9484.72 98
- 1 9600 600 314 9600 9212.87 95
- 1 9600 400 332 9600 8713.37 90
-
-
-
-
-
- 5.2. Anthony Healy's Report
-
-
-
-
-
-
-
- Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST)
- From: Antony Healey <ahealey@st.nepean.uws.edu.au>
- To: Simon Janes <guru@ncm.com>
- Subject: Re: Load Balancing
-
- Hi Simon,
+ ====== ======== === ======== ======= ======= ===
+ #lines speed mtu seconds theory actual %of
+ kbit/sec duration speed speed max
+ ====== ======== === ======== ======= ======= ===
+ 3 115200 900 _ 345600
+ 3 115200 400 18.1 345600 159825 46
+ 2 115200 900 _ 230400
+ 2 115200 600 18.1 230400 159825 69
+ 2 115200 400 19.3 230400 149888 65
+ 4 57600 900 _ 234600
+ 4 57600 600 _ 234600
+ 4 57600 400 _ 234600
+ 3 57600 600 20.9 172800 138413 80
+ 3 57600 900 21.2 172800 136455 78
+ 3 115200 600 21.7 345600 133311 38
+ 3 57600 400 22.5 172800 128571 74
+ 4 38400 900 25.2 153600 114795 74
+ 4 38400 600 26.4 153600 109577 71
+ 4 38400 400 27.3 153600 105965 68
+ 2 57600 900 29.1 115200 99410.3 86
+ 1 115200 900 30.7 115200 94229.3 81
+ 2 57600 600 30.2 115200 95789.4 83
+ 3 38400 900 30.3 115200 95473.3 82
+ 3 38400 600 31.2 115200 92719.2 80
+ 1 115200 600 31.3 115200 92423 80
+ 2 57600 400 32.3 115200 89561.6 77
+ 1 115200 400 32.8 115200 88196.3 76
+ 3 38400 400 33.5 115200 86353.4 74
+ 2 38400 900 43.7 76800 66197.7 86
+ 2 38400 600 44 76800 65746.4 85
+ 2 38400 400 47.2 76800 61289 79
+ 4 19200 900 50.8 76800 56945.7 74
+ 4 19200 400 53.2 76800 54376.7 70
+ 4 19200 600 53.7 76800 53870.4 70
+ 1 57600 900 54.6 57600 52982.4 91
+ 1 57600 600 56.2 57600 51474 89
+ 3 19200 900 60.5 57600 47815.5 83
+ 1 57600 400 60.2 57600 48053.8 83
+ 3 19200 600 62 57600 46658.7 81
+ 3 19200 400 64.7 57600 44711.6 77
+ 1 38400 900 79.4 38400 36433.8 94
+ 1 38400 600 82.4 38400 35107.3 91
+ 2 19200 900 84.4 38400 34275.4 89
+ 1 38400 400 86.8 38400 33327.6 86
+ 2 19200 600 87.6 38400 33023.3 85
+ 2 19200 400 91.2 38400 31719.7 82
+ 4 9600 900 94.7 38400 30547.4 79
+ 4 9600 400 106 38400 27290.9 71
+ 4 9600 600 110 38400 26298.5 68
+ 3 9600 900 118 28800 24515.6 85
+ 3 9600 600 120 28800 24107 83
+ 3 9600 400 131 28800 22082.7 76
+ 1 19200 900 155 19200 18663.5 97
+ 1 19200 600 161 19200 17968 93
+ 1 19200 400 170 19200 17016.7 88
+ 2 9600 600 176 19200 16436.6 85
+ 2 9600 900 180 19200 16071.3 83
+ 2 9600 400 181 19200 15982.5 83
+ 1 9600 900 305 9600 9484.72 98
+ 1 9600 600 314 9600 9212.87 95
+ 1 9600 400 332 9600 8713.37 90
+ ====== ======== === ======== ======= ======= ===
+
+5.2. Anthony Healy's Report
+---------------------------
+
+ ::
+
+ Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST)
+ From: Antony Healey <ahealey@st.nepean.uws.edu.au>
+ To: Simon Janes <guru@ncm.com>
+ Subject: Re: Load Balancing
+
+ Hi Simon,
I've installed your patch and it works great. I have trialed
it over twin SL/IP lines, just over null modems, but I was
able to data at over 48Kb/s [ISDN link -Simon]. I managed a
transfer of up to 7.5 Kbyte/s on one go, but averaged around
6.4 Kbyte/s, which I think is pretty cool. :)
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index f1f868479ceb..d578b8bcd8a4 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -41,6 +41,11 @@ In the message structure descriptions below, if an attribute name is suffixed
with "+", parent nest can contain multiple attributes of the same type. This
implements an array of entries.
+Attributes that need to be filled-in by device drivers and that are dumped to
+user space based on whether they are valid or not should not use zero as a
+valid value. This avoids the need to explicitly signal the validity of the
+attribute in the device driver API.
+
Request header
==============
@@ -68,6 +73,7 @@ the flags may not apply to requests. Recognized flags are:
================================= ===================================
``ETHTOOL_FLAG_COMPACT_BITSETS`` use compact format bitsets in reply
``ETHTOOL_FLAG_OMIT_REPLY`` omit optional reply (_SET and _ACT)
+ ``ETHTOOL_FLAG_STATS`` include optional device statistics
================================= ===================================
New request flags should follow the general idea that if the flag is not set,
@@ -178,7 +184,7 @@ according to message purpose:
Userspace to kernel:
- ===================================== ================================
+ ===================================== =================================
``ETHTOOL_MSG_STRSET_GET`` get string set
``ETHTOOL_MSG_LINKINFO_GET`` get link settings
``ETHTOOL_MSG_LINKINFO_SET`` set link settings
@@ -189,22 +195,75 @@ Userspace to kernel:
``ETHTOOL_MSG_DEBUG_SET`` set debugging settings
``ETHTOOL_MSG_WOL_GET`` get wake-on-lan settings
``ETHTOOL_MSG_WOL_SET`` set wake-on-lan settings
- ===================================== ================================
+ ``ETHTOOL_MSG_FEATURES_GET`` get device features
+ ``ETHTOOL_MSG_FEATURES_SET`` set device features
+ ``ETHTOOL_MSG_PRIVFLAGS_GET`` get private flags
+ ``ETHTOOL_MSG_PRIVFLAGS_SET`` set private flags
+ ``ETHTOOL_MSG_RINGS_GET`` get ring sizes
+ ``ETHTOOL_MSG_RINGS_SET`` set ring sizes
+ ``ETHTOOL_MSG_CHANNELS_GET`` get channel counts
+ ``ETHTOOL_MSG_CHANNELS_SET`` set channel counts
+ ``ETHTOOL_MSG_COALESCE_GET`` get coalescing parameters
+ ``ETHTOOL_MSG_COALESCE_SET`` set coalescing parameters
+ ``ETHTOOL_MSG_PAUSE_GET`` get pause parameters
+ ``ETHTOOL_MSG_PAUSE_SET`` set pause parameters
+ ``ETHTOOL_MSG_EEE_GET`` get EEE settings
+ ``ETHTOOL_MSG_EEE_SET`` set EEE settings
+ ``ETHTOOL_MSG_TSINFO_GET`` get timestamping info
+ ``ETHTOOL_MSG_CABLE_TEST_ACT`` action start cable test
+ ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT`` action start raw TDR cable test
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET`` get tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET`` get FEC settings
+ ``ETHTOOL_MSG_FEC_SET`` set FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET`` get standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` get PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_SET`` set transceiver module parameters
+ ``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
+ ``ETHTOOL_MSG_PSE_SET`` set PSE parameters
+ ``ETHTOOL_MSG_PSE_GET`` get PSE parameters
+ ===================================== =================================
Kernel to userspace:
- ===================================== =================================
- ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
- ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
- ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
- ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
- ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
- ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
- ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
- ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
- ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
- ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
- ===================================== =================================
+ ======================================== =================================
+ ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents
+ ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings
+ ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification
+ ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info
+ ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification
+ ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info
+ ``ETHTOOL_MSG_DEBUG_GET_REPLY`` debugging settings
+ ``ETHTOOL_MSG_DEBUG_NTF`` debugging settings notification
+ ``ETHTOOL_MSG_WOL_GET_REPLY`` wake-on-lan settings
+ ``ETHTOOL_MSG_WOL_NTF`` wake-on-lan settings notification
+ ``ETHTOOL_MSG_FEATURES_GET_REPLY`` device features
+ ``ETHTOOL_MSG_FEATURES_SET_REPLY`` optional reply to FEATURES_SET
+ ``ETHTOOL_MSG_FEATURES_NTF`` netdev features notification
+ ``ETHTOOL_MSG_PRIVFLAGS_GET_REPLY`` private flags
+ ``ETHTOOL_MSG_PRIVFLAGS_NTF`` private flags
+ ``ETHTOOL_MSG_RINGS_GET_REPLY`` ring sizes
+ ``ETHTOOL_MSG_RINGS_NTF`` ring sizes
+ ``ETHTOOL_MSG_CHANNELS_GET_REPLY`` channel counts
+ ``ETHTOOL_MSG_CHANNELS_NTF`` channel counts
+ ``ETHTOOL_MSG_COALESCE_GET_REPLY`` coalescing parameters
+ ``ETHTOOL_MSG_COALESCE_NTF`` coalescing parameters
+ ``ETHTOOL_MSG_PAUSE_GET_REPLY`` pause parameters
+ ``ETHTOOL_MSG_PAUSE_NTF`` pause parameters
+ ``ETHTOOL_MSG_EEE_GET_REPLY`` EEE settings
+ ``ETHTOOL_MSG_EEE_NTF`` EEE settings
+ ``ETHTOOL_MSG_TSINFO_GET_REPLY`` timestamping info
+ ``ETHTOOL_MSG_CABLE_TEST_NTF`` Cable test results
+ ``ETHTOOL_MSG_CABLE_TEST_TDR_NTF`` Cable test TDR results
+ ``ETHTOOL_MSG_TUNNEL_INFO_GET_REPLY`` tunnel offload info
+ ``ETHTOOL_MSG_FEC_GET_REPLY`` FEC settings
+ ``ETHTOOL_MSG_FEC_NTF`` FEC settings
+ ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY`` read SFP module EEPROM
+ ``ETHTOOL_MSG_STATS_GET_REPLY`` standard statistics
+ ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
+ ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
+ ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
+ ======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
information. They usually do not contain any message specific attributes.
@@ -361,14 +420,17 @@ Request contents:
Kernel response contents:
- ==================================== ====== ==========================
- ``ETHTOOL_A_LINKMODES_HEADER`` nested reply header
- ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
- ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
- ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
- ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
- ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
- ==================================== ====== ==========================
+ ========================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested reply header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE`` u8 Master/slave port state
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
+ ========================================== ====== ==========================
For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask
represents supported modes. ``ETHTOOL_A_LINKMODES_PEER`` in the reply is a bit
@@ -383,32 +445,36 @@ LINKMODES_SET
Request contents:
- ==================================== ====== ==========================
- ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
- ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
- ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
- ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
- ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
- ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
- ==================================== ====== ==========================
+ ========================================== ====== ==========================
+ ``ETHTOOL_A_LINKMODES_HEADER`` nested request header
+ ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status
+ ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes
+ ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes
+ ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s)
+ ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode
+ ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode
+ ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching
+ ``ETHTOOL_A_LINKMODES_LANES`` u32 lanes
+ ========================================== ====== ==========================
``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If
autonegotiation is on (either set now or kept from before), advertised modes
are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one
-of speed and duplex is specified, kernel adjusts advertised modes to all
-supported modes matching speed, duplex or both (whatever is specified). This
-autoselection is done on ethtool side with ioctl interface, netlink interface
-is supposed to allow requesting changes without knowing what exactly kernel
-supports.
+of speed, duplex and lanes is specified, kernel adjusts advertised modes to all
+supported modes matching speed, duplex, lanes or all (whatever is specified).
+This autoselection is done on ethtool side with ioctl interface, netlink
+interface is supposed to allow requesting changes without knowing what exactly
+kernel supports.
LINKSTATE_GET
=============
-Requests link state information. At the moment, only link up/down flag (as
-provided by ``ETHTOOL_GLINK`` ioctl command) is provided but some future
-extensions are planned (e.g. link down reason). This request does not have any
-attributes.
+Requests link state information. Link up/down flag (as provided by
+``ETHTOOL_GLINK`` ioctl command) is provided. Optionally, extended state might
+be provided as well. In general, extended state describes reasons for why a port
+is down, or why it operates in some non-obvious mode. This request does not have
+any attributes.
Request contents:
@@ -418,19 +484,158 @@ Request contents:
Kernel response contents:
- ==================================== ====== ==========================
+ ==================================== ====== ============================
``ETHTOOL_A_LINKSTATE_HEADER`` nested reply header
``ETHTOOL_A_LINKSTATE_LINK`` bool link state (up/down)
- ==================================== ====== ==========================
+ ``ETHTOOL_A_LINKSTATE_SQI`` u32 Current Signal Quality Index
+ ``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
+ ``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
+ ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
+ ==================================== ====== ============================
For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
carrier flag provided by ``netif_carrier_ok()`` but there are drivers which
define their own handler.
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` are
+optional values. ethtool core can provide either both
+``ETHTOOL_A_LINKSTATE_EXT_STATE`` and ``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE``,
+or only ``ETHTOOL_A_LINKSTATE_EXT_STATE``, or none of them.
+
``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all
devices supporting the request).
+Link extended states:
+
+ ================================================ ============================================
+ ``ETHTOOL_LINK_EXT_STATE_AUTONEG`` States relating to the autonegotiation or
+ issues therein
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_TRAINING_FAILURE`` Failure during link training
+
+ ``ETHTOOL_LINK_EXT_STATE_LINK_LOGICAL_MISMATCH`` Logical mismatch in physical coding sublayer
+ or forward error correction sublayer
+
+ ``ETHTOOL_LINK_EXT_STATE_BAD_SIGNAL_INTEGRITY`` Signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_STATE_NO_CABLE`` No cable connected
+
+ ``ETHTOOL_LINK_EXT_STATE_CABLE_ISSUE`` Failure is related to cable,
+ e.g., unsupported cable
+
+ ``ETHTOOL_LINK_EXT_STATE_EEPROM_ISSUE`` Failure is related to EEPROM, e.g., failure
+ during reading or parsing the data
+
+ ``ETHTOOL_LINK_EXT_STATE_CALIBRATION_FAILURE`` Failure during calibration algorithm
+
+ ``ETHTOOL_LINK_EXT_STATE_POWER_BUDGET_EXCEEDED`` The hardware is not able to provide the
+ power required from cable or module
+
+ ``ETHTOOL_LINK_EXT_STATE_OVERHEAT`` The module is overheated
+
+ ``ETHTOOL_LINK_EXT_STATE_MODULE`` Transceiver module issue
+ ================================================ ============================================
+
+Link extended substates:
+
+ Autoneg substates:
+
+ =============================================================== ================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED`` Peer side is down
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_ACK_NOT_RECEIVED`` Ack not received from peer side
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NEXT_PAGE_EXCHANGE_FAILED`` Next page exchange failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_PARTNER_DETECTED_FORCE_MODE`` Peer side is down during force
+ mode or there is no agreement of
+ speed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_FEC_MISMATCH_DURING_OVERRIDE`` Forward error correction modes
+ in both sides are mismatched
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_AN_NO_HCD`` No Highest Common Denominator
+ =============================================================== ================================
+
+ Link training substates:
+
+ =========================================================================== ====================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_FRAME_LOCK_NOT_ACQUIRED`` Frames were not
+ recognized, the
+ lock failed
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_INHIBIT_TIMEOUT`` The lock did not
+ occur before
+ timeout
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_KR_LINK_PARTNER_DID_NOT_SET_RECEIVER_READY`` Peer side did not
+ send ready signal
+ after training
+ process
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LT_REMOTE_FAULT`` Remote side is not
+ ready yet
+ =========================================================================== ====================
+
+ Link logical mismatch substates:
+
+ ================================================================ ===============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_BLOCK_LOCK`` Physical coding sublayer was
+ not locked in first phase -
+ block lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_ACQUIRE_AM_LOCK`` Physical coding sublayer was
+ not locked in second phase -
+ alignment markers lock
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_PCS_DID_NOT_GET_ALIGN_STATUS`` Physical coding sublayer did
+ not get align status
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_FC_FEC_IS_NOT_LOCKED`` FC forward error correction is
+ not locked
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_LLM_RS_FEC_IS_NOT_LOCKED`` RS forward error correction is
+ not locked
+ ================================================================ ===============================
+
+ Bad signal integrity substates:
+
+ ================================================================= =============================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_LARGE_NUMBER_OF_PHYSICAL_ERRORS`` Large number of physical
+ errors
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_UNSUPPORTED_RATE`` The system attempted to
+ operate the cable at a rate
+ that is not formally
+ supported, which led to
+ signal integrity issues
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_REFERENCE_CLOCK_LOST`` The external clock signal for
+ SerDes is too weak or
+ unavailable.
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_BSI_SERDES_ALOS`` The received signal for
+ SerDes is too weak because
+ analog loss of signal.
+ ================================================================= =============================
+
+ Cable issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_UNSUPPORTED_CABLE`` Unsupported cable
+
+ ``ETHTOOL_LINK_EXT_SUBSTATE_CI_CABLE_TEST_FAILURE`` Cable test failure
+ =================================================== ============================================
+
+ Transceiver module issue substates:
+
+ =================================================== ============================================
+ ``ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY`` The CMIS Module State Machine did not reach
+ the ModuleReady state. For example, if the
+ module is stuck at ModuleFault state
+ =================================================== ============================================
+
DEBUG_GET
=========
@@ -521,12 +726,973 @@ Request contents:
``WAKE_MAGICSECURE`` mode.
+FEATURES_GET
+============
+
+Gets netdev features like ``ETHTOOL_GFEATURES`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested reply header
+ ``ETHTOOL_A_FEATURES_HW`` bitset dev->hw_features
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset dev->wanted_features
+ ``ETHTOOL_A_FEATURES_ACTIVE`` bitset dev->features
+ ``ETHTOOL_A_FEATURES_NOCHANGE`` bitset NETIF_F_NEVER_CHANGE
+ ==================================== ====== ==========================
+
+Bitmaps in kernel response have the same meaning as bitmaps used in ioctl
+interference but attribute names are different (they are based on
+corresponding members of struct net_device). Legacy "flags" are not provided,
+if userspace needs them (most likely only ethtool for backward compatibility),
+it can calculate their values from related feature bits itself.
+ETHA_FEATURES_HW uses mask consisting of all features recognized by kernel (to
+provide all names when using verbose bitmap format), the other three use no
+mask (simple bit lists).
+
+
+FEATURES_SET
+============
+
+Request to set netdev features like ``ETHTOOL_SFEATURES`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested request header
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset requested features
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_FEATURES_HEADER`` nested reply header
+ ``ETHTOOL_A_FEATURES_WANTED`` bitset diff wanted vs. result
+ ``ETHTOOL_A_FEATURES_ACTIVE`` bitset diff old vs. new active
+ ==================================== ====== ==========================
+
+Request constains only one bitset which can be either value/mask pair (request
+to change specific feature bits and leave the rest) or only a value (request
+to set all features to specified set).
+
+As request is subject to netdev_change_features() sanity checks, optional
+kernel reply (can be suppressed by ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request
+header) informs client about the actual result. ``ETHTOOL_A_FEATURES_WANTED``
+reports the difference between client request and actual result: mask consists
+of bits which differ between requested features and result (dev->features
+after the operation), value consists of values of these bits in the request
+(i.e. negated values from resulting features). ``ETHTOOL_A_FEATURES_ACTIVE``
+reports the difference between old and new dev->features: mask consists of
+bits which have changed, values are their values in new dev->features (after
+the operation).
+
+``ETHTOOL_MSG_FEATURES_NTF`` notification is sent not only if device features
+are modified using ``ETHTOOL_MSG_FEATURES_SET`` request or on of ethtool ioctl
+request but also each time features are modified with netdev_update_features()
+or netdev_change_features().
+
+
+PRIVFLAGS_GET
+=============
+
+Gets private flags like ``ETHTOOL_GPFLAGS`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested reply header
+ ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_PRIVFLAGS_FLAGS`` is a bitset with values of device private flags.
+These flags are defined by driver, their number and names (and also meaning)
+are device dependent. For compact bitset format, names can be retrieved as
+``ETH_SS_PRIV_FLAGS`` string set. If verbose bitset format is requested,
+response uses all private flags supported by the device as mask so that client
+gets the full information without having to fetch the string set with names.
+
+
+PRIVFLAGS_SET
+=============
+
+Sets or modifies values of device private flags like ``ETHTOOL_SPFLAGS``
+ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PRIVFLAGS_HEADER`` nested request header
+ ``ETHTOOL_A_PRIVFLAGS_FLAGS`` bitset private flags
+ ==================================== ====== ==========================
+
+``ETHTOOL_A_PRIVFLAGS_FLAGS`` can either set the whole set of private flags or
+modify only values of some of them.
+
+
+RINGS_GET
+=========
+
+Gets ring sizes like ``ETHTOOL_GRINGPARAM`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ==================================== ====== ===========================
+
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
+page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
+If enabled the device is configured to place frame headers and data into
+separate buffers. The device configuration must make it possible to receive
+full memory pages of data, for example because MTU is high enough or through
+HW-GRO.
+
+``ETHTOOL_A_RINGS_TX_PUSH`` flag is used to enable descriptor fast
+path to send packets. In ordinary path, driver fills descriptors in DRAM and
+notifies NIC hardware. In fast path, driver pushes descriptors to the device
+through MMIO writes, thus reducing the latency. However, enabling this feature
+may increase the CPU cost. Drivers may enforce additional per-packet
+eligibility checks (e.g. on packet size).
+
+RINGS_SET
+=========
+
+Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ==================================== ====== ===========================
+
+Kernel checks that requested ring sizes do not exceed limits reported by
+driver. Driver may impose additional constraints and may not suspport all
+attributes.
+
+
+``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
+Completion queue events(CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent(like send success or
+error) or received(like pointers to packet fragments). The CQE size parameter
+enables to modify the CQE size other than default size if NIC supports it.
+A bigger CQE can have more receive buffer pointers inturn NIC can transfer
+a bigger frame from wire. Based on the NIC hardware, the overall completion
+queue size can be adjusted in the driver if CQE size is modified.
+
+CHANNELS_GET
+============
+
+Gets channel counts like ``ETHTOOL_GCHANNELS`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested reply header
+ ``ETHTOOL_A_CHANNELS_RX_MAX`` u32 max receive channels
+ ``ETHTOOL_A_CHANNELS_TX_MAX`` u32 max transmit channels
+ ``ETHTOOL_A_CHANNELS_OTHER_MAX`` u32 max other channels
+ ``ETHTOOL_A_CHANNELS_COMBINED_MAX`` u32 max combined channels
+ ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count
+ ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count
+ ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count
+ ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count
+ ===================================== ====== ==========================
+
+
+CHANNELS_SET
+============
+
+Sets channel counts like ``ETHTOOL_SCHANNELS`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_CHANNELS_HEADER`` nested request header
+ ``ETHTOOL_A_CHANNELS_RX_COUNT`` u32 receive channel count
+ ``ETHTOOL_A_CHANNELS_TX_COUNT`` u32 transmit channel count
+ ``ETHTOOL_A_CHANNELS_OTHER_COUNT`` u32 other channel count
+ ``ETHTOOL_A_CHANNELS_COMBINED_COUNT`` u32 combined channel count
+ ===================================== ====== ==========================
+
+Kernel checks that requested channel counts do not exceed limits reported by
+driver. Driver may impose additional constraints and may not suspport all
+attributes.
+
+
+COALESCE_GET
+============
+
+Gets coalescing parameters like ``ETHTOOL_GCOALESCE`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ =========================================== ====== =======================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested reply header
+ ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx
+ ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx
+ ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ
+ ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce
+ ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx
+ ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
+ ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ =========================================== ====== =======================
+
+Attributes are only included in reply if their value is not zero or the
+corresponding bit in ``ethtool_ops::supported_coalesce_params`` is set (i.e.
+they are declared as supported by driver).
+
+Timer reset mode (``ETHTOOL_A_COALESCE_USE_CQE_TX`` and
+``ETHTOOL_A_COALESCE_USE_CQE_RX``) controls the interaction between packet
+arrival and the various time based delay parameters. By default timers are
+expected to limit the max delay between any packet arrival/departure and a
+corresponding interrupt. In this mode timer should be started by packet
+arrival (sometimes delivery of previous interrupt) and reset when interrupt
+is delivered.
+Setting the appropriate attribute to 1 will enable ``CQE`` mode, where
+each packet event resets the timer. In this mode timer is used to force
+the interrupt if queue goes idle, while busy queues depend on the packet
+limit to trigger interrupts.
+
+COALESCE_SET
+============
+
+Sets coalescing parameters like ``ETHTOOL_SCOALESCE`` ioctl request.
+
+Request contents:
+
+ =========================================== ====== =======================
+ ``ETHTOOL_A_COALESCE_HEADER`` nested request header
+ ``ETHTOOL_A_COALESCE_RX_USECS`` u32 delay (us), normal Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES`` u32 max packets, normal Rx
+ ``ETHTOOL_A_COALESCE_RX_USECS_IRQ`` u32 delay (us), Rx in IRQ
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_IRQ`` u32 max packets, Rx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_USECS`` u32 delay (us), normal Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES`` u32 max packets, normal Tx
+ ``ETHTOOL_A_COALESCE_TX_USECS_IRQ`` u32 delay (us), Tx in IRQ
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_IRQ`` u32 IRQ packets, Tx in IRQ
+ ``ETHTOOL_A_COALESCE_STATS_BLOCK_USECS`` u32 delay of stats update
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_RX`` bool adaptive Rx coalesce
+ ``ETHTOOL_A_COALESCE_USE_ADAPTIVE_TX`` bool adaptive Tx coalesce
+ ``ETHTOOL_A_COALESCE_PKT_RATE_LOW`` u32 threshold for low rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_LOW`` u32 delay (us), low Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_LOW`` u32 max packets, low Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_LOW`` u32 delay (us), low Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_LOW`` u32 max packets, low Tx
+ ``ETHTOOL_A_COALESCE_PKT_RATE_HIGH`` u32 threshold for high rate
+ ``ETHTOOL_A_COALESCE_RX_USECS_HIGH`` u32 delay (us), high Rx
+ ``ETHTOOL_A_COALESCE_RX_MAX_FRAMES_HIGH`` u32 max packets, high Rx
+ ``ETHTOOL_A_COALESCE_TX_USECS_HIGH`` u32 delay (us), high Tx
+ ``ETHTOOL_A_COALESCE_TX_MAX_FRAMES_HIGH`` u32 max packets, high Tx
+ ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL`` u32 rate sampling interval
+ ``ETHTOOL_A_COALESCE_USE_CQE_TX`` bool timer reset mode, Tx
+ ``ETHTOOL_A_COALESCE_USE_CQE_RX`` bool timer reset mode, Rx
+ =========================================== ====== =======================
+
+Request is rejected if it attributes declared as unsupported by driver (i.e.
+such that the corresponding bit in ``ethtool_ops::supported_coalesce_params``
+is not set), regardless of their values. Driver may impose additional
+constraints on coalescing parameters and their values.
+
+
+PAUSE_GET
+=========
+
+Gets pause frame settings like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation
+ ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames
+ ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames
+ ``ETHTOOL_A_PAUSE_STATS`` nested pause statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_PAUSE_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set
+in ``ETHTOOL_A_HEADER_FLAGS``.
+It will be empty if driver did not report any statistics. Drivers fill in
+the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_pause_stats
+
+Each member has a corresponding attribute defined.
+
+PAUSE_SET
+=========
+
+Sets pause parameters like ``ETHTOOL_GPAUSEPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PAUSE_HEADER`` nested request header
+ ``ETHTOOL_A_PAUSE_AUTONEG`` bool pause autonegotiation
+ ``ETHTOOL_A_PAUSE_RX`` bool receive pause frames
+ ``ETHTOOL_A_PAUSE_TX`` bool transmit pause frames
+ ===================================== ====== ==========================
+
+
+EEE_GET
+=======
+
+Gets Energy Efficient Ethernet settings like ``ETHTOOL_GEEE`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ``ETHTOOL_A_EEE_MODES_OURS`` bool supported/advertised modes
+ ``ETHTOOL_A_EEE_MODES_PEER`` bool peer advertised link modes
+ ``ETHTOOL_A_EEE_ACTIVE`` bool EEE is actively used
+ ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled
+ ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled
+ ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us)
+ ===================================== ====== ==========================
+
+In ``ETHTOOL_A_EEE_MODES_OURS``, mask consists of link modes for which EEE is
+enabled, value of link modes for which EEE is advertised. Link modes for which
+peer advertises EEE are listed in ``ETHTOOL_A_EEE_MODES_PEER`` (no mask). The
+netlink interface allows reporting EEE status for all link modes but only
+first 32 are provided by the ``ethtool_ops`` callback.
+
+
+EEE_SET
+=======
+
+Sets Energy Efficient Ethernet parameters like ``ETHTOOL_SEEE`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_EEE_HEADER`` nested request header
+ ``ETHTOOL_A_EEE_MODES_OURS`` bool advertised modes
+ ``ETHTOOL_A_EEE_ENABLED`` bool EEE is enabled
+ ``ETHTOOL_A_EEE_TX_LPI_ENABLED`` bool Tx lpi enabled
+ ``ETHTOOL_A_EEE_TX_LPI_TIMER`` u32 Tx lpi timeout (in us)
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_EEE_MODES_OURS`` is used to either list link modes to advertise
+EEE for (if there is no mask) or specify changes to the list (if there is
+a mask). The netlink interface allows reporting EEE status for all link modes
+but only first 32 can be set at the moment as that is what the ``ethtool_ops``
+callback supports.
+
+
+TSINFO_GET
+==========
+
+Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TSINFO_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TSINFO_HEADER`` nested request header
+ ``ETHTOOL_A_TSINFO_TIMESTAMPING`` bitset SO_TIMESTAMPING flags
+ ``ETHTOOL_A_TSINFO_TX_TYPES`` bitset supported Tx types
+ ``ETHTOOL_A_TSINFO_RX_FILTERS`` bitset supported Rx filters
+ ``ETHTOOL_A_TSINFO_PHC_INDEX`` u32 PTP hw clock index
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_TSINFO_PHC_INDEX`` is absent if there is no associated PHC (there
+is no special value for this case). The bitset attributes are omitted if they
+would be empty (no bit set).
+
+CABLE_TEST
+==========
+
+Start a cable test.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_CABLE_TEST_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Notification contents:
+
+An Ethernet cable typically contains 1, 2 or 4 pairs. The length of
+the pair can only be measured when there is a fault in the pair and
+hence a reflection. Information about the fault may not be available,
+depending on the specific hardware. Hence the contents of the notify
+message are mostly optional. The attributes can be repeated an
+arbitrary number of times, in an arbitrary order, for an arbitrary
+number of pairs.
+
+The example shows the notification sent when the test is completed for
+a T2 cable, i.e. two pairs. One pair is OK and hence has no length
+information. The second pair has a fault and does have length
+information.
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_STATUS`` | u8 | completed |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_CABLE_TEST_NTF_NEST`` | nested | all the results |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_RESULT`` | nested | cable test result |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_RESULT`` | nested | cable test results |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH`` | nested | cable length |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM`` | u32 | length in cm |
+ +-+-+-----------------------------------------+--------+---------------------+
+
+CABLE_TEST TDR
+==============
+
+Start a cable test and report raw TDR data
+
+Request contents:
+
+ +--------------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_HEADER`` | nested | reply header |
+ +--------------------------------------------+--------+-----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_CFG`` | nested | test configuration |
+ +-+------------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_FIRST_DISTANCE`` | u32 | first data distance |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_LAST_DISTANCE`` | u32 | last data distance |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_STEP_STEP_DISTANCE`` | u32 | distance of each step |
+ +-+-+----------------------------------------+--------+-----------------------+
+ | | ``ETHTOOL_A_CABLE_TEST_TDR_CFG_PAIR`` | u8 | pair to test |
+ +-+-+----------------------------------------+--------+-----------------------+
+
+The ETHTOOL_A_CABLE_TEST_TDR_CFG is optional, as well as all members
+of the nest. All distances are expressed in centimeters. The PHY takes
+the distances as a guide, and rounds to the nearest distance it
+actually supports. If a pair is passed, only that one pair will be
+tested. Otherwise all pairs are tested.
+
+Notification contents:
+
+Raw TDR data is gathered by sending a pulse down the cable and
+recording the amplitude of the reflected pulse for a given distance.
+
+It can take a number of seconds to collect TDR data, especial if the
+full 100 meters is probed at 1 meter intervals. When the test is
+started a notification will be sent containing just
+ETHTOOL_A_CABLE_TEST_TDR_STATUS with the value
+ETHTOOL_A_CABLE_TEST_NTF_STATUS_STARTED.
+
+When the test has completed a second notification will be sent
+containing ETHTOOL_A_CABLE_TEST_TDR_STATUS with the value
+ETHTOOL_A_CABLE_TEST_NTF_STATUS_COMPLETED and the TDR data.
+
+The message may optionally contain the amplitude of the pulse send
+down the cable. This is measured in mV. A reflection should not be
+bigger than transmitted pulse.
+
+Before the raw TDR data should be an ETHTOOL_A_CABLE_TDR_NEST_STEP
+nest containing information about the distance along the cable for the
+first reading, the last reading, and the step between each
+reading. Distances are measured in centimeters. These should be the
+exact values the PHY used. These may be different to what the user
+requested, if the native measurement resolution is greater than 1 cm.
+
+For each step along the cable, a ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE is
+used to report the amplitude of the reflection for a given pair.
+
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_STATUS`` | u8 | completed |
+ +---------------------------------------------+--------+----------------------+
+ | ``ETHTOOL_A_CABLE_TEST_TDR_NTF_NEST`` | nested | all the results |
+ +-+-------------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_PULSE`` | nested | TX Pulse amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_PULSE_mV`` | s16 | Pulse amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_NEST_STEP`` | nested | TDR step info |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_FIRST_DISTANCE`` | u32 | First data distance |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_LAST_DISTANCE`` | u32 | Last data distance |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_STEP_STEP_DISTANCE`` | u32 | distance of each step|
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | ``ETHTOOL_A_CABLE_TDR_NEST_AMPLITUDE`` | nested | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULTS_PAIR`` | u8 | pair number |
+ +-+-+-----------------------------------------+--------+----------------------+
+ | | | ``ETHTOOL_A_CABLE_AMPLITUDE_mV`` | s16 | Reflection amplitude |
+ +-+-+-----------------------------------------+--------+----------------------+
+
+TUNNEL_INFO
+===========
+
+Gets information about the tunnel state NIC is aware of.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_TUNNEL_INFO_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_TUNNEL_INFO_UDP_PORTS`` | nested | all UDP port tables |
+ +-+-------------------------------------------+--------+---------------------+
+ | | ``ETHTOOL_A_TUNNEL_UDP_TABLE`` | nested | one UDP port table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_SIZE`` | u32 | max size of the |
+ | | | | | table |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` | bitset | tunnel types which |
+ | | | | | table can hold |
+ +-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_TUNNEL_UDP_TABLE_ENTRY`` | nested | offloaded UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_PORT`` | be16 | UDP port |
+ +-+-+-+---------------------------------------+--------+---------------------+
+ | | | | ``ETHTOOL_A_TUNNEL_UDP_ENTRY_TYPE`` | u32 | tunnel type |
+ +-+-+-+---------------------------------------+--------+---------------------+
+
+For UDP tunnel table empty ``ETHTOOL_A_TUNNEL_UDP_TABLE_TYPES`` indicates that
+the table contains static entries, hard-coded by the NIC.
+
+FEC_GET
+=======
+
+Gets FEC configuration and state like ``ETHTOOL_GFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ``ETHTOOL_A_FEC_ACTIVE`` u32 index of active FEC mode
+ ``ETHTOOL_A_FEC_STATS`` nested FEC statistics
+ ===================================== ====== ==========================
+
+``ETHTOOL_A_FEC_ACTIVE`` is the bit index of the FEC link mode currently
+active on the interface. This attribute may not be present if device does
+not support FEC.
+
+``ETHTOOL_A_FEC_MODES`` and ``ETHTOOL_A_FEC_AUTO`` are only meaningful when
+autonegotiation is disabled. If ``ETHTOOL_A_FEC_AUTO`` is non-zero driver will
+select the FEC mode automatically based on the parameters of the SFP module.
+This is equivalent to the ``ETHTOOL_FEC_AUTO`` bit of the ioctl interface.
+``ETHTOOL_A_FEC_MODES`` carry the current FEC configuration using link mode
+bits (rather than old ``ETHTOOL_FEC_*`` bits).
+
+``ETHTOOL_A_FEC_STATS`` are reported if ``ETHTOOL_FLAG_STATS`` was set in
+``ETHTOOL_A_HEADER_FLAGS``.
+Each attribute carries an array of 64bit statistics. First entry in the array
+contains the total number of events on the port, while the following entries
+are counters corresponding to lanes/PCS instances. The number of entries in
+the array will be:
+
++--------------+---------------------------------------------+
+| `0` | device does not support FEC statistics |
++--------------+---------------------------------------------+
+| `1` | device does not support per-lane break down |
++--------------+---------------------------------------------+
+| `1 + #lanes` | device has full support for FEC stats |
++--------------+---------------------------------------------+
+
+Drivers fill in the statistics in the following structure:
+
+.. kernel-doc:: include/linux/ethtool.h
+ :identifiers: ethtool_fec_stats
+
+FEC_SET
+=======
+
+Sets FEC parameters like ``ETHTOOL_SFECPARAM`` ioctl request.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_FEC_HEADER`` nested request header
+ ``ETHTOOL_A_FEC_MODES`` bitset configured modes
+ ``ETHTOOL_A_FEC_AUTO`` bool FEC mode auto selection
+ ===================================== ====== ==========================
+
+``FEC_SET`` is only meaningful when autonegotiation is disabled. Otherwise
+FEC mode is selected as part of autonegotiation.
+
+``ETHTOOL_A_FEC_MODES`` selects which FEC mode should be used. It's recommended
+to set only one bit, if multiple bits are set driver may choose between them
+in an implementation specific way.
+
+``ETHTOOL_A_FEC_AUTO`` requests the driver to choose FEC mode based on SFP
+module parameters. This does not mean autonegotiation.
+
+MODULE_EEPROM_GET
+=================
+
+Fetch module EEPROM data dump.
+This interface is designed to allow dumps of at most 1/2 page at once. This
+means only dumps of 128 (or less) bytes are allowed, without crossing half page
+boundary located at offset 128. For pages other than 0 only high 128 bytes are
+accessible.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_MODULE_EEPROM_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_EEPROM_OFFSET`` u32 offset within a page
+ ``ETHTOOL_A_MODULE_EEPROM_LENGTH`` u32 amount of bytes to read
+ ``ETHTOOL_A_MODULE_EEPROM_PAGE`` u8 page number
+ ``ETHTOOL_A_MODULE_EEPROM_BANK`` u8 bank number
+ ``ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS`` u8 page I2C address
+ ======================================= ====== ==========================
+
+If ``ETHTOOL_A_MODULE_EEPROM_BANK`` is not specified, bank 0 is assumed.
+
+Kernel response contents:
+
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_HEADER`` | nested | reply header |
+ +---------------------------------------------+--------+---------------------+
+ | ``ETHTOOL_A_MODULE_EEPROM_DATA`` | binary | array of bytes from |
+ | | | module EEPROM |
+ +---------------------------------------------+--------+---------------------+
+
+``ETHTOOL_A_MODULE_EEPROM_DATA`` has an attribute length equal to the amount of
+bytes driver actually read.
+
+STATS_GET
+=========
+
+Get standard statistics for the interface. Note that this is not
+a re-implementation of ``ETHTOOL_GSTATS`` which exposed driver-defined
+stats.
+
+Request contents:
+
+ ======================================= ====== ==========================
+ ``ETHTOOL_A_STATS_HEADER`` nested request header
+ ``ETHTOOL_A_STATS_GROUPS`` bitset requested groups of stats
+ ======================================= ====== ==========================
+
+Kernel response contents:
+
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_HEADER`` | nested | reply header |
+ +-----------------------------------+--------+--------------------------------+
+ | ``ETHTOOL_A_STATS_GRP`` | nested | one or more group of stats |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_ID`` | u32 | group ID - ``ETHTOOL_STATS_*`` |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_SS_ID`` | u32 | string set ID for names |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_STAT`` | nested | nest containing a statistic |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_RX`` | nested | histogram statistic (Rx) |
+ +-+---------------------------------+--------+--------------------------------+
+ | | ``ETHTOOL_A_STATS_GRP_HIST_TX`` | nested | histogram statistic (Tx) |
+ +-+---------------------------------+--------+--------------------------------+
+
+Users specify which groups of statistics they are requesting via
+the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are:
+
+ ====================== ======== ===============================================
+ ETHTOOL_STATS_ETH_MAC eth-mac Basic IEEE 802.3 MAC statistics (30.3.1.1.*)
+ ETHTOOL_STATS_ETH_PHY eth-phy Basic IEEE 802.3 PHY statistics (30.3.2.1.*)
+ ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*)
+ ETHTOOL_STATS_RMON rmon RMON (RFC 2819) statistics
+ ====================== ======== ===============================================
+
+Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply.
+``ETHTOOL_A_STATS_GRP_ID`` identifies which group's statistics nest contains.
+``ETHTOOL_A_STATS_GRP_SS_ID`` identifies the string set ID for the names of
+the statistics in the group, if available.
+
+Statistics are added to the ``ETHTOOL_A_STATS_GRP`` nest under
+``ETHTOOL_A_STATS_GRP_STAT``. ``ETHTOOL_A_STATS_GRP_STAT`` should contain
+single 8 byte (u64) attribute inside - the type of that attribute is
+the statistic ID and the value is the value of the statistic.
+Each group has its own interpretation of statistic IDs.
+Attribute IDs correspond to strings from the string set identified
+by ``ETHTOOL_A_STATS_GRP_SS_ID``. Complex statistics (such as RMON histogram
+entries) are also listed inside ``ETHTOOL_A_STATS_GRP`` and do not have
+a string defined in the string set.
+
+RMON "histogram" counters count number of packets within given size range.
+Because RFC does not specify the ranges beyond the standard 1518 MTU devices
+differ in definition of buckets. For this reason the definition of packet ranges
+is left to each driver.
+
+``ETHTOOL_A_STATS_GRP_HIST_RX`` and ``ETHTOOL_A_STATS_GRP_HIST_TX`` nests
+contain the following attributes:
+
+ ================================= ====== ===================================
+ ETHTOOL_A_STATS_RMON_HIST_BKT_LOW u32 low bound of the packet size bucket
+ ETHTOOL_A_STATS_RMON_HIST_BKT_HI u32 high bound of the bucket
+ ETHTOOL_A_STATS_RMON_HIST_VAL u64 packet counter
+ ================================= ====== ===================================
+
+Low and high bounds are inclusive, for example:
+
+ ============================= ==== ====
+ RFC statistic low high
+ ============================= ==== ====
+ etherStatsPkts64Octets 0 64
+ etherStatsPkts512to1023Octets 512 1023
+ ============================= ==== ====
+
+PHC_VCLOCKS_GET
+===============
+
+Query device PHC virtual clocks information.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHC_VCLOCKS_HEADER`` nested reply header
+ ``ETHTOOL_A_PHC_VCLOCKS_NUM`` u32 PHC virtual clocks number
+ ``ETHTOOL_A_PHC_VCLOCKS_INDEX`` s32 PHC index array
+ ==================================== ====== ==========================
+
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested reply header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ``ETHTOOL_A_MODULE_POWER_MODE`` u8 operational power mode
+ ====================================== ====== ==========================
+
+The optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute encodes the
+transceiver module power mode policy enforced by the host. The default policy
+is driver-dependent, but "auto" is the recommended default and it should be
+implemented by new drivers and drivers where conformance to a legacy behavior
+is not critical.
+
+The optional ``ETHTHOOL_A_MODULE_POWER_MODE`` attribute encodes the operational
+power mode policy of the transceiver module. It is only reported when a module
+is plugged-in. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+ ====================================== ====== ==========================
+ ``ETHTOOL_A_MODULE_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` u8 power mode policy
+ ====================================== ====== ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_POWER_MODE_POLICY`` attribute is used
+to set the transceiver module power policy enforced by the host. Possible
+values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_power_mode_policy
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
+PSE_GET
+=======
+
+Gets PSE attributes.
+
+Request contents:
+
+ ===================================== ====== ==========================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ===================================== ====== ==========================
+
+Kernel response contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
+the operational state of the PoDL PSE functions. The operational state of the
+PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
+action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2
+aPoDLPSEAdminState. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_admin_state
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
+the power detection status of the PoDL PSE. The status depend on internal PSE
+state machine and automatic PD classification support. This option is
+corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_podl_pse_pw_d_status
+
+PSE_SET
+=======
+
+Sets PSE parameters.
+
+Request contents:
+
+ ====================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
+ ====================================== ====== =============================
+
+When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
+to control PoDL PSE Admin functions. This option is implementing
+``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
+``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+
Request translation
===================
The following table maps ioctl commands to netlink commands providing their
functionality. Entries with "n/a" in right column are commands which do not
-have their netlink replacement yet.
+have their netlink replacement yet. Entries which "n/a" in the left column
+are netlink only.
=================================== =====================================
ioctl command netlink command
@@ -545,37 +1711,37 @@ have their netlink replacement yet.
``ETHTOOL_GLINK`` ``ETHTOOL_MSG_LINKSTATE_GET``
``ETHTOOL_GEEPROM`` n/a
``ETHTOOL_SEEPROM`` n/a
- ``ETHTOOL_GCOALESCE`` n/a
- ``ETHTOOL_SCOALESCE`` n/a
- ``ETHTOOL_GRINGPARAM`` n/a
- ``ETHTOOL_SRINGPARAM`` n/a
- ``ETHTOOL_GPAUSEPARAM`` n/a
- ``ETHTOOL_SPAUSEPARAM`` n/a
- ``ETHTOOL_GRXCSUM`` n/a
- ``ETHTOOL_SRXCSUM`` n/a
- ``ETHTOOL_GTXCSUM`` n/a
- ``ETHTOOL_STXCSUM`` n/a
- ``ETHTOOL_GSG`` n/a
- ``ETHTOOL_SSG`` n/a
+ ``ETHTOOL_GCOALESCE`` ``ETHTOOL_MSG_COALESCE_GET``
+ ``ETHTOOL_SCOALESCE`` ``ETHTOOL_MSG_COALESCE_SET``
+ ``ETHTOOL_GRINGPARAM`` ``ETHTOOL_MSG_RINGS_GET``
+ ``ETHTOOL_SRINGPARAM`` ``ETHTOOL_MSG_RINGS_SET``
+ ``ETHTOOL_GPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_GET``
+ ``ETHTOOL_SPAUSEPARAM`` ``ETHTOOL_MSG_PAUSE_SET``
+ ``ETHTOOL_GRXCSUM`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SRXCSUM`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GTXCSUM`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_STXCSUM`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GSG`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SSG`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_TEST`` n/a
``ETHTOOL_GSTRINGS`` ``ETHTOOL_MSG_STRSET_GET``
``ETHTOOL_PHYS_ID`` n/a
``ETHTOOL_GSTATS`` n/a
- ``ETHTOOL_GTSO`` n/a
- ``ETHTOOL_STSO`` n/a
+ ``ETHTOOL_GTSO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_STSO`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_GPERMADDR`` rtnetlink ``RTM_GETLINK``
- ``ETHTOOL_GUFO`` n/a
- ``ETHTOOL_SUFO`` n/a
- ``ETHTOOL_GGSO`` n/a
- ``ETHTOOL_SGSO`` n/a
- ``ETHTOOL_GFLAGS`` n/a
- ``ETHTOOL_SFLAGS`` n/a
- ``ETHTOOL_GPFLAGS`` n/a
- ``ETHTOOL_SPFLAGS`` n/a
+ ``ETHTOOL_GUFO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SUFO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GGSO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SGSO`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GFLAGS`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SFLAGS`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_GET``
+ ``ETHTOOL_SPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_SET``
``ETHTOOL_GRXFH`` n/a
``ETHTOOL_SRXFH`` n/a
- ``ETHTOOL_GGRO`` n/a
- ``ETHTOOL_SGRO`` n/a
+ ``ETHTOOL_GGRO`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SGRO`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_GRXRINGS`` n/a
``ETHTOOL_GRXCLSRLCNT`` n/a
``ETHTOOL_GRXCLSRULE`` n/a
@@ -589,18 +1755,18 @@ have their netlink replacement yet.
``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET``
``ETHTOOL_GRXFHINDIR`` n/a
``ETHTOOL_SRXFHINDIR`` n/a
- ``ETHTOOL_GFEATURES`` n/a
- ``ETHTOOL_SFEATURES`` n/a
- ``ETHTOOL_GCHANNELS`` n/a
- ``ETHTOOL_SCHANNELS`` n/a
+ ``ETHTOOL_GFEATURES`` ``ETHTOOL_MSG_FEATURES_GET``
+ ``ETHTOOL_SFEATURES`` ``ETHTOOL_MSG_FEATURES_SET``
+ ``ETHTOOL_GCHANNELS`` ``ETHTOOL_MSG_CHANNELS_GET``
+ ``ETHTOOL_SCHANNELS`` ``ETHTOOL_MSG_CHANNELS_SET``
``ETHTOOL_SET_DUMP`` n/a
``ETHTOOL_GET_DUMP_FLAG`` n/a
``ETHTOOL_GET_DUMP_DATA`` n/a
- ``ETHTOOL_GET_TS_INFO`` n/a
- ``ETHTOOL_GMODULEINFO`` n/a
- ``ETHTOOL_GMODULEEEPROM`` n/a
- ``ETHTOOL_GEEE`` n/a
- ``ETHTOOL_SEEE`` n/a
+ ``ETHTOOL_GET_TS_INFO`` ``ETHTOOL_MSG_TSINFO_GET``
+ ``ETHTOOL_GMODULEINFO`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+ ``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
+ ``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
+ ``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
``ETHTOOL_GRSSH`` n/a
``ETHTOOL_SRSSH`` n/a
``ETHTOOL_GTUNABLE`` n/a
@@ -613,6 +1779,12 @@ have their netlink replacement yet.
``ETHTOOL_MSG_LINKMODES_SET``
``ETHTOOL_PHY_GTUNABLE`` n/a
``ETHTOOL_PHY_STUNABLE`` n/a
- ``ETHTOOL_GFECPARAM`` n/a
- ``ETHTOOL_SFECPARAM`` n/a
+ ``ETHTOOL_GFECPARAM`` ``ETHTOOL_MSG_FEC_GET``
+ ``ETHTOOL_SFECPARAM`` ``ETHTOOL_MSG_FEC_SET``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_ACT``
+ n/a ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
+ n/a ``ETHTOOL_MSG_TUNNEL_INFO_GET``
+ n/a ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+ n/a ``ETHTOOL_MSG_MODULE_GET``
+ n/a ``ETHTOOL_MSG_MODULE_SET``
=================================== =====================================
diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.rst
index fe719388518b..f1435b7fcdb7 100644
--- a/Documentation/networking/fib_trie.txt
+++ b/Documentation/networking/fib_trie.rst
@@ -1,8 +1,12 @@
- LC-trie implementation notes.
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+LC-trie implementation notes
+============================
Node types
----------
-leaf
+leaf
An end node with data. This has a copy of the relevant key, along
with 'hlist' with routing table entries sorted by prefix length.
See struct leaf and struct leaf_info.
@@ -13,7 +17,7 @@ trie node or tnode
A few concepts explained
------------------------
-Bits (tnode)
+Bits (tnode)
The number of bits in the key segment used for indexing into the
child array - the "child index". See Level Compression.
@@ -23,7 +27,7 @@ Pos (tnode)
Path Compression / skipped bits
Any given tnode is linked to from the child array of its parent, using
- a segment of the key specified by the parent's "pos" and "bits"
+ a segment of the key specified by the parent's "pos" and "bits"
In certain cases, this tnode's own "pos" will not be immediately
adjacent to the parent (pos+bits), but there will be some bits
in the key skipped over because they represent a single path with no
@@ -56,8 +60,8 @@ full_children
Comments
---------
-We have tried to keep the structure of the code as close to fib_hash as
-possible to allow verification and help up reviewing.
+We have tried to keep the structure of the code as close to fib_hash as
+possible to allow verification and help up reviewing.
fib_find_node()
A good start for understanding this code. This function implements a
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
new file mode 100644
index 000000000000..f69da5074860
--- /dev/null
+++ b/Documentation/networking/filter.rst
@@ -0,0 +1,685 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _networking-filter:
+
+=======================================================
+Linux Socket Filtering aka Berkeley Packet Filter (BPF)
+=======================================================
+
+Notice
+------
+
+This file used to document the eBPF format and mechanisms even when not
+related to socket filtering. The ../bpf/index.rst has more details
+on eBPF.
+
+Introduction
+------------
+
+Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
+Though there are some distinct differences between the BSD and Linux
+Kernel filtering, but when we speak of BPF or LSF in Linux context, we
+mean the very same mechanism of filtering in the Linux kernel.
+
+BPF allows a user-space program to attach a filter onto any socket and
+allow or disallow certain types of data to come through the socket. LSF
+follows exactly the same filter code structure as BSD's BPF, so referring
+to the BSD bpf.4 manpage is very helpful in creating filters.
+
+On Linux, BPF is much simpler than on BSD. One does not have to worry
+about devices or anything like that. You simply create your filter code,
+send it to the kernel via the SO_ATTACH_FILTER option and if your filter
+code passes the kernel check on it, you then immediately begin filtering
+data on that socket.
+
+You can also detach filters from your socket via the SO_DETACH_FILTER
+option. This will probably not be used much since when you close a socket
+that has a filter on it the filter is automagically removed. The other
+less common case may be adding a different filter on the same socket where
+you had another filter that is still running: the kernel takes care of
+removing the old one and placing your new one in its place, assuming your
+filter has passed the checks, otherwise if it fails the old filter will
+remain on that socket.
+
+SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
+set, a filter cannot be removed or changed. This allows one process to
+setup a socket, attach a filter, lock it then drop privileges and be
+assured that the filter will be kept until the socket is closed.
+
+The biggest user of this construct might be libpcap. Issuing a high-level
+filter command like `tcpdump -i em1 port 22` passes through the libpcap
+internal compiler that generates a structure that can eventually be loaded
+via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
+displays what is being placed into this structure.
+
+Although we were only speaking about sockets here, BPF in Linux is used
+in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
+qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
+such as team driver, PTP code, etc where BPF is being used.
+
+.. [1] Documentation/userspace-api/seccomp_filter.rst
+
+Original BPF paper:
+
+Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
+architecture for user-level packet capture. In Proceedings of the
+USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
+Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
+CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
+
+Structure
+---------
+
+User space applications include <linux/filter.h> which contains the
+following relevant structures::
+
+ struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+ };
+
+Such a structure is assembled as an array of 4-tuples, that contains
+a code, jt, jf and k value. jt and jf are jump offsets and k a generic
+value to be used for a provided code::
+
+ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
+ unsigned short len; /* Number of filter blocks */
+ struct sock_filter __user *filter;
+ };
+
+For socket filtering, a pointer to this structure (as shown in
+follow-up example) is being passed to the kernel through setsockopt(2).
+
+Example
+-------
+
+::
+
+ #include <sys/socket.h>
+ #include <sys/types.h>
+ #include <arpa/inet.h>
+ #include <linux/if_ether.h>
+ /* ... */
+
+ /* From the example above: tcpdump -i em1 port 22 -dd */
+ struct sock_filter code[] = {
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 8, 0x000086dd },
+ { 0x30, 0, 0, 0x00000014 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 17, 0x00000011 },
+ { 0x28, 0, 0, 0x00000036 },
+ { 0x15, 14, 0, 0x00000016 },
+ { 0x28, 0, 0, 0x00000038 },
+ { 0x15, 12, 13, 0x00000016 },
+ { 0x15, 0, 12, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 8, 0x00000011 },
+ { 0x28, 0, 0, 0x00000014 },
+ { 0x45, 6, 0, 0x00001fff },
+ { 0xb1, 0, 0, 0x0000000e },
+ { 0x48, 0, 0, 0x0000000e },
+ { 0x15, 2, 0, 0x00000016 },
+ { 0x48, 0, 0, 0x00000010 },
+ { 0x15, 0, 1, 0x00000016 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0x00000000 },
+ };
+
+ struct sock_fprog bpf = {
+ .len = ARRAY_SIZE(code),
+ .filter = code,
+ };
+
+ sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (sock < 0)
+ /* ... bail out ... */
+
+ ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
+ if (ret < 0)
+ /* ... bail out ... */
+
+ /* ... */
+ close(sock);
+
+The above example code attaches a socket filter for a PF_PACKET socket
+in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
+be dropped for this socket.
+
+The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
+and SO_LOCK_FILTER for preventing the filter to be detached, takes an
+integer value with 0 or 1.
+
+Note that socket filters are not restricted to PF_PACKET sockets only,
+but can also be used on other socket families.
+
+Summary of system calls:
+
+ * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
+
+Normally, most use cases for socket filtering on packet sockets will be
+covered by libpcap in high-level syntax, so as an application developer
+you should stick to that. libpcap wraps its own layer around all that.
+
+Unless i) using/linking to libpcap is not an option, ii) the required BPF
+filters use Linux extensions that are not supported by libpcap's compiler,
+iii) a filter might be more complex and not cleanly implementable with
+libpcap's compiler, or iv) particular filter codes should be optimized
+differently than libpcap's internal compiler does; then in such cases
+writing such a filter "by hand" can be of an alternative. For example,
+xt_bpf and cls_bpf users might have requirements that could result in
+more complex filter code, or one that cannot be expressed with libpcap
+(e.g. different return codes for various code paths). Moreover, BPF JIT
+implementors may wish to manually write test cases and thus need low-level
+access to BPF code as well.
+
+BPF engine and instruction set
+------------------------------
+
+Under tools/bpf/ there's a small helper tool called bpf_asm which can
+be used to write low-level filters for example scenarios mentioned in the
+previous section. Asm-like syntax mentioned here has been implemented in
+bpf_asm and will be used for further explanations (instead of dealing with
+less readable opcodes directly, principles are the same). The syntax is
+closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
+
+The BPF architecture consists of the following basic elements:
+
+ ======= ====================================================
+ Element Description
+ ======= ====================================================
+ A 32 bit wide accumulator
+ X 32 bit wide X register
+ M[] 16 x 32 bit wide misc registers aka "scratch memory
+ store", addressable from 0 to 15
+ ======= ====================================================
+
+A program, that is translated by bpf_asm into "opcodes" is an array that
+consists of the following elements (as already mentioned)::
+
+ op:16, jt:8, jf:8, k:32
+
+The element op is a 16 bit wide opcode that has a particular instruction
+encoded. jt and jf are two 8 bit wide jump targets, one for condition
+"jump if true", the other one "jump if false". Eventually, element k
+contains a miscellaneous argument that can be interpreted in different
+ways depending on the given instruction in op.
+
+The instruction set consists of load, store, branch, alu, miscellaneous
+and return instructions that are also represented in bpf_asm syntax. This
+table lists all bpf_asm instructions available resp. what their underlying
+opcodes as defined in linux/filter.h stand for:
+
+ =========== =================== =====================
+ Instruction Addressing mode Description
+ =========== =================== =====================
+ ld 1, 2, 3, 4, 12 Load word into A
+ ldi 4 Load word into A
+ ldh 1, 2 Load half-word into A
+ ldb 1, 2 Load byte into A
+ ldx 3, 4, 5, 12 Load word into X
+ ldxi 4 Load word into X
+ ldxb 5 Load byte into X
+
+ st 3 Store A into M[]
+ stx 3 Store X into M[]
+
+ jmp 6 Jump to label
+ ja 6 Jump to label
+ jeq 7, 8, 9, 10 Jump on A == <x>
+ jneq 9, 10 Jump on A != <x>
+ jne 9, 10 Jump on A != <x>
+ jlt 9, 10 Jump on A < <x>
+ jle 9, 10 Jump on A <= <x>
+ jgt 7, 8, 9, 10 Jump on A > <x>
+ jge 7, 8, 9, 10 Jump on A >= <x>
+ jset 7, 8, 9, 10 Jump on A & <x>
+
+ add 0, 4 A + <x>
+ sub 0, 4 A - <x>
+ mul 0, 4 A * <x>
+ div 0, 4 A / <x>
+ mod 0, 4 A % <x>
+ neg !A
+ and 0, 4 A & <x>
+ or 0, 4 A | <x>
+ xor 0, 4 A ^ <x>
+ lsh 0, 4 A << <x>
+ rsh 0, 4 A >> <x>
+
+ tax Copy A into X
+ txa Copy X into A
+
+ ret 4, 11 Return
+ =========== =================== =====================
+
+The next table shows addressing formats from the 2nd column:
+
+ =============== =================== ===============================================
+ Addressing mode Syntax Description
+ =============== =================== ===============================================
+ 0 x/%x Register X
+ 1 [k] BHW at byte offset k in the packet
+ 2 [x + k] BHW at the offset X + k in the packet
+ 3 M[k] Word at offset k in M[]
+ 4 #k Literal value stored in k
+ 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
+ 6 L Jump label L
+ 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 9 #k,Lt Jump to Lt if predicate is true
+ 10 x/%x,Lt Jump to Lt if predicate is true
+ 11 a/%a Accumulator A
+ 12 extension BPF extension
+ =============== =================== ===============================================
+
+The Linux kernel also has a couple of BPF extensions that are used along
+with the class of load instructions by "overloading" the k argument with
+a negative offset + a particular extension offset. The result of such BPF
+extensions are loaded into A.
+
+Possible BPF extensions are shown in the following table:
+
+ =================================== =================================================
+ Extension Description
+ =================================== =================================================
+ len skb->len
+ proto skb->protocol
+ type skb->pkt_type
+ poff Payload start offset
+ ifidx skb->dev->ifindex
+ nla Netlink attribute of type X with offset A
+ nlan Nested Netlink attribute of type X with offset A
+ mark skb->mark
+ queue skb->queue_mapping
+ hatype skb->dev->type
+ rxhash skb->hash
+ cpu raw_smp_processor_id()
+ vlan_tci skb_vlan_tag_get(skb)
+ vlan_avail skb_vlan_tag_present(skb)
+ vlan_tpid skb->vlan_proto
+ rand get_random_u32()
+ =================================== =================================================
+
+These extensions can also be prefixed with '#'.
+Examples for low-level BPF:
+
+**ARP packets**::
+
+ ldh [12]
+ jne #0x806, drop
+ ret #-1
+ drop: ret #0
+
+**IPv4 TCP packets**::
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #6, drop
+ ret #-1
+ drop: ret #0
+
+**icmp random packet sampling, 1 in 4**::
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #1, drop
+ # get a random uint32 number
+ ld rand
+ mod #4
+ jneq #1, drop
+ ret #-1
+ drop: ret #0
+
+**SECCOMP filter example**::
+
+ ld [4] /* offsetof(struct seccomp_data, arch) */
+ jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
+ ld [0] /* offsetof(struct seccomp_data, nr) */
+ jeq #15, good /* __NR_rt_sigreturn */
+ jeq #231, good /* __NR_exit_group */
+ jeq #60, good /* __NR_exit */
+ jeq #0, good /* __NR_read */
+ jeq #1, good /* __NR_write */
+ jeq #5, good /* __NR_fstat */
+ jeq #9, good /* __NR_mmap */
+ jeq #14, good /* __NR_rt_sigprocmask */
+ jeq #13, good /* __NR_rt_sigaction */
+ jeq #35, good /* __NR_nanosleep */
+ bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
+ good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
+
+Examples for low-level BPF extension:
+
+**Packet for interface index 13**::
+
+ ld ifidx
+ jneq #13, drop
+ ret #-1
+ drop: ret #0
+
+**(Accelerated) VLAN w/ id 10**::
+
+ ld vlan_tci
+ jneq #10, drop
+ ret #-1
+ drop: ret #0
+
+The above example code can be placed into a file (here called "foo"), and
+then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
+and cls_bpf understands and can directly be loaded with. Example with above
+ARP code::
+
+ $ ./bpf_asm foo
+ 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
+
+In copy and paste C-like output::
+
+ $ ./bpf_asm -c foo
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 1, 0x00000806 },
+ { 0x06, 0, 0, 0xffffffff },
+ { 0x06, 0, 0, 0000000000 },
+
+In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
+filters that might not be obvious at first, it's good to test filters before
+attaching to a live system. For that purpose, there's a small tool called
+bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
+for testing BPF filters against given pcap files, single stepping through the
+BPF code on the pcap's packets and to do BPF machine register dumps.
+
+Starting bpf_dbg is trivial and just requires issuing::
+
+ # ./bpf_dbg
+
+In case input and output do not equal stdin/stdout, bpf_dbg takes an
+alternative stdin source as a first argument, and an alternative stdout
+sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
+
+Other than that, a particular libreadline configuration can be set via
+file "~/.bpf_dbg_init" and the command history is stored in the file
+"~/.bpf_dbg_history".
+
+Interaction in bpf_dbg happens through a shell that also has auto-completion
+support (follow-up example commands starting with '>' denote bpf_dbg shell).
+The usual workflow would be to ...
+
+* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
+ Loads a BPF filter from standard output of bpf_asm, or transformed via
+ e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT
+ debugging (next section), this command creates a temporary socket and
+ loads the BPF code into the kernel. Thus, this will also be useful for
+ JIT developers.
+
+* load pcap foo.pcap
+
+ Loads standard tcpdump pcap file.
+
+* run [<n>]
+
+bpf passes:1 fails:9
+ Runs through all packets from a pcap to account how many passes and fails
+ the filter will generate. A limit of packets to traverse can be given.
+
+* disassemble::
+
+ l0: ldh [12]
+ l1: jeq #0x800, l2, l5
+ l2: ldb [23]
+ l3: jeq #0x1, l4, l5
+ l4: ret #0xffff
+ l5: ret #0
+
+ Prints out BPF code disassembly.
+
+* dump::
+
+ /* { op, jt, jf, k }, */
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 3, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 0, 1, 0x00000001 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0000000000 },
+
+ Prints out C-style BPF code dump.
+
+* breakpoint 0::
+
+ breakpoint at: l0: ldh [12]
+
+* breakpoint 1::
+
+ breakpoint at: l1: jeq #0x800, l2, l5
+
+ ...
+
+ Sets breakpoints at particular BPF instructions. Issuing a `run` command
+ will walk through the pcap file continuing from the current packet and
+ break when a breakpoint is being hit (another `run` will continue from
+ the currently active breakpoint executing next instructions):
+
+ * run::
+
+ -- register dump --
+ pc: [0] <-- program counter
+ code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
+ curr: l0: ldh [12] <-- disassembly of current instruction
+ A: [00000000][0] <-- content of A (hex, decimal)
+ X: [00000000][0] <-- content of X (hex, decimal)
+ M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
+ -- packet dump -- <-- Current packet from pcap (hex)
+ len: 42
+ 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
+ 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
+ 32: 00 00 00 00 00 00 0a 3b 01 01
+ (breakpoint)
+ >
+
+ * breakpoint::
+
+ breakpoints: 0 1
+
+ Prints currently set breakpoints.
+
+* step [-<n>, +<n>]
+
+ Performs single stepping through the BPF program from the current pc
+ offset. Thus, on each step invocation, above register dump is issued.
+ This can go forwards and backwards in time, a plain `step` will break
+ on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
+
+* select <n>
+
+ Selects a given packet from the pcap file to continue from. Thus, on
+ the next `run` or `step`, the BPF program is being evaluated against
+ the user pre-selected packet. Numbering starts just as in Wireshark
+ with index 1.
+
+* quit
+
+ Exits bpf_dbg.
+
+JIT compiler
+------------
+
+The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
+PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
+CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
+attached filter from user space or for internal kernel users if it has
+been previously enabled by root::
+
+ echo 1 > /proc/sys/net/core/bpf_jit_enable
+
+For JIT developers, doing audits etc, each compile run can output the generated
+opcode image into the kernel log via::
+
+ echo 2 > /proc/sys/net/core/bpf_jit_enable
+
+Example output from dmesg::
+
+ [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
+ [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
+ [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
+ [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
+ [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
+ [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
+
+When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
+setting any other value than that will return in failure. This is even the case for
+setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
+is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
+generally recommended approach instead.
+
+In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
+generating disassembly out of the kernel log's hexdump::
+
+ # ./bpf_jit_disasm
+ 70 bytes emitted from JIT compiler (pass:3, flen:6)
+ ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 1: mov %rsp,%rbp
+ 4: sub $0x60,%rsp
+ 8: mov %rbx,-0x8(%rbp)
+ c: mov 0x68(%rdi),%r9d
+ 10: sub 0x6c(%rdi),%r9d
+ 14: mov 0xd8(%rdi),%r8
+ 1b: mov $0xc,%esi
+ 20: callq 0xffffffffe0ff9442
+ 25: cmp $0x800,%eax
+ 2a: jne 0x0000000000000042
+ 2c: mov $0x17,%esi
+ 31: callq 0xffffffffe0ff945e
+ 36: cmp $0x1,%eax
+ 39: jne 0x0000000000000042
+ 3b: mov $0xffff,%eax
+ 40: jmp 0x0000000000000044
+ 42: xor %eax,%eax
+ 44: leaveq
+ 45: retq
+
+ Issuing option `-o` will "annotate" opcodes to resulting assembler
+ instructions, which can be very useful for JIT developers:
+
+ # ./bpf_jit_disasm -o
+ 70 bytes emitted from JIT compiler (pass:3, flen:6)
+ ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 55
+ 1: mov %rsp,%rbp
+ 48 89 e5
+ 4: sub $0x60,%rsp
+ 48 83 ec 60
+ 8: mov %rbx,-0x8(%rbp)
+ 48 89 5d f8
+ c: mov 0x68(%rdi),%r9d
+ 44 8b 4f 68
+ 10: sub 0x6c(%rdi),%r9d
+ 44 2b 4f 6c
+ 14: mov 0xd8(%rdi),%r8
+ 4c 8b 87 d8 00 00 00
+ 1b: mov $0xc,%esi
+ be 0c 00 00 00
+ 20: callq 0xffffffffe0ff9442
+ e8 1d 94 ff e0
+ 25: cmp $0x800,%eax
+ 3d 00 08 00 00
+ 2a: jne 0x0000000000000042
+ 75 16
+ 2c: mov $0x17,%esi
+ be 17 00 00 00
+ 31: callq 0xffffffffe0ff945e
+ e8 28 94 ff e0
+ 36: cmp $0x1,%eax
+ 83 f8 01
+ 39: jne 0x0000000000000042
+ 75 07
+ 3b: mov $0xffff,%eax
+ b8 ff ff 00 00
+ 40: jmp 0x0000000000000044
+ eb 02
+ 42: xor %eax,%eax
+ 31 c0
+ 44: leaveq
+ c9
+ 45: retq
+ c3
+
+For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
+toolchain for developing and testing the kernel's JIT compiler.
+
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later). This new
+ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which
+originates from [e]xtended BPF is not the same as BPF extensions! While
+eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
+of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into eBPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> eBPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the eBPF interpreter. For in-kernel handlers, this all works transparently
+by using bpf_prog_create() for setting up the filter, resp.
+bpf_prog_destroy() for destroying it. The function
+bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct bpf_prog that we
+got from bpf_prog_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from bpf_check_classic() apply
+before a conversion to the new layout is being done behind the scenes!
+
+Currently, the classic BPF format is being used for JITing on most
+32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
+sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
+instruction set.
+
+Testing
+-------
+
+Next to the BPF toolchain, the kernel also ships a test module that contains
+various test cases for classic and eBPF that can be executed against
+the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
+enabled via Kconfig::
+
+ CONFIG_TEST_BPF=m
+
+After the module has been built and installed, the test suite can be executed
+via insmod or modprobe against 'test_bpf' module. Results of the test cases
+including timings in nsec can be found in the kernel log (dmesg).
+
+Misc
+----
+
+Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
+SECCOMP-BPF kernel fuzzing.
+
+Written by
+----------
+
+The document was written in the hope that it is found useful and in order
+to give potential BPF hackers or security auditors a better overview of
+the underlying architecture.
+
+- Jay Schulist <jschlst@samba.org>
+- Daniel Borkmann <daniel@iogearbox.net>
+- Alexei Starovoitov <ast@kernel.org>
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
deleted file mode 100644
index c4a328f2d57a..000000000000
--- a/Documentation/networking/filter.txt
+++ /dev/null
@@ -1,1545 +0,0 @@
-Linux Socket Filtering aka Berkeley Packet Filter (BPF)
-=======================================================
-
-Introduction
-------------
-
-Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
-Though there are some distinct differences between the BSD and Linux
-Kernel filtering, but when we speak of BPF or LSF in Linux context, we
-mean the very same mechanism of filtering in the Linux kernel.
-
-BPF allows a user-space program to attach a filter onto any socket and
-allow or disallow certain types of data to come through the socket. LSF
-follows exactly the same filter code structure as BSD's BPF, so referring
-to the BSD bpf.4 manpage is very helpful in creating filters.
-
-On Linux, BPF is much simpler than on BSD. One does not have to worry
-about devices or anything like that. You simply create your filter code,
-send it to the kernel via the SO_ATTACH_FILTER option and if your filter
-code passes the kernel check on it, you then immediately begin filtering
-data on that socket.
-
-You can also detach filters from your socket via the SO_DETACH_FILTER
-option. This will probably not be used much since when you close a socket
-that has a filter on it the filter is automagically removed. The other
-less common case may be adding a different filter on the same socket where
-you had another filter that is still running: the kernel takes care of
-removing the old one and placing your new one in its place, assuming your
-filter has passed the checks, otherwise if it fails the old filter will
-remain on that socket.
-
-SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
-set, a filter cannot be removed or changed. This allows one process to
-setup a socket, attach a filter, lock it then drop privileges and be
-assured that the filter will be kept until the socket is closed.
-
-The biggest user of this construct might be libpcap. Issuing a high-level
-filter command like `tcpdump -i em1 port 22` passes through the libpcap
-internal compiler that generates a structure that can eventually be loaded
-via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
-displays what is being placed into this structure.
-
-Although we were only speaking about sockets here, BPF in Linux is used
-in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
-qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
-such as team driver, PTP code, etc where BPF is being used.
-
- [1] Documentation/userspace-api/seccomp_filter.rst
-
-Original BPF paper:
-
-Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
-architecture for user-level packet capture. In Proceedings of the
-USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
-Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
-CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
-
-Structure
----------
-
-User space applications include <linux/filter.h> which contains the
-following relevant structures:
-
-struct sock_filter { /* Filter block */
- __u16 code; /* Actual filter code */
- __u8 jt; /* Jump true */
- __u8 jf; /* Jump false */
- __u32 k; /* Generic multiuse field */
-};
-
-Such a structure is assembled as an array of 4-tuples, that contains
-a code, jt, jf and k value. jt and jf are jump offsets and k a generic
-value to be used for a provided code.
-
-struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
- unsigned short len; /* Number of filter blocks */
- struct sock_filter __user *filter;
-};
-
-For socket filtering, a pointer to this structure (as shown in
-follow-up example) is being passed to the kernel through setsockopt(2).
-
-Example
--------
-
-#include <sys/socket.h>
-#include <sys/types.h>
-#include <arpa/inet.h>
-#include <linux/if_ether.h>
-/* ... */
-
-/* From the example above: tcpdump -i em1 port 22 -dd */
-struct sock_filter code[] = {
- { 0x28, 0, 0, 0x0000000c },
- { 0x15, 0, 8, 0x000086dd },
- { 0x30, 0, 0, 0x00000014 },
- { 0x15, 2, 0, 0x00000084 },
- { 0x15, 1, 0, 0x00000006 },
- { 0x15, 0, 17, 0x00000011 },
- { 0x28, 0, 0, 0x00000036 },
- { 0x15, 14, 0, 0x00000016 },
- { 0x28, 0, 0, 0x00000038 },
- { 0x15, 12, 13, 0x00000016 },
- { 0x15, 0, 12, 0x00000800 },
- { 0x30, 0, 0, 0x00000017 },
- { 0x15, 2, 0, 0x00000084 },
- { 0x15, 1, 0, 0x00000006 },
- { 0x15, 0, 8, 0x00000011 },
- { 0x28, 0, 0, 0x00000014 },
- { 0x45, 6, 0, 0x00001fff },
- { 0xb1, 0, 0, 0x0000000e },
- { 0x48, 0, 0, 0x0000000e },
- { 0x15, 2, 0, 0x00000016 },
- { 0x48, 0, 0, 0x00000010 },
- { 0x15, 0, 1, 0x00000016 },
- { 0x06, 0, 0, 0x0000ffff },
- { 0x06, 0, 0, 0x00000000 },
-};
-
-struct sock_fprog bpf = {
- .len = ARRAY_SIZE(code),
- .filter = code,
-};
-
-sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
-if (sock < 0)
- /* ... bail out ... */
-
-ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
-if (ret < 0)
- /* ... bail out ... */
-
-/* ... */
-close(sock);
-
-The above example code attaches a socket filter for a PF_PACKET socket
-in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
-be dropped for this socket.
-
-The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
-and SO_LOCK_FILTER for preventing the filter to be detached, takes an
-integer value with 0 or 1.
-
-Note that socket filters are not restricted to PF_PACKET sockets only,
-but can also be used on other socket families.
-
-Summary of system calls:
-
- * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
- * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
- * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
-
-Normally, most use cases for socket filtering on packet sockets will be
-covered by libpcap in high-level syntax, so as an application developer
-you should stick to that. libpcap wraps its own layer around all that.
-
-Unless i) using/linking to libpcap is not an option, ii) the required BPF
-filters use Linux extensions that are not supported by libpcap's compiler,
-iii) a filter might be more complex and not cleanly implementable with
-libpcap's compiler, or iv) particular filter codes should be optimized
-differently than libpcap's internal compiler does; then in such cases
-writing such a filter "by hand" can be of an alternative. For example,
-xt_bpf and cls_bpf users might have requirements that could result in
-more complex filter code, or one that cannot be expressed with libpcap
-(e.g. different return codes for various code paths). Moreover, BPF JIT
-implementors may wish to manually write test cases and thus need low-level
-access to BPF code as well.
-
-BPF engine and instruction set
-------------------------------
-
-Under tools/bpf/ there's a small helper tool called bpf_asm which can
-be used to write low-level filters for example scenarios mentioned in the
-previous section. Asm-like syntax mentioned here has been implemented in
-bpf_asm and will be used for further explanations (instead of dealing with
-less readable opcodes directly, principles are the same). The syntax is
-closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
-
-The BPF architecture consists of the following basic elements:
-
- Element Description
-
- A 32 bit wide accumulator
- X 32 bit wide X register
- M[] 16 x 32 bit wide misc registers aka "scratch memory
- store", addressable from 0 to 15
-
-A program, that is translated by bpf_asm into "opcodes" is an array that
-consists of the following elements (as already mentioned):
-
- op:16, jt:8, jf:8, k:32
-
-The element op is a 16 bit wide opcode that has a particular instruction
-encoded. jt and jf are two 8 bit wide jump targets, one for condition
-"jump if true", the other one "jump if false". Eventually, element k
-contains a miscellaneous argument that can be interpreted in different
-ways depending on the given instruction in op.
-
-The instruction set consists of load, store, branch, alu, miscellaneous
-and return instructions that are also represented in bpf_asm syntax. This
-table lists all bpf_asm instructions available resp. what their underlying
-opcodes as defined in linux/filter.h stand for:
-
- Instruction Addressing mode Description
-
- ld 1, 2, 3, 4, 12 Load word into A
- ldi 4 Load word into A
- ldh 1, 2 Load half-word into A
- ldb 1, 2 Load byte into A
- ldx 3, 4, 5, 12 Load word into X
- ldxi 4 Load word into X
- ldxb 5 Load byte into X
-
- st 3 Store A into M[]
- stx 3 Store X into M[]
-
- jmp 6 Jump to label
- ja 6 Jump to label
- jeq 7, 8, 9, 10 Jump on A == <x>
- jneq 9, 10 Jump on A != <x>
- jne 9, 10 Jump on A != <x>
- jlt 9, 10 Jump on A < <x>
- jle 9, 10 Jump on A <= <x>
- jgt 7, 8, 9, 10 Jump on A > <x>
- jge 7, 8, 9, 10 Jump on A >= <x>
- jset 7, 8, 9, 10 Jump on A & <x>
-
- add 0, 4 A + <x>
- sub 0, 4 A - <x>
- mul 0, 4 A * <x>
- div 0, 4 A / <x>
- mod 0, 4 A % <x>
- neg !A
- and 0, 4 A & <x>
- or 0, 4 A | <x>
- xor 0, 4 A ^ <x>
- lsh 0, 4 A << <x>
- rsh 0, 4 A >> <x>
-
- tax Copy A into X
- txa Copy X into A
-
- ret 4, 11 Return
-
-The next table shows addressing formats from the 2nd column:
-
- Addressing mode Syntax Description
-
- 0 x/%x Register X
- 1 [k] BHW at byte offset k in the packet
- 2 [x + k] BHW at the offset X + k in the packet
- 3 M[k] Word at offset k in M[]
- 4 #k Literal value stored in k
- 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
- 6 L Jump label L
- 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
- 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf
- 9 #k,Lt Jump to Lt if predicate is true
- 10 x/%x,Lt Jump to Lt if predicate is true
- 11 a/%a Accumulator A
- 12 extension BPF extension
-
-The Linux kernel also has a couple of BPF extensions that are used along
-with the class of load instructions by "overloading" the k argument with
-a negative offset + a particular extension offset. The result of such BPF
-extensions are loaded into A.
-
-Possible BPF extensions are shown in the following table:
-
- Extension Description
-
- len skb->len
- proto skb->protocol
- type skb->pkt_type
- poff Payload start offset
- ifidx skb->dev->ifindex
- nla Netlink attribute of type X with offset A
- nlan Nested Netlink attribute of type X with offset A
- mark skb->mark
- queue skb->queue_mapping
- hatype skb->dev->type
- rxhash skb->hash
- cpu raw_smp_processor_id()
- vlan_tci skb_vlan_tag_get(skb)
- vlan_avail skb_vlan_tag_present(skb)
- vlan_tpid skb->vlan_proto
- rand prandom_u32()
-
-These extensions can also be prefixed with '#'.
-Examples for low-level BPF:
-
-** ARP packets:
-
- ldh [12]
- jne #0x806, drop
- ret #-1
- drop: ret #0
-
-** IPv4 TCP packets:
-
- ldh [12]
- jne #0x800, drop
- ldb [23]
- jneq #6, drop
- ret #-1
- drop: ret #0
-
-** (Accelerated) VLAN w/ id 10:
-
- ld vlan_tci
- jneq #10, drop
- ret #-1
- drop: ret #0
-
-** icmp random packet sampling, 1 in 4
- ldh [12]
- jne #0x800, drop
- ldb [23]
- jneq #1, drop
- # get a random uint32 number
- ld rand
- mod #4
- jneq #1, drop
- ret #-1
- drop: ret #0
-
-** SECCOMP filter example:
-
- ld [4] /* offsetof(struct seccomp_data, arch) */
- jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
- ld [0] /* offsetof(struct seccomp_data, nr) */
- jeq #15, good /* __NR_rt_sigreturn */
- jeq #231, good /* __NR_exit_group */
- jeq #60, good /* __NR_exit */
- jeq #0, good /* __NR_read */
- jeq #1, good /* __NR_write */
- jeq #5, good /* __NR_fstat */
- jeq #9, good /* __NR_mmap */
- jeq #14, good /* __NR_rt_sigprocmask */
- jeq #13, good /* __NR_rt_sigaction */
- jeq #35, good /* __NR_nanosleep */
- bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
- good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
-
-The above example code can be placed into a file (here called "foo"), and
-then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
-and cls_bpf understands and can directly be loaded with. Example with above
-ARP code:
-
-$ ./bpf_asm foo
-4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
-
-In copy and paste C-like output:
-
-$ ./bpf_asm -c foo
-{ 0x28, 0, 0, 0x0000000c },
-{ 0x15, 0, 1, 0x00000806 },
-{ 0x06, 0, 0, 0xffffffff },
-{ 0x06, 0, 0, 0000000000 },
-
-In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
-filters that might not be obvious at first, it's good to test filters before
-attaching to a live system. For that purpose, there's a small tool called
-bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
-for testing BPF filters against given pcap files, single stepping through the
-BPF code on the pcap's packets and to do BPF machine register dumps.
-
-Starting bpf_dbg is trivial and just requires issuing:
-
-# ./bpf_dbg
-
-In case input and output do not equal stdin/stdout, bpf_dbg takes an
-alternative stdin source as a first argument, and an alternative stdout
-sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
-
-Other than that, a particular libreadline configuration can be set via
-file "~/.bpf_dbg_init" and the command history is stored in the file
-"~/.bpf_dbg_history".
-
-Interaction in bpf_dbg happens through a shell that also has auto-completion
-support (follow-up example commands starting with '>' denote bpf_dbg shell).
-The usual workflow would be to ...
-
-> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
- Loads a BPF filter from standard output of bpf_asm, or transformed via
- e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
- debugging (next section), this command creates a temporary socket and
- loads the BPF code into the kernel. Thus, this will also be useful for
- JIT developers.
-
-> load pcap foo.pcap
- Loads standard tcpdump pcap file.
-
-> run [<n>]
-bpf passes:1 fails:9
- Runs through all packets from a pcap to account how many passes and fails
- the filter will generate. A limit of packets to traverse can be given.
-
-> disassemble
-l0: ldh [12]
-l1: jeq #0x800, l2, l5
-l2: ldb [23]
-l3: jeq #0x1, l4, l5
-l4: ret #0xffff
-l5: ret #0
- Prints out BPF code disassembly.
-
-> dump
-/* { op, jt, jf, k }, */
-{ 0x28, 0, 0, 0x0000000c },
-{ 0x15, 0, 3, 0x00000800 },
-{ 0x30, 0, 0, 0x00000017 },
-{ 0x15, 0, 1, 0x00000001 },
-{ 0x06, 0, 0, 0x0000ffff },
-{ 0x06, 0, 0, 0000000000 },
- Prints out C-style BPF code dump.
-
-> breakpoint 0
-breakpoint at: l0: ldh [12]
-> breakpoint 1
-breakpoint at: l1: jeq #0x800, l2, l5
- ...
- Sets breakpoints at particular BPF instructions. Issuing a `run` command
- will walk through the pcap file continuing from the current packet and
- break when a breakpoint is being hit (another `run` will continue from
- the currently active breakpoint executing next instructions):
-
- > run
- -- register dump --
- pc: [0] <-- program counter
- code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
- curr: l0: ldh [12] <-- disassembly of current instruction
- A: [00000000][0] <-- content of A (hex, decimal)
- X: [00000000][0] <-- content of X (hex, decimal)
- M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
- -- packet dump -- <-- Current packet from pcap (hex)
- len: 42
- 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
- 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
- 32: 00 00 00 00 00 00 0a 3b 01 01
- (breakpoint)
- >
-
-> breakpoint
-breakpoints: 0 1
- Prints currently set breakpoints.
-
-> step [-<n>, +<n>]
- Performs single stepping through the BPF program from the current pc
- offset. Thus, on each step invocation, above register dump is issued.
- This can go forwards and backwards in time, a plain `step` will break
- on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
-
-> select <n>
- Selects a given packet from the pcap file to continue from. Thus, on
- the next `run` or `step`, the BPF program is being evaluated against
- the user pre-selected packet. Numbering starts just as in Wireshark
- with index 1.
-
-> quit
-#
- Exits bpf_dbg.
-
-JIT compiler
-------------
-
-The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
-PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
-CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
-attached filter from user space or for internal kernel users if it has
-been previously enabled by root:
-
- echo 1 > /proc/sys/net/core/bpf_jit_enable
-
-For JIT developers, doing audits etc, each compile run can output the generated
-opcode image into the kernel log via:
-
- echo 2 > /proc/sys/net/core/bpf_jit_enable
-
-Example output from dmesg:
-
-[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
-[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
-[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
-[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
-[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
-[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
-
-When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and
-setting any other value than that will return in failure. This is even the case for
-setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log
-is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the
-generally recommended approach instead.
-
-In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
-generating disassembly out of the kernel log's hexdump:
-
-# ./bpf_jit_disasm
-70 bytes emitted from JIT compiler (pass:3, flen:6)
-ffffffffa0069c8f + <x>:
- 0: push %rbp
- 1: mov %rsp,%rbp
- 4: sub $0x60,%rsp
- 8: mov %rbx,-0x8(%rbp)
- c: mov 0x68(%rdi),%r9d
- 10: sub 0x6c(%rdi),%r9d
- 14: mov 0xd8(%rdi),%r8
- 1b: mov $0xc,%esi
- 20: callq 0xffffffffe0ff9442
- 25: cmp $0x800,%eax
- 2a: jne 0x0000000000000042
- 2c: mov $0x17,%esi
- 31: callq 0xffffffffe0ff945e
- 36: cmp $0x1,%eax
- 39: jne 0x0000000000000042
- 3b: mov $0xffff,%eax
- 40: jmp 0x0000000000000044
- 42: xor %eax,%eax
- 44: leaveq
- 45: retq
-
-Issuing option `-o` will "annotate" opcodes to resulting assembler
-instructions, which can be very useful for JIT developers:
-
-# ./bpf_jit_disasm -o
-70 bytes emitted from JIT compiler (pass:3, flen:6)
-ffffffffa0069c8f + <x>:
- 0: push %rbp
- 55
- 1: mov %rsp,%rbp
- 48 89 e5
- 4: sub $0x60,%rsp
- 48 83 ec 60
- 8: mov %rbx,-0x8(%rbp)
- 48 89 5d f8
- c: mov 0x68(%rdi),%r9d
- 44 8b 4f 68
- 10: sub 0x6c(%rdi),%r9d
- 44 2b 4f 6c
- 14: mov 0xd8(%rdi),%r8
- 4c 8b 87 d8 00 00 00
- 1b: mov $0xc,%esi
- be 0c 00 00 00
- 20: callq 0xffffffffe0ff9442
- e8 1d 94 ff e0
- 25: cmp $0x800,%eax
- 3d 00 08 00 00
- 2a: jne 0x0000000000000042
- 75 16
- 2c: mov $0x17,%esi
- be 17 00 00 00
- 31: callq 0xffffffffe0ff945e
- e8 28 94 ff e0
- 36: cmp $0x1,%eax
- 83 f8 01
- 39: jne 0x0000000000000042
- 75 07
- 3b: mov $0xffff,%eax
- b8 ff ff 00 00
- 40: jmp 0x0000000000000044
- eb 02
- 42: xor %eax,%eax
- 31 c0
- 44: leaveq
- c9
- 45: retq
- c3
-
-For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
-toolchain for developing and testing the kernel's JIT compiler.
-
-BPF kernel internals
---------------------
-Internally, for the kernel interpreter, a different instruction set
-format with similar underlying principles from BPF described in previous
-paragraphs is being used. However, the instruction set format is modelled
-closer to the underlying architecture to mimic native instruction sets, so
-that a better performance can be achieved (more details later). This new
-ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
-originates from [e]xtended BPF is not the same as BPF extensions! While
-eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
-of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
-
-It is designed to be JITed with one to one mapping, which can also open up
-the possibility for GCC/LLVM compilers to generate optimized eBPF code through
-an eBPF backend that performs almost as fast as natively compiled code.
-
-The new instruction set was originally designed with the possible goal in
-mind to write programs in "restricted C" and compile into eBPF with a optional
-GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
-minimal performance overhead over two steps, that is, C -> eBPF -> native code.
-
-Currently, the new format is being used for running user BPF programs, which
-includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
-team driver's classifier for its load-balancing mode, netfilter's xt_bpf
-extension, PTP dissector/classifier, and much more. They are all internally
-converted by the kernel into the new instruction set representation and run
-in the eBPF interpreter. For in-kernel handlers, this all works transparently
-by using bpf_prog_create() for setting up the filter, resp.
-bpf_prog_destroy() for destroying it. The macro
-BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
-code to run the filter. 'filter' is a pointer to struct bpf_prog that we
-got from bpf_prog_create(), and 'ctx' the given context (e.g.
-skb pointer). All constraints and restrictions from bpf_check_classic() apply
-before a conversion to the new layout is being done behind the scenes!
-
-Currently, the classic BPF format is being used for JITing on most
-32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
-sparc64, arm32, riscv (RV64G) perform JIT compilation from eBPF
-instruction set.
-
-Some core changes of the new internal format:
-
-- Number of registers increase from 2 to 10:
-
- The old format had two registers A and X, and a hidden frame pointer. The
- new layout extends this to be 10 internal registers and a read-only frame
- pointer. Since 64-bit CPUs are passing arguments to functions via registers
- the number of args from eBPF program to in-kernel function is restricted
- to 5 and one register is used to accept return value from an in-kernel
- function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
- sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
- registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
-
- Therefore, eBPF calling convention is defined as:
-
- * R0 - return value from in-kernel function, and exit value for eBPF program
- * R1 - R5 - arguments from eBPF program to in-kernel function
- * R6 - R9 - callee saved registers that in-kernel function will preserve
- * R10 - read-only frame pointer to access stack
-
- Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
- etc, and eBPF calling convention maps directly to ABIs used by the kernel on
- 64-bit architectures.
-
- On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
- and may let more complex programs to be interpreted.
-
- R0 - R5 are scratch registers and eBPF program needs spill/fill them if
- necessary across calls. Note that there is only one eBPF program (== one
- eBPF main routine) and it cannot call other eBPF functions, it can only
- call predefined in-kernel functions, though.
-
-- Register width increases from 32-bit to 64-bit:
-
- Still, the semantics of the original 32-bit ALU operations are preserved
- via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
- subregisters that zero-extend into 64-bit if they are being written to.
- That behavior maps directly to x86_64 and arm64 subregister definition, but
- makes other JITs more difficult.
-
- 32-bit architectures run 64-bit internal BPF programs via interpreter.
- Their JITs may convert BPF programs that only use 32-bit subregisters into
- native instruction set and let the rest being interpreted.
-
- Operation is 64-bit, because on 64-bit architectures, pointers are also
- 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
- so 32-bit eBPF registers would otherwise require to define register-pair
- ABI, thus, there won't be able to use a direct eBPF register to HW register
- mapping and JIT would need to do combine/split/move operations for every
- register in and out of the function, which is complex, bug prone and slow.
- Another reason is the use of atomic 64-bit counters.
-
-- Conditional jt/jf targets replaced with jt/fall-through:
-
- While the original design has constructs such as "if (cond) jump_true;
- else jump_false;", they are being replaced into alternative constructs like
- "if (cond) jump_true; /* else fall-through */".
-
-- Introduces bpf_call insn and register passing convention for zero overhead
- calls from/to other kernel functions:
-
- Before an in-kernel function call, the internal BPF program needs to
- place function arguments into R1 to R5 registers to satisfy calling
- convention, then the interpreter will take them from registers and pass
- to in-kernel function. If R1 - R5 registers are mapped to CPU registers
- that are used for argument passing on given architecture, the JIT compiler
- doesn't need to emit extra moves. Function arguments will be in the correct
- registers and BPF_CALL instruction will be JITed as single 'call' HW
- instruction. This calling convention was picked to cover common call
- situations without performance penalty.
-
- After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
- a return value of the function. Since R6 - R9 are callee saved, their state
- is preserved across the call.
-
- For example, consider three C functions:
-
- u64 f1() { return (*_f2)(1); }
- u64 f2(u64 a) { return f3(a + 1, a); }
- u64 f3(u64 a, u64 b) { return a - b; }
-
- GCC can compile f1, f3 into x86_64:
-
- f1:
- movl $1, %edi
- movq _f2(%rip), %rax
- jmp *%rax
- f3:
- movq %rdi, %rax
- subq %rsi, %rax
- ret
-
- Function f2 in eBPF may look like:
-
- f2:
- bpf_mov R2, R1
- bpf_add R1, 1
- bpf_call f3
- bpf_exit
-
- If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
- returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
- be used to call into f2.
-
- For practical reasons all eBPF programs have only one argument 'ctx' which is
- already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
- can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
- are currently not supported, but these restrictions can be lifted if necessary
- in the future.
-
- On 64-bit architectures all register map to HW registers one to one. For
- example, x86_64 JIT compiler can map them as ...
-
- R0 - rax
- R1 - rdi
- R2 - rsi
- R3 - rdx
- R4 - rcx
- R5 - r8
- R6 - rbx
- R7 - r13
- R8 - r14
- R9 - r15
- R10 - rbp
-
- ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
- and rbx, r12 - r15 are callee saved.
-
- Then the following internal BPF pseudo-program:
-
- bpf_mov R6, R1 /* save ctx */
- bpf_mov R2, 2
- bpf_mov R3, 3
- bpf_mov R4, 4
- bpf_mov R5, 5
- bpf_call foo
- bpf_mov R7, R0 /* save foo() return value */
- bpf_mov R1, R6 /* restore ctx for next call */
- bpf_mov R2, 6
- bpf_mov R3, 7
- bpf_mov R4, 8
- bpf_mov R5, 9
- bpf_call bar
- bpf_add R0, R7
- bpf_exit
-
- After JIT to x86_64 may look like:
-
- push %rbp
- mov %rsp,%rbp
- sub $0x228,%rsp
- mov %rbx,-0x228(%rbp)
- mov %r13,-0x220(%rbp)
- mov %rdi,%rbx
- mov $0x2,%esi
- mov $0x3,%edx
- mov $0x4,%ecx
- mov $0x5,%r8d
- callq foo
- mov %rax,%r13
- mov %rbx,%rdi
- mov $0x6,%esi
- mov $0x7,%edx
- mov $0x8,%ecx
- mov $0x9,%r8d
- callq bar
- add %r13,%rax
- mov -0x228(%rbp),%rbx
- mov -0x220(%rbp),%r13
- leaveq
- retq
-
- Which is in this example equivalent in C to:
-
- u64 bpf_filter(u64 ctx)
- {
- return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
- }
-
- In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
- arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
- registers and place their return value into '%rax' which is R0 in eBPF.
- Prologue and epilogue are emitted by JIT and are implicit in the
- interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
- them across the calls as defined by calling convention.
-
- For example the following program is invalid:
-
- bpf_mov R1, 1
- bpf_call foo
- bpf_mov R0, R1
- bpf_exit
-
- After the call the registers R1-R5 contain junk values and cannot be read.
- An in-kernel eBPF verifier is used to validate internal BPF programs.
-
-Also in the new design, eBPF is limited to 4096 insns, which means that any
-program will terminate quickly and will only call a fixed number of kernel
-functions. Original BPF and the new format are two operand instructions,
-which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
-
-The input context pointer for invoking the interpreter function is generic,
-its content is defined by a specific use case. For seccomp register R1 points
-to seccomp_data, for converted BPF filters R1 points to a skb.
-
-A program, that is translated internally consists of the following elements:
-
- op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
-
-So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
-has room for new instructions. Some of them may use 16/24/32 byte encoding. New
-instructions must be multiple of 8 bytes to preserve backward compatibility.
-
-Internal BPF is a general purpose RISC instruction set. Not every register and
-every instruction are used during translation from original BPF to new format.
-For example, socket filters are not using 'exclusive add' instruction, but
-tracing filters may do to maintain counters of events, for example. Register R9
-is not used by socket filters either, but more complex filters may be running
-out of registers and would have to resort to spill/fill to stack.
-
-Internal BPF can be used as a generic assembler for last step performance
-optimizations, socket filters and seccomp are using it as assembler. Tracing
-filters may use it as assembler to generate code from kernel. In kernel usage
-may not be bounded by security considerations, since generated internal BPF code
-may be optimizing internal code path and not being exposed to the user space.
-Safety of internal BPF can come from a verifier (TBD). In such use cases as
-described, it may be used as safe instruction set.
-
-Just like the original BPF, the new format runs within a controlled environment,
-is deterministic and the kernel can easily prove that. The safety of the program
-can be determined in two steps: first step does depth-first-search to disallow
-loops and other CFG validation; second step starts from the first insn and
-descends all possible paths. It simulates execution of every insn and observes
-the state change of registers and stack.
-
-eBPF opcode encoding
---------------------
-
-eBPF is reusing most of the opcode encoding from classic to simplify conversion
-of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
-field is divided into three parts:
-
- +----------------+--------+--------------------+
- | 4 bits | 1 bit | 3 bits |
- | operation code | source | instruction class |
- +----------------+--------+--------------------+
- (MSB) (LSB)
-
-Three LSB bits store instruction class which is one of:
-
- Classic BPF classes: eBPF classes:
-
- BPF_LD 0x00 BPF_LD 0x00
- BPF_LDX 0x01 BPF_LDX 0x01
- BPF_ST 0x02 BPF_ST 0x02
- BPF_STX 0x03 BPF_STX 0x03
- BPF_ALU 0x04 BPF_ALU 0x04
- BPF_JMP 0x05 BPF_JMP 0x05
- BPF_RET 0x06 BPF_JMP32 0x06
- BPF_MISC 0x07 BPF_ALU64 0x07
-
-When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
-
- BPF_K 0x00
- BPF_X 0x08
-
- * in classic BPF, this means:
-
- BPF_SRC(code) == BPF_X - use register X as source operand
- BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
- * in eBPF, this means:
-
- BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
- BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
-
-... and four MSB bits store operation code.
-
-If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
-
- BPF_ADD 0x00
- BPF_SUB 0x10
- BPF_MUL 0x20
- BPF_DIV 0x30
- BPF_OR 0x40
- BPF_AND 0x50
- BPF_LSH 0x60
- BPF_RSH 0x70
- BPF_NEG 0x80
- BPF_MOD 0x90
- BPF_XOR 0xa0
- BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
- BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
- BPF_END 0xd0 /* eBPF only: endianness conversion */
-
-If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:
-
- BPF_JA 0x00 /* BPF_JMP only */
- BPF_JEQ 0x10
- BPF_JGT 0x20
- BPF_JGE 0x30
- BPF_JSET 0x40
- BPF_JNE 0x50 /* eBPF only: jump != */
- BPF_JSGT 0x60 /* eBPF only: signed '>' */
- BPF_JSGE 0x70 /* eBPF only: signed '>=' */
- BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */
- BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */
- BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
- BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
- BPF_JSLT 0xc0 /* eBPF only: signed '<' */
- BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
-
-So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
-and eBPF. There are only two registers in classic BPF, so it means A += X.
-In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
-BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
-src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
-
-Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
-eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
-BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
-exactly the same operations as BPF_ALU, but with 64-bit wide operands
-instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
-dst_reg = dst_reg + src_reg
-
-Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
-operation. Classic BPF_RET | BPF_K means copy imm32 into return register
-and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
-in eBPF means function exit only. The eBPF program needs to store return
-value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as
-BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide
-operands for the comparisons instead.
-
-For load and store instructions the 8-bit 'code' field is divided as:
-
- +--------+--------+-------------------+
- | 3 bits | 2 bits | 3 bits |
- | mode | size | instruction class |
- +--------+--------+-------------------+
- (MSB) (LSB)
-
-Size modifier is one of ...
-
- BPF_W 0x00 /* word */
- BPF_H 0x08 /* half word */
- BPF_B 0x10 /* byte */
- BPF_DW 0x18 /* eBPF only, double word */
-
-... which encodes size of load/store operation:
-
- B - 1 byte
- H - 2 byte
- W - 4 byte
- DW - 8 byte (eBPF only)
-
-Mode modifier is one of:
-
- BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
- BPF_ABS 0x20
- BPF_IND 0x40
- BPF_MEM 0x60
- BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
- BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
- BPF_XADD 0xc0 /* eBPF only, exclusive add */
-
-eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
-(BPF_IND | <size> | BPF_LD) which are used to access packet data.
-
-They had to be carried over from classic to have strong performance of
-socket filters running in eBPF interpreter. These instructions can only
-be used when interpreter context is a pointer to 'struct sk_buff' and
-have seven implicit operands. Register R6 is an implicit input that must
-contain pointer to sk_buff. Register R0 is an implicit output which contains
-the data fetched from the packet. Registers R1-R5 are scratch registers
-and must not be used to store the data across BPF_ABS | BPF_LD or
-BPF_IND | BPF_LD instructions.
-
-These instructions have implicit program exit condition as well. When
-eBPF program is trying to access the data beyond the packet boundary,
-the interpreter will abort the execution of the program. JIT compilers
-therefore must preserve this property. src_reg and imm32 fields are
-explicit inputs to these instructions.
-
-For example:
-
- BPF_IND | BPF_W | BPF_LD means:
-
- R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
- and R1 - R5 were scratched.
-
-Unlike classic BPF instruction set, eBPF has generic load/store operations:
-
-BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
-BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
-BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off)
-BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
-BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
-
-Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
-2 byte atomic increments are not supported.
-
-eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
-of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
-instruction that loads 64-bit immediate value into a dst_reg.
-Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
-32-bit immediate value into a register.
-
-eBPF verifier
--------------
-The safety of the eBPF program is determined in two steps.
-
-First step does DAG check to disallow loops and other CFG validation.
-In particular it will detect programs that have unreachable instructions.
-(though classic BPF checker allows them)
-
-Second step starts from the first insn and descends all possible paths.
-It simulates execution of every insn and observes the state change of
-registers and stack.
-
-At the start of the program the register R1 contains a pointer to context
-and has type PTR_TO_CTX.
-If verifier sees an insn that does R2=R1, then R2 has now type
-PTR_TO_CTX as well and can be used on the right hand side of expression.
-If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
-since addition of two valid pointers makes invalid pointer.
-(In 'secure' mode verifier will reject any type of pointer arithmetic to make
-sure that kernel addresses don't leak to unprivileged users)
-
-If register was never written to, it's not readable:
- bpf_mov R0 = R2
- bpf_exit
-will be rejected, since R2 is unreadable at the start of the program.
-
-After kernel function call, R1-R5 are reset to unreadable and
-R0 has a return type of the function.
-
-Since R6-R9 are callee saved, their state is preserved across the call.
- bpf_mov R6 = 1
- bpf_call foo
- bpf_mov R0 = R6
- bpf_exit
-is a correct program. If there was R1 instead of R6, it would have
-been rejected.
-
-load/store instructions are allowed only with registers of valid types, which
-are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
-For example:
- bpf_mov R1 = 1
- bpf_mov R2 = 2
- bpf_xadd *(u32 *)(R1 + 3) += R2
- bpf_exit
-will be rejected, since R1 doesn't have a valid pointer type at the time of
-execution of instruction bpf_xadd.
-
-At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context')
-A callback is used to customize verifier to restrict eBPF program access to only
-certain fields within ctx structure with specified size and alignment.
-
-For example, the following insn:
- bpf_ld R0 = *(u32 *)(R6 + 8)
-intends to load a word from address R6 + 8 and store it into R0
-If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
-that offset 8 of size 4 bytes can be accessed for reading, otherwise
-the verifier will reject the program.
-If R6=PTR_TO_STACK, then access should be aligned and be within
-stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
-so it will fail verification, since it's out of bounds.
-
-The verifier will allow eBPF program to read data from stack only after
-it wrote into it.
-Classic BPF verifier does similar check with M[0-15] memory slots.
-For example:
- bpf_ld R0 = *(u32 *)(R10 - 4)
- bpf_exit
-is invalid program.
-Though R10 is correct read-only register and has type PTR_TO_STACK
-and R10 - 4 is within stack bounds, there were no stores into that location.
-
-Pointer register spill/fill is tracked as well, since four (R6-R9)
-callee saved registers may not be enough for some programs.
-
-Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
-The eBPF verifier will check that registers match argument constraints.
-After the call register R0 will be set to return type of the function.
-
-Function calls is a main mechanism to extend functionality of eBPF programs.
-Socket filters may let programs to call one set of functions, whereas tracing
-filters may allow completely different set.
-
-If a function made accessible to eBPF program, it needs to be thought through
-from safety point of view. The verifier will guarantee that the function is
-called with valid arguments.
-
-seccomp vs socket filters have different security restrictions for classic BPF.
-Seccomp solves this by two stage verifier: classic BPF verifier is followed
-by seccomp verifier. In case of eBPF one configurable verifier is shared for
-all use cases.
-
-See details of eBPF verifier in kernel/bpf/verifier.c
-
-Register value tracking
------------------------
-In order to determine the safety of an eBPF program, the verifier must track
-the range of possible values in each register and also in each stack slot.
-This is done with 'struct bpf_reg_state', defined in include/linux/
-bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
-register state has a type, which is either NOT_INIT (the register has not been
-written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
-pointer type. The types of pointers describe their base, as follows:
- PTR_TO_CTX Pointer to bpf_context.
- CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
- on these pointers is forbidden.
- PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
- PTR_TO_MAP_VALUE_OR_NULL
- Either a pointer to a map value, or NULL; map accesses
- (see section 'eBPF maps', below) return this type,
- which becomes a PTR_TO_MAP_VALUE when checked != NULL.
- Arithmetic on these pointers is forbidden.
- PTR_TO_STACK Frame pointer.
- PTR_TO_PACKET skb->data.
- PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
- PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted.
- PTR_TO_SOCKET_OR_NULL
- Either a pointer to a socket, or NULL; socket lookup
- returns this type, which becomes a PTR_TO_SOCKET when
- checked != NULL. PTR_TO_SOCKET is reference-counted,
- so programs must release the reference through the
- socket release function before the end of the program.
- Arithmetic on these pointers is forbidden.
-However, a pointer may be offset from this base (as a result of pointer
-arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
-offset'. The former is used when an exactly-known value (e.g. an immediate
-operand) is added to a pointer, while the latter is used for values which are
-not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
-the range of possible values in the register.
-The verifier's knowledge about the variable offset consists of:
-* minimum and maximum values as unsigned
-* minimum and maximum values as signed
-* knowledge of the values of individual bits, in the form of a 'tnum': a u64
-'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
-1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
-mask and value; no bit should ever be 1 in both. For example, if a byte is read
-into a register from memory, the register's top 56 bits are known zero, while
-the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
-then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
-0x1ff), because of potential carries.
-
-Besides arithmetic, the register state can also be updated by conditional
-branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
-it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
-branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
-BPF_JSGE) would instead update the signed minimum/maximum values. Information
-from the signed and unsigned bounds can be combined; for instance if a value is
-first tested < 8 and then tested s> 4, the verifier will conclude that the value
-is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
-
-PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
-pointers sharing that same variable offset. This is important for packet range
-checks: after adding a variable to a packet pointer register A, if you then copy
-it to another register B and then add a constant 4 to A, both registers will
-share the same 'id' but the A will have a fixed offset of +4. Then if A is
-bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
-now known to have a safe range of at least 4 bytes. See 'Direct packet access',
-below, for more on PTR_TO_PACKET ranges.
-
-The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
-the pointer returned from a map lookup. This means that when one copy is
-checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
-As well as range-checking, the tracked information is also used for enforcing
-alignment of pointer accesses. For instance, on most systems the packet pointer
-is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
-over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
-pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
-bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
-that pointer are safe.
-The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common
-to all copies of the pointer returned from a socket lookup. This has similar
-behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but
-it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
-represents a reference to the corresponding 'struct sock'. To ensure that the
-reference is not leaked, it is imperative to NULL-check the reference and in
-the non-NULL case, and pass the valid reference to the socket release function.
-
-Direct packet access
---------------------
-In cls_bpf and act_bpf programs the verifier allows direct access to the packet
-data via skb->data and skb->data_end pointers.
-Ex:
-1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */
-2: r3 = *(u32 *)(r1 +76) /* load skb->data */
-3: r5 = r3
-4: r5 += 14
-5: if r5 > r4 goto pc+16
-R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
-6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */
-
-this 2byte load from the packet is safe to do, since the program author
-did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which
-means that in the fall-through case the register R3 (which points to skb->data)
-has at least 14 directly accessible bytes. The verifier marks it
-as R3=pkt(id=0,off=0,r=14).
-id=0 means that no additional variables were added to the register.
-off=0 means that no additional constants were added.
-r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok.
-Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points
-to the packet data, but constant 14 was added to the register, so
-it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
-which is zero bytes.
-
-More complex packet access may look like:
- R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
- 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
- 7: r4 = *(u8 *)(r3 +12)
- 8: r4 *= 14
- 9: r3 = *(u32 *)(r1 +76) /* load skb->data */
-10: r3 += r4
-11: r2 = r1
-12: r2 <<= 48
-13: r2 >>= 48
-14: r3 += r2
-15: r2 = r3
-16: r2 += 8
-17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
-18: if r2 > r1 goto pc+2
- R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
-19: r1 = *(u8 *)(r3 +4)
-The state of the register R3 is R3=pkt(id=2,off=0,r=8)
-id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
-offset within a packet and since the program author did
-'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
-The verifier only allows 'add'/'sub' operations on packet registers. Any other
-operation will set the register state to 'SCALAR_VALUE' and it won't be
-available for direct packet access.
-Operation 'r3 += rX' may overflow and become less than original skb->data,
-therefore the verifier has to prevent that. So when it sees 'r3 += rX'
-instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
-against skb->data_end will not give us 'range' information, so attempts to read
-through the pointer will give "invalid access to packet" error.
-Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
-R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
-of the register are guaranteed to be zero, and nothing is known about the lower
-8 bits. After insn 'r4 *= 14' the state becomes
-R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
-value by constant 14 will keep upper 52 bits as zero, also the least significant
-bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
-R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
-extending. This logic is implemented in adjust_reg_min_max_vals() function,
-which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
-versa) and adjust_scalar_min_max_vals() for operations on two scalars.
-
-The end result is that bpf program author can access packet directly
-using normal C code as:
- void *data = (void *)(long)skb->data;
- void *data_end = (void *)(long)skb->data_end;
- struct eth_hdr *eth = data;
- struct iphdr *iph = data + sizeof(*eth);
- struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph);
-
- if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
- return 0;
- if (eth->h_proto != htons(ETH_P_IP))
- return 0;
- if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
- return 0;
- if (udp->dest == 53 || udp->source == 9)
- ...;
-which makes such programs easier to write comparing to LD_ABS insn
-and significantly faster.
-
-eBPF maps
----------
-'maps' is a generic storage of different types for sharing data between kernel
-and userspace.
-
-The maps are accessed from user space via BPF syscall, which has commands:
-- create a map with given type and attributes
- map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
- using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
- returns process-local file descriptor or negative error
-
-- lookup key in a given map
- err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
- using attr->map_fd, attr->key, attr->value
- returns zero and stores found elem into value or negative error
-
-- create or update key/value pair in a given map
- err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
- using attr->map_fd, attr->key, attr->value
- returns zero or negative error
-
-- find and delete element by key in a given map
- err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
- using attr->map_fd, attr->key
-
-- to delete map: close(fd)
- Exiting process will delete maps automatically
-
-userspace programs use this syscall to create/access maps that eBPF programs
-are concurrently updating.
-
-maps can have different types: hash, array, bloom filter, radix-tree, etc.
-
-The map is defined by:
- . type
- . max number of elements
- . key size in bytes
- . value size in bytes
-
-Pruning
--------
-The verifier does not actually walk all possible paths through the program. For
-each new branch to analyse, the verifier looks at all the states it's previously
-been in when at this instruction. If any of them contain the current state as a
-subset, the branch is 'pruned' - that is, the fact that the previous state was
-accepted implies the current state would be as well. For instance, if in the
-previous state, r1 held a packet-pointer, and in the current state, r1 holds a
-packet-pointer with a range as long or longer and at least as strict an
-alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
-have been used by any path from that point, so any value in r2 (including
-another NOT_INIT) is safe. The implementation is in the function regsafe().
-Pruning considers not only the registers but also the stack (and any spilled
-registers it may hold). They must all be safe for the branch to be pruned.
-This is implemented in states_equal().
-
-Understanding eBPF verifier messages
-------------------------------------
-
-The following are few examples of invalid eBPF programs and verifier error
-messages as seen in the log:
-
-Program with unreachable instructions:
-static struct bpf_insn prog[] = {
- BPF_EXIT_INSN(),
- BPF_EXIT_INSN(),
-};
-Error:
- unreachable insn 1
-
-Program that reads uninitialized register:
- BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
- BPF_EXIT_INSN(),
-Error:
- 0: (bf) r0 = r2
- R2 !read_ok
-
-Program that doesn't initialize R0 before exiting:
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
- BPF_EXIT_INSN(),
-Error:
- 0: (bf) r2 = r1
- 1: (95) exit
- R0 !read_ok
-
-Program that accesses stack out of bounds:
- BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
- BPF_EXIT_INSN(),
-Error:
- 0: (7a) *(u64 *)(r10 +8) = 0
- invalid stack off=8 size=8
-
-Program that doesn't initialize stack before passing its address into function:
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_EXIT_INSN(),
-Error:
- 0: (bf) r2 = r10
- 1: (07) r2 += -8
- 2: (b7) r1 = 0x0
- 3: (85) call 1
- invalid indirect read from stack off -8+0 size 8
-
-Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_EXIT_INSN(),
-Error:
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 0x0
- 4: (85) call 1
- fd 0 is not pointing to valid bpf_map
-
-Program that doesn't check return value of map_lookup_elem() before accessing
-map element:
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
- BPF_EXIT_INSN(),
-Error:
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 0x0
- 4: (85) call 1
- 5: (7a) *(u64 *)(r0 +0) = 0
- R0 invalid mem access 'map_value_or_null'
-
-Program that correctly checks map_lookup_elem() returned value for NULL, but
-accesses the memory with incorrect alignment:
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
- BPF_EXIT_INSN(),
-Error:
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 1
- 4: (85) call 1
- 5: (15) if r0 == 0x0 goto pc+1
- R0=map_ptr R10=fp
- 6: (7a) *(u64 *)(r0 +4) = 0
- misaligned access off 4 size 8
-
-Program that correctly checks map_lookup_elem() returned value for NULL and
-accesses memory with correct alignment in one side of 'if' branch, but fails
-to do so in the other side of 'if' branch:
- BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_LD_MAP_FD(BPF_REG_1, 0),
- BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
- BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
- BPF_EXIT_INSN(),
- BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
- BPF_EXIT_INSN(),
-Error:
- 0: (7a) *(u64 *)(r10 -8) = 0
- 1: (bf) r2 = r10
- 2: (07) r2 += -8
- 3: (b7) r1 = 1
- 4: (85) call 1
- 5: (15) if r0 == 0x0 goto pc+2
- R0=map_ptr R10=fp
- 6: (7a) *(u64 *)(r0 +0) = 0
- 7: (95) exit
-
- from 5 to 8: R0=imm0 R10=fp
- 8: (7a) *(u64 *)(r0 +0) = 1
- R0 invalid mem access 'imm'
-
-Program that performs a socket lookup then sets the pointer to NULL without
-checking it:
-value:
- BPF_MOV64_IMM(BPF_REG_2, 0),
- BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_MOV64_IMM(BPF_REG_3, 4),
- BPF_MOV64_IMM(BPF_REG_4, 0),
- BPF_MOV64_IMM(BPF_REG_5, 0),
- BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
- BPF_MOV64_IMM(BPF_REG_0, 0),
- BPF_EXIT_INSN(),
-Error:
- 0: (b7) r2 = 0
- 1: (63) *(u32 *)(r10 -8) = r2
- 2: (bf) r2 = r10
- 3: (07) r2 += -8
- 4: (b7) r3 = 4
- 5: (b7) r4 = 0
- 6: (b7) r5 = 0
- 7: (85) call bpf_sk_lookup_tcp#65
- 8: (b7) r0 = 0
- 9: (95) exit
- Unreleased reference id=1, alloc_insn=7
-
-Program that performs a socket lookup but does not NULL-check the returned
-value:
- BPF_MOV64_IMM(BPF_REG_2, 0),
- BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8),
- BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
- BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
- BPF_MOV64_IMM(BPF_REG_3, 4),
- BPF_MOV64_IMM(BPF_REG_4, 0),
- BPF_MOV64_IMM(BPF_REG_5, 0),
- BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp),
- BPF_EXIT_INSN(),
-Error:
- 0: (b7) r2 = 0
- 1: (63) *(u32 *)(r10 -8) = r2
- 2: (bf) r2 = r10
- 3: (07) r2 += -8
- 4: (b7) r3 = 4
- 5: (b7) r4 = 0
- 6: (b7) r5 = 0
- 7: (85) call bpf_sk_lookup_tcp#65
- 8: (95) exit
- Unreleased reference id=1, alloc_insn=7
-
-Testing
--------
-
-Next to the BPF toolchain, the kernel also ships a test module that contains
-various test cases for classic and internal BPF that can be executed against
-the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
-enabled via Kconfig:
-
- CONFIG_TEST_BPF=m
-
-After the module has been built and installed, the test suite can be executed
-via insmod or modprobe against 'test_bpf' module. Results of the test cases
-including timings in nsec can be found in the kernel log (dmesg).
-
-Misc
-----
-
-Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
-SECCOMP-BPF kernel fuzzing.
-
-Written by
-----------
-
-The document was written in the hope that it is found useful and in order
-to give potential BPF hackers or security auditors a better overview of
-the underlying architecture.
-
-Jay Schulist <jschlst@samba.org>
-Daniel Borkmann <daniel@iogearbox.net>
-Alexei Starovoitov <ast@kernel.org>
diff --git a/Documentation/networking/framerelay.txt b/Documentation/networking/framerelay.txt
deleted file mode 100644
index 1a0b720440dd..000000000000
--- a/Documentation/networking/framerelay.txt
+++ /dev/null
@@ -1,39 +0,0 @@
-Frame Relay (FR) support for linux is built into a two tiered system of device
-drivers. The upper layer implements RFC1490 FR specification, and uses the
-Data Link Connection Identifier (DLCI) as its hardware address. Usually these
-are assigned by your network supplier, they give you the number/numbers of
-the Virtual Connections (VC) assigned to you.
-
-Each DLCI is a point-to-point link between your machine and a remote one.
-As such, a separate device is needed to accommodate the routing. Within the
-net-tools archives is 'dlcicfg'. This program will communicate with the
-base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'...
-The configuration script will ask you how many DLCIs you need, as well as
-how many DLCIs you want to assign to each Frame Relay Access Device (FRAD).
-
-The DLCI uses a number of function calls to communicate with the FRAD, all
-of which are stored in the FRAD's private data area. assoc/deassoc,
-activate/deactivate and dlci_config. The DLCI supplies a receive function
-to the FRAD to accept incoming packets.
-
-With this initial offering, only 1 FRAD driver is available. With many thanks
-to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E &
-S508 are supported. This driver is currently set up for only FR, but as
-Sangoma makes more firmware modules available, it can be updated to provide
-them as well.
-
-Configuration of the FRAD makes use of another net-tools program, 'fradcfg'.
-This program makes use of a configuration file (which dlcicfg can also read)
-to specify the types of boards to be configured as FRADs, as well as perform
-any board specific configuration. The Sangoma module of fradcfg loads the
-FR firmware into the card, sets the irq/port/memory information, and provides
-an initial configuration.
-
-Additional FRAD device drivers can be added as hardware is available.
-
-At this time, the dlcicfg and fradcfg programs have not been incorporated into
-the net-tools distribution. They can be found at ftp.invlogic.com, in
-/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just
-use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for
-pre-2.0.4 and later.
-
diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.rst
index 179b18ce45ff..595a83b9a61b 100644
--- a/Documentation/networking/gen_stats.txt
+++ b/Documentation/networking/gen_stats.rst
@@ -1,67 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================================
Generic networking statistics for netlink users
-======================================================================
+===============================================
Statistic counters are grouped into structs:
+==================== ===================== =====================
Struct TLV type Description
-----------------------------------------------------------------------
+==================== ===================== =====================
gnet_stats_basic TCA_STATS_BASIC Basic statistics
gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator
gnet_stats_queue TCA_STATS_QUEUE Queue statistics
none TCA_STATS_APP Application specific
+==================== ===================== =====================
Collecting:
-----------
-Declare the statistic structs you need:
-struct mystruct {
- struct gnet_stats_basic bstats;
- struct gnet_stats_queue qstats;
- ...
-};
+Declare the statistic structs you need::
+
+ struct mystruct {
+ struct gnet_stats_basic bstats;
+ struct gnet_stats_queue qstats;
+ ...
+ };
+
+Update statistics, in dequeue() methods only, (while owning qdisc->running)::
-Update statistics, in dequeue() methods only, (while owning qdisc->running)
-mystruct->tstats.packet++;
-mystruct->qstats.backlog += skb->pkt_len;
+ mystruct->tstats.packet++;
+ mystruct->qstats.backlog += skb->pkt_len;
Export to userspace (Dump):
---------------------------
-my_dumping_routine(struct sk_buff *skb, ...)
-{
- struct gnet_dump dump;
+::
- if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump,
- TCA_PAD) < 0)
- goto rtattr_failure;
+ my_dumping_routine(struct sk_buff *skb, ...)
+ {
+ struct gnet_dump dump;
- if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 ||
- gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 ||
- gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0)
- goto rtattr_failure;
+ if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump,
+ TCA_PAD) < 0)
+ goto rtattr_failure;
- if (gnet_stats_finish_copy(&dump) < 0)
- goto rtattr_failure;
- ...
-}
+ if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 ||
+ gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 ||
+ gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0)
+ goto rtattr_failure;
+
+ if (gnet_stats_finish_copy(&dump) < 0)
+ goto rtattr_failure;
+ ...
+ }
TCA_STATS/TCA_XSTATS backward compatibility:
--------------------------------------------
Prior users of struct tc_stats and xstats can maintain backward
compatibility by calling the compat wrappers to keep providing the
-existing TLV types.
+existing TLV types::
-my_dumping_routine(struct sk_buff *skb, ...)
-{
- if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS,
- TCA_XSTATS, &mystruct->lock, &dump,
- TCA_PAD) < 0)
- goto rtattr_failure;
- ...
-}
+ my_dumping_routine(struct sk_buff *skb, ...)
+ {
+ if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS,
+ TCA_XSTATS, &mystruct->lock, &dump,
+ TCA_PAD) < 0)
+ goto rtattr_failure;
+ ...
+ }
A struct tc_stats will be filled out during gnet_stats_copy_* calls
and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app
@@ -77,7 +86,7 @@ are responsible for making sure that the lock is initialized.
Rate Estimator:
---------------
+---------------
0) Prepare an estimator attribute. Most likely this would be in user
space. The value of this TLV should contain a tc_estimator structure.
@@ -92,18 +101,19 @@ Rate Estimator:
TCA_RATE to your code in the kernel.
In the kernel when setting up:
+
1) make sure you have basic stats and rate stats setup first.
2) make sure you have initialized stats lock that is used to setup such
stats.
-3) Now initialize a new estimator:
+3) Now initialize a new estimator::
- int ret = gen_new_estimator(my_basicstats,my_rate_est_stats,
- mystats_lock, attr_with_tcestimator_struct);
+ int ret = gen_new_estimator(my_basicstats,my_rate_est_stats,
+ mystats_lock, attr_with_tcestimator_struct);
- if ret == 0
- success
- else
- failed
+ if ret == 0
+ success
+ else
+ failed
From now on, every time you dump my_rate_est_stats it will contain
up-to-date info.
@@ -115,5 +125,5 @@ are still valid (i.e still exist) at the time of making this call.
Authors:
--------
-Thomas Graf <tgraf@suug.ch>
-Jamal Hadi Salim <hadi@cyberus.ca>
+- Thomas Graf <tgraf@suug.ch>
+- Jamal Hadi Salim <hadi@cyberus.ca>
diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.rst
index 4eb3cc40b702..1c3bb5cb98d4 100644
--- a/Documentation/networking/generic-hdlc.txt
+++ b/Documentation/networking/generic-hdlc.rst
@@ -1,14 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
Generic HDLC layer
+==================
+
Krzysztof Halasa <khc@pm.waw.pl>
Generic HDLC layer currently supports:
+
1. Frame Relay (ANSI, CCITT, Cisco and no LMI)
+
- Normal (routed) and Ethernet-bridged (Ethernet device emulation)
interfaces can share a single PVC.
- ARP support (no InARP support in the kernel - there is an
experimental InARP user-space daemon available on:
http://www.kernel.org/pub/linux/utils/net/hdlc/).
+
2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation
3. Cisco HDLC
4. PPP
@@ -24,19 +32,24 @@ with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging).
Make sure the hdlc.o and the hardware driver are loaded. It should
create a number of "hdlc" (hdlc0 etc) network devices, one for each
WAN port. You'll need the "sethdlc" utility, get it from:
+
http://www.kernel.org/pub/linux/utils/net/hdlc/
-Compile sethdlc.c utility:
+Compile sethdlc.c utility::
+
gcc -O2 -Wall -o sethdlc sethdlc.c
+
Make sure you're using a correct version of sethdlc for your kernel.
Use sethdlc to set physical interface, clock rate, HDLC mode used,
and add any required PVCs if using Frame Relay.
-Usually you want something like:
+Usually you want something like::
sethdlc hdlc0 clock int rate 128000
sethdlc hdlc0 cisco interval 10 timeout 25
-or
+
+or::
+
sethdlc hdlc0 rs232 clock ext
sethdlc hdlc0 fr lmi ansi
sethdlc hdlc0 create 99
@@ -49,46 +62,63 @@ any IP address to it) before using pvc devices.
Setting interface:
-* v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port
- if the card has software-selectable interfaces
- loopback - activate hardware loopback (for testing only)
-* clock ext - both RX clock and TX clock external
-* clock int - both RX clock and TX clock internal
-* clock txint - RX clock external, TX clock internal
-* clock txfromrx - RX clock external, TX clock derived from RX clock
-* rate - sets clock rate in bps (for "int" or "txint" clock only)
+* v35 | rs232 | x21 | t1 | e1
+ - sets physical interface for a given port
+ if the card has software-selectable interfaces
+ loopback
+ - activate hardware loopback (for testing only)
+* clock ext
+ - both RX clock and TX clock external
+* clock int
+ - both RX clock and TX clock internal
+* clock txint
+ - RX clock external, TX clock internal
+* clock txfromrx
+ - RX clock external, TX clock derived from RX clock
+* rate
+ - sets clock rate in bps (for "int" or "txint" clock only)
Setting protocol:
* hdlc - sets raw HDLC (IP-only) mode
+
nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code
+
no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu
+
crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity
* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding
as above.
* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported)
+
interval - time in seconds between keepalive packets
+
timeout - time in seconds after last received keepalive packet before
- we assume the link is down
+ we assume the link is down
* ppp - sets synchronous PPP mode
* x25 - sets X.25 mode
* fr - Frame Relay mode
+
lmi ansi / ccitt / cisco / none - LMI (link management) type
+
dce - Frame Relay DCE (network) side LMI instead of default DTE (user).
+
It has nothing to do with clocks!
- t391 - link integrity verification polling timer (in seconds) - user
- t392 - polling verification timer (in seconds) - network
- n391 - full status polling counter - user
- n392 - error threshold - both user and network
- n393 - monitored events count - both user and network
+
+ - t391 - link integrity verification polling timer (in seconds) - user
+ - t392 - polling verification timer (in seconds) - network
+ - n391 - full status polling counter - user
+ - n392 - error threshold - both user and network
+ - n393 - monitored events count - both user and network
Frame-Relay only:
+
* create n | delete n - adds / deletes PVC interface with DLCI #n.
Newly created interface will be named pvc0, pvc1 etc.
@@ -101,26 +131,34 @@ Frame-Relay only:
Board-specific issues
---------------------
-n2.o and c101.o need parameters to work:
+n2.o and c101.o need parameters to work::
insmod n2 hw=io,irq,ram,ports[:io,irq,...]
-example:
+
+example::
+
insmod n2 hw=0x300,10,0xD0000,01
-or
+or::
+
insmod c101 hw=irq,ram[:irq,...]
-example:
+
+example::
+
insmod c101 hw=9,0xdc000
-If built into the kernel, these drivers need kernel (command line) parameters:
+If built into the kernel, these drivers need kernel (command line) parameters::
+
n2.hw=io,irq,ram,ports:...
-or
+
+or::
+
c101.hw=irq,ram:...
If you have a problem with N2, C101 or PLX200SYN card, you can issue the
-"private" command to see port's packet descriptor rings (in kernel logs):
+"private" command to see port's packet descriptor rings (in kernel logs)::
sethdlc hdlc0 private
diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.rst
index 3e071115ca90..59e04ccf80c1 100644
--- a/Documentation/networking/generic_netlink.txt
+++ b/Documentation/networking/generic_netlink.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Generic Netlink
+===============
+
A wiki document on how to use Generic Netlink can be found here:
* http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto
diff --git a/Documentation/networking/gtp.txt b/Documentation/networking/gtp.rst
index 6966bbec1ecb..1563fb94b289 100644
--- a/Documentation/networking/gtp.txt
+++ b/Documentation/networking/gtp.rst
@@ -1,12 +1,18 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
The Linux kernel GTP tunneling module
-======================================================================
-Documentation by Harald Welte <laforge@gnumonks.org> and
- Andreas Schultz <aschultz@tpip.net>
+=====================================
+
+Documentation by
+ Harald Welte <laforge@gnumonks.org> and
+ Andreas Schultz <aschultz@tpip.net>
In 'drivers/net/gtp.c' you are finding a kernel-level implementation
of a GTP tunnel endpoint.
-== What is GTP ==
+What is GTP
+===========
GTP is the Generic Tunnel Protocol, which is a 3GPP protocol used for
tunneling User-IP payload between a mobile station (phone, modem)
@@ -41,7 +47,8 @@ publicly via the 3GPP website at http://www.3gpp.org/DynaReport/29060.htm
A direct PDF link to v13.6.0 is provided for convenience below:
http://www.etsi.org/deliver/etsi_ts/129000_129099/129060/13.06.00_60/ts_129060v130600p.pdf
-== The Linux GTP tunnelling module ==
+The Linux GTP tunnelling module
+===============================
The module implements the function of a tunnel endpoint, i.e. it is
able to decapsulate tunneled IP packets in the uplink originated by
@@ -70,7 +77,8 @@ Userspace :)
The official homepage of the module is at
https://osmocom.org/projects/linux-kernel-gtp-u/wiki
-== Userspace Programs with Linux Kernel GTP-U support ==
+Userspace Programs with Linux Kernel GTP-U support
+==================================================
At the time of this writing, there are at least two Free Software
implementations that implement GTP-C and can use the netlink interface
@@ -82,7 +90,8 @@ to make use of the Linux kernel GTP-U support:
* ergw (GGSN + P-GW in Erlang):
https://github.com/travelping/ergw
-== Userspace Library / Command Line Utilities ==
+Userspace Library / Command Line Utilities
+==========================================
There is a userspace library called 'libgtpnl' which is based on
libmnl and which implements a C-language API towards the netlink
@@ -90,7 +99,8 @@ interface provided by the Kernel GTP module:
http://git.osmocom.org/libgtpnl/
-== Protocol Versions ==
+Protocol Versions
+=================
There are two different versions of GTP-U: v0 [GSM TS 09.60] and v1
[3GPP TS 29.281]. Both are implemented in the Kernel GTP module.
@@ -105,7 +115,8 @@ doesn't implement GTP-C, we don't have to worry about this. It's the
responsibility of the control plane implementation in userspace to
implement that.
-== IPv6 ==
+IPv6
+====
The 3GPP specifications indicate either IPv4 or IPv6 can be used both
on the inner (user) IP layer, or on the outer (transport) layer.
@@ -114,22 +125,25 @@ Unfortunately, the Kernel module currently supports IPv6 neither for
the User IP payload, nor for the outer IP layer. Patches or other
Contributions to fix this are most welcome!
-== Mailing List ==
+Mailing List
+============
-If yo have questions regarding how to use the Kernel GTP module from
+If you have questions regarding how to use the Kernel GTP module from
your own software, or want to contribute to the code, please use the
osmocom-net-grps mailing list for related discussion. The list can be
reached at osmocom-net-gprs@lists.osmocom.org and the mailman
interface for managing your subscription is at
https://lists.osmocom.org/mailman/listinfo/osmocom-net-gprs
-== Issue Tracker ==
+Issue Tracker
+=============
The Osmocom project maintains an issue tracker for the Kernel GTP-U
module at
https://osmocom.org/projects/linux-kernel-gtp-u/issues
-== History / Acknowledgements ==
+History / Acknowledgements
+==========================
The Module was originally created in 2012 by Harald Welte, but never
completed. Pablo came in to finish the mess Harald left behind. But
@@ -139,9 +153,11 @@ In 2015, Andreas Schultz came to the rescue and fixed lots more bugs,
extended it with new features and finally pushed all of us to get it
mainline, where it was merged in 4.7.0.
-== Architectural Details ==
+Architectural Details
+=====================
-=== Local GTP-U entity and tunnel identification ===
+Local GTP-U entity and tunnel identification
+--------------------------------------------
GTP-U uses UDP for transporting PDU's. The receiving UDP port is 2152
for GTPv1-U and 3386 for GTPv0-U.
@@ -164,15 +180,15 @@ Therefore:
destination IP and the tunnel endpoint id. The source IP and port
have no meaning and can change at any time.
-[3GPP TS 29.281] Section 4.3.0 defines this so:
+[3GPP TS 29.281] Section 4.3.0 defines this so::
-> The TEID in the GTP-U header is used to de-multiplex traffic
-> incoming from remote tunnel endpoints so that it is delivered to the
-> User plane entities in a way that allows multiplexing of different
-> users, different packet protocols and different QoS levels.
-> Therefore no two remote GTP-U endpoints shall send traffic to a
-> GTP-U protocol entity using the same TEID value except
-> for data forwarding as part of mobility procedures.
+ The TEID in the GTP-U header is used to de-multiplex traffic
+ incoming from remote tunnel endpoints so that it is delivered to the
+ User plane entities in a way that allows multiplexing of different
+ users, different packet protocols and different QoS levels.
+ Therefore no two remote GTP-U endpoints shall send traffic to a
+ GTP-U protocol entity using the same TEID value except
+ for data forwarding as part of mobility procedures.
The definition above only defines that two remote GTP-U endpoints
*should not* send to the same TEID, it *does not* forbid or exclude
@@ -183,7 +199,8 @@ multiple or unknown peers.
Therefore, the receiving side identifies tunnels exclusively based on
TEIDs, not based on the source IP!
-== APN vs. Network Device ==
+APN vs. Network Device
+======================
The GTP-U driver creates a Linux network device for each Gi/SGi
interface.
@@ -201,29 +218,33 @@ number of Gi/SGi interfaces implemented by a GGSN/P-GW.
[3GPP TS 29.061] Section 11.3 makes it clear that the selection of a
specific Gi/SGi interfaces is made through the Access Point Name
-(APN):
-
-> 2. each private network manages its own addressing. In general this
-> will result in different private networks having overlapping
-> address ranges. A logically separate connection (e.g. an IP in IP
-> tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW
-> and each private network.
->
-> In this case the IP address alone is not necessarily unique. The
-> pair of values, Access Point Name (APN) and IPv4 address and/or
-> IPv6 prefixes, is unique.
+(APN)::
+
+ 2. each private network manages its own addressing. In general this
+ will result in different private networks having overlapping
+ address ranges. A logically separate connection (e.g. an IP in IP
+ tunnel or layer 2 virtual circuit) is used between the GGSN/P-GW
+ and each private network.
+
+ In this case the IP address alone is not necessarily unique. The
+ pair of values, Access Point Name (APN) and IPv4 address and/or
+ IPv6 prefixes, is unique.
In order to support the overlapping address range use case, each APN
is mapped to a separate Gi/SGi interface (network device).
-NOTE: The Access Point Name is purely a control plane (GTP-C) concept.
-At the GTP-U level, only Tunnel Endpoint Identifiers are present in
-GTP-U packets and network devices are known
+.. note::
+
+ The Access Point Name is purely a control plane (GTP-C) concept.
+ At the GTP-U level, only Tunnel Endpoint Identifiers are present in
+ GTP-U packets and network devices are known
Therefore for a given UE the mapping in IP to PDN network is:
+
* network device + MS IP -> Peer IP + Peer TEID,
and from PDN to IP network:
+
* local GTP-U IP + TEID -> network device
Furthermore, before a received T-PDU is injected into the network
diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst
index 36ca823a1122..f27856d77c8b 100644
--- a/Documentation/networking/ieee802154.rst
+++ b/Documentation/networking/ieee802154.rst
@@ -26,12 +26,14 @@ The stack is composed of three main parts:
Socket API
==========
-.. c:function:: int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
+::
+
+ int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
The address family, socket addresses etc. are defined in the
include/net/af_ieee802154.h header or in the special header
-in the userspace package (see either http://wpan.cakelab.org/ or the
-git tree at https://github.com/linux-wpan/wpan-tools).
+in the userspace package (see either https://linux-wpan.org/wpan-tools.html
+or the git tree at https://github.com/linux-wpan/wpan-tools).
6LoWPAN Linux implementation
============================
@@ -131,12 +133,12 @@ Register PHY in the system.
Freeing registered PHY.
-.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi):
+.. c:function:: void ieee802154_rx_irqsafe(struct ieee802154_hw *hw, struct sk_buff *skb, u8 lqi)
Telling 802.15.4 module there is a new received frame in the skb with
the RF Link Quality Indicator (LQI) from the hardware device.
-.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling):
+.. c:function:: void ieee802154_xmit_complete(struct ieee802154_hw *hw, struct sk_buff *skb, bool ifs_handling)
Telling 802.15.4 module the frame in the skb is or going to be
transmitted through the hardware device
@@ -155,25 +157,25 @@ operations structure at least::
...
};
-.. c:function:: int start(struct ieee802154_hw *hw):
+.. c:function:: int start(struct ieee802154_hw *hw)
Handler that 802.15.4 module calls for the hardware device initialization.
-.. c:function:: void stop(struct ieee802154_hw *hw):
+.. c:function:: void stop(struct ieee802154_hw *hw)
Handler that 802.15.4 module calls for the hardware device cleanup.
-.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb):
+.. c:function:: int xmit_async(struct ieee802154_hw *hw, struct sk_buff *skb)
Handler that 802.15.4 module calls for each frame in the skb going to be
transmitted through the hardware device.
-.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level):
+.. c:function:: int ed(struct ieee802154_hw *hw, u8 *level)
Handler that 802.15.4 module calls for Energy Detection from the hardware
device.
-.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel):
+.. c:function:: int set_channel(struct ieee802154_hw *hw, u8 page, u8 channel)
Set radio for listening on specific channel of the hardware device.
diff --git a/Documentation/networking/ila.txt b/Documentation/networking/ila.rst
index a17dac9dc915..5ac0a6270b17 100644
--- a/Documentation/networking/ila.txt
+++ b/Documentation/networking/ila.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
Identifier Locator Addressing (ILA)
+===================================
Introduction
@@ -26,11 +30,13 @@ The ILA protocol is described in Internet-Draft draft-herbert-intarea-ila.
ILA terminology
===============
- - Identifier A number that identifies an addressable node in the network
+ - Identifier
+ A number that identifies an addressable node in the network
independent of its location. ILA identifiers are sixty-four
bit values.
- - Locator A network prefix that routes to a physical host. Locators
+ - Locator
+ A network prefix that routes to a physical host. Locators
provide the topological location of an addressed node. ILA
locators are sixty-four bit prefixes.
@@ -51,17 +57,20 @@ ILA terminology
bits) and an identifier (low order sixty-four bits). ILA
addresses are never visible to an application.
- - ILA host An end host that is capable of performing ILA translations
+ - ILA host
+ An end host that is capable of performing ILA translations
on transmit or receive.
- - ILA router A network node that performs ILA translation and forwarding
+ - ILA router
+ A network node that performs ILA translation and forwarding
of translated packets.
- ILA forwarding cache
A type of ILA router that only maintains a working set
cache of mappings.
- - ILA node A network node capable of performing ILA translations. This
+ - ILA node
+ A network node capable of performing ILA translations. This
can be an ILA router, ILA forwarding cache, or ILA host.
@@ -82,18 +91,18 @@ Configuration and datapath for these two points of deployment is somewhat
different.
The diagram below illustrates the flow of packets through ILA as well
-as showing ILA hosts and routers.
+as showing ILA hosts and routers::
+--------+ +--------+
| Host A +-+ +--->| Host B |
| | | (2) ILA (') | |
+--------+ | ...addressed.... ( ) +--------+
- V +---+--+ . packet . +---+--+ (_)
+ V +---+--+ . packet . +---+--+ (_)
(1) SIR | | ILA |----->-------->---->| ILA | | (3) SIR
addressed +->|router| . . |router|->-+ addressed
packet +---+--+ . IPv6 . +---+--+ packet
- / . Network .
- / . . +--+-++--------+
+ / . Network .
+ / . . +--+-++--------+
+--------+ / . . |ILA || Host |
| Host +--+ . .- -|host|| |
| | . . +--+-++--------+
@@ -173,7 +182,7 @@ ILA address, never a SIR address.
In the simplest format the identifier types, C-bit, and checksum
adjustment value are not present so an identifier is considered an
-unstructured sixty-four bit value.
+unstructured sixty-four bit value::
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier |
@@ -184,7 +193,7 @@ unstructured sixty-four bit value.
The checksum neutral adjustment may be configured to always be
present using neutral-map-auto. In this case there is no C-bit, but the
checksum adjustment is in the low order 16 bits. The identifier is
-still sixty-four bits.
+still sixty-four bits::
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier |
@@ -193,7 +202,7 @@ still sixty-four bits.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The C-bit may used to explicitly indicate that checksum neutral
-mapping has been applied to an ILA address. The format is:
+mapping has been applied to an ILA address. The format is::
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |C| Identifier |
@@ -204,7 +213,7 @@ mapping has been applied to an ILA address. The format is:
The identifier type field may be present to indicate the identifier
type. If it is not present then the type is inferred based on mapping
configuration. The checksum neutral adjustment may automatically
-used with the identifier type as illustrated below.
+used with the identifier type as illustrated below::
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type| Identifier |
@@ -213,7 +222,7 @@ used with the identifier type as illustrated below.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
If the identifier type and the C-bit can be present simultaneously so
-the identifier format would be:
+the identifier format would be::
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type|C| Identifier |
@@ -258,28 +267,30 @@ same meanings as described above.
Some examples
=============
-# Configure an ILA route that uses checksum neutral mapping as well
-# as type field. Note that the type field is set in the SIR address
-# (the 2000 implies type is 1 which is LUID).
-ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \
- csum-mode neutral-map ident-type use-format
-
-# Configure an ILA LWT route that uses auto checksum neutral mapping
-# (no C-bit) and configure identifier type to be LUID so that the
-# identifier type field will not be present.
-ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \
- csum-mode neutral-map-auto ident-type luid
-
-ila_xlat configuration
-
-# Configure an ILA to SIR mapping that matches a locator and overwrites
-# it with a SIR address (3333:0:0:1 in this example). The C-bit and
-# identifier field are used.
-ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
- csum-mode neutral-map-auto ident-type use-format
-
-# Configure an ILA to SIR mapping where checksum neutral is automatically
-# set without the C-bit and the identifier type is configured to be LUID
-# so that the identifier type field is not present.
-ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
- csum-mode neutral-map-auto ident-type use-format
+::
+
+ # Configure an ILA route that uses checksum neutral mapping as well
+ # as type field. Note that the type field is set in the SIR address
+ # (the 2000 implies type is 1 which is LUID).
+ ip route add 3333:0:0:1:2000:0:1:87/128 encap ila 2001:0:87:0 \
+ csum-mode neutral-map ident-type use-format
+
+ # Configure an ILA LWT route that uses auto checksum neutral mapping
+ # (no C-bit) and configure identifier type to be LUID so that the
+ # identifier type field will not be present.
+ ip route add 3333:0:0:1:2000:0:2:87/128 encap ila 2001:0:87:1 \
+ csum-mode neutral-map-auto ident-type luid
+
+ ila_xlat configuration
+
+ # Configure an ILA to SIR mapping that matches a locator and overwrites
+ # it with a SIR address (3333:0:0:1 in this example). The C-bit and
+ # identifier field are used.
+ ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
+ csum-mode neutral-map-auto ident-type use-format
+
+ # Configure an ILA to SIR mapping where checksum neutral is automatically
+ # set without the C-bit and the identifier type is configured to be LUID
+ # so that the identifier type field is not present.
+ ip ila add loc_match 2001:0:119:0 loc 3333:0:0:1 \
+ csum-mode neutral-map-auto ident-type use-format
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index d07d9855dcd3..16a153bcc5fe 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -1,27 +1,31 @@
-Linux Networking Documentation
-==============================
+Networking
+==========
+
+Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
Contents:
.. toctree::
:maxdepth: 2
- netdev-FAQ
af_xdp
+ bareudp
batman-adv
can
can_ucan_protocol
device_drivers/index
dsa/index
devlink/index
+ caif/index
ethtool-netlink
ieee802154
j1939
kapi
- z8530book
msg_zerocopy
failover
+ net_dim
net_failover
+ page_pool
phy
sfp-phylink
alias
@@ -33,6 +37,88 @@ Contents:
tls
tls-offload
nfc
+ 6lowpan
+ 6pack
+ arcnet-hardware
+ arcnet
+ atm
+ ax25
+ bonding
+ cdc_mbim
+ dccp
+ dctcp
+ dns_resolver
+ driver
+ eql
+ fib_trie
+ filter
+ generic-hdlc
+ generic_netlink
+ gen_stats
+ gtp
+ ila
+ ioam6-sysctl
+ ipddp
+ ip_dynaddr
+ ipsec
+ ip-sysctl
+ ipv6
+ ipvlan
+ ipvs-sysctl
+ kcm
+ l2tp
+ lapb-module
+ mac80211-injection
+ mctp
+ mpls-sysctl
+ mptcp-sysctl
+ multiqueue
+ netconsole
+ netdev-features
+ netdevices
+ netfilter-sysctl
+ netif-msg
+ nexthop-group-resilient
+ nf_conntrack-sysctl
+ nf_flowtable
+ openvswitch
+ operstates
+ packet_mmap
+ phonet
+ pktgen
+ plip
+ ppp_generic
+ proc_net_tcp
+ radiotap-headers
+ rds
+ regulatory
+ representors
+ rxrpc
+ sctp
+ secid
+ seg6-sysctl
+ skbuff
+ smc-sysctl
+ statistics
+ strparser
+ switchdev
+ sysfs-tagging
+ tc-actions-env-rules
+ tcp-thin
+ team
+ timestamping
+ tipc
+ tproxy
+ tuntap
+ udplite
+ vrf
+ vxlan
+ x25-iface
+ x25
+ xfrm_device
+ xfrm_proc
+ xfrm_sync
+ xfrm_sysctl
.. only:: subproject and html
diff --git a/Documentation/networking/ioam6-sysctl.rst b/Documentation/networking/ioam6-sysctl.rst
new file mode 100644
index 000000000000..c18cab2c481a
--- /dev/null
+++ b/Documentation/networking/ioam6-sysctl.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+IOAM6 Sysfs variables
+=====================
+
+
+/proc/sys/net/conf/<iface>/ioam6_* variables:
+=============================================
+
+ioam6_enabled - BOOL
+ Accept (= enabled) or ignore (= disabled) IPv6 IOAM options on ingress
+ for this interface.
+
+ * 0 - disabled (default)
+ * 1 - enabled
+
+ioam6_id - SHORT INTEGER
+ Define the IOAM id of this interface.
+
+ Default is ~0.
+
+ioam6_id_wide - INTEGER
+ Define the wide IOAM id of this interface.
+
+ Default is ~0.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.rst
index 5f53faff4e25..e7b3fa7bb3f7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1,8 +1,15 @@
-/proc/sys/net/ipv4/* Variables:
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+IP Sysctl
+=========
+
+/proc/sys/net/ipv4/* Variables
+==============================
ip_forward - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Forward Packets between interfaces.
@@ -18,7 +25,8 @@ ip_default_ttl - INTEGER
ip_no_pmtu_disc - INTEGER
Disable Path MTU Discovery. If enabled in mode 1 and a
fragmentation-required ICMP is received, the PMTU to this
- destination will be set to min_pmtu (see below). You will need
+ destination will be set to the smallest of the old MTU to
+ this destination and min_pmtu (see below). You will need
to raise min_pmtu to the smallest interface MTU on your system
manually if you want to avoid locally generated fragments.
@@ -38,10 +46,12 @@ ip_no_pmtu_disc - INTEGER
could break other protocols.
Possible values: 0-3
+
Default: FALSE
min_pmtu - INTEGER
- default 552 - minimum discovered Path MTU
+ default 552 - minimum Path MTU. Unless this is changed mannually,
+ each cached pmtu will never be lower than this setting.
ip_forward_use_pmtu - BOOLEAN
By default we don't trust protocol path MTUs while forwarding
@@ -51,16 +61,20 @@ ip_forward_use_pmtu - BOOLEAN
which tries to discover path mtus by itself and depends on the
kernel honoring this information. This is normally not the
case.
+
Default: 0 (disabled)
+
Possible values:
- 0 - disabled
- 1 - enabled
+
+ - 0 - disabled
+ - 1 - enabled
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv4 reply packets that are not
associated with a socket for example, TCP RSTs or ICMP echo replies).
If unset, these packets have a fwmark of zero. If set, they have the
fwmark of the packet they are replying to.
+
Default: 0
fib_multipath_use_neigh - BOOLEAN
@@ -68,63 +82,109 @@ fib_multipath_use_neigh - BOOLEAN
multipath routes. If disabled, neighbor information is not used and
packets could be directed to a failed nexthop. Only valid for kernels
built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+
Default: 0 (disabled)
+
Possible values:
- 0 - disabled
- 1 - enabled
+
+ - 0 - disabled
+ - 1 - enabled
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes. Only valid
for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled.
+
Default: 0 (Layer 3)
+
Possible values:
- 0 - Layer 3
- 1 - Layer 4
- 2 - Layer 3 or inner Layer 3 if present
+
+ - 0 - Layer 3
+ - 1 - Layer 4
+ - 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Unused (Flow Label)
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
synchronize_rcu is forced.
- Default: 512kB Minimum: 64kB Maximum: 64MB
+
+ Default: 512kB Minimum: 64kB Maximum: 64MB
ip_forward_update_priority - INTEGER
Whether to update SKB priority from "TOS" field in IPv4 header after it
is forwarded. The new SKB priority is mapped from TOS field value
according to an rt_tos2priority table (see e.g. man tc-prio).
+
Default: 1 (Update priority.)
+
Possible values:
- 0 - Do not update priority.
- 1 - Update priority.
+
+ - 0 - Do not update priority.
+ - 1 - Update priority.
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
this when using large numbers of interfaces and/or routes.
+
From linux kernel 3.6 onwards, this is deprecated for ipv4
as route cache is no longer used.
neigh/default/gc_thresh1 - INTEGER
Minimum number of entries to keep. Garbage collector will not
purge entries if there are fewer than this number.
+
Default: 128
neigh/default/gc_thresh2 - INTEGER
Threshold when garbage collector becomes more aggressive about
purging entries. Entries older than 5 seconds will be cleared
when over this number.
+
Default: 512
neigh/default/gc_thresh3 - INTEGER
Maximum number of non-PERMANENT neighbor entries allowed. Increase
this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
+
Default: 1024
neigh/default/unres_qlen_bytes - INTEGER
The maximum number of bytes which may be used by packets
queued for each unresolved address by other network layers.
(added in linux 3.3)
+
Setting negative value is meaningless and will return error.
+
Default: SK_WMEM_MAX, (same as net.core.wmem_default).
+
Exact value depends on architecture and kernel options,
but should be enough to allow queuing 256 packets
of medium size.
@@ -132,13 +192,22 @@ neigh/default/unres_qlen_bytes - INTEGER
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
unresolved address by other network layers.
+
(deprecated in linux 3.3) : use unres_qlen_bytes instead.
+
Prior to linux 3.3, the default value is 3 which may cause
unexpected packet loss. The current default value is calculated
according to default value of unres_qlen_bytes and true size of
packet.
+
Default: 101
+neigh/default/interval_probe_time_ms - INTEGER
+ The probe interval for neighbor entries with NTF_MANAGED flag,
+ the min value is 1.
+
+ Default: 5000
+
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
@@ -146,6 +215,27 @@ min_adv_mss - INTEGER
The advertised MSS depends on the first hop route MTU, but will
never be lower than this setting.
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
IP Fragmentation:
ipfrag_high_thresh - LONG INTEGER
@@ -183,7 +273,15 @@ ipfrag_max_dist - INTEGER
from different IP datagrams, which could result in data corruption.
Default: 64
-INET peer storage:
+bc_forwarding - INTEGER
+ bc_forwarding enables the feature described in rfc1812#section-5.3.5.2
+ and rfc2644. It allows the router to forward directed broadcast.
+ To enable this feature, the 'all' entry and the input interface entry
+ should be set to 1.
+ Default: 0
+
+INET peer storage
+=================
inet_peer_threshold - INTEGER
The approximate size of the storage. Starting from this threshold
@@ -203,7 +301,8 @@ inet_peer_maxttl - INTEGER
when the number of entries in the pool is very small).
Measured in seconds.
-TCP variables:
+TCP variables
+=============
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
@@ -222,18 +321,22 @@ tcp_adv_win_scale - INTEGER
Count buffering overhead as bytes/2^tcp_adv_win_scale
(if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
if it is <= 0.
+
Possible values are [-31, 31], inclusive.
+
Default: 1
tcp_allowed_congestion_control - STRING
Show/set the congestion control choices available to non-privileged
processes. The list is a subset of those listed in
tcp_available_congestion_control.
+
Default is "reno" and the default setting (tcp_congestion_control).
tcp_app_win - INTEGER
Reserve max(window/2^tcp_app_win, mss) of window for application
buffer. Value 0 is special, it means that nothing is reserved.
+
Default: 31
tcp_autocorking - BOOLEAN
@@ -244,6 +347,7 @@ tcp_autocorking - BOOLEAN
packet for the flow is waiting in Qdisc queues or device transmit
queue. Applications can still use TCP_CORK for optimal behavior
when they know how/when to uncork their sockets.
+
Default : 1
tcp_available_congestion_control - STRING
@@ -265,6 +369,7 @@ tcp_mtu_probe_floor - INTEGER
tcp_min_snd_mss - INTEGER
TCP SYN and SYNACK messages usually advertise an ADVMSS option,
as described in RFC 1122 and RFC 6691.
+
If this ADVMSS option is smaller than tcp_min_snd_mss,
it is silently capped to tcp_min_snd_mss.
@@ -277,6 +382,7 @@ tcp_congestion_control - STRING
Default is set as part of kernel configuration.
For passive connections, the listener congestion control choice
is inherited.
+
[see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ]
tcp_dsack - BOOLEAN
@@ -286,9 +392,12 @@ tcp_early_retrans - INTEGER
Tail loss probe (TLP) converts RTOs occurring due to tail
losses into fast recovery (draft-ietf-tcpm-rack). Note that
TLP requires RACK to function properly (see tcp_recovery below)
+
Possible values:
- 0 disables TLP
- 3 or 4 enables TLP
+
+ - 0 disables TLP
+ - 3 or 4 enables TLP
+
Default: 3
tcp_ecn - INTEGER
@@ -297,12 +406,17 @@ tcp_ecn - INTEGER
support for it. This feature is useful in avoiding losses due
to congestion by allowing supporting routers to signal
congestion before having to drop packets.
+
Possible values are:
- 0 Disable ECN. Neither initiate nor accept ECN.
- 1 Enable ECN when requested by incoming connections and
- also request ECN on outgoing connection attempts.
- 2 Enable ECN when requested by incoming connections
- but do not request ECN on outgoing connections.
+
+ = =====================================================
+ 0 Disable ECN. Neither initiate nor accept ECN.
+ 1 Enable ECN when requested by incoming connections and
+ also request ECN on outgoing connection attempts.
+ 2 Enable ECN when requested by incoming connections
+ but do not request ECN on outgoing connections.
+ = =====================================================
+
Default: 2
tcp_ecn_fallback - BOOLEAN
@@ -312,6 +426,7 @@ tcp_ecn_fallback - BOOLEAN
additional detection mechanisms could be implemented under this
knob. The value is not used, if tcp_ecn or per route (or congestion
control) ECN settings are disabled.
+
Default: 1 (fallback enabled)
tcp_fack - BOOLEAN
@@ -324,7 +439,9 @@ tcp_fin_timeout - INTEGER
valid "receive only" state for an un-orphaned connection, an
orphaned connection in FIN_WAIT_2 state could otherwise wait
forever for the remote to close its end of the connection.
+
Cf. tcp_max_orphans
+
Default: 60 seconds
tcp_frto - INTEGER
@@ -390,7 +507,8 @@ tcp_l3mdev_accept - BOOLEAN
derived from the listen socket to be bound to the L3 domain in
which the packets originated. Only valid when the kernel was
compiled with CONFIG_NET_L3_MASTER_DEV.
- Default: 0 (disabled)
+
+ Default: 0 (disabled)
tcp_low_latency - BOOLEAN
This is a legacy option, it has no effect anymore.
@@ -410,10 +528,14 @@ tcp_max_orphans - INTEGER
tcp_max_syn_backlog - INTEGER
Maximal number of remembered connection requests (SYN_RECV),
which have not received an acknowledgment from connecting client.
+
This is a per-listener limit.
+
The minimal value is 128 for low memory machines, and it will
increase in proportion to the memory of machine.
+
If server suffers from overload, try increasing this number.
+
Remember to also check /proc/sys/net/core/somaxconn
A SYN_RECV request socket consumes about 304 bytes of memory.
@@ -445,7 +567,9 @@ tcp_min_rtt_wlen - INTEGER
minimum RTT when it is moved to a longer path (e.g., due to traffic
engineering). A longer window makes the filter more resistant to RTT
inflations such as transient congestion. The unit is seconds.
+
Possible values: 0 - 86400 (1 day)
+
Default: 300
tcp_moderate_rcvbuf - BOOLEAN
@@ -457,9 +581,10 @@ tcp_moderate_rcvbuf - BOOLEAN
tcp_mtu_probing - INTEGER
Controls TCP Packetization-Layer Path MTU Discovery. Takes three
values:
- 0 - Disabled
- 1 - Disabled by default, enabled when an ICMP black hole detected
- 2 - Always enabled, use initial MSS of tcp_base_mss.
+
+ - 0 - Disabled
+ - 1 - Disabled by default, enabled when an ICMP black hole detected
+ - 2 - Always enabled, use initial MSS of tcp_base_mss.
tcp_probe_interval - UNSIGNED INTEGER
Controls how often to start TCP Packetization-Layer Path MTU
@@ -481,6 +606,7 @@ tcp_no_metrics_save - BOOLEAN
tcp_no_ssthresh_metrics_save - BOOLEAN
Controls whether TCP saves ssthresh metrics in the route cache.
+
Default is 1, which disables ssthresh metrics.
tcp_orphan_retries - INTEGER
@@ -489,6 +615,7 @@ tcp_orphan_retries - INTEGER
See tcp_retries2 for more details.
The default value is 8.
+
If your machine is a loaded WEB server,
you should think about lowering this value, such sockets
may consume significant resources. Cf. tcp_max_orphans.
@@ -497,24 +624,40 @@ tcp_recovery - INTEGER
This value is a bitmap to enable various experimental loss recovery
features.
- RACK: 0x1 enables the RACK loss detection for fast detection of lost
- retransmissions and tail drops. It also subsumes and disables
- RFC6675 recovery for SACK connections.
- RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
- RACK: 0x4 disables RACK's DUPACK threshold heuristic
+ ========= =============================================================
+ RACK: 0x1 enables the RACK loss detection for fast detection of lost
+ retransmissions and tail drops. It also subsumes and disables
+ RFC6675 recovery for SACK connections.
+
+ RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+
+ RACK: 0x4 disables RACK's DUPACK threshold heuristic
+ ========= =============================================================
Default: 0x1
+tcp_reflect_tos - BOOLEAN
+ For listening sockets, reuse the DSCP value of the initial SYN message
+ for outgoing packets. This allows to have both directions of a TCP
+ stream to use the same DSCP value, assuming DSCP remains unchanged for
+ the lifetime of the connection.
+
+ This options affects both IPv4 and IPv6.
+
+ Default: 0 (disabled)
+
tcp_reordering - INTEGER
Initial reordering level of packets in a TCP stream.
TCP stack can then dynamically adjust flow reordering level
between this initial value and tcp_max_reordering
+
Default: 3
tcp_max_reordering - INTEGER
Maximal reordering level of packets in a TCP stream.
300 is a fairly conservative value, but you might increase it
if paths are using per packet load balancing (like bonding rr mode)
+
Default: 300
tcp_retrans_collapse - BOOLEAN
@@ -550,26 +693,27 @@ tcp_rfc1337 - BOOLEAN
If set, the TCP stack behaves conforming to RFC1337. If unset,
we are not conforming to RFC, but prevent TCP TIME_WAIT
assassination.
+
Default: 0
tcp_rmem - vector of 3 INTEGERs: min, default, max
min: Minimal size of receive buffer used by TCP sockets.
It is guaranteed to each TCP socket, even under moderate memory
pressure.
+
Default: 4K
default: initial size of receive buffer used by TCP sockets.
This value overrides net.core.rmem_default used by other protocols.
- Default: 87380 bytes. This value results in window of 65535 with
- default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
- less for default tcp_app_win. See below about these variables.
+ Default: 131072 bytes.
+ This value results in initial window of 65535.
max: maximal size of receive buffer allowed for automatically
selected receiver buffers for TCP socket. This value does not override
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
automatic tuning of that socket's receive buffer size, in which
case this value is ignored.
- Default: between 87380B and 6MB, depending on RAM size.
+ Default: between 131072 and 6MB, depending on RAM size.
tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
@@ -581,6 +725,14 @@ tcp_comp_sack_delay_ns - LONG INTEGER
Default : 1,000,000 ns (1 ms)
+tcp_comp_sack_slack_ns - LONG INTEGER
+ This sysctl control the slack used when arming the
+ timer used by SACK compression. This gives extra time
+ for small RTT flows, and reduces system overhead by allowing
+ opportunistic reduction of timer interrupts.
+
+ Default : 100,000 ns (100 us)
+
tcp_comp_sack_nr - INTEGER
Max number of SACK that can be compressed.
Using 0 disables SACK compression.
@@ -592,12 +744,14 @@ tcp_slow_start_after_idle - BOOLEAN
window after an idle period. An idle period is defined at
the current RTO. If unset, the congestion window will not
be timed out after an idle period.
+
Default: 1
tcp_stdurg - BOOLEAN
Use the Host requirements interpretation of the TCP urgent pointer field.
Most hosts use the older BSD interpretation, so if you turn this on
Linux might not communicate correctly with them.
+
Default: FALSE
tcp_synack_retries - INTEGER
@@ -632,6 +786,31 @@ tcp_syncookies - INTEGER
network connections you can set this knob to 2 to enable
unconditionally generation of syncookies.
+tcp_migrate_req - BOOLEAN
+ The incoming connection is tied to a specific listening socket when
+ the initial SYN packet is received during the three-way handshake.
+ When a listener is closed, in-flight request sockets during the
+ handshake and established sockets in the accept queue are aborted.
+
+ If the listener has SO_REUSEPORT enabled, other listeners on the
+ same port should have been able to accept such connections. This
+ option makes it possible to migrate such child sockets to another
+ listener after close() or shutdown().
+
+ The BPF_SK_REUSEPORT_SELECT_OR_MIGRATE type of eBPF program should
+ usually be used to define the policy to pick an alive listener.
+ Otherwise, the kernel will randomly pick an alive listener only if
+ this option is enabled.
+
+ Note that migration between listeners with different settings may
+ crash applications. Let's say migration happens from listener A to
+ B, and only B has TCP_SAVE_SYN enabled. B cannot read SYN data from
+ the requests migrated from A. To avoid such a situation, cancel
+ migration by returning SK_DROP in the type of eBPF program, or
+ disable this option.
+
+ Default: 0
+
tcp_fastopen - INTEGER
Enable TCP Fast Open (RFC7413) to send and accept data in the opening
SYN packet.
@@ -646,19 +825,22 @@ tcp_fastopen - INTEGER
the option value being the length of the syn-data backlog.
The values (bitmap) are
- 0x1: (client) enables sending data in the opening SYN on the client.
- 0x2: (server) enables the server support, i.e., allowing data in
+
+ ===== ======== ======================================================
+ 0x1 (client) enables sending data in the opening SYN on the client.
+ 0x2 (server) enables the server support, i.e., allowing data in
a SYN packet to be accepted and passed to the
application before 3-way handshake finishes.
- 0x4: (client) send data in the opening SYN regardless of cookie
+ 0x4 (client) send data in the opening SYN regardless of cookie
availability and without a cookie option.
- 0x200: (server) accept data-in-SYN w/o any cookie option present.
- 0x400: (server) enable all listeners to support Fast Open by
+ 0x200 (server) accept data-in-SYN w/o any cookie option present.
+ 0x400 (server) enable all listeners to support Fast Open by
default without explicit TCP_FASTOPEN socket option.
+ ===== ======== ======================================================
Default: 0x1
- Note that that additional client or server features are only
+ Note that additional client or server features are only
effective if the basic support (0x1 and 0x2) are enabled respectively.
tcp_fastopen_blackhole_timeout_sec - INTEGER
@@ -668,7 +850,8 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER
get detected right after Fastopen is re-enabled and will reset to
initial value when the blackhole issue goes away.
0 to disable the blackhole detection.
- By default, it is set to 1hr.
+
+ By default, it is set to 0 (feature is disabled).
tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs
The list consists of a primary key and an optional backup key. The
@@ -698,28 +881,56 @@ tcp_syn_retries - INTEGER
for an active TCP connection attempt will happen after 127seconds.
tcp_timestamps - INTEGER
-Enable timestamps as defined in RFC1323.
- 0: Disabled.
- 1: Enable timestamps as defined in RFC1323 and use random offset for
- each connection rather than only using the current time.
- 2: Like 1, but without random offsets.
+ Enable timestamps as defined in RFC1323.
+
+ - 0: Disabled.
+ - 1: Enable timestamps as defined in RFC1323 and use random offset for
+ each connection rather than only using the current time.
+ - 2: Like 1, but without random offsets.
+
Default: 1
tcp_min_tso_segs - INTEGER
Minimal number of segments per TSO frame.
+
Since linux-3.12, TCP does an automatic sizing of TSO frames,
depending on flow rate, instead of filling 64Kbytes packets.
For specific usages, it's possible to force TCP to build big
TSO frames. Note that TCP stack might split too big TSO packets
if available window is too small.
+
Default: 2
+tcp_tso_rtt_log - INTEGER
+ Adjustment of TSO packet sizes based on min_rtt
+
+ Starting from linux-5.18, TCP autosizing can be tweaked
+ for flows having small RTT.
+
+ Old autosizing was splitting the pacing budget to send 1024 TSO
+ per second.
+
+ tso_packet_size = sk->sk_pacing_rate / 1024;
+
+ With the new mechanism, we increase this TSO sizing using:
+
+ distance = min_rtt_usec / (2^tcp_tso_rtt_log)
+ tso_packet_size += gso_max_size >> distance;
+
+ This means that flows between very close hosts can use bigger
+ TSO packets, reducing their cpu costs.
+
+ If you want to use the old autosizing, set this sysctl to 0.
+
+ Default: 9 (2^9 = 512 usec)
+
tcp_pacing_ss_ratio - INTEGER
sk->sk_pacing_rate is set by TCP stack using a ratio applied
to current rate. (current_rate = cwnd * mss / srtt)
If TCP is in slow start, tcp_pacing_ss_ratio is applied
to let TCP probe for bigger speeds, assuming cwnd can be
doubled every other RTT.
+
Default: 200
tcp_pacing_ca_ratio - INTEGER
@@ -727,6 +938,7 @@ tcp_pacing_ca_ratio - INTEGER
to current rate. (current_rate = cwnd * mss / srtt)
If TCP is in congestion avoidance phase, tcp_pacing_ca_ratio
is applied to conservatively probe for bigger throughput.
+
Default: 120
tcp_tso_win_divisor - INTEGER
@@ -734,16 +946,20 @@ tcp_tso_win_divisor - INTEGER
can be consumed by a single TSO frame.
The setting of this parameter is a choice between burstiness and
building larger TSO frames.
+
Default: 3
tcp_tw_reuse - INTEGER
Enable reuse of TIME-WAIT sockets for new connections when it is
safe from protocol viewpoint.
- 0 - disable
- 1 - global enable
- 2 - enable for loopback traffic only
+
+ - 0 - disable
+ - 1 - global enable
+ - 2 - enable for loopback traffic only
+
It should not be changed without advice/request of technical
experts.
+
Default: 2
tcp_window_scaling - BOOLEAN
@@ -752,11 +968,14 @@ tcp_window_scaling - BOOLEAN
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
Each TCP socket has rights to use it due to fact of its birth.
+
Default: 4K
default: initial size of send buffer used by TCP sockets. This
value overrides net.core.wmem_default used by other protocols.
+
It is usually lower than net.core.wmem_default.
+
Default: 16K
max: Maximal amount of memory allowed for automatically tuned
@@ -764,6 +983,7 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max
net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
automatic tuning of that socket's send buffer size, in which case
this value is ignored.
+
Default: between 64K and 4MB, depending on RAM size.
tcp_notsent_lowat - UNSIGNED INTEGER
@@ -784,6 +1004,7 @@ tcp_workaround_signed_windows - BOOLEAN
remote TCP is broken and treats the window as a signed quantity.
If unset, assume the remote TCP is not broken even if we do
not receive a window scaling option from them.
+
Default: 0
tcp_thin_linear_timeouts - BOOLEAN
@@ -795,7 +1016,8 @@ tcp_thin_linear_timeouts - BOOLEAN
initiated. This improves retransmission latency for
non-aggressive thin streams, often found to be time-dependent.
For more information on thin streams, see
- Documentation/networking/tcp-thin.txt
+ Documentation/networking/tcp-thin.rst
+
Default: 0
tcp_limit_output_bytes - INTEGER
@@ -807,22 +1029,48 @@ tcp_limit_output_bytes - INTEGER
flows, for typical pfifo_fast qdiscs. tcp_limit_output_bytes
limits the number of bytes on qdisc or device to reduce artificial
RTT/cwnd and reduce bufferbloat.
+
Default: 1048576 (16 * 65536)
tcp_challenge_ack_limit - INTEGER
Limits number of Challenge ACK sent per second, as recommended
in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
- Default: 100
+ Note that this per netns rate limit can allow some side channel
+ attacks and probably should not be enabled.
+ TCP stack implements per TCP socket limits anyway.
+ Default: INT_MAX (unlimited)
-tcp_rx_skb_cache - BOOLEAN
- Controls a per TCP socket cache of one skb, that might help
- performance of some workloads. This might be dangerous
- on systems with a lot of TCP sockets, since it increases
- memory usage.
+tcp_ehash_entries - INTEGER
+ Show the number of hash buckets for TCP sockets in the current
+ networking namespace.
- Default: 0 (disabled)
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+tcp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for TCP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
-UDP variables:
+ Note also that the default value of tcp_max_tw_buckets and
+ tcp_max_syn_backlog depend on the hash bucket size.
+
+ Possible values: 0, 2^n (n: 0 - 24 (16Mi))
+
+ Default: 0
+
+UDP variables
+=============
udp_l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
@@ -830,18 +1078,17 @@ udp_l3mdev_accept - BOOLEAN
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
- Default: 0 (disabled)
+
+ Default: 0 (disabled)
udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets.
- min: Below this number of pages UDP is not bothered about its
- memory appetite. When amount of memory allocated by UDP exceeds
- this number, UDP starts to moderate memory usage.
+ min: Number of pages allowed for queueing by all UDP sockets.
pressure: This value was introduced to follow format of tcp_mem.
- max: Number of pages allowed for queueing by all UDP sockets.
+ max: This value was introduced to follow format of tcp_mem.
Default is calculated at boot time from amount of available memory.
@@ -849,15 +1096,14 @@ udp_rmem_min - INTEGER
Minimal size of receive buffer used by UDP sockets in moderation.
Each UDP socket is able to use the size for receiving data, even if
total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+
Default: 4K
udp_wmem_min - INTEGER
- Minimal size of send buffer used by UDP sockets in moderation.
- Each UDP socket is able to use the size for sending data, even if
- total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
- Default: 4K
+ UDP does not have tx memory accounting and this tunable has no effect.
-RAW variables:
+RAW variables
+=============
raw_l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
@@ -865,9 +1111,11 @@ raw_l3mdev_accept - BOOLEAN
being received regardless of the L3 domain in which they
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
+
Default: 1 (enabled)
-CIPSOv4 Variables:
+CIPSOv4 Variables
+=================
cipso_cache_enable - BOOLEAN
If set, enable additions to and lookups from the CIPSO label mapping
@@ -875,15 +1123,17 @@ cipso_cache_enable - BOOLEAN
miss. However, regardless of the setting the cache is still
invalidated when required when means you can safely toggle this on and
off and the cache will always be "safe".
+
Default: 1
cipso_cache_bucket_size - INTEGER
The CIPSO label cache consists of a fixed size hash table with each
hash bucket containing a number of cache entries. This variable limits
- the number of entries in each hash bucket; the larger the value the
+ the number of entries in each hash bucket; the larger the value is, the
more CIPSO label mappings that can be cached. When the number of
entries in a given hash bucket reaches this limit adding new entries
causes the oldest entry in the bucket to be removed to make room.
+
Default: 10
cipso_rbm_optfmt - BOOLEAN
@@ -891,6 +1141,7 @@ cipso_rbm_optfmt - BOOLEAN
the CIPSO draft specification (see Documentation/netlabel for details).
This means that when set the CIPSO tag will be padded with empty
categories in order to make the packet data 32-bit aligned.
+
Default: 0
cipso_rbm_structvalid - BOOLEAN
@@ -900,9 +1151,11 @@ cipso_rbm_structvalid - BOOLEAN
where in the CIPSO processing code but setting this to 0 (False) should
result in less work (i.e. it should be faster) but could cause problems
with other implementations that require strict checking.
+
Default: 0
-IP Variables:
+IP Variables
+============
ip_local_port_range - 2 INTEGERS
Defines the local port range that is used by TCP and UDP to
@@ -931,16 +1184,18 @@ ip_local_reserved_ports - list of comma separated ranges
assignments.
You can reserve ports which are not in the current
- ip_local_port_range, e.g.:
+ ip_local_port_range, e.g.::
- $ cat /proc/sys/net/ipv4/ip_local_port_range
- 32000 60999
- $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
- 8080,9148
+ $ cat /proc/sys/net/ipv4/ip_local_port_range
+ 32000 60999
+ $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+ 8080,9148
although this is redundant. However such a setting is useful
if later the port range is changed to a value that will
- include the reserved ports.
+ include the reserved ports. Also keep in mind, that overlapping
+ of these ranges may affect probability of selecting ephemeral
+ ports which are right after block of reserved ports.
Default: Empty
@@ -956,13 +1211,24 @@ ip_unprivileged_port_start - INTEGER
ip_nonlocal_bind - BOOLEAN
If set, allows processes to bind() to non-local IP addresses,
which can be quite useful - but may break some applications.
+
Default: 0
-ip_dynaddr - BOOLEAN
+ip_autobind_reuse - BOOLEAN
+ By default, bind() does not select the ports automatically even if
+ the new socket and all sockets bound to the port have SO_REUSEADDR.
+ ip_autobind_reuse allows bind() to reuse the port and this is useful
+ when you use bind()+connect(), but may break some applications.
+ The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this
+ option should only be set by experts.
+ Default: 0
+
+ip_dynaddr - INTEGER
If set non-zero, enables support for dynamic addresses.
If set to a non-zero value larger than 1, a kernel log
message will be printed when dynamic address rewriting
occurs.
+
Default: 0
ip_early_demux - BOOLEAN
@@ -972,25 +1238,43 @@ ip_early_demux - BOOLEAN
It may add an additional cost for pure routing workloads that
reduces overall throughput, in such case you should disable it.
+
Default: 1
+ping_group_range - 2 INTEGERS
+ Restrict ICMP_PROTO datagram sockets to users in the group range.
+ The default is "1 0", meaning, that nobody (not even root) may
+ create ping sockets. Setting it to "100 100" would grant permissions
+ to the single group. "0 4294967295" would enable it for the world, "100
+ 4294967295" would enable it for the users, but not daemons.
+
tcp_early_demux - BOOLEAN
Enable early demux for established TCP sockets.
+
Default: 1
udp_early_demux - BOOLEAN
Enable early demux for connected UDP sockets. Disable this if
your system could experience more unconnected load.
+
Default: 1
icmp_echo_ignore_all - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it.
+
Default: 0
+icmp_echo_enable_probe - BOOLEAN
+ If set to one, then the kernel will respond to RFC 8335 PROBE
+ requests sent to it.
+
+ Default: 0
+
icmp_echo_ignore_broadcasts - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO and
TIMESTAMP requests sent to it via broadcast/multicast.
+
Default: 1
icmp_ratelimit - INTEGER
@@ -1000,46 +1284,57 @@ icmp_ratelimit - INTEGER
otherwise the minimal space between responses in milliseconds.
Note that another sysctl, icmp_msgs_per_sec limits the number
of ICMP packets sent on all targets.
+
Default: 1000
icmp_msgs_per_sec - INTEGER
Limit maximal number of ICMP packets sent per second from this host.
Only messages whose type matches icmp_ratemask (see below) are
- controlled by this limit.
+ controlled by this limit. For security reasons, the precise count
+ of messages per second is randomized.
+
Default: 1000
icmp_msgs_burst - INTEGER
icmp_msgs_per_sec controls number of ICMP packets sent per second,
while icmp_msgs_burst controls the burst size of these packets.
+ For security reasons, the precise burst size is randomized.
+
Default: 50
icmp_ratemask - INTEGER
Mask made of ICMP types for which rates are being limited.
+
Significant bits: IHGFEDCBA9876543210
+
Default mask: 0000001100000011000 (6168)
Bit definitions (see include/linux/icmp.h):
+
+ = =========================
0 Echo Reply
- 3 Destination Unreachable *
- 4 Source Quench *
+ 3 Destination Unreachable [1]_
+ 4 Source Quench [1]_
5 Redirect
8 Echo Request
- B Time Exceeded *
- C Parameter Problem *
+ B Time Exceeded [1]_
+ C Parameter Problem [1]_
D Timestamp Request
E Timestamp Reply
F Info Request
G Info Reply
H Address Mask Request
I Address Mask Reply
+ = =========================
- * These are rate limited by default (see default mask above)
+ .. [1] These are rate limited by default (see default mask above)
icmp_ignore_bogus_error_responses - BOOLEAN
Some routers violate RFC1122 by sending bogus responses to broadcast
frames. Such violations are normally logged via a kernel warning.
If this is set to TRUE, the kernel will not give such warnings, which
will avoid log file clutter.
+
Default: 1
icmp_errors_use_inbound_ifaddr - BOOLEAN
@@ -1049,7 +1344,7 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
- This is the behaviour network many administrators will expect from
+ This is the behaviour many network administrators will expect from
a router. And it can make debugging complicated network layouts
much easier.
@@ -1084,32 +1379,39 @@ igmp_max_memberships - INTEGER
igmp_max_msf - INTEGER
Maximum number of addresses allowed in the source filter list for a
multicast group.
+
Default: 10
igmp_qrv - INTEGER
Controls the IGMP query robustness variable (see RFC2236 8.1).
+
Default: 2 (as specified by RFC2236 8.1)
+
Minimum: 1 (as specified by RFC6636 4.5)
force_igmp_version - INTEGER
- 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback
- allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier
- Present timer expires.
- 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if
- receive IGMPv2/v3 query.
- 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive
- IGMPv1 query message. Will reply report if receive IGMPv3 query.
- 3 - Enforce to use IGMP version 3. The same react with default 0.
+ - 0 - (default) No enforcement of a IGMP version, IGMPv1/v2 fallback
+ allowed. Will back to IGMPv3 mode again if all IGMPv1/v2 Querier
+ Present timer expires.
+ - 1 - Enforce to use IGMP version 1. Will also reply IGMPv1 report if
+ receive IGMPv2/v3 query.
+ - 2 - Enforce to use IGMP version 2. Will fallback to IGMPv1 if receive
+ IGMPv1 query message. Will reply report if receive IGMPv3 query.
+ - 3 - Enforce to use IGMP version 3. The same react with default 0.
- Note: this is not the same with force_mld_version because IGMPv3 RFC3376
- Security Considerations does not have clear description that we could
- ignore other version messages completely as MLDv2 RFC3810. So make
- this value as default 0 is recommended.
+ .. note::
-conf/interface/* changes special settings per interface (where
-"interface" is the name of your network interface)
+ this is not the same with force_mld_version because IGMPv3 RFC3376
+ Security Considerations does not have clear description that we could
+ ignore other version messages completely as MLDv2 RFC3810. So make
+ this value as default 0 is recommended.
-conf/all/* is special, changes the settings for all interfaces
+``conf/interface/*``
+ changes special settings per interface (where
+ interface" is the name of your network interface)
+
+``conf/all/*``
+ is special, changes the settings for all interfaces
log_martians - BOOLEAN
Log packets with impossible addresses to kernel log.
@@ -1120,14 +1422,21 @@ log_martians - BOOLEAN
accept_redirects - BOOLEAN
Accept ICMP redirect messages.
accept_redirects for the interface will be enabled if:
+
- both conf/{all,interface}/accept_redirects are TRUE in the case
forwarding for the interface is enabled
+
or
+
- at least one of conf/{all,interface}/accept_redirects is TRUE in the
case forwarding for the interface is disabled
+
accept_redirects for the interface will be disabled otherwise
- default TRUE (host)
- FALSE (router)
+
+ default:
+
+ - TRUE (host)
+ - FALSE (router)
forwarding - BOOLEAN
Enable IP forwarding on this interface. This controls whether packets
@@ -1152,12 +1461,14 @@ medium_id - INTEGER
proxy_arp - BOOLEAN
Do proxy arp.
+
proxy_arp for the interface will be enabled if at least one of
conf/{all,interface}/proxy_arp is set to TRUE,
it will be disabled otherwise
proxy_arp_pvlan - BOOLEAN
Private VLAN proxy arp.
+
Basically allow proxy arp replies back to the same interface
(from which the ARP request/solicitation was received).
@@ -1170,6 +1481,7 @@ proxy_arp_pvlan - BOOLEAN
proxy_arp.
This technology is known by different names:
+
In RFC 3069 it is called VLAN Aggregation.
Cisco and Allied Telesyn call it Private VLAN.
Hewlett-Packard call it Source-Port filtering or port-isolation.
@@ -1178,26 +1490,33 @@ proxy_arp_pvlan - BOOLEAN
shared_media - BOOLEAN
Send(router) or accept(host) RFC1620 shared media redirects.
Overrides secure_redirects.
+
shared_media for the interface will be enabled if at least one of
conf/{all,interface}/shared_media is set to TRUE,
it will be disabled otherwise
+
default TRUE
secure_redirects - BOOLEAN
Accept ICMP redirect messages only to gateways listed in the
interface's current gateway list. Even if disabled, RFC1122 redirect
rules still apply.
+
Overridden by shared_media.
+
secure_redirects for the interface will be enabled if at least one of
conf/{all,interface}/secure_redirects is set to TRUE,
it will be disabled otherwise
+
default TRUE
send_redirects - BOOLEAN
Send redirects, if router.
+
send_redirects for the interface will be enabled if at least one of
conf/{all,interface}/send_redirects is set to TRUE,
it will be disabled otherwise
+
Default: TRUE
bootp_relay - BOOLEAN
@@ -1206,15 +1525,20 @@ bootp_relay - BOOLEAN
BOOTP relay daemon will catch and forward such packets.
conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay
for the interface
+
default FALSE
+
Not Implemented Yet.
accept_source_route - BOOLEAN
Accept packets with SRR option.
conf/all/accept_source_route must also be set to TRUE to accept packets
with SRR option on the interface
- default TRUE (router)
- FALSE (host)
+
+ default
+
+ - TRUE (router)
+ - FALSE (host)
accept_local - BOOLEAN
Accept packets with local source addresses. In combination with
@@ -1225,18 +1549,19 @@ accept_local - BOOLEAN
route_localnet - BOOLEAN
Do not consider loopback addresses as martian source or destination
while routing. This enables the use of 127/8 for local routing purposes.
+
default FALSE
rp_filter - INTEGER
- 0 - No source validation.
- 1 - Strict mode as defined in RFC3704 Strict Reverse Path
- Each incoming packet is tested against the FIB and if the interface
- is not the best reverse path the packet check will fail.
- By default failed packets are discarded.
- 2 - Loose mode as defined in RFC3704 Loose Reverse Path
- Each incoming packet's source address is also tested against the FIB
- and if the source address is not reachable via any interface
- the packet check will fail.
+ - 0 - No source validation.
+ - 1 - Strict mode as defined in RFC3704 Strict Reverse Path
+ Each incoming packet is tested against the FIB and if the interface
+ is not the best reverse path the packet check will fail.
+ By default failed packets are discarded.
+ - 2 - Loose mode as defined in RFC3704 Loose Reverse Path
+ Each incoming packet's source address is also tested against the FIB
+ and if the source address is not reachable via any interface
+ the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
@@ -1248,20 +1573,39 @@ rp_filter - INTEGER
Default value is 0. Note that some distributions enable it
in startup scripts.
+src_valid_mark - BOOLEAN
+ - 0 - The fwmark of the packet is not included in reverse path
+ route lookup. This allows for asymmetric routing configurations
+ utilizing the fwmark in only one direction, e.g., transparent
+ proxying.
+
+ - 1 - The fwmark of the packet is included in reverse path route
+ lookup. This permits rp_filter to function when the fwmark is
+ used for routing traffic in both directions.
+
+ This setting also affects the utilization of fmwark when
+ performing source address selection for ICMP replies, or
+ determining addresses stored for the IPOPT_TS_TSANDADDR and
+ IPOPT_RR IP options.
+
+ The max value from conf/{all,interface}/src_valid_mark is used.
+
+ Default value is 0.
+
arp_filter - BOOLEAN
- 1 - Allows you to have multiple network interfaces on the same
- subnet, and have the ARPs for each interface be answered
- based on whether or not the kernel would route a packet from
- the ARP'd IP out that interface (therefore you must use source
- based routing for this to work). In other words it allows control
- of which cards (usually 1) will respond to an arp request.
-
- 0 - (default) The kernel can respond to arp requests with addresses
- from other interfaces. This may seem wrong but it usually makes
- sense, because it increases the chance of successful communication.
- IP addresses are owned by the complete host on Linux, not by
- particular interfaces. Only for more complex setups like load-
- balancing, does this behaviour cause problems.
+ - 1 - Allows you to have multiple network interfaces on the same
+ subnet, and have the ARPs for each interface be answered
+ based on whether or not the kernel would route a packet from
+ the ARP'd IP out that interface (therefore you must use source
+ based routing for this to work). In other words it allows control
+ of which cards (usually 1) will respond to an arp request.
+
+ - 0 - (default) The kernel can respond to arp requests with addresses
+ from other interfaces. This may seem wrong but it usually makes
+ sense, because it increases the chance of successful communication.
+ IP addresses are owned by the complete host on Linux, not by
+ particular interfaces. Only for more complex setups like load-
+ balancing, does this behaviour cause problems.
arp_filter for the interface will be enabled if at least one of
conf/{all,interface}/arp_filter is set to TRUE,
@@ -1271,26 +1615,27 @@ arp_announce - INTEGER
Define different restriction levels for announcing the local
source IP address from IP packets in ARP requests sent on
interface:
- 0 - (default) Use any local address, configured on any interface
- 1 - Try to avoid local addresses that are not in the target's
- subnet for this interface. This mode is useful when target
- hosts reachable via this interface require the source IP
- address in ARP requests to be part of their logical network
- configured on the receiving interface. When we generate the
- request we will check all our subnets that include the
- target IP and will preserve the source address if it is from
- such subnet. If there is no such subnet we select source
- address according to the rules for level 2.
- 2 - Always use the best local address for this target.
- In this mode we ignore the source address in the IP packet
- and try to select local address that we prefer for talks with
- the target host. Such local address is selected by looking
- for primary IP addresses on all our subnets on the outgoing
- interface that include the target IP address. If no suitable
- local address is found we select the first local address
- we have on the outgoing interface or on all other interfaces,
- with the hope we will receive reply for our request and
- even sometimes no matter the source IP address we announce.
+
+ - 0 - (default) Use any local address, configured on any interface
+ - 1 - Try to avoid local addresses that are not in the target's
+ subnet for this interface. This mode is useful when target
+ hosts reachable via this interface require the source IP
+ address in ARP requests to be part of their logical network
+ configured on the receiving interface. When we generate the
+ request we will check all our subnets that include the
+ target IP and will preserve the source address if it is from
+ such subnet. If there is no such subnet we select source
+ address according to the rules for level 2.
+ - 2 - Always use the best local address for this target.
+ In this mode we ignore the source address in the IP packet
+ and try to select local address that we prefer for talks with
+ the target host. Such local address is selected by looking
+ for primary IP addresses on all our subnets on the outgoing
+ interface that include the target IP address. If no suitable
+ local address is found we select the first local address
+ we have on the outgoing interface or on all other interfaces,
+ with the hope we will receive reply for our request and
+ even sometimes no matter the source IP address we announce.
The max value from conf/{all,interface}/arp_announce is used.
@@ -1301,32 +1646,40 @@ arp_announce - INTEGER
arp_ignore - INTEGER
Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
- 0 - (default): reply for any local target IP address, configured
- on any interface
- 1 - reply only if the target IP address is local address
- configured on the incoming interface
- 2 - reply only if the target IP address is local address
- configured on the incoming interface and both with the
- sender's IP address are part from same subnet on this interface
- 3 - do not reply for local addresses configured with scope host,
- only resolutions for global and link addresses are replied
- 4-7 - reserved
- 8 - do not reply for all local addresses
+
+ - 0 - (default): reply for any local target IP address, configured
+ on any interface
+ - 1 - reply only if the target IP address is local address
+ configured on the incoming interface
+ - 2 - reply only if the target IP address is local address
+ configured on the incoming interface and both with the
+ sender's IP address are part from same subnet on this interface
+ - 3 - do not reply for local addresses configured with scope host,
+ only resolutions for global and link addresses are replied
+ - 4-7 - reserved
+ - 8 - do not reply for all local addresses
The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}
arp_notify - BOOLEAN
Define mode for notification of address and device changes.
- 0 - (default): do nothing
- 1 - Generate gratuitous arp requests when device is brought up
- or hardware address changes.
-arp_accept - BOOLEAN
- Define behavior for gratuitous ARP frames who's IP is not
- already present in the ARP table:
- 0 - don't create new entries in the ARP table
- 1 - create new entries in the ARP table
+ == ==========================================================
+ 0 (default): do nothing
+ 1 Generate gratuitous arp requests when device is brought up
+ or hardware address changes.
+ == ==========================================================
+
+arp_accept - INTEGER
+ Define behavior for accepting gratuitous ARP (garp) frames from devices
+ that are not already present in the ARP table:
+
+ - 0 - don't create new entries in the ARP table
+ - 1 - create new entries in the ARP table
+ - 2 - create new entries only if the source IP address is in the same
+ subnet as an address configured on the interface that received the
+ garp message.
Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.
@@ -1335,6 +1688,15 @@ arp_accept - BOOLEAN
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.
+arp_evict_nocarrier - BOOLEAN
+ Clears the ARP cache on NOCARRIER events. This option is important for
+ wireless devices where the ARP cache should not be cleared when roaming
+ between access points on the same network. In most cases this should
+ remain as the default (1).
+
+ - 1 - (default): Clear the ARP cache on NOCARRIER events
+ - 0 - Do not clear ARP cache on NOCARRIER events
+
mcast_solicit - INTEGER
The maximum number of multicast probes in INCOMPLETE state,
when the associated hardware address is unknown. Defaults
@@ -1362,13 +1724,18 @@ disable_xfrm - BOOLEAN
igmpv2_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
IGMPv1 or IGMPv2 report retransmit will take place.
+
Default: 10000 (10 seconds)
igmpv3_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
IGMPv3 report retransmit will take place.
+
Default: 1000 (1 seconds)
+ignore_routes_with_linkdown - BOOLEAN
+ Ignore routes whose link is down when performing a FIB lookup.
+
promote_secondaries - BOOLEAN
When a primary IP address is removed from this interface
promote a corresponding secondary IP address instead of
@@ -1377,19 +1744,23 @@ promote_secondaries - BOOLEAN
drop_unicast_in_l2_multicast - BOOLEAN
Drop any unicast IP packets that are received in link-layer
multicast (or broadcast) frames.
+
This behavior (for multicast) is actually a SHOULD in RFC
1122, but is disabled by default for compatibility reasons.
+
Default: off (0)
drop_gratuitous_arp - BOOLEAN
Drop all gratuitous ARP frames, for example if there's a known
good ARP proxy on the network and such frames need not be used
(or in the case of 802.11, must not be used to prevent attacks.)
+
Default: off (0)
tag - INTEGER
Allows you to write a number, which can be used as required.
+
Default value is 0.
xfrm4_gc_thresh - INTEGER
@@ -1401,21 +1772,24 @@ xfrm4_gc_thresh - INTEGER
igmp_link_local_mcast_reports - BOOLEAN
Enable IGMP reports for link local multicast groups in the
224.0.0.X range.
+
Default TRUE
Alexey Kuznetsov.
kuznet@ms2.inr.ac.ru
Updated by:
-Andi Kleen
-ak@muc.de
-Nicolas Delon
-delon.nicolas@wanadoo.fr
+- Andi Kleen
+ ak@muc.de
+- Nicolas Delon
+ delon.nicolas@wanadoo.fr
-/proc/sys/net/ipv6/* Variables:
+
+/proc/sys/net/ipv6/* Variables
+==============================
IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also
apply to IPv6 [XXX?].
@@ -1424,8 +1798,9 @@ bindv6only - BOOLEAN
Default value for IPV6_V6ONLY socket option,
which restricts use of the IPv6 socket to IPv6 communication
only.
- TRUE: disable IPv4-mapped address feature
- FALSE: enable IPv4-mapped address feature
+
+ - TRUE: disable IPv4-mapped address feature
+ - FALSE: enable IPv4-mapped address feature
Default: FALSE (as specified in RFC3493)
@@ -1433,8 +1808,10 @@ flowlabel_consistency - BOOLEAN
Protect the consistency (and unicity) of flow label.
You have to disable it to use IPV6_FL_F_REFLECT flag on the
flow label manager.
- TRUE: enabled
- FALSE: disabled
+
+ - TRUE: enabled
+ - FALSE: disabled
+
Default: TRUE
auto_flowlabels - INTEGER
@@ -1442,22 +1819,28 @@ auto_flowlabels - INTEGER
packet. This allows intermediate devices, such as routers, to
identify packet flows for mechanisms like Equal Cost Multipath
Routing (see RFC 6438).
- 0: automatic flow labels are completely disabled
- 1: automatic flow labels are enabled by default, they can be
+
+ = ===========================================================
+ 0 automatic flow labels are completely disabled
+ 1 automatic flow labels are enabled by default, they can be
disabled on a per socket basis using the IPV6_AUTOFLOWLABEL
socket option
- 2: automatic flow labels are allowed, they may be enabled on a
+ 2 automatic flow labels are allowed, they may be enabled on a
per socket basis using the IPV6_AUTOFLOWLABEL socket option
- 3: automatic flow labels are enabled and enforced, they cannot
+ 3 automatic flow labels are enabled and enforced, they cannot
be disabled by the socket option
+ = ===========================================================
+
Default: 1
flowlabel_state_ranges - BOOLEAN
Split the flow label number space into two ranges. 0-0x7FFFF is
reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF
is reserved for stateless flow labels as described in RFC6437.
- TRUE: enabled
- FALSE: disabled
+
+ - TRUE: enabled
+ - FALSE: disabled
+
Default: true
flowlabel_reflect - INTEGER
@@ -1467,49 +1850,88 @@ flowlabel_reflect - INTEGER
https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
This is a bitmask.
- 1: enabled for established flows
- Note that this prevents automatic flowlabel changes, as done
- in "tcp: change IPv6 flow-label upon receiving spurious retransmission"
- and "tcp: Change txhash on every SYN and RTO retransmit"
+ - 1: enabled for established flows
- 2: enabled for TCP RESET packets (no active listener)
- If set, a RST packet sent in response to a SYN packet on a closed
- port will reflect the incoming flow label.
+ Note that this prevents automatic flowlabel changes, as done
+ in "tcp: change IPv6 flow-label upon receiving spurious retransmission"
+ and "tcp: Change txhash on every SYN and RTO retransmit"
- 4: enabled for ICMPv6 echo reply messages.
+ - 2: enabled for TCP RESET packets (no active listener)
+ If set, a RST packet sent in response to a SYN packet on a closed
+ port will reflect the incoming flow label.
+
+ - 4: enabled for ICMPv6 echo reply messages.
Default: 0
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes.
+
Default: 0 (Layer 3)
+
Possible values:
- 0 - Layer 3 (source and destination addresses plus flow label)
- 1 - Layer 4 (standard 5-tuple)
- 2 - Layer 3 or inner Layer 3 if present
+
+ - 0 - Layer 3 (source and destination addresses plus flow label)
+ - 1 - Layer 4 (standard 5-tuple)
+ - 2 - Layer 3 or inner Layer 3 if present
+ - 3 - Custom multipath hash. Fields used for multipath hash calculation
+ are determined by fib_multipath_hash_fields sysctl
+
+fib_multipath_hash_fields - UNSIGNED INTEGER
+ When fib_multipath_hash_policy is set to 3 (custom multipath hash), the
+ fields used for multipath hash calculation are determined by this
+ sysctl.
+
+ This value is a bitmask which enables various fields for multipath hash
+ calculation.
+
+ Possible fields are:
+
+ ====== ============================
+ 0x0001 Source IP address
+ 0x0002 Destination IP address
+ 0x0004 IP protocol
+ 0x0008 Flow Label
+ 0x0010 Source port
+ 0x0020 Destination port
+ 0x0040 Inner source IP address
+ 0x0080 Inner destination IP address
+ 0x0100 Inner IP protocol
+ 0x0200 Inner Flow Label
+ 0x0400 Inner source port
+ 0x0800 Inner destination port
+ ====== ============================
+
+ Default: 0x0007 (source IP, destination IP and IP protocol)
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
- TRUE: enabled
- FALSE: disabled
+
+ - TRUE: enabled
+ - FALSE: disabled
+
Default: FALSE
idgen_delay - INTEGER
Controls the delay in seconds after which time to retry
privacy stable address generation if a DAD conflict is
detected.
+
Default: 1 (as specified in RFC7217)
idgen_retries - INTEGER
Controls the number of retries to generate a stable privacy
address if a DAD conflict is detected.
+
Default: 3 (as specified in RFC7217)
mld_qrv - INTEGER
Controls the MLD query robustness variable (see RFC3810 9.1).
+
Default: 2 (as specified by RFC3810 9.1)
+
Minimum: 1 (as specified by RFC6636 4.5)
max_dst_opts_number - INTEGER
@@ -1517,6 +1939,7 @@ max_dst_opts_number - INTEGER
options extension header. If this value is less than zero
then unknown options are disallowed and the number of known
TLVs allowed is the absolute value of this number.
+
Default: 8
max_hbh_opts_number - INTEGER
@@ -1524,16 +1947,19 @@ max_hbh_opts_number - INTEGER
options extension header. If this value is less than zero
then unknown options are disallowed and the number of known
TLVs allowed is the absolute value of this number.
+
Default: 8
max_dst_opts_length - INTEGER
Maximum length allowed for a Destination options extension
header.
+
Default: INT_MAX (unlimited)
max_hbh_length - INTEGER
Maximum length allowed for a Hop-by-Hop options extension
header.
+
Default: INT_MAX (unlimited)
skip_notify_on_dev_down - BOOLEAN
@@ -1542,8 +1968,59 @@ skip_notify_on_dev_down - BOOLEAN
generate this message; IPv6 does by default. Setting this sysctl
to true skips the message, making IPv4 and IPv6 on par in relying
on userspace caches to track link events and evict routes.
+
Default: false (generate message)
+nexthop_compat_mode - BOOLEAN
+ New nexthop API provides a means for managing nexthops independent of
+ prefixes. Backwards compatibilty with old route format is enabled by
+ default which means route dumps and notifications contain the new
+ nexthop attribute but also the full, expanded nexthop definition.
+ Further, updates or deletes of a nexthop configuration generate route
+ notifications for each fib entry using the nexthop. Once a system
+ understands the new API, this sysctl can be disabled to achieve full
+ performance benefits of the new API by disabling the nexthop expansion
+ and extraneous notifications.
+ Default: true (backward compat mode)
+
+fib_notify_on_flag_change - INTEGER
+ Whether to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/
+ RTM_F_TRAP/RTM_F_OFFLOAD_FAILED flags are changed.
+
+ After installing a route to the kernel, user space receives an
+ acknowledgment, which means the route was installed in the kernel,
+ but not necessarily in hardware.
+ It is also possible for a route already installed in hardware to change
+ its action and therefore its flags. For example, a host route that is
+ trapping packets can be "promoted" to perform decapsulation following
+ the installation of an IPinIP/VXLAN tunnel.
+ The notifications will indicate to user-space the state of the route.
+
+ Default: 0 (Do not emit notifications.)
+
+ Possible values:
+
+ - 0 - Do not emit notifications.
+ - 1 - Emit notifications.
+ - 2 - Emit notifications only for RTM_F_OFFLOAD_FAILED flag change.
+
+ioam6_id - INTEGER
+ Define the IOAM id of this node. Uses only 24 bits out of 32 in total.
+
+ Min: 0
+ Max: 0xFFFFFF
+
+ Default: 0xFFFFFF
+
+ioam6_id_wide - LONG INTEGER
+ Define the wide IOAM id of this node. Uses only 56 bits out of 64 in
+ total. Can be different from ioam6_id.
+
+ Min: 0
+ Max: 0xFFFFFFFFFFFFFF
+
+ Default: 0xFFFFFFFFFFFFFF
+
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
@@ -1558,28 +2035,27 @@ ip6frag_low_thresh - INTEGER
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
-IPv6 Segment Routing:
-
-seg6_flowlabel - INTEGER
- Controls the behaviour of computing the flowlabel of outer
- IPv6 header in case of SR T.encaps
-
- -1 set flowlabel to zero.
- 0 copy flowlabel from Inner packet in case of Inner IPv6
- (Set flowlabel to 0 in case IPv4/L2)
- 1 Compute the flowlabel using seg6_make_flowlabel()
-
- Default is 0.
-
-conf/default/*:
+``conf/default/*``:
Change the interface-specific default settings.
+ These settings would be used during creating new interfaces.
-conf/all/*:
+
+``conf/all/*``:
Change all the interface-specific settings.
[XXX: Other special features than forwarding?]
+conf/all/disable_ipv6 - BOOLEAN
+ Changing this value is same as changing ``conf/default/disable_ipv6``
+ setting and also all per-interface ``disable_ipv6`` settings to the same
+ value.
+
+ Reading this value does not have any particular meaning. It does not say
+ whether IPv6 support is enabled or disabled. Returned value can be 1
+ also in the case when some interface has ``disable_ipv6`` set to 0 and
+ has configured IPv6 addresses.
+
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
@@ -1599,9 +2075,10 @@ fwmark_reflect - BOOLEAN
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
If unset, these packets have a fwmark of zero. If set, they have the
fwmark of the packet they are replying to.
+
Default: 0
-conf/interface/*:
+``conf/interface/*``:
Change special settings per interface.
The functional behaviour for certain settings is different
@@ -1616,31 +2093,50 @@ accept_ra - INTEGER
transmitted.
Possible values are:
- 0 Do not accept Router Advertisements.
- 1 Accept Router Advertisements if forwarding is disabled.
- 2 Overrule forwarding behaviour. Accept Router Advertisements
- even if forwarding is enabled.
- Functional default: enabled if local forwarding is disabled.
- disabled if local forwarding is enabled.
+ == ===========================================================
+ 0 Do not accept Router Advertisements.
+ 1 Accept Router Advertisements if forwarding is disabled.
+ 2 Overrule forwarding behaviour. Accept Router Advertisements
+ even if forwarding is enabled.
+ == ===========================================================
+
+ Functional default:
+
+ - enabled if local forwarding is disabled.
+ - disabled if local forwarding is enabled.
accept_ra_defrtr - BOOLEAN
Learn default router in Router Advertisement.
- Functional default: enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
+
+ra_defrtr_metric - UNSIGNED INTEGER
+ Route metric for default route learned in Router Advertisement. This value
+ will be assigned as metric for the default route learned via IPv6 Router
+ Advertisement. Takes affect only if accept_ra_defrtr is enabled.
+
+ Possible values:
+ 1 to 0xFFFFFFFF
+
+ Default: IP6_RT_PRIO_USER i.e. 1024.
accept_ra_from_local - BOOLEAN
Accept RA with source-address that is found on local machine
- if the RA is otherwise proper and able to be accepted.
- Default is to NOT accept these as it may be an un-intended
- network loop.
+ if the RA is otherwise proper and able to be accepted.
+
+ Default is to NOT accept these as it may be an un-intended
+ network loop.
Functional default:
- enabled if accept_ra_from_local is enabled
- on a specific interface.
- disabled if accept_ra_from_local is disabled
- on a specific interface.
+
+ - enabled if accept_ra_from_local is enabled
+ on a specific interface.
+ - disabled if accept_ra_from_local is disabled
+ on a specific interface.
accept_ra_min_hop_limit - INTEGER
Minimum hop limit Information in Router Advertisement.
@@ -1653,8 +2149,10 @@ accept_ra_min_hop_limit - INTEGER
accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
- Functional default: enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
accept_ra_rt_info_min_plen - INTEGER
Minimum prefix length of Route Information in RA.
@@ -1662,8 +2160,10 @@ accept_ra_rt_info_min_plen - INTEGER
Route Information w/ prefix smaller than this variable shall
be ignored.
- Functional default: 0 if accept_ra_rtr_pref is enabled.
- -1 if accept_ra_rtr_pref is disabled.
+ Functional default:
+
+ * 0 if accept_ra_rtr_pref is enabled.
+ * -1 if accept_ra_rtr_pref is disabled.
accept_ra_rt_info_max_plen - INTEGER
Maximum prefix length of Route Information in RA.
@@ -1671,33 +2171,41 @@ accept_ra_rt_info_max_plen - INTEGER
Route Information w/ prefix larger than this variable shall
be ignored.
- Functional default: 0 if accept_ra_rtr_pref is enabled.
- -1 if accept_ra_rtr_pref is disabled.
+ Functional default:
+
+ * 0 if accept_ra_rtr_pref is enabled.
+ * -1 if accept_ra_rtr_pref is disabled.
accept_ra_rtr_pref - BOOLEAN
Accept Router Preference in RA.
- Functional default: enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
accept_ra_mtu - BOOLEAN
Apply the MTU value specified in RA option 5 (RFC4861). If
disabled, the MTU specified in the RA will be ignored.
- Functional default: enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
+ Functional default:
+
+ - enabled if accept_ra is enabled.
+ - disabled if accept_ra is disabled.
accept_redirects - BOOLEAN
Accept Redirects.
- Functional default: enabled if local forwarding is disabled.
- disabled if local forwarding is enabled.
+ Functional default:
+
+ - enabled if local forwarding is disabled.
+ - disabled if local forwarding is enabled.
accept_source_route - INTEGER
Accept source routing (routing extension header).
- >= 0: Accept only routing header type 2.
- < 0: Do not accept routing header.
+ - >= 0: Accept only routing header type 2.
+ - < 0: Do not accept routing header.
Default: 0
@@ -1705,24 +2213,30 @@ autoconf - BOOLEAN
Autoconfigure addresses using Prefix Information in Router
Advertisements.
- Functional default: enabled if accept_ra_pinfo is enabled.
- disabled if accept_ra_pinfo is disabled.
+ Functional default:
+
+ - enabled if accept_ra_pinfo is enabled.
+ - disabled if accept_ra_pinfo is disabled.
dad_transmits - INTEGER
The amount of Duplicate Address Detection probes to send.
+
Default: 1
forwarding - INTEGER
Configure interface-specific Host/Router behaviour.
- Note: It is recommended to have the same setting on all
- interfaces; mixed router/host scenarios are rather uncommon.
+ .. note::
+
+ It is recommended to have the same setting on all
+ interfaces; mixed router/host scenarios are rather uncommon.
Possible values are:
- 0 Forwarding disabled
- 1 Forwarding enabled
- FALSE (0):
+ - 0 Forwarding disabled
+ - 1 Forwarding enabled
+
+ **FALSE (0)**:
By default, Host behaviour is assumed. This means:
@@ -1733,7 +2247,7 @@ forwarding - INTEGER
Advertisements (and do autoconfiguration).
4. If accept_redirects is TRUE (default), accept Redirects.
- TRUE (1):
+ **TRUE (1)**:
If local forwarding is enabled, Router behaviour is assumed.
This means exactly the reverse from the above:
@@ -1744,19 +2258,22 @@ forwarding - INTEGER
4. Redirects are ignored.
Default: 0 (disabled) if global forwarding is disabled (default),
- otherwise 1 (enabled).
+ otherwise 1 (enabled).
hop_limit - INTEGER
Default Hop Limit to set.
+
Default: 64
mtu - INTEGER
Default Maximum Transfer Unit
+
Default: 1280 (IPv6 required minimum)
ip_nonlocal_bind - BOOLEAN
If set, allows processes to bind() to non-local IPv6 addresses,
which can be quite useful - but may break some applications.
+
Default: 0
router_probe_interval - INTEGER
@@ -1768,15 +2285,18 @@ router_probe_interval - INTEGER
router_solicitation_delay - INTEGER
Number of seconds to wait after interface is brought up
before sending Router Solicitations.
+
Default: 1
router_solicitation_interval - INTEGER
Number of seconds to wait between Router Solicitations.
+
Default: 4
router_solicitations - INTEGER
Number of Router Solicitations to send until assuming no
routers are present.
+
Default: 3
use_oif_addrs_only - BOOLEAN
@@ -1788,28 +2308,35 @@ use_oif_addrs_only - BOOLEAN
use_tempaddr - INTEGER
Preference for Privacy Extensions (RFC3041).
- <= 0 : disable Privacy Extensions
- == 1 : enable Privacy Extensions, but prefer public
- addresses over temporary addresses.
- > 1 : enable Privacy Extensions and prefer temporary
- addresses over public addresses.
- Default: 0 (for most devices)
- -1 (for point-to-point devices and loopback devices)
+
+ * <= 0 : disable Privacy Extensions
+ * == 1 : enable Privacy Extensions, but prefer public
+ addresses over temporary addresses.
+ * > 1 : enable Privacy Extensions and prefer temporary
+ addresses over public addresses.
+
+ Default:
+
+ * 0 (for most devices)
+ * -1 (for point-to-point devices and loopback devices)
temp_valid_lft - INTEGER
valid lifetime (in seconds) for temporary addresses.
- Default: 604800 (7 days)
+
+ Default: 172800 (2 days)
temp_prefered_lft - INTEGER
Preferred lifetime (in seconds) for temporary addresses.
+
Default: 86400 (1 day)
keep_addr_on_down - INTEGER
Keep all IPv6 addresses on an interface down event. If set static
global addresses with no expiration time are not flushed.
- >0 : enabled
- 0 : system default
- <0 : disabled
+
+ * >0 : enabled
+ * 0 : system default
+ * <0 : disabled
Default: 0 (addresses are removed)
@@ -1818,11 +2345,13 @@ max_desync_factor - INTEGER
that ensures that clients don't synchronize with each
other and generate new addresses at exactly the same time.
value is in seconds.
+
Default: 600
regen_max_retry - INTEGER
Number of attempts before give up attempting to generate
valid temporary addresses.
+
Default: 5
max_addresses - INTEGER
@@ -1830,12 +2359,14 @@ max_addresses - INTEGER
to zero disables the limitation. It is not recommended to set this
value too large (or to zero) because it would be an easy way to
crash the kernel by allowing too many addresses to be created.
+
Default: 16
disable_ipv6 - BOOLEAN
Disable IPv6 operation. If accept_dad is set to 2, this value
will be dynamically set to TRUE if DAD fails for the link-local
address.
+
Default: FALSE (enable IPv6 operation)
When this value is changed from 1 to 0 (IPv6 is being enabled),
@@ -1849,10 +2380,13 @@ disable_ipv6 - BOOLEAN
accept_dad - INTEGER
Whether to accept DAD (Duplicate Address Detection).
- 0: Disable DAD
- 1: Enable DAD (default)
- 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate
- link-local address has been found.
+
+ == ==============================================================
+ 0 Disable DAD
+ 1 Enable DAD (default)
+ 2 Enable DAD, and disable IPv6 operation if MAC-based duplicate
+ link-local address has been found.
+ == ==============================================================
DAD operation and mode on a given interface will be selected according
to the maximum value of conf/{all,interface}/accept_dad.
@@ -1860,6 +2394,7 @@ accept_dad - INTEGER
force_tllao - BOOLEAN
Enable sending the target link-layer address option even when
responding to a unicast neighbor solicitation.
+
Default: FALSE
Quoting from RFC 2461, section 4.4, Target link-layer address:
@@ -1877,9 +2412,10 @@ force_tllao - BOOLEAN
ndisc_notify - BOOLEAN
Define mode for notification of address and device changes.
- 0 - (default): do nothing
- 1 - Generate unsolicited neighbour advertisements when device is brought
- up or hardware address changes.
+
+ * 0 - (default): do nothing
+ * 1 - Generate unsolicited neighbour advertisements when device is brought
+ up or hardware address changes.
ndisc_tclass - INTEGER
The IPv6 Traffic Class to use by default when sending IPv6 Neighbor
@@ -1888,33 +2424,47 @@ ndisc_tclass - INTEGER
These 8 bits can be interpreted as 6 high order bits holding the DSCP
value and 2 low order bits representing ECN (which you probably want
to leave cleared).
- 0 - (default)
+
+ * 0 - (default)
+
+ndisc_evict_nocarrier - BOOLEAN
+ Clears the neighbor discovery table on NOCARRIER events. This option is
+ important for wireless devices where the neighbor discovery cache should
+ not be cleared when roaming between access points on the same network.
+ In most cases this should remain as the default (1).
+
+ - 1 - (default): Clear neighbor discover cache on NOCARRIER events.
+ - 0 - Do not clear neighbor discovery cache on NOCARRIER events.
mldv1_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
MLDv1 report retransmit will take place.
+
Default: 10000 (10 seconds)
mldv2_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
MLDv2 report retransmit will take place.
+
Default: 1000 (1 second)
force_mld_version - INTEGER
- 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed
- 1 - Enforce to use MLD version 1
- 2 - Enforce to use MLD version 2
+ * 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed
+ * 1 - Enforce to use MLD version 1
+ * 2 - Enforce to use MLD version 2
suppress_frag_ndisc - INTEGER
Control RFC 6980 (Security Implications of IPv6 Fragmentation
with IPv6 Neighbor Discovery) behavior:
- 1 - (default) discard fragmented neighbor discovery packets
- 0 - allow fragmented neighbor discovery packets
+
+ * 1 - (default) discard fragmented neighbor discovery packets
+ * 0 - allow fragmented neighbor discovery packets
optimistic_dad - BOOLEAN
Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
- 0: disabled (default)
- 1: enabled
+
+ * 0: disabled (default)
+ * 1: enabled
Optimistic Duplicate Address Detection for the interface will be enabled
if at least one of conf/{all,interface}/optimistic_dad is set to 1,
@@ -1925,8 +2475,9 @@ use_optimistic - BOOLEAN
source address selection. Preferred addresses will still be chosen
before optimistic addresses, subject to other ranking in the source
address selection algorithm.
- 0: disabled (default)
- 1: enabled
+
+ * 0: disabled (default)
+ * 1: enabled
This will be enabled if at least one of
conf/{all,interface}/use_optimistic is set to 1, disabled otherwise.
@@ -1948,12 +2499,14 @@ stable_secret - IPv6 address
addr_gen_mode - INTEGER
Defines how link-local and autoconf addresses are generated.
- 0: generate address based on EUI64 (default)
- 1: do no generate a link-local address, use EUI64 for addresses generated
- from autoconf
- 2: generate stable privacy addresses, using the secret from
+ = =================================================================
+ 0 generate address based on EUI64 (default)
+ 1 do no generate a link-local address, use EUI64 for addresses
+ generated from autoconf
+ 2 generate stable privacy addresses, using the secret from
stable_secret (RFC7217)
- 3: generate stable privacy addresses, using a random secret if unset
+ 3 generate stable privacy addresses, using a random secret if unset
+ = =================================================================
drop_unicast_in_l2_multicast - BOOLEAN
Drop any unicast IPv6 packets that are received in link-layer
@@ -1968,6 +2521,37 @@ drop_unsolicited_na - BOOLEAN
By default this is turned off.
+accept_untracked_na - INTEGER
+ Define behavior for accepting neighbor advertisements from devices that
+ are absent in the neighbor cache:
+
+ - 0 - (default) Do not accept unsolicited and untracked neighbor
+ advertisements.
+
+ - 1 - Add a new neighbor cache entry in STALE state for routers on
+ receiving a neighbor advertisement (either solicited or unsolicited)
+ with target link-layer address option specified if no neighbor entry
+ is already present for the advertised IPv6 address. Without this knob,
+ NAs received for untracked addresses (absent in neighbor cache) are
+ silently ignored.
+
+ This is as per router-side behavior documented in RFC9131.
+
+ This has lower precedence than drop_unsolicited_na.
+
+ This will optimize the return path for the initial off-link
+ communication that is initiated by a directly connected host, by
+ ensuring that the first-hop router which turns on this setting doesn't
+ have to buffer the initial return packets to do neighbor-solicitation.
+ The prerequisite is that the host is configured to send unsolicited
+ neighbor advertisements on interface bringup. This setting should be
+ used in conjunction with the ndisc_notify setting on the host to
+ satisfy this prerequisite.
+
+ - 2 - Extend option (1) to add a new neighbor cache entry only if the
+ source IP address is in the same subnet as an address configured on
+ the interface that received the neighbor advertisement.
+
enhanced_dad - BOOLEAN
Include a nonce option in the IPv6 neighbor solicitation messages used for
duplicate address detection per RFC7527. A received DAD NS will only signal
@@ -1975,13 +2559,18 @@ enhanced_dad - BOOLEAN
detection of duplicates due to loopback of the NS messages that we send.
The nonce option will be sent on an interface unless both of
conf/{all,interface}/enhanced_dad are set to FALSE.
+
Default: TRUE
-icmp/*:
+``icmp/*``:
+===========
+
ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 messages.
+
0 to disable any limiting,
otherwise the minimal space between responses in milliseconds.
+
Default: 1000
ratemask - list of comma separated ranges
@@ -2002,16 +2591,19 @@ ratemask - list of comma separated ranges
echo_ignore_all - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol.
+
Default: 0
echo_ignore_multicast - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol via multicast.
+
Default: 0
echo_ignore_anycast - BOOLEAN
If set non-zero, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol destined to anycast address.
+
Default: 0
xfrm6_gc_thresh - INTEGER
@@ -2027,43 +2619,52 @@ YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org>
/proc/sys/net/bridge/* Variables:
+=================================
bridge-nf-call-arptables - BOOLEAN
- 1 : pass bridged ARP traffic to arptables' FORWARD chain.
- 0 : disable this.
+ - 1 : pass bridged ARP traffic to arptables' FORWARD chain.
+ - 0 : disable this.
+
Default: 1
bridge-nf-call-iptables - BOOLEAN
- 1 : pass bridged IPv4 traffic to iptables' chains.
- 0 : disable this.
+ - 1 : pass bridged IPv4 traffic to iptables' chains.
+ - 0 : disable this.
+
Default: 1
bridge-nf-call-ip6tables - BOOLEAN
- 1 : pass bridged IPv6 traffic to ip6tables' chains.
- 0 : disable this.
+ - 1 : pass bridged IPv6 traffic to ip6tables' chains.
+ - 0 : disable this.
+
Default: 1
bridge-nf-filter-vlan-tagged - BOOLEAN
- 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
- 0 : disable this.
+ - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
+ - 0 : disable this.
+
Default: 0
bridge-nf-filter-pppoe-tagged - BOOLEAN
- 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
- 0 : disable this.
+ - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
+ - 0 : disable this.
+
Default: 0
bridge-nf-pass-vlan-input-dev - BOOLEAN
- 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
- interface on the bridge and set the netfilter input device to the vlan.
- This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT
- target work with vlan-on-top-of-bridge interfaces. When no matching
- vlan interface is found, or this switch is off, the input device is
- set to the bridge interface.
- 0: disable bridge netfilter vlan interface lookup.
+ - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
+ interface on the bridge and set the netfilter input device to the
+ vlan. This allows use of e.g. "iptables -i br0.1" and makes the
+ REDIRECT target work with vlan-on-top-of-bridge interfaces. When no
+ matching vlan interface is found, or this switch is off, the input
+ device is set to the bridge interface.
+
+ - 0: disable bridge netfilter vlan interface lookup.
+
Default: 0
-proc/sys/net/sctp/* Variables:
+``proc/sys/net/sctp/*`` Variables:
+==================================
addip_enable - BOOLEAN
Enable or disable extension of Dynamic Address Reconfiguration
@@ -2128,11 +2729,13 @@ addip_noauth_enable - BOOLEAN
we provide this variable to control the enforcement of the
authentication requirement.
- 1: Allow ADD-IP extension to be used without authentication. This
+ == ===============================================================
+ 1 Allow ADD-IP extension to be used without authentication. This
should only be set in a closed environment for interoperability
with older implementations.
- 0: Enforce the authentication requirement
+ 0 Enforce the authentication requirement
+ == ===============================================================
Default: 0
@@ -2142,8 +2745,8 @@ auth_enable - BOOLEAN
required for secure operation of Dynamic Address Reconfiguration
(ADD-IP) extension.
- 1: Enable this extension.
- 0: Disable this extension.
+ - 1: Enable this extension.
+ - 0: Disable this extension.
Default: 0
@@ -2151,8 +2754,8 @@ prsctp_enable - BOOLEAN
Enable or disable the Partial Reliability extension (RFC3758) which
is used to notify peers that a given DATA should no longer be expected.
- 1: Enable extension
- 0: Disable
+ - 1: Enable extension
+ - 0: Disable
Default: 1
@@ -2254,8 +2857,8 @@ cookie_preserve_enable - BOOLEAN
Enable or disable the ability to extend the lifetime of the SCTP cookie
that is used during the establishment phase of SCTP association
- 1: Enable cookie lifetime extension.
- 0: Disable
+ - 1: Enable cookie lifetime extension.
+ - 0: Disable
Default: 1
@@ -2263,9 +2866,11 @@ cookie_hmac_alg - STRING
Select the hmac algorithm used when generating the cookie value sent by
a listening sctp socket to a connecting client in the INIT-ACK chunk.
Valid values are:
+
* md5
* sha1
* none
+
Ability to assign md5 or sha1 as the selected alg is predicated on the
configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and
CONFIG_CRYPTO_SHA1).
@@ -2284,16 +2889,16 @@ rcvbuf_policy - INTEGER
to each association instead of the socket. This prevents the described
blocking.
- 1: rcvbuf space is per association
- 0: rcvbuf space is per socket
+ - 1: rcvbuf space is per association
+ - 0: rcvbuf space is per socket
Default: 0
sndbuf_policy - INTEGER
Similar to rcvbuf_policy above, this applies to send buffer space.
- 1: Send buffer is tracked per association
- 0: Send buffer is tracked per socket.
+ - 1: Send buffer is tracked per association
+ - 0: Send buffer is tracked per socket.
Default: 0
@@ -2321,24 +2926,115 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
Default: 4K
sctp_wmem - vector of 3 INTEGERs: min, default, max
- Currently this tunable has no effect.
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimum size of send buffer that can be used by SCTP sockets.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
+
+ Default: 4K
addr_scope_policy - INTEGER
Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
- 0 - Disable IPv4 address scoping
- 1 - Enable IPv4 address scoping
- 2 - Follow draft but allow IPv4 private addresses
- 3 - Follow draft but allow IPv4 link local addresses
+ - 0 - Disable IPv4 address scoping
+ - 1 - Enable IPv4 address scoping
+ - 2 - Follow draft but allow IPv4 private addresses
+ - 3 - Follow draft but allow IPv4 link local addresses
Default: 1
+udp_port - INTEGER
+ The listening port for the local UDP tunneling sock. Normally it's
+ using the IANA-assigned UDP port number 9899 (sctp-tunneling).
+
+ This UDP sock is used for processing the incoming UDP-encapsulated
+ SCTP packets (from RFC6951), and shared by all applications in the
+ same net namespace. This UDP sock will be closed when the value is
+ set to 0.
+
+ The value will also be used to set the src port of the UDP header
+ for the outgoing UDP-encapsulated SCTP packets. For the dest port,
+ please refer to 'encap_port' below.
+
+ Default: 0
+
+encap_port - INTEGER
+ The default remote UDP encapsulation port.
+
+ This value is used to set the dest port of the UDP header for the
+ outgoing UDP-encapsulated SCTP packets by default. Users can also
+ change the value for each sock/asoc/transport by using setsockopt.
+ For further information, please refer to RFC6951.
+
+ Note that when connecting to a remote server, the client should set
+ this to the port that the UDP tunneling sock on the peer server is
+ listening to and the local UDP tunneling sock on the client also
+ must be started. On the server, it would get the encap_port from
+ the incoming packet's source port.
+
+ Default: 0
+
+plpmtud_probe_interval - INTEGER
+ The time interval (in milliseconds) for the PLPMTUD probe timer,
+ which is configured to expire after this period to receive an
+ acknowledgment to a probe packet. This is also the time interval
+ between the probes for the current pmtu when the probe search
+ is done.
+
+ PLPMTUD will be disabled when 0 is set, and other values for it
+ must be >= 5000.
+
+ Default: 0
+
+reconf_enable - BOOLEAN
+ Enable or disable extension of Stream Reconfiguration functionality
+ specified in RFC6525. This extension provides the ability to "reset"
+ a stream, and it includes the Parameters of "Outgoing/Incoming SSN
+ Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams".
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+intl_enable - BOOLEAN
+ Enable or disable extension of User Message Interleaving functionality
+ specified in RFC8260. This extension allows the interleaving of user
+ messages sent on different streams. With this feature enabled, I-DATA
+ chunk will replace DATA chunk to carry user messages if also supported
+ by the peer. Note that to use this feature, one needs to set this option
+ to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2
+ and SCTP_INTERLEAVING_SUPPORTED to 1.
+
+ - 1: Enable extension.
+ - 0: Disable extension.
+
+ Default: 0
+
+ecn_enable - BOOLEAN
+ Control use of Explicit Congestion Notification (ECN) by SCTP.
+ Like in TCP, ECN is used only when both ends of the SCTP connection
+ indicate support for it. This feature is useful in avoiding losses
+ due to congestion by allowing supporting routers to signal congestion
+ before having to drop packets.
+
+ 1: Enable ecn.
+ 0: Disable ecn.
+
+ Default: 1
+
+
+``/proc/sys/net/core/*``
+========================
-/proc/sys/net/core/*
Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries.
-/proc/sys/net/unix/*
+``/proc/sys/net/unix/*``
+========================
+
max_dgram_qlen - INTEGER
The maximum length of dgram socket receive queue
diff --git a/Documentation/networking/ip_dynaddr.txt b/Documentation/networking/ip_dynaddr.rst
index 45f3c1268e86..eacc0c780c7f 100644
--- a/Documentation/networking/ip_dynaddr.txt
+++ b/Documentation/networking/ip_dynaddr.rst
@@ -1,10 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
IP dynamic address hack-port v0.03
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+==================================
+
This stuff allows diald ONESHOT connections to get established by
dynamically changing packet source address (and socket's if local procs).
It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2).
-If enabled[*] and forwarding interface has changed:
+If enabled\ [#]_ and forwarding interface has changed:
+
1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS
while in SYN_SENT state (diald-box processes).
2) Out-bounded MASQueraded source address changes ON OUTPUT (when
@@ -12,18 +17,24 @@ If enabled[*] and forwarding interface has changed:
received by the tunnel.
This is specially helpful for auto dialup links (diald), where the
-``actual'' outgoing address is unknown at the moment the link is
+``actual`` outgoing address is unknown at the moment the link is
going up. So, the *same* (local AND masqueraded) connections requests that
bring the link up will be able to get established.
-[*] At boot, by default no address rewriting is attempted.
- To enable:
+.. [#] At boot, by default no address rewriting is attempted.
+
+ To enable::
+
# echo 1 > /proc/sys/net/ipv4/ip_dynaddr
- To enable verbose mode:
- # echo 2 > /proc/sys/net/ipv4/ip_dynaddr
- To disable (default)
+
+ To enable verbose mode::
+
+ # echo 2 > /proc/sys/net/ipv4/ip_dynaddr
+
+ To disable (default)::
+
# echo 0 > /proc/sys/net/ipv4/ip_dynaddr
Enjoy!
--- Juanjo <jjciarla@raiz.uncu.edu.ar>
+Juanjo <jjciarla@raiz.uncu.edu.ar>
diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.rst
index ba5c217fffe0..be7091b77927 100644
--- a/Documentation/networking/ipddp.txt
+++ b/Documentation/networking/ipddp.rst
@@ -1,7 +1,12 @@
-Text file for ipddp.c:
- AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
+.. SPDX-License-Identifier: GPL-2.0
-This text file is written by Jay Schulist <jschlst@samba.org>
+=========================================================
+AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
+=========================================================
+
+Documentation ipddp.c
+
+This file is written by Jay Schulist <jschlst@samba.org>
Introduction
------------
@@ -21,7 +26,7 @@ kernel AppleTalk layer and drivers are available.
Each mode requires its own user space software.
Compiling AppleTalk-IP Decapsulation/Encapsulation
-=================================================
+==================================================
AppleTalk-IP decapsulation needs to be compiled into your kernel. You
will need to turn on AppleTalk-IP driver support. Then you will need to
diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.rst
index ba794b7e51be..afe9d7b48be3 100644
--- a/Documentation/networking/ipsec.txt
+++ b/Documentation/networking/ipsec.rst
@@ -1,12 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+IPsec
+=====
+
Here documents known IPsec corner cases which need to be keep in mind when
deploy various IPsec configuration in real world production environment.
-1. IPcomp: Small IP packet won't get compressed at sender, and failed on
+1. IPcomp:
+ Small IP packet won't get compressed at sender, and failed on
policy check on receiver.
-Quote from RFC3173:
-2.2. Non-Expansion Policy
+Quote from RFC3173::
+
+ 2.2. Non-Expansion Policy
If the total size of a compressed payload and the IPComp header, as
defined in section 3, is not smaller than the size of the original
diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.rst
index 6cd74fa55358..ba09c2f2dcc7 100644
--- a/Documentation/networking/ipv6.txt
+++ b/Documentation/networking/ipv6.rst
@@ -1,9 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+IPv6
+====
+
Options for the ipv6 module are supplied as parameters at load time.
Module options may be given as command line arguments to the insmod
or modprobe command, but are usually specified in either
-/etc/modules.d/*.conf configuration files, or in a distro-specific
+``/etc/modules.d/*.conf`` configuration files, or in a distro-specific
configuration file.
The available ipv6 module parameters are listed below. If a parameter
diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.rst
index 27a38e50c287..0000c1d383bc 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.rst
@@ -1,46 +1,64 @@
+.. SPDX-License-Identifier: GPL-2.0
- IPVLAN Driver HOWTO
+===================
+IPVLAN Driver HOWTO
+===================
Initial Release:
Mahesh Bandewar <maheshb AT google.com>
1. Introduction:
- This is conceptually very similar to the macvlan driver with one major
+================
+This is conceptually very similar to the macvlan driver with one major
exception of using L3 for mux-ing /demux-ing among slaves. This property makes
-the master device share the L2 with it's slave devices. I have developed this
+the master device share the L2 with its slave devices. I have developed this
driver in conjunction with network namespaces and not sure if there is use case
outside of it.
2. Building and Installation:
- In order to build the driver, please select the config item CONFIG_IPVLAN.
+=============================
+
+In order to build the driver, please select the config item CONFIG_IPVLAN.
The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
(CONFIG_IPVLAN=m).
3. Configuration:
- There are no module parameters for this driver and it can be configured
+=================
+
+There are no module parameters for this driver and it can be configured
using IProute2/ip utility.
+::
ip link add link <master> name <slave> type ipvlan [ mode MODE ] [ FLAGS ]
where
- MODE: l3 (default) | l3s | l2
- FLAGS: bridge (default) | private | vepa
+ MODE: l3 (default) | l3s | l2
+ FLAGS: bridge (default) | private | vepa
+
+e.g.
- e.g.
(a) Following will create IPvlan link with eth0 as master in
- L3 bridge mode
- bash# ip link add link eth0 name ipvl0 type ipvlan
- (b) This command will create IPvlan link in L2 bridge mode.
- bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge
- (c) This command will create an IPvlan device in L2 private mode.
- bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private
- (d) This command will create an IPvlan device in L2 vepa mode.
- bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa
+ L3 bridge mode::
+
+ bash# ip link add link eth0 name ipvl0 type ipvlan
+ (b) This command will create IPvlan link in L2 bridge mode::
+
+ bash# ip link add link eth0 name ipvl0 type ipvlan mode l2 bridge
+
+ (c) This command will create an IPvlan device in L2 private mode::
+
+ bash# ip link add link eth0 name ipvlan type ipvlan mode l2 private
+
+ (d) This command will create an IPvlan device in L2 vepa mode::
+
+ bash# ip link add link eth0 name ipvlan type ipvlan mode l2 vepa
4. Operating modes:
- IPvlan has two modes of operation - L2 and L3. For a given master device,
+===================
+
+IPvlan has two modes of operation - L2 and L3. For a given master device,
you can select one of these two modes and all slaves on that master will
operate in the same (selected) mode. The RX mode is almost identical except
that in L3 mode the slaves wont receive any multicast / broadcast traffic.
@@ -48,39 +66,50 @@ L3 mode is more restrictive since routing is controlled from the other (mostly)
default namespace.
4.1 L2 mode:
- In this mode TX processing happens on the stack instance attached to the
+------------
+
+In this mode TX processing happens on the stack instance attached to the
slave device and packets are switched and queued to the master device to send
out. In this mode the slaves will RX/TX multicast and broadcast (if applicable)
as well.
4.2 L3 mode:
- In this mode TX processing up to L3 happens on the stack instance attached
+------------
+
+In this mode TX processing up to L3 happens on the stack instance attached
to the slave device and packets are switched to the stack instance of the
master device for the L2 processing and routing from that instance will be
used before packets are queued on the outbound device. In this mode the slaves
will not receive nor can send multicast / broadcast traffic.
4.3 L3S mode:
- This is very similar to the L3 mode except that iptables (conn-tracking)
+-------------
+
+This is very similar to the L3 mode except that iptables (conn-tracking)
works in this mode and hence it is L3-symmetric (L3s). This will have slightly less
performance but that shouldn't matter since you are choosing this mode over plain-L3
mode to make conn-tracking work.
5. Mode flags:
- At this time following mode flags are available
+==============
+
+At this time following mode flags are available
5.1 bridge:
- This is the default option. To configure the IPvlan port in this mode,
+-----------
+This is the default option. To configure the IPvlan port in this mode,
user can choose to either add this option on the command-line or don't specify
anything. This is the traditional mode where slaves can cross-talk among
themselves apart from talking through the master device.
5.2 private:
- If this option is added to the command-line, the port is set in private
+------------
+If this option is added to the command-line, the port is set in private
mode. i.e. port won't allow cross communication between slaves.
5.3 vepa:
- If this is added to the command-line, the port is set in VEPA mode.
+---------
+If this is added to the command-line, the port is set in VEPA mode.
i.e. port will offload switching functionality to the external entity as
described in 802.1Qbg
Note: VEPA mode in IPvlan has limitations. IPvlan uses the mac-address of the
@@ -89,18 +118,25 @@ neighbor will have source and destination mac same. This will make the switch /
router send the redirect message.
6. What to choose (macvlan vs. ipvlan)?
- These two devices are very similar in many regards and the specific use
+=======================================
+
+These two devices are very similar in many regards and the specific use
case could very well define which device to choose. if one of the following
-situations defines your use case then you can choose to use ipvlan -
- (a) The Linux host that is connected to the external switch / router has
-policy configured that allows only one mac per port.
- (b) No of virtual devices created on a master exceed the mac capacity and
-puts the NIC in promiscuous mode and degraded performance is a concern.
- (c) If the slave device is to be put into the hostile / untrusted network
-namespace where L2 on the slave could be changed / misused.
+situations defines your use case then you can choose to use ipvlan:
+
+
+(a) The Linux host that is connected to the external switch / router has
+ policy configured that allows only one mac per port.
+(b) No of virtual devices created on a master exceed the mac capacity and
+ puts the NIC in promiscuous mode and degraded performance is a concern.
+(c) If the slave device is to be put into the hostile / untrusted network
+ namespace where L2 on the slave could be changed / misused.
6. Example configuration:
+=========================
+
+::
+=============================================================+
| Host: host1 |
@@ -117,30 +153,37 @@ namespace where L2 on the slave could be changed / misused.
+==============================#==============================+
- (a) Create two network namespaces - ns0, ns1
- ip netns add ns0
- ip netns add ns1
-
- (b) Create two ipvlan slaves on eth0 (master device)
- ip link add link eth0 ipvl0 type ipvlan mode l2
- ip link add link eth0 ipvl1 type ipvlan mode l2
-
- (c) Assign slaves to the respective network namespaces
- ip link set dev ipvl0 netns ns0
- ip link set dev ipvl1 netns ns1
-
- (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices
- - For ns0
- (1) ip netns exec ns0 bash
- (2) ip link set dev ipvl0 up
- (3) ip link set dev lo up
- (4) ip -4 addr add 127.0.0.1 dev lo
- (5) ip -4 addr add $IPADDR dev ipvl0
- (6) ip -4 route add default via $ROUTER dev ipvl0
- - For ns1
- (1) ip netns exec ns1 bash
- (2) ip link set dev ipvl1 up
- (3) ip link set dev lo up
- (4) ip -4 addr add 127.0.0.1 dev lo
- (5) ip -4 addr add $IPADDR dev ipvl1
- (6) ip -4 route add default via $ROUTER dev ipvl1
+(a) Create two network namespaces - ns0, ns1::
+
+ ip netns add ns0
+ ip netns add ns1
+
+(b) Create two ipvlan slaves on eth0 (master device)::
+
+ ip link add link eth0 ipvl0 type ipvlan mode l2
+ ip link add link eth0 ipvl1 type ipvlan mode l2
+
+(c) Assign slaves to the respective network namespaces::
+
+ ip link set dev ipvl0 netns ns0
+ ip link set dev ipvl1 netns ns1
+
+(d) Now switch to the namespace (ns0 or ns1) to configure the slave devices
+
+ - For ns0::
+
+ (1) ip netns exec ns0 bash
+ (2) ip link set dev ipvl0 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl0
+ (6) ip -4 route add default via $ROUTER dev ipvl0
+
+ - For ns1::
+
+ (1) ip netns exec ns1 bash
+ (2) ip link set dev ipvl1 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl1
+ (6) ip -4 route add default via $ROUTER dev ipvl1
diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.rst
index 056898685d40..387fda80f05f 100644
--- a/Documentation/networking/ipvs-sysctl.txt
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -1,23 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+IPvs-sysctl
+===========
+
/proc/sys/net/ipv4/vs/* Variables:
+==================================
am_droprate - INTEGER
- default 10
+ default 10
- It sets the always mode drop rate, which is used in the mode 3
- of the drop_rate defense.
+ It sets the always mode drop rate, which is used in the mode 3
+ of the drop_rate defense.
amemthresh - INTEGER
- default 1024
+ default 1024
- It sets the available memory threshold (in pages), which is
- used in the automatic modes of defense. When there is no
- enough available memory, the respective strategy will be
- enabled and the variable is automatically set to 2, otherwise
- the strategy is disabled and the variable is set to 1.
+ It sets the available memory threshold (in pages), which is
+ used in the automatic modes of defense. When there is no
+ enough available memory, the respective strategy will be
+ enabled and the variable is automatically set to 2, otherwise
+ the strategy is disabled and the variable is set to 1.
backup_only - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
If set, disable the director function while the server is
in backup mode to avoid packet loops for DR/TUN methods.
@@ -30,8 +37,7 @@ conn_reuse_mode - INTEGER
0: disable any special handling on port reuse. The new
connection will be delivered to the same real server that was
- servicing the previous connection. This will effectively
- disable expire_nodest_conn.
+ servicing the previous connection.
bit 1: enable rescheduling of new connections when it is safe.
That is, whenever expire_nodest_conn and for TCP sockets, when
@@ -44,8 +50,8 @@ conn_reuse_mode - INTEGER
real servers to a very busy cluster.
conntrack - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
If set, maintain connection tracking entries for
connections handled by IPVS.
@@ -61,28 +67,28 @@ conntrack - BOOLEAN
Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled.
cache_bypass - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
- If it is enabled, forward packets to the original destination
- directly when no cache server is available and destination
- address is not local (iph->daddr is RTN_UNICAST). It is mostly
- used in transparent web cache cluster.
+ If it is enabled, forward packets to the original destination
+ directly when no cache server is available and destination
+ address is not local (iph->daddr is RTN_UNICAST). It is mostly
+ used in transparent web cache cluster.
debug_level - INTEGER
- 0 - transmission error messages (default)
- 1 - non-fatal error messages
- 2 - configuration
- 3 - destination trash
- 4 - drop entry
- 5 - service lookup
- 6 - scheduling
- 7 - connection new/expire, lookup and synchronization
- 8 - state transition
- 9 - binding destination, template checks and applications
- 10 - IPVS packet transmission
- 11 - IPVS packet handling (ip_vs_in/ip_vs_out)
- 12 or more - packet traversal
+ - 0 - transmission error messages (default)
+ - 1 - non-fatal error messages
+ - 2 - configuration
+ - 3 - destination trash
+ - 4 - drop entry
+ - 5 - service lookup
+ - 6 - scheduling
+ - 7 - connection new/expire, lookup and synchronization
+ - 8 - state transition
+ - 9 - binding destination, template checks and applications
+ - 10 - IPVS packet transmission
+ - 11 - IPVS packet handling (ip_vs_in/ip_vs_out)
+ - 12 or more - packet traversal
Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled.
@@ -92,58 +98,58 @@ debug_level - INTEGER
the level.
drop_entry - INTEGER
- 0 - disabled (default)
-
- The drop_entry defense is to randomly drop entries in the
- connection hash table, just in order to collect back some
- memory for new connections. In the current code, the
- drop_entry procedure can be activated every second, then it
- randomly scans 1/32 of the whole and drops entries that are in
- the SYN-RECV/SYNACK state, which should be effective against
- syn-flooding attack.
-
- The valid values of drop_entry are from 0 to 3, where 0 means
- that this strategy is always disabled, 1 and 2 mean automatic
- modes (when there is no enough available memory, the strategy
- is enabled and the variable is automatically set to 2,
- otherwise the strategy is disabled and the variable is set to
- 1), and 3 means that that the strategy is always enabled.
+ - 0 - disabled (default)
+
+ The drop_entry defense is to randomly drop entries in the
+ connection hash table, just in order to collect back some
+ memory for new connections. In the current code, the
+ drop_entry procedure can be activated every second, then it
+ randomly scans 1/32 of the whole and drops entries that are in
+ the SYN-RECV/SYNACK state, which should be effective against
+ syn-flooding attack.
+
+ The valid values of drop_entry are from 0 to 3, where 0 means
+ that this strategy is always disabled, 1 and 2 mean automatic
+ modes (when there is no enough available memory, the strategy
+ is enabled and the variable is automatically set to 2,
+ otherwise the strategy is disabled and the variable is set to
+ 1), and 3 means that the strategy is always enabled.
drop_packet - INTEGER
- 0 - disabled (default)
+ - 0 - disabled (default)
- The drop_packet defense is designed to drop 1/rate packets
- before forwarding them to real servers. If the rate is 1, then
- drop all the incoming packets.
+ The drop_packet defense is designed to drop 1/rate packets
+ before forwarding them to real servers. If the rate is 1, then
+ drop all the incoming packets.
- The value definition is the same as that of the drop_entry. In
- the automatic mode, the rate is determined by the follow
- formula: rate = amemthresh / (amemthresh - available_memory)
- when available memory is less than the available memory
- threshold. When the mode 3 is set, the always mode drop rate
- is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+ The value definition is the same as that of the drop_entry. In
+ the automatic mode, the rate is determined by the follow
+ formula: rate = amemthresh / (amemthresh - available_memory)
+ when available memory is less than the available memory
+ threshold. When the mode 3 is set, the always mode drop rate
+ is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
-
- The default value is 0, the load balancer will silently drop
- packets when its destination server is not available. It may
- be useful, when user-space monitoring program deletes the
- destination server (because of server overload or wrong
- detection) and add back the server later, and the connections
- to the server can continue.
-
- If this feature is enabled, the load balancer will expire the
- connection immediately when a packet arrives and its
- destination server is not available, then the client program
- will be notified that the connection is closed. This is
- equivalent to the feature some people requires to flush
- connections when its destination is not available.
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ The default value is 0, the load balancer will silently drop
+ packets when its destination server is not available. It may
+ be useful, when user-space monitoring program deletes the
+ destination server (because of server overload or wrong
+ detection) and add back the server later, and the connections
+ to the server can continue.
+
+ If this feature is enabled, the load balancer will expire the
+ connection immediately when a packet arrives and its
+ destination server is not available, then the client program
+ will be notified that the connection is closed. This is
+ equivalent to the feature some people requires to flush
+ connections when its destination is not available.
expire_quiescent_template - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
When set to a non-zero value, the load balancer will expire
persistent templates when the destination server is quiescent.
@@ -158,8 +164,8 @@ expire_quiescent_template - BOOLEAN
connection and the destination server is quiescent.
ignore_tunneled - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
If set, ipvs will set the ipvs_property on all packets which are of
unrecognized protocols. This prevents us from routing tunneled
@@ -168,30 +174,30 @@ ignore_tunneled - BOOLEAN
ipvs routing loops when ipvs is also acting as a real server).
nat_icmp_send - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
- It controls sending icmp error messages (ICMP_DEST_UNREACH)
- for VS/NAT when the load balancer receives packets from real
- servers but the connection entries don't exist.
+ It controls sending icmp error messages (ICMP_DEST_UNREACH)
+ for VS/NAT when the load balancer receives packets from real
+ servers but the connection entries don't exist.
pmtu_disc - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
By default, reject with FRAG_NEEDED all DF packets that exceed
the PMTU, irrespective of the forwarding method. For TUN method
the flag can be disabled to fragment such packets.
secure_tcp - INTEGER
- 0 - disabled (default)
+ - 0 - disabled (default)
The secure_tcp defense is to use a more complicated TCP state
transition table. For VS/NAT, it also delays entering the
TCP ESTABLISHED state until the three way handshake is completed.
- The value definition is the same as that of drop_entry and
- drop_packet.
+ The value definition is the same as that of drop_entry and
+ drop_packet.
sync_threshold - vector of 2 INTEGERs: sync_threshold, sync_period
default 3 50
@@ -248,8 +254,8 @@ sync_ports - INTEGER
8848+sync_ports-1.
snat_reroute - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
If enabled, recalculate the route of SNATed packets from
realservers so that they are routed as if they originate from the
@@ -270,6 +276,7 @@ sync_persist_mode - INTEGER
Controls the synchronisation of connections when using persistence
0: All types of connections are synchronised
+
1: Attempt to reduce the synchronisation traffic depending on
the connection type. For persistent services avoid synchronisation
for normal connections, do it only for persistence templates.
@@ -292,3 +299,14 @@ sync_version - INTEGER
Kernels with this sync_version entry are able to receive messages
of both version 1 and version 2 of the synchronisation protocol.
+
+run_estimation - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If disabled, the estimation will be stop, and you can't see
+ any update on speed estimation data.
+
+ You can always re-enable estimation by setting this value to 1.
+ But be careful, the first estimation after re-enable is not
+ accurate.
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
index f5be243d250a..b705d2801e9c 100644
--- a/Documentation/networking/j1939.rst
+++ b/Documentation/networking/j1939.rst
@@ -10,9 +10,9 @@ Overview / What Is J1939
SAE J1939 defines a higher layer protocol on CAN. It implements a more
sophisticated addressing scheme and extends the maximum packet size above 8
bytes. Several derived specifications exist, which differ from the original
-J1939 on the application level, like MilCAN A, NMEA2000 and especially
+J1939 on the application level, like MilCAN A, NMEA2000, and especially
ISO-11783 (ISOBUS). This last one specifies the so-called ETP (Extended
-Transport Protocol) which is has been included in this implementation. This
+Transport Protocol), which has been included in this implementation. This
results in a maximum packet size of ((2 ^ 24) - 1) * 7 bytes == 111 MiB.
Specifications used
@@ -32,15 +32,15 @@ sockets, we found some reasons to justify a kernel implementation for the
addressing and transport methods used by J1939.
* **Addressing:** when a process on an ECU communicates via J1939, it should
- not necessarily know its source address. Although at least one process per
+ not necessarily know its source address. Although, at least one process per
ECU should know the source address. Other processes should be able to reuse
that address. This way, address parameters for different processes
cooperating for the same ECU, are not duplicated. This way of working is
- closely related to the UNIX concept where programs do just one thing, and do
+ closely related to the UNIX concept, where programs do just one thing and do
it well.
* **Dynamic addressing:** Address Claiming in J1939 is time critical.
- Furthermore data transport should be handled properly during the address
+ Furthermore, data transport should be handled properly during the address
negotiation. Putting this functionality in the kernel eliminates it as a
requirement for _every_ user space process that communicates via J1939. This
results in a consistent J1939 bus with proper addressing.
@@ -58,7 +58,7 @@ Therefore, these parts are left to user space.
The J1939 sockets operate on CAN network devices (see SocketCAN). Any J1939
user space library operating on CAN raw sockets will still operate properly.
-Since such library does not communicate with the in-kernel implementation, care
+Since such a library does not communicate with the in-kernel implementation, care
must be taken that these two do not interfere. In practice, this means they
cannot share ECU addresses. A single ECU (or virtual ECU) address is used by
the library exclusively, or by the in-kernel system exclusively.
@@ -69,21 +69,59 @@ J1939 concepts
PGN
---
+The J1939 protocol uses the 29-bit CAN identifier with the following structure:
+
+ ============ ============== ====================
+ 29 bit CAN-ID
+ --------------------------------------------------
+ Bit positions within the CAN-ID
+ --------------------------------------------------
+ 28 ... 26 25 ... 8 7 ... 0
+ ============ ============== ====================
+ Priority PGN SA (Source Address)
+ ============ ============== ====================
+
The PGN (Parameter Group Number) is a number to identify a packet. The PGN
is composed as follows:
-1 bit : Reserved Bit
-1 bit : Data Page
-8 bits : PF (PDU Format)
-8 bits : PS (PDU Specific)
+
+ ============ ============== ================= =================
+ PGN
+ ------------------------------------------------------------------
+ Bit positions within the CAN-ID
+ ------------------------------------------------------------------
+ 25 24 23 ... 16 15 ... 8
+ ============ ============== ================= =================
+ R (Reserved) DP (Data Page) PF (PDU Format) PS (PDU Specific)
+ ============ ============== ================= =================
In J1939-21 distinction is made between PDU1 format (where PF < 240) and PDU2
-format (where PF >= 240). Furthermore, when using PDU2 format, the PS-field
+format (where PF >= 240). Furthermore, when using the PDU2 format, the PS-field
contains a so-called Group Extension, which is part of the PGN. When using PDU2
format, the Group Extension is set in the PS-field.
+ ============== ========================
+ PDU1 Format (specific) (peer to peer)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ 00h ... EFh DA (Destination address)
+ ============== ========================
+
+ ============== ========================
+ PDU2 Format (global) (broadcast)
+ ----------------------------------------
+ Bit positions within the CAN-ID
+ ----------------------------------------
+ 23 ... 16 15 ... 8
+ ============== ========================
+ F0h ... FFh GE (Group Extenstion)
+ ============== ========================
+
On the other hand, when using PDU1 format, the PS-field contains a so-called
Destination Address, which is _not_ part of the PGN. When communicating a PGN
-from user space to kernel (or visa versa) and PDU2 format is used, the PS-field
+from user space to kernel (or vice versa) and PDU2 format is used, the PS-field
of the PGN shall be set to zero. The Destination Address shall be set
elsewhere.
@@ -96,15 +134,15 @@ Addressing
Both static and dynamic addressing methods can be used.
-For static addresses, no extra checks are made by the kernel, and provided
+For static addresses, no extra checks are made by the kernel and provided
addresses are considered right. This responsibility is for the OEM or system
integrator.
For dynamic addressing, so-called Address Claiming, extra support is foreseen
-in the kernel. In J1939 any ECU is known by it's 64-bit NAME. At the moment of
+in the kernel. In J1939 any ECU is known by its 64-bit NAME. At the moment of
a successful address claim, the kernel keeps track of both NAME and source
address being claimed. This serves as a base for filter schemes. By default,
-packets with a destination that is not locally, will be rejected.
+packets with a destination that is not locally will be rejected.
Mixed mode packets (from a static to a dynamic address or vice versa) are
allowed. The BSD sockets define separate API calls for getting/setting the
@@ -131,31 +169,31 @@ API Calls
---------
On CAN, you first need to open a socket for communicating over a CAN network.
-To use J1939, #include <linux/can/j1939.h>. From there, <linux/can.h> will be
+To use J1939, ``#include <linux/can/j1939.h>``. From there, ``<linux/can.h>`` will be
included too. To open a socket, use:
.. code-block:: C
s = socket(PF_CAN, SOCK_DGRAM, CAN_J1939);
-J1939 does use SOCK_DGRAM sockets. In the J1939 specification, connections are
+J1939 does use ``SOCK_DGRAM`` sockets. In the J1939 specification, connections are
mentioned in the context of transport protocol sessions. These still deliver
-packets to the other end (using several CAN packets). SOCK_STREAM is not
+packets to the other end (using several CAN packets). ``SOCK_STREAM`` is not
supported.
-After the successful creation of the socket, you would normally use the bind(2)
-and/or connect(2) system call to bind the socket to a CAN interface. After
-binding and/or connecting the socket, you can read(2) and write(2) from/to the
-socket or use send(2), sendto(2), sendmsg(2) and the recv*() counterpart
+After the successful creation of the socket, you would normally use the ``bind(2)``
+and/or ``connect(2)`` system call to bind the socket to a CAN interface. After
+binding and/or connecting the socket, you can ``read(2)`` and ``write(2)`` from/to the
+socket or use ``send(2)``, ``sendto(2)``, ``sendmsg(2)`` and the ``recv*()`` counterpart
operations on the socket as usual. There are also J1939 specific socket options
described below.
-In order to send data, a bind(2) must have been successful. bind(2) assigns a
+In order to send data, a ``bind(2)`` must have been successful. ``bind(2)`` assigns a
local address to a socket.
-Different from CAN is that the payload data is just the data that get send,
-without it's header info. The header info is derived from the sockaddr supplied
-to bind(2), connect(2), sendto(2) and recvfrom(2). A write(2) with size 4 will
+Different from CAN is that the payload data is just the data that get sends,
+without its header info. The header info is derived from the sockaddr supplied
+to ``bind(2)``, ``connect(2)``, ``sendto(2)`` and ``recvfrom(2)``. A ``write(2)`` with size 4 will
result in a packet with 4 bytes.
The sockaddr structure has extensions for use with J1939 as specified below:
@@ -180,47 +218,47 @@ The sockaddr structure has extensions for use with J1939 as specified below:
} can_addr;
}
-can_family & can_ifindex serve the same purpose as for other SocketCAN sockets.
+``can_family`` & ``can_ifindex`` serve the same purpose as for other SocketCAN sockets.
-can_addr.j1939.pgn specifies the PGN (max 0x3ffff). Individual bits are
+``can_addr.j1939.pgn`` specifies the PGN (max 0x3ffff). Individual bits are
specified above.
-can_addr.j1939.name contains the 64-bit J1939 NAME.
+``can_addr.j1939.name`` contains the 64-bit J1939 NAME.
-can_addr.j1939.addr contains the address.
+``can_addr.j1939.addr`` contains the address.
-The bind(2) system call assigns the local address, i.e. the source address when
-sending packages. If a PGN during bind(2) is set, it's used as a RX filter.
-I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
+The ``bind(2)`` system call assigns the local address, i.e. the source address when
+sending packages. If a PGN during ``bind(2)`` is set, it's used as a RX filter.
+I.e. only packets with a matching PGN are received. If an ADDR or NAME is set
it is used as a receive filter, too. It will match the destination NAME or ADDR
of the incoming packet. The NAME filter will work only if appropriate Address
Claiming for this name was done on the CAN bus and registered/cached by the
kernel.
-On the other hand connect(2) assigns the remote address, i.e. the destination
-address. The PGN from connect(2) is used as the default PGN when sending
+On the other hand ``connect(2)`` assigns the remote address, i.e. the destination
+address. The PGN from ``connect(2)`` is used as the default PGN when sending
packets. If ADDR or NAME is set it will be used as the default destination ADDR
-or NAME. Further a set ADDR or NAME during connect(2) is used as a receive
+or NAME. Further a set ADDR or NAME during ``connect(2)`` is used as a receive
filter. It will match the source NAME or ADDR of the incoming packet.
-Both write(2) and send(2) will send a packet with local address from bind(2) and
-the remote address from connect(2). Use sendto(2) to overwrite the destination
+Both ``write(2)`` and ``send(2)`` will send a packet with local address from ``bind(2)`` and the
+remote address from ``connect(2)``. Use ``sendto(2)`` to overwrite the destination
address.
-If can_addr.j1939.name is set (!= 0) the NAME is looked up by the kernel and
-the corresponding ADDR is used. If can_addr.j1939.name is not set (== 0),
-can_addr.j1939.addr is used.
+If ``can_addr.j1939.name`` is set (!= 0) the NAME is looked up by the kernel and
+the corresponding ADDR is used. If ``can_addr.j1939.name`` is not set (== 0),
+``can_addr.j1939.addr`` is used.
When creating a socket, reasonable defaults are set. Some options can be
-modified with setsockopt(2) & getsockopt(2).
+modified with ``setsockopt(2)`` & ``getsockopt(2)``.
RX path related options:
-- SO_J1939_FILTER - configure array of filters
-- SO_J1939_PROMISC - disable filters set by bind(2) and connect(2)
+- ``SO_J1939_FILTER`` - configure array of filters
+- ``SO_J1939_PROMISC`` - disable filters set by ``bind(2)`` and ``connect(2)``
By default no broadcast packets can be send or received. To enable sending or
-receiving broadcast packets use the socket option SO_BROADCAST:
+receiving broadcast packets use the socket option ``SO_BROADCAST``:
.. code-block:: C
@@ -261,26 +299,26 @@ The following diagram illustrates the RX path:
+---------------------------+
TX path related options:
-SO_J1939_SEND_PRIO - change default send priority for the socket
+``SO_J1939_SEND_PRIO`` - change default send priority for the socket
Message Flags during send() and Related System Calls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-send(2), sendto(2) and sendmsg(2) take a 'flags' argument. Currently
+``send(2)``, ``sendto(2)`` and ``sendmsg(2)`` take a 'flags' argument. Currently
supported flags are:
-* MSG_DONTWAIT, i.e. non-blocking operation.
+* ``MSG_DONTWAIT``, i.e. non-blocking operation.
recvmsg(2)
^^^^^^^^^^
-In most cases recvmsg(2) is needed if you want to extract more information than
-recvfrom(2) can provide. For example package priority and timestamp. The
+In most cases ``recvmsg(2)`` is needed if you want to extract more information than
+``recvfrom(2)`` can provide. For example package priority and timestamp. The
Destination Address, name and packet priority (if applicable) are attached to
-the msghdr in the recvmsg(2) call. They can be extracted using cmsg(3) macros,
-with cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR,
-SCM_J1939_DEST_NAME or SCM_J1939_PRIO. The returned data is a uint8_t for
-priority and dst_addr, and uint64_t for dst_name.
+the msghdr in the ``recvmsg(2)`` call. They can be extracted using ``cmsg(3)`` macros,
+with ``cmsg_level == SOL_J1939 && cmsg_type == SCM_J1939_DEST_ADDR``,
+``SCM_J1939_DEST_NAME`` or ``SCM_J1939_PRIO``. The returned data is a ``uint8_t`` for
+``priority`` and ``dst_addr``, and ``uint64_t`` for ``dst_name``.
.. code-block:: C
@@ -305,12 +343,12 @@ Dynamic Addressing
Distinction has to be made between using the claimed address and doing an
address claim. To use an already claimed address, one has to fill in the
-j1939.name member and provide it to bind(2). If the name had claimed an address
+``j1939.name`` member and provide it to ``bind(2)``. If the name had claimed an address
earlier, all further messages being sent will use that address. And the
-j1939.addr member will be ignored.
+``j1939.addr`` member will be ignored.
An exception on this is PGN 0x0ee00. This is the "Address Claim/Cannot Claim
-Address" message and the kernel will use the j1939.addr member for that PGN if
+Address" message and the kernel will use the ``j1939.addr`` member for that PGN if
necessary.
To claim an address following code example can be used:
@@ -371,12 +409,12 @@ NAME can send packets.
If another ECU claims the address, the kernel will mark the NAME-SA expired.
No socket bound to the NAME can send packets (other than address claims). To
-claim another address, some socket bound to NAME, must bind(2) again, but with
-only j1939.addr changed to the new SA, and must then send a valid address claim
+claim another address, some socket bound to NAME, must ``bind(2)`` again, but with
+only ``j1939.addr`` changed to the new SA, and must then send a valid address claim
packet. This restarts the state machine in the kernel (and any other
participant on the bus) for this NAME.
-can-utils also include the jacd tool, so it can be used as code example or as
+``can-utils`` also include the ``j1939acd`` tool, so it can be used as code example or as
default Address Claiming daemon.
Send Examples
@@ -403,8 +441,8 @@ Bind:
bind(sock, (struct sockaddr *)&baddr, sizeof(baddr));
-Now, the socket 'sock' is bound to the SA 0x20. Since no connect(2) was called,
-at this point we can use only sendto(2) or sendmsg(2).
+Now, the socket 'sock' is bound to the SA 0x20. Since no ``connect(2)`` was called,
+at this point we can use only ``sendto(2)`` or ``sendmsg(2)``.
Send:
@@ -414,8 +452,8 @@ Send:
.can_family = AF_CAN,
.can_addr.j1939 = {
.name = J1939_NO_NAME;
- .pgn = 0x30,
- .addr = 0x12300,
+ .addr = 0x30,
+ .pgn = 0x12300,
},
};
diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst
index f03ae64be8bc..ea55f462cefa 100644
--- a/Documentation/networking/kapi.rst
+++ b/Documentation/networking/kapi.rst
@@ -83,27 +83,6 @@ SUN RPC subsystem
.. kernel-doc:: net/sunrpc/clnt.c
:export:
-WiMAX
------
-
-.. kernel-doc:: net/wimax/op-msg.c
- :export:
-
-.. kernel-doc:: net/wimax/op-reset.c
- :export:
-
-.. kernel-doc:: net/wimax/op-rfkill.c
- :export:
-
-.. kernel-doc:: net/wimax/stack.c
- :export:
-
-.. kernel-doc:: include/net/wimax.h
- :internal:
-
-.. kernel-doc:: include/uapi/linux/wimax.h
- :internal:
-
Network device support
======================
@@ -134,6 +113,15 @@ PHY Support
.. kernel-doc:: drivers/net/phy/phy.c
:internal:
+.. kernel-doc:: drivers/net/phy/phy-core.c
+ :export:
+
+.. kernel-doc:: drivers/net/phy/phy-c45.c
+ :export:
+
+.. kernel-doc:: include/linux/phy.h
+ :internal:
+
.. kernel-doc:: drivers/net/phy/phy_device.c
:export:
diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.rst
index b773a5278ac4..db0f5560ac1c 100644
--- a/Documentation/networking/kcm.txt
+++ b/Documentation/networking/kcm.rst
@@ -1,35 +1,38 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
Kernel Connection Multiplexor
------------------------------
+=============================
Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
interface over TCP for generic application protocols. With KCM an application
can efficiently send and receive application protocol messages over TCP using
datagram sockets.
-KCM implements an NxM multiplexor in the kernel as diagrammed below:
-
-+------------+ +------------+ +------------+ +------------+
-| KCM socket | | KCM socket | | KCM socket | | KCM socket |
-+------------+ +------------+ +------------+ +------------+
- | | | |
- +-----------+ | | +----------+
- | | | |
- +----------------------------------+
- | Multiplexor |
- +----------------------------------+
- | | | | |
- +---------+ | | | ------------+
- | | | | |
-+----------+ +----------+ +----------+ +----------+ +----------+
-| Psock | | Psock | | Psock | | Psock | | Psock |
-+----------+ +----------+ +----------+ +----------+ +----------+
- | | | | |
-+----------+ +----------+ +----------+ +----------+ +----------+
-| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
-+----------+ +----------+ +----------+ +----------+ +----------+
+KCM implements an NxM multiplexor in the kernel as diagrammed below::
+
+ +------------+ +------------+ +------------+ +------------+
+ | KCM socket | | KCM socket | | KCM socket | | KCM socket |
+ +------------+ +------------+ +------------+ +------------+
+ | | | |
+ +-----------+ | | +----------+
+ | | | |
+ +----------------------------------+
+ | Multiplexor |
+ +----------------------------------+
+ | | | | |
+ +---------+ | | | ------------+
+ | | | | |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | Psock | | Psock | | Psock | | Psock | | Psock |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | | | | |
+ +----------+ +----------+ +----------+ +----------+ +----------+
+ | TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
+ +----------+ +----------+ +----------+ +----------+ +----------+
KCM sockets
------------
+===========
The KCM sockets provide the user interface to the multiplexor. All the KCM sockets
bound to a multiplexor are considered to have equivalent function, and I/O
@@ -37,7 +40,7 @@ operations in different sockets may be done in parallel without the need for
synchronization between threads in userspace.
Multiplexor
------------
+===========
The multiplexor provides the message steering. In the transmit path, messages
written on a KCM socket are sent atomically on an appropriate TCP socket.
@@ -45,14 +48,14 @@ Similarly, in the receive path, messages are constructed on each TCP socket
(Psock) and complete messages are steered to a KCM socket.
TCP sockets & Psocks
---------------------
+====================
TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
for each bound TCP socket, this structure holds the state for constructing
messages on receive as well as other connection specific information for KCM.
Connected mode semantics
-------------------------
+========================
Each multiplexor assumes that all attached TCP connections are to the same
destination and can use the different connections for load balancing when
@@ -60,7 +63,7 @@ transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
can be used to send and receive messages from the KCM socket.
Socket types
-------------
+============
KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
@@ -110,23 +113,23 @@ User interface
Creating a multiplexor
----------------------
-A new multiplexor and initial KCM socket is created by a socket call:
+A new multiplexor and initial KCM socket is created by a socket call::
socket(AF_KCM, type, protocol)
- - type is either SOCK_DGRAM or SOCK_SEQPACKET
- - protocol is KCMPROTO_CONNECTED
+- type is either SOCK_DGRAM or SOCK_SEQPACKET
+- protocol is KCMPROTO_CONNECTED
Cloning KCM sockets
-------------------
After the first KCM socket is created using the socket call as described
above, additional sockets for the multiplexor can be created by cloning
-a KCM socket. This is accomplished by an ioctl on a KCM socket:
+a KCM socket. This is accomplished by an ioctl on a KCM socket::
/* From linux/kcm.h */
struct kcm_clone {
- int fd;
+ int fd;
};
struct kcm_clone info;
@@ -142,11 +145,11 @@ Attach transport sockets
------------------------
Attaching of transport sockets to a multiplexor is performed by calling an
-ioctl on a KCM socket for the multiplexor. e.g.:
+ioctl on a KCM socket for the multiplexor. e.g.::
/* From linux/kcm.h */
struct kcm_attach {
- int fd;
+ int fd;
int bpf_fd;
};
@@ -160,18 +163,19 @@ ioctl on a KCM socket for the multiplexor. e.g.:
ioctl(kcmfd, SIOCKCMATTACH, &info);
The kcm_attach structure contains:
- fd: file descriptor for TCP socket being attached
- bpf_prog_fd: file descriptor for compiled BPF program downloaded
+
+ - fd: file descriptor for TCP socket being attached
+ - bpf_prog_fd: file descriptor for compiled BPF program downloaded
Unattach transport sockets
--------------------------
Unattaching a transport socket from a multiplexor is straightforward. An
-"unattach" ioctl is done with the kcm_unattach structure as the argument:
+"unattach" ioctl is done with the kcm_unattach structure as the argument::
/* From linux/kcm.h */
struct kcm_unattach {
- int fd;
+ int fd;
};
struct kcm_unattach info;
@@ -190,7 +194,7 @@ When receive is disabled, any pending messages in the socket's
receive buffer are moved to other sockets. This feature is useful
if an application thread knows that it will be doing a lot of
work on a request and won't be able to service new messages for a
-while. Example use:
+while. Example use::
int val = 1;
@@ -200,7 +204,7 @@ BFP programs for message delineation
------------------------------------
BPF programs can be compiled using the BPF LLVM backend. For example,
-the BPF program for parsing Thrift is:
+the BPF program for parsing Thrift is::
#include "bpf.h" /* for __sk_buff */
#include "bpf_helpers.h" /* for load_word intrinsic */
@@ -250,6 +254,7 @@ based on groups, or batches of messages, can be beneficial for performance.
On transmit, there are three ways an application can batch (pipeline)
messages on a KCM socket.
+
1) Send multiple messages in a single sendmmsg.
2) Send a group of messages each with a sendmsg call, where all messages
except the last have MSG_BATCH in the flags of sendmsg call.
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
new file mode 100644
index 000000000000..7f383e99dbad
--- /dev/null
+++ b/Documentation/networking/l2tp.rst
@@ -0,0 +1,677 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+L2TP
+====
+
+Layer 2 Tunneling Protocol (L2TP) allows L2 frames to be tunneled over
+an IP network.
+
+This document covers the kernel's L2TP subsystem. It documents kernel
+APIs for application developers who want to use the L2TP subsystem and
+it provides some technical details about the internal implementation
+which may be useful to kernel developers and maintainers.
+
+Overview
+========
+
+The kernel's L2TP subsystem implements the datapath for L2TPv2 and
+L2TPv3. L2TPv2 is carried over UDP. L2TPv3 is carried over UDP or
+directly over IP (protocol 115).
+
+The L2TP RFCs define two basic kinds of L2TP packets: control packets
+(the "control plane"), and data packets (the "data plane"). The kernel
+deals only with data packets. The more complex control packets are
+handled by user space.
+
+An L2TP tunnel carries one or more L2TP sessions. Each tunnel is
+associated with a socket. Each session is associated with a virtual
+netdevice, e.g. ``pppN``, ``l2tpethN``, through which data frames pass
+to/from L2TP. Fields in the L2TP header identify the tunnel or session
+and whether it is a control or data packet. When tunnels and sessions
+are set up using the Linux kernel API, we're just setting up the L2TP
+data path. All aspects of the control protocol are to be handled by
+user space.
+
+This split in responsibilities leads to a natural sequence of
+operations when establishing tunnels and sessions. The procedure looks
+like this:
+
+ 1) Create a tunnel socket. Exchange L2TP control protocol messages
+ with the peer over that socket in order to establish a tunnel.
+
+ 2) Create a tunnel context in the kernel, using information
+ obtained from the peer using the control protocol messages.
+
+ 3) Exchange L2TP control protocol messages with the peer over the
+ tunnel socket in order to establish a session.
+
+ 4) Create a session context in the kernel using information
+ obtained from the peer using the control protocol messages.
+
+L2TP APIs
+=========
+
+This section documents each userspace API of the L2TP subsystem.
+
+Tunnel Sockets
+--------------
+
+L2TPv2 always uses UDP. L2TPv3 may use UDP or IP encapsulation.
+
+To create a tunnel socket for use by L2TP, the standard POSIX
+socket API is used.
+
+For example, for a tunnel using IPv4 addresses and UDP encapsulation::
+
+ int sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
+
+Or for a tunnel using IPv6 addresses and IP encapsulation::
+
+ int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_L2TP);
+
+UDP socket programming doesn't need to be covered here.
+
+IPPROTO_L2TP is an IP protocol type implemented by the kernel's L2TP
+subsystem. The L2TPIP socket address is defined in struct
+sockaddr_l2tpip and struct sockaddr_l2tpip6 at
+`include/uapi/linux/l2tp.h`_. The address includes the L2TP tunnel
+(connection) id. To use L2TP IP encapsulation, an L2TPv3 application
+should bind the L2TPIP socket using the locally assigned
+tunnel id. When the peer's tunnel id and IP address is known, a
+connect must be done.
+
+If the L2TP application needs to handle L2TPv3 tunnel setup requests
+from peers using L2TPIP, it must open a dedicated L2TPIP
+socket to listen for those requests and bind the socket using tunnel
+id 0 since tunnel setup requests are addressed to tunnel id 0.
+
+An L2TP tunnel and all of its sessions are automatically closed when
+its tunnel socket is closed.
+
+Netlink API
+-----------
+
+L2TP applications use netlink to manage L2TP tunnel and session
+instances in the kernel. The L2TP netlink API is defined in
+`include/uapi/linux/l2tp.h`_.
+
+L2TP uses `Generic Netlink`_ (GENL). Several commands are defined:
+Create, Delete, Modify and Get for tunnel and session
+instances, e.g. ``L2TP_CMD_TUNNEL_CREATE``. The API header lists the
+netlink attribute types that can be used with each command.
+
+Tunnel and session instances are identified by a locally unique
+32-bit id. L2TP tunnel ids are given by ``L2TP_ATTR_CONN_ID`` and
+``L2TP_ATTR_PEER_CONN_ID`` attributes and L2TP session ids are given
+by ``L2TP_ATTR_SESSION_ID`` and ``L2TP_ATTR_PEER_SESSION_ID``
+attributes. If netlink is used to manage L2TPv2 tunnel and session
+instances, the L2TPv2 16-bit tunnel/session id is cast to a 32-bit
+value in these attributes.
+
+In the ``L2TP_CMD_TUNNEL_CREATE`` command, ``L2TP_ATTR_FD`` tells the
+kernel the tunnel socket fd being used. If not specified, the kernel
+creates a kernel socket for the tunnel, using IP parameters set in
+``L2TP_ATTR_IP[6]_SADDR``, ``L2TP_ATTR_IP[6]_DADDR``,
+``L2TP_ATTR_UDP_SPORT``, ``L2TP_ATTR_UDP_DPORT`` attributes. Kernel
+sockets are used to implement unmanaged L2TPv3 tunnels (iproute2's "ip
+l2tp" commands). If ``L2TP_ATTR_FD`` is given, it must be a socket fd
+that is already bound and connected. There is more information about
+unmanaged tunnels later in this document.
+
+``L2TP_CMD_TUNNEL_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Sets the tunnel (connection) id.
+PEER_CONN_ID Y Sets the peer tunnel (connection) id.
+PROTO_VERSION Y Protocol version. 2 or 3.
+ENCAP_TYPE Y Encapsulation type: UDP or IP.
+FD N Tunnel socket file descriptor.
+UDP_CSUM N Enable IPv4 UDP checksums. Used only if FD is
+ not set.
+UDP_ZERO_CSUM6_TX N Zero IPv6 UDP checksum on transmit. Used only
+ if FD is not set.
+UDP_ZERO_CSUM6_RX N Zero IPv6 UDP checksum on receive. Used only if
+ FD is not set.
+IP_SADDR N IPv4 source address. Used only if FD is not
+ set.
+IP_DADDR N IPv4 destination address. Used only if FD is
+ not set.
+UDP_SPORT N UDP source port. Used only if FD is not set.
+UDP_DPORT N UDP destination port. Used only if FD is not
+ set.
+IP6_SADDR N IPv6 source address. Used only if FD is not
+ set.
+IP6_DADDR N IPv6 destination address. Used only if FD is
+ not set.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be destroyed.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the tunnel id to be modified.
+DEBUG N Debug flags.
+================== ======== ===
+
+``L2TP_CMD_TUNNEL_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored in DUMP requests.
+================== ======== ===
+
+``L2TP_CMD_SESSION_CREATE`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y The parent tunnel id.
+SESSION_ID Y Sets the session id.
+PEER_SESSION_ID Y Sets the parent session id.
+PW_TYPE Y Sets the pseudowire type.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+L2SPEC_TYPE N Sets layer2-specific-sublayer type (L2TPv3
+ only).
+COOKIE N Sets optional cookie (L2TPv3 only).
+PEER_COOKIE N Sets optional peer cookie (L2TPv3 only).
+IFNAME N Sets interface name (L2TPv3 only).
+================== ======== ===
+
+For Ethernet session types, this will create an l2tpeth virtual
+interface which can then be configured as required. For PPP session
+types, a PPPoL2TP socket must also be opened and connected, mapping it
+onto the new session. This is covered in "PPPoL2TP Sockets" later.
+
+``L2TP_CMD_SESSION_DESTROY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be destroyed.
+SESSION_ID Y Identifies the session id to be destroyed.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+``L2TP_CMD_SESSION_MODIFY`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID Y Identifies the parent tunnel id of the session
+ to be modified.
+SESSION_ID Y Identifies the session id to be modified.
+IFNAME N Identifies the session by interface name. If
+ set, this overrides any CONN_ID and SESSION_ID
+ attributes. Currently supported for L2TPv3
+ Ethernet sessions only.
+DEBUG N Debug flags.
+RECV_SEQ N Enable rx data sequence numbers.
+SEND_SEQ N Enable tx data sequence numbers.
+LNS_MODE N Enable LNS mode (auto-enable data sequence
+ numbers).
+RECV_TIMEOUT N Timeout to wait when reordering received
+ packets.
+================== ======== ===
+
+``L2TP_CMD_SESSION_GET`` attributes:-
+
+================== ======== ===
+Attribute Required Use
+================== ======== ===
+CONN_ID N Identifies the tunnel id to be queried.
+ Ignored for DUMP requests.
+SESSION_ID N Identifies the session id to be queried.
+ Ignored for DUMP requests.
+IFNAME N Identifies the session by interface name.
+ If set, this overrides any CONN_ID and
+ SESSION_ID attributes. Ignored for DUMP
+ requests. Currently supported for L2TPv3
+ Ethernet sessions only.
+================== ======== ===
+
+Application developers should refer to `include/uapi/linux/l2tp.h`_ for
+netlink command and attribute definitions.
+
+Sample userspace code using libmnl_:
+
+ - Open L2TP netlink socket::
+
+ struct nl_sock *nl_sock;
+ int l2tp_nl_family_id;
+
+ nl_sock = nl_socket_alloc();
+ genl_connect(nl_sock);
+ genl_id = genl_ctrl_resolve(nl_sock, L2TP_GENL_NAME);
+
+ - Create a tunnel::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_FD, tunl_sock_fd);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u8(nlh, L2TP_ATTR_PROTO_VERSION, protocol_version);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_ENCAP_TYPE, encap);
+
+ - Create a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_CREATE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_CONN_ID, peer_tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_PEER_SESSION_ID, peer_sid);
+ mnl_attr_put_u16(nlh, L2TP_ATTR_PW_TYPE, pwtype);
+ /* there are other session options which can be set using netlink
+ * attributes during session creation -- see l2tp.h
+ */
+
+ - Delete a session::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_SESSION_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+ mnl_attr_put_u32(nlh, L2TP_ATTR_SESSION_ID, sid);
+
+ - Delete a tunnel and all of its sessions (if any)::
+
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *gnlh;
+
+ nlh = mnl_nlmsg_put_header(buf);
+ nlh->nlmsg_type = genl_id; /* assigned to genl socket */
+ nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+ nlh->nlmsg_seq = seq;
+
+ gnlh = mnl_nlmsg_put_extra_header(nlh, sizeof(*gnlh));
+ gnlh->cmd = L2TP_CMD_TUNNEL_DELETE;
+ gnlh->version = L2TP_GENL_VERSION;
+ gnlh->reserved = 0;
+
+ mnl_attr_put_u32(nlh, L2TP_ATTR_CONN_ID, tid);
+
+PPPoL2TP Session Socket API
+---------------------------
+
+For PPP session types, a PPPoL2TP socket must be opened and connected
+to the L2TP session.
+
+When creating PPPoL2TP sockets, the application provides information
+to the kernel about the tunnel and session in a socket connect()
+call. Source and destination tunnel and session ids are provided, as
+well as the file descriptor of a UDP or L2TPIP socket. See struct
+pppol2tp_addr in `include/linux/if_pppol2tp.h`_. For historical reasons,
+there are unfortunately slightly different address structures for
+L2TPv2/L2TPv3 IPv4/IPv6 tunnels and userspace must use the appropriate
+structure that matches the tunnel socket type.
+
+Userspace may control behavior of the tunnel or session using
+setsockopt and ioctl on the PPPoX socket. The following socket
+options are supported:-
+
+========= ===========================================================
+DEBUG bitmask of debug message categories. See below.
+SENDSEQ - 0 => don't send packets with sequence numbers
+ - 1 => send packets with sequence numbers
+RECVSEQ - 0 => receive packet sequence numbers are optional
+ - 1 => drop receive packets without sequence numbers
+LNSMODE - 0 => act as LAC.
+ - 1 => act as LNS.
+REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder.
+========= ===========================================================
+
+In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
+to retrieve tunnel and session statistics from the kernel using the
+PPPoX socket of the appropriate tunnel or session.
+
+Sample userspace code:
+
+ - Create session PPPoX data socket::
+
+ struct sockaddr_pppol2tp sax;
+ int fd;
+
+ /* Note, the tunnel socket must be bound already, else it
+ * will not be ready
+ */
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = tunnel_fd;
+ sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = session_id;
+ sax.pppol2tp.d_tunnel = peer_tunnel_id;
+ sax.pppol2tp.d_session = peer_session_id;
+
+ /* session_fd is the fd of the session's PPPoL2TP socket.
+ * tunnel_fd is the fd of the tunnel UDP / L2TPIP socket.
+ */
+ fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+ if (fd < 0 ) {
+ return -errno;
+ }
+ return 0;
+
+Old L2TPv2-only API
+-------------------
+
+When L2TP was first added to the Linux kernel in 2.6.23, it
+implemented only L2TPv2 and did not include a netlink API. Instead,
+tunnel and session instances in the kernel were managed directly using
+only PPPoL2TP sockets. The PPPoL2TP socket is used as described in
+section "PPPoL2TP Session Socket API" but tunnel and session instances
+are automatically created on a connect() of the socket instead of
+being created by a separate netlink request:
+
+ - Tunnels are managed using a tunnel management socket which is a
+ dedicated PPPoL2TP socket, connected to (invalid) session
+ id 0. The L2TP tunnel instance is created when the PPPoL2TP
+ tunnel management socket is connected and is destroyed when the
+ socket is closed.
+
+ - Session instances are created in the kernel when a PPPoL2TP
+ socket is connected to a non-zero session id. Session parameters
+ are set using setsockopt. The L2TP session instance is destroyed
+ when the socket is closed.
+
+This API is still supported but its use is discouraged. Instead, new
+L2TPv2 applications should use netlink to first create the tunnel and
+session, then create a PPPoL2TP socket for the session.
+
+Unmanaged L2TPv3 tunnels
+------------------------
+
+The kernel L2TP subsystem also supports static (unmanaged) L2TPv3
+tunnels. Unmanaged tunnels have no userspace tunnel socket, and
+exchange no control messages with the peer to set up the tunnel; the
+tunnel is configured manually at each end of the tunnel. All
+configuration is done using netlink. There is no need for an L2TP
+userspace application in this case -- the tunnel socket is created by
+the kernel and configured using parameters sent in the
+``L2TP_CMD_TUNNEL_CREATE`` netlink request. The ``ip`` utility of
+``iproute2`` has commands for managing static L2TPv3 tunnels; do ``ip
+l2tp help`` for more information.
+
+Debugging
+---------
+
+The L2TP subsystem offers a range of debugging interfaces through the
+debugfs filesystem.
+
+To access these interfaces, the debugfs filesystem must first be mounted::
+
+ # mount -t debugfs debugfs /debug
+
+Files under the l2tp directory can then be accessed, providing a summary
+of the current population of tunnel and session contexts existing in the
+kernel::
+
+ # cat /debug/l2tp/tunnels
+
+The debugfs files should not be used by applications to obtain L2TP
+state information because the file format is subject to change. It is
+implemented to provide extra debug information to help diagnose
+problems. Applications should instead use the netlink API.
+
+In addition the L2TP subsystem implements tracepoints using the standard
+kernel event tracing API. The available L2TP events can be reviewed as
+follows::
+
+ # find /debug/tracing/events/l2tp
+
+Finally, /proc/net/pppol2tp is also provided for backwards compatibility
+with the original pppol2tp code. It lists information about L2TPv2
+tunnels and sessions only. Its use is discouraged.
+
+Internal Implementation
+=======================
+
+This section is for kernel developers and maintainers.
+
+Sockets
+-------
+
+UDP sockets are implemented by the networking core. When an L2TP
+tunnel is created using a UDP socket, the socket is set up as an
+encapsulated UDP socket by setting encap_rcv and encap_destroy
+callbacks on the UDP socket. l2tp_udp_encap_recv is called when
+packets are received on the socket. l2tp_udp_encap_destroy is called
+when userspace closes the socket.
+
+L2TPIP sockets are implemented in `net/l2tp/l2tp_ip.c`_ and
+`net/l2tp/l2tp_ip6.c`_.
+
+Tunnels
+-------
+
+The kernel keeps a struct l2tp_tunnel context per L2TP tunnel. The
+l2tp_tunnel is always associated with a UDP or L2TP/IP socket and
+keeps a list of sessions in the tunnel. When a tunnel is first
+registered with L2TP core, the reference count on the socket is
+increased. This ensures that the socket cannot be removed while L2TP's
+data structures reference it.
+
+Tunnels are identified by a unique tunnel id. The id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Tunnels are kept in a per-net list, indexed by tunnel id. The tunnel
+id namespace is shared by L2TPv2 and L2TPv3. The tunnel context can be
+derived from the socket's sk_user_data.
+
+Handling tunnel socket close is perhaps the most tricky part of the
+L2TP implementation. If userspace closes a tunnel socket, the L2TP
+tunnel and all of its sessions must be closed and destroyed. Since the
+tunnel context holds a ref on the tunnel socket, the socket's
+sk_destruct won't be called until the tunnel sock_put's its
+socket. For UDP sockets, when userspace closes the tunnel socket, the
+socket's encap_destroy handler is invoked, which L2TP uses to initiate
+its tunnel close actions. For L2TPIP sockets, the socket's close
+handler initiates the same tunnel close actions. All sessions are
+first closed. Each session drops its tunnel ref. When the tunnel ref
+reaches zero, the tunnel puts its socket ref. When the socket is
+eventually destroyed, its sk_destruct finally frees the L2TP tunnel
+context.
+
+Sessions
+--------
+
+The kernel keeps a struct l2tp_session context for each session. Each
+session has private data which is used for data specific to the
+session type. With L2TPv2, the session always carries PPP
+traffic. With L2TPv3, the session can carry Ethernet frames (Ethernet
+pseudowire) or other data types such as PPP, ATM, HDLC or Frame
+Relay. Linux currently implements only Ethernet and PPP session types.
+
+Some L2TP session types also have a socket (PPP pseudowires) while
+others do not (Ethernet pseudowires). We can't therefore use the
+socket reference count as the reference count for session
+contexts. The L2TP implementation therefore has its own internal
+reference counts on the session contexts.
+
+Like tunnels, L2TP sessions are identified by a unique
+session id. Just as with tunnel ids, the session id is 16-bit for
+L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
+value.
+
+Sessions hold a ref on their parent tunnel to ensure that the tunnel
+stays extant while one or more sessions references it.
+
+Sessions are kept in a per-tunnel list, indexed by session id. L2TPv3
+sessions are also kept in a per-net list indexed by session id,
+because L2TPv3 session ids are unique across all tunnels and L2TPv3
+data packets do not contain a tunnel id in the header. This list is
+therefore needed to find the session context associated with a
+received data packet when the tunnel context cannot be derived from
+the tunnel socket.
+
+Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
+scoped by the tunnel, the kernel does not police this for L2TPv3 UDP
+tunnels and does not add sessions of L2TPv3 UDP tunnels into the
+per-net session list. In the UDP receive code, we must trust that the
+tunnel can be identified using the tunnel socket's sk_user_data and
+lookup the session in the tunnel's session list instead of the per-net
+session list.
+
+PPP
+---
+
+`net/l2tp/l2tp_ppp.c`_ implements the PPPoL2TP socket family. Each PPP
+session has a PPPoL2TP socket.
+
+The PPPoL2TP socket's sk_user_data references the l2tp_session.
+
+Userspace sends and receives PPP packets over L2TP using a PPPoL2TP
+socket. Only PPP control frames pass over this socket: PPP data
+packets are handled entirely by the kernel, passing between the L2TP
+session and its associated ``pppN`` netdev through the PPP channel
+interface of the kernel PPP subsystem.
+
+The L2TP PPP implementation handles the closing of a PPPoL2TP socket
+by closing its corresponding L2TP session. This is complicated because
+it must consider racing with netlink session create/destroy requests
+and pppol2tp_connect trying to reconnect with a session that is in the
+process of being closed. Unlike tunnels, PPP sessions do not hold a
+ref on their associated socket, so code must be careful to sock_hold
+the socket where necessary. For all the details, see commit
+3d609342cc04129ff7568e19316ce3d7451a27e8.
+
+Ethernet
+--------
+
+`net/l2tp/l2tp_eth.c`_ implements L2TPv3 Ethernet pseudowires. It
+manages a netdev for each session.
+
+L2TP Ethernet sessions are created and destroyed by netlink request,
+or are destroyed when the tunnel is destroyed. Unlike PPP sessions,
+Ethernet sessions do not have an associated socket.
+
+Miscellaneous
+=============
+
+RFCs
+----
+
+The kernel code implements the datapath features specified in the
+following RFCs:
+
+======= =============== ===================================
+RFC2661 L2TPv2 https://tools.ietf.org/html/rfc2661
+RFC3931 L2TPv3 https://tools.ietf.org/html/rfc3931
+RFC4719 L2TPv3 Ethernet https://tools.ietf.org/html/rfc4719
+======= =============== ===================================
+
+Implementations
+---------------
+
+A number of open source applications use the L2TP kernel subsystem:
+
+============ ==============================================
+iproute2 https://github.com/shemminger/iproute2
+go-l2tp https://github.com/katalix/go-l2tp
+tunneldigger https://github.com/wlanslovenija/tunneldigger
+xl2tpd https://github.com/xelerance/xl2tpd
+============ ==============================================
+
+Limitations
+-----------
+
+The current implementation has a number of limitations:
+
+ 1) Multiple UDP sockets with the same 5-tuple address cannot be
+ used. The kernel's tunnel context is identified using private
+ data associated with the socket so it is important that each
+ socket is uniquely identified by its address.
+
+ 2) Interfacing with openvswitch is not yet implemented. It may be
+ useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
+
+ 3) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+ configured with a VLAN sub-interface. Since L2TPv3 VLAN
+ pseudowires carry one and only one VLAN, it may be better to use
+ a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
+ pair per VLAN session. The netlink attribute
+ ``L2TP_ATTR_VLAN_ID`` was added for this, but it was never
+ implemented.
+
+Testing
+-------
+
+Unmanaged L2TPv3 Ethernet features are tested by the kernel's built-in
+selftests. See `tools/testing/selftests/net/l2tp.sh`_.
+
+Another test suite, l2tp-ktest_, covers all
+of the L2TP APIs and tunnel/session types. This may be integrated into
+the kernel's built-in L2TP selftests in the future.
+
+.. Links
+.. _Generic Netlink: generic_netlink.html
+.. _libmnl: https://www.netfilter.org/projects/libmnl
+.. _include/uapi/linux/l2tp.h: ../../../include/uapi/linux/l2tp.h
+.. _include/linux/if_pppol2tp.h: ../../../include/linux/if_pppol2tp.h
+.. _net/l2tp/l2tp_ip.c: ../../../net/l2tp/l2tp_ip.c
+.. _net/l2tp/l2tp_ip6.c: ../../../net/l2tp/l2tp_ip6.c
+.. _net/l2tp/l2tp_ppp.c: ../../../net/l2tp/l2tp_ppp.c
+.. _net/l2tp/l2tp_eth.c: ../../../net/l2tp/l2tp_eth.c
+.. _tools/testing/selftests/net/l2tp.sh: ../../../tools/testing/selftests/net/l2tp.sh
+.. _l2tp-ktest: https://github.com/katalix/l2tp-ktest
diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt
deleted file mode 100644
index 9bc271cdc9a8..000000000000
--- a/Documentation/networking/l2tp.txt
+++ /dev/null
@@ -1,345 +0,0 @@
-This document describes how to use the kernel's L2TP drivers to
-provide L2TP functionality. L2TP is a protocol that tunnels one or
-more sessions over an IP tunnel. It is commonly used for VPNs
-(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
-network infrastructure. With L2TPv3, it is also useful as a Layer-2
-tunneling infrastructure.
-
-Features
-========
-
-L2TPv2 (PPP over L2TP (UDP tunnels)).
-L2TPv3 ethernet pseudowires.
-L2TPv3 PPP pseudowires.
-L2TPv3 IP encapsulation.
-Netlink sockets for L2TPv3 configuration management.
-
-History
-=======
-
-The original pppol2tp driver was introduced in 2.6.23 and provided
-L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP
-sessions over a UDP tunnel.
-
-L2TPv3 (rfc3931) changes the protocol to allow different frame types
-to be passed over an L2TP tunnel by moving the PPP-specific parts of
-the protocol out of the core L2TP packet headers. Each frame type is
-known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM
-pseudowires for L2TP are defined in separate RFC standards. Another
-change for L2TPv3 is that it can be carried directly over IP with no
-UDP header (UDP is optional). It is also possible to create static
-unmanaged L2TPv3 tunnels manually without a control protocol
-(userspace daemon) to manage them.
-
-To support L2TPv3, the original pppol2tp driver was split up to
-separate the L2TP and PPP functionality. Existing L2TPv2 userspace
-apps should be unaffected as the original pppol2tp sockets API is
-retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and
-sessions.
-
-Design
-======
-
-The L2TP protocol separates control and data frames. The L2TP kernel
-drivers handle only L2TP data frames; control frames are always
-handled by userspace. L2TP control frames carry messages between L2TP
-clients/servers and are used to setup / teardown tunnels and
-sessions. An L2TP client or server is implemented in userspace.
-
-Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP
-provides L2TPv3 IP encapsulation (no UDP) and is implemented using a
-new l2tpip socket family. The tunnel socket is typically created by
-userspace, though for unmanaged L2TPv3 tunnels, the socket can also be
-created by the kernel. Each L2TP session (pseudowire) gets a network
-interface instance. In the case of PPP, these interfaces are created
-indirectly by pppd using a pppol2tp socket. In the case of ethernet,
-the netdevice is created upon a netlink request to create an L2TPv3
-ethernet pseudowire.
-
-For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a
-mechanism by which PPP frames carried through an L2TP session are
-passed through the kernel's PPP subsystem. The standard PPP daemon,
-pppd, handles all PPP interaction with the peer. PPP network
-interfaces are created for each local PPP endpoint. The kernel's PPP
-subsystem arranges for PPP control frames to be delivered to pppd,
-while data frames are forwarded as usual.
-
-For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a
-netdevice driver, managing virtual ethernet devices, one per
-pseudowire. These interfaces can be managed using standard Linux tools
-such as "ip" and "ifconfig". If only IP frames are passed over the
-tunnel, the interface can be given an IP addresses of itself and its
-peer. If non-IP frames are to be passed over the tunnel, the interface
-can be added to a bridge using brctl. All L2TP datapath protocol
-functions are handled by the L2TP core driver.
-
-Each tunnel and session within a tunnel is assigned a unique tunnel_id
-and session_id. These ids are carried in the L2TP header of every
-control and data packet. (Actually, in L2TPv3, the tunnel_id isn't
-present in data frames - it is inferred from the IP connection on
-which the packet was received.) The L2TP driver uses the ids to lookup
-internal tunnel and/or session contexts to determine how to handle the
-packet. Zero tunnel / session ids are treated specially - zero ids are
-never assigned to tunnels or sessions in the network. In the driver,
-the tunnel context keeps a reference to the tunnel UDP or L2TPIP
-socket. The session context holds data that lets the driver interface
-to the kernel's network frame type subsystems, i.e. PPP, ethernet.
-
-Userspace Programming
-=====================
-
-For L2TPv2, there are a number of requirements on the userspace L2TP
-daemon in order to use the pppol2tp driver.
-
-1. Use a UDP socket per tunnel.
-
-2. Create a single PPPoL2TP socket per tunnel bound to a special null
- session id. This is used only for communicating with the driver but
- must remain open while the tunnel is active. Opening this tunnel
- management socket causes the driver to mark the tunnel socket as an
- L2TP UDP encapsulation socket and flags it for use by the
- referenced tunnel id. This hooks up the UDP receive path via
- udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
- in this special PPPoX socket.
-
-3. Create a PPPoL2TP socket per L2TP session. This is typically done
- by starting pppd with the pppol2tp plugin and appropriate
- arguments. A PPPoL2TP tunnel management socket (Step 2) must be
- created before the first PPPoL2TP session socket is created.
-
-When creating PPPoL2TP sockets, the application provides information
-to the driver about the socket in a socket connect() call. Source and
-destination tunnel and session ids are provided, as well as the file
-descriptor of a UDP socket. See struct pppol2tp_addr in
-include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are
-treated specially. When creating the per-tunnel PPPoL2TP management
-socket in Step 2 above, zero source and destination session ids are
-specified, which tells the driver to prepare the supplied UDP file
-descriptor for use as an L2TP tunnel socket.
-
-Userspace may control behavior of the tunnel or session using
-setsockopt and ioctl on the PPPoX socket. The following socket
-options are supported:-
-
-DEBUG - bitmask of debug message categories. See below.
-SENDSEQ - 0 => don't send packets with sequence numbers
- 1 => send packets with sequence numbers
-RECVSEQ - 0 => receive packet sequence numbers are optional
- 1 => drop receive packets without sequence numbers
-LNSMODE - 0 => act as LAC.
- 1 => act as LNS.
-REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
-
-Only the DEBUG option is supported by the special tunnel management
-PPPoX socket.
-
-In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
-to retrieve tunnel and session statistics from the kernel using the
-PPPoX socket of the appropriate tunnel or session.
-
-For L2TPv3, userspace must use the netlink API defined in
-include/linux/l2tp.h to manage tunnel and session contexts. The
-general procedure to create a new L2TP tunnel with one session is:-
-
-1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel
- using netlink.
-
-2. Create a UDP or L2TPIP socket for the tunnel.
-
-3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE
- request. Set attributes according to desired tunnel parameters,
- referencing the UDP or L2TPIP socket created in the previous step.
-
-4. Create a new L2TP session in the tunnel using a
- L2TP_CMD_SESSION_CREATE request.
-
-The tunnel and all of its sessions are closed when the tunnel socket
-is closed. The netlink API may also be used to delete sessions and
-tunnels. Configuration and status info may be set or read using netlink.
-
-The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These
-are where there is no L2TP control message exchange with the peer to
-setup the tunnel; the tunnel is configured manually at each end of the
-tunnel. There is no need for an L2TP userspace application in this
-case -- the tunnel socket is created by the kernel and configured
-using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink
-request. The "ip" utility of iproute2 has commands for managing static
-L2TPv3 tunnels; do "ip l2tp help" for more information.
-
-Debugging
-=========
-
-The driver supports a flexible debug scheme where kernel trace
-messages may be optionally enabled per tunnel and per session. Care is
-needed when debugging a live system since the messages are not
-rate-limited and a busy system could be swamped. Userspace uses
-setsockopt on the PPPoX socket to set a debug mask.
-
-The following debug mask bits are available:
-
-L2TP_MSG_DEBUG verbose debug (if compiled in)
-L2TP_MSG_CONTROL userspace - kernel interface
-L2TP_MSG_SEQ sequence numbers handling
-L2TP_MSG_DATA data packets
-
-If enabled, files under a l2tp debugfs directory can be used to dump
-kernel state about L2TP tunnels and sessions. To access it, the
-debugfs filesystem must first be mounted.
-
-# mount -t debugfs debugfs /debug
-
-Files under the l2tp directory can then be accessed.
-
-# cat /debug/l2tp/tunnels
-
-The debugfs files should not be used by applications to obtain L2TP
-state information because the file format is subject to change. It is
-implemented to provide extra debug information to help diagnose
-problems.) Users should use the netlink API.
-
-/proc/net/pppol2tp is also provided for backwards compatibility with
-the original pppol2tp driver. It lists information about L2TPv2
-tunnels and sessions only. Its use is discouraged.
-
-Unmanaged L2TPv3 Tunnels
-========================
-
-Some commercial L2TP products support unmanaged L2TPv3 ethernet
-tunnels, where there is no L2TP control protocol; tunnels are
-configured at each side manually. New commands are available in
-iproute2's ip utility to support this.
-
-To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
-and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
-tunnel endpoints:-
-
-# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
- udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
-# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
-# ip -s -d show dev l2tpeth0
-# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
-# ip li set dev l2tpeth0 up
-
-Choose IP addresses to be the address of a local IP interface and that
-of the remote system. The IP addresses of the l2tpeth0 interface can be
-anything suitable.
-
-Repeat the above at the peer, with ports, tunnel/session ids and IP
-addresses reversed. The tunnel and session IDs can be any non-zero
-32-bit number, but the values must be reversed at the peer.
-
-Host 1 Host2
-udp_sport=5000 udp_sport=5001
-udp_dport=5001 udp_dport=5000
-tunnel_id=42 tunnel_id=45
-peer_tunnel_id=45 peer_tunnel_id=42
-session_id=128 session_id=5196755
-peer_session_id=5196755 peer_session_id=128
-
-When done at both ends of the tunnel, it should be possible to send
-data over the network. e.g.
-
-# ping 10.5.1.1
-
-
-Sample Userspace Code
-=====================
-
-1. Create tunnel management PPPoX socket
-
- kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
- if (kernel_fd >= 0) {
- struct sockaddr_pppol2tp sax;
- struct sockaddr_in const *peer_addr;
-
- peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
- memset(&sax, 0, sizeof(sax));
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
- sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
- sax.pppol2tp.d_tunnel = 0;
- sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
-
- if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
- perror("connect failed");
- result = -errno;
- goto err;
- }
- }
-
-2. Create session PPPoX data socket
-
- struct sockaddr_pppol2tp sax;
- int fd;
-
- /* Note, the target socket must be bound already, else it will not be ready */
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = tunnel_fd;
- sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = session_id;
- sax.pppol2tp.d_tunnel = peer_tunnel_id;
- sax.pppol2tp.d_session = peer_session_id;
-
- /* session_fd is the fd of the session's PPPoL2TP socket.
- * tunnel_fd is the fd of the tunnel UDP socket.
- */
- fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
- if (fd < 0 ) {
- return -errno;
- }
- return 0;
-
-Internal Implementation
-=======================
-
-The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a
-struct l2tp_session context for each session. The l2tp_tunnel is
-always associated with a UDP or L2TP/IP socket and keeps a list of
-sessions in the tunnel. The l2tp_session context keeps kernel state
-about the session. It has private data which is used for data specific
-to the session type. With L2TPv2, the session always carried PPP
-traffic. With L2TPv3, the session can also carry ethernet frames
-(ethernet pseudowire) or other data types such as ATM, HDLC or Frame
-Relay.
-
-When a tunnel is first opened, the reference count on the socket is
-increased using sock_hold(). This ensures that the kernel socket
-cannot be removed while L2TP's data structures reference it.
-
-Some L2TP sessions also have a socket (PPP pseudowires) while others
-do not (ethernet pseudowires). We can't use the socket reference count
-as the reference count for session contexts. The L2TP implementation
-therefore has its own internal reference counts on the session
-contexts.
-
-To Do
-=====
-
-Add L2TP tunnel switching support. This would route tunneled traffic
-from one L2TP tunnel into another. Specified in
-http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08
-
-Add L2TPv3 VLAN pseudowire support.
-
-Add L2TPv3 IP pseudowire support.
-
-Add L2TPv3 ATM pseudowire support.
-
-Miscellaneous
-=============
-
-The L2TP drivers were developed as part of the OpenL2TP project by
-Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
-designed from the ground up to have the L2TP datapath in the
-kernel. The project also implemented the pppol2tp plugin for pppd
-which allows pppd to use the kernel driver. Details can be found at
-http://www.openl2tp.org.
diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.rst
index d4fc8f221559..ff586bc9f005 100644
--- a/Documentation/networking/lapb-module.txt
+++ b/Documentation/networking/lapb-module.rst
@@ -1,8 +1,14 @@
- The Linux LAPB Module Interface 1.3
+.. SPDX-License-Identifier: GPL-2.0
- Jonathan Naylor 29.12.96
+===============================
+The Linux LAPB Module Interface
+===============================
-Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
+Version 1.3
+
+Jonathan Naylor 29.12.96
+
+Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
The LAPB module will be a separately compiled module for use by any parts of
the Linux operating system that require a LAPB service. This document
@@ -32,16 +38,16 @@ LAPB Initialisation Structure
This structure is used only once, in the call to lapb_register (see below).
It contains information about the device driver that requires the services
-of the LAPB module.
+of the LAPB module::
-struct lapb_register_struct {
- void (*connect_confirmation)(int token, int reason);
- void (*connect_indication)(int token, int reason);
- void (*disconnect_confirmation)(int token, int reason);
- void (*disconnect_indication)(int token, int reason);
- int (*data_indication)(int token, struct sk_buff *skb);
- void (*data_transmit)(int token, struct sk_buff *skb);
-};
+ struct lapb_register_struct {
+ void (*connect_confirmation)(int token, int reason);
+ void (*connect_indication)(int token, int reason);
+ void (*disconnect_confirmation)(int token, int reason);
+ void (*disconnect_indication)(int token, int reason);
+ int (*data_indication)(int token, struct sk_buff *skb);
+ void (*data_transmit)(int token, struct sk_buff *skb);
+ };
Each member of this structure corresponds to a function in the device driver
that is called when a particular event in the LAPB module occurs. These will
@@ -54,19 +60,19 @@ LAPB Parameter Structure
This structure is used with the lapb_getparms and lapb_setparms functions
(see below). They are used to allow the device driver to get and set the
-operational parameters of the LAPB implementation for a given connection.
-
-struct lapb_parms_struct {
- unsigned int t1;
- unsigned int t1timer;
- unsigned int t2;
- unsigned int t2timer;
- unsigned int n2;
- unsigned int n2count;
- unsigned int window;
- unsigned int state;
- unsigned int mode;
-};
+operational parameters of the LAPB implementation for a given connection::
+
+ struct lapb_parms_struct {
+ unsigned int t1;
+ unsigned int t1timer;
+ unsigned int t2;
+ unsigned int t2timer;
+ unsigned int n2;
+ unsigned int n2count;
+ unsigned int window;
+ unsigned int state;
+ unsigned int mode;
+ };
T1 and T2 are protocol timing parameters and are given in units of 100ms. N2
is the maximum number of tries on the link before it is declared a failure.
@@ -78,11 +84,14 @@ link.
The mode variable is a bit field used for setting (at present) three values.
The bit fields have the following meanings:
+====== =================================================
Bit Meaning
+====== =================================================
0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED).
1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP).
2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE)
3-31 Reserved, must be 0.
+====== =================================================
Extended LAPB operation indicates the use of extended sequence numbers and
consequently larger window sizes, the default is standard LAPB operation.
@@ -99,8 +108,9 @@ Functions
The LAPB module provides a number of function entry points.
+::
-int lapb_register(void *token, struct lapb_register_struct);
+ int lapb_register(void *token, struct lapb_register_struct);
This must be called before the LAPB module may be used. If the call is
successful then LAPB_OK is returned. The token must be a unique identifier
@@ -111,33 +121,42 @@ For multiple LAPB links in a single device driver, multiple calls to
lapb_register must be made. The format of the lapb_register_struct is given
above. The return values are:
+============= =============================
LAPB_OK LAPB registered successfully.
LAPB_BADTOKEN Token is already registered.
LAPB_NOMEM Out of memory
+============= =============================
+::
-int lapb_unregister(void *token);
+ int lapb_unregister(void *token);
This releases all the resources associated with a LAPB link. Any current
LAPB link will be abandoned without further messages being passed. After
this call, the value of token is no longer valid for any calls to the LAPB
function. The valid return values are:
+============= ===============================
LAPB_OK LAPB unregistered successfully.
LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= ===============================
+::
-int lapb_getparms(void *token, struct lapb_parms_struct *parms);
+ int lapb_getparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to get the values of the current LAPB
variables, the lapb_parms_struct is described above. The valid return values
are:
+============= =============================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= =============================
+::
-int lapb_setparms(void *token, struct lapb_parms_struct *parms);
+ int lapb_setparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to set the values of the current LAPB
variables, the lapb_parms_struct is described above. The values of t1timer,
@@ -145,42 +164,54 @@ t2timer and n2count are ignored, likewise changing the mode bits when
connected will be ignored. An error implies that none of the values have
been changed. The valid return values are:
+============= =================================================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_INVALUE One of the values was out of its allowable range.
+============= =================================================
+::
-int lapb_connect_request(void *token);
+ int lapb_connect_request(void *token);
Initiate a connect using the current parameter settings. The valid return
values are:
+============== =================================
LAPB_OK LAPB is starting to connect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_CONNECTED LAPB module is already connected.
+============== =================================
+::
-int lapb_disconnect_request(void *token);
+ int lapb_disconnect_request(void *token);
Initiate a disconnect. The valid return values are:
+================= ===============================
LAPB_OK LAPB is starting to disconnect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
+================= ===============================
+::
-int lapb_data_request(void *token, struct sk_buff *skb);
+ int lapb_data_request(void *token, struct sk_buff *skb);
Queue data with the LAPB module for transmitting over the link. If the call
is successful then the skbuff is owned by the LAPB module and may not be
used by the device driver again. The valid return values are:
+================= =============================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
+================= =============================
+::
-int lapb_data_received(void *token, struct sk_buff *skb);
+ int lapb_data_received(void *token, struct sk_buff *skb);
Queue data with the LAPB module which has been received from the device. It
is expected that the data passed to the LAPB module has skb->data pointing
@@ -188,9 +219,10 @@ to the beginning of the LAPB data. If the call is successful then the skbuff
is owned by the LAPB module and may not be used by the device driver again.
The valid return values are:
+============= ===========================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
-
+============= ===========================
Callbacks
---------
@@ -200,49 +232,58 @@ module to call when an event occurs. They are registered with the LAPB
module with lapb_register (see above) in the structure lapb_register_struct
(see above).
+::
-void (*connect_confirmation)(void *token, int reason);
+ void (*connect_confirmation)(void *token, int reason);
This is called by the LAPB module when a connection is established after
being requested by a call to lapb_connect_request (see above). The reason is
always LAPB_OK.
+::
-void (*connect_indication)(void *token, int reason);
+ void (*connect_indication)(void *token, int reason);
This is called by the LAPB module when the link is established by the remote
system. The value of reason is always LAPB_OK.
+::
-void (*disconnect_confirmation)(void *token, int reason);
+ void (*disconnect_confirmation)(void *token, int reason);
This is called by the LAPB module when an event occurs after the device
driver has called lapb_disconnect_request (see above). The reason indicates
what has happened. In all cases the LAPB link can be regarded as being
terminated. The values for reason are:
+================= ====================================================
LAPB_OK The LAPB link was terminated normally.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
+================= ====================================================
+::
-void (*disconnect_indication)(void *token, int reason);
+ void (*disconnect_indication)(void *token, int reason);
This is called by the LAPB module when the link is terminated by the remote
system or another event has occurred to terminate the link. This may be
returned in response to a lapb_connect_request (see above) if the remote
system refused the request. The values for reason are:
+================= ====================================================
LAPB_OK The LAPB link was terminated normally by the remote
system.
LAPB_REFUSED The remote system refused the connect request.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
+================= ====================================================
+::
-int (*data_indication)(void *token, struct sk_buff *skb);
+ int (*data_indication)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data has been received from the
remote system that should be passed onto the next layer in the protocol
@@ -254,8 +295,9 @@ This method should return NET_RX_DROP (as defined in the header
file include/linux/netdevice.h) if and only if the frame was dropped
before it could be delivered to the upper layer.
+::
-void (*data_transmit)(void *token, struct sk_buff *skb);
+ void (*data_transmit)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data is to be transmitted to the
remote system by the device driver. The skbuff becomes the property of the
diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt
deleted file mode 100644
index 0bf3220c715b..000000000000
--- a/Documentation/networking/ltpc.txt
+++ /dev/null
@@ -1,131 +0,0 @@
-This is the ALPHA version of the ltpc driver.
-
-In order to use it, you will need at least version 1.3.3 of the
-netatalk package, and the Apple or Farallon LocalTalk PC card.
-There are a number of different LocalTalk cards for the PC; this
-driver applies only to the one with the 65c02 processor chip on it.
-
-To include it in the kernel, select the CONFIG_LTPC switch in the
-configuration dialog. You can also compile it as a module.
-
-While the driver will attempt to autoprobe the I/O port address, IRQ
-line, and DMA channel of the card, this does not always work. For
-this reason, you should be prepared to supply these parameters
-yourself. (see "Card Configuration" below for how to determine or
-change the settings on your card)
-
-When the driver is compiled into the kernel, you can add a line such
-as the following to your /etc/lilo.conf:
-
- append="ltpc=0x240,9,1"
-
-where the parameters (in order) are the port address, IRQ, and DMA
-channel. The second and third values can be omitted, in which case
-the driver will try to determine them itself.
-
-If you load the driver as a module, you can pass the parameters "io=",
-"irq=", and "dma=" on the command line with insmod or modprobe, or add
-them as options in a configuration file in /etc/modprobe.d/ directory:
-
- alias lt0 ltpc # autoload the module when the interface is configured
- options ltpc io=0x240 irq=9 dma=1
-
-Before starting up the netatalk demons (perhaps in rc.local), you
-need to add a line such as:
-
- /sbin/ifconfig lt0 127.0.0.42
-
-The address is unimportant - however, the card needs to be configured
-with ifconfig so that Netatalk can find it.
-
-The appropriate netatalk configuration depends on whether you are
-attached to a network that includes AppleTalk routers or not. If,
-like me, you are simply connecting to your home Macintoshes and
-printers, you need to set up netatalk to "seed". The way I do this
-is to have the lines
-
- dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
- lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
-
-in my atalkd.conf. What is going on here is that I need to fool
-netatalk into thinking that there are two AppleTalk interfaces
-present; otherwise, it refuses to seed. This is a hack, and a more
-permanent solution would be to alter the netatalk code. Also, make
-sure you have the correct name for the dummy interface - If it's
-compiled as a module, you will need to refer to it as "dummy0" or some
-such.
-
-If you are attached to an extended AppleTalk network, with routers on
-it, then you don't need to fool around with this -- the appropriate
-line in atalkd.conf is
-
- lt0 -phase 1
-
---------------------------------------
-
-Card Configuration:
-
-The interrupts and so forth are configured via the dipswitch on the
-board. Set the switches so as not to conflict with other hardware.
-
- Interrupts -- set at most one. If none are set, the driver uses
- polled mode. Because the card was developed in the XT era, the
- original documentation refers to IRQ2. Since you'll be running
- this on an AT (or later) class machine, that really means IRQ9.
-
- SW1 IRQ 4
- SW2 IRQ 3
- SW3 IRQ 9 (2 in original card documentation only applies to XT)
-
-
- DMA -- choose DMA 1 or 3, and set both corresponding switches.
-
- SW4 DMA 3
- SW5 DMA 1
- SW6 DMA 3
- SW7 DMA 1
-
-
- I/O address -- choose one.
-
- SW8 220 / 240
-
---------------------------------------
-
-IP:
-
-Yes, it is possible to do IP over LocalTalk. However, you can't just
-treat the LocalTalk device like an ordinary Ethernet device, even if
-that's what it looks like to Netatalk.
-
-Instead, you follow the same procedure as for doing IP in EtherTalk.
-See Documentation/networking/ipddp.txt for more information about the
-kernel driver and userspace tools needed.
-
---------------------------------------
-
-BUGS:
-
-IRQ autoprobing often doesn't work on a cold boot. To get around
-this, either compile the driver as a module, or pass the parameters
-for the card to the kernel as described above.
-
-Also, as usual, autoprobing is not recommended when you use the driver
-as a module. (though it usually works at boot time, at least)
-
-Polled mode is *really* slow sometimes, but this seems to depend on
-the configuration of the network.
-
-It may theoretically be possible to use two LTPC cards in the same
-machine, but this is unsupported, so if you really want to do this,
-you'll probably have to hack the initialization code a bit.
-
-______________________________________
-
-THANKS:
- Thanks to Alan Cox for helpful discussions early on in this
-work, and to Denis Hainsworth for doing the bleeding-edge testing.
-
--- Bradford Johnson <bradford@math.umn.edu>
-
--- Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.rst
index d58d78df9ca2..63ba6611fdff 100644
--- a/Documentation/networking/mac80211-injection.txt
+++ b/Documentation/networking/mac80211-injection.rst
@@ -1,16 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
How to use packet injection with mac80211
=========================================
mac80211 now allows arbitrary packets to be injected down any Monitor Mode
interface from userland. The packet you inject needs to be composed in the
-following format:
+following format::
[ radiotap header ]
[ ieee80211 header ]
[ payload ]
The radiotap format is discussed in
-./Documentation/networking/radiotap-headers.txt.
+./Documentation/networking/radiotap-headers.rst.
Despite many radiotap parameters being currently defined, most only make sense
to appear on received packets. The following information is parsed from the
@@ -18,15 +21,19 @@ radiotap headers and used to control injection:
* IEEE80211_RADIOTAP_FLAGS
- IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
- IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
- IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
+ ========================= ===========================================
+ IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated
+ IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available
+ IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the
current fragmentation threshold.
+ ========================= ===========================================
* IEEE80211_RADIOTAP_TX_FLAGS
- IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for
+ ============================= ========================================
+ IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for
an ACK even if it is a unicast frame
+ ============================= ========================================
* IEEE80211_RADIOTAP_RATE
@@ -37,8 +44,10 @@ radiotap headers and used to control injection:
HT rate for the transmission (only for devices without own rate control).
Also some flags are parsed
- IEEE80211_RADIOTAP_MCS_SGI: use short guard interval
- IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode
+ ============================ ========================
+ IEEE80211_RADIOTAP_MCS_SGI use short guard interval
+ IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode
+ ============================ ========================
* IEEE80211_RADIOTAP_DATA_RETRIES
@@ -51,17 +60,17 @@ radiotap headers and used to control injection:
without own rate control). Also other fields are parsed
flags field
- IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
+ IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
bandwidth field
- 1: send using 40MHz channel width
- 4: send using 80MHz channel width
- 11: send using 160MHz channel width
+ * 1: send using 40MHz channel width
+ * 4: send using 80MHz channel width
+ * 11: send using 160MHz channel width
The injection code can also skip all other currently defined radiotap fields
facilitating replay of captured radiotap headers directly.
-Here is an example valid radiotap header defining some parameters
+Here is an example valid radiotap header defining some parameters::
0x00, 0x00, // <-- radiotap version
0x0b, 0x00, // <- radiotap header length
@@ -71,7 +80,7 @@ Here is an example valid radiotap header defining some parameters
0x01 //<-- antenna
The ieee80211 header follows immediately afterwards, looking for example like
-this:
+this::
0x08, 0x01, 0x00, 0x00,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
@@ -84,14 +93,14 @@ Then lastly there is the payload.
After composing the packet contents, it is sent by send()-ing it to a logical
mac80211 interface that is in Monitor mode. Libpcap can also be used,
(which is easier than doing the work to bind the socket to the right
-interface), along the following lines:
+interface), along the following lines:::
ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf);
-...
+ ...
r = pcap_inject(ppcap, u8aSendBuffer, nLength);
You can also find a link to a complete inject application here:
-http://wireless.kernel.org/en/users/Documentation/packetspammer
+https://wireless.wiki.kernel.org/en/users/Documentation/packetspammer
Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst
new file mode 100644
index 000000000000..c628cb5406d2
--- /dev/null
+++ b/Documentation/networking/mctp.rst
@@ -0,0 +1,320 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Management Component Transport Protocol (MCTP)
+==============================================
+
+net/mctp/ contains protocol support for MCTP, as defined by DMTF standard
+DSP0236. Physical interface drivers ("bindings" in the specification) are
+provided in drivers/net/mctp/.
+
+The core code provides a socket-based interface to send and receive MCTP
+messages, through an AF_MCTP, SOCK_DGRAM socket.
+
+Structure: interfaces & networks
+================================
+
+The kernel models the local MCTP topology through two items: interfaces and
+networks.
+
+An interface (or "link") is an instance of an MCTP physical transport binding
+(as defined by DSP0236, section 3.2.47), likely connected to a specific hardware
+device. This is represented as a ``struct netdevice``.
+
+A network defines a unique address space for MCTP endpoints by endpoint-ID
+(described by DSP0236, section 3.2.31). A network has a user-visible identifier
+to allow references from userspace. Route definitions are specific to one
+network.
+
+Interfaces are associated with one network. A network may be associated with one
+or more interfaces.
+
+If multiple networks are present, each may contain endpoint IDs (EIDs) that are
+also present on other networks.
+
+Sockets API
+===========
+
+Protocol definitions
+--------------------
+
+MCTP uses ``AF_MCTP`` / ``PF_MCTP`` for the address- and protocol- families.
+Since MCTP is message-based, only ``SOCK_DGRAM`` sockets are supported.
+
+.. code-block:: C
+
+ int sd = socket(AF_MCTP, SOCK_DGRAM, 0);
+
+The only (current) value for the ``protocol`` argument is 0.
+
+As with all socket address families, source and destination addresses are
+specified with a ``sockaddr`` type, with a single-byte endpoint address:
+
+.. code-block:: C
+
+ typedef __u8 mctp_eid_t;
+
+ struct mctp_addr {
+ mctp_eid_t s_addr;
+ };
+
+ struct sockaddr_mctp {
+ __kernel_sa_family_t smctp_family;
+ unsigned int smctp_network;
+ struct mctp_addr smctp_addr;
+ __u8 smctp_type;
+ __u8 smctp_tag;
+ };
+
+ #define MCTP_NET_ANY 0x0
+ #define MCTP_ADDR_ANY 0xff
+
+
+Syscall behaviour
+-----------------
+
+The following sections describe the MCTP-specific behaviours of the standard
+socket system calls. These behaviours have been chosen to map closely to the
+existing sockets APIs.
+
+``bind()`` : set local socket address
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Sockets that receive incoming request packets will bind to a local address,
+using the ``bind()`` syscall.
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = MCTP_NET_ANY;
+ addr.smctp_addr.s_addr = MCTP_ADDR_ANY;
+ addr.smctp_type = MCTP_TYPE_PLDM;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+
+ int rc = bind(sd, (struct sockaddr *)&addr, sizeof(addr));
+
+This establishes the local address of the socket. Incoming MCTP messages that
+match the network, address, and message type will be received by this socket.
+The reference to 'incoming' is important here; a bound socket will only receive
+messages with the TO bit set, to indicate an incoming request message, rather
+than a response.
+
+The ``smctp_tag`` value will configure the tags accepted from the remote side of
+this socket. Given the above, the only valid value is ``MCTP_TAG_OWNER``, which
+will result in remotely "owned" tags being routed to this socket. Since
+``MCTP_TAG_OWNER`` is set, the 3 least-significant bits of ``smctp_tag`` are not
+used; callers must set them to zero.
+
+A ``smctp_network`` value of ``MCTP_NET_ANY`` will configure the socket to
+receive incoming packets from any locally-connected network. A specific network
+value will cause the socket to only receive incoming messages from that network.
+
+The ``smctp_addr`` field specifies a local address to bind to. A value of
+``MCTP_ADDR_ANY`` configures the socket to receive messages addressed to any
+local destination EID.
+
+The ``smctp_type`` field specifies which message types to receive. Only the
+lower 7 bits of the type is matched on incoming messages (ie., the
+most-significant IC bit is not part of the match). This results in the socket
+receiving packets with and without a message integrity check footer.
+
+``sendto()``, ``sendmsg()``, ``send()`` : transmit an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message is transmitted using one of the ``sendto()``, ``sendmsg()`` or
+``send()`` syscalls. Using ``sendto()`` as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ char buf[14];
+ ssize_t len;
+
+ /* set message destination */
+ addr.smctp_family = AF_MCTP;
+ addr.smctp_network = 0;
+ addr.smctp_addr.s_addr = 8;
+ addr.smctp_tag = MCTP_TAG_OWNER;
+ addr.smctp_type = MCTP_TYPE_ECHO;
+
+ /* arbitrary message to send, with message-type header */
+ buf[0] = MCTP_TYPE_ECHO;
+ memcpy(buf + 1, "hello, world!", sizeof(buf) - 1);
+
+ len = sendto(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, sizeof(addr));
+
+The network and address fields of ``addr`` define the remote address to send to.
+If ``smctp_tag`` has the ``MCTP_TAG_OWNER``, the kernel will ignore any bits set
+in ``MCTP_TAG_VALUE``, and generate a tag value suitable for the destination
+EID. If ``MCTP_TAG_OWNER`` is not set, the message will be sent with the tag
+value as specified. If a tag value cannot be allocated, the system call will
+report an errno of ``EAGAIN``.
+
+The application must provide the message type byte as the first byte of the
+message buffer passed to ``sendto()``. If a message integrity check is to be
+included in the transmitted message, it must also be provided in the message
+buffer, and the most-significant bit of the message type byte must be 1.
+
+The ``sendmsg()`` system call allows a more compact argument interface, and the
+message buffer to be specified as a scatter-gather list. At present no ancillary
+message types (used for the ``msg_control`` data passed to ``sendmsg()``) are
+defined.
+
+Transmitting a message on an unconnected socket with ``MCTP_TAG_OWNER``
+specified will cause an allocation of a tag, if no valid tag is already
+allocated for that destination. The (destination-eid,tag) tuple acts as an
+implicit local socket address, to allow the socket to receive responses to this
+outgoing message. If any previous allocation has been performed (to for a
+different remote EID), that allocation is lost.
+
+Sockets will only receive responses to requests they have sent (with TO=1) and
+may only respond (with TO=0) to requests they have received.
+
+``recvfrom()``, ``recvmsg()``, ``recv()`` : receive an MCTP message
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An MCTP message can be received by an application using one of the
+``recvfrom()``, ``recvmsg()``, or ``recv()`` system calls. Using ``recvfrom()``
+as the primary example:
+
+.. code-block:: C
+
+ struct sockaddr_mctp addr;
+ socklen_t addrlen;
+ char buf[14];
+ ssize_t len;
+
+ addrlen = sizeof(addr);
+
+ len = recvfrom(sd, buf, sizeof(buf), 0,
+ (struct sockaddr_mctp *)&addr, &addrlen);
+
+ /* We can expect addr to describe an MCTP address */
+ assert(addrlen >= sizeof(buf));
+ assert(addr.smctp_family == AF_MCTP);
+
+ printf("received %zd bytes from remote EID %d\n", rc, addr.smctp_addr);
+
+The address argument to ``recvfrom`` and ``recvmsg`` is populated with the
+remote address of the incoming message, including tag value (this will be needed
+in order to reply to the message).
+
+The first byte of the message buffer will contain the message type byte. If an
+integrity check follows the message, it will be included in the received buffer.
+
+The ``recv()`` system call behaves in a similar way, but does not provide a
+remote address to the application. Therefore, these are only useful if the
+remote address is already known, or the message does not require a reply.
+
+Like the send calls, sockets will only receive responses to requests they have
+sent (TO=1) and may only respond (TO=0) to requests they have received.
+
+``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tags give applications more control over MCTP message tags, by allocating
+(and dropping) tag values explicitly, rather than the kernel automatically
+allocating a per-message tag at ``sendmsg()`` time.
+
+In general, you will only need to use these ioctls if your MCTP protocol does
+not fit the usual request/response model. For example, if you need to persist
+tags across multiple requests, or a request may generate more than one response.
+In these cases, the ioctls allow you to decouple the tag allocation (and
+release) from individual message send and receive operations.
+
+Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``:
+
+.. code-block:: C
+
+ struct mctp_ioc_tag_ctl {
+ mctp_eid_t peer_addr;
+ __u8 tag;
+ __u16 flags;
+ };
+
+``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application
+can use in future ``sendmsg()`` calls. The application populates the
+``peer_addr`` member with the remote EID. Other fields must be zero.
+
+On return, the ``tag`` member will be populated with the allocated tag value.
+The allocated tag will have the following tag bits set:
+
+ - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag
+ owner
+
+ - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a
+ preallocated tag.
+
+ - ... and the actual tag value, within the least-significant three bits
+ (``MCTP_TAG_MASK``). Note that zero is a valid tag value.
+
+The tag value should be used as-is for the ``smctp_tag`` member of ``struct
+sockaddr_mctp``.
+
+``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a
+``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the
+allocation, and the ``tag`` value must match exactly the tag returned from the
+allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits).
+The ``flags`` field must be zero.
+
+Kernel internals
+================
+
+There are a few possible packet flows in the MCTP stack:
+
+1. local TX to remote endpoint, message <= MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ : route lookup
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+2. local TX to remote endpoint, message > MTU::
+
+ sendmsg()
+ -> mctp_local_output()
+ -> mctp_do_fragment_route()
+ : creates packet-sized skbs. For each new skb:
+ -> rt->output() (== mctp_route_output)
+ -> dev_queue_xmit()
+
+3. remote TX to local endpoint, single-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ -> sock_queue_rcv_skb()
+
+4. remote TX to local endpoint, multiple-packet message::
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : stores skb in struct sk_key->reasm_head
+
+ mctp_pkttype_receive()
+ : route lookup
+ -> rt->output() (== mctp_route_input)
+ : sk_key lookup
+ : finds existing reassembly in sk_key->reasm_head
+ : appends new fragment
+ -> sock_queue_rcv_skb()
+
+Key refcounts
+-------------
+
+ * keys are refed by:
+
+ - a skb: during route output, stored in ``skb->cb``.
+
+ - netns and sock lists.
+
+ * keys can be associated with a device, in which case they hold a
+ reference to the dev (set through ``key->dev``, counted through
+ ``dev->key_count``). Multiple keys can reference the device.
diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.rst
index 025cc9b96992..0a2ac88404d7 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.rst
@@ -1,4 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+MPLS Sysfs variables
+====================
+
/proc/sys/net/mpls/* Variables:
+===============================
platform_labels - INTEGER
Number of entries in the platform label table. It is not
@@ -17,6 +24,7 @@ platform_labels - INTEGER
no longer fit in the table.
Possible values: 0 - 1048575
+
Default: 0
ip_ttl_propagate - BOOL
@@ -27,8 +35,8 @@ ip_ttl_propagate - BOOL
If disabled, the MPLS transport network will appear as a
single hop to transit traffic.
- 0 - disabled / RFC 3443 [Short] Pipe Model
- 1 - enabled / RFC 3443 Uniform Model (default)
+ * 0 - disabled / RFC 3443 [Short] Pipe Model
+ * 1 - enabled / RFC 3443 Uniform Model (default)
default_ttl - INTEGER
Default TTL value to use for MPLS packets where it cannot be
@@ -36,6 +44,7 @@ default_ttl - INTEGER
or ip_ttl_propagate has been disabled.
Possible values: 1 - 255
+
Default: 255
conf/<interface>/input - BOOL
@@ -44,5 +53,5 @@ conf/<interface>/input - BOOL
If disabled, packets will be discarded without further
processing.
- 0 - disabled (default)
- not 0 - enabled
+ * 0 - disabled (default)
+ * not 0 - enabled
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
new file mode 100644
index 000000000000..213510698014
--- /dev/null
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+MPTCP Sysfs variables
+=====================
+
+/proc/sys/net/mptcp/* Variables
+===============================
+
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
+
+add_addr_timeout - INTEGER (seconds)
+ Set the timeout after which an ADD_ADDR control message will be
+ resent to an MPTCP peer that has not acknowledged a previous
+ ADD_ADDR message.
+
+ The default value matches TCP_RTO_MAX. This is a per-namespace
+ sysctl.
+
+ Default: 120
+
+checksum_enabled - BOOLEAN
+ Control whether DSS checksum can be enabled.
+
+ DSS checksum can be enabled if the value is nonzero. This is a
+ per-namespace sysctl.
+
+ Default: 0
+
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
+
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
+
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+pm_type - INTEGER
+ Set the default path manager type to use for each new MPTCP
+ socket. In-kernel path management will control subflow
+ connections and address advertisements according to
+ per-namespace values configured over the MPTCP netlink
+ API. Userspace path management puts per-MPTCP-connection subflow
+ connection decisions and address advertisements under control of
+ a privileged userspace program, at the cost of more netlink
+ traffic to propagate all of the related events and commands.
+
+ This is a per-namespace sysctl.
+
+ * 0 - In-kernel path manager
+ * 1 - Userspace path manager
+
+ Default: 0
+
+stale_loss_cnt - INTEGER
+ The number of MPTCP-level retransmission intervals with no traffic and
+ pending outstanding data on a given subflow required to declare it stale.
+ The packet scheduler ignores stale subflows.
+ A low stale_loss_cnt value allows for fast active-backup switch-over,
+ an high value maximize links utilization on edge scenarios e.g. lossy
+ link with high BER or peer pausing the data processing.
+
+ This is a per-namespace sysctl.
+
+ Default: 4
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index ace56204dd03..15920db8d35d 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -50,7 +50,7 @@ the excellent reporting over at LWN.net or read the original code.
patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
- https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+ https://lore.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface
diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.rst
index 4caa0e314cc2..0a576166e9dd 100644
--- a/Documentation/networking/multiqueue.txt
+++ b/Documentation/networking/multiqueue.rst
@@ -1,17 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
- HOWTO for multiqueue network device support
- ===========================================
+===========================================
+HOWTO for multiqueue network device support
+===========================================
Section 1: Base driver requirements for implementing multiqueue support
+=======================================================================
Intro: Kernel support for multiqueue devices
---------------------------------------------------------
Kernel support for multiqueue devices is always present.
-Section 1: Base driver requirements for implementing multiqueue support
------------------------------------------------------------------------
-
Base drivers are required to use the new alloc_etherdev_mq() or
alloc_netdev_mq() functions to allocate the subqueues for the device. The
underlying kernel API will take care of the allocation and deallocation of
@@ -26,8 +26,7 @@ comes online or when it's completely shut down (unregister_netdev(), etc.).
Section 2: Qdisc support for multiqueue devices
-
------------------------------------------------
+===============================================
Currently two qdiscs are optimized for multiqueue devices. The first is the
default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue.
@@ -46,22 +45,22 @@ will be queued to the band associated with the hardware queue.
Section 3: Brief howto using MULTIQ for multiqueue devices
----------------------------------------------------------------
+==========================================================
The userspace command 'tc,' part of the iproute2 package, is used to configure
qdiscs. To add the MULTIQ qdisc to your network device, assuming the device
-is called eth0, run the following command:
+is called eth0, run the following command::
-# tc qdisc add dev eth0 root handle 1: multiq
+ # tc qdisc add dev eth0 root handle 1: multiq
The qdisc will allocate the number of bands to equal the number of queues that
the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx
-queues, the band mapping would look like:
+queues, the band mapping would look like::
-band 0 => queue 0
-band 1 => queue 1
-band 2 => queue 2
-band 3 => queue 3
+ band 0 => queue 0
+ band 1 => queue 1
+ band 2 => queue 2
+ band 3 => queue 3
Traffic will begin flowing through each queue based on either the simple_tx_hash
function or based on netdev->select_queue() if you have it defined.
@@ -69,11 +68,11 @@ function or based on netdev->select_queue() if you have it defined.
The behavior of tc filters remains the same. However a new tc action,
skbedit, has been added. Assuming you wanted to route all traffic to a
specific host, for example 192.168.0.3, through a specific queue you could use
-this action and establish a filter such as:
+this action and establish a filter such as::
-tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
- match ip dst 192.168.0.3 \
- action skbedit queue_mapping 3
+ tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
+ match ip dst 192.168.0.3 \
+ action skbedit queue_mapping 3
-Author: Alexander Duyck <alexander.h.duyck@intel.com>
-Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
+:Author: Alexander Duyck <alexander.h.duyck@intel.com>
+:Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
diff --git a/Documentation/networking/net_dim.txt b/Documentation/networking/net_dim.rst
index 9bdb7d5a3ba3..3bed9fd95336 100644
--- a/Documentation/networking/net_dim.txt
+++ b/Documentation/networking/net_dim.rst
@@ -1,28 +1,20 @@
+======================================================
Net DIM - Generic Network Dynamic Interrupt Moderation
======================================================
-Author:
- Tal Gilboa <talgi@mellanox.com>
-
-
-Contents
-=========
+:Author: Tal Gilboa <talgi@mellanox.com>
-- Assumptions
-- Introduction
-- The Net DIM Algorithm
-- Registering a Network Device to DIM
-- Example
+.. contents:: :depth: 2
-Part 0: Assumptions
-======================
+Assumptions
+===========
This document assumes the reader has basic knowledge in network drivers
and in general interrupt moderation.
-Part I: Introduction
-======================
+Introduction
+============
Dynamic Interrupt Moderation (DIM) (in networking) refers to changing the
interrupt moderation configuration of a channel in order to optimize packet
@@ -41,14 +33,15 @@ number of wanted packets per event. The Net DIM algorithm ascribes importance to
increase bandwidth over reducing interrupt rate.
-Part II: The Net DIM Algorithm
-===============================
+Net DIM Algorithm
+=================
Each iteration of the Net DIM algorithm follows these steps:
-1. Calculates new data sample.
-2. Compares it to previous sample.
-3. Makes a decision - suggests interrupt moderation configuration fields.
-4. Applies a schedule work function, which applies suggested configuration.
+
+#. Calculates new data sample.
+#. Compares it to previous sample.
+#. Makes a decision - suggests interrupt moderation configuration fields.
+#. Applies a schedule work function, which applies suggested configuration.
The first two steps are straightforward, both the new and the previous data are
supplied by the driver registered to Net DIM. The previous data is the new data
@@ -89,19 +82,21 @@ manoeuvre as it may provide partial data or ignore the algorithm suggestion
under some conditions.
-Part III: Registering a Network Device to DIM
-==============================================
+Registering a Network Device to DIM
+===================================
-Net DIM API exposes the main function net_dim(struct dim *dim,
-struct dim_sample end_sample). This function is the entry point to the Net
+Net DIM API exposes the main function net_dim().
+This function is the entry point to the Net
DIM algorithm and has to be called every time the driver would like to check if
it should change interrupt moderation parameters. The driver should provide two
-data structures: struct dim and struct dim_sample. Struct dim
+data structures: :c:type:`struct dim <dim>` and
+:c:type:`struct dim_sample <dim_sample>`. :c:type:`struct dim <dim>`
describes the state of DIM for a specific object (RX queue, TX queue,
other queues, etc.). This includes the current selected profile, previous data
samples, the callback function provided by the driver and more.
-Struct dim_sample describes a data sample, which will be compared to the
-data sample stored in struct dim in order to decide on the algorithm's next
+:c:type:`struct dim_sample <dim_sample>` describes a data sample,
+which will be compared to the data sample stored in :c:type:`struct dim <dim>`
+in order to decide on the algorithm's next
step. The sample should include bytes, packets and interrupts, measured by
the driver.
@@ -110,9 +105,10 @@ main net_dim() function. The recommended method is to call net_dim() on each
interrupt. Since Net DIM has a built-in moderation and it might decide to skip
iterations under certain conditions, there is no need to moderate the net_dim()
calls as well. As mentioned above, the driver needs to provide an object of type
-struct dim to the net_dim() function call. It is advised for each entity
-using Net DIM to hold a struct dim as part of its data structure and use it
-as the main Net DIM API object. The struct dim_sample should hold the latest
+:c:type:`struct dim <dim>` to the net_dim() function call. It is advised for
+each entity using Net DIM to hold a :c:type:`struct dim <dim>` as part of its
+data structure and use it as the main Net DIM API object.
+The :c:type:`struct dim_sample <dim_sample>` should hold the latest
bytes, packets and interrupts count. No need to perform any calculations, just
include the raw data.
@@ -124,19 +120,19 @@ the data flow. After the work is done, Net DIM algorithm needs to be set to
the proper state in order to move to the next iteration.
-Part IV: Example
-=================
+Example
+=======
The following code demonstrates how to register a driver to Net DIM. The actual
usage is not complete but it should make the outline of the usage clear.
-my_driver.c:
+.. code-block:: c
-#include <linux/dim.h>
+ #include <linux/dim.h>
-/* Callback for net DIM to schedule on a decision to change moderation */
-void my_driver_do_dim_work(struct work_struct *work)
-{
+ /* Callback for net DIM to schedule on a decision to change moderation */
+ void my_driver_do_dim_work(struct work_struct *work)
+ {
/* Get struct dim from struct work_struct */
struct dim *dim = container_of(work, struct dim,
work);
@@ -145,11 +141,11 @@ void my_driver_do_dim_work(struct work_struct *work)
/* Signal net DIM work is done and it should move to next iteration */
dim->state = DIM_START_MEASURE;
-}
+ }
-/* My driver's interrupt handler */
-int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
-{
+ /* My driver's interrupt handler */
+ int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
+ {
...
/* A struct to hold current measured data */
struct dim_sample dim_sample;
@@ -162,13 +158,19 @@ int my_driver_handle_interrupt(struct my_driver_entity *my_entity, ...)
/* Call net DIM */
net_dim(&my_entity->dim, dim_sample);
...
-}
+ }
-/* My entity's initialization function (my_entity was already allocated) */
-int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
-{
+ /* My entity's initialization function (my_entity was already allocated) */
+ int my_driver_init_my_entity(struct my_driver_entity *my_entity, ...)
+ {
...
/* Initiate struct work_struct with my driver's callback function */
INIT_WORK(&my_entity->dim.work, my_driver_do_dim_work);
...
-}
+ }
+
+Dynamic Interrupt Moderation (DIM) library API
+==============================================
+
+.. kernel-doc:: include/linux/dim.h
+ :internal:
diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst
index e143ab79a960..3a662f2b4d6e 100644
--- a/Documentation/networking/net_failover.rst
+++ b/Documentation/networking/net_failover.rst
@@ -35,7 +35,7 @@ To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY
feature on the virtio-net interface and assign the same MAC address to both
virtio-net and VF interfaces.
-Here is an example XML snippet that shows such configuration.
+Here is an example libvirt XML snippet that shows such configuration:
::
<interface type='network'>
@@ -45,18 +45,32 @@ Here is an example XML snippet that shows such configuration.
<model type='virtio'/>
<driver name='vhost' queues='4'/>
<link state='down'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
+ <teaming type='persistent'/>
+ <alias name='ua-backup0'/>
</interface>
<interface type='hostdev' managed='yes'>
<mac address='52:54:00:00:12:53'/>
<source>
<address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
</source>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+ <teaming type='transient' persistent='ua-backup0'/>
</interface>
+In this configuration, the first device definition is for the virtio-net
+interface and this acts as the 'persistent' device indicating that this
+interface will always be plugged in. This is specified by the 'teaming' tag with
+required attribute type having value 'persistent'. The link state for the
+virtio-net device is set to 'down' to ensure that the 'failover' netdev prefers
+the VF passthrough device for normal communication. The virtio-net device will
+be brought UP during live migration to allow uninterrupted communication.
+
+The second device definition is for the VF passthrough interface. Here the
+'teaming' tag is provided with type 'transient' indicating that this device may
+periodically be unplugged. A second attribute - 'persistent' is provided and
+points to the alias name declared for the virtio-net device.
+
Booting a VM with the above configuration will result in the following 3
-netdevs created in the VM.
+interfaces created in the VM:
::
4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
@@ -65,13 +79,36 @@ netdevs created in the VM.
valid_lft 42482sec preferred_lft 42482sec
inet6 fe80::97d8:db2:8c10:b6d6/64 scope link
valid_lft forever preferred_lft forever
- 5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000
+ 5: ens10nsby: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master ens10 state DOWN group default qlen 1000
link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000
link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff
-ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave
-'standby' and 'primary' netdevs respectively.
+Here, ens10 is the 'failover' master interface, ens10nsby is the slave 'standby'
+virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface.
+
+One point to note here is that some user space network configuration daemons
+like systemd-networkd, ifupdown, etc, do not understand the 'net_failover'
+device; and on the first boot, the VM might end up with both 'failover' device
+and VF accquiring IP addresses (either same or different) from the DHCP server.
+This will result in lack of connectivity to the VM. So some tweaks might be
+needed to these network configuration daemons to make sure that an IP is
+received only on the 'failover' device.
+
+Below is the patch snippet used with 'cloud-ifupdown-helper' script found on
+Debian cloud images:
+
+::
+ @@ -27,6 +27,8 @@ do_setup() {
+ local working="$cfgdir/.$INTERFACE"
+ local final="$cfgdir/$INTERFACE"
+
+ + if [ -d "/sys/class/net/${INTERFACE}/master" ]; then exit 0; fi
+ +
+ if ifup --no-act "$INTERFACE" > /dev/null 2>&1; then
+ # interface is already known to ifupdown, no need to generate cfg
+ log "Skipping configuration generation for $INTERFACE"
+
Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode
==================================================================
@@ -80,40 +117,68 @@ net_failover also enables hypervisor controlled live migration to be supported
with VMs that have direct attached SR-IOV VF devices by automatic failover to
the paravirtual datapath when the VF is unplugged.
-Here is a sample script that shows the steps to initiate live migration on
-the source hypervisor.
+Here is a sample script that shows the steps to initiate live migration from
+the source hypervisor. Note: It is assumed that the VM is connected to a
+software bridge 'br0' which has a single VF attached to it along with the vnet
+device to the VM. This is not the VF that was passthrough'd to the VM (seen in
+the vf.xml file).
::
- # cat vf_xml
+ # cat vf.xml
<interface type='hostdev' managed='yes'>
<mac address='52:54:00:00:12:53'/>
<source>
<address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/>
</source>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
+ <teaming type='transient' persistent='ua-backup0'/>
</interface>
- # Source Hypervisor
+ # Source Hypervisor migrate.sh
#!/bin/bash
- DOMAIN=fedora27-tap01
- PF=enp66s0f0
- VF_NUM=5
- TAP_IF=tap01
- VF_XML=
+ DOMAIN=vm-01
+ PF=ens6np0
+ VF=ens6v1 # VF attached to the bridge.
+ VF_NUM=1
+ TAP_IF=vmtap01 # virtio-net interface in the VM.
+ VF_XML=vf.xml
MAC=52:54:00:00:12:53
ZERO_MAC=00:00:00:00:00:00
+ # Set the virtio-net interface up.
virsh domif-setlink $DOMAIN $TAP_IF up
- bridge fdb del $MAC dev $PF master
- virsh detach-device $DOMAIN $VF_XML
+
+ # Remove the VF that was passthrough'd to the VM.
+ virsh detach-device --live --config $DOMAIN $VF_XML
+
ip link set $PF vf $VF_NUM mac $ZERO_MAC
- virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system
+ # Add FDB entry for traffic to continue going to the VM via
+ # the VF -> br0 -> vnet interface path.
+ bridge fdb add $MAC dev $VF
+ bridge fdb add $MAC dev $TAP_IF master
+
+ # Migrate the VM
+ virsh migrate --live --persistent $DOMAIN qemu+ssh://$REMOTE_HOST/system
+
+ # Clean up FDB entries after migration completes.
+ bridge fdb del $MAC dev $VF
+ bridge fdb del $MAC dev $TAP_IF master
- # Destination Hypervisor
+On the destination hypervisor, a shared bridge 'br0' is created before migration
+starts, and a VF from the destination PF is added to the bridge. Similarly an
+appropriate FDB entry is added.
+
+The following script is executed on the destination hypervisor once migration
+completes, and it reattaches the VF to the VM and brings down the virtio-net
+interface.
+
+::
+ # reattach-vf.sh
#!/bin/bash
- virsh attach-device $DOMAIN $VF_XML
- virsh domif-setlink $DOMAIN $TAP_IF down
+ bridge fdb del 52:54:00:00:12:53 dev ens36v0
+ bridge fdb del 52:54:00:00:12:53 dev vmtap01 master
+ virsh attach-device --config --live vm01 vf.xml
+ virsh domif-setlink vm01 vmtap01 down
diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.rst
index 296ea00fd3eb..1f5c4a04027c 100644
--- a/Documentation/networking/netconsole.txt
+++ b/Documentation/networking/netconsole.rst
@@ -1,7 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Netconsole
+==========
+
started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
+
2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
+
IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
+
Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
Please send bug reports to Matt Mackall <mpm@selenic.com>
@@ -23,34 +32,34 @@ Sender and receiver configuration:
==================================
It takes a string configuration parameter "netconsole" in the
-following format:
+following format::
netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
where
- + if present, enable extended console support
- src-port source for UDP packets (defaults to 6665)
- src-ip source IP to use (interface address)
- dev network interface (eth0)
- tgt-port port for logging agent (6666)
- tgt-ip IP address for logging agent
- tgt-macaddr ethernet MAC address for logging agent (broadcast)
+ + if present, enable extended console support
+ src-port source for UDP packets (defaults to 6665)
+ src-ip source IP to use (interface address)
+ dev network interface (eth0)
+ tgt-port port for logging agent (6666)
+ tgt-ip IP address for logging agent
+ tgt-macaddr ethernet MAC address for logging agent (broadcast)
-Examples:
+Examples::
linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
- or
+or::
insmod netconsole netconsole=@/,@10.0.0.2/
- or using IPv6
+or using IPv6::
insmod netconsole netconsole=@/,@fd00:1:2:3::1/
It also supports logging to multiple remote agents by specifying
parameters for the multiple agents separated by semicolons and the
-complete string enclosed in "quotes", thusly:
+complete string enclosed in "quotes", thusly::
modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
@@ -67,14 +76,19 @@ for example:
On distributions using a BSD-based netcat version (e.g. Fedora,
openSUSE and Ubuntu) the listening port must be specified without
- the -p switch:
+ the -p switch::
+
+ nc -u -l -p <port>' / 'nc -u -l <port>
+
+ or::
- 'nc -u -l -p <port>' / 'nc -u -l <port>' or
- 'netcat -u -l -p <port>' / 'netcat -u -l <port>'
+ netcat -u -l -p <port>' / 'netcat -u -l <port>
3) socat
- 'socat udp-recv:<port> -'
+::
+
+ socat udp-recv:<port> -
Dynamic reconfiguration:
========================
@@ -92,7 +106,7 @@ netconsole module (or kernel, if netconsole is built-in).
Some examples follow (where configfs is mounted at the /sys/kernel/config
mountpoint).
-To add a remote logging target (target names can be arbitrary):
+To add a remote logging target (target names can be arbitrary)::
cd /sys/kernel/config/netconsole/
mkdir target1
@@ -102,12 +116,13 @@ above) and are disabled by default -- they must first be enabled by writing
"1" to the "enabled" attribute (usually after setting parameters accordingly)
as described below.
-To remove a target:
+To remove a target::
rmdir /sys/kernel/config/netconsole/othertarget/
The interface exposes these parameters of a netconsole target to userspace:
+ ============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
dev_name Local network interface name (read-write)
@@ -117,12 +132,13 @@ The interface exposes these parameters of a netconsole target to userspace:
remote_ip Remote agent's IP address (read-write)
local_mac Local interface's MAC address (read-only)
remote_mac Remote agent's MAC address (read-write)
+ ============== ================================= ============
The "enabled" attribute is also used to control whether the parameters of
a target can be updated or not -- you can modify the parameters of only
disabled targets (i.e. if "enabled" is 0).
-To update a target's parameters:
+To update a target's parameters::
cat enabled # check if enabled is 1
echo 0 > enabled # disable the target (if required)
@@ -140,12 +156,12 @@ Extended console:
If '+' is prefixed to the configuration line or "extended" config file
is set to 1, extended console support is enabled. An example boot
-param follows.
+param follows::
linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
Log messages are transmitted with extended metadata header in the
-following format which is the same as /dev/kmsg.
+following format which is the same as /dev/kmsg::
<level>,<sequnum>,<timestamp>,<contflag>;<message text>
@@ -155,12 +171,12 @@ newline is used as the delimeter.
If a message doesn't fit in certain number of bytes (currently 1000),
the message is split into multiple fragments by netconsole. These
-fragments are transmitted with "ncfrag" header field added.
+fragments are transmitted with "ncfrag" header field added::
ncfrag=<byte-offset>/<total-bytes>
For example, assuming a lot smaller chunk size, a message "the first
-chunk, the 2nd chunk." may be split as follows.
+chunk, the 2nd chunk." may be split as follows::
6,416,1758426,-,ncfrag=0/31;the first chunk,
6,416,1758426,-,ncfrag=16/31; the 2nd chunk.
@@ -168,39 +184,52 @@ chunk, the 2nd chunk." may be split as follows.
Miscellaneous notes:
====================
-WARNING: the default target ethernet setting uses the broadcast
-ethernet address to send packets, which can cause increased load on
-other systems on the same ethernet segment.
+.. Warning::
+
+ the default target ethernet setting uses the broadcast
+ ethernet address to send packets, which can cause increased load on
+ other systems on the same ethernet segment.
+
+.. Tip::
+
+ some LAN switches may be configured to suppress ethernet broadcasts
+ so it is advised to explicitly specify the remote agents' MAC addresses
+ from the config parameters passed to netconsole.
+
+.. Tip::
+
+ to find out the MAC address of, say, 10.0.0.2, you may try using::
+
+ ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
-TIP: some LAN switches may be configured to suppress ethernet broadcasts
-so it is advised to explicitly specify the remote agents' MAC addresses
-from the config parameters passed to netconsole.
+.. Tip::
-TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
+ in case the remote logging agent is on a separate LAN subnet than
+ the sender, it is suggested to try specifying the MAC address of the
+ default gateway (you may use /sbin/route -n to find it out) as the
+ remote MAC address instead.
- ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
+.. note::
-TIP: in case the remote logging agent is on a separate LAN subnet than
-the sender, it is suggested to try specifying the MAC address of the
-default gateway (you may use /sbin/route -n to find it out) as the
-remote MAC address instead.
+ the network device (eth1 in the above case) can run any kind
+ of other network traffic, netconsole is not intrusive. Netconsole
+ might cause slight delays in other traffic if the volume of kernel
+ messages is high, but should have no other impact.
-NOTE: the network device (eth1 in the above case) can run any kind
-of other network traffic, netconsole is not intrusive. Netconsole
-might cause slight delays in other traffic if the volume of kernel
-messages is high, but should have no other impact.
+.. note::
-NOTE: if you find that the remote logging agent is not receiving or
-printing all messages from the sender, it is likely that you have set
-the "console_loglevel" parameter (on the sender) to only send high
-priority messages to the console. You can change this at runtime using:
+ if you find that the remote logging agent is not receiving or
+ printing all messages from the sender, it is likely that you have set
+ the "console_loglevel" parameter (on the sender) to only send high
+ priority messages to the console. You can change this at runtime using::
- dmesg -n 8
+ dmesg -n 8
-or by specifying "debug" on the kernel command line at boot, to send
-all kernel messages to the console. A specific value for this parameter
-can also be set using the "loglevel" kernel boot option. See the
-dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst for details.
+ or by specifying "debug" on the kernel command line at boot, to send
+ all kernel messages to the console. A specific value for this parameter
+ can also be set using the "loglevel" kernel boot option. See the
+ dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst
+ for details.
Netconsole was designed to be as instantaneous as possible, to
enable the logging of even the most critical kernel bugs. It works
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
deleted file mode 100644
index d5c9320901c3..000000000000
--- a/Documentation/networking/netdev-FAQ.rst
+++ /dev/null
@@ -1,272 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-.. _netdev-FAQ:
-
-==========
-netdev FAQ
-==========
-
-Q: What is netdev?
-------------------
-A: It is a mailing list for all network-related Linux stuff. This
-includes anything found under net/ (i.e. core code like IPv6) and
-drivers/net (i.e. hardware specific drivers) in the Linux source tree.
-
-Note that some subsystems (e.g. wireless drivers) which have a high
-volume of traffic have their own specific mailing lists.
-
-The netdev list is managed (like many other Linux mailing lists) through
-VGER (http://vger.kernel.org/) and archives can be found below:
-
-- http://marc.info/?l=linux-netdev
-- http://www.spinics.net/lists/netdev/
-
-Aside from subsystems like that mentioned above, all network-related
-Linux development (i.e. RFC, review, comments, etc.) takes place on
-netdev.
-
-Q: How do the changes posted to netdev make their way into Linux?
------------------------------------------------------------------
-A: There are always two trees (git repositories) in play. Both are
-driven by David Miller, the main network maintainer. There is the
-``net`` tree, and the ``net-next`` tree. As you can probably guess from
-the names, the ``net`` tree is for fixes to existing code already in the
-mainline tree from Linus, and ``net-next`` is where the new code goes
-for the future release. You can find the trees here:
-
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
-- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
-
-Q: How often do changes from these trees make it to the mainline Linus tree?
-----------------------------------------------------------------------------
-A: To understand this, you need to know a bit of background information on
-the cadence of Linux development. Each new release starts off with a
-two week "merge window" where the main maintainers feed their new stuff
-to Linus for merging into the mainline tree. After the two weeks, the
-merge window is closed, and it is called/tagged ``-rc1``. No new
-features get mainlined after this -- only fixes to the rc1 content are
-expected. After roughly a week of collecting fixes to the rc1 content,
-rc2 is released. This repeats on a roughly weekly basis until rc7
-(typically; sometimes rc6 if things are quiet, or rc8 if things are in a
-state of churn), and a week after the last vX.Y-rcN was done, the
-official vX.Y is released.
-
-Relating that to netdev: At the beginning of the 2-week merge window,
-the ``net-next`` tree will be closed - no new changes/features. The
-accumulated new content of the past ~10 weeks will be passed onto
-mainline/Linus via a pull request for vX.Y -- at the same time, the
-``net`` tree will start accumulating fixes for this pulled content
-relating to vX.Y
-
-An announcement indicating when ``net-next`` has been closed is usually
-sent to netdev, but knowing the above, you can predict that in advance.
-
-IMPORTANT: Do not send new ``net-next`` content to netdev during the
-period during which ``net-next`` tree is closed.
-
-Shortly after the two weeks have passed (and vX.Y-rc1 is released), the
-tree for ``net-next`` reopens to collect content for the next (vX.Y+1)
-release.
-
-If you aren't subscribed to netdev and/or are simply unsure if
-``net-next`` has re-opened yet, simply check the ``net-next`` git
-repository link above for any new networking-related commits. You may
-also check the following website for the current status:
-
- http://vger.kernel.org/~davem/net-next.html
-
-The ``net`` tree continues to collect fixes for the vX.Y content, and is
-fed back to Linus at regular (~weekly) intervals. Meaning that the
-focus for ``net`` is on stabilization and bug fixes.
-
-Finally, the vX.Y gets released, and the whole cycle starts over.
-
-Q: So where are we now in this cycle?
-
-Load the mainline (Linus) page here:
-
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-
-and note the top of the "tags" section. If it is rc1, it is early in
-the dev cycle. If it was tagged rc7 a week ago, then a release is
-probably imminent.
-
-Q: How do I indicate which tree (net vs. net-next) my patch should be in?
--------------------------------------------------------------------------
-A: Firstly, think whether you have a bug fix or new "next-like" content.
-Then once decided, assuming that you use git, use the prefix flag, i.e.
-::
-
- git format-patch --subject-prefix='PATCH net-next' start..finish
-
-Use ``net`` instead of ``net-next`` (always lower case) in the above for
-bug-fix ``net`` content. If you don't use git, then note the only magic
-in the above is just the subject text of the outgoing e-mail, and you
-can manually change it yourself with whatever MUA you are comfortable
-with.
-
-Q: I sent a patch and I'm wondering what happened to it?
---------------------------------------------------------
-Q: How can I tell whether it got merged?
-A: Start by looking at the main patchworks queue for netdev:
-
- http://patchwork.ozlabs.org/project/netdev/list/
-
-The "State" field will tell you exactly where things are at with your
-patch.
-
-Q: The above only says "Under Review". How can I find out more?
-----------------------------------------------------------------
-A: Generally speaking, the patches get triaged quickly (in less than
-48h). So be patient. Asking the maintainer for status updates on your
-patch is a good way to ensure your patch is ignored or pushed to the
-bottom of the priority list.
-
-Q: I submitted multiple versions of the patch series
-----------------------------------------------------
-Q: should I directly update patchwork for the previous versions of these
-patch series?
-A: No, please don't interfere with the patch status on patchwork, leave
-it to the maintainer to figure out what is the most recent and current
-version that should be applied. If there is any doubt, the maintainer
-will reply and ask what should be done.
-
-Q: I made changes to only a few patches in a patch series should I resend only those changed?
----------------------------------------------------------------------------------------------
-A: No, please resend the entire patch series and make sure you do number your
-patches such that it is clear this is the latest and greatest set of patches
-that can be applied.
-
-Q: I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do?
--------------------------------------------------------------------------------------------------------------------------------------------
-A: There is no revert possible, once it is pushed out, it stays like that.
-Please send incremental versions on top of what has been merged in order to fix
-the patches the way they would look like if your latest patch series was to be
-merged.
-
-Q: How can I tell what patches are queued up for backporting to the various stable releases?
---------------------------------------------------------------------------------------------
-A: Normally Greg Kroah-Hartman collects stable commits himself, but for
-networking, Dave collects up patches he deems critical for the
-networking subsystem, and then hands them off to Greg.
-
-There is a patchworks queue that you can see here:
-
- http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
-
-It contains the patches which Dave has selected, but not yet handed off
-to Greg. If Greg already has the patch, then it will be here:
-
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git
-
-A quick way to find whether the patch is in this stable-queue is to
-simply clone the repo, and then git grep the mainline commit ID, e.g.
-::
-
- stable-queue$ git grep -l 284041ef21fdf2e
- releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
- stable/stable-queue$
-
-Q: I see a network patch and I think it should be backported to stable.
------------------------------------------------------------------------
-Q: Should I request it via stable@vger.kernel.org like the references in
-the kernel's Documentation/process/stable-kernel-rules.rst file say?
-A: No, not for networking. Check the stable queues as per above first
-to see if it is already queued. If not, then send a mail to netdev,
-listing the upstream commit ID and why you think it should be a stable
-candidate.
-
-Before you jump to go do the above, do note that the normal stable rules
-in :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
-still apply. So you need to explicitly indicate why it is a critical
-fix and exactly what users are impacted. In addition, you need to
-convince yourself that you *really* think it has been overlooked,
-vs. having been considered and rejected.
-
-Generally speaking, the longer it has had a chance to "soak" in
-mainline, the better the odds that it is an OK candidate for stable. So
-scrambling to request a commit be added the day after it appears should
-be avoided.
-
-Q: I have created a network patch and I think it should be backported to stable.
---------------------------------------------------------------------------------
-Q: Should I add a Cc: stable@vger.kernel.org like the references in the
-kernel's Documentation/ directory say?
-A: No. See above answer. In short, if you think it really belongs in
-stable, then ensure you write a decent commit log that describes who
-gets impacted by the bug fix and how it manifests itself, and when the
-bug was introduced. If you do that properly, then the commit will get
-handled appropriately and most likely get put in the patchworks stable
-queue if it really warrants it.
-
-If you think there is some valid information relating to it being in
-stable that does *not* belong in the commit log, then use the three dash
-marker line as described in
-:ref:`Documentation/process/submitting-patches.rst <the_canonical_patch_format>`
-to temporarily embed that information into the patch that you send.
-
-Q: Are all networking bug fixes backported to all stable releases?
-------------------------------------------------------------------
-A: Due to capacity, Dave could only take care of the backports for the
-last two stable releases. For earlier stable releases, each stable
-branch maintainer is supposed to take care of them. If you find any
-patch is missing from an earlier stable branch, please notify
-stable@vger.kernel.org with either a commit ID or a formal patch
-backported, and CC Dave and other relevant networking developers.
-
-Q: Is the comment style convention different for the networking content?
-------------------------------------------------------------------------
-A: Yes, in a largely trivial way. Instead of this::
-
- /*
- * foobar blah blah blah
- * another line of text
- */
-
-it is requested that you make it look like this::
-
- /* foobar blah blah blah
- * another line of text
- */
-
-Q: I am working in existing code that has the former comment style and not the latter.
---------------------------------------------------------------------------------------
-Q: Should I submit new code in the former style or the latter?
-A: Make it the latter style, so that eventually all code in the domain
-of netdev is of this format.
-
-Q: I found a bug that might have possible security implications or similar.
----------------------------------------------------------------------------
-Q: Should I mail the main netdev maintainer off-list?**
-A: No. The current netdev maintainer has consistently requested that
-people use the mailing lists and not reach out directly. If you aren't
-OK with that, then perhaps consider mailing security@kernel.org or
-reading about http://oss-security.openwall.org/wiki/mailing-lists/distros
-as possible alternative mechanisms.
-
-Q: What level of testing is expected before I submit my change?
----------------------------------------------------------------
-A: If your changes are against ``net-next``, the expectation is that you
-have tested by layering your changes on top of ``net-next``. Ideally
-you will have done run-time testing specific to your change, but at a
-minimum, your changes should survive an ``allyesconfig`` and an
-``allmodconfig`` build without new warnings or failures.
-
-Q: Any other tips to help ensure my net/net-next patch gets OK'd?
------------------------------------------------------------------
-A: Attention to detail. Re-read your own work as if you were the
-reviewer. You can start with using ``checkpatch.pl``, perhaps even with
-the ``--strict`` flag. But do not be mindlessly robotic in doing so.
-If your change is a bug fix, make sure your commit log indicates the
-end-user visible symptom, the underlying reason as to why it happens,
-and then if necessary, explain why the fix proposed is the best way to
-get things done. Don't mangle whitespace, and as is common, don't
-mis-indent function arguments that span multiple lines. If it is your
-first patch, mail it to yourself so you can test apply it to an
-unpatched tree to confirm infrastructure didn't mangle it.
-
-Finally, go back and read
-:ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
-to be sure you are not repeating some common mistake documented there.
diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.rst
index 58dd1c1e3c65..d7b15bb64deb 100644
--- a/Documentation/networking/netdev-features.txt
+++ b/Documentation/networking/netdev-features.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
Netdev features mess and how to get out from it alive
=====================================================
@@ -6,8 +9,8 @@ Author:
- Part I: Feature sets
-======================
+Part I: Feature sets
+====================
Long gone are the days when a network card would just take and give packets
verbatim. Today's devices add multiple features and bugs (read: offloads)
@@ -39,8 +42,8 @@ one used internally by network core:
- Part II: Controlling enabled features
-=======================================
+Part II: Controlling enabled features
+=====================================
When current feature set (netdev->features) is to be changed, new set
is calculated and filtered by calling ndo_fix_features callback
@@ -65,8 +68,8 @@ driver except by means of ndo_fix_features callback.
- Part III: Implementation hints
-================================
+Part III: Implementation hints
+==============================
* ndo_fix_features:
@@ -94,8 +97,8 @@ Errors returned are not (and cannot be) propagated anywhere except dmesg.
- Part IV: Features
-===================
+Part IV: Features
+=================
For current list of features, see include/linux/netdev_features.h.
This section describes semantics of some of them.
@@ -179,3 +182,24 @@ stricter than Hardware LRO. A packet stream merged by Hardware GRO must
be re-segmentable by GSO or TSO back to the exact original packet stream.
Hardware GRO is dependent on RXCSUM since every packet successfully merged
by hardware must also have the checksum verified by hardware.
+
+* hsr-tag-ins-offload
+
+This should be set for devices which insert an HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tag automatically.
+
+* hsr-tag-rm-offload
+
+This should be set for devices which remove HSR (High-availability Seamless
+Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically.
+
+* hsr-fwd-offload
+
+This should be set for devices which forward HSR (High-availability Seamless
+Redundancy) frames from one port to another in hardware.
+
+* hsr-dup-offload
+
+This should be set for devices which duplicate outgoing HSR (High-availability
+Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
+frames in hardware.
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
new file mode 100644
index 000000000000..9e4cccb90b87
--- /dev/null
+++ b/Documentation/networking/netdevices.rst
@@ -0,0 +1,299 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+Network Devices, the Kernel, and You!
+=====================================
+
+
+Introduction
+============
+The following is a random collection of documentation regarding
+network devices.
+
+struct net_device lifetime rules
+================================
+Network device structures need to persist even after module is unloaded and
+must be allocated with alloc_netdev_mqs() and friends.
+If device has registered successfully, it will be freed on last use
+by free_netdev(). This is required to handle the pathological case cleanly
+(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
+
+alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
+private data which gets freed when the network device is freed. If
+separately allocated data is attached to the network device
+(netdev_priv()) then it is up to the module exit handler to free that.
+
+There are two groups of APIs for registering struct net_device.
+First group can be used in normal contexts where ``rtnl_lock`` is not already
+held: register_netdev(), unregister_netdev().
+Second group can be used when ``rtnl_lock`` is already held:
+register_netdevice(), unregister_netdevice(), free_netdevice().
+
+Simple drivers
+--------------
+
+Most drivers (especially device drivers) handle lifetime of struct net_device
+in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
+
+In that case the struct net_device registration is done using
+the register_netdev(), and unregister_netdev() functions:
+
+.. code-block:: c
+
+ int probe()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ dev = alloc_netdev_mqs(...);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* ... do all device setup before calling register_netdev() ...
+ */
+
+ err = register_netdev(dev);
+ if (err)
+ goto err_undo;
+
+ /* net_device is visible to the user! */
+
+ err_undo:
+ /* ... undo the device setup ... */
+ free_netdev(dev);
+ return err;
+ }
+
+ void remove()
+ {
+ unregister_netdev(dev);
+ free_netdev(dev);
+ }
+
+Note that after calling register_netdev() the device is visible in the system.
+Users can open it and start sending / receiving traffic immediately,
+or run any other callback, so all initialization must be done prior to
+registration.
+
+unregister_netdev() closes the device and waits for all users to be done
+with it. The memory of struct net_device itself may still be referenced
+by sysfs but all operations on that device will fail.
+
+free_netdev() can be called after unregister_netdev() returns on when
+register_netdev() failed.
+
+Device management under RTNL
+----------------------------
+
+Registering struct net_device while in context which already holds
+the ``rtnl_lock`` requires extra care. In those scenarios most drivers
+will want to make use of struct net_device's ``needs_free_netdev``
+and ``priv_destructor`` members for freeing of state.
+
+Example flow of netdev handling under ``rtnl_lock``:
+
+.. code-block:: c
+
+ static void my_setup(struct net_device *dev)
+ {
+ dev->needs_free_netdev = true;
+ }
+
+ static void my_destructor(struct net_device *dev)
+ {
+ some_obj_destroy(priv->obj);
+ some_uninit(priv);
+ }
+
+ int create_link()
+ {
+ struct my_device_priv *priv;
+ int err;
+
+ ASSERT_RTNL();
+
+ dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
+ if (!dev)
+ return -ENOMEM;
+ priv = netdev_priv(dev);
+
+ /* Implicit constructor */
+ err = some_init(priv);
+ if (err)
+ goto err_free_dev;
+
+ priv->obj = some_obj_create();
+ if (!priv->obj) {
+ err = -ENOMEM;
+ goto err_some_uninit;
+ }
+ /* End of constructor, set the destructor: */
+ dev->priv_destructor = my_destructor;
+
+ err = register_netdevice(dev);
+ if (err)
+ /* register_netdevice() calls destructor on failure */
+ goto err_free_dev;
+
+ /* If anything fails now unregister_netdevice() (or unregister_netdev())
+ * will take care of calling my_destructor and free_netdev().
+ */
+
+ return 0;
+
+ err_some_uninit:
+ some_uninit(priv);
+ err_free_dev:
+ free_netdev(dev);
+ return err;
+ }
+
+If struct net_device.priv_destructor is set it will be called by the core
+some time after unregister_netdevice(), it will also be called if
+register_netdevice() fails. The callback may be invoked with or without
+``rtnl_lock`` held.
+
+There is no explicit constructor callback, driver "constructs" the private
+netdev state after allocating it and before registration.
+
+Setting struct net_device.needs_free_netdev makes core call free_netdevice()
+automatically after unregister_netdevice() when all references to the device
+are gone. It only takes effect after a successful call to register_netdevice()
+so if register_netdevice() fails driver is responsible for calling
+free_netdev().
+
+free_netdev() is safe to call on error paths right after unregister_netdevice()
+or when register_netdevice() fails. Parts of netdev (de)registration process
+happen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
+will defer some of the processing until ``rtnl_lock`` is released.
+
+Devices spawned from struct rtnl_link_ops should never free the
+struct net_device directly.
+
+.ndo_init and .ndo_uninit
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
+registration and de-registration, under ``rtnl_lock``. Drivers can use
+those e.g. when parts of their init process need to run under ``rtnl_lock``.
+
+``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
+runs during de-registering after device is closed but other subsystems
+may still have outstanding references to the netdevice.
+
+MTU
+===
+Each network device has a Maximum Transfer Unit. The MTU does not
+include any link layer protocol overhead. Upper layer protocols must
+not pass a socket buffer (skb) to a device to transmit with more data
+than the mtu. The MTU does not include link layer header overhead, so
+for example on Ethernet if the standard MTU is 1500 bytes used, the
+actual skb will contain up to 1514 bytes because of the Ethernet
+header. Devices should allow for the 4 byte VLAN header as well.
+
+Segmentation Offload (GSO, TSO) is an exception to this rule. The
+upper layer protocol may pass a large socket buffer to the device
+transmit routine, and the device will break that up into separate
+packets based on the current MTU.
+
+MTU is symmetrical and applies both to receive and transmit. A device
+must be able to receive at least the maximum size packet allowed by
+the MTU. A network device may use the MTU as mechanism to size receive
+buffers, but the device should allow packets with VLAN header. With
+standard Ethernet mtu of 1500 bytes, the device should allow up to
+1518 byte packets (1500 + 14 header + 4 tag). The device may either:
+drop, truncate, or pass up oversize packets, but dropping oversize
+packets is preferred.
+
+
+struct net_device synchronization rules
+=======================================
+ndo_open:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_stop:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+ Note: netif_running() is guaranteed false
+
+ndo_do_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ This is only called by network subsystems internally,
+ not by user space calling ioctl as it was in before
+ linux-5.14.
+
+ndo_siocbond:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the bonding driver for the SIOCBOND family of
+ ioctl commands.
+
+ndo_siocwandev:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ Used by the drivers/net/wan framework to handle
+ the SIOCWANDEV ioctl with the if_settings structure.
+
+ndo_siocdevprivate:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ This is used to implement SIOCDEVPRIVATE ioctl helpers.
+ These should not be added to new drivers, so don't use.
+
+ndo_eth_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_get_stats:
+ Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
+ Context: atomic (can't sleep under rwlock or RCU)
+
+ndo_start_xmit:
+ Synchronization: __netif_tx_lock spinlock.
+
+ When the driver sets NETIF_F_LLTX in dev->features this will be
+ called without holding netif_tx_lock. In this case the driver
+ has to lock by itself when needed.
+ The locking there should also properly protect against
+ set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
+ Don't use it for new drivers.
+
+ Context: Process with BHs disabled or BH (timer),
+ will be called with interrupts disabled by netconsole.
+
+ Return codes:
+
+ * NETDEV_TX_OK everything ok.
+ * NETDEV_TX_BUSY Cannot transmit packet, try later
+ Usually a bug, means queue start/stop flow control is broken in
+ the driver. Note: the driver must NOT put the skb in its DMA ring.
+
+ndo_tx_timeout:
+ Synchronization: netif_tx_lock spinlock; all TX queues frozen.
+ Context: BHs disabled
+ Notes: netif_queue_stopped() is guaranteed true
+
+ndo_set_rx_mode:
+ Synchronization: netif_addr_lock spinlock.
+ Context: BHs disabled
+
+struct napi_struct synchronization rules
+========================================
+napi->poll:
+ Synchronization:
+ NAPI_STATE_SCHED bit in napi->state. Device
+ driver's ndo_stop method will invoke napi_disable() on
+ all NAPI instances which will do a sleeping poll on the
+ NAPI_STATE_SCHED napi->state bit, waiting for all pending
+ NAPI activity to cease.
+
+ Context:
+ softirq
+ will be called with interrupts disabled by netconsole.
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
deleted file mode 100644
index 7fec2061a334..000000000000
--- a/Documentation/networking/netdevices.txt
+++ /dev/null
@@ -1,104 +0,0 @@
-
-Network Devices, the Kernel, and You!
-
-
-Introduction
-============
-The following is a random collection of documentation regarding
-network devices.
-
-struct net_device allocation rules
-==================================
-Network device structures need to persist even after module is unloaded and
-must be allocated with alloc_netdev_mqs() and friends.
-If device has registered successfully, it will be freed on last use
-by free_netdev(). This is required to handle the pathologic case cleanly
-(example: rmmod mydriver </sys/class/net/myeth/mtu )
-
-alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver
-private data which gets freed when the network device is freed. If
-separately allocated data is attached to the network device
-(netdev_priv(dev)) then it is up to the module exit handler to free that.
-
-MTU
-===
-Each network device has a Maximum Transfer Unit. The MTU does not
-include any link layer protocol overhead. Upper layer protocols must
-not pass a socket buffer (skb) to a device to transmit with more data
-than the mtu. The MTU does not include link layer header overhead, so
-for example on Ethernet if the standard MTU is 1500 bytes used, the
-actual skb will contain up to 1514 bytes because of the Ethernet
-header. Devices should allow for the 4 byte VLAN header as well.
-
-Segmentation Offload (GSO, TSO) is an exception to this rule. The
-upper layer protocol may pass a large socket buffer to the device
-transmit routine, and the device will break that up into separate
-packets based on the current MTU.
-
-MTU is symmetrical and applies both to receive and transmit. A device
-must be able to receive at least the maximum size packet allowed by
-the MTU. A network device may use the MTU as mechanism to size receive
-buffers, but the device should allow packets with VLAN header. With
-standard Ethernet mtu of 1500 bytes, the device should allow up to
-1518 byte packets (1500 + 14 header + 4 tag). The device may either:
-drop, truncate, or pass up oversize packets, but dropping oversize
-packets is preferred.
-
-
-struct net_device synchronization rules
-=======================================
-ndo_open:
- Synchronization: rtnl_lock() semaphore.
- Context: process
-
-ndo_stop:
- Synchronization: rtnl_lock() semaphore.
- Context: process
- Note: netif_running() is guaranteed false
-
-ndo_do_ioctl:
- Synchronization: rtnl_lock() semaphore.
- Context: process
-
-ndo_get_stats:
- Synchronization: dev_base_lock rwlock.
- Context: nominally process, but don't sleep inside an rwlock
-
-ndo_start_xmit:
- Synchronization: __netif_tx_lock spinlock.
-
- When the driver sets NETIF_F_LLTX in dev->features this will be
- called without holding netif_tx_lock. In this case the driver
- has to lock by itself when needed.
- The locking there should also properly protect against
- set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
- Don't use it for new drivers.
-
- Context: Process with BHs disabled or BH (timer),
- will be called with interrupts disabled by netconsole.
-
- Return codes:
- o NETDEV_TX_OK everything ok.
- o NETDEV_TX_BUSY Cannot transmit packet, try later
- Usually a bug, means queue start/stop flow control is broken in
- the driver. Note: the driver must NOT put the skb in its DMA ring.
-
-ndo_tx_timeout:
- Synchronization: netif_tx_lock spinlock; all TX queues frozen.
- Context: BHs disabled
- Notes: netif_queue_stopped() is guaranteed true
-
-ndo_set_rx_mode:
- Synchronization: netif_addr_lock spinlock.
- Context: BHs disabled
-
-struct napi_struct synchronization rules
-========================================
-napi->poll:
- Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
- driver's ndo_stop method will invoke napi_disable() on
- all NAPI instances which will do a sleeping poll on the
- NAPI_STATE_SCHED napi->state bit, waiting for all pending
- NAPI activity to cease.
- Context: softirq
- will be called with interrupts disabled by netconsole.
diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.rst
index 55791e50e169..beb6d7b275d4 100644
--- a/Documentation/networking/netfilter-sysctl.txt
+++ b/Documentation/networking/netfilter-sysctl.rst
@@ -1,8 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Netfilter Sysfs variables
+=========================
+
/proc/sys/net/netfilter/* Variables:
+====================================
nf_log_all_netns - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
By default, only init_net namespace can log packets into kernel log
with LOG target; this aims to prevent containers from flooding host
diff --git a/Documentation/networking/netif-msg.rst b/Documentation/networking/netif-msg.rst
new file mode 100644
index 000000000000..b20d265a734d
--- /dev/null
+++ b/Documentation/networking/netif-msg.rst
@@ -0,0 +1,95 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NETIF Msg Level
+===============
+
+The design of the network interface message level setting.
+
+History
+-------
+
+ The design of the debugging message interface was guided and
+ constrained by backwards compatibility previous practice. It is useful
+ to understand the history and evolution in order to understand current
+ practice and relate it to older driver source code.
+
+ From the beginning of Linux, each network device driver has had a local
+ integer variable that controls the debug message level. The message
+ level ranged from 0 to 7, and monotonically increased in verbosity.
+
+ The message level was not precisely defined past level 3, but were
+ always implemented within +-1 of the specified level. Drivers tended
+ to shed the more verbose level messages as they matured.
+
+ - 0 Minimal messages, only essential information on fatal errors.
+ - 1 Standard messages, initialization status. No run-time messages
+ - 2 Special media selection messages, generally timer-driver.
+ - 3 Interface starts and stops, including normal status messages
+ - 4 Tx and Rx frame error messages, and abnormal driver operation
+ - 5 Tx packet queue information, interrupt events.
+ - 6 Status on each completed Tx packet and received Rx packets
+ - 7 Initial contents of Tx and Rx packets
+
+ Initially this message level variable was uniquely named in each driver
+ e.g. "lance_debug", so that a kernel symbolic debugger could locate and
+ modify the setting. When kernel modules became common, the variables
+ were consistently renamed to "debug" and allowed to be set as a module
+ parameter.
+
+ This approach worked well. However there is always a demand for
+ additional features. Over the years the following emerged as
+ reasonable and easily implemented enhancements
+
+ - Using an ioctl() call to modify the level.
+ - Per-interface rather than per-driver message level setting.
+ - More selective control over the type of messages emitted.
+
+ The netif_msg recommendation adds these features with only a minor
+ complexity and code size increase.
+
+ The recommendation is the following points
+
+ - Retaining the per-driver integer variable "debug" as a module
+ parameter with a default level of '1'.
+
+ - Adding a per-interface private variable named "msg_enable". The
+ variable is a bit map rather than a level, and is initialized as::
+
+ 1 << debug
+
+ Or more precisely::
+
+ debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
+
+ Messages should changes from::
+
+ if (debug > 1)
+ printk(MSG_DEBUG "%s: ...
+
+ to::
+
+ if (np->msg_enable & NETIF_MSG_LINK)
+ printk(MSG_DEBUG "%s: ...
+
+
+The set of message levels is named
+
+
+ ========= =================== ============
+ Old level Name Bit position
+ ========= =================== ============
+ 0 NETIF_MSG_DRV 0x0001
+ 1 NETIF_MSG_PROBE 0x0002
+ 2 NETIF_MSG_LINK 0x0004
+ 2 NETIF_MSG_TIMER 0x0004
+ 3 NETIF_MSG_IFDOWN 0x0008
+ 3 NETIF_MSG_IFUP 0x0008
+ 4 NETIF_MSG_RX_ERR 0x0010
+ 4 NETIF_MSG_TX_ERR 0x0010
+ 5 NETIF_MSG_TX_QUEUED 0x0020
+ 5 NETIF_MSG_INTR 0x0020
+ 6 NETIF_MSG_TX_DONE 0x0040
+ 6 NETIF_MSG_RX_STATUS 0x0040
+ 7 NETIF_MSG_PKTDATA 0x0080
+ ========= =================== ============
diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt
deleted file mode 100644
index c967ddb90d0b..000000000000
--- a/Documentation/networking/netif-msg.txt
+++ /dev/null
@@ -1,79 +0,0 @@
-
-________________
-NETIF Msg Level
-
-The design of the network interface message level setting.
-
-History
-
- The design of the debugging message interface was guided and
- constrained by backwards compatibility previous practice. It is useful
- to understand the history and evolution in order to understand current
- practice and relate it to older driver source code.
-
- From the beginning of Linux, each network device driver has had a local
- integer variable that controls the debug message level. The message
- level ranged from 0 to 7, and monotonically increased in verbosity.
-
- The message level was not precisely defined past level 3, but were
- always implemented within +-1 of the specified level. Drivers tended
- to shed the more verbose level messages as they matured.
- 0 Minimal messages, only essential information on fatal errors.
- 1 Standard messages, initialization status. No run-time messages
- 2 Special media selection messages, generally timer-driver.
- 3 Interface starts and stops, including normal status messages
- 4 Tx and Rx frame error messages, and abnormal driver operation
- 5 Tx packet queue information, interrupt events.
- 6 Status on each completed Tx packet and received Rx packets
- 7 Initial contents of Tx and Rx packets
-
- Initially this message level variable was uniquely named in each driver
- e.g. "lance_debug", so that a kernel symbolic debugger could locate and
- modify the setting. When kernel modules became common, the variables
- were consistently renamed to "debug" and allowed to be set as a module
- parameter.
-
- This approach worked well. However there is always a demand for
- additional features. Over the years the following emerged as
- reasonable and easily implemented enhancements
- Using an ioctl() call to modify the level.
- Per-interface rather than per-driver message level setting.
- More selective control over the type of messages emitted.
-
- The netif_msg recommendation adds these features with only a minor
- complexity and code size increase.
-
- The recommendation is the following points
- Retaining the per-driver integer variable "debug" as a module
- parameter with a default level of '1'.
-
- Adding a per-interface private variable named "msg_enable". The
- variable is a bit map rather than a level, and is initialized as
- 1 << debug
- Or more precisely
- debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
-
- Messages should changes from
- if (debug > 1)
- printk(MSG_DEBUG "%s: ...
- to
- if (np->msg_enable & NETIF_MSG_LINK)
- printk(MSG_DEBUG "%s: ...
-
-
-The set of message levels is named
- Old level Name Bit position
- 0 NETIF_MSG_DRV 0x0001
- 1 NETIF_MSG_PROBE 0x0002
- 2 NETIF_MSG_LINK 0x0004
- 2 NETIF_MSG_TIMER 0x0004
- 3 NETIF_MSG_IFDOWN 0x0008
- 3 NETIF_MSG_IFUP 0x0008
- 4 NETIF_MSG_RX_ERR 0x0010
- 4 NETIF_MSG_TX_ERR 0x0010
- 5 NETIF_MSG_TX_QUEUED 0x0020
- 5 NETIF_MSG_INTR 0x0020
- 6 NETIF_MSG_TX_DONE 0x0040
- 6 NETIF_MSG_RX_STATUS 0x0040
- 7 NETIF_MSG_PKTDATA 0x0080
-
diff --git a/Documentation/networking/nexthop-group-resilient.rst b/Documentation/networking/nexthop-group-resilient.rst
new file mode 100644
index 000000000000..fabecee24d85
--- /dev/null
+++ b/Documentation/networking/nexthop-group-resilient.rst
@@ -0,0 +1,293 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Resilient Next-hop Groups
+=========================
+
+Resilient groups are a type of next-hop group that is aimed at minimizing
+disruption in flow routing across changes to the group composition and
+weights of constituent next hops.
+
+The idea behind resilient hashing groups is best explained in contrast to
+the legacy multipath next-hop group, which uses the hash-threshold
+algorithm, described in RFC 2992.
+
+To select a next hop, hash-threshold algorithm first assigns a range of
+hashes to each next hop in the group, and then selects the next hop by
+comparing the SKB hash with the individual ranges. When a next hop is
+removed from the group, the ranges are recomputed, which leads to
+reassignment of parts of hash space from one next hop to another. RFC 2992
+illustrates it thus::
+
+ +-------+-------+-------+-------+-------+
+ | 1 | 2 | 3 | 4 | 5 |
+ +-------+-+-----+---+---+-----+-+-------+
+ | 1 | 2 | 4 | 5 |
+ +---------+---------+---------+---------+
+
+ Before and after deletion of next hop 3
+ under the hash-threshold algorithm.
+
+Note how next hop 2 gave up part of the hash space in favor of next hop 1,
+and 4 in favor of 5. While there will usually be some overlap between the
+previous and the new distribution, some traffic flows change the next hop
+that they resolve to.
+
+If a multipath group is used for load-balancing between multiple servers,
+this hash space reassignment causes an issue that packets from a single
+flow suddenly end up arriving at a server that does not expect them. This
+can result in TCP connections being reset.
+
+If a multipath group is used for load-balancing among available paths to
+the same server, the issue is that different latencies and reordering along
+the way causes the packets to arrive in the wrong order, resulting in
+degraded application performance.
+
+To mitigate the above-mentioned flow redirection, resilient next-hop groups
+insert another layer of indirection between the hash space and its
+constituent next hops: a hash table. The selection algorithm uses SKB hash
+to choose a hash table bucket, then reads the next hop that this bucket
+contains, and forwards traffic there.
+
+This indirection brings an important feature. In the hash-threshold
+algorithm, the range of hashes associated with a next hop must be
+continuous. With a hash table, mapping between the hash table buckets and
+the individual next hops is arbitrary. Therefore when a next hop is deleted
+the buckets that held it are simply reassigned to other next hops::
+
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ v v v v
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ Before and after deletion of next hop 3
+ under the resilient hashing algorithm.
+
+When weights of next hops in a group are altered, it may be possible to
+choose a subset of buckets that are currently not used for forwarding
+traffic, and use those to satisfy the new next-hop distribution demands,
+keeping the "busy" buckets intact. This way, established flows are ideally
+kept being forwarded to the same endpoints through the same paths as before
+the next-hop group change.
+
+Algorithm
+---------
+
+In a nutshell, the algorithm works as follows. Each next hop deserves a
+certain number of buckets, according to its weight and the number of
+buckets in the hash table. In accordance with the source code, we will call
+this number a "wants count" of a next hop. In case of an event that might
+cause bucket allocation change, the wants counts for individual next hops
+are updated.
+
+Next hops that have fewer buckets than their wants count, are called
+"underweight". Those that have more are "overweight". If there are no
+overweight (and therefore no underweight) next hops in the group, it is
+said to be "balanced".
+
+Each bucket maintains a last-used timer. Every time a packet is forwarded
+through a bucket, this timer is updated to current jiffies value. One
+attribute of a resilient group is then the "idle timer", which is the
+amount of time that a bucket must not be hit by traffic in order for it to
+be considered "idle". Buckets that are not idle are busy.
+
+After assigning wants counts to next hops, an "upkeep" algorithm runs. For
+buckets:
+
+1) that have no assigned next hop, or
+2) whose next hop has been removed, or
+3) that are idle and their next hop is overweight,
+
+upkeep changes the next hop that the bucket references to one of the
+underweight next hops. If, after considering all buckets in this manner,
+there are still underweight next hops, another upkeep run is scheduled to a
+future time.
+
+There may not be enough "idle" buckets to satisfy the updated wants counts
+of all next hops. Another attribute of a resilient group is the "unbalanced
+timer". This timer can be set to 0, in which case the table will stay out
+of balance until idle buckets do appear, possibly never. If set to a
+non-zero value, the value represents the period of time that the table is
+permitted to stay out of balance.
+
+With this in mind, we update the above list of conditions with one more
+item. Thus buckets:
+
+4) whose next hop is overweight, and the amount of time that the table has
+ been out of balance exceeds the unbalanced timer, if that is non-zero,
+
+\... are migrated as well.
+
+Offloading & Driver Feedback
+----------------------------
+
+When offloading resilient groups, the algorithm that distributes buckets
+among next hops is still the one in SW. Drivers are notified of updates to
+next hop groups in the following three ways:
+
+- Full group notification with the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
+ created and buckets populated for the first time.
+
+- Single-bucket notifications of the type
+ ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
+ individual migrations within an already-established group.
+
+- Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
+ is sent before the group is replaced, and is a way for the driver to veto
+ the group before committing anything to the HW.
+
+Some single-bucket notifications are forced, as indicated by the "force"
+flag in the notification. Those are used for the cases where e.g. the next
+hop associated with the bucket was removed, and the bucket really must be
+migrated.
+
+Non-forced notifications can be overridden by the driver by returning an
+error code. The use case for this is that the driver notifies the HW that a
+bucket should be migrated, but the HW discovers that the bucket has in fact
+been hit by traffic.
+
+A second way for the HW to report that a bucket is busy is through the
+``nexthop_res_grp_activity_update()`` API. The buckets identified this way
+as busy are treated as if traffic hit them.
+
+Offloaded buckets should be flagged as either "offload" or "trap". This is
+done through the ``nexthop_bucket_set_hw_flags()`` API.
+
+Netlink UAPI
+------------
+
+Resilient Group Replacement
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
+same manner as other multipath groups. The following changes apply to the
+attributes passed in the netlink message:
+
+ =================== =========================================================
+ ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
+ ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
+ groups.
+ =================== =========================================================
+
+``NHA_RES_GROUP`` payload:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
+ ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
+ ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
+ =================================== =========================================
+
+Next Hop Get
+^^^^^^^^^^^^
+
+Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
+message in exactly the same way as other next hop get requests. The
+response attributes match the replacement attributes cited above, except
+``NHA_RES_GROUP`` payload will include the following attribute:
+
+ =================================== =========================================
+ ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
+ of balance, in units of clock_t.
+ =================================== =========================================
+
+Bucket Get
+^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
+used to request a single bucket. The attributes recognized at get requests
+are:
+
+ =================== =========================================================
+ ``NHA_ID`` ID of the next-hop group that the bucket belongs to.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
+ ======================== ====================================================
+
+Bucket Dumps
+^^^^^^^^^^^^
+
+The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
+to request a dump of matching buckets. The attributes recognized at dump
+requests are:
+
+ =================== =========================================================
+ ``NHA_ID`` If specified, limits the dump to just the next-hop group
+ with this ID.
+ ``NHA_OIF`` If specified, limits the dump to buckets that contain
+ next hops that use the device with this ifindex.
+ ``NHA_MASTER`` If specified, limits the dump to buckets that contain
+ next hops that use a device in the VRF with this ifindex.
+ ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
+ =================== =========================================================
+
+``NHA_RES_BUCKET`` payload:
+
+ ======================== ====================================================
+ ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
+ that contain the next hop with this ID.
+ ======================== ====================================================
+
+Usage
+-----
+
+To illustrate the usage, consider the following commands::
+
+ # ip nexthop add id 1 via 192.0.2.2 dev eth0
+ # ip nexthop add id 2 via 192.0.2.3 dev eth0
+ # ip nexthop add id 10 group 1/2 type resilient \
+ buckets 8 idle_timer 60 unbalanced_timer 300
+
+The last command creates a resilient next-hop group. It will have 8 buckets
+(which is unusually low number, and used here for demonstration purposes
+only), each bucket will be considered idle when no traffic hits it for at
+least 60 seconds, and if the table remains out of balance for 300 seconds,
+it will be forcefully brought into balance.
+
+Changing next-hop weights leads to change in bucket allocation::
+
+ # ip nexthop replace id 10 group 1,3/2 type resilient
+
+This can be confirmed by looking at individual buckets::
+
+ # ip nexthop bucket show id 10
+ id 10 index 0 idle_time 5.59 nhid 1
+ id 10 index 1 idle_time 5.59 nhid 1
+ id 10 index 2 idle_time 8.74 nhid 2
+ id 10 index 3 idle_time 8.74 nhid 2
+ id 10 index 4 idle_time 8.74 nhid 1
+ id 10 index 5 idle_time 8.74 nhid 1
+ id 10 index 6 idle_time 8.74 nhid 1
+ id 10 index 7 idle_time 8.74 nhid 1
+
+Note the two buckets that have a shorter idle time. Those are the ones that
+were migrated after the next-hop replace command to satisfy the new demand
+that next hop 1 be given 6 buckets instead of 4.
+
+Netdevsim
+---------
+
+The netdevsim driver implements a mock offload of resilient groups, and
+exposes debugfs interface that allows marking individual buckets as busy.
+For example, the following will mark bucket 23 in next-hop group 10 as
+active::
+
+ # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
+
+In addition, another debugfs interface can be used to configure that the
+next attempt to migrate a bucket should fail::
+
+ # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
+
+Besides serving as an example, the interfaces that netdevsim exposes are
+useful in automated testing, and
+``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
+them to test the algorithm.
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.rst
index f75c2ce6e136..1120d71f28d7 100644
--- a/Documentation/networking/nf_conntrack-sysctl.txt
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -1,8 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Netfilter Conntrack Sysfs variables
+===================================
+
/proc/sys/net/netfilter/nf_conntrack_* Variables:
+=================================================
nf_conntrack_acct - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Enable connection tracking flow accounting. 64-bit byte and packet
counters per flow are added.
@@ -10,14 +17,13 @@ nf_conntrack_acct - BOOLEAN
nf_conntrack_buckets - INTEGER
Size of hash table. If not specified as parameter during module
loading, the default size is calculated by dividing total memory
- by 16384 to determine the number of buckets but the hash table will
- never have fewer than 32 and limited to 16384 buckets. For systems
- with more than 4GB of memory it will be 65536 buckets.
+ by 16384 to determine the number of buckets. The hash table will
+ never have fewer than 1024 and never more than 262144 buckets.
This sysctl is only writeable in the initial net namespace.
nf_conntrack_checksum - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
Verify checksum of incoming packets. Packets with bad checksums are
in INVALID state. If this is enabled, such packets will not be
@@ -27,11 +33,14 @@ nf_conntrack_count - INTEGER (read-only)
Number of currently allocated flow entries.
nf_conntrack_events - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - 1 - enabled
+ - 2 - auto (default)
If this option is enabled, the connection tracking code will
provide userspace with connection tracking events via ctnetlink.
+ The default allocates the extension if a userspace program is
+ listening to ctnetlink events.
nf_conntrack_expect_max - INTEGER
Maximum size of expectation table. Default value is
@@ -61,15 +70,6 @@ nf_conntrack_generic_timeout - INTEGER (seconds)
Default for generic timeout. This refers to layer 4 unknown/unsupported
protocols.
-nf_conntrack_helper - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
-
- Enable automatic conntrack helper assignment.
- If disabled it is required to set up iptables rules to assign
- helpers to connections. See the CT target description in the
- iptables-extensions(8) man page for further information.
-
nf_conntrack_icmp_timeout - INTEGER (seconds)
default 30
@@ -81,31 +81,41 @@ nf_conntrack_icmpv6_timeout - INTEGER (seconds)
Default for ICMP6 timeout.
nf_conntrack_log_invalid - INTEGER
- 0 - disable (default)
- 1 - log ICMP packets
- 6 - log TCP packets
- 17 - log UDP packets
- 33 - log DCCP packets
- 41 - log ICMPv6 packets
- 136 - log UDPLITE packets
- 255 - log packets of any protocol
+ - 0 - disable (default)
+ - 1 - log ICMP packets
+ - 6 - log TCP packets
+ - 17 - log UDP packets
+ - 33 - log DCCP packets
+ - 41 - log ICMPv6 packets
+ - 136 - log UDPLITE packets
+ - 255 - log packets of any protocol
Log invalid packets of a type specified by value.
nf_conntrack_max - INTEGER
- Size of connection tracking table. Default value is
- nf_conntrack_buckets value * 4.
+ Maximum number of allowed connection tracking entries. This value is set
+ to nf_conntrack_buckets by default.
+ Note that connection tracking entries are added to the table twice -- once
+ for the original direction and once for the reply direction (i.e., with
+ the reversed address). This means that with default settings a maxed-out
+ table will have a average hash chain length of 2, not 1.
nf_conntrack_tcp_be_liberal - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Be conservative in what you do, be liberal in what you accept from others.
If it's non-zero, we mark only out of window RST segments as INVALID.
+nf_conntrack_tcp_ignore_invalid_rst - BOOLEAN
+ - 0 - disabled (default)
+ - 1 - enabled
+
+ If it's 1, we don't mark out of window RST segments as INVALID.
+
nf_conntrack_tcp_loose - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
If it is set to zero, we disable picking up already established
connections.
@@ -148,8 +158,8 @@ nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds)
default 300
nf_conntrack_timestamp - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Enable connection tracking flow timestamping.
@@ -170,3 +180,24 @@ nf_conntrack_gre_timeout_stream - INTEGER (seconds)
This extended timeout will be used in case there is an GRE stream
detected.
+
+nf_hooks_lwtunnel - BOOLEAN
+ - 0 - disabled (default)
+ - not 0 - enabled
+
+ If this option is enabled, the lightweight tunnel netfilter hooks are
+ enabled. This option cannot be disabled once it is enabled.
+
+nf_flowtable_tcp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for tcp connections.
+ TCP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with tcp pickup timeout.
+
+nf_flowtable_udp_timeout - INTEGER (seconds)
+ default 30
+
+ Control offload timeout for udp connections.
+ UDP connections may be offloaded from nf conntrack to nf flow table.
+ Once aged, the connection is returned to nf conntrack with udp pickup timeout.
diff --git a/Documentation/networking/nf_flowtable.rst b/Documentation/networking/nf_flowtable.rst
new file mode 100644
index 000000000000..d757c21c10f2
--- /dev/null
+++ b/Documentation/networking/nf_flowtable.rst
@@ -0,0 +1,235 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+Netfilter's flowtable infrastructure
+====================================
+
+This documentation describes the Netfilter flowtable infrastructure which allows
+you to define a fastpath through the flowtable datapath. This infrastructure
+also provides hardware offload support. The flowtable supports for the layer 3
+IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
+
+Overview
+--------
+
+Once the first packet of the flow successfully goes through the IP forwarding
+path, from the second packet on, you might decide to offload the flow to the
+flowtable through your ruleset. The flowtable infrastructure provides a rule
+action that allows you to specify when to add a flow to the flowtable.
+
+A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
+transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
+classic IP forwarding path (the visible effect is that you do not see these
+packets from any of the Netfilter hooks coming after ingress). In case that
+there is no matching entry in the flowtable (ie. flowtable miss), the packet
+follows the classic IP forwarding path.
+
+The flowtable uses a resizable hashtable. Lookups are based on the following
+n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
+source and destination, layer 4 source and destination ports and the input
+interface (useful in case there are several conntrack zones in place).
+
+The 'flow add' action allows you to populate the flowtable, the user selectively
+specifies what flows are placed into the flowtable. Hence, packets follow the
+classic IP forwarding path unless the user explicitly instruct flows to use this
+new alternative forwarding path via policy.
+
+The flowtable datapath is represented in Fig.1, which describes the classic IP
+forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
+
+::
+
+ userspace process
+ ^ |
+ | |
+ _____|____ ____\/___
+ / \ / \
+ | input | | output |
+ \__________/ \_________/
+ ^ |
+ | |
+ _________ __________ --------- _____\/_____
+ / \ / \ |Routing | / \
+ --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
+ \_________/ \__________/ ---------- \____________/ ^
+ | ^ | ^ |
+ flowtable | ____\/___ | |
+ | | / \ | |
+ __\/___ | | forward |------------ |
+ |-----| | \_________/ |
+ |-----| | 'flow offload' rule |
+ |-----| | adds entry to |
+ |_____| | flowtable |
+ | | |
+ / \ | |
+ /hit\_no_| |
+ \ ? / |
+ \ / |
+ |__yes_________________fastpath bypass ____________________________|
+
+ Fig.1 Netfilter hooks and flowtable interactions
+
+The flowtable entry also stores the NAT configuration, so all packets are
+mangled according to the NAT policy that is specified from the classic IP
+forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
+traffic is passed up to follow the classic IP forwarding path given that the
+transport header is missing, in this case, flowtable lookups are not possible.
+TCP RST and FIN packets are also passed up to the classic IP forwarding path to
+release the flow gracefully. Packets that exceed the MTU are also passed up to
+the classic forwarding path to report packet-too-big ICMP errors to the sender.
+
+Example configuration
+---------------------
+
+Enabling the flowtable bypass is relatively easy, you only need to create a
+flowtable and add one rule to your forward chain::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ }
+ chain y {
+ type filter hook forward priority 0; policy accept;
+ ip protocol tcp flow add @f
+ counter packets 0 bytes 0
+ }
+ }
+
+This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
+netdevices. You can create as many flowtables as you want in case you need to
+perform resource partitioning. The flowtable priority defines the order in which
+hooks are run in the pipeline, this is convenient in case you already have a
+nftables ingress chain (make sure the flowtable priority is smaller than the
+nftables ingress chain hence the flowtable runs before in the pipeline).
+
+The 'flow offload' action from the forward chain 'y' adds an entry to the
+flowtable for the TCP syn-ack packet coming in the reply direction. Once the
+flow is offloaded, you will observe that the counter rule in the example above
+does not get updated for the packets that are being forwarded through the
+forwarding bypass.
+
+You can identify offloaded flows through the [OFFLOAD] tag when listing your
+connection tracking table.
+
+::
+
+ # conntrack -L
+ tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
+
+
+Layer 2 encapsulation
+---------------------
+
+Since Linux kernel 5.13, the flowtable infrastructure discovers the real
+netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
+parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
+VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
+flowtable datapath also deals with layer 2 decapsulation.
+
+You do not need to add the PPPoE and the VLAN devices to your flowtable,
+instead the real device is sufficient for the flowtable to track your flows.
+
+Bridge and IP forwarding
+------------------------
+
+Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
+flowtable infrastructure discovers the topology behind the bridge device. This
+allows the flowtable to define a fastpath bypass between the bridge ports
+(represented as eth1 and eth2 in the example figure below) and the gateway
+device (represented as eth0) in your switch/router.
+
+::
+
+ fastpath bypass
+ .-------------------------.
+ / \
+ | IP forwarding |
+ | / \ \/
+ | br0 eth0 ..... eth0
+ . / \ *host B*
+ -> eth1 eth2
+ . *switch/router*
+ .
+ .
+ eth0
+ *host A*
+
+The flowtable infrastructure also supports for bridge VLAN filtering actions
+such as PVID and untagged. You can also stack a classic VLAN device on top of
+your bridge port.
+
+If you would like that your flowtable defines a fastpath between your bridge
+ports and your IP forwarding path, you have to add your bridge ports (as
+represented by the real netdevice) to your flowtable definition.
+
+Counters
+--------
+
+The flowtable can synchronize packet and byte counters with the existing
+connection tracking entry by specifying the counter statement in your flowtable
+definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ counter
+ }
+ }
+
+Counter support is available since Linux kernel 5.7.
+
+Hardware offload
+----------------
+
+If your network device provides hardware offload support, you can turn it on by
+means of the 'offload' flag in your flowtable definition, e.g.
+
+::
+
+ table inet x {
+ flowtable f {
+ hook ingress priority 0; devices = { eth0, eth1 };
+ flags offload;
+ }
+ }
+
+There is a workqueue that adds the flows to the hardware. Note that a few
+packets might still run over the flowtable software path until the workqueue has
+a chance to offload the flow to the network device.
+
+You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
+listing your connection tracking table. Please, note that the [OFFLOAD] tag
+refers to the software offload mode, so there is a distinction between [OFFLOAD]
+which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
+to the hardware offload datapath being used by the flow.
+
+The flowtable hardware offload infrastructure also supports for the DSA
+(Distributed Switch Architecture).
+
+Limitations
+-----------
+
+The flowtable behaves like a cache. The flowtable entries might get stale if
+either the destination MAC address or the egress netdevice that is used for
+transmission changes.
+
+This might be a problem if:
+
+- You run the flowtable in software mode and you combine bridge and IP
+ forwarding in your setup.
+- Hardware offload is enabled.
+
+More reading
+------------
+
+This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
+also made a very complete and comprehensive summary called "A state of network
+acceleration" that describes how things were before this infrastructure was
+mainlined [3]_ and it also makes a rough summary of this work [4]_.
+
+.. [1] https://lwn.net/Articles/738214/
+.. [2] https://lwn.net/Articles/742164/
+.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
+.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt
deleted file mode 100644
index 0bf32d1121be..000000000000
--- a/Documentation/networking/nf_flowtable.txt
+++ /dev/null
@@ -1,112 +0,0 @@
-Netfilter's flowtable infrastructure
-====================================
-
-This documentation describes the software flowtable infrastructure available in
-Netfilter since Linux kernel 4.16.
-
-Overview
---------
-
-Initial packets follow the classic forwarding path, once the flow enters the
-established state according to the conntrack semantics (ie. we have seen traffic
-in both directions), then you can decide to offload the flow to the flowtable
-from the forward chain via the 'flow offload' action available in nftables.
-
-Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
-output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
-path (the visible effect is that you do not see these packets from any of the
-netfilter hooks coming after the ingress). In case of flowtable miss, the packet
-follows the classic forward path.
-
-The flowtable uses a resizable hashtable, lookups are based on the following
-7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
-and destination ports and the input interface (useful in case there are several
-conntrack zones in place).
-
-Flowtables are populated via the 'flow offload' nftables action, so the user can
-selectively specify what flows are placed into the flow table. Hence, packets
-follow the classic forwarding path unless the user explicitly instruct packets
-to use this new alternative forwarding path via nftables policy.
-
-This is represented in Fig.1, which describes the classic forwarding path
-including the Netfilter hooks and the flowtable fastpath bypass.
-
- userspace process
- ^ |
- | |
- _____|____ ____\/___
- / \ / \
- | input | | output |
- \__________/ \_________/
- ^ |
- | |
- _________ __________ --------- _____\/_____
- / \ / \ |Routing | / \
- --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
- \_________/ \__________/ ---------- \____________/ ^
- | ^ | ^ |
- flowtable | ____\/___ | |
- | | / \ | |
- __\/___ | | forward |------------ |
- |-----| | \_________/ |
- |-----| | 'flow offload' rule |
- |-----| | adds entry to |
- |_____| | flowtable |
- | | |
- / \ | |
- /hit\_no_| |
- \ ? / |
- \ / |
- |__yes_________________fastpath bypass ____________________________|
-
- Fig.1 Netfilter hooks and flowtable interactions
-
-The flowtable entry also stores the NAT configuration, so all packets are
-mangled according to the NAT policy that matches the initial packets that went
-through the classic forwarding path. The TTL is decremented before calling
-neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
-path given that the transport selectors are missing, therefore flowtable lookup
-is not possible.
-
-Example configuration
----------------------
-
-Enabling the flowtable bypass is relatively easy, you only need to create a
-flowtable and add one rule to your forward chain.
-
- table inet x {
- flowtable f {
- hook ingress priority 0; devices = { eth0, eth1 };
- }
- chain y {
- type filter hook forward priority 0; policy accept;
- ip protocol tcp flow offload @f
- counter packets 0 bytes 0
- }
- }
-
-This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
-netdevices. You can create as many flowtables as you want in case you need to
-perform resource partitioning. The flowtable priority defines the order in which
-hooks are run in the pipeline, this is convenient in case you already have a
-nftables ingress chain (make sure the flowtable priority is smaller than the
-nftables ingress chain hence the flowtable runs before in the pipeline).
-
-The 'flow offload' action from the forward chain 'y' adds an entry to the
-flowtable for the TCP syn-ack packet coming in the reply direction. Once the
-flow is offloaded, you will observe that the counter rule in the example above
-does not get updated for the packets that are being forwarded through the
-forwarding bypass.
-
-More reading
-------------
-
-This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also
-made a very complete and comprehensive summary called "A state of network
-acceleration" that describes how things were before this infrastructure was
-mailined [3] and it also makes a rough summary of this work [4].
-
-[1] https://lwn.net/Articles/738214/
-[2] https://lwn.net/Articles/742164/
-[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
-[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.rst
index b3b9ac61d29d..1a8353dbf1b6 100644
--- a/Documentation/networking/openvswitch.txt
+++ b/Documentation/networking/openvswitch.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
Open vSwitch datapath developer documentation
=============================================
@@ -80,13 +83,13 @@ The <linux/openvswitch.h> header file defines the exact format of the
flow key attributes. For informal explanatory purposes here, we write
them as comma-separated strings, with parentheses indicating arguments
and nesting. For example, the following could represent a flow key
-corresponding to a TCP packet that arrived on vport 1:
+corresponding to a TCP packet that arrived on vport 1::
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
frag=no), tcp(src=49163, dst=80)
-Often we ellipsize arguments not important to the discussion, e.g.:
+Often we ellipsize arguments not important to the discussion, e.g.::
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
@@ -151,20 +154,20 @@ Some care is needed to really maintain forward and backward
compatibility for applications that follow the rules listed under
"Flow key compatibility" above.
-The basic rule is obvious:
+The basic rule is obvious::
- ------------------------------------------------------------------
+ ==================================================================
New network protocol support must only supplement existing flow
key attributes. It must not change the meaning of already defined
flow key attributes.
- ------------------------------------------------------------------
+ ==================================================================
This rule does have less-obvious consequences so it is worth working
through a few examples. Suppose, for example, that the kernel module
did not already implement VLAN parsing. Instead, it just interpreted
the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
packet. The flow key for any packet with an 802.1Q header would look
-essentially like this, ignoring metadata:
+essentially like this, ignoring metadata::
eth(...), eth_type(0x8100)
@@ -172,7 +175,7 @@ Naively, to add VLAN support, it makes sense to add a new "vlan" flow
key attribute to contain the VLAN tag, then continue to decode the
encapsulated headers beyond the VLAN tag using the existing field
definitions. With this change, a TCP packet in VLAN 10 would have a
-flow key much like this:
+flow key much like this::
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
@@ -187,7 +190,7 @@ across kernel versions even though it follows the compatibility rules.
The solution is to use a set of nested attributes. This is, for
example, why 802.1Q support uses nested attributes. A TCP packet in
-VLAN 10 is actually expressed as:
+VLAN 10 is actually expressed as::
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
ip(proto=6, ...), tcp(...)))
@@ -215,14 +218,14 @@ For example, consider a packet that contains an IP header that
indicates protocol 6 for TCP, but which is truncated just after the IP
header, so that the TCP header is missing. The flow key for this
packet would include a tcp attribute with all-zero src and dst, like
-this:
+this::
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
As another example, consider a packet with an Ethernet type of 0x8100,
indicating that a VLAN TCI should follow, but which is truncated just
after the Ethernet type. The flow key for this packet would include
-an all-zero-bits vlan and an empty encap attribute, like this:
+an all-zero-bits vlan and an empty encap attribute, like this::
eth(...), eth_type(0x8100), vlan(0), encap()
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.rst
index b203d1334822..1ee2141e8ef1 100644
--- a/Documentation/networking/operstates.txt
+++ b/Documentation/networking/operstates.rst
@@ -1,5 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Operational States
+==================
+
1. Introduction
+===============
Linux distinguishes between administrative and operational state of an
interface. Administrative state is the result of "ip link set dev
@@ -20,6 +27,7 @@ and changeable from userspace under certain rules.
2. Querying from userspace
+==========================
Both admin and operational state can be queried via the netlink
operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK
@@ -30,16 +38,20 @@ These values contain interface state:
ifinfomsg::if_flags & IFF_UP:
Interface is admin up
+
ifinfomsg::if_flags & IFF_RUNNING:
Interface is in RFC2863 operational state UP or UNKNOWN. This is for
backward compatibility, routing daemons, dhcp clients can use this
flag to determine whether they should use the interface.
+
ifinfomsg::if_flags & IFF_LOWER_UP:
Driver has signaled netif_carrier_on()
+
ifinfomsg::if_flags & IFF_DORMANT:
Driver has signaled netif_dormant_on()
TLV IFLA_OPERSTATE
+------------------
contains RFC2863 state of the interface in numeric representation:
@@ -47,26 +59,35 @@ IF_OPER_UNKNOWN (0):
Interface is in unknown state, neither driver nor userspace has set
operational state. Interface must be considered for user data as
setting operational state has not been implemented in every driver.
+
IF_OPER_NOTPRESENT (1):
Unused in current kernel (notpresent interfaces normally disappear),
just a numerical placeholder.
+
IF_OPER_DOWN (2):
Interface is unable to transfer data on L1, f.e. ethernet is not
plugged or interface is ADMIN down.
+
IF_OPER_LOWERLAYERDOWN (3):
Interfaces stacked on an interface that is IF_OPER_DOWN show this
state (f.e. VLAN).
+
IF_OPER_TESTING (4):
- Unused in current kernel.
+ Interface is in testing mode, for example executing driver self-tests
+ or media (cable) test. It can't be used for normal traffic until tests
+ complete.
+
IF_OPER_DORMANT (5):
Interface is L1 up, but waiting for an external event, f.e. for a
protocol to establish. (802.1X)
+
IF_OPER_UP (6):
Interface is operational up and can be used.
This TLV can also be queried via sysfs.
TLV IFLA_LINKMODE
+-----------------
contains link policy. This is needed for userspace interaction
described below.
@@ -75,6 +96,7 @@ This TLV can also be queried via sysfs.
3. Kernel driver API
+====================
Kernel drivers have access to two flags that map to IFF_LOWER_UP and
IFF_DORMANT. These flags can be set from everywhere, even from
@@ -91,7 +113,7 @@ it as lower layer.
Note that for certain kind of soft-devices, which are not managing any
real hardware, it is possible to set this bit from userspace. One
-should use TVL IFLA_CARRIER to do so.
+should use TLV IFLA_CARRIER to do so.
netif_carrier_ok() can be used to query that bit.
@@ -126,6 +148,7 @@ netif_carrier_ok() && !netif_dormant():
4. Setting from userspace
+=========================
Applications have to use the netlink interface to influence the
RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1
@@ -139,18 +162,18 @@ are multicasted on the netlink group RTNLGRP_LINK.
So basically a 802.1X supplicant interacts with the kernel like this:
--subscribe to RTNLGRP_LINK
--set IFLA_LINKMODE to 1 via RTM_SETLINK
--query RTM_GETLINK once to get initial state
--if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
- netlink multicast signals this state
--do 802.1X, eventually abort if flags go down again
--send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
- succeeds, IF_OPER_DORMANT otherwise
--see how operstate and IFF_RUNNING is echoed via netlink multicast
--set interface back to IF_OPER_DORMANT if 802.1X reauthentication
- fails
--restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
+- subscribe to RTNLGRP_LINK
+- set IFLA_LINKMODE to 1 via RTM_SETLINK
+- query RTM_GETLINK once to get initial state
+- if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
+ netlink multicast signals this state
+- do 802.1X, eventually abort if flags go down again
+- send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
+ succeeds, IF_OPER_DORMANT otherwise
+- see how operstate and IFF_RUNNING is echoed via netlink multicast
+- set interface back to IF_OPER_DORMANT if 802.1X reauthentication
+ fails
+- restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
if supplicant goes down, bring back IFLA_LINKMODE to 0 and
IFLA_OPERSTATE to a sane value.
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
new file mode 100644
index 000000000000..c5da1a5d93de
--- /dev/null
+++ b/Documentation/networking/packet_mmap.rst
@@ -0,0 +1,1083 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+Packet MMAP
+===========
+
+Abstract
+========
+
+This file documents the mmap() facility available with the PACKET
+socket interface. This type of sockets is used for
+
+i) capture network traffic with utilities like tcpdump,
+ii) transmit network traffic, or any other that needs raw
+ access to network interface.
+
+Howto can be found at:
+
+ https://sites.google.com/site/packetmmap/
+
+Please send your comments to
+ - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+ - Johann Baudy
+
+Why use PACKET_MMAP
+===================
+
+Non PACKET_MMAP capture process (plain AF_PACKET) is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
+
+On the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
+
+How to use mmap() to improve capture process
+============================================
+
+From the user standpoint, you should use the higher level libpcap library, which
+is a de facto standard, portable across nearly all operating systems
+including Win32.
+
+Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
+TPACKET_V3 support was added in version 1.5.0
+
+How to use mmap() directly to improve capture process
+=====================================================
+
+From the system calls stand point, the use of PACKET_MMAP involves
+the following process::
+
+
+ [setup] socket() -------> creation of the capture socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_RX_RING
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [capture] poll() ---------> to wait for incoming packets
+
+ [shutdown] close() --------> destruction of the capture socket and
+ deallocation of all associated
+ resources.
+
+
+socket creation and destruction is straight forward, and is done
+the same way with or without PACKET_MMAP::
+
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
+
+where mode is SOCK_RAW for the raw interface were link level
+information can be captured or SOCK_DGRAM for the cooked
+interface where link level information capture is not
+supported and a link level pseudo-header is provided
+by the kernel.
+
+The destruction of the socket and all associated resources
+is done by a simple call to close(fd).
+
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
+Next I will describe PACKET_MMAP settings and its constraints,
+also the mapping of the circular buffer in the user process and
+the use of this buffer.
+
+How to use mmap() directly to improve transmission process
+==========================================================
+Transmission process is similar to capture as shown below::
+
+ [setup] socket() -------> creation of the transmission socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_TX_RING
+ bind() ---------> bind transmission socket with a network interface
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [transmission] poll() ---------> wait for free packets (optional)
+ send() ---------> send all packets that are set as ready in
+ the ring
+ The flag MSG_DONTWAIT can be used to return
+ before end of transfer.
+
+ [shutdown] close() --------> destruction of the transmission socket and
+ deallocation of all associated resources.
+
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph::
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts::
+
+ --------------------
+ | struct tpacket_hdr | Header. It contains the status of
+ | | of this frame
+ |--------------------|
+ | data buffer |
+ . . Data that will be sent over the network interface.
+ . .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example::
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = htons(ETH_P_ALL);
+ my_addr.sll_ifindex = s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+
+By default, the user should put data at::
+
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at::
+
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
+PACKET_MMAP settings
+====================
+
+To setup PACKET_MMAP from user level code is done with a call like
+
+ - Capture process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+
+ - Transmission process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
+The most significant argument in the previous call is the req parameter,
+this parameter must to have the following structure::
+
+ struct tpacket_req
+ {
+ unsigned int tp_block_size; /* Minimal size of contiguous block */
+ unsigned int tp_block_nr; /* Number of blocks */
+ unsigned int tp_frame_size; /* Size of frame */
+ unsigned int tp_frame_nr; /* Total number of frames */
+ };
+
+This structure is defined in /usr/include/linux/if_packet.h and establishes a
+circular buffer (ring) of unswappable memory.
+Being mapped in the capture process allows reading the captured frames and
+related meta-information like timestamps without requiring a system call.
+
+Frames are grouped in blocks. Each block is a physically contiguous
+region of memory and holds tp_block_size/tp_frame_size frames. The total number
+of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
+
+ frames_per_block = tp_block_size/tp_frame_size
+
+indeed, packet_set_ring checks that the following condition is true::
+
+ frames_per_block * tp_block_nr == tp_frame_nr
+
+Lets see an example, with the following values::
+
+ tp_block_size= 4096
+ tp_frame_size= 2048
+ tp_block_nr = 4
+ tp_frame_nr = 8
+
+we will get the following buffer structure::
+
+ block #1 block #2
+ +---------+---------+ +---------+---------+
+ | frame 1 | frame 2 | | frame 3 | frame 4 |
+ +---------+---------+ +---------+---------+
+
+ block #3 block #4
+ +---------+---------+ +---------+---------+
+ | frame 5 | frame 6 | | frame 7 | frame 8 |
+ +---------+---------+ +---------+---------+
+
+A frame can be of any size with the only condition it can fit in a block. A block
+can only hold an integer number of frames, or in other words, a frame cannot
+be spawned across two blocks, so there are some details you have to take into
+account when choosing the frame_size. See "Mapping and use of the circular
+buffer (ring)".
+
+PACKET_MMAP setting constraints
+===============================
+
+In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
+the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
+16384 in a 64 bit architecture.
+
+Block size limit
+----------------
+
+As stated earlier, each block is a contiguous physical region of memory. These
+memory regions are allocated with calls to the __get_free_pages() function. As
+the name indicates, this function allocates pages of memory, and the second
+argument is "order" or a power of two number of pages, that is
+(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
+order=2 ==> 16384 bytes, etc. The maximum size of a
+region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
+precisely the limit can be calculated as::
+
+ PAGE_SIZE << MAX_ORDER
+
+ In a i386 architecture PAGE_SIZE is 4096 bytes
+ In a 2.4/i386 kernel MAX_ORDER is 10
+ In a 2.6/i386 kernel MAX_ORDER is 11
+
+So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
+respectively, with an i386 architecture.
+
+User space programs can include /usr/include/sys/user.h and
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+
+The pagesize can also be determined dynamically with the getpagesize (2)
+system call.
+
+Block number limit
+------------------
+
+To understand the constraints of PACKET_MMAP, we have to see the structure
+used to hold the pointers to each block.
+
+Currently, this structure is a dynamically allocated vector with kmalloc
+called pg_vec, its size limits the number of blocks that can be allocated::
+
+ +---+---+---+---+
+ | x | x | x | x |
+ +---+---+---+---+
+ | | | |
+ | | | v
+ | | v block #4
+ | v block #3
+ v block #2
+ block #1
+
+kmalloc allocates any number of bytes of physically contiguous memory from
+a pool of pre-determined sizes. This pool of memory is maintained by the slab
+allocator which is at the end the responsible for doing the allocation and
+hence which imposes the maximum memory that kmalloc can allocate.
+
+In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
+predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
+entries of /proc/slabinfo
+
+In a 32 bit architecture, pointers are 4 bytes long, so the total number of
+pointers to blocks is::
+
+ 131072/4 = 32768 blocks
+
+PACKET_MMAP buffer size calculator
+==================================
+
+Definitions:
+
+============== ================================================================
+<size-max> is the maximum size of allocable with kmalloc
+ (see /proc/slabinfo)
+<pointer size> depends on the architecture -- ``sizeof(void *)``
+<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
+<max-order> is the value defined with MAX_ORDER
+<frame size> it's an upper bound of frame's capture size (more on this later)
+============== ================================================================
+
+from these definitions we will derive::
+
+ <block number> = <size-max>/<pointer size>
+ <block size> = <pagesize> << <max-order>
+
+so, the max buffer size is::
+
+ <block number> * <block size>
+
+and, the number of frames be::
+
+ <block number> * <block size> / <frame size>
+
+Suppose the following parameters, which apply for 2.6 kernel and an
+i386 architecture::
+
+ <size-max> = 131072 bytes
+ <pointer size> = 4 bytes
+ <pagesize> = 4096 bytes
+ <max-order> = 11
+
+and a value for <frame size> of 2048 bytes. These parameters will yield::
+
+ <block number> = 131072/4 = 32768 blocks
+ <block size> = 4096 << 11 = 8 MiB.
+
+and hence the buffer will have a 262144 MiB size. So it can hold
+262144 MiB / 2048 bytes = 134217728 frames
+
+Actually, this buffer size is not possible with an i386 architecture.
+Remember that the memory is allocated in kernel space, in the case of
+an i386 kernel's memory size is limited to 1GiB.
+
+All memory allocations are not freed until the socket is closed. The memory
+allocations are done with GFP_KERNEL priority, this basically means that
+the allocation can wait and swap other process' memory in order to allocate
+the necessary memory, so normally limits can be reached.
+
+Other constraints
+-----------------
+
+If you check the source code you will see that what I draw here as a frame
+is not only the link level frame. At the beginning of each frame there is a
+header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
+meta information like timestamp. So what we draw here a frame it's really
+the following (from include/linux/if_packet.h)::
+
+ /*
+ Frame structure:
+
+ - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
+ - struct tpacket_hdr
+ - pad to TPACKET_ALIGNMENT=16
+ - struct sockaddr_ll
+ - Gap, chosen so that packet data (Start+tp_net) aligns to
+ TPACKET_ALIGNMENT=16
+ - Start+tp_mac: [ Optional MAC header ]
+ - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
+ - Pad to align to TPACKET_ALIGNMENT=16
+ */
+
+The following are conditions that are checked in packet_set_ring
+
+ - tp_block_size must be a multiple of PAGE_SIZE (1)
+ - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
+ - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
+ - tp_frame_nr must be exactly frames_per_block*tp_block_nr
+
+Note that tp_block_size should be chosen to be a power of two or there will
+be a waste of memory.
+
+Mapping and use of the circular buffer (ring)
+---------------------------------------------
+
+The mapping of the buffer in the user process is done with the conventional
+mmap function. Even the circular buffer is compound of several physically
+discontiguous blocks of memory, they are contiguous to the user space, hence
+just one call to mmap is needed::
+
+ mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+If tp_frame_size is a divisor of tp_block_size frames will be
+contiguously spaced by tp_frame_size bytes. If not, each
+tp_block_size/tp_frame_size frames there will be a gap between
+the frames. This is because a frame cannot be spawn across two
+blocks.
+
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap::
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
+At the beginning of each frame there is an status field (see
+struct tpacket_hdr). If this field is 0 means that the frame is ready
+to be used for the kernel, If not, there is a frame the user can read
+and the following flags apply:
+
+Capture process
+^^^^^^^^^^^^^^^
+
+From include/linux/if_packet.h::
+
+ #define TP_STATUS_COPY (1 << 1)
+ #define TP_STATUS_LOSING (1 << 2)
+ #define TP_STATUS_CSUMNOTREADY (1 << 3)
+ #define TP_STATUS_CSUM_VALID (1 << 7)
+
+====================== =======================================================
+TP_STATUS_COPY This flag indicates that the frame (and associated
+ meta information) has been truncated because it's
+ larger than tp_frame_size. This packet can be
+ read entirely with recvfrom().
+
+ In order to make this work it must to be
+ enabled previously with setsockopt() and
+ the PACKET_COPY_THRESH option.
+
+ The number of frames that can be buffered to
+ be read with recvfrom is limited like a normal socket.
+ See the SO_RCVBUF option in the socket (7) man page.
+
+TP_STATUS_LOSING indicates there were packet drops from last time
+ statistics where checked with getsockopt() and
+ the PACKET_STATISTICS option.
+
+TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
+ its checksum will be done in hardware. So while
+ reading the packet we should not try to check the
+ checksum.
+
+TP_STATUS_CSUM_VALID This flag indicates that at least the transport
+ header checksum of the packet has been already
+ validated on the kernel side. If the flag is not set
+ then we are free to check the checksum by ourselves
+ provided that TP_STATUS_CSUMNOTREADY is also not set.
+====================== =======================================================
+
+for convenience there are also the following defines::
+
+ #define TP_STATUS_KERNEL 0
+ #define TP_STATUS_USER 1
+
+The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
+receives a packet it puts in the buffer and updates the status with
+at least the TP_STATUS_USER flag. Then the user can read the packet,
+once the packet is read the user must zero the status field, so the kernel
+can use again that frame buffer.
+
+The user can use poll (any other variant should apply too) to check if new
+packets are in the ring::
+
+ struct pollfd pfd;
+
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLIN|POLLRDNORM|POLLERR;
+
+ if (status == TP_STATUS_KERNEL)
+ retval = poll(&pfd, 1, timeout);
+
+It doesn't incur in a race condition to first check the status value and
+then poll for frames.
+
+Transmission process
+^^^^^^^^^^^^^^^^^^^^
+
+Those defines are also used for transmission::
+
+ #define TP_STATUS_AVAILABLE 0 // Frame is available
+ #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
+ #define TP_STATUS_SENDING 2 // Frame is currently in transmission
+ #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
+
+First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
+packet, the user fills a data buffer of an available frame, sets tp_len to
+current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
+This can be done on multiple frames. Once the user is ready to transmit, it
+calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
+forwarded to the network device. The kernel updates each status of sent
+frames with TP_STATUS_SENDING until the end of transfer.
+
+At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
+
+::
+
+ header->tp_len = in_i_size;
+ header->tp_status = TP_STATUS_SEND_REQUEST;
+ retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+
+(status == TP_STATUS_SENDING)
+
+::
+
+ struct pollfd pfd;
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLOUT;
+ retval = poll(&pfd, 1, timeout);
+
+What TPACKET versions are available and when to use them?
+=========================================================
+
+::
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
+
+ - How to switch to TPACKET_V2:
+
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
+ ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation for RX_RING:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+
+ - RX Hash data available in user space
+ - TX_RING semantics are conceptually similar to TPACKET_V2;
+ use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
+ instead of TPACKET2_HDRLEN. In the current implementation,
+ the tp_next_offset field in the tpacket3_hdr MUST be set to
+ zero, indicating that the ring does not hold variable sized frames.
+ Packets with non-zero values of tp_next_offset will be dropped.
+
+AF_PACKET fanout mode
+=====================
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Currently implemented fanout policies are:
+
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
+ - PACKET_FANOUT_LB: schedule to socket by round-robin
+ - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
+ - PACKET_FANOUT_RND: schedule to socket by random selection
+ - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.)::
+
+ #include <stddef.h>
+ #include <stdlib.h>
+ #include <stdio.h>
+ #include <string.h>
+
+ #include <sys/types.h>
+ #include <sys/wait.h>
+ #include <sys/socket.h>
+ #include <sys/ioctl.h>
+
+ #include <unistd.h>
+
+ #include <linux/if_ether.h>
+ #include <linux/if_packet.h>
+
+ #include <net/if.h>
+
+ static const char *device_name;
+ static int fanout_type;
+ static int fanout_id;
+
+ #ifndef PACKET_FANOUT
+ # define PACKET_FANOUT 18
+ # define PACKET_FANOUT_HASH 0
+ # define PACKET_FANOUT_LB 1
+ #endif
+
+ static int setup_socket(void)
+ {
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+ }
+
+ static void fanout_thread(void)
+ {
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+ }
+
+AF_PACKET TPACKET_V3 example
+============================
+
+AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
+sizes by doing it's own memory management. It is based on blocks where polling
+works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
+
+It is said that TPACKET_V3 brings the following benefits:
+
+ * ~15% - 20% reduction in CPU-usage
+ * ~20% increase in packet capture rate
+ * ~2x increase in packet density
+ * Port aggregation analysis
+ * Non static frame size to capture entire packet payload
+
+So it seems to be a good candidate to be used with packet fanout.
+
+Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
+it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
+
+ /* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <stdint.h>
+ #include <string.h>
+ #include <assert.h>
+ #include <net/if.h>
+ #include <arpa/inet.h>
+ #include <netdb.h>
+ #include <poll.h>
+ #include <unistd.h>
+ #include <signal.h>
+ #include <inttypes.h>
+ #include <sys/socket.h>
+ #include <sys/mman.h>
+ #include <linux/if_packet.h>
+ #include <linux/if_ether.h>
+ #include <linux/ip.h>
+
+ #ifndef likely
+ # define likely(x) __builtin_expect(!!(x), 1)
+ #endif
+ #ifndef unlikely
+ # define unlikely(x) __builtin_expect(!!(x), 0)
+ #endif
+
+ struct block_desc {
+ uint32_t version;
+ uint32_t offset_to_priv;
+ struct tpacket_hdr_v1 h1;
+ };
+
+ struct ring {
+ struct iovec *rd;
+ uint8_t *map;
+ struct tpacket_req3 req;
+ };
+
+ static unsigned long packets_total = 0, bytes_total = 0;
+ static sig_atomic_t sigint = 0;
+
+ static void sighandler(int num)
+ {
+ sigint = 1;
+ }
+
+ static int setup_socket(struct ring *ring, char *netdev)
+ {
+ int err, i, fd, v = TPACKET_V3;
+ struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
+
+ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (fd < 0) {
+ perror("socket");
+ exit(1);
+ }
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ memset(&ring->req, 0, sizeof(ring->req));
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
+ sizeof(ring->req));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
+ if (ring->map == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
+ assert(ring->rd);
+ for (i = 0; i < ring->req.tp_block_nr; ++i) {
+ ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
+ ring->rd[i].iov_len = ring->req.tp_block_size;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = PF_PACKET;
+ ll.sll_protocol = htons(ETH_P_ALL);
+ ll.sll_ifindex = if_nametoindex(netdev);
+ ll.sll_hatype = 0;
+ ll.sll_pkttype = 0;
+ ll.sll_halen = 0;
+
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ exit(1);
+ }
+
+ return fd;
+ }
+
+ static void display(struct tpacket3_hdr *ppd)
+ {
+ struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
+ struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ struct sockaddr_in ss, sd;
+ char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
+
+ memset(&ss, 0, sizeof(ss));
+ ss.sin_family = PF_INET;
+ ss.sin_addr.s_addr = ip->saddr;
+ getnameinfo((struct sockaddr *) &ss, sizeof(ss),
+ sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
+
+ memset(&sd, 0, sizeof(sd));
+ sd.sin_family = PF_INET;
+ sd.sin_addr.s_addr = ip->daddr;
+ getnameinfo((struct sockaddr *) &sd, sizeof(sd),
+ dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
+
+ printf("%s -> %s, ", sbuff, dbuff);
+ }
+
+ printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
+ }
+
+ static void walk_block(struct block_desc *pbd, const int block_num)
+ {
+ int num_pkts = pbd->h1.num_pkts, i;
+ unsigned long bytes = 0;
+ struct tpacket3_hdr *ppd;
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
+ for (i = 0; i < num_pkts; ++i) {
+ bytes += ppd->tp_snaplen;
+ display(ppd);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
+ }
+
+ packets_total += num_pkts;
+ bytes_total += bytes;
+ }
+
+ static void flush_block(struct block_desc *pbd)
+ {
+ pbd->h1.block_status = TP_STATUS_KERNEL;
+ }
+
+ static void teardown_socket(struct ring *ring, int fd)
+ {
+ munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
+ free(ring->rd);
+ close(fd);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ socklen_t len;
+ struct ring ring;
+ struct pollfd pfd;
+ unsigned int block_num = 0, blocks = 64;
+ struct block_desc *pbd;
+ struct tpacket_stats_v3 stats;
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ signal(SIGINT, sighandler);
+
+ memset(&ring, 0, sizeof(ring));
+ fd = setup_socket(&ring, argp[argc - 1]);
+ assert(fd > 0);
+
+ memset(&pfd, 0, sizeof(pfd));
+ pfd.fd = fd;
+ pfd.events = POLLIN | POLLERR;
+ pfd.revents = 0;
+
+ while (likely(!sigint)) {
+ pbd = (struct block_desc *) ring.rd[block_num].iov_base;
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
+ poll(&pfd, 1, -1);
+ continue;
+ }
+
+ walk_block(pbd, block_num);
+ flush_block(pbd);
+ block_num = (block_num + 1) % blocks;
+ }
+
+ len = sizeof(stats);
+ err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
+ if (err < 0) {
+ perror("getsockopt");
+ exit(1);
+ }
+
+ fflush(stdout);
+ printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
+ stats.tp_packets, bytes_total, stats.tp_drops,
+ stats.tp_freeze_q_cnt);
+
+ teardown_socket(&ring, fd);
+ return 0;
+ }
+
+PACKET_QDISC_BYPASS
+===================
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation::
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+PACKET_TIMESTAMP
+================
+
+The PACKET_TIMESTAMP setting determines the source of the timestamp in
+the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
+NIC is capable of timestamping packets in hardware, you can request those
+hardware timestamps to be used. Note: you may need to enable the generation
+of hardware timestamps with SIOCSHWTSTAMP (see related information from
+Documentation/networking/timestamping.rst).
+
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
+
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
+ setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
+
+For the mmap(2)ed ring buffers, such timestamps are stored in the
+``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
+To determine what kind of timestamp has been reported, the tp_status field
+is binary or'ed with the following possible bits ...
+
+::
+
+ TP_STATUS_TS_RAW_HARDWARE
+ TP_STATUS_TS_SOFTWARE
+
+... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
+
+Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
+ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
+frames to be updated resp. the frame handed over to the application, iv) walk
+through the frames to pick up the individual hw/sw timestamps.
+
+Only (!) if transmit timestamping is enabled, then these bits are combined
+with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
+application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
+in a first step to see if the frame belongs to the application, and then
+one can extract the type of timestamp in a second step from tp_status)!
+
+If you don't care about them, thus having it disabled, checking for
+TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
+TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
+members do not contain a valid value. For TX_RINGs, by default no timestamp
+is generated!
+
+See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
+for more information on hardware timestamps.
+
+Miscellaneous bits
+==================
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.rst
+
+THANKS
+======
+
+ Jesse Brandeburg, for fixing my grammathical/spelling errors
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
deleted file mode 100644
index 999eb41da81d..000000000000
--- a/Documentation/networking/packet_mmap.txt
+++ /dev/null
@@ -1,1061 +0,0 @@
---------------------------------------------------------------------------------
-+ ABSTRACT
---------------------------------------------------------------------------------
-
-This file documents the mmap() facility available with the PACKET
-socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
-i) capture network traffic with utilities like tcpdump, ii) transmit network
-traffic, or any other that needs raw access to network interface.
-
-Howto can be found at:
- https://sites.google.com/site/packetmmap/
-
-Please send your comments to
- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
- Johann Baudy
-
--------------------------------------------------------------------------------
-+ Why use PACKET_MMAP
---------------------------------------------------------------------------------
-
-In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
-inefficient. It uses very limited buffers and requires one system call to
-capture each packet, it requires two if you want to get packet's timestamp
-(like libpcap always does).
-
-In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
-configurable circular buffer mapped in user space that can be used to either
-send or receive packets. This way reading packets just needs to wait for them,
-most of the time there is no need to issue a single system call. Concerning
-transmission, multiple packets can be sent through one system call to get the
-highest bandwidth. By using a shared buffer between the kernel and the user
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the capture and
-transmission process, but it isn't everything. At least, if you are capturing
-at high speeds (this is relative to the cpu speed), you should check if the
-device driver of your network interface card supports some sort of interrupt
-load mitigation or (even better) if it supports NAPI, also make sure it is
-enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
-supported by devices of your network. CPU IRQ pinning of your network interface
-card can also be an advantage.
-
---------------------------------------------------------------------------------
-+ How to use mmap() to improve capture process
---------------------------------------------------------------------------------
-
-From the user standpoint, you should use the higher level libpcap library, which
-is a de facto standard, portable across nearly all operating systems
-including Win32.
-
-Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
-TPACKET_V3 support was added in version 1.5.0
-
---------------------------------------------------------------------------------
-+ How to use mmap() directly to improve capture process
---------------------------------------------------------------------------------
-
-From the system calls stand point, the use of PACKET_MMAP involves
-the following process:
-
-
-[setup] socket() -------> creation of the capture socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_RX_RING
- mmap() ---------> mapping of the allocated buffer to the
- user process
-
-[capture] poll() ---------> to wait for incoming packets
-
-[shutdown] close() --------> destruction of the capture socket and
- deallocation of all associated
- resources.
-
-
-socket creation and destruction is straight forward, and is done
-the same way with or without PACKET_MMAP:
-
- int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
-
-where mode is SOCK_RAW for the raw interface were link level
-information can be captured or SOCK_DGRAM for the cooked
-interface where link level information capture is not
-supported and a link level pseudo-header is provided
-by the kernel.
-
-The destruction of the socket and all associated resources
-is done by a simple call to close(fd).
-
-Similarly as without PACKET_MMAP, it is possible to use one socket
-for capture and transmission. This can be done by mapping the
-allocated RX and TX buffer ring with a single mmap() call.
-See "Mapping and use of the circular buffer (ring)".
-
-Next I will describe PACKET_MMAP settings and its constraints,
-also the mapping of the circular buffer in the user process and
-the use of this buffer.
-
---------------------------------------------------------------------------------
-+ How to use mmap() directly to improve transmission process
---------------------------------------------------------------------------------
-Transmission process is similar to capture as shown below.
-
-[setup] socket() -------> creation of the transmission socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_TX_RING
- bind() ---------> bind transmission socket with a network interface
- mmap() ---------> mapping of the allocated buffer to the
- user process
-
-[transmission] poll() ---------> wait for free packets (optional)
- send() ---------> send all packets that are set as ready in
- the ring
- The flag MSG_DONTWAIT can be used to return
- before end of transfer.
-
-[shutdown] close() --------> destruction of the transmission socket and
- deallocation of all associated resources.
-
-Socket creation and destruction is also straight forward, and is done
-the same way as in capturing described in the previous paragraph:
-
- int fd = socket(PF_PACKET, mode, 0);
-
-The protocol can optionally be 0 in case we only want to transmit
-via this socket, which avoids an expensive call to packet_rcv().
-In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
-set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
-
-Binding the socket to your network interface is mandatory (with zero copy) to
-know the header size of frames used in the circular buffer.
-
-As capture, each frame contains two parts:
-
- --------------------
-| struct tpacket_hdr | Header. It contains the status of
-| | of this frame
-|--------------------|
-| data buffer |
-. . Data that will be sent over the network interface.
-. .
- --------------------
-
- bind() associates the socket to your network interface thanks to
- sll_ifindex parameter of struct sockaddr_ll.
-
- Initialization example:
-
- struct sockaddr_ll my_addr;
- struct ifreq s_ifr;
- ...
-
- strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
-
- /* get interface index of eth0 */
- ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
-
- /* fill sockaddr_ll struct to prepare binding */
- my_addr.sll_family = AF_PACKET;
- my_addr.sll_protocol = htons(ETH_P_ALL);
- my_addr.sll_ifindex = s_ifr.ifr_ifindex;
-
- /* bind socket to eth0 */
- bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
-
- A complete tutorial is available at: https://sites.google.com/site/packetmmap/
-
-By default, the user should put data at :
- frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
-
-So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
-the beginning of the user data will be at :
- frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
-
-If you wish to put user data at a custom offset from the beginning of
-the frame (for payload alignment with SOCK_RAW mode for instance) you
-can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
-to make this work it must be enabled previously with setsockopt()
-and the PACKET_TX_HAS_OFF option.
-
---------------------------------------------------------------------------------
-+ PACKET_MMAP settings
---------------------------------------------------------------------------------
-
-To setup PACKET_MMAP from user level code is done with a call like
-
- - Capture process
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- - Transmission process
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
-
-The most significant argument in the previous call is the req parameter,
-this parameter must to have the following structure:
-
- struct tpacket_req
- {
- unsigned int tp_block_size; /* Minimal size of contiguous block */
- unsigned int tp_block_nr; /* Number of blocks */
- unsigned int tp_frame_size; /* Size of frame */
- unsigned int tp_frame_nr; /* Total number of frames */
- };
-
-This structure is defined in /usr/include/linux/if_packet.h and establishes a
-circular buffer (ring) of unswappable memory.
-Being mapped in the capture process allows reading the captured frames and
-related meta-information like timestamps without requiring a system call.
-
-Frames are grouped in blocks. Each block is a physically contiguous
-region of memory and holds tp_block_size/tp_frame_size frames. The total number
-of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
-
- frames_per_block = tp_block_size/tp_frame_size
-
-indeed, packet_set_ring checks that the following condition is true
-
- frames_per_block * tp_block_nr == tp_frame_nr
-
-Lets see an example, with the following values:
-
- tp_block_size= 4096
- tp_frame_size= 2048
- tp_block_nr = 4
- tp_frame_nr = 8
-
-we will get the following buffer structure:
-
- block #1 block #2
-+---------+---------+ +---------+---------+
-| frame 1 | frame 2 | | frame 3 | frame 4 |
-+---------+---------+ +---------+---------+
-
- block #3 block #4
-+---------+---------+ +---------+---------+
-| frame 5 | frame 6 | | frame 7 | frame 8 |
-+---------+---------+ +---------+---------+
-
-A frame can be of any size with the only condition it can fit in a block. A block
-can only hold an integer number of frames, or in other words, a frame cannot
-be spawned across two blocks, so there are some details you have to take into
-account when choosing the frame_size. See "Mapping and use of the circular
-buffer (ring)".
-
---------------------------------------------------------------------------------
-+ PACKET_MMAP setting constraints
---------------------------------------------------------------------------------
-
-In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
-the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
-16384 in a 64 bit architecture. For information on these kernel versions
-see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
-
- Block size limit
-------------------
-
-As stated earlier, each block is a contiguous physical region of memory. These
-memory regions are allocated with calls to the __get_free_pages() function. As
-the name indicates, this function allocates pages of memory, and the second
-argument is "order" or a power of two number of pages, that is
-(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
-order=2 ==> 16384 bytes, etc. The maximum size of a
-region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
-precisely the limit can be calculated as:
-
- PAGE_SIZE << MAX_ORDER
-
- In a i386 architecture PAGE_SIZE is 4096 bytes
- In a 2.4/i386 kernel MAX_ORDER is 10
- In a 2.6/i386 kernel MAX_ORDER is 11
-
-So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
-respectively, with an i386 architecture.
-
-User space programs can include /usr/include/sys/user.h and
-/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
-
-The pagesize can also be determined dynamically with the getpagesize (2)
-system call.
-
- Block number limit
---------------------
-
-To understand the constraints of PACKET_MMAP, we have to see the structure
-used to hold the pointers to each block.
-
-Currently, this structure is a dynamically allocated vector with kmalloc
-called pg_vec, its size limits the number of blocks that can be allocated.
-
- +---+---+---+---+
- | x | x | x | x |
- +---+---+---+---+
- | | | |
- | | | v
- | | v block #4
- | v block #3
- v block #2
- block #1
-
-kmalloc allocates any number of bytes of physically contiguous memory from
-a pool of pre-determined sizes. This pool of memory is maintained by the slab
-allocator which is at the end the responsible for doing the allocation and
-hence which imposes the maximum memory that kmalloc can allocate.
-
-In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
-predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
-entries of /proc/slabinfo
-
-In a 32 bit architecture, pointers are 4 bytes long, so the total number of
-pointers to blocks is
-
- 131072/4 = 32768 blocks
-
- PACKET_MMAP buffer size calculator
-------------------------------------
-
-Definitions:
-
-<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
-<pointer size>: depends on the architecture -- sizeof(void *)
-<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)
-<max-order> : is the value defined with MAX_ORDER
-<frame size> : it's an upper bound of frame's capture size (more on this later)
-
-from these definitions we will derive
-
- <block number> = <size-max>/<pointer size>
- <block size> = <pagesize> << <max-order>
-
-so, the max buffer size is
-
- <block number> * <block size>
-
-and, the number of frames be
-
- <block number> * <block size> / <frame size>
-
-Suppose the following parameters, which apply for 2.6 kernel and an
-i386 architecture:
-
- <size-max> = 131072 bytes
- <pointer size> = 4 bytes
- <pagesize> = 4096 bytes
- <max-order> = 11
-
-and a value for <frame size> of 2048 bytes. These parameters will yield
-
- <block number> = 131072/4 = 32768 blocks
- <block size> = 4096 << 11 = 8 MiB.
-
-and hence the buffer will have a 262144 MiB size. So it can hold
-262144 MiB / 2048 bytes = 134217728 frames
-
-Actually, this buffer size is not possible with an i386 architecture.
-Remember that the memory is allocated in kernel space, in the case of
-an i386 kernel's memory size is limited to 1GiB.
-
-All memory allocations are not freed until the socket is closed. The memory
-allocations are done with GFP_KERNEL priority, this basically means that
-the allocation can wait and swap other process' memory in order to allocate
-the necessary memory, so normally limits can be reached.
-
- Other constraints
--------------------
-
-If you check the source code you will see that what I draw here as a frame
-is not only the link level frame. At the beginning of each frame there is a
-header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
-meta information like timestamp. So what we draw here a frame it's really
-the following (from include/linux/if_packet.h):
-
-/*
- Frame structure:
-
- - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
- - struct tpacket_hdr
- - pad to TPACKET_ALIGNMENT=16
- - struct sockaddr_ll
- - Gap, chosen so that packet data (Start+tp_net) aligns to
- TPACKET_ALIGNMENT=16
- - Start+tp_mac: [ Optional MAC header ]
- - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- - Pad to align to TPACKET_ALIGNMENT=16
- */
-
- The following are conditions that are checked in packet_set_ring
-
- tp_block_size must be a multiple of PAGE_SIZE (1)
- tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
- tp_frame_size must be a multiple of TPACKET_ALIGNMENT
- tp_frame_nr must be exactly frames_per_block*tp_block_nr
-
-Note that tp_block_size should be chosen to be a power of two or there will
-be a waste of memory.
-
---------------------------------------------------------------------------------
-+ Mapping and use of the circular buffer (ring)
---------------------------------------------------------------------------------
-
-The mapping of the buffer in the user process is done with the conventional
-mmap function. Even the circular buffer is compound of several physically
-discontiguous blocks of memory, they are contiguous to the user space, hence
-just one call to mmap is needed:
-
- mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
-
-If tp_frame_size is a divisor of tp_block_size frames will be
-contiguously spaced by tp_frame_size bytes. If not, each
-tp_block_size/tp_frame_size frames there will be a gap between
-the frames. This is because a frame cannot be spawn across two
-blocks.
-
-To use one socket for capture and transmission, the mapping of both the
-RX and TX buffer ring has to be done with one call to mmap:
-
- ...
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
- ...
- rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- tx_ring = rx_ring + size;
-
-RX must be the first as the kernel maps the TX ring memory right
-after the RX one.
-
-At the beginning of each frame there is an status field (see
-struct tpacket_hdr). If this field is 0 means that the frame is ready
-to be used for the kernel, If not, there is a frame the user can read
-and the following flags apply:
-
-+++ Capture process:
- from include/linux/if_packet.h
-
- #define TP_STATUS_COPY (1 << 1)
- #define TP_STATUS_LOSING (1 << 2)
- #define TP_STATUS_CSUMNOTREADY (1 << 3)
- #define TP_STATUS_CSUM_VALID (1 << 7)
-
-TP_STATUS_COPY : This flag indicates that the frame (and associated
- meta information) has been truncated because it's
- larger than tp_frame_size. This packet can be
- read entirely with recvfrom().
-
- In order to make this work it must to be
- enabled previously with setsockopt() and
- the PACKET_COPY_THRESH option.
-
- The number of frames that can be buffered to
- be read with recvfrom is limited like a normal socket.
- See the SO_RCVBUF option in the socket (7) man page.
-
-TP_STATUS_LOSING : indicates there were packet drops from last time
- statistics where checked with getsockopt() and
- the PACKET_STATISTICS option.
-
-TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which
- its checksum will be done in hardware. So while
- reading the packet we should not try to check the
- checksum.
-
-TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
- header checksum of the packet has been already
- validated on the kernel side. If the flag is not set
- then we are free to check the checksum by ourselves
- provided that TP_STATUS_CSUMNOTREADY is also not set.
-
-for convenience there are also the following defines:
-
- #define TP_STATUS_KERNEL 0
- #define TP_STATUS_USER 1
-
-The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
-receives a packet it puts in the buffer and updates the status with
-at least the TP_STATUS_USER flag. Then the user can read the packet,
-once the packet is read the user must zero the status field, so the kernel
-can use again that frame buffer.
-
-The user can use poll (any other variant should apply too) to check if new
-packets are in the ring:
-
- struct pollfd pfd;
-
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLIN|POLLRDNORM|POLLERR;
-
- if (status == TP_STATUS_KERNEL)
- retval = poll(&pfd, 1, timeout);
-
-It doesn't incur in a race condition to first check the status value and
-then poll for frames.
-
-++ Transmission process
-Those defines are also used for transmission:
-
- #define TP_STATUS_AVAILABLE 0 // Frame is available
- #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
- #define TP_STATUS_SENDING 2 // Frame is currently in transmission
- #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
-
-First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
-packet, the user fills a data buffer of an available frame, sets tp_len to
-current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
-This can be done on multiple frames. Once the user is ready to transmit, it
-calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
-forwarded to the network device. The kernel updates each status of sent
-frames with TP_STATUS_SENDING until the end of transfer.
-At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
-
- header->tp_len = in_i_size;
- header->tp_status = TP_STATUS_SEND_REQUEST;
- retval = send(this->socket, NULL, 0, 0);
-
-The user can also use poll() to check if a buffer is available:
-(status == TP_STATUS_SENDING)
-
- struct pollfd pfd;
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLOUT;
- retval = poll(&pfd, 1, timeout);
-
--------------------------------------------------------------------------------
-+ What TPACKET versions are available and when to use them?
--------------------------------------------------------------------------------
-
- int val = tpacket_version;
- setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
- getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
-
-where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
-
-TPACKET_V1:
- - Default if not otherwise specified by setsockopt(2)
- - RX_RING, TX_RING available
-
-TPACKET_V1 --> TPACKET_V2:
- - Made 64 bit clean due to unsigned long usage in TPACKET_V1
- structures, thus this also works on 64 bit kernel with 32 bit
- userspace and the like
- - Timestamp resolution in nanoseconds instead of microseconds
- - RX_RING, TX_RING available
- - VLAN metadata information available for packets
- (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
- in the tpacket2_hdr structure:
- - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
- that the tp_vlan_tci field has valid VLAN TCI value
- - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
- indicates that the tp_vlan_tpid field has valid VLAN TPID value
- - How to switch to TPACKET_V2:
- 1. Replace struct tpacket_hdr by struct tpacket2_hdr
- 2. Query header len and save
- 3. Set protocol version to 2, set up ring as usual
- 4. For getting the sockaddr_ll,
- use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
- (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
-
-TPACKET_V2 --> TPACKET_V3:
- - Flexible buffer implementation for RX_RING:
- 1. Blocks can be configured with non-static frame-size
- 2. Read/poll is at a block-level (as opposed to packet-level)
- 3. Added poll timeout to avoid indefinite user-space wait
- on idle links
- 4. Added user-configurable knobs:
- 4.1 block::timeout
- 4.2 tpkt_hdr::sk_rxhash
- - RX Hash data available in user space
- - TX_RING semantics are conceptually similar to TPACKET_V2;
- use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
- instead of TPACKET2_HDRLEN. In the current implementation,
- the tp_next_offset field in the tpacket3_hdr MUST be set to
- zero, indicating that the ring does not hold variable sized frames.
- Packets with non-zero values of tp_next_offset will be dropped.
-
--------------------------------------------------------------------------------
-+ AF_PACKET fanout mode
--------------------------------------------------------------------------------
-
-In the AF_PACKET fanout mode, packet reception can be load balanced among
-processes. This also works in combination with mmap(2) on packet sockets.
-
-Currently implemented fanout policies are:
-
- - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
- - PACKET_FANOUT_LB: schedule to socket by round-robin
- - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
- - PACKET_FANOUT_RND: schedule to socket by random selection
- - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
- - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
-
-Minimal example code by David S. Miller (try things like "./test eth0 hash",
-"./test eth0 lb", etc.):
-
-#include <stddef.h>
-#include <stdlib.h>
-#include <stdio.h>
-#include <string.h>
-
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/socket.h>
-#include <sys/ioctl.h>
-
-#include <unistd.h>
-
-#include <linux/if_ether.h>
-#include <linux/if_packet.h>
-
-#include <net/if.h>
-
-static const char *device_name;
-static int fanout_type;
-static int fanout_id;
-
-#ifndef PACKET_FANOUT
-# define PACKET_FANOUT 18
-# define PACKET_FANOUT_HASH 0
-# define PACKET_FANOUT_LB 1
-#endif
-
-static int setup_socket(void)
-{
- int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
- struct sockaddr_ll ll;
- struct ifreq ifr;
- int fanout_arg;
-
- if (fd < 0) {
- perror("socket");
- return EXIT_FAILURE;
- }
-
- memset(&ifr, 0, sizeof(ifr));
- strcpy(ifr.ifr_name, device_name);
- err = ioctl(fd, SIOCGIFINDEX, &ifr);
- if (err < 0) {
- perror("SIOCGIFINDEX");
- return EXIT_FAILURE;
- }
-
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = AF_PACKET;
- ll.sll_ifindex = ifr.ifr_ifindex;
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- return EXIT_FAILURE;
- }
-
- fanout_arg = (fanout_id | (fanout_type << 16));
- err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
- &fanout_arg, sizeof(fanout_arg));
- if (err) {
- perror("setsockopt");
- return EXIT_FAILURE;
- }
-
- return fd;
-}
-
-static void fanout_thread(void)
-{
- int fd = setup_socket();
- int limit = 10000;
-
- if (fd < 0)
- exit(fd);
-
- while (limit-- > 0) {
- char buf[1600];
- int err;
-
- err = read(fd, buf, sizeof(buf));
- if (err < 0) {
- perror("read");
- exit(EXIT_FAILURE);
- }
- if ((limit % 10) == 0)
- fprintf(stdout, "(%d) \n", getpid());
- }
-
- fprintf(stdout, "%d: Received 10000 packets\n", getpid());
-
- close(fd);
- exit(0);
-}
-
-int main(int argc, char **argp)
-{
- int fd, err;
- int i;
-
- if (argc != 3) {
- fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
- return EXIT_FAILURE;
- }
-
- if (!strcmp(argp[2], "hash"))
- fanout_type = PACKET_FANOUT_HASH;
- else if (!strcmp(argp[2], "lb"))
- fanout_type = PACKET_FANOUT_LB;
- else {
- fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
- exit(EXIT_FAILURE);
- }
-
- device_name = argp[1];
- fanout_id = getpid() & 0xffff;
-
- for (i = 0; i < 4; i++) {
- pid_t pid = fork();
-
- switch (pid) {
- case 0:
- fanout_thread();
-
- case -1:
- perror("fork");
- exit(EXIT_FAILURE);
- }
- }
-
- for (i = 0; i < 4; i++) {
- int status;
-
- wait(&status);
- }
-
- return 0;
-}
-
--------------------------------------------------------------------------------
-+ AF_PACKET TPACKET_V3 example
--------------------------------------------------------------------------------
-
-AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
-sizes by doing it's own memory management. It is based on blocks where polling
-works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
-
-It is said that TPACKET_V3 brings the following benefits:
- *) ~15 - 20% reduction in CPU-usage
- *) ~20% increase in packet capture rate
- *) ~2x increase in packet density
- *) Port aggregation analysis
- *) Non static frame size to capture entire packet payload
-
-So it seems to be a good candidate to be used with packet fanout.
-
-Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
-it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
-
-/* Written from scratch, but kernel-to-user space API usage
- * dissected from lolpcap:
- * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
- * License: GPL, version 2.0
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <stdint.h>
-#include <string.h>
-#include <assert.h>
-#include <net/if.h>
-#include <arpa/inet.h>
-#include <netdb.h>
-#include <poll.h>
-#include <unistd.h>
-#include <signal.h>
-#include <inttypes.h>
-#include <sys/socket.h>
-#include <sys/mman.h>
-#include <linux/if_packet.h>
-#include <linux/if_ether.h>
-#include <linux/ip.h>
-
-#ifndef likely
-# define likely(x) __builtin_expect(!!(x), 1)
-#endif
-#ifndef unlikely
-# define unlikely(x) __builtin_expect(!!(x), 0)
-#endif
-
-struct block_desc {
- uint32_t version;
- uint32_t offset_to_priv;
- struct tpacket_hdr_v1 h1;
-};
-
-struct ring {
- struct iovec *rd;
- uint8_t *map;
- struct tpacket_req3 req;
-};
-
-static unsigned long packets_total = 0, bytes_total = 0;
-static sig_atomic_t sigint = 0;
-
-static void sighandler(int num)
-{
- sigint = 1;
-}
-
-static int setup_socket(struct ring *ring, char *netdev)
-{
- int err, i, fd, v = TPACKET_V3;
- struct sockaddr_ll ll;
- unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
- unsigned int blocknum = 64;
-
- fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
- if (fd < 0) {
- perror("socket");
- exit(1);
- }
-
- err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
-
- memset(&ring->req, 0, sizeof(ring->req));
- ring->req.tp_block_size = blocksiz;
- ring->req.tp_frame_size = framesiz;
- ring->req.tp_block_nr = blocknum;
- ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
- ring->req.tp_retire_blk_tov = 60;
- ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
-
- err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
- sizeof(ring->req));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
-
- ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
- if (ring->map == MAP_FAILED) {
- perror("mmap");
- exit(1);
- }
-
- ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
- assert(ring->rd);
- for (i = 0; i < ring->req.tp_block_nr; ++i) {
- ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
- ring->rd[i].iov_len = ring->req.tp_block_size;
- }
-
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = PF_PACKET;
- ll.sll_protocol = htons(ETH_P_ALL);
- ll.sll_ifindex = if_nametoindex(netdev);
- ll.sll_hatype = 0;
- ll.sll_pkttype = 0;
- ll.sll_halen = 0;
-
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- exit(1);
- }
-
- return fd;
-}
-
-static void display(struct tpacket3_hdr *ppd)
-{
- struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
- struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
-
- if (eth->h_proto == htons(ETH_P_IP)) {
- struct sockaddr_in ss, sd;
- char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
-
- memset(&ss, 0, sizeof(ss));
- ss.sin_family = PF_INET;
- ss.sin_addr.s_addr = ip->saddr;
- getnameinfo((struct sockaddr *) &ss, sizeof(ss),
- sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
-
- memset(&sd, 0, sizeof(sd));
- sd.sin_family = PF_INET;
- sd.sin_addr.s_addr = ip->daddr;
- getnameinfo((struct sockaddr *) &sd, sizeof(sd),
- dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
-
- printf("%s -> %s, ", sbuff, dbuff);
- }
-
- printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
-}
-
-static void walk_block(struct block_desc *pbd, const int block_num)
-{
- int num_pkts = pbd->h1.num_pkts, i;
- unsigned long bytes = 0;
- struct tpacket3_hdr *ppd;
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
- pbd->h1.offset_to_first_pkt);
- for (i = 0; i < num_pkts; ++i) {
- bytes += ppd->tp_snaplen;
- display(ppd);
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
- ppd->tp_next_offset);
- }
-
- packets_total += num_pkts;
- bytes_total += bytes;
-}
-
-static void flush_block(struct block_desc *pbd)
-{
- pbd->h1.block_status = TP_STATUS_KERNEL;
-}
-
-static void teardown_socket(struct ring *ring, int fd)
-{
- munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
- free(ring->rd);
- close(fd);
-}
-
-int main(int argc, char **argp)
-{
- int fd, err;
- socklen_t len;
- struct ring ring;
- struct pollfd pfd;
- unsigned int block_num = 0, blocks = 64;
- struct block_desc *pbd;
- struct tpacket_stats_v3 stats;
-
- if (argc != 2) {
- fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
- return EXIT_FAILURE;
- }
-
- signal(SIGINT, sighandler);
-
- memset(&ring, 0, sizeof(ring));
- fd = setup_socket(&ring, argp[argc - 1]);
- assert(fd > 0);
-
- memset(&pfd, 0, sizeof(pfd));
- pfd.fd = fd;
- pfd.events = POLLIN | POLLERR;
- pfd.revents = 0;
-
- while (likely(!sigint)) {
- pbd = (struct block_desc *) ring.rd[block_num].iov_base;
-
- if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
- poll(&pfd, 1, -1);
- continue;
- }
-
- walk_block(pbd, block_num);
- flush_block(pbd);
- block_num = (block_num + 1) % blocks;
- }
-
- len = sizeof(stats);
- err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
- if (err < 0) {
- perror("getsockopt");
- exit(1);
- }
-
- fflush(stdout);
- printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
- stats.tp_packets, bytes_total, stats.tp_drops,
- stats.tp_freeze_q_cnt);
-
- teardown_socket(&ring, fd);
- return 0;
-}
-
--------------------------------------------------------------------------------
-+ PACKET_QDISC_BYPASS
--------------------------------------------------------------------------------
-
-If there is a requirement to load the network with many packets in a similar
-fashion as pktgen does, you might set the following option after socket
-creation:
-
- int one = 1;
- setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
-
-This has the side-effect, that packets sent through PF_PACKET will bypass the
-kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
-packet are not buffered, tc disciplines are ignored, increased loss can occur
-and such packets are also not visible to other PF_PACKET sockets anymore. So,
-you have been warned; generally, this can be useful for stress testing various
-components of a system.
-
-On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
-on PF_PACKET sockets.
-
--------------------------------------------------------------------------------
-+ PACKET_TIMESTAMP
--------------------------------------------------------------------------------
-
-The PACKET_TIMESTAMP setting determines the source of the timestamp in
-the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
-NIC is capable of timestamping packets in hardware, you can request those
-hardware timestamps to be used. Note: you may need to enable the generation
-of hardware timestamps with SIOCSHWTSTAMP (see related information from
-Documentation/networking/timestamping.txt).
-
-PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
-
- int req = SOF_TIMESTAMPING_RAW_HARDWARE;
- setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
-
-For the mmap(2)ed ring buffers, such timestamps are stored in the
-tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
-what kind of timestamp has been reported, the tp_status field is binary |'ed
-with the following possible bits ...
-
- TP_STATUS_TS_RAW_HARDWARE
- TP_STATUS_TS_SOFTWARE
-
-... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
-RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
-software fallback was invoked *within* PF_PACKET's processing code (less
-precise).
-
-Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
-ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
-frames to be updated resp. the frame handed over to the application, iv) walk
-through the frames to pick up the individual hw/sw timestamps.
-
-Only (!) if transmit timestamping is enabled, then these bits are combined
-with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
-application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
-in a first step to see if the frame belongs to the application, and then
-one can extract the type of timestamp in a second step from tp_status)!
-
-If you don't care about them, thus having it disabled, checking for
-TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
-TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
-members do not contain a valid value. For TX_RINGs, by default no timestamp
-is generated!
-
-See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt
-for more information on hardware timestamps.
-
--------------------------------------------------------------------------------
-+ Miscellaneous bits
--------------------------------------------------------------------------------
-
-- Packet sockets work well together with Linux socket filters, thus you also
- might want to have a look at Documentation/networking/filter.txt
-
---------------------------------------------------------------------------------
-+ THANKS
---------------------------------------------------------------------------------
-
- Jesse Brandeburg, for fixing my grammathical/spelling errors
-
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
new file mode 100644
index 000000000000..5db8c263b0c6
--- /dev/null
+++ b/Documentation/networking/page_pool.rst
@@ -0,0 +1,223 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Page Pool API
+=============
+
+The page_pool allocator is optimized for the XDP mode that uses one frame
+per-page, but it can fallback on the regular page allocator APIs.
+
+Basic use involves replacing alloc_pages() calls with the
+page_pool_alloc_pages() call. Drivers should use page_pool_dev_alloc_pages()
+replacing dev_alloc_pages().
+
+API keeps track of inflight pages, in order to let API user know
+when it is safe to free a page_pool object. Thus, API users
+must run page_pool_release_page() when a page is leaving the page_pool or
+call page_pool_put_page() where appropriate in order to maintain correct
+accounting.
+
+API user must call page_pool_put_page() once on a page, as it
+will either recycle the page, or in case of refcnt > 1, it will
+release the DMA mapping and inflight state accounting.
+
+Architecture overview
+=====================
+
+.. code-block:: none
+
+ +------------------+
+ | Driver |
+ +------------------+
+ ^
+ |
+ |
+ |
+ v
+ +--------------------------------------------+
+ | request memory |
+ +--------------------------------------------+
+ ^ ^
+ | |
+ | Pool empty | Pool has entries
+ | |
+ v v
+ +-----------------------+ +------------------------+
+ | alloc (and map) pages | | get page from cache |
+ +-----------------------+ +------------------------+
+ ^ ^
+ | |
+ | cache available | No entries, refill
+ | | from ptr-ring
+ | |
+ v v
+ +-----------------+ +------------------+
+ | Fast cache | | ptr-ring cache |
+ +-----------------+ +------------------+
+
+API interface
+=============
+The number of pools created **must** match the number of hardware queues
+unless hardware restrictions make that impossible. This would otherwise beat the
+purpose of page pool, which is allocate pages fast from cache without locking.
+This lockless guarantee naturally comes from running under a NAPI softirq.
+The protection doesn't strictly have to be NAPI, any guarantee that allocating
+a page will cause no race conditions is enough.
+
+* page_pool_create(): Create a pool.
+ * flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV
+ * order: 2^order pages on allocation
+ * pool_size: size of the ptr_ring
+ * nid: preferred NUMA node for allocation
+ * dev: struct device. Used on DMA operations
+ * dma_dir: DMA direction
+ * max_len: max DMA sync memory size
+ * offset: DMA address offset
+
+* page_pool_put_page(): The outcome of this depends on the page refcnt. If the
+ driver bumps the refcnt > 1 this will unmap the page. If the page refcnt is 1
+ the allocator owns the page and will try to recycle it in one of the pool
+ caches. If PP_FLAG_DMA_SYNC_DEV is set, the page will be synced for_device
+ using dma_sync_single_range_for_device().
+
+* page_pool_put_full_page(): Similar to page_pool_put_page(), but will DMA sync
+ for the entire memory area configured in area pool->max_len.
+
+* page_pool_recycle_direct(): Similar to page_pool_put_full_page() but caller
+ must guarantee safe context (e.g NAPI), since it will recycle the page
+ directly into the pool fast cache.
+
+* page_pool_release_page(): Unmap the page (if mapped) and account for it on
+ inflight counters.
+
+* page_pool_dev_alloc_pages(): Get a page from the page allocator or page_pool
+ caches.
+
+* page_pool_get_dma_addr(): Retrieve the stored DMA address.
+
+* page_pool_get_dma_dir(): Retrieve the stored DMA direction.
+
+* page_pool_put_page_bulk(): Tries to refill a number of pages into the
+ ptr_ring cache holding ptr_ring producer lock. If the ptr_ring is full,
+ page_pool_put_page_bulk() will release leftover pages to the page allocator.
+ page_pool_put_page_bulk() is suitable to be run inside the driver NAPI tx
+ completion loop for the XDP_REDIRECT use case.
+ Please note the caller must not use data area after running
+ page_pool_put_page_bulk(), as this function overwrites it.
+
+* page_pool_get_stats(): Retrieve statistics about the page_pool. This API
+ is only available if the kernel has been configured with
+ ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct
+ page_pool_stats`` structure is passed to this API which is filled in. The
+ caller can then report those stats to the user (perhaps via ethtool,
+ debugfs, etc.). See below for an example usage of this API.
+
+Stats API and structures
+------------------------
+If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
+``page_pool_get_stats()`` and structures described below are available. It
+takes a pointer to a ``struct page_pool`` and a pointer to a ``struct
+page_pool_stats`` allocated by the caller.
+
+The API will fill in the provided ``struct page_pool_stats`` with
+statistics about the page_pool.
+
+The stats structure has the following fields::
+
+ struct page_pool_stats {
+ struct page_pool_alloc_stats alloc_stats;
+ struct page_pool_recycle_stats recycle_stats;
+ };
+
+
+The ``struct page_pool_alloc_stats`` has the following fields:
+ * ``fast``: successful fast path allocations
+ * ``slow``: slow path order-0 allocations
+ * ``slow_high_order``: slow path high order allocations
+ * ``empty``: ptr ring is empty, so a slow path allocation was forced.
+ * ``refill``: an allocation which triggered a refill of the cache
+ * ``waive``: pages obtained from the ptr ring that cannot be added to
+ the cache due to a NUMA mismatch.
+
+The ``struct page_pool_recycle_stats`` has the following fields:
+ * ``cached``: recycling placed page in the page pool cache
+ * ``cache_full``: page pool cache was full
+ * ``ring``: page placed into the ptr ring
+ * ``ring_full``: page released from page pool because the ptr ring was full
+ * ``released_refcnt``: page released (and not recycled) because refcnt > 1
+
+Coding examples
+===============
+
+Registration
+------------
+
+.. code-block:: c
+
+ /* Page pool registration */
+ struct page_pool_params pp_params = { 0 };
+ struct xdp_rxq_info xdp_rxq;
+ int err;
+
+ pp_params.order = 0;
+ /* internal DMA mapping in page_pool */
+ pp_params.flags = PP_FLAG_DMA_MAP;
+ pp_params.pool_size = DESC_NUM;
+ pp_params.nid = NUMA_NO_NODE;
+ pp_params.dev = priv->dev;
+ pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
+ page_pool = page_pool_create(&pp_params);
+
+ err = xdp_rxq_info_reg(&xdp_rxq, ndev, 0);
+ if (err)
+ goto err_out;
+
+ err = xdp_rxq_info_reg_mem_model(&xdp_rxq, MEM_TYPE_PAGE_POOL, page_pool);
+ if (err)
+ goto err_out;
+
+NAPI poller
+-----------
+
+
+.. code-block:: c
+
+ /* NAPI Rx poller */
+ enum dma_data_direction dma_dir;
+
+ dma_dir = page_pool_get_dma_dir(dring->page_pool);
+ while (done < budget) {
+ if (some error)
+ page_pool_recycle_direct(page_pool, page);
+ if (packet_is_xdp) {
+ if XDP_DROP:
+ page_pool_recycle_direct(page_pool, page);
+ } else (packet_is_skb) {
+ page_pool_release_page(page_pool, page);
+ new_page = page_pool_dev_alloc_pages(page_pool);
+ }
+ }
+
+Stats
+-----
+
+.. code-block:: c
+
+ #ifdef CONFIG_PAGE_POOL_STATS
+ /* retrieve stats */
+ struct page_pool_stats stats = { 0 };
+ if (page_pool_get_stats(page_pool, &stats)) {
+ /* perhaps the driver reports statistics with ethool */
+ ethtool_print_allocation_stats(&stats.alloc_stats);
+ ethtool_print_recycle_stats(&stats.recycle_stats);
+ }
+ #endif
+
+Driver unload
+-------------
+
+.. code-block:: c
+
+ /* Driver unload */
+ page_pool_put_full_page(page_pool, page, false);
+ xdp_rxq_info_unreg(&xdp_rxq);
diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.rst
index 81003581f47a..8668dcbc5e6a 100644
--- a/Documentation/networking/phonet.txt
+++ b/Documentation/networking/phonet.rst
@@ -1,3 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+============================
Linux Phonet protocol family
============================
@@ -11,6 +15,7 @@ device attached to the modem. The modem takes care of routing.
Phonet packets can be exchanged through various hardware connections
depending on the device, such as:
+
- USB with the CDC Phonet interface,
- infrared,
- Bluetooth,
@@ -21,7 +26,7 @@ depending on the device, such as:
Packets format
--------------
-Phonet packets have a common header as follows:
+Phonet packets have a common header as follows::
struct phonethdr {
uint8_t pn_media; /* Media type (link-layer identifier) */
@@ -72,7 +77,7 @@ only the (default) Linux FIFO qdisc should be used with them.
Network layer
-------------
-The Phonet socket address family maps the Phonet packet header:
+The Phonet socket address family maps the Phonet packet header::
struct sockaddr_pn {
sa_family_t spn_family; /* AF_PHONET */
@@ -94,6 +99,8 @@ protocol from the PF_PHONET family. Each socket is bound to one of the
2^10 object IDs available, and can send and receive packets with any
other peer.
+::
+
struct sockaddr_pn addr = { .spn_family = AF_PHONET, };
ssize_t len;
socklen_t addrlen = sizeof(addr);
@@ -105,7 +112,7 @@ other peer.
sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr));
len = recvfrom(fd, buf, sizeof(buf), 0,
- (struct sockaddr *)&addr, &addrlen);
+ (struct sockaddr *)&addr, &addrlen);
This protocol follows the SOCK_DGRAM connection-less semantics.
However, connect() and getpeername() are not supported, as they did
@@ -116,7 +123,7 @@ Resource subscription
---------------------
A Phonet datagram socket can be subscribed to any number of 8-bits
-Phonet resources, as follow:
+Phonet resources, as follow::
uint32_t res = 0xXX;
ioctl(fd, SIOCPNADDRESOURCE, &res);
@@ -137,6 +144,8 @@ socket paradigm. The listening socket is bound to an unique free object
ID. Each listening socket can handle up to 255 simultaneous
connections, one per accept()'d socket.
+::
+
int lfd, cfd;
lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
@@ -161,7 +170,7 @@ Connections are traditionally established between two endpoints by a
As of Linux kernel version 2.6.39, it is also possible to connect
two endpoints directly, using connect() on the active side. This is
intended to support the newer Nokia Wireless Modem API, as found in
-e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
+e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform::
struct sockaddr_spn spn;
int fd;
@@ -177,38 +186,45 @@ e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
close(fd);
-WARNING:
-When polling a connected pipe socket for writability, there is an
-intrinsic race condition whereby writability might be lost between the
-polling and the writing system calls. In this case, the socket will
-block until write becomes possible again, unless non-blocking mode
-is enabled.
+.. Warning:
+
+ When polling a connected pipe socket for writability, there is an
+ intrinsic race condition whereby writability might be lost between the
+ polling and the writing system calls. In this case, the socket will
+ block until write becomes possible again, unless non-blocking mode
+ is enabled.
The pipe protocol provides two socket options at the SOL_PNPIPE level:
PNPIPE_ENCAP accepts one integer value (int) of:
- PNPIPE_ENCAP_NONE: The socket operates normally (default).
+ PNPIPE_ENCAP_NONE:
+ The socket operates normally (default).
- PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP
+ PNPIPE_ENCAP_IP:
+ The socket is used as a backend for a virtual IP
interface. This requires CAP_NET_ADMIN capability. GPRS data
support on Nokia modems can use this. Note that the socket cannot
be reliably poll()'d or read() from while in this mode.
- PNPIPE_IFINDEX is a read-only integer value. It contains the
- interface index of the network interface created by PNPIPE_ENCAP,
- or zero if encapsulation is off.
+ PNPIPE_IFINDEX
+ is a read-only integer value. It contains the
+ interface index of the network interface created by PNPIPE_ENCAP,
+ or zero if encapsulation is off.
- PNPIPE_HANDLE is a read-only integer value. It contains the underlying
- identifier ("pipe handle") of the pipe. This is only defined for
- socket descriptors that are already connected or being connected.
+ PNPIPE_HANDLE
+ is a read-only integer value. It contains the underlying
+ identifier ("pipe handle") of the pipe. This is only defined for
+ socket descriptors that are already connected or being connected.
Authors
-------
Linux Phonet was initially written by Sakari Ailus.
+
Other contributors include Mikä Liljeberg, Andras Domokos,
Carlos Chinea and Rémi Denis-Courmont.
-Copyright (C) 2008 Nokia Corporation.
+
+Copyright |copy| 2008 Nokia Corporation.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 256106054c8c..d11329a08984 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -80,8 +80,8 @@ values of phy_interface_t must be understood from the perspective of the PHY
device itself, leading to the following:
* PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
- internal delay by itself, it assumes that either the Ethernet MAC (if capable
- or the PCB traces) insert the correct 1.5-2ns delay
+ internal delay by itself, it assumes that either the Ethernet MAC (if capable)
+ or the PCB traces insert the correct 1.5-2ns delay
* PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
for the transmit data lines (TXD[3:0]) processed by the PHY device
@@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons:
* PHY device drivers in PHYLIB being reusable by nature, being able to
configure correctly a specified delay enables more designs with similar delay
- requirements to be operate correctly
+ requirements to be operated correctly
For cases where the PHY is not capable of providing this delay, but the
Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
@@ -120,7 +120,7 @@ required delays, as defined per the RGMII standard, several options may be
available:
* Some SoCs may offer a pin pad/mux/controller capable of configuring a given
- set of pins'strength, delays, and voltage; and it may be a suitable
+ set of pins' strength, delays, and voltage; and it may be a suitable
option to insert the expected 2ns RGMII delay.
* Modifying the PCB design to include a fixed delay (e.g: using a specifically
@@ -216,7 +216,7 @@ put into an unsupported state.
Lastly, once the controller is ready to handle network traffic, you call
phy_start(phydev). This tells the PAL that you are ready, and configures the
PHY to connect to the network. If the MAC interrupt of your network driver
-also handles PHY status changes, just set phydev->irq to PHY_IGNORE_INTERRUPT
+also handles PHY status changes, just set phydev->irq to PHY_MAC_INTERRUPT
before you call phy_start and use phy_mac_interrupt() from the network
driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
phy_start() enables the PHY interrupts (if applicable) and starts the
@@ -237,6 +237,11 @@ negotiation results.
Some of the interface modes are described below:
+``PHY_INTERFACE_MODE_SMII``
+ This is serial MII, clocked at 125MHz, supporting 100M and 10M speeds.
+ Some details can be found in
+ https://opencores.org/ocsvn/smii/smii/trunk/doc/SMII.pdf
+
``PHY_INTERFACE_MODE_1000BASEX``
This defines the 1000BASE-X single-lane serdes link as defined by the
802.3 standard section 36. The link operates at a fixed bit rate of
@@ -247,8 +252,8 @@ Some of the interface modes are described below:
speeds (see below.)
``PHY_INTERFACE_MODE_2500BASEX``
- This defines a variant of 1000BASE-X which is clocked 2.5 times faster,
- than the 802.3 standard giving a fixed bit rate of 3.125Gbaud.
+ This defines a variant of 1000BASE-X which is clocked 2.5 times as fast
+ as the 802.3 standard, giving a fixed bit rate of 3.125Gbaud.
``PHY_INTERFACE_MODE_SGMII``
This is used for Cisco SGMII, which is a modification of 1000BASE-X
@@ -267,6 +272,12 @@ Some of the interface modes are described below:
duplex, pause or other settings. This is dependent on the MAC and/or
PHY behaviour.
+``PHY_INTERFACE_MODE_5GBASER``
+ This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is
+ identical to the 10GBASE-R protocol defined in Clause 49, with the
+ exception that it operates at half the frequency. Please refer to the
+ IEEE standard for the definition.
+
``PHY_INTERFACE_MODE_10GBASER``
This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
various different mediums. Please refer to the IEEE standard for a
@@ -286,6 +297,32 @@ Some of the interface modes are described below:
Note: due to legacy usage, some 10GBASE-R usage incorrectly makes
use of this definition.
+``PHY_INTERFACE_MODE_25GBASER``
+ This is the IEEE 802.3 PCS Clause 107 defined 25GBASE-R protocol.
+ The PCS is identical to 10GBASE-R, i.e. 64B/66B encoded
+ running 2.5 as fast, giving a fixed bit rate of 25.78125 Gbaud.
+ Please refer to the IEEE standard for further information.
+
+``PHY_INTERFACE_MODE_100BASEX``
+ This defines IEEE 802.3 Clause 24. The link operates at a fixed data
+ rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying
+ data rate of 100Mpbs.
+
+``PHY_INTERFACE_MODE_QUSGMII``
+ This defines the Cisco the Quad USGMII mode, which is the Quad variant of
+ the USGMII (Universal SGMII) link. It's very similar to QSGMII, but uses
+ a Packet Control Header (PCH) instead of the 7 bytes preamble to carry not
+ only the port id, but also so-called "extensions". The only documented
+ extension so-far in the specification is the inclusion of timestamps, for
+ PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the
+ same capabilities in terms of link speed and negociation.
+
+``PHY_INTERFACE_MODE_1000BASEKX``
+ This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73
+ autonegotiation. Generally, it will be used with a Clause 70 PMD. To
+ contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this
+ interface mode has different autonegotiation and only supports full duplex.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.rst
index d2fd78f85aa4..1225f0f63ff0 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.rst
@@ -1,7 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
-
- HOWTO for the linux packet generator
- ------------------------------------
+====================================
+HOWTO for the linux packet generator
+====================================
Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel
or as a module. A module is preferred; modprobe pktgen if needed. Once
@@ -9,17 +10,18 @@ running, pktgen creates a thread for each CPU with affinity to that CPU.
Monitoring and controlling is done via /proc. It is easiest to select a
suitable sample script and configure that.
-On a dual CPU:
+On a dual CPU::
+
+ ps aux | grep pkt
+ root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
+ root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
-ps aux | grep pkt
-root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
-root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
+For monitoring and control pktgen creates::
-For monitoring and control pktgen creates:
/proc/net/pktgen/pgctrl
/proc/net/pktgen/kpktgend_X
- /proc/net/pktgen/ethX
+ /proc/net/pktgen/ethX
Tuning NIC for max performance
@@ -28,7 +30,8 @@ Tuning NIC for max performance
The default NIC settings are (likely) not tuned for pktgen's artificial
overload type of benchmarking, as this could hurt the normal use-case.
-Specifically increasing the TX ring buffer in the NIC:
+Specifically increasing the TX ring buffer in the NIC::
+
# ethtool -G ethX tx 1024
A larger TX ring can improve pktgen's performance, while it can hurt
@@ -46,7 +49,8 @@ This cleanup issue is specifically the case for the driver ixgbe
and the cleanup interval is affected by the ethtool --coalesce setting
of parameter "rx-usecs".
-For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6)::
+
# ethtool -C ethX rx-usecs 30
@@ -55,7 +59,7 @@ Kernel threads
Pktgen creates a thread for each CPU with affinity to that CPU.
Which is controlled through procfile /proc/net/pktgen/kpktgend_X.
-Example: /proc/net/pktgen/kpktgend_0
+Example: /proc/net/pktgen/kpktgend_0::
Running:
Stopped: eth4@0
@@ -64,6 +68,7 @@ Example: /proc/net/pktgen/kpktgend_0
Most important are the devices assigned to the thread.
The two basic thread commands are:
+
* add_device DEVICE@NAME -- adds a single device
* rem_device_all -- remove all associated devices
@@ -73,7 +78,7 @@ be unique.
To support adding the same device to multiple threads, which is useful
with multi queue NICs, the device naming scheme is extended with "@":
- device@something
+device@something
The part after "@" can be anything, but it is custom to use the thread
number.
@@ -83,30 +88,30 @@ Viewing devices
The Params section holds configured information. The Current section
holds running statistics. The Result is printed after a run or after
-interruption. Example:
-
-/proc/net/pktgen/eth4@0
-
- Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
- frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
- flows: 0 flowlen: 0
- queue_map_min: 0 queue_map_max: 0
- dst_min: 192.168.81.2 dst_max:
- src_min: src_max:
- src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
- udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
- src_mac_count: 0 dst_mac_count: 0
- Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
- Current:
- pkts-sofar: 100000 errors: 0
- started: 623913381008us stopped: 623913396439us idle: 25us
- seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
- cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
- cur_udp_dst: 9 cur_udp_src: 42
- cur_queue_map: 0
- flows: 0
- Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
- 6480562pps 3110Mb/sec (3110669760bps) errors: 0
+interruption. Example::
+
+ /proc/net/pktgen/eth4@0
+
+ Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
+ frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
+ flows: 0 flowlen: 0
+ queue_map_min: 0 queue_map_max: 0
+ dst_min: 192.168.81.2 dst_max:
+ src_min: src_max:
+ src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
+ udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
+ src_mac_count: 0 dst_mac_count: 0
+ Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
+ Current:
+ pkts-sofar: 100000 errors: 0
+ started: 623913381008us stopped: 623913396439us idle: 25us
+ seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
+ cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
+ cur_udp_dst: 9 cur_udp_src: 42
+ cur_queue_map: 0
+ flows: 0
+ Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
+ 6480562pps 3110Mb/sec (3110669760bps) errors: 0
Configuring devices
@@ -114,11 +119,12 @@ Configuring devices
This is done via the /proc interface, and most easily done via pgset
as defined in the sample scripts.
You need to specify PGDEV environment variable to use functions from sample
-scripts, i.e.:
-export PGDEV=/proc/net/pktgen/eth4@0
-source samples/pktgen/functions.sh
+scripts, i.e.::
+
+ export PGDEV=/proc/net/pktgen/eth4@0
+ source samples/pktgen/functions.sh
-Examples:
+Examples::
pg_ctrl start starts injection.
pg_ctrl stop aborts injection. Also, ^C aborts generator.
@@ -126,17 +132,17 @@ Examples:
pgset "clone_skb 1" sets the number of copies of the same packet
pgset "clone_skb 0" use single SKB for all transmits
pgset "burst 8" uses xmit_more API to queue 8 copies of the same
- packet and update HW tx queue tail pointer once.
- "burst 1" is the default
+ packet and update HW tx queue tail pointer once.
+ "burst 1" is the default
pgset "pkt_size 9014" sets packet size to 9014
pgset "frags 5" packet will consist of 5 fragments
pgset "count 200000" sets number of packets to send, set to zero
- for continuous sends until explicitly stopped.
+ for continuous sends until explicitly stopped.
pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
pgset "dst 10.0.0.1" sets IP destination address
- (BEWARE! This generator is very aggressive!)
+ (BEWARE! This generator is very aggressive!)
pgset "dst_min 10.0.0.1" Same as dst
pgset "dst_max 10.0.0.254" Set the maximum destination IP.
@@ -149,46 +155,46 @@ Examples:
pgset "queue_map_min 0" Sets the min value of tx queue interval
pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
- To select queue 1 of a given device,
- use queue_map_min=1 and queue_map_max=1
+ To select queue 1 of a given device,
+ use queue_map_min=1 and queue_map_max=1
pgset "src_mac_count 1" Sets the number of MACs we'll range through.
- The 'minimum' MAC is what you set with srcmac.
+ The 'minimum' MAC is what you set with srcmac.
pgset "dst_mac_count 1" Sets the number of MACs we'll range through.
- The 'minimum' MAC is what you set with dstmac.
+ The 'minimum' MAC is what you set with dstmac.
pgset "flag [name]" Set a flag to determine behaviour. Current flags
- are: IPSRC_RND # IP source is random (between min/max)
- IPDST_RND # IP destination is random
- UDPSRC_RND, UDPDST_RND,
- MACSRC_RND, MACDST_RND
- TXSIZE_RND, IPV6,
- MPLS_RND, VID_RND, SVID_RND
- FLOW_SEQ,
- QUEUE_MAP_RND # queue map random
- QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
- UDPCSUM,
- IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
- NODE_ALLOC # node specific memory allocation
- NO_TIMESTAMP # disable timestamping
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
+ MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
+ QUEUE_MAP_RND # queue map random
+ QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
+ NO_TIMESTAMP # disable timestamping
pgset 'flag ![name]' Clear a flag to determine behaviour.
- Note that you might need to use single quote in
- interactive mode, so that your shell wouldn't expand
- the specified flag as a history command.
+ Note that you might need to use single quote in
+ interactive mode, so that your shell wouldn't expand
+ the specified flag as a history command.
pgset "spi [SPI_VALUE]" Set specific SA used to transform packet.
pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
- cycle through the port range.
+ cycle through the port range.
pgset "udp_src_max 9" set UDP source port max.
pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then
- cycle through the port range.
+ cycle through the port range.
pgset "udp_dst_max 9" set UDP destination port max.
pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example
- outer label=16,middle label=32,
+ outer label=16,middle label=32,
inner label=0 (IPv4 NULL)) Note that
there must be no spaces between the
arguments. Leading zeros are required.
@@ -232,32 +238,34 @@ A collection of tutorial scripts and helpers for pktgen is in the
samples/pktgen directory. The helper parameters.sh file support easy
and consistent parameter parsing across the sample scripts.
-Usage example and help:
+Usage example and help::
+
./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
-Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX
+Usage:::
+
+ ./pktgen_sample01_simple.sh [-vx] -i ethX
+
-i : ($DEV) output interface/device (required)
-s : ($PKT_SIZE) packet size
- -d : ($DEST_IP) destination IP
+ -d : ($DEST_IP) destination IP. CIDR (e.g. 198.18.0.0/15) is also allowed
-m : ($DST_MAC) destination MAC-addr
+ -p : ($DST_PORT) destination PORT range (e.g. 433-444) is also allowed
-t : ($THREADS) threads to start
+ -f : ($F_THREAD) index of first thread (zero indexed CPU number)
-c : ($SKB_CLONE) SKB clones send before alloc new SKB
+ -n : ($COUNT) num messages to send per thread, 0 means indefinitely
-b : ($BURST) HW level bursting of SKBs
-v : ($VERBOSE) verbose
-x : ($DEBUG) debug
+ -6 : ($IP6) IPv6
+ -w : ($DELAY) Tx Delay value (ns)
+ -a : ($APPEND) Script will not reset generator's state, but will append its config
The global variables being set are also listed. E.g. the required
interface/device parameter "-i" sets variable $DEV. Copy the
pktgen_sampleXX scripts and modify them to fit your own needs.
-The old scripts:
-
-pktgen.conf-1-2 # 1 CPU 2 dev
-pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
-pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
-pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
-pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
-
Interrupt affinity
===================
@@ -271,10 +279,10 @@ to the running threads CPU (directly from smp_processor_id()).
Enable IPsec
============
Default IPsec transformation with ESP encapsulation plus transport mode
-can be enabled by simply setting:
+can be enabled by simply setting::
-pgset "flag IPSEC"
-pgset "flows 1"
+ pgset "flag IPSEC"
+ pgset "flows 1"
To avoid breaking existing testbed scripts for using AH type and tunnel mode,
you can use "pgset spi SPI_VALUE" to specify which transformation mode
@@ -284,115 +292,117 @@ to employ.
Current commands and configuration options
==========================================
-** Pgcontrol commands:
+**Pgcontrol commands**::
-start
-stop
-reset
+ start
+ stop
+ reset
-** Thread commands:
+**Thread commands**::
-add_device
-rem_device_all
+ add_device
+ rem_device_all
-** Device commands:
+**Device commands**::
-count
-clone_skb
-burst
-debug
+ count
+ clone_skb
+ burst
+ debug
-frags
-delay
+ frags
+ delay
-src_mac_count
-dst_mac_count
+ src_mac_count
+ dst_mac_count
-pkt_size
-min_pkt_size
-max_pkt_size
+ pkt_size
+ min_pkt_size
+ max_pkt_size
-queue_map_min
-queue_map_max
-skb_priority
+ queue_map_min
+ queue_map_max
+ skb_priority
-tos (ipv4)
-traffic_class (ipv6)
+ tos (ipv4)
+ traffic_class (ipv6)
-mpls
+ mpls
-udp_src_min
-udp_src_max
+ udp_src_min
+ udp_src_max
-udp_dst_min
-udp_dst_max
+ udp_dst_min
+ udp_dst_max
-node
+ node
-flag
- IPSRC_RND
- IPDST_RND
- UDPSRC_RND
- UDPDST_RND
- MACSRC_RND
- MACDST_RND
- TXSIZE_RND
- IPV6
- MPLS_RND
- VID_RND
- SVID_RND
- FLOW_SEQ
- QUEUE_MAP_RND
- QUEUE_MAP_CPU
- UDPCSUM
- IPSEC
- NODE_ALLOC
- NO_TIMESTAMP
+ flag
+ IPSRC_RND
+ IPDST_RND
+ UDPSRC_RND
+ UDPDST_RND
+ MACSRC_RND
+ MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
+ IPSEC
+ NODE_ALLOC
+ NO_TIMESTAMP
-spi (ipsec)
+ spi (ipsec)
-dst_min
-dst_max
+ dst_min
+ dst_max
-src_min
-src_max
+ src_min
+ src_max
-dst_mac
-src_mac
+ dst_mac
+ src_mac
-clear_counters
+ clear_counters
-src6
-dst6
-dst6_max
-dst6_min
+ src6
+ dst6
+ dst6_max
+ dst6_min
-flows
-flowlen
+ flows
+ flowlen
-rate
-ratep
+ rate
+ ratep
-xmit_mode <start_xmit|netif_receive>
+ xmit_mode <start_xmit|netif_receive>
-vlan_cfi
-vlan_id
-vlan_p
+ vlan_cfi
+ vlan_id
+ vlan_p
-svlan_cfi
-svlan_id
-svlan_p
+ svlan_cfi
+ svlan_id
+ svlan_p
References:
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
Paper from Linux-Kongress in Erlangen 2004.
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
Thanks to:
+
Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek
Stephen Hemminger, Andi Kleen, Dave Miller and many others.
diff --git a/Documentation/networking/PLIP.txt b/Documentation/networking/plip.rst
index ad7e3f7c3bbf..0eda745050ff 100644
--- a/Documentation/networking/PLIP.txt
+++ b/Documentation/networking/plip.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================
PLIP: The Parallel Line Internet Protocol Device
+================================================
Donald Becker (becker@super.org)
I.D.A. Supercomputing Research Center, Bowie MD 20715
@@ -83,7 +87,7 @@ When the PLIP driver is used in IRQ mode, the timeout used for triggering a
data transfer (the maximal time the PLIP driver would allow the other side
before announcing a timeout, when trying to handshake a transfer of some
data) is, by default, 500usec. As IRQ delivery is more or less immediate,
-this timeout is quite sufficient.
+this timeout is quite sufficient.
When in IRQ-less mode, the PLIP driver polls the parallel port HZ times
per second (where HZ is typically 100 on most platforms, and 1024 on an
@@ -115,7 +119,7 @@ printer "null" cable to transfer data four bits at a time using
data bit outputs connected to status bit inputs.
The second data transfer method relies on both machines having
-bi-directional parallel ports, rather than output-only ``printer''
+bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers and avoids reconstructing
nibbles into bytes, leading to much faster transfers.
@@ -132,7 +136,7 @@ bits with standard status register implementation.
A cable that implements this protocol is available commercially as a
"Null Printer" or "Turbo Laplink" cable. It can be constructed with
-two DB-25 male connectors symmetrically connected as follows:
+two DB-25 male connectors symmetrically connected as follows::
STROBE output 1*
D0->ERROR 2 - 15 15 - 2
@@ -146,7 +150,8 @@ two DB-25 male connectors symmetrically connected as follows:
SLCTIN 17 - 17
extra grounds are 18*,19*,20*,21*,22*,23*,24*
GROUND 25 - 25
-* Do not connect these pins on either end
+
+ * Do not connect these pins on either end
If the cable you are using has a metallic shield it should be
connected to the metallic DB-25 shell at one end only.
@@ -155,14 +160,14 @@ Parallel Transfer Mode 1
========================
The second data transfer method relies on both machines having
-bi-directional parallel ports, rather than output-only ``printer''
+bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers, and avoids reconstructing
nibbles into bytes. This cable should not be used on unidirectional
-``printer'' (as opposed to ``parallel'') ports or when the machine
+``printer`` (as opposed to ``parallel``) ports or when the machine
isn't configured for PLIP, as it will result in output driver
conflicts and the (unlikely) possibility of damage.
-The cable for this transfer mode should be constructed as follows:
+The cable for this transfer mode should be constructed as follows::
STROBE->BUSY 1 - 11
D0->D0 2 - 2
@@ -179,7 +184,8 @@ The cable for this transfer mode should be constructed as follows:
GND->ERROR 18 - 15
extra grounds are 19*,20*,21*,22*,23*,24*
GROUND 25 - 25
-* Do not connect these pins on either end
+
+ * Do not connect these pins on either end
Once again, if the cable you are using has a metallic shield it should
be connected to the metallic DB-25 shell at one end only.
@@ -188,7 +194,7 @@ PLIP Mode 0 transfer protocol
=============================
The PLIP driver is compatible with the "Crynwr" parallel port transfer
-standard in Mode 0. That standard specifies the following protocol:
+standard in Mode 0. That standard specifies the following protocol::
send header nibble '0x8'
count-low octet
@@ -196,20 +202,21 @@ standard in Mode 0. That standard specifies the following protocol:
... data octets
checksum octet
-Each octet is sent as
+Each octet is sent as::
+
<wait for rx. '0x1?'> <send 0x10+(octet&0x0F)>
<wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)>
To start a transfer the transmitting machine outputs a nibble 0x08.
That raises the ACK line, triggering an interrupt in the receiving
machine. The receiving machine disables interrupts and raises its own ACK
-line.
+line.
-Restated:
+Restated::
-(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
-Send_Byte:
- OUT := low nibble, OUT.4 := 1
- WAIT FOR IN.4 = 1
- OUT := high nibble, OUT.4 := 0
- WAIT FOR IN.4 = 0
+ (OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
+ Send_Byte:
+ OUT := low nibble, OUT.4 := 1
+ WAIT FOR IN.4 = 1
+ OUT := high nibble, OUT.4 := 0
+ WAIT FOR IN.4 = 0
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.rst
index fd563aff5fc9..5a10abce5964 100644
--- a/Documentation/networking/ppp_generic.txt
+++ b/Documentation/networking/ppp_generic.rst
@@ -1,8 +1,12 @@
- PPP Generic Driver and Channel Interface
- ----------------------------------------
+.. SPDX-License-Identifier: GPL-2.0
- Paul Mackerras
+========================================
+PPP Generic Driver and Channel Interface
+========================================
+
+ Paul Mackerras
paulus@samba.org
+
7 Feb 2002
The generic PPP driver in linux-2.4 provides an implementation of the
@@ -19,7 +23,7 @@ functionality which is of use in any PPP implementation, including:
* simple packet filtering
For sending and receiving PPP frames, the generic PPP driver calls on
-the services of PPP `channels'. A PPP channel encapsulates a
+the services of PPP ``channels``. A PPP channel encapsulates a
mechanism for transporting PPP frames from one machine to another. A
PPP channel implementation can be arbitrarily complex internally but
has a very simple interface with the generic PPP code: it merely has
@@ -102,7 +106,7 @@ communications medium and prepare it to do PPP. For example, with an
async tty, this can involve setting the tty speed and modes, issuing
modem commands, and then going through some sort of dialog with the
remote system to invoke PPP service there. We refer to this process
-as `discovery'. Then the user-level process tells the medium to
+as ``discovery``. Then the user-level process tells the medium to
become a PPP channel and register itself with the generic PPP layer.
The channel then has to report the channel number assigned to it back
to the user-level process. From that point, the PPP negotiation code
@@ -111,8 +115,8 @@ negotiation, accessing the channel through the /dev/ppp interface.
At the interface to the PPP generic layer, PPP frames are stored in
skbuff structures and start with the two-byte PPP protocol number.
-The frame does *not* include the 0xff `address' byte or the 0x03
-`control' byte that are optionally used in async PPP. Nor is there
+The frame does *not* include the 0xff ``address`` byte or the 0x03
+``control`` byte that are optionally used in async PPP. Nor is there
any escaping of control characters, nor are there any FCS or framing
characters included. That is all the responsibility of the channel
code, if it is needed for the particular medium. That is, the skbuffs
@@ -121,16 +125,16 @@ protocol number and the data, and the skbuffs presented to ppp_input()
must be in the same format.
The channel must provide an instance of a ppp_channel struct to
-represent the channel. The channel is free to use the `private' field
-however it wishes. The channel should initialize the `mtu' and
-`hdrlen' fields before calling ppp_register_channel() and not change
-them until after ppp_unregister_channel() returns. The `mtu' field
+represent the channel. The channel is free to use the ``private`` field
+however it wishes. The channel should initialize the ``mtu`` and
+``hdrlen`` fields before calling ppp_register_channel() and not change
+them until after ppp_unregister_channel() returns. The ``mtu`` field
represents the maximum size of the data part of the PPP frames, that
is, it does not include the 2-byte protocol number.
If the channel needs some headroom in the skbuffs presented to it for
transmission (i.e., some space free in the skbuff data area before the
-start of the PPP frame), it should set the `hdrlen' field of the
+start of the PPP frame), it should set the ``hdrlen`` field of the
ppp_channel struct to the amount of headroom required. The generic
PPP layer will attempt to provide that much headroom but the channel
should still check if there is sufficient headroom and copy the skbuff
@@ -310,6 +314,22 @@ channel are:
it is connected to. It will return an EINVAL error if the channel
is not connected to an interface.
+* PPPIOCBRIDGECHAN bridges a channel with another. The argument should
+ point to an int containing the channel number of the channel to bridge
+ to. Once two channels are bridged, frames presented to one channel by
+ ppp_input() are passed to the bridge instance for onward transmission.
+ This allows frames to be switched from one channel into another: for
+ example, to pass PPPoE frames into a PPPoL2TP session. Since channel
+ bridging interrupts the normal ppp_input() path, a given channel may
+ not be part of a bridge at the same time as being part of a unit.
+ This ioctl will return an EALREADY error if the channel is already
+ part of a bridge or unit, or ENXIO if the requested channel does not
+ exist.
+
+* PPPIOCUNBRIDGECHAN performs the inverse of PPPIOCBRIDGECHAN, unbridging
+ a channel pair. This ioctl will return an EINVAL error if the channel
+ does not form part of a bridge.
+
* All other ioctl commands are passed to the channel ioctl() function.
The ioctl calls that are available on an instance that is attached to
@@ -322,6 +342,8 @@ an interface unit are:
interface. The argument should be a pointer to an int containing
the new flags value. The bits in the flags value that can be set
are:
+
+ ================ ========================================
SC_COMP_TCP enable transmit TCP header compression
SC_NO_TCP_CCID disable connection-id compression for
TCP header compression
@@ -335,6 +357,7 @@ an interface unit are:
SC_MP_SHORTSEQ expect short multilink sequence
numbers on received multilink fragments
SC_MP_XSHORTSEQ transmit short multilink sequence nos.
+ ================ ========================================
The values of these flags are defined in <linux/ppp-ioctl.h>. Note
that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and
@@ -345,17 +368,20 @@ an interface unit are:
interface unit. The argument should point to an int where the ioctl
will store the flags value. As well as the values listed above for
PPPIOCSFLAGS, the following bits may be set in the returned value:
+
+ ================ =========================================
SC_COMP_RUN CCP compressor is running
SC_DECOMP_RUN CCP decompressor is running
SC_DC_ERROR CCP decompressor detected non-fatal error
SC_DC_FERROR CCP decompressor detected fatal error
+ ================ =========================================
* PPPIOCSCOMPRESS sets the parameters for packet compression or
decompression. The argument should point to a ppp_option_data
structure (defined in <linux/ppp-ioctl.h>), which contains a
pointer/length pair which should describe a block of memory
containing a CCP option specifying a compression method and its
- parameters. The ppp_option_data struct also contains a `transmit'
+ parameters. The ppp_option_data struct also contains a ``transmit``
field. If this is 0, the ioctl will affect the receive path,
otherwise the transmit path.
@@ -377,7 +403,7 @@ an interface unit are:
ppp_idle structure (defined in <linux/ppp_defs.h>). If the
CONFIG_PPP_FILTER option is enabled, the set of packets which reset
the transmit and receive idle timers is restricted to those which
- pass the `active' packet filter.
+ pass the ``active`` packet filter.
Two versions of this command exist, to deal with user space
expecting times as either 32-bit or 64-bit time_t seconds.
@@ -391,31 +417,33 @@ an interface unit are:
* PPPIOCSNPMODE sets the network-protocol mode for a given network
protocol. The argument should point to an npioctl struct (defined
- in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol
- number for the protocol to be affected, and the `mode' field
+ in <linux/ppp-ioctl.h>). The ``protocol`` field gives the PPP protocol
+ number for the protocol to be affected, and the ``mode`` field
specifies what to do with packets for that protocol:
+ ============= ==============================================
NPMODE_PASS normal operation, transmit and receive packets
NPMODE_DROP silently drop packets for this protocol
NPMODE_ERROR drop packets and return an error on transmit
NPMODE_QUEUE queue up packets for transmit, drop received
packets
+ ============= ==============================================
At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as
NPMODE_DROP.
* PPPIOCGNPMODE returns the network-protocol mode for a given
protocol. The argument should point to an npioctl struct with the
- `protocol' field set to the PPP protocol number for the protocol of
- interest. On return the `mode' field will be set to the network-
+ ``protocol`` field set to the PPP protocol number for the protocol of
+ interest. On return the ``mode`` field will be set to the network-
protocol mode for that protocol.
-* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet
+* PPPIOCSPASS and PPPIOCSACTIVE set the ``pass`` and ``active`` packet
filters. These ioctls are only available if the CONFIG_PPP_FILTER
option is selected. The argument should point to a sock_fprog
structure (defined in <linux/filter.h>) containing the compiled BPF
instructions for the filter. Packets are dropped if they fail the
- `pass' filter; otherwise, if they fail the `active' filter they are
+ ``pass`` filter; otherwise, if they fail the ``active`` filter they are
passed but they do not reset the transmit or receive idle timer.
* PPPIOCSMRRU enables or disables multilink processing for received
diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.rst
index 4a79209e77a7..7d9dfe36af45 100644
--- a/Documentation/networking/proc_net_tcp.txt
+++ b/Documentation/networking/proc_net_tcp.rst
@@ -1,15 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+The proc/net/tcp and proc/net/tcp6 variables
+============================================
+
This document describes the interfaces /proc/net/tcp and /proc/net/tcp6.
Note that these interfaces are deprecated in favor of tcp_diag.
-These /proc interfaces provide information about currently active TCP
+These /proc interfaces provide information about currently active TCP
connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c
and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively.
It will first list all listening TCP sockets, and next list all established
-TCP connections. A typical entry of /proc/net/tcp would look like this (split
-up into 3 parts because of the length of the line):
+TCP connections. A typical entry of /proc/net/tcp would look like this (split
+up into 3 parts because of the length of the line)::
- 46: 010310AC:9C4C 030310AC:1770 01
+ 46: 010310AC:9C4C 030310AC:1770 01
| | | | | |--> connection state
| | | | |------> remote TCP port number
| | | |-------------> remote IPv4 address
@@ -17,7 +23,7 @@ up into 3 parts because of the length of the line):
| |---------------------------> local IPv4 address
|----------------------------------> number of entry
- 00000150:00000000 01:00000019 00000000
+ 00000150:00000000 01:00000019 00000000
| | | | |--> number of unrecovered RTO timeouts
| | | |----------> number of jiffies until timer expires
| | |----------------> timer_active (see below)
@@ -25,7 +31,7 @@ up into 3 parts because of the length of the line):
|-------------------------------> transmit-queue
1000 0 54165785 4 cd1e6040 25 4 27 3 -1
- | | | | | | | | | |--> slow start size threshold,
+ | | | | | | | | | |--> slow start size threshold,
| | | | | | | | | or -1 if the threshold
| | | | | | | | | is >= 0xFFFF
| | | | | | | | |----> sending congestion window
@@ -40,9 +46,12 @@ up into 3 parts because of the length of the line):
|---------------------------------------------> uid
timer_active:
+
+ == ================================================================
0 no timer is pending
1 retransmit-timer is pending
2 another timer (e.g. delayed ack or keepalive) is pending
- 3 this is a socket in TIME_WAIT state. Not all fields will contain
+ 3 this is a socket in TIME_WAIT state. Not all fields will contain
data (or even exist)
4 zero window probe timer is pending
+ == ================================================================
diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.rst
index 953331c7984f..1a1bd1ec0650 100644
--- a/Documentation/networking/radiotap-headers.txt
+++ b/Documentation/networking/radiotap-headers.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
How to use radiotap headers
===========================
@@ -5,9 +8,9 @@ Pointer to the radiotap include file
------------------------------------
Radiotap headers are variable-length and extensible, you can get most of the
-information you need to know on them from:
+information you need to know on them from::
-./include/net/ieee80211_radiotap.h
+ ./include/net/ieee80211_radiotap.h
This document gives an overview and warns on some corner cases.
@@ -21,6 +24,8 @@ of the it_present member of ieee80211_radiotap_header is set, it means that
the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the
argument area.
+::
+
< 8-byte ieee80211_radiotap_header >
[ <possible argument bitmap extensions ... > ]
[ <argument> ... ]
@@ -76,6 +81,8 @@ ieee80211_radiotap_header.
Example valid radiotap header
-----------------------------
+::
+
0x00, 0x00, // <-- radiotap version + pad byte
0x0b, 0x00, // <- radiotap header length
0x04, 0x0c, 0x00, 0x00, // <-- bitmap
@@ -89,64 +96,64 @@ Using the Radiotap Parser
If you are having to parse a radiotap struct, you can radically simplify the
job by using the radiotap parser that lives in net/wireless/radiotap.c and has
-its prototypes available in include/net/cfg80211.h. You use it like this:
+its prototypes available in include/net/cfg80211.h. You use it like this::
-#include <net/cfg80211.h>
+ #include <net/cfg80211.h>
-/* buf points to the start of the radiotap header part */
+ /* buf points to the start of the radiotap header part */
-int MyFunction(u8 * buf, int buflen)
-{
- int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
- struct ieee80211_radiotap_iterator iterator;
- int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
+ int MyFunction(u8 * buf, int buflen)
+ {
+ int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
+ struct ieee80211_radiotap_iterator iterator;
+ int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
- while (!ret) {
+ while (!ret) {
- ret = ieee80211_radiotap_iterator_next(&iterator);
+ ret = ieee80211_radiotap_iterator_next(&iterator);
- if (ret)
- continue;
+ if (ret)
+ continue;
- /* see if this argument is something we can use */
+ /* see if this argument is something we can use */
- switch (iterator.this_arg_index) {
- /*
- * You must take care when dereferencing iterator.this_arg
- * for multibyte types... the pointer is not aligned. Use
- * get_unaligned((type *)iterator.this_arg) to dereference
- * iterator.this_arg for type "type" safely on all arches.
- */
- case IEEE80211_RADIOTAP_RATE:
- /* radiotap "rate" u8 is in
- * 500kbps units, eg, 0x02=1Mbps
- */
- pkt_rate_100kHz = (*iterator.this_arg) * 5;
- break;
+ switch (iterator.this_arg_index) {
+ /*
+ * You must take care when dereferencing iterator.this_arg
+ * for multibyte types... the pointer is not aligned. Use
+ * get_unaligned((type *)iterator.this_arg) to dereference
+ * iterator.this_arg for type "type" safely on all arches.
+ */
+ case IEEE80211_RADIOTAP_RATE:
+ /* radiotap "rate" u8 is in
+ * 500kbps units, eg, 0x02=1Mbps
+ */
+ pkt_rate_100kHz = (*iterator.this_arg) * 5;
+ break;
- case IEEE80211_RADIOTAP_ANTENNA:
- /* radiotap uses 0 for 1st ant */
- antenna = *iterator.this_arg);
- break;
+ case IEEE80211_RADIOTAP_ANTENNA:
+ /* radiotap uses 0 for 1st ant */
+ antenna = *iterator.this_arg);
+ break;
- case IEEE80211_RADIOTAP_DBM_TX_POWER:
- pwr = *iterator.this_arg;
- break;
+ case IEEE80211_RADIOTAP_DBM_TX_POWER:
+ pwr = *iterator.this_arg;
+ break;
- default:
- break;
- }
- } /* while more rt headers */
+ default:
+ break;
+ }
+ } /* while more rt headers */
- if (ret != -ENOENT)
- return TXRX_DROP;
+ if (ret != -ENOENT)
+ return TXRX_DROP;
- /* discard the radiotap header part */
- buf += iterator.max_length;
- buflen -= iterator.max_length;
+ /* discard the radiotap header part */
+ buf += iterator.max_length;
+ buflen -= iterator.max_length;
- ...
+ ...
-}
+ }
Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.rst
index eec61694e894..498395f5fbcb 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.rst
@@ -1,3 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===
+RDS
+===
Overview
========
@@ -24,36 +29,39 @@ as IB.
The high-level semantics of RDS from the application's point of view are
* Addressing
- RDS uses IPv4 addresses and 16bit port numbers to identify
- the end point of a connection. All socket operations that involve
- passing addresses between kernel and user space generally
- use a struct sockaddr_in.
- The fact that IPv4 addresses are used does not mean the underlying
- transport has to be IP-based. In fact, RDS over IB uses a
- reliable IB connection; the IP address is used exclusively to
- locate the remote node's GID (by ARPing for the given IP).
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
- The port space is entirely independent of UDP, TCP or any other
- protocol.
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
* Socket interface
- RDS sockets work *mostly* as you would expect from a BSD
- socket. The next section will cover the details. At any rate,
- all I/O is performed through the standard BSD socket API.
- Some additions like zerocopy support are implemented through
- control messages, while other extensions use the getsockopt/
- setsockopt calls.
-
- Sockets must be bound before you can send or receive data.
- This is needed because binding also selects a transport and
- attaches it to the socket. Once bound, the transport assignment
- does not change. RDS will tolerate IPs moving around (eg in
- a active-active HA scenario), but only as long as the address
- doesn't move to a different transport.
+
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
* sysctls
- RDS supports a number of sysctls in /proc/sys/net/rds
+
+ RDS supports a number of sysctls in /proc/sys/net/rds
Socket Interface
@@ -66,89 +74,88 @@ Socket Interface
options.
fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
- This creates a new, unbound RDS socket.
+ This creates a new, unbound RDS socket.
setsockopt(SOL_SOCKET): send and receive buffer size
- RDS honors the send and receive buffer size socket options.
- You are not allowed to queue more than SO_SNDSIZE bytes to
- a socket. A message is queued when sendmsg is called, and
- it leaves the queue when the remote system acknowledges
- its arrival.
-
- The SO_RCVSIZE option controls the maximum receive queue length.
- This is a soft limit rather than a hard limit - RDS will
- continue to accept and queue incoming messages, even if that
- takes the queue length over the limit. However, it will also
- mark the port as "congested" and send a congestion update to
- the source node. The source node is supposed to throttle any
- processes sending to this congested port.
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
bind(fd, &sockaddr_in, ...)
- This binds the socket to a local IP address and port, and a
- transport, if one has not already been selected via the
+ This binds the socket to a local IP address and port, and a
+ transport, if one has not already been selected via the
SO_RDS_TRANSPORT socket option
sendmsg(fd, ...)
- Sends a message to the indicated recipient. The kernel will
- transparently establish the underlying reliable connection
- if it isn't up yet.
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
- An attempt to send a message that exceeds SO_SNDSIZE will
- return with -EMSGSIZE
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
- An attempt to send a message that would take the total number
- of queued bytes over the SO_SNDSIZE threshold will return
- EAGAIN.
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
- An attempt to send a message to a destination that is marked
- as "congested" will return ENOBUFS.
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
recvmsg(fd, ...)
- Receives a message that was queued to this socket. The sockets
- recv queue accounting is adjusted, and if the queue length
- drops below SO_SNDSIZE, the port is marked uncongested, and
- a congestion update is sent to all peers.
-
- Applications can ask the RDS kernel module to receive
- notifications via control messages (for instance, there is a
- notification when a congestion update arrived, or when a RDMA
- operation completes). These notifications are received through
- the msg.msg_control buffer of struct msghdr. The format of the
- messages is described in manpages.
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
poll(fd)
- RDS supports the poll interface to allow the application
- to implement async I/O.
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
- POLLIN handling is pretty straightforward. When there's an
- incoming message queued to the socket, or a pending notification,
- we signal POLLIN.
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
- POLLOUT is a little harder. Since you can essentially send
- to any destination, RDS will always signal POLLOUT as long as
- there's room on the send queue (ie the number of bytes queued
- is less than the sendbuf size).
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
- However, the kernel will refuse to accept messages to
- a destination marked congested - in this case you will loop
- forever if you rely on poll to tell you what to do.
- This isn't a trivial problem, but applications can deal with
- this - by using congestion notifications, and by checking for
- ENOBUFS errors returned by sendmsg.
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
- This allows the application to discard all messages queued to a
- specific destination on this particular socket.
-
- This allows the application to cancel outstanding messages if
- it detects a timeout. For instance, if it tried to send a message,
- and the remote host is unreachable, RDS will keep trying forever.
- The application may decide it's not worth it, and cancel the
- operation. In this case, it would use RDS_CANCEL_SENT_TO to
- nuke any pending messages.
-
- setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
- getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+ ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
Set or read an integer defining the underlying
encapsulating transport to be used for RDS packets on the
socket. When setting the option, integer argument may be
@@ -180,32 +187,39 @@ RDS Protocol
Message header
The message header is a 'struct rds_header' (see rds.h):
+
Fields:
+
h_sequence:
- per-packet sequence number
+ per-packet sequence number
h_ack:
- piggybacked acknowledgment of last packet received
+ piggybacked acknowledgment of last packet received
h_len:
- length of data, not including header
+ length of data, not including header
h_sport:
- source port
+ source port
h_dport:
- destination port
+ destination port
h_flags:
- CONG_BITMAP - this is a congestion update bitmap
- ACK_REQUIRED - receiver must ack this packet
- RETRANSMITTED - packet has previously been sent
+ Can be:
+
+ ============= ==================================
+ CONG_BITMAP this is a congestion update bitmap
+ ACK_REQUIRED receiver must ack this packet
+ RETRANSMITTED packet has previously been sent
+ ============= ==================================
+
h_credit:
- indicate to other end of connection that
- it has more credits available (i.e. there is
- more send room)
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
h_padding[4]:
- unused, for future use
+ unused, for future use
h_csum:
- header checksum
+ header checksum
h_exthdr:
- optional data can be passed here. This is currently used for
- passing RDMA-related information.
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
ACK and retransmit handling
@@ -260,7 +274,7 @@ RDS Protocol
RDS Transport Layer
-==================
+===================
As mentioned above, RDS is not IB-specific. Its code is divided
into a general RDS layer and a transport layer.
@@ -281,19 +295,25 @@ RDS Kernel Structures
be sent and sets header fields as needed, based on the socket API.
This is then queued for the individual connection and sent by the
connection's transport.
+
struct rds_incoming
a generic struct referring to incoming data that can be handed from
the transport to the general code and queued by the general code
while the socket is awoken. It is then passed back to the transport
code to handle the actual copy-to-user.
+
struct rds_socket
per-socket information
+
struct rds_connection
per-connection information
+
struct rds_transport
pointers to transport-specific functions
+
struct rds_statistics
non-transport-specific statistics
+
struct rds_cong_map
wraps the raw congestion bitmap, contains rbnode, waitq, etc.
@@ -317,53 +337,58 @@ The send path
=============
rds_sendmsg()
- struct rds_message built from incoming data
- CMSGs parsed (e.g. RDMA ops)
- transport connection alloced and connected if not already
- rds_message placed on send queue
- send worker awoken
+ - struct rds_message built from incoming data
+ - CMSGs parsed (e.g. RDMA ops)
+ - transport connection alloced and connected if not already
+ - rds_message placed on send queue
+ - send worker awoken
+
rds_send_worker()
- calls rds_send_xmit() until queue is empty
+ - calls rds_send_xmit() until queue is empty
+
rds_send_xmit()
- transmits congestion map if one is pending
- may set ACK_REQUIRED
- calls transport to send either non-RDMA or RDMA message
- (RDMA ops never retransmitted)
+ - transmits congestion map if one is pending
+ - may set ACK_REQUIRED
+ - calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+
rds_ib_xmit()
- allocs work requests from send ring
- adds any new send credits available to peer (h_credits)
- maps the rds_message's sg list
- piggybacks ack
- populates work requests
- post send to connection's queue pair
+ - allocs work requests from send ring
+ - adds any new send credits available to peer (h_credits)
+ - maps the rds_message's sg list
+ - piggybacks ack
+ - populates work requests
+ - post send to connection's queue pair
The recv path
=============
rds_ib_recv_cq_comp_handler()
- looks at write completions
- unmaps recv buffer from device
- no errors, call rds_ib_process_recv()
- refill recv ring
+ - looks at write completions
+ - unmaps recv buffer from device
+ - no errors, call rds_ib_process_recv()
+ - refill recv ring
+
rds_ib_process_recv()
- validate header checksum
- copy header to rds_ib_incoming struct if start of a new datagram
- add to ibinc's fraglist
- if competed datagram:
- update cong map if datagram was cong update
- call rds_recv_incoming() otherwise
- note if ack is required
+ - validate header checksum
+ - copy header to rds_ib_incoming struct if start of a new datagram
+ - add to ibinc's fraglist
+ - if competed datagram:
+ - update cong map if datagram was cong update
+ - call rds_recv_incoming() otherwise
+ - note if ack is required
+
rds_recv_incoming()
- drop duplicate packets
- respond to pings
- find the sock associated with this datagram
- add to sock queue
- wake up sock
- do some congestion calculations
+ - drop duplicate packets
+ - respond to pings
+ - find the sock associated with this datagram
+ - add to sock queue
+ - wake up sock
+ - do some congestion calculations
rds_recvmsg
- copy data into user iovec
- handle CMSGs
- return to application
+ - copy data into user iovec
+ - handle CMSGs
+ - return to application
Multipath RDS (mprds)
=====================
diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.rst
index 381e5b23d61d..16782a95b74a 100644
--- a/Documentation/networking/regulatory.txt
+++ b/Documentation/networking/regulatory.rst
@@ -1,12 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
Linux wireless regulatory documentation
----------------------------------------
+=======================================
This document gives a brief review over how the Linux wireless
regulatory infrastructure works.
More up to date information can be obtained at the project's web page:
-http://wireless.kernel.org/en/developers/Regulatory
+https://wireless.wiki.kernel.org/en/developers/Regulatory
Keeping regulatory domains in userspace
---------------------------------------
@@ -34,7 +37,7 @@ expected regulatory domains will be respected by the kernel.
A currently available userspace agent which can accomplish this
is CRDA - central regulatory domain agent. Its documented here:
-http://wireless.kernel.org/en/developers/Regulatory/CRDA
+https://wireless.wiki.kernel.org/en/developers/Regulatory/CRDA
Essentially the kernel will send a udev event when it knows
it needs a new regulatory domain. A udev rule can be put in place
@@ -55,9 +58,9 @@ Who asks for regulatory domains?
Users can use iw:
-http://wireless.kernel.org/en/users/Documentation/iw
+https://wireless.wiki.kernel.org/en/users/Documentation/iw
-An example:
+An example::
# set regulatory domain to "Costa Rica"
iw reg set CR
@@ -104,9 +107,9 @@ Example code - drivers hinting an alpha2:
This example comes from the zd1211rw device driver. You can start
by having a mapping of your device's EEPROM country/regulatory
-domain value to a specific alpha2 as follows:
+domain value to a specific alpha2 as follows::
-static struct zd_reg_alpha2_map reg_alpha2_map[] = {
+ static struct zd_reg_alpha2_map reg_alpha2_map[] = {
{ ZD_REGDOMAIN_FCC, "US" },
{ ZD_REGDOMAIN_IC, "CA" },
{ ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */
@@ -116,10 +119,10 @@ static struct zd_reg_alpha2_map reg_alpha2_map[] = {
{ ZD_REGDOMAIN_FRANCE, "FR" },
Then you can define a routine to map your read EEPROM value to an alpha2,
-as follows:
+as follows::
-static int zd_reg2alpha2(u8 regdomain, char *alpha2)
-{
+ static int zd_reg2alpha2(u8 regdomain, char *alpha2)
+ {
unsigned int i;
struct zd_reg_alpha2_map *reg_map;
for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) {
@@ -131,12 +134,14 @@ static int zd_reg2alpha2(u8 regdomain, char *alpha2)
}
}
return 1;
-}
+ }
Lastly, you can then hint to the core of your discovered alpha2, if a match
was found. You need to do this after you have registered your wiphy. You
are expected to do this during initialization.
+::
+
r = zd_reg2alpha2(mac->regdomain, alpha2);
if (!r)
regulatory_hint(hw->wiphy, alpha2);
@@ -156,9 +161,9 @@ call regulatory_hint() with the regulatory domain structure in it.
Bellow is a simple example, with a regulatory domain cached using the stack.
Your implementation may vary (read EEPROM cache instead, for example).
-Example cache of some regulatory domain
+Example cache of some regulatory domain::
-struct ieee80211_regdomain mydriver_jp_regdom = {
+ struct ieee80211_regdomain mydriver_jp_regdom = {
.n_reg_rules = 3,
.alpha2 = "JP",
//.alpha2 = "99", /* If I have no alpha2 to map it to */
@@ -173,9 +178,9 @@ struct ieee80211_regdomain mydriver_jp_regdom = {
NL80211_RRF_NO_IR|
NL80211_RRF_DFS),
}
-};
+ };
-Then in some part of your code after your wiphy has been registered:
+Then in some part of your code after your wiphy has been registered::
struct ieee80211_regdomain *rd;
int size_of_regd;
diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst
new file mode 100644
index 000000000000..ee1f5cd54496
--- /dev/null
+++ b/Documentation/networking/representors.rst
@@ -0,0 +1,259 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Network Function Representors
+=============================
+
+This document describes the semantics and usage of representor netdevices, as
+used to control internal switching on SmartNICs. For the closely-related port
+representors on physical (multi-port) switches, see
+:ref:`Documentation/networking/switchdev.rst <switchdev>`.
+
+Motivation
+----------
+
+Since the mid-2010s, network cards have started offering more complex
+virtualisation capabilities than the legacy SR-IOV approach (with its simple
+MAC/VLAN-based switching model) can support. This led to a desire to offload
+software-defined networks (such as OpenVSwitch) to these NICs to specify the
+network connectivity of each function. The resulting designs are variously
+called SmartNICs or DPUs.
+
+Network function representors bring the standard Linux networking stack to
+virtual switches and IOV devices. Just as each physical port of a Linux-
+controlled switch has a separate netdev, so does each virtual port of a virtual
+switch.
+When the system boots, and before any offload is configured, all packets from
+the virtual functions appear in the networking stack of the PF via the
+representors. The PF can thus always communicate freely with the virtual
+functions.
+The PF can configure standard Linux forwarding between representors, the uplink
+or any other netdev (routing, bridging, TC classifiers).
+
+Thus, a representor is both a control plane object (representing the function in
+administrative commands) and a data plane object (one end of a virtual pipe).
+As a virtual link endpoint, the representor can be configured like any other
+netdevice; in some cases (e.g. link state) the representee will follow the
+representor's configuration, while in others there are separate APIs to
+configure the representee.
+
+Definitions
+-----------
+
+This document uses the term "switchdev function" to refer to the PCIe function
+which has administrative control over the virtual switch on the device.
+Typically, this will be a PF, but conceivably a NIC could be configured to grant
+these administrative privileges instead to a VF or SF (subfunction).
+Depending on NIC design, a multi-port NIC might have a single switchdev function
+for the whole device or might have a separate virtual switch, and hence
+switchdev function, for each physical network port.
+If the NIC supports nested switching, there might be separate switchdev
+functions for each nested switch, in which case each switchdev function should
+only create representors for the ports on the (sub-)switch it directly
+administers.
+
+A "representee" is the object that a representor represents. So for example in
+the case of a VF representor, the representee is the corresponding VF.
+
+What does a representor do?
+---------------------------
+
+A representor has three main roles.
+
+1. It is used to configure the network connection the representee sees, e.g.
+ link up/down, MTU, etc. For instance, bringing the representor
+ administratively UP should cause the representee to see a link up / carrier
+ on event.
+2. It provides the slow path for traffic which does not hit any offloaded
+ fast-path rules in the virtual switch. Packets transmitted on the
+ representor netdevice should be delivered to the representee; packets
+ transmitted by the representee which fail to match any switching rule should
+ be received on the representor netdevice. (That is, there is a virtual pipe
+ connecting the representor to the representee, similar in concept to a veth
+ pair.)
+ This allows software switch implementations (such as OpenVSwitch or a Linux
+ bridge) to forward packets between representees and the rest of the network.
+3. It acts as a handle by which switching rules (such as TC filters) can refer
+ to the representee, allowing these rules to be offloaded.
+
+The combination of 2) and 3) means that the behaviour (apart from performance)
+should be the same whether a TC filter is offloaded or not. E.g. a TC rule
+on a VF representor applies in software to packets received on that representor
+netdevice, while in hardware offload it would apply to packets transmitted by
+the representee VF. Conversely, a mirred egress redirect to a VF representor
+corresponds in hardware to delivery directly to the representee VF.
+
+What functions should have a representor?
+-----------------------------------------
+
+Essentially, for each virtual port on the device's internal switch, there
+should be a representor.
+Some vendors have chosen to omit representors for the uplink and the physical
+network port, which can simplify usage (the uplink netdev becomes in effect the
+physical port's representor) but does not generalise to devices with multiple
+ports or uplinks.
+
+Thus, the following should all have representors:
+
+ - VFs belonging to the switchdev function.
+ - Other PFs on the local PCIe controller, and any VFs belonging to them.
+ - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
+ System-on-Chip within the SmartNIC).
+ - PFs and VFs with other personalities, including network block devices (such
+ as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
+ if) their network access is implemented through a virtual switch port. [#]_
+ Note that such functions can require a representor despite the representee
+ not having a netdev.
+ - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
+ their own port on the switch (as opposed to using their parent PF's port).
+ - Any accelerators or plugins on the device whose interface to the network is
+ through a virtual switch port, even if they do not have a corresponding PCIe
+ PF or VF.
+
+This allows the entire switching behaviour of the NIC to be controlled through
+representor TC rules.
+
+It is a common misunderstanding to conflate virtual ports with PCIe virtual
+functions or their netdevs. While in simple cases there will be a 1:1
+correspondence between VF netdevices and VF representors, more advanced device
+configurations may not follow this.
+A PCIe function which does not have network access through the internal switch
+(not even indirectly through the hardware implementation of whatever services
+the function provides) should *not* have a representor (even if it has a
+netdev).
+Such a function has no switch virtual port for the representor to configure or
+to be the other end of the virtual pipe.
+The representor represents the virtual port, not the PCIe function nor the 'end
+user' netdevice.
+
+.. [#] The concept here is that a hardware IP stack in the device performs the
+ translation between block DMA requests and network packets, so that only
+ network packets pass through the virtual port onto the switch. The network
+ access that the IP stack "sees" would then be configurable through tc rules;
+ e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,
+ any needed configuration of the block device *qua* block device, not being a
+ networking entity, would not be appropriate for the representor and would
+ thus use some other channel such as devlink.
+ Contrast this with the case of a virtio-blk implementation which forwards the
+ DMA requests unchanged to another PF whose driver then initiates and
+ terminates IP traffic in software; in that case the DMA traffic would *not*
+ run over the virtual switch and the virtio-blk PF should thus *not* have a
+ representor.
+
+How are representors created?
+-----------------------------
+
+The driver instance attached to the switchdev function should, for each virtual
+port on the switch, create a pure-software netdevice which has some form of
+in-kernel reference to the switchdev function's own netdevice or driver private
+data (``netdev_priv()``).
+This may be by enumerating ports at probe time, reacting dynamically to the
+creation and destruction of ports at run time, or a combination of the two.
+
+The operations of the representor netdevice will generally involve acting
+through the switchdev function. For example, ``ndo_start_xmit()`` might send
+the packet through a hardware TX queue attached to the switchdev function, with
+either packet metadata or queue configuration marking it for delivery to the
+representee.
+
+How are representors identified?
+--------------------------------
+
+The representor netdevice should *not* directly refer to a PCIe device (e.g.
+through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
+representee or of the switchdev function.
+Instead, it should implement the ``ndo_get_devlink_port()`` netdevice op, which
+the kernel uses to provide the ``phys_switch_id`` and ``phys_port_name`` sysfs
+nodes. (Some legacy drivers implement ``ndo_get_port_parent_id()`` and
+``ndo_get_phys_port_name()`` directly, but this is deprecated.) See
+:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
+details of this API.
+
+It is expected that userland will use this information (e.g. through udev rules)
+to construct an appropriately informative name or alias for the netdevice. For
+instance if the switchdev function is ``eth4`` then a representor with a
+``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
+
+There are as yet no established conventions for naming representors which do not
+correspond to PCIe functions (e.g. accelerators and plugins).
+
+How do representors interact with TC rules?
+-------------------------------------------
+
+Any TC rule on a representor applies (in software TC) to packets received by
+that representor netdevice. Thus, if the delivery part of the rule corresponds
+to another port on the virtual switch, the driver may choose to offload it to
+hardware, applying it to packets transmitted by the representee.
+
+Similarly, since a TC mirred egress action targeting the representor would (in
+software) send the packet through the representor (and thus indirectly deliver
+it to the representee), hardware offload should interpret this as delivery to
+the representee.
+
+As a simple example, if ``PORT_DEV`` is the physical port representor and
+``REP_DEV`` is a VF representor, the following rules::
+
+ tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
+ action mirred egress redirect dev $PORT_DEV
+ tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
+ action mirred egress mirror dev $REP_DEV
+
+would mean that all IPv4 packets from the VF are sent out the physical port, and
+all IPv4 packets received on the physical port are delivered to the VF in
+addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule,
+the VF would get two copies, as the packet reception on ``PORT_DEV`` would
+trigger the TC rule again and mirror the packet to ``REP_DEV``.)
+
+On devices without separate port and uplink representors, ``PORT_DEV`` would
+instead be the switchdev function's own uplink netdevice.
+
+Of course the rules can (if supported by the NIC) include packet-modifying
+actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
+
+Tunnel encapsulation and decapsulation are rather more complicated, as they
+involve a third netdevice (a tunnel netdev operating in metadata mode, such as
+a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
+require an IP address to be bound to the underlay device (e.g. switchdev
+function uplink netdev or port representor). TC rules such as::
+
+ tc filter add dev $REP_DEV parent ffff: flower \
+ action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
+ dst_port 4789 \
+ action mirred egress redirect dev vxlan0
+ tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
+ enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
+ action tunnel_key unset action mirred egress redirect dev $REP_DEV
+
+where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
+another IP address on the same subnet, mean that packets sent by the VF should
+be VxLAN encapsulated and sent out the physical port (the driver has to deduce
+this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
+perform an ARP/neighbour table lookup to find the MAC addresses to use in the
+outer Ethernet frame), while UDP packets received on the physical port with UDP
+port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
+decapsulated and forwarded to the VF.
+
+If this all seems complicated, just remember the 'golden rule' of TC offload:
+the hardware should ensure the same final results as if the packets were
+processed through the slow path, traversed software TC (except ignoring any
+``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
+received through the representor netdevices.
+
+Configuring the representee's MAC
+---------------------------------
+
+The representee's link state is controlled through the representor. Setting the
+representor administratively UP or DOWN should cause carrier ON or OFF at the
+representee.
+
+Setting an MTU on the representor should cause that same MTU to be reported to
+the representee.
+(On hardware that allows configuring separate and distinct MTU and MRU values,
+the representor MTU should correspond to the representee's MRU and vice-versa.)
+
+Currently there is no way to use the representor to set the station permanent
+MAC address of the representee; other methods available to do this include:
+
+ - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
+ - devlink port function (see **devlink-port(8)** and
+ :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.rst
index 180e07d956a7..39494a6ea739 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.rst
@@ -1,6 +1,8 @@
- ======================
- RxRPC NETWORK PROTOCOL
- ======================
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+RxRPC Network Protocol
+======================
The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
that can be used to perform RxRPC remote operations. This is done over sockets
@@ -9,36 +11,35 @@ receive data, aborts and errors.
Contents of this document:
- (*) Overview.
+ (#) Overview.
- (*) RxRPC protocol summary.
+ (#) RxRPC protocol summary.
- (*) AF_RXRPC driver model.
+ (#) AF_RXRPC driver model.
- (*) Control messages.
+ (#) Control messages.
- (*) Socket options.
+ (#) Socket options.
- (*) Security.
+ (#) Security.
- (*) Example client usage.
+ (#) Example client usage.
- (*) Example server usage.
+ (#) Example server usage.
- (*) AF_RXRPC kernel interface.
+ (#) AF_RXRPC kernel interface.
- (*) Configurable parameters.
+ (#) Configurable parameters.
-========
-OVERVIEW
+Overview
========
RxRPC is a two-layer protocol. There is a session layer which provides
reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
layer, but implements a real network protocol; and there's the presentation
layer which renders structured data to binary blobs and back again using XDR
-(as does SunRPC):
+(as does SunRPC)::
+-------------+
| Application |
@@ -85,31 +86,30 @@ The Andrew File System (AFS) is an example of an application that uses this and
that has both kernel (filesystem) and userspace (utility) components.
-======================
-RXRPC PROTOCOL SUMMARY
+RxRPC Protocol Summary
======================
An overview of the RxRPC protocol:
- (*) RxRPC sits on top of another networking protocol (UDP is the only option
+ (#) RxRPC sits on top of another networking protocol (UDP is the only option
currently), and uses this to provide network transport. UDP ports, for
example, provide transport endpoints.
- (*) RxRPC supports multiple virtual "connections" from any given transport
+ (#) RxRPC supports multiple virtual "connections" from any given transport
endpoint, thus allowing the endpoints to be shared, even to the same
remote endpoint.
- (*) Each connection goes to a particular "service". A connection may not go
+ (#) Each connection goes to a particular "service". A connection may not go
to multiple services. A service may be considered the RxRPC equivalent of
a port number. AF_RXRPC permits multiple services to share an endpoint.
- (*) Client-originating packets are marked, thus a transport endpoint can be
+ (#) Client-originating packets are marked, thus a transport endpoint can be
shared between client and server connections (connections have a
direction).
- (*) Up to a billion connections may be supported concurrently between one
+ (#) Up to a billion connections may be supported concurrently between one
local transport endpoint and one service on one remote endpoint. An RxRPC
- connection is described by seven numbers:
+ connection is described by seven numbers::
Local address }
Local port } Transport (UDP) address
@@ -119,22 +119,22 @@ An overview of the RxRPC protocol:
Connection ID
Service ID
- (*) Each RxRPC operation is a "call". A connection may make up to four
+ (#) Each RxRPC operation is a "call". A connection may make up to four
billion calls, but only up to four calls may be in progress on a
connection at any one time.
- (*) Calls are two-phase and asymmetric: the client sends its request data,
+ (#) Calls are two-phase and asymmetric: the client sends its request data,
which the service receives; then the service transmits the reply data
which the client receives.
- (*) The data blobs are of indefinite size, the end of a phase is marked with a
+ (#) The data blobs are of indefinite size, the end of a phase is marked with a
flag in the packet. The number of packets of data making up one blob may
not exceed 4 billion, however, as this would cause the sequence number to
wrap.
- (*) The first four bytes of the request data are the service operation ID.
+ (#) The first four bytes of the request data are the service operation ID.
- (*) Security is negotiated on a per-connection basis. The connection is
+ (#) Security is negotiated on a per-connection basis. The connection is
initiated by the first data packet on it arriving. If security is
requested, the server then issues a "challenge" and then the client
replies with a "response". If the response is successful, the security is
@@ -143,146 +143,145 @@ An overview of the RxRPC protocol:
connection lapse before the client, the security will be renegotiated if
the client uses the connection again.
- (*) Calls use ACK packets to handle reliability. Data packets are also
+ (#) Calls use ACK packets to handle reliability. Data packets are also
explicitly sequenced per call.
- (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
+ (#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
A hard-ACK indicates to the far side that all the data received to a point
has been received and processed; a soft-ACK indicates that the data has
been received but may yet be discarded and re-requested. The sender may
not discard any transmittable packets until they've been hard-ACK'd.
- (*) Reception of a reply data packet implicitly hard-ACK's all the data
+ (#) Reception of a reply data packet implicitly hard-ACK's all the data
packets that make up the request.
- (*) An call is complete when the request has been sent, the reply has been
+ (#) An call is complete when the request has been sent, the reply has been
received and the final hard-ACK on the last packet of the reply has
reached the server.
- (*) An call may be aborted by either end at any time up to its completion.
+ (#) An call may be aborted by either end at any time up to its completion.
-=====================
-AF_RXRPC DRIVER MODEL
+AF_RXRPC Driver Model
=====================
About the AF_RXRPC driver:
- (*) The AF_RXRPC protocol transparently uses internal sockets of the transport
+ (#) The AF_RXRPC protocol transparently uses internal sockets of the transport
protocol to represent transport endpoints.
- (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
+ (#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
connections are handled transparently. One client socket may be used to
make multiple simultaneous calls to the same service. One server socket
may handle calls from many clients.
- (*) Additional parallel client connections will be initiated to support extra
+ (#) Additional parallel client connections will be initiated to support extra
concurrent calls, up to a tunable limit.
- (*) Each connection is retained for a certain amount of time [tunable] after
+ (#) Each connection is retained for a certain amount of time [tunable] after
the last call currently using it has completed in case a new call is made
that could reuse it.
- (*) Each internal UDP socket is retained [tunable] for a certain amount of
+ (#) Each internal UDP socket is retained [tunable] for a certain amount of
time [tunable] after the last connection using it discarded, in case a new
connection is made that could use it.
- (*) A client-side connection is only shared between calls if they have have
+ (#) A client-side connection is only shared between calls if they have
the same key struct describing their security (and assuming the calls
would otherwise share the connection). Non-secured calls would also be
able to share connections with each other.
- (*) A server-side connection is shared if the client says it is.
+ (#) A server-side connection is shared if the client says it is.
- (*) ACK'ing is handled by the protocol driver automatically, including ping
+ (#) ACK'ing is handled by the protocol driver automatically, including ping
replying.
- (*) SO_KEEPALIVE automatically pings the other side to keep the connection
+ (#) SO_KEEPALIVE automatically pings the other side to keep the connection
alive [TODO].
- (*) If an ICMP error is received, all calls affected by that error will be
+ (#) If an ICMP error is received, all calls affected by that error will be
aborted with an appropriate network error passed through recvmsg().
Interaction with the user of the RxRPC socket:
- (*) A socket is made into a server socket by binding an address with a
+ (#) A socket is made into a server socket by binding an address with a
non-zero service ID.
- (*) In the client, sending a request is achieved with one or more sendmsgs,
+ (#) In the client, sending a request is achieved with one or more sendmsgs,
followed by the reply being received with one or more recvmsgs.
- (*) The first sendmsg for a request to be sent from a client contains a tag to
+ (#) The first sendmsg for a request to be sent from a client contains a tag to
be used in all other sendmsgs or recvmsgs associated with that call. The
tag is carried in the control data.
- (*) connect() is used to supply a default destination address for a client
+ (#) connect() is used to supply a default destination address for a client
socket. This may be overridden by supplying an alternate address to the
first sendmsg() of a call (struct msghdr::msg_name).
- (*) If connect() is called on an unbound client, a random local port will
+ (#) If connect() is called on an unbound client, a random local port will
bound before the operation takes place.
- (*) A server socket may also be used to make client calls. To do this, the
+ (#) A server socket may also be used to make client calls. To do this, the
first sendmsg() of the call must specify the target address. The server's
transport endpoint is used to send the packets.
- (*) Once the application has received the last message associated with a call,
+ (#) Once the application has received the last message associated with a call,
the tag is guaranteed not to be seen again, and so it can be used to pin
client resources. A new call can then be initiated with the same tag
without fear of interference.
- (*) In the server, a request is received with one or more recvmsgs, then the
+ (#) In the server, a request is received with one or more recvmsgs, then the
the reply is transmitted with one or more sendmsgs, and then the final ACK
is received with a last recvmsg.
- (*) When sending data for a call, sendmsg is given MSG_MORE if there's more
+ (#) When sending data for a call, sendmsg is given MSG_MORE if there's more
data to come on that call.
- (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more
+ (#) When receiving data for a call, recvmsg flags MSG_MORE if there's more
data to come for that call.
- (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
+ (#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
to indicate the terminal message for that call.
- (*) A call may be aborted by adding an abort control message to the control
+ (#) A call may be aborted by adding an abort control message to the control
data. Issuing an abort terminates the kernel's use of that call's tag.
Any messages waiting in the receive queue for that call will be discarded.
- (*) Aborts, busy notifications and challenge packets are delivered by recvmsg,
+ (#) Aborts, busy notifications and challenge packets are delivered by recvmsg,
and control data messages will be set to indicate the context. Receiving
an abort or a busy message terminates the kernel's use of that call's tag.
- (*) The control data part of the msghdr struct is used for a number of things:
+ (#) The control data part of the msghdr struct is used for a number of things:
- (*) The tag of the intended or affected call.
+ (#) The tag of the intended or affected call.
- (*) Sending or receiving errors, aborts and busy notifications.
+ (#) Sending or receiving errors, aborts and busy notifications.
- (*) Notifications of incoming calls.
+ (#) Notifications of incoming calls.
- (*) Sending debug requests and receiving debug replies [TODO].
+ (#) Sending debug requests and receiving debug replies [TODO].
- (*) When the kernel has received and set up an incoming call, it sends a
+ (#) When the kernel has received and set up an incoming call, it sends a
message to server application to let it know there's a new call awaiting
its acceptance [recvmsg reports a special control message]. The server
application then uses sendmsg to assign a tag to the new call. Once that
is done, the first part of the request data will be delivered by recvmsg.
- (*) The server application has to provide the server socket with a keyring of
+ (#) The server application has to provide the server socket with a keyring of
secret keys corresponding to the security types it permits. When a secure
connection is being set up, the kernel looks up the appropriate secret key
in the keyring and then sends a challenge packet to the client and
receives a response packet. The kernel then checks the authorisation of
the packet and either aborts the connection or sets up the security.
- (*) The name of the key a client will use to secure its communications is
+ (#) The name of the key a client will use to secure its communications is
nominated by a socket option.
Notes on sendmsg:
- (*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
+ (#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
making progress at accepting packets within a reasonable time such that we
manage to queue up all the data for transmission. This requires the
client to accept at least one packet per 2*RTT time period.
@@ -294,7 +293,7 @@ Notes on sendmsg:
Notes on recvmsg:
- (*) If there's a sequence of data messages belonging to a particular call on
+ (#) If there's a sequence of data messages belonging to a particular call on
the receive queue, then recvmsg will keep working through them until:
(a) it meets the end of that call's received data,
@@ -320,13 +319,13 @@ Notes on recvmsg:
flagged.
-================
-CONTROL MESSAGES
+Control Messages
================
AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
calls, to invoke certain actions and to report certain conditions. These are:
+ ======================= === =========== ===============================
MESSAGE ID SRT DATA MEANING
======================= === =========== ===============================
RXRPC_USER_CALL_ID sr- User ID App's call specifier
@@ -340,10 +339,11 @@ calls, to invoke certain actions and to report certain conditions. These are:
RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call
RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded
RXRPC_TX_LENGTH s-- data len Total length of Tx data
+ ======================= === =========== ===============================
(SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message)
- (*) RXRPC_USER_CALL_ID
+ (#) RXRPC_USER_CALL_ID
This is used to indicate the application's call ID. It's an unsigned long
that the app specifies in the client by attaching it to the first data
@@ -351,7 +351,7 @@ calls, to invoke certain actions and to report certain conditions. These are:
message. recvmsg() passes it in conjunction with all messages except
those of the RXRPC_NEW_CALL message.
- (*) RXRPC_ABORT
+ (#) RXRPC_ABORT
This is can be used by an application to abort a call by passing it to
sendmsg, or it can be delivered by recvmsg to indicate a remote abort was
@@ -359,13 +359,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
specify the call affected. If an abort is being sent, then error EBADSLT
will be returned if there is no call with that user ID.
- (*) RXRPC_ACK
+ (#) RXRPC_ACK
This is delivered to a server application to indicate that the final ACK
of a call was received from the client. It will be associated with an
RXRPC_USER_CALL_ID to indicate the call that's now complete.
- (*) RXRPC_NET_ERROR
+ (#) RXRPC_NET_ERROR
This is delivered to an application to indicate that an ICMP error message
was encountered in the process of trying to talk to the peer. An
@@ -373,13 +373,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
- (*) RXRPC_BUSY
+ (#) RXRPC_BUSY
This is delivered to a client application to indicate that a call was
rejected by the server due to the server being busy. It will be
associated with an RXRPC_USER_CALL_ID to indicate the rejected call.
- (*) RXRPC_LOCAL_ERROR
+ (#) RXRPC_LOCAL_ERROR
This is delivered to an application to indicate that a local error was
encountered and that a call has been aborted because of it. An
@@ -387,13 +387,13 @@ calls, to invoke certain actions and to report certain conditions. These are:
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
- (*) RXRPC_NEW_CALL
+ (#) RXRPC_NEW_CALL
This is delivered to indicate to a server application that a new call has
arrived and is awaiting acceptance. No user ID is associated with this,
as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT.
- (*) RXRPC_ACCEPT
+ (#) RXRPC_ACCEPT
This is used by a server application to attempt to accept a call and
assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID
@@ -402,12 +402,12 @@ calls, to invoke certain actions and to report certain conditions. These are:
return error ENODATA. If the user ID is already in use by another call,
then error EBADSLT will be returned.
- (*) RXRPC_EXCLUSIVE_CALL
+ (#) RXRPC_EXCLUSIVE_CALL
This is used to indicate that a client call should be made on a one-off
connection. The connection is discarded once the call has terminated.
- (*) RXRPC_UPGRADE_SERVICE
+ (#) RXRPC_UPGRADE_SERVICE
This is used to make a client call to probe if the specified service ID
may be upgraded by the server. The caller must check msg_name returned to
@@ -419,7 +419,7 @@ calls, to invoke certain actions and to report certain conditions. These are:
future communication to that server and RXRPC_UPGRADE_SERVICE should no
longer be set.
- (*) RXRPC_TX_LENGTH
+ (#) RXRPC_TX_LENGTH
This is used to inform the kernel of the total amount of data that is
going to be transmitted by a call (whether in a client request or a
@@ -443,7 +443,7 @@ SOCKET OPTIONS
AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
- (*) RXRPC_SECURITY_KEY
+ (#) RXRPC_SECURITY_KEY
This is used to specify the description of the key to be used. The key is
extracted from the calling process's keyrings with request_key() and
@@ -452,17 +452,17 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
The optval pointer points to the description string, and optlen indicates
how long the string is, without the NUL terminator.
- (*) RXRPC_SECURITY_KEYRING
+ (#) RXRPC_SECURITY_KEYRING
Similar to above but specifies a keyring of server secret keys to use (key
type "keyring"). See the "Security" section.
- (*) RXRPC_EXCLUSIVE_CONNECTION
+ (#) RXRPC_EXCLUSIVE_CONNECTION
This is used to request that new connections should be used for each call
made subsequently on this socket. optval should be NULL and optlen 0.
- (*) RXRPC_MIN_SECURITY_LEVEL
+ (#) RXRPC_MIN_SECURITY_LEVEL
This is used to specify the minimum security level required for calls on
this socket. optval must point to an int containing one of the following
@@ -477,19 +477,19 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
Encrypted checksum plus packet padded and first eight bytes of packet
encrypted - which includes the actual packet length.
- (c) RXRPC_SECURITY_ENCRYPTED
+ (c) RXRPC_SECURITY_ENCRYPT
Encrypted checksum plus entire packet padded and encrypted, including
actual packet length.
- (*) RXRPC_UPGRADEABLE_SERVICE
+ (#) RXRPC_UPGRADEABLE_SERVICE
This is used to indicate that a service socket with two bindings may
upgrade one bound service to the other if requested by the client. optval
must point to an array of two unsigned short ints. The first is the
service ID to upgrade from and the second the service ID to upgrade to.
- (*) RXRPC_SUPPORTED_CMSG
+ (#) RXRPC_SUPPORTED_CMSG
This is a read-only option that writes an int into the buffer indicating
the highest control message type supported.
@@ -509,7 +509,7 @@ found at:
http://people.redhat.com/~dhowells/rxrpc/klog.c
The payload provided to add_key() on the client should be of the following
-form:
+form::
struct rxrpc_key_sec2_v1 {
uint16_t security_index; /* 2 */
@@ -546,14 +546,14 @@ EXAMPLE CLIENT USAGE
A client would issue an operation by:
- (1) An RxRPC socket is set up by:
+ (1) An RxRPC socket is set up by::
client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
Where the third parameter indicates the protocol family of the transport
socket used - usually IPv4 but it can also be IPv6 [TODO].
- (2) A local address can optionally be bound:
+ (2) A local address can optionally be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -570,20 +570,20 @@ A client would issue an operation by:
several unrelated RxRPC sockets. Security is handled on a basis of
per-RxRPC virtual connection.
- (3) The security is set:
+ (3) The security is set::
const char *key = "AFS:cambridge.redhat.com";
setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key));
This issues a request_key() to get the key representing the security
- context. The minimum security level can be set:
+ context. The minimum security level can be set::
- unsigned int sec = RXRPC_SECURITY_ENCRYPTED;
+ unsigned int sec = RXRPC_SECURITY_ENCRYPT;
setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
&sec, sizeof(sec));
(4) The server to be contacted can then be specified (alternatively this can
- be done through sendmsg):
+ be done through sendmsg)::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -598,7 +598,9 @@ A client would issue an operation by:
(5) The request data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control message attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last part of
the request. Multiple requests may be made simultaneously.
@@ -635,13 +637,12 @@ any more calls (further calls to the same destination will be blocked until the
probe is concluded).
-====================
-EXAMPLE SERVER USAGE
+Example Server Usage
====================
A server would be set up to accept operations in the following manner:
- (1) An RxRPC socket is created by:
+ (1) An RxRPC socket is created by::
server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
@@ -649,7 +650,7 @@ A server would be set up to accept operations in the following manner:
socket used - usually IPv4.
(2) Security is set up if desired by giving the socket a keyring with server
- secret keys in it:
+ secret keys in it::
keyring = add_key("keyring", "AFSkeys", NULL, 0,
KEY_SPEC_PROCESS_KEYRING);
@@ -663,7 +664,7 @@ A server would be set up to accept operations in the following manner:
The keyring can be manipulated after it has been given to the socket. This
permits the server to add more keys, replace keys, etc. while it is live.
- (3) A local address must then be bound:
+ (3) A local address must then be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -680,7 +681,7 @@ A server would be set up to accept operations in the following manner:
should be called twice.
(4) If service upgrading is required, first two service IDs must have been
- bound and then the following option must be set:
+ bound and then the following option must be set::
unsigned short service_ids[2] = { from_ID, to_ID };
setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE,
@@ -690,14 +691,14 @@ A server would be set up to accept operations in the following manner:
to_ID if they request it. This will be reflected in msg_name obtained
through recvmsg() when the request data is delivered to userspace.
- (5) The server is then set to listen out for incoming calls:
+ (5) The server is then set to listen out for incoming calls::
listen(server, 100);
(6) The kernel notifies the server of pending incoming connections by sending
it a message for each. This is received with recvmsg() on the server
socket. It has no data, and has a single dataless control message
- attached:
+ attached::
RXRPC_NEW_CALL
@@ -709,8 +710,10 @@ A server would be set up to accept operations in the following manner:
(7) The server then accepts the new call by issuing a sendmsg() with two
pieces of control data and no actual data:
- RXRPC_ACCEPT - indicate connection acceptance
- RXRPC_USER_CALL_ID - specify user ID for this call
+ ================== ==============================
+ RXRPC_ACCEPT indicate connection acceptance
+ RXRPC_USER_CALL_ID specify user ID for this call
+ ================== ==============================
(8) The first request data packet will then be posted to the server socket for
recvmsg() to pick up. At that point, the RxRPC address for the call can
@@ -722,12 +725,17 @@ A server would be set up to accept operations in the following manner:
All data will be delivered with the following control message attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
(9) The reply data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last message
for a particular call.
@@ -736,8 +744,10 @@ A server would be set up to accept operations in the following manner:
when it is received. It will take the form of a dataless message with two
control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
- RXRPC_ACK - indicates final ACK (no data)
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ACK indicates final ACK (no data)
+ ================== ===================================
MSG_EOR will be flagged to indicate that this is the final message for
this call.
@@ -746,8 +756,10 @@ A server would be set up to accept operations in the following manner:
aborted by calling sendmsg() with a dataless message with the following
control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
- RXRPC_ABORT - indicates abort code (4 byte data)
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ABORT indicates abort code (4 byte data)
+ ================== ===================================
Any packets waiting in the socket's receive queue will be discarded if
this is issued.
@@ -757,8 +769,7 @@ the one server socket, using control messages on sendmsg() and recvmsg() to
determine the call affected.
-=========================
-AF_RXRPC KERNEL INTERFACE
+AF_RXRPC Kernel Interface
=========================
The AF_RXRPC module also provides an interface for use by in-kernel utilities
@@ -786,7 +797,7 @@ then it passes this to the kernel interface functions.
The kernel interface functions are as follows:
- (*) Begin a new client call.
+ (#) Begin a new client call::
struct rxrpc_call *
rxrpc_kernel_begin_call(struct socket *sock,
@@ -837,7 +848,7 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
- (*) End a client call.
+ (#) End a client call::
void rxrpc_kernel_end_call(struct socket *sock,
struct rxrpc_call *call);
@@ -846,7 +857,7 @@ The kernel interface functions are as follows:
from AF_RXRPC's knowledge and will not be seen again in association with
the specified call.
- (*) Send data through a call.
+ (#) Send data through a call::
typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
unsigned long user_call_ID,
@@ -872,7 +883,7 @@ The kernel interface functions are as follows:
called with the call-state spinlock held to prevent any reply or final ACK
from being delivered first.
- (*) Receive data from a call.
+ (#) Receive data from a call::
int rxrpc_kernel_recv_data(struct socket *sock,
struct rxrpc_call *call,
@@ -902,12 +913,14 @@ The kernel interface functions are as follows:
more data was available, EMSGSIZE is returned.
If a remote ABORT is detected, the abort code received will be stored in
- *_abort and ECONNABORTED will be returned.
+ ``*_abort`` and ECONNABORTED will be returned.
The service ID that the call ended up with is returned into *_service.
This can be used to see if a call got a service upgrade.
- (*) Abort a call.
+ (#) Abort a call??
+
+ ::
void rxrpc_kernel_abort_call(struct socket *sock,
struct rxrpc_call *call,
@@ -916,7 +929,7 @@ The kernel interface functions are as follows:
This is used to abort a call if it's still in an abortable state. The
abort code specified will be placed in the ABORT message sent.
- (*) Intercept received RxRPC messages.
+ (#) Intercept received RxRPC messages::
typedef void (*rxrpc_interceptor_t)(struct sock *sk,
unsigned long user_call_ID,
@@ -937,7 +950,8 @@ The kernel interface functions are as follows:
The skb->mark field indicates the type of message:
- MARK MEANING
+ =============================== =======================================
+ Mark Meaning
=============================== =======================================
RXRPC_SKB_MARK_DATA Data message
RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call
@@ -946,6 +960,7 @@ The kernel interface functions are as follows:
RXRPC_SKB_MARK_NET_ERROR Network error detected
RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered
RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance
+ =============================== =======================================
The remote abort message can be probed with rxrpc_kernel_get_abort_code().
The two error messages can be probed with rxrpc_kernel_get_error_number().
@@ -961,7 +976,7 @@ The kernel interface functions are as follows:
is possible to get extra refs on all types of message for later freeing,
but this may pin the state of a call until the message is finally freed.
- (*) Accept an incoming call.
+ (#) Accept an incoming call::
struct rxrpc_call *
rxrpc_kernel_accept_call(struct socket *sock,
@@ -975,7 +990,7 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
- (*) Reject an incoming call.
+ (#) Reject an incoming call::
int rxrpc_kernel_reject_call(struct socket *sock);
@@ -984,21 +999,21 @@ The kernel interface functions are as follows:
Other errors may be returned if the call had been aborted (-ECONNABORTED)
or had timed out (-ETIME).
- (*) Allocate a null key for doing anonymous security.
+ (#) Allocate a null key for doing anonymous security::
struct key *rxrpc_get_null_key(const char *keyname);
This is used to allocate a null RxRPC key that can be used to indicate
anonymous security for a particular domain.
- (*) Get the peer address of a call.
+ (#) Get the peer address of a call::
void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
This is used to find the remote peer address of a call.
- (*) Set the total transmit data size on a call.
+ (#) Set the total transmit data size on a call::
void rxrpc_kernel_set_tx_length(struct socket *sock,
struct rxrpc_call *call,
@@ -1009,14 +1024,14 @@ The kernel interface functions are as follows:
size should be set when the call is begun. tx_total_len may not be less
than zero.
- (*) Get call RTT.
+ (#) Get call RTT::
u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call);
Get the RTT time to the peer in use by a call. The value returned is in
nanoseconds.
- (*) Check call still alive.
+ (#) Check call still alive::
bool rxrpc_kernel_check_life(struct socket *sock,
struct rxrpc_call *call,
@@ -1024,7 +1039,7 @@ The kernel interface functions are as follows:
void rxrpc_kernel_probe_life(struct socket *sock,
struct rxrpc_call *call);
- The first function passes back in *_life a number that is updated when
+ The first function passes back in ``*_life`` a number that is updated when
ACKs are received from the peer (notably including PING RESPONSE ACKs
which we can elicit by sending PING ACKs to see if the call still exists
on the server). The caller should compare the numbers of two calls to see
@@ -1040,18 +1055,7 @@ The kernel interface functions are as follows:
first function to change. Note that this must be called in TASK_RUNNING
state.
- (*) Get reply timestamp.
-
- bool rxrpc_kernel_get_reply_time(struct socket *sock,
- struct rxrpc_call *call,
- ktime_t *_ts)
-
- This allows the timestamp on the first DATA packet of the reply of a
- client call to be queried, provided that it is still in the Rx ring. If
- successful, the timestamp will be stored into *_ts and true will be
- returned; false will be returned otherwise.
-
- (*) Get remote client epoch.
+ (#) Get remote client epoch::
u32 rxrpc_kernel_get_epoch(struct socket *sock,
struct rxrpc_call *call)
@@ -1065,7 +1069,7 @@ The kernel interface functions are as follows:
This value can be used to determine if the remote client has been
restarted as it shouldn't change otherwise.
- (*) Set the maxmimum lifespan on a call.
+ (#) Set the maxmimum lifespan on a call::
void rxrpc_kernel_set_max_life(struct socket *sock,
struct rxrpc_call *call,
@@ -1075,15 +1079,23 @@ The kernel interface functions are as follows:
jiffies). In the event of the timeout occurring, the call will be
aborted and -ETIME or -ETIMEDOUT will be returned.
+ (#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the
+ kernel::
-=======================
-CONFIGURABLE PARAMETERS
+ int rxrpc_sock_set_min_security_level(struct sock *sk,
+ unsigned int val);
+
+ This specifies the minimum security level required for calls on this
+ socket.
+
+
+Configurable Parameters
=======================
The RxRPC protocol driver has a number of configurable parameters that can be
adjusted through sysctls in /proc/net/rxrpc/:
- (*) req_ack_delay
+ (#) req_ack_delay
The amount of time in milliseconds after receiving a packet with the
request-ack flag set before we honour the flag and actually send the
@@ -1093,60 +1105,60 @@ adjusted through sysctls in /proc/net/rxrpc/:
reception window is full (to a maximum of 255 packets), so delaying the
ACK permits several packets to be ACK'd in one go.
- (*) soft_ack_delay
+ (#) soft_ack_delay
The amount of time in milliseconds after receiving a new packet before we
generate a soft-ACK to tell the sender that it doesn't need to resend.
- (*) idle_ack_delay
+ (#) idle_ack_delay
The amount of time in milliseconds after all the packets currently in the
received queue have been consumed before we generate a hard-ACK to tell
the sender it can free its buffers, assuming no other reason occurs that
we would send an ACK.
- (*) resend_timeout
+ (#) resend_timeout
The amount of time in milliseconds after transmitting a packet before we
transmit it again, assuming no ACK is received from the receiver telling
us they got it.
- (*) max_call_lifetime
+ (#) max_call_lifetime
The maximum amount of time in seconds that a call may be in progress
before we preemptively kill it.
- (*) dead_call_expiry
+ (#) dead_call_expiry
The amount of time in seconds before we remove a dead call from the call
list. Dead calls are kept around for a little while for the purpose of
repeating ACK and ABORT packets.
- (*) connection_expiry
+ (#) connection_expiry
The amount of time in seconds after a connection was last used before we
remove it from the connection list. While a connection is in existence,
it serves as a placeholder for negotiated security; when it is deleted,
the security must be renegotiated.
- (*) transport_expiry
+ (#) transport_expiry
The amount of time in seconds after a transport was last used before we
remove it from the transport list. While a transport is in existence, it
serves to anchor the peer data and keeps the connection ID counter.
- (*) rxrpc_rx_window_size
+ (#) rxrpc_rx_window_size
The size of the receive window in packets. This is the maximum number of
unconsumed received packets we're willing to hold in memory for any
particular call.
- (*) rxrpc_rx_mtu
+ (#) rxrpc_rx_mtu
The maximum packet MTU size that we're willing to receive in bytes. This
indicates to the peer whether we're willing to accept jumbo packets.
- (*) rxrpc_rx_jumbo_max
+ (#) rxrpc_rx_jumbo_max
The maximum number of packets that we're willing to accept in a jumbo
packet. Non-terminal packets in a jumbo packet must contain a four byte
diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst
index f78d7bf27ff5..3d435caa3ef2 100644
--- a/Documentation/networking/scaling.rst
+++ b/Documentation/networking/scaling.rst
@@ -81,7 +81,7 @@ of queues to IRQs can be determined from /proc/interrupts. By default,
an IRQ may be handled on any CPU. Because a non-negligible part of packet
processing takes place in receive interrupt handling, it is advantageous
to spread receive interrupts between CPUs. To manually adjust the IRQ
-affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
+affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
will be running irqbalance, a daemon that dynamically optimizes IRQ
assignments and as a result may override any manual settings.
@@ -160,7 +160,7 @@ can be configured for each receive queue using a sysfs file entry::
This file implements a bitmap of CPUs. RPS is disabled when it is zero
(the default), in which case packets are processed on the interrupting
-CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
+CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
the bitmap.
@@ -465,9 +465,9 @@ XPS Configuration
-----------------
XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs/receive-queues that may
-use a transmit queue is configured using the sysfs file entry:
+default for SMP). If compiled in, it is driver dependent whether, and
+how, XPS is configured at device init. The mapping of CPUs/receive-queues
+to transmit queue can be inspected and configured using sysfs:
For selection based on CPUs map::
diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.rst
index 97b810ca9082..9f4d9c8a925b 100644
--- a/Documentation/networking/sctp.txt
+++ b/Documentation/networking/sctp.rst
@@ -1,35 +1,42 @@
-Linux Kernel SCTP
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel SCTP
+=================
This is the current BETA release of the Linux Kernel SCTP reference
-implementation.
+implementation.
SCTP (Stream Control Transmission Protocol) is a IP based, message oriented,
reliable transport protocol, with congestion control, support for
transparent multi-homing, and multiple ordered streams of messages.
RFC2960 defines the core protocol. The IETF SIGTRAN working group originally
-developed the SCTP protocol and later handed the protocol over to the
-Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
-general purpose transport.
+developed the SCTP protocol and later handed the protocol over to the
+Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
+general purpose transport.
-See the IETF website (http://www.ietf.org) for further documents on SCTP.
-See http://www.ietf.org/rfc/rfc2960.txt
+See the IETF website (http://www.ietf.org) for further documents on SCTP.
+See http://www.ietf.org/rfc/rfc2960.txt
The initial project goal is to create an Linux kernel reference implementation
-of SCTP that is RFC 2960 compliant and provides an programming interface
-referred to as the UDP-style API of the Sockets Extensions for SCTP, as
-proposed in IETF Internet-Drafts.
+of SCTP that is RFC 2960 compliant and provides an programming interface
+referred to as the UDP-style API of the Sockets Extensions for SCTP, as
+proposed in IETF Internet-Drafts.
-Caveats:
+Caveats
+=======
--lksctp can be built as statically or as a module. However, be aware that
-module removal of lksctp is not yet a safe activity.
+- lksctp can be built as statically or as a module. However, be aware that
+ module removal of lksctp is not yet a safe activity.
--There is tentative support for IPv6, but most work has gone towards
-implementation and testing lksctp on IPv4.
+- There is tentative support for IPv6, but most work has gone towards
+ implementation and testing lksctp on IPv4.
For more information, please visit the lksctp project website:
+
http://www.sf.net/projects/lksctp
Or contact the lksctp developers through the mailing list:
+
<linux-sctp@vger.kernel.org>
diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.rst
index 95ea06784333..b45141a98027 100644
--- a/Documentation/networking/secid.txt
+++ b/Documentation/networking/secid.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+LSM/SeLinux secid
+=================
+
flowi structure:
The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
new file mode 100644
index 000000000000..07c20e470baf
--- /dev/null
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -0,0 +1,39 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Seg6 Sysfs variables
+====================
+
+
+/proc/sys/net/conf/<iface>/seg6_* variables:
+============================================
+
+seg6_enabled - BOOL
+ Accept or drop SR-enabled IPv6 packets on this interface.
+
+ Relevant packets are those with SRH present and DA = local.
+
+ * 0 - disabled (default)
+ * not 0 - enabled
+
+seg6_require_hmac - INTEGER
+ Define HMAC policy for ingress SR-enabled packets on this interface.
+
+ * -1 - Ignore HMAC field
+ * 0 - Accept SR packets without HMAC, validate SR packets with HMAC
+ * 1 - Drop SR packets without HMAC, validate SR packets with HMAC
+
+ Default is 0.
+
+seg6_flowlabel - INTEGER
+ Controls the behaviour of computing the flowlabel of outer
+ IPv6 header in case of SR T.encaps
+
+ == =======================================================
+ -1 set flowlabel to zero.
+ 0 copy flowlabel from Inner packet in case of Inner IPv6
+ (Set flowlabel to 0 in case IPv4/L2)
+ 1 Compute the flowlabel using seg6_make_flowlabel()
+ == =======================================================
+
+ Default is 0.
diff --git a/Documentation/networking/seg6-sysctl.txt b/Documentation/networking/seg6-sysctl.txt
deleted file mode 100644
index bdbde23b19cb..000000000000
--- a/Documentation/networking/seg6-sysctl.txt
+++ /dev/null
@@ -1,18 +0,0 @@
-/proc/sys/net/conf/<iface>/seg6_* variables:
-
-seg6_enabled - BOOL
- Accept or drop SR-enabled IPv6 packets on this interface.
-
- Relevant packets are those with SRH present and DA = local.
-
- 0 - disabled (default)
- not 0 - enabled
-
-seg6_require_hmac - INTEGER
- Define HMAC policy for ingress SR-enabled packets on this interface.
-
- -1 - Ignore HMAC field
- 0 - Accept SR packets without HMAC, validate SR packets with HMAC
- 1 - Drop SR packets without HMAC, validate SR packets with HMAC
-
- Default is 0.
diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
index d753a309f9d1..55b65f607a64 100644
--- a/Documentation/networking/sfp-phylink.rst
+++ b/Documentation/networking/sfp-phylink.rst
@@ -74,10 +74,13 @@ phylib to the sfp/phylink support. Please send patches to improve
this documentation.
1. Optionally split the network driver's phylib update function into
- three parts dealing with link-down, link-up and reconfiguring the
- MAC settings. This can be done as a separate preparation commit.
+ two parts dealing with link-down and link-up. This can be done as
+ a separate preparation commit.
- An example of this preparation can be found in git commit fc548b991fb0.
+ An older example of this preparation can be found in git commit
+ fc548b991fb0, although this was splitting into three parts; the
+ link-up part now includes configuring the MAC for the link settings.
+ Please see :c:func:`mac_link_up` for more information on this.
2. Replace::
@@ -135,32 +138,32 @@ this documentation.
.. code-block:: c
- static int foo_ethtool_set_link_ksettings(struct net_device *dev,
- const struct ethtool_link_ksettings *cmd)
- {
- struct foo_priv *priv = netdev_priv(dev);
-
- return phylink_ethtool_ksettings_set(priv->phylink, cmd);
- }
-
- static int foo_ethtool_get_link_ksettings(struct net_device *dev,
- struct ethtool_link_ksettings *cmd)
- {
- struct foo_priv *priv = netdev_priv(dev);
+ static int foo_ethtool_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_set(priv->phylink, cmd);
+ }
- return phylink_ethtool_ksettings_get(priv->phylink, cmd);
- }
+ static int foo_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+ {
+ struct foo_priv *priv = netdev_priv(dev);
+
+ return phylink_ethtool_ksettings_get(priv->phylink, cmd);
+ }
-7. Replace the call to:
+7. Replace the call to::
phy_dev = of_phy_connect(dev, node, link_func, flags, phy_interface);
- and associated code with a call to:
+ and associated code with a call to::
err = phylink_of_phy_connect(priv->phylink, node, flags);
For the most part, ``flags`` can be zero; these flags are passed to
- the of_phy_attach() inside this function call if a PHY is specified
+ the phy_attach_direct() inside this function call if a PHY is specified
in the DT node ``node``.
``node`` should be the DT node which contains the network phy property,
@@ -200,20 +203,28 @@ this documentation.
The :c:func:`validate` method should mask the supplied supported mask,
and ``state->advertising`` with the supported ethtool link modes.
These are the new ethtool link modes, so bitmask operations must be
- used. For an example, see drivers/net/ethernet/marvell/mvneta.c.
+ used. For an example, see ``drivers/net/ethernet/marvell/mvneta.c``.
The :c:func:`mac_link_state` method is used to read the link state
from the MAC, and report back the settings that the MAC is currently
using. This is particularly important for in-band negotiation
methods such as 1000base-X and SGMII.
+ The :c:func:`mac_link_up` method is used to inform the MAC that the
+ link has come up. The call includes the negotiation mode and interface
+ for reference only. The finalised link parameters are also supplied
+ (speed, duplex and flow control/pause enablement settings) which
+ should be used to configure the MAC when the MAC and PCS are not
+ tightly integrated, or when the settings are not coming from in-band
+ negotiation.
+
The :c:func:`mac_config` method is used to update the MAC with the
requested state, and must avoid unnecessarily taking the link down
when making changes to the MAC configuration. This means the
function should modify the state and only take the link down when
absolutely necessary to change the MAC configuration. An example
of how to do this can be found in :c:func:`mvneta_mac_config` in
- drivers/net/ethernet/marvell/mvneta.c.
+ ``drivers/net/ethernet/marvell/mvneta.c``.
For further information on these methods, please see the inline
documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
@@ -270,4 +281,4 @@ as necessary.
For information describing the SFP cage in DT, please see the binding
documentation in the kernel source tree
-``Documentation/devicetree/bindings/net/sff,sfp.txt``
+``Documentation/devicetree/bindings/net/sff,sfp.yaml``.
diff --git a/Documentation/networking/skbuff.rst b/Documentation/networking/skbuff.rst
new file mode 100644
index 000000000000..5b74275a73a3
--- /dev/null
+++ b/Documentation/networking/skbuff.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+struct sk_buff
+==============
+
+:c:type:`sk_buff` is the main networking structure representing
+a packet.
+
+Basic sk_buff geometry
+----------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: Basic sk_buff geometry
+
+Shared skbs and skb clones
+--------------------------
+
+:c:member:`sk_buff.users` is a simple refcount allowing multiple entities
+to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred
+to as shared skbs (see skb_shared()).
+
+skb_clone() allows for fast duplication of skbs. None of the data buffers
+get copied, but caller gets a new metadata struct (struct sk_buff).
+&skb_shared_info.refcount indicates the number of skbs pointing at the same
+packet data (i.e. clones).
+
+dataref and headerless skbs
+---------------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: dataref and headerless skbs
+
+Checksum information
+--------------------
+
+.. kernel-doc:: include/linux/skbuff.h
+ :doc: skb checksums
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..6d8acdbe9be1
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,61 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+SMC Sysctl
+==========
+
+/proc/sys/net/smc/* Variables
+=============================
+
+autocorking_size - INTEGER
+ Setting SMC auto corking size:
+ SMC auto corking is like TCP auto corking from the application's
+ perspective of view. When applications do consecutive small
+ write()/sendmsg() system calls, we try to coalesce these small writes
+ as much as possible, to lower total amount of CDC and RDMA Write been
+ sent.
+ autocorking_size limits the maximum corked bytes that can be sent to
+ the under device in 1 single sending. If set to 0, the SMC auto corking
+ is disabled.
+ Applications can still use TCP_CORK for optimal behavior when they
+ know how/when to uncork their sockets.
+
+ Default: 64K
+
+smcr_buf_type - INTEGER
+ Controls which type of sndbufs and RMBs to use in later newly created
+ SMC-R link group. Only for SMC-R.
+
+ Default: 0 (physically contiguous sndbufs and RMBs)
+
+ Possible values:
+
+ - 0 - Use physically contiguous buffers
+ - 1 - Use virtually contiguous buffers
+ - 2 - Mixed use of the two types. Try physically contiguous buffers first.
+ If not available, use virtually contiguous buffers then.
+
+smcr_testlink_time - INTEGER
+ How frequently SMC-R link sends out TEST_LINK LLC messages to confirm
+ viability, after the last activity of connections on it. Value 0 means
+ disabling TEST_LINK.
+
+ Default: 30 seconds.
+
+wmem - INTEGER
+ Initial size of send buffer used by SMC sockets.
+ The default value inherits from net.ipv4.tcp_wmem[1].
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 16K
+
+rmem - INTEGER
+ Initial size of receive buffer (RMB) used by SMC sockets.
+ The default value inherits from net.ipv4.tcp_rmem[1].
+
+ The minimum value is 16KiB and there is no hard limit for max value, but
+ only allowed 512KiB for SMC-R and 1MiB for SMC-D.
+
+ Default: 128K
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index 38a4edc4522b..423d138b5ff3 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -314,7 +314,7 @@ https://lwn.net/Articles/576263/
* TcpExtTCPOrigDataSent
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+explanation below::
TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
@@ -324,7 +324,7 @@ explaination below::
* TCPSynRetrans
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+explanation below::
TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
@@ -332,7 +332,7 @@ explaination below::
* TCPFastOpenActiveFail
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
-explaination below::
+explanation below::
TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
the remote does not accept it or the attempts timed out.
@@ -382,7 +382,7 @@ Defined in `RFC1213 tcpAttemptFails`_.
Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
the 'segments sent containing the RST flag', but in linux kernel, this
-couner indicates the segments kerenl tried to send. The sending
+counter indicates the segments kernel tried to send. The sending
process might be failed due to some errors (e.g. memory alloc failed).
.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
@@ -700,7 +700,7 @@ SACK option could have up to 4 blocks, they are checked
individually. E.g., if 3 blocks of a SACk is invalid, the
corresponding counter would be updated 3 times. The comment of the
`Add counters for discarded SACK blocks`_ patch has additional
-explaination:
+explanation:
.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
@@ -792,7 +792,7 @@ counters to indicate the ACK is skipped in which scenario. The ACK
would only be skipped if the received packet is either a SYN packet or
it has no data.
-.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
+.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst
* TcpExtTCPACKSkippedSynRecv
@@ -829,7 +829,7 @@ PAWS check fails or the received sequence number is out of window.
* TcpExtTCPACKSkippedTimeWait
-Tha ACK is skipped in Time-Wait status, the reason would be either
+The ACK is skipped in Time-Wait status, the reason would be either
PAWS check failed or the received sequence number is out of window.
* TcpExtTCPACKSkippedChallenge
@@ -908,8 +908,8 @@ A TLP probe packet is sent.
A packet loss is detected and recovered by TLP.
-TCP Fast Open
-=============
+TCP Fast Open description
+=========================
TCP Fast Open is a technology which allows data transfer before the
3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
general description.
@@ -984,7 +984,7 @@ TcpExtSyncookiesRecv counter wont be updated.
Challenge ACK
=============
-For details of challenge ACK, please refer the explaination of
+For details of challenge ACK, please refer the explanation of
TcpExtTCPACKSkippedChallenge.
* TcpExtTCPChallengeACK
@@ -1002,7 +1002,7 @@ prune
=====
When a socket is under memory pressure, the TCP stack will try to
reclaim memory from the receiving queue and out of order queue. One of
-the reclaiming method is 'collapse', which means allocate a big sbk,
+the reclaiming method is 'collapse', which means allocate a big skb,
copy the contiguous skbs to the single big skb, and free these
contiguous skbs.
@@ -1163,7 +1163,7 @@ The server side nstat output::
IpExtOutOctets 52 0.0
IpExtInNoECTPkts 1 0.0
-Input a string in nc client side again ('world' in our exmaple)::
+Input a string in nc client side again ('world' in our example)::
nstatuser@nstat-a:~$ nc -v nstat-b 9000
Connection to nstat-b 9000 port [tcp/*] succeeded!
@@ -1211,7 +1211,7 @@ replied an ACK. But kernel handled them in different ways. When the
TCP window scale option is not used, kernel will try to enable fast
path immediately when the connection comes into the established state,
but if the TCP window scale option is used, kernel will disable the
-fast path at first, and try to enable it after kerenl receives
+fast path at first, and try to enable it after kernel receives
packets. We could use the 'ss' command to verify whether the window
scale option is used. e.g. run below command on either server or
client::
@@ -1343,7 +1343,7 @@ Check TcpExtTCPAbortOnMemory on client::
nstatuser@nstat-a:~$ nstat | grep -i abort
TcpExtTCPAbortOnMemory 54 0.0
-Check orphane socket count on client::
+Check orphaned socket count on client::
nstatuser@nstat-a:~$ ss -s
Total: 131 (kernel 0)
@@ -1685,7 +1685,7 @@ Send 3 SYN repeatly to nstat-b::
nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
-Check snmp cunter on nstat-b::
+Check snmp counter on nstat-b::
nstatuser@nstat-b:~$ nstat | grep -i skip
TcpExtTCPACKSkippedSynRecv 1 0.0
@@ -1770,7 +1770,7 @@ string 'foo' in our example::
Connection from nstat-a 42132 received!
foo
-On nstat-a, the tcpdump should have caputred the ACK. We should check
+On nstat-a, the tcpdump should have captured the ACK. We should check
the source port numbers of the two nc clients::
nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
@@ -1778,7 +1778,7 @@ the source port numbers of the two nc clients::
ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000
ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001
-Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
+Run tcprewrite, change port 9001 to port 9000, change port 42132 to
port 50208::
nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
new file mode 100644
index 000000000000..c9aeb70dafa2
--- /dev/null
+++ b/Documentation/networking/statistics.rst
@@ -0,0 +1,220 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Interface statistics
+====================
+
+Overview
+========
+
+This document is a guide to Linux network interface statistics.
+
+There are three main sources of interface statistics in Linux:
+
+ - standard interface statistics based on
+ :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`;
+ - protocol-specific statistics; and
+ - driver-defined statistics available via ethtool.
+
+Standard interface statistics
+-----------------------------
+
+There are multiple interfaces to reach the standard statistics.
+Most commonly used is the `ip` command from `iproute2`::
+
+ $ ip -s -s link show dev ens4u1u1
+ 6: ens4u1u1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
+ link/ether 48:2a:e3:4c:b1:d1 brd ff:ff:ff:ff:ff:ff
+ RX: bytes packets errors dropped overrun mcast
+ 74327665117 69016965 0 0 0 0
+ RX errors: length crc frame fifo missed
+ 0 0 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 21405556176 44608960 0 0 0 0
+ TX errors: aborted fifo window heartbeat transns
+ 0 0 0 0 128
+ altname enp58s0u1u1
+
+Note that `-s` has been specified twice to see all members of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+If `-s` is specified once the detailed errors won't be shown.
+
+`ip` supports JSON formatting via the `-j` option.
+
+Protocol-specific statistics
+----------------------------
+
+Protocol-specific statistics are exposed via relevant interfaces,
+the same interfaces as are used to configure them.
+
+ethtool
+~~~~~~~
+
+Ethtool exposes common low-level statistics.
+All the standard statistics are expected to be maintained
+by the device, not the driver (as opposed to driver-defined stats
+described in the next section which mix software and hardware stats).
+For devices which contain unmanaged
+switches (e.g. legacy SR-IOV or multi-host NICs) the events counted
+may not pertain exclusively to the packets destined to
+the local host interface. In other words the events may
+be counted at the network port (MAC/PHY blocks) without separation
+for different host side (PCIe) devices. Such ambiguity must not
+be present when internal switch is managed by Linux (so called
+switchdev mode for NICs).
+
+Standard ethtool statistics can be accessed via the interfaces used
+for configuration. For example ethtool interface used
+to configure pause frames can report corresponding hardware counters::
+
+ $ ethtool --include-statistics -a eth0
+ Pause parameters for eth0:
+ Autonegotiate: on
+ RX: on
+ TX: on
+ Statistics:
+ tx_pause_frames: 1
+ rx_pause_frames: 1
+
+General Ethernet statistics not associated with any particular
+functionality are exposed via ``ethtool -S $ifc`` by specifying
+the ``--groups`` parameter::
+
+ $ ethtool -S eth0 --groups eth-phy eth-mac eth-ctrl rmon
+ Stats for eth0:
+ eth-phy-SymbolErrorDuringCarrier: 0
+ eth-mac-FramesTransmittedOK: 1
+ eth-mac-FrameTooLongErrors: 1
+ eth-ctrl-MACControlFramesTransmitted: 1
+ eth-ctrl-MACControlFramesReceived: 0
+ eth-ctrl-UnsupportedOpcodesReceived: 1
+ rmon-etherStatsUndersizePkts: 1
+ rmon-etherStatsJabbers: 0
+ rmon-rx-etherStatsPkts64Octets: 1
+ rmon-rx-etherStatsPkts65to127Octets: 0
+ rmon-rx-etherStatsPkts128to255Octets: 0
+ rmon-tx-etherStatsPkts64Octets: 2
+ rmon-tx-etherStatsPkts65to127Octets: 3
+ rmon-tx-etherStatsPkts128to255Octets: 0
+
+Driver-defined statistics
+-------------------------
+
+Driver-defined ethtool statistics can be dumped using `ethtool -S $ifc`, e.g.::
+
+ $ ethtool -S ens4u1u1
+ NIC statistics:
+ tx_single_collisions: 0
+ tx_multi_collisions: 0
+
+uAPIs
+=====
+
+procfs
+------
+
+The historical `/proc/net/dev` text interface gives access to the list
+of interfaces as well as their statistics.
+
+Note that even though this interface is using
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+internally it combines some of the fields.
+
+sysfs
+-----
+
+Each device directory in sysfs contains a `statistics` directory (e.g.
+`/sys/class/net/lo/statistics/`) with files corresponding to
+members of :c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+
+This simple interface is convenient especially in constrained/embedded
+environments without access to tools. However, it's inefficient when
+reading multiple stats as it internally performs a full dump of
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`
+and reports only the stat corresponding to the accessed file.
+
+Sysfs files are documented in
+`Documentation/ABI/testing/sysfs-class-net-statistics`.
+
+
+netlink
+-------
+
+`rtnetlink` (`NETLINK_ROUTE`) is the preferred method of accessing
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` stats.
+
+Statistics are reported both in the responses to link information
+requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
+when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
+
+ethtool
+-------
+
+Ethtool IOCTL interface allows drivers to report implementation
+specific statistics. Historically it has also been used to report
+statistics for which other APIs did not exist, like per-device-queue
+statistics, or standard-based statistics (e.g. RFC 2863).
+
+Statistics and their string identifiers are retrieved separately.
+Identifiers via `ETHTOOL_GSTRINGS` with `string_set` set to `ETH_SS_STATS`,
+and values via `ETHTOOL_GSTATS`. User space should use `ETHTOOL_GDRVINFO`
+to retrieve the number of statistics (`.n_stats`).
+
+ethtool-netlink
+---------------
+
+Ethtool netlink is a replacement for the older IOCTL interface.
+
+Protocol-related statistics can be requested in get commands by setting
+the `ETHTOOL_FLAG_STATS` flag in `ETHTOOL_A_HEADER_FLAGS`. Currently
+statistics are supported in the following commands:
+
+ - `ETHTOOL_MSG_PAUSE_GET`
+ - `ETHTOOL_MSG_FEC_GET`
+
+debugfs
+-------
+
+Some drivers expose extra statistics via `debugfs`.
+
+struct rtnl_link_stats64
+========================
+
+.. kernel-doc:: include/uapi/linux/if_link.h
+ :identifiers: rtnl_link_stats64
+
+Notes for driver authors
+========================
+
+Drivers should report all statistics which have a matching member in
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>` exclusively
+via `.ndo_get_stats64`. Reporting such standard stats via ethtool
+or debugfs will not be accepted.
+
+Drivers must ensure best possible compliance with
+:c:type:`struct rtnl_link_stats64 <rtnl_link_stats64>`.
+Please note for example that detailed error statistics must be
+added into the general `rx_error` / `tx_error` counters.
+
+The `.ndo_get_stats64` callback can not sleep because of accesses
+via `/proc/net/dev`. If driver may sleep when retrieving the statistics
+from the device it should do so periodically asynchronously and only return
+a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
+allows setting the frequency of refreshing statistics, if needed.
+
+Retrieving ethtool statistics is a multi-syscall process, drivers are advised
+to keep the number of statistics constant to avoid race conditions with
+user space trying to read them.
+
+Statistics must persist across routine operations like bringing the interface
+down and up.
+
+Kernel-internal data structures
+-------------------------------
+
+The following structures are internal to the kernel, their members are
+translated to netlink attributes when dumped. Drivers must not overwrite
+the statistics they don't report with 0.
+
+- ethtool_pause_stats()
+- ethtool_fec_stats()
diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.rst
index a7d354ddda7b..6cab1f74ae05 100644
--- a/Documentation/networking/strparser.txt
+++ b/Documentation/networking/strparser.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
Stream Parser (strparser)
+=========================
Introduction
============
@@ -34,8 +38,10 @@ that is called when a full message has been completed.
Functions
=========
-strp_init(struct strparser *strp, struct sock *sk,
- const struct strp_callbacks *cb)
+ ::
+
+ strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
@@ -43,31 +49,41 @@ strp_init(struct strparser *strp, struct sock *sk,
callback mode; in general mode this is set to NULL. Callbacks
are called by the stream parser (the callbacks are listed below).
-void strp_pause(struct strparser *strp)
+ ::
+
+ void strp_pause(struct strparser *strp)
Temporarily pause a stream parser. Message parsing is suspended
and no new messages are delivered to the upper layer.
-void strp_unpause(struct strparser *strp)
+ ::
+
+ void strp_unpause(struct strparser *strp)
Unpause a paused stream parser.
-void strp_stop(struct strparser *strp);
+ ::
+
+ void strp_stop(struct strparser *strp);
strp_stop is called to completely stop stream parser operations.
This is called internally when the stream parser encounters an
error, and it is called from the upper layer to stop parsing
operations.
-void strp_done(struct strparser *strp);
+ ::
+
+ void strp_done(struct strparser *strp);
strp_done is called to release any resources held by the stream
parser instance. This must be called after the stream processor
has been stopped.
-int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
- unsigned int orig_offset, size_t orig_len,
- size_t max_msg_size, long timeo)
+ ::
+
+ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
+ unsigned int orig_offset, size_t orig_len,
+ size_t max_msg_size, long timeo)
strp_process is called in general mode for a stream parser to
parse an sk_buff. The number of bytes processed or a negative
@@ -75,7 +91,9 @@ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
consume the sk_buff. max_msg_size is maximum size the stream
parser will parse. timeo is timeout for completing a message.
-void strp_data_ready(struct strparser *strp);
+ ::
+
+ void strp_data_ready(struct strparser *strp);
The upper layer calls strp_tcp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
@@ -83,7 +101,9 @@ void strp_data_ready(struct strparser *strp);
maximum messages size is the limit of the receive socket
buffer and message timeout is the receive timeout for the socket.
-void strp_check_rcv(struct strparser *strp);
+ ::
+
+ void strp_check_rcv(struct strparser *strp);
strp_check_rcv is called to check for new messages on the socket.
This is normally called at initialization of a stream parser
@@ -94,7 +114,9 @@ Callbacks
There are six callbacks:
-int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
+ ::
+
+ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
@@ -107,14 +129,16 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
The return values of this function are:
- >0 : indicates length of successfully parsed message
- 0 : indicates more data must be received to parse the message
- -ESTRPIPE : current message should not be processed by the
- kernel, return control of the socket to userspace which
- can proceed to read the messages itself
- other < 0 : Error in parsing, give control back to userspace
- assuming that synchronization is lost and the stream
- is unrecoverable (application expected to close TCP socket)
+ ========= ===========================================================
+ >0 indicates length of successfully parsed message
+ 0 indicates more data must be received to parse the message
+ -ESTRPIPE current message should not be processed by the
+ kernel, return control of the socket to userspace which
+ can proceed to read the messages itself
+ other < 0 Error in parsing, give control back to userspace
+ assuming that synchronization is lost and the stream
+ is unrecoverable (application expected to close TCP socket)
+ ========= ===========================================================
In the case that an error is returned (return value is less than
zero) and the parser is in receive callback mode, then it will set
@@ -123,7 +147,9 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
the current message, then the error set on the attached socket is
ENODATA since the stream is unrecoverable in that case.
-void (*lock)(struct strparser *strp)
+ ::
+
+ void (*lock)(struct strparser *strp)
The lock callback is called to lock the strp structure when
the strparser is performing an asynchronous operation (such as
@@ -131,14 +157,18 @@ void (*lock)(struct strparser *strp)
function is to lock_sock for the associated socket. In general
mode the callback must be set appropriately.
-void (*unlock)(struct strparser *strp)
+ ::
+
+ void (*unlock)(struct strparser *strp)
The unlock callback is called to release the lock obtained
by the lock callback. In receive callback mode the default
function is release_sock for the associated socket. In general
mode the callback must be set appropriately.
-void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
+ ::
+
+ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
@@ -152,7 +182,9 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
the length of the message. skb->len - offset may be greater
then full_len since strparser does not trim the skb.
-int (*read_sock_done)(struct strparser *strp, int err);
+ ::
+
+ int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
the TCP socket in receive callback mode. The stream parser may
@@ -160,7 +192,9 @@ int (*read_sock_done)(struct strparser *strp, int err);
to occur when exiting the loop. If the callback is not set (NULL
in strp_init) a default function is used.
-void (*abort_parser)(struct strparser *strp, int err);
+ ::
+
+ void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
in parsing. The default function stops the stream parser and
@@ -204,4 +238,3 @@ Author
======
Tom Herbert (tom@quantonium.net)
-
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.rst
index 86174ce8cd13..758f1dae3fce 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.rst
@@ -1,7 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+.. _switchdev:
+
+===============================================
Ethernet switch device driver model (switchdev)
===============================================
-Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
-Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
+
+Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
+
+Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
The Ethernet switch device driver model (switchdev) is an in-kernel driver
@@ -12,53 +19,57 @@ Figure 1 is a block diagram showing the components of the switchdev model for
an example setup using a data-center-class switch ASIC chip. Other setups
with SR-IOV or soft switches, such as OVS, are possible.
+::
- User-space tools
+
+ User-space tools
user space |
+-------------------------------------------------------------------+
kernel | Netlink
- |
- +--------------+-------------------------------+
- | Network stack |
- | (Linux) |
- | |
- +----------------------------------------------+
-
- sw1p2 sw1p4 sw1p6
- sw1p1 + sw1p3 + sw1p5 + eth1
- + | + | + | +
- | | | | | | |
- +--+----+----+----+----+----+---+ +-----+-----+
- | Switch driver | | mgmt |
- | (this document) | | driver |
- | | | |
- +--------------+----------------+ +-----------+
- |
+ |
+ +--------------+-------------------------------+
+ | Network stack |
+ | (Linux) |
+ | |
+ +----------------------------------------------+
+
+ sw1p2 sw1p4 sw1p6
+ sw1p1 + sw1p3 + sw1p5 + eth1
+ + | + | + | +
+ | | | | | | |
+ +--+----+----+----+----+----+---+ +-----+-----+
+ | Switch driver | | mgmt |
+ | (this document) | | driver |
+ | | | |
+ +--------------+----------------+ +-----------+
+ |
kernel | HW bus (eg PCI)
+-------------------------------------------------------------------+
hardware |
- +--------------+----------------+
- | Switch device (sw1) |
- | +----+ +--------+
- | | v offloaded data path | mgmt port
- | | | |
- +--|----|----+----+----+----+---+
- | | | | | |
- + + + + + +
- p1 p2 p3 p4 p5 p6
+ +--------------+----------------+
+ | Switch device (sw1) |
+ | +----+ +--------+
+ | | v offloaded data path | mgmt port
+ | | | |
+ +--|----|----+----+----+----+---+
+ | | | | | |
+ + + + + + +
+ p1 p2 p3 p4 p5 p6
- front-panel ports
+ front-panel ports
- Fig 1.
+ Fig 1.
Include Files
-------------
-#include <linux/netdevice.h>
-#include <net/switchdev.h>
+::
+
+ #include <linux/netdevice.h>
+ #include <net/switchdev.h>
Configuration
@@ -114,10 +125,10 @@ Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
useful for dynamically-named ports where the device names its ports based on
external configuration. For example, if a physical 40G port is split logically
into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
-name for each port using port PHYS name. The udev rule would be:
+name for each port using port PHYS name. The udev rule would be::
-SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
- ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
+ ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
@@ -149,7 +160,7 @@ tools such as iproute2.
The switchdev driver can know a particular port's position in the topology by
monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
-bond will see it's upper master change. If that bond is moved into a bridge,
+bond will see its upper master change. If that bond is moved into a bridge,
the bond's upper master will change. And so on. The driver will track such
movements to know what position a port is in in the overall topology by
registering for netdevice events and acting on NETDEV_CHANGEUPPER.
@@ -171,21 +182,44 @@ To offloading L2 bridging, the switchdev driver/device should support:
Static FDB Entries
^^^^^^^^^^^^^^^^^^
-The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
-to support static FDB entries installed to the device. Static bridge FDB
-entries are installed, for example, using iproute2 bridge cmd:
+A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and
+``ndo_fdb_dump`` operations is able to support the command below, which adds a
+static bridge FDB entry::
+
+ bridge fdb add dev DEV ADDRESS [vlan VID] [self] static
+
+(the "static" keyword is non-optional: if not specified, the entry defaults to
+being "local", which means that it should not be forwarded)
- bridge fdb add ADDR dev DEV [vlan VID] [self]
+The "self" keyword (optional because it is implicit) has the role of
+instructing the kernel to fulfill the operation through the ``ndo_fdb_add``
+implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this
+will bypass the bridge and therefore leave the software database out of sync
+with the hardware one.
-The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
-ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
-switchdev_port_obj_xxx ops.
+To avoid this, the "master" keyword can be used::
-XXX: what should be done if offloading this rule to hardware fails (for
-example, due to full capacity in hardware tables) ?
+ bridge fdb add dev DEV ADDRESS [vlan VID] master static
+
+The above command instructs the kernel to search for a master interface of
+``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that.
+This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification
+which the port driver can handle and use it to program its hardware table. This
+way, the software and the hardware database will both contain this static FDB
+entry.
+
+Note: for new switchdev drivers that offload the Linux bridge, implementing the
+``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly
+discouraged: all static FDB entries should be added on a bridge port using the
+"master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to
+visualize the hardware tables, if the device does not have an interrupt for
+notifying the operating system of newly learned/forgotten dynamic FDB
+addresses. In that case, the hardware FDB might end up having entries that the
+software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see
+them.
Note: by default, the bridge does not filter on VLAN and only bridges untagged
-traffic. To enable VLAN support, turn on VLAN filtering:
+traffic. To enable VLAN support, turn on VLAN filtering::
echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
@@ -194,7 +228,7 @@ Notification of Learned/Forgotten Source MAC/VLANs
The switch device will learn/forget source MAC address/VLAN on ingress packets
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
-in turn, will notify the bridge driver using the switchdev notifier call:
+in turn, will notify the bridge driver using the switchdev notifier call::
err = call_switchdev_notifiers(val, dev, info, extack);
@@ -202,7 +236,7 @@ Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
-command will label these entries "offload":
+command will label these entries "offload"::
$ bridge fdb
52:54:00:12:35:01 dev sw1p1 master br0 permanent
@@ -219,11 +253,11 @@ command will label these entries "offload":
01:00:5e:00:00:01 dev br0 self permanent
33:33:ff:12:35:01 dev br0 self permanent
-Learning on the port should be disabled on the bridge using the bridge command:
+Learning on the port should be disabled on the bridge using the bridge command::
bridge link set dev DEV learning off
-Learning on the device port should be enabled, as well as learning_sync:
+Learning on the device port should be enabled, as well as learning_sync::
bridge link set dev DEV learning on self
bridge link set dev DEV learning_sync on self
@@ -314,12 +348,16 @@ forwards the packet to the matching FIB entry's nexthop(s) egress ports.
To program the device, the driver has to register a FIB notifier handler
using register_fib_notifier. The following events are available:
-FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device,
- or modifying an existing entry on the device.
-FIB_EVENT_ENTRY_DEL: used for removing a FIB entry
-FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes
-FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
+=================== ===================================================
+FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device,
+ or modifying an existing entry on the device.
+FIB_EVENT_ENTRY_DEL used for removing a FIB entry
+FIB_EVENT_RULE_ADD,
+FIB_EVENT_RULE_DEL used to propagate FIB rule changes
+=================== ===================================================
+
+FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
struct fib_entry_notifier_info {
struct fib_notifier_info info; /* must be first */
@@ -332,12 +370,12 @@ FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
u32 nlflags;
};
-to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
-structure holds details on the route and route's nexthops. *dev is one of the
-port netdevs mentioned in the route's next hop list.
+to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi``
+structure holds details on the route and route's nexthops. ``*dev`` is one
+of the port netdevs mentioned in the route's next hop list.
Routes offloaded to the device are labeled with "offload" in the ip route
-listing:
+listing::
$ ip route show
default via 192.168.0.2 dev eth0
@@ -371,3 +409,156 @@ The driver can monitor for updates to arp_tbl using the netevent notifier
NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
to know when arp_tbl neighbor entries are purged from the port.
+
+Device driver expected behavior
+-------------------------------
+
+Below is a set of defined behavior that switchdev enabled network devices must
+adhere to.
+
+Configuration-less state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Upon driver bring up, the network devices must be fully operational, and the
+backing driver must configure the network device such that it is possible to
+send and receive traffic to this network device and it is properly separated
+from other network devices/ports (e.g.: as is frequent with a switch ASIC). How
+this is achieved is heavily hardware dependent, but a simple solution can be to
+use per-port VLAN identifiers unless a better mechanism is available
+(proprietary metadata for each network port for instance).
+
+The network device must be capable of running a full IP protocol stack
+including multicast, DHCP, IPv4/6, etc. If necessary, it should program the
+appropriate filters for VLAN, multicast, unicast etc. The underlying device
+driver must effectively be configured in a similar fashion to what it would do
+when IGMP snooping is enabled for IP multicast over these switchdev network
+devices and unsolicited multicast must be filtered as early as possible in
+the hardware.
+
+When configuring VLANs on top of the network device, all VLANs must be working,
+irrespective of the state of other network devices (e.g.: other ports being part
+of a VLAN-aware bridge doing ingress VID checking). See below for details.
+
+If the device implements e.g.: VLAN filtering, putting the interface in
+promiscuous mode should allow the reception of all VLAN tags (including those
+not present in the filter(s)).
+
+Bridged switch ports
+^^^^^^^^^^^^^^^^^^^^
+
+When a switchdev enabled network device is added as a bridge member, it should
+not disrupt any functionality of non-bridged network devices and they
+should continue to behave as normal network devices. Depending on the bridge
+configuration knobs below, the expected behavior is documented.
+
+Bridge VLAN filtering
+^^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of a VLAN filtering mode (statically,
+at device creation time, and dynamically, during run time) which must be
+observed by the underlying switchdev network device/hardware:
+
+- with VLAN filtering turned off: the bridge is strictly VLAN unaware and its
+ data path will process all Ethernet frames as if they are VLAN-untagged.
+ The bridge VLAN database can still be modified, but the modifications should
+ have no effect while VLAN filtering is turned off. Frames ingressing the
+ device with a VID that is not programmed into the bridge/switch's VLAN table
+ must be forwarded and may be processed using a VLAN device (see below).
+
+- with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing
+ the device with a VID that is not programmed into the bridges/switch's VLAN
+ table must be dropped (strict VID checking).
+
+When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev
+network device which is a bridge port member, the behavior of the software
+network stack must be preserved, or the configuration must be refused if that
+is not possible.
+
+- with VLAN filtering turned off, the bridge will process all ingress traffic
+ for the port, except for the traffic tagged with a VLAN ID destined for a
+ VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even
+ be added to a second bridge, which includes other switch ports or software
+ interfaces. Some approaches to ensure that the forwarding domain for traffic
+ belonging to the VLAN upper interfaces are managed properly:
+
+ * If forwarding destinations can be managed per VLAN, the hardware could be
+ configured to map all traffic, except the packets tagged with a VID
+ belonging to a VLAN upper interface, to an internal VID corresponding to
+ untagged packets. This internal VID spans all ports of the VLAN-unaware
+ bridge. The VID corresponding to the VLAN upper interface spans the
+ physical port of that VLAN interface, as well as the other ports that
+ might be bridged with it.
+ * Treat bridge ports with VLAN upper interfaces as standalone, and let
+ forwarding be handled in the software data path.
+
+- with VLAN filtering turned on, these VLAN devices can be created as long as
+ the bridge does not have an existing VLAN entry with the same VID on any
+ bridge port. These VLAN devices cannot be enslaved into the bridge since they
+ duplicate functionality/use case with the bridge's VLAN data path processing.
+
+Non-bridged network ports of the same switch fabric must not be disturbed in any
+way by the enabling of VLAN filtering on the bridge device(s). If the VLAN
+filtering setting is global to the entire chip, then the standalone ports
+should indicate to the network stack that VLAN filtering is required by setting
+'rx-vlan-filter: on [fixed]' in the ethtool features.
+
+Because VLAN filtering can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately. If that is not possible, the
+switchdev driver can also refuse to support dynamic toggling of the VLAN
+filtering knob at runtime and require a destruction of the bridge device(s) and
+creation of new bridge device(s) with a different VLAN filtering value to
+ensure VLAN awareness is pushed down to the hardware.
+
+Even when VLAN filtering in the bridge is turned off, the underlying switch
+hardware and driver may still configure itself in a VLAN-aware mode provided
+that the behavior described above is observed.
+
+The VLAN protocol of the bridge plays a role in deciding whether a packet is
+treated as tagged or not: a bridge using the 802.1ad protocol must treat both
+VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as
+untagged.
+
+The 802.1p (VID 0) tagged packets must be treated in the same way by the device
+as untagged packets, since the bridge device does not allow the manipulation of
+VID 0 in its database.
+
+When the bridge has VLAN filtering enabled and a PVID is not configured on the
+ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge
+has VLAN filtering enabled and a PVID exists on the ingress port, untagged and
+priority-tagged packets must be accepted and forwarded according to the
+bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering
+disabled, the presence/lack of a PVID should not influence the packet
+forwarding decision.
+
+Bridge IGMP snooping
+^^^^^^^^^^^^^^^^^^^^
+
+The Linux bridge allows the configuration of IGMP snooping (statically, at
+interface creation time, or dynamically, during runtime) which must be observed
+by the underlying switchdev network device/hardware in the following way:
+
+- when IGMP snooping is turned off, multicast traffic must be flooded to all
+ ports within the same bridge that have mcast_flood=true. The CPU/management
+ port should ideally not be flooded (unless the ingress interface has
+ IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through
+ the network stack notifications. If the hardware is not capable of doing that
+ then the CPU/management port must also be flooded and multicast filtering
+ happens in software.
+
+- when IGMP snooping is turned on, multicast traffic must selectively flow
+ to the appropriate network ports (including CPU/management port). Flooding of
+ unknown multicast should be only towards the ports connected to a multicast
+ router (the local device may also act as a multicast router).
+
+The switch must adhere to RFC 4541 and flood multicast traffic accordingly
+since that is what the Linux bridge implementation does.
+
+Because IGMP snooping can be turned on/off at runtime, the switchdev driver
+must be able to reconfigure the underlying hardware on the fly to honor the
+toggling of that option and behave appropriately.
+
+A switchdev driver can also refuse to support dynamic toggling of the multicast
+snooping knob at runtime and require the destruction of the bridge device(s)
+and creation of a new bridge device(s) with a different multicast snooping
+value.
diff --git a/Documentation/networking/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst
new file mode 100644
index 000000000000..83647e10c207
--- /dev/null
+++ b/Documentation/networking/sysfs-tagging.rst
@@ -0,0 +1,48 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Sysfs tagging
+=============
+
+(Taken almost verbatim from Eric Biederman's netns tagging patch
+commit msg)
+
+The problem. Network devices show up in sysfs and with the network
+namespace active multiple devices with the same name can show up in
+the same directory, ouch!
+
+To avoid that problem and allow existing applications in network
+namespaces to see the same interface that is currently presented in
+sysfs, sysfs now has tagging directory support.
+
+By using the network namespace pointers as tags to separate out
+the sysfs directory entries we ensure that we don't have conflicts
+in the directories and applications only see a limited set of
+the network devices.
+
+Each sysfs directory entry may be tagged with a namespace via the
+``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged,
+then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE
+and KOBJ_NS_TYPES, and ns will point to the namespace to which it
+belongs.
+
+Each sysfs superblock's kernfs_super_info contains an array
+``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace
+kobj_nstype first mounts sysfs, a new superblock is created. It
+will be differentiated from other sysfs mounts by having its
+``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that
+through bind mounting and mounts propagation, a task can easily view
+the contents of other namespaces' sysfs mounts. Therefore, when a
+namespace exits, it will call kobj_ns_exit() to invalidate any
+kernfs_node->ns pointers pointing to it.
+
+Users of this interface:
+
+- define a type in the ``kobj_ns_type`` enumeration.
+- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has
+
+ - current_ns() which returns current's namespace
+ - netlink_ns() which returns a socket's namespace
+ - initial_ns() which returns the initial namesapce
+
+- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/networking/tc-actions-env-rules.rst b/Documentation/networking/tc-actions-env-rules.rst
new file mode 100644
index 000000000000..86884b8fb4e0
--- /dev/null
+++ b/Documentation/networking/tc-actions-env-rules.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+TC Actions - Environmental Rules
+================================
+
+
+The "environmental" rules for authors of any new tc actions are:
+
+1) If you stealeth or borroweth any packet thou shalt be branching
+ from the righteous path and thou shalt cloneth.
+
+ For example if your action queues a packet to be processed later,
+ or intentionally branches by redirecting a packet, then you need to
+ clone the packet.
+
+2) If you munge any packet thou shalt call pskb_expand_head in the case
+ someone else is referencing the skb. After that you "own" the skb.
+
+3) Dropping packets you don't own is a no-no. You simply return
+ TC_ACT_SHOT to the caller and they will drop it.
+
+The "environmental" rules for callers of actions (qdiscs etc) are:
+
+#) Thou art responsible for freeing anything returned as being
+ TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
+ returned, then all is great and you don't need to do anything.
+
+Post on netdev if something is unclear.
diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt
deleted file mode 100644
index f37814693ad3..000000000000
--- a/Documentation/networking/tc-actions-env-rules.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-
-The "environmental" rules for authors of any new tc actions are:
-
-1) If you stealeth or borroweth any packet thou shalt be branching
-from the righteous path and thou shalt cloneth.
-
-For example if your action queues a packet to be processed later,
-or intentionally branches by redirecting a packet, then you need to
-clone the packet.
-
-2) If you munge any packet thou shalt call pskb_expand_head in the case
-someone else is referencing the skb. After that you "own" the skb.
-
-3) Dropping packets you don't own is a no-no. You simply return
-TC_ACT_SHOT to the caller and they will drop it.
-
-The "environmental" rules for callers of actions (qdiscs etc) are:
-
-*) Thou art responsible for freeing anything returned as being
-TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
-returned, then all is great and you don't need to do anything.
-
-Post on netdev if something is unclear.
-
diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.rst
index 151e229980f1..b06765c96ea1 100644
--- a/Documentation/networking/tcp-thin.txt
+++ b/Documentation/networking/tcp-thin.rst
@@ -1,5 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
Thin-streams and TCP
====================
+
A wide range of Internet-based services that use reliable transport
protocols display what we call thin-stream properties. This means
that the application sends data with such a low rate that the
@@ -42,6 +46,7 @@ References
==========
More information on the modifications, as well as a wide range of
experimental data can be found here:
+
"Improving latency for interactive, thin-stream applications over
reliable transport"
http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.rst
index 5a013686b9ea..0a7f3a059586 100644
--- a/Documentation/networking/team.txt
+++ b/Documentation/networking/team.rst
@@ -1,2 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+Team
+====
+
Team devices are driven from userspace via libteam library which is here:
https://github.com/jpirko/libteam
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.rst
index 8dd6333c3270..be4eb1242057 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.rst
@@ -1,9 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Timestamping
+============
+
1. Control Interfaces
+=====================
The interfaces for receiving network packages timestamps are:
-* SO_TIMESTAMP
+SO_TIMESTAMP
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
control message in usec resolution.
@@ -13,7 +20,7 @@ The interfaces for receiving network packages timestamps are:
SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
SO_TIMESTAMP_NEW options respectively.
-* SO_TIMESTAMPNS
+SO_TIMESTAMPNS
Same timestamping mechanism as SO_TIMESTAMP, but reports the
timestamp as struct timespec in nsec resolution.
SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
@@ -22,17 +29,18 @@ The interfaces for receiving network packages timestamps are:
and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
respectively.
-* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicast:approximate transmit timestamp obtained by
reading the looped packet receive timestamp.
-* SO_TIMESTAMPING
+SO_TIMESTAMPING
Generates timestamps on reception, transmission or both. Supports
multiple timestamp sources, including hardware. Supports generating
timestamps for stream sockets.
-1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW):
+1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW)
+-------------------------------------------------------------
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
@@ -47,7 +55,8 @@ struct __kernel_sock_timeval format.
SO_TIMESTAMP_OLD returns incorrect timestamps after the year 2038
on 32 bit machines.
-1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW):
+1.2 SO_TIMESTAMPNS (also SO_TIMESTAMPNS_OLD and SO_TIMESTAMPNS_NEW)
+-------------------------------------------------------------------
This option is identical to SO_TIMESTAMP except for the returned data type.
Its struct timespec allows for higher resolution (ns) timestamps than the
@@ -59,10 +68,11 @@ struct __kernel_timespec format.
SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
on 32 bit machines.
-1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW):
+1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW)
+----------------------------------------------------------------------
Supports multiple types of timestamp requests. As a result, this
-socket option takes a bitmap of flags, not a boolean. In
+socket option takes a bitmap of flags, not a boolean. In::
err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
@@ -76,6 +86,7 @@ be enabled for individual sendmsg calls using cmsg (1.3.4).
1.3.1 Timestamp Generation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
Some bits are requests to the stack to try to generate timestamps. Any
combination of them is valid. Changes to these bits apply to newly
@@ -106,7 +117,6 @@ SOF_TIMESTAMPING_TX_SOFTWARE:
require driver support and may not be available for all devices.
This flag can be enabled via both socket options and control messages.
-
SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel
transmit latency is, if long, often dominated by queuing delay. The
@@ -132,6 +142,7 @@ SOF_TIMESTAMPING_TX_ACK:
1.3.2 Timestamp Reporting
+^^^^^^^^^^^^^^^^^^^^^^^^^
The other three bits control which timestamps will be reported in a
generated control message. Changes to the bits take immediate
@@ -151,11 +162,11 @@ SOF_TIMESTAMPING_RAW_HARDWARE:
1.3.3 Timestamp Options
+^^^^^^^^^^^^^^^^^^^^^^^
The interface supports the options
SOF_TIMESTAMPING_OPT_ID:
-
Generate a unique identifier along with each packet. A process can
have multiple concurrent timestamping requests outstanding. Packets
can be reordered in the transmit path, for instance in the packet
@@ -183,7 +194,6 @@ SOF_TIMESTAMPING_OPT_ID:
SOF_TIMESTAMPING_OPT_CMSG:
-
Support recv() cmsg for all timestamped packets. Control messages
are already supported unconditionally on all packets with receive
timestamps and on IPv6 packets with transmit timestamp. This option
@@ -193,7 +203,6 @@ SOF_TIMESTAMPING_OPT_CMSG:
SOF_TIMESTAMPING_OPT_TSONLY:
-
Applies to transmit timestamps only. Makes the kernel return the
timestamp as a cmsg alongside an empty packet, as opposed to
alongside the original packet. This reduces the amount of memory
@@ -202,7 +211,6 @@ SOF_TIMESTAMPING_OPT_TSONLY:
This option disables SOF_TIMESTAMPING_OPT_CMSG.
SOF_TIMESTAMPING_OPT_STATS:
-
Optional stats that are obtained along with the transmit timestamps.
It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
transmit timestamp is available, the stats are available in a
@@ -213,7 +221,6 @@ SOF_TIMESTAMPING_OPT_STATS:
data was limited by peer's receiver window.
SOF_TIMESTAMPING_OPT_PKTINFO:
-
Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
packets with hardware timestamps. The message contains struct
scm_ts_pktinfo, which supplies the index of the real interface which
@@ -223,7 +230,6 @@ SOF_TIMESTAMPING_OPT_PKTINFO:
other fields, but they are reserved and undefined.
SOF_TIMESTAMPING_OPT_TX_SWHW:
-
Request both hardware and software timestamps for outgoing packets
when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE
are enabled at the same time. If both timestamps are generated,
@@ -242,12 +248,13 @@ combined with SOF_TIMESTAMPING_OPT_TSONLY.
1.3.4. Enabling timestamps via control messages
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to socket options, timestamp generation can be requested
per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
Using this feature, applications can sample timestamps per sendmsg()
without paying the overhead of enabling and disabling timestamps via
-setsockopt:
+setsockopt::
struct msghdr *msg;
...
@@ -264,7 +271,7 @@ The SOF_TIMESTAMPING_TX_* flags set via cmsg will override
the SOF_TIMESTAMPING_TX_* flags set via setsockopt.
Moreover, applications must still enable timestamp reporting via
-setsockopt to receive timestamps:
+setsockopt to receive timestamps::
__u32 val = SOF_TIMESTAMPING_SOFTWARE |
SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
@@ -272,6 +279,7 @@ setsockopt to receive timestamps:
1.4 Bytestream Timestamps
+-------------------------
The SO_TIMESTAMPING interface supports timestamping of bytes in a
bytestream. Each request is interpreted as a request for when the
@@ -331,6 +339,7 @@ unusual.
2 Data Interfaces
+==================
Timestamps are read using the ancillary data feature of recvmsg().
See `man 3 cmsg` for details of this interface. The socket manual
@@ -339,20 +348,21 @@ SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
2.1 SCM_TIMESTAMPING records
+----------------------------
These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
-For SO_TIMESTAMPING_OLD:
+For SO_TIMESTAMPING_OLD::
-struct scm_timestamping {
- struct timespec ts[3];
-};
+ struct scm_timestamping {
+ struct timespec ts[3];
+ };
-For SO_TIMESTAMPING_NEW:
+For SO_TIMESTAMPING_NEW::
-struct scm_timestamping64 {
- struct __kernel_timespec ts[3];
+ struct scm_timestamping64 {
+ struct __kernel_timespec ts[3];
Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
struct scm_timestamping64 format.
@@ -377,6 +387,7 @@ in ts[0] when a real software timestamp is missing. This happens also
on hardware transmit timestamps.
2.1.1 Transmit timestamps with MSG_ERRQUEUE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For transmit timestamps the outgoing packet is looped back to the
socket's error queue with the send timestamp(s) attached. A process
@@ -393,6 +404,7 @@ embeds the struct scm_timestamping.
2.1.1.2 Timestamp types
+~~~~~~~~~~~~~~~~~~~~~~~
The semantics of the three struct timespec are defined by field
ee_info in the extended error structure. It contains a value of
@@ -408,6 +420,7 @@ case the timestamp is stored in ts[0].
2.1.1.3 Fragmentation
+~~~~~~~~~~~~~~~~~~~~~
Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
@@ -416,6 +429,7 @@ socket.
2.1.1.4 Packet Payload
+~~~~~~~~~~~~~~~~~~~~~~
The calling application is often not interested in receiving the whole
packet payload that it passed to the stack originally: the socket
@@ -427,6 +441,7 @@ however, the full packet is queued, taking up budget from SO_RCVBUF.
2.1.1.5 Blocking Read
+~~~~~~~~~~~~~~~~~~~~~
Reading from the error queue is always a non-blocking operation. To
block waiting on a timestamp, use poll or select. poll() will return
@@ -436,6 +451,7 @@ ignored on request. See also `man 2 poll`.
2.1.2 Receive timestamps
+^^^^^^^^^^^^^^^^^^^^^^^^
On reception, there is no reason to read from the socket error queue.
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
@@ -447,16 +463,17 @@ is again deprecated and ts[2] holds a hardware timestamp if set.
3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
+=======================================================================
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
-include/uapi/linux/net_tstamp.h as:
+include/uapi/linux/net_tstamp.h as::
-struct hwtstamp_config {
- int flags; /* no flags defined right now, must be zero */
- int tx_type; /* HWTSTAMP_TX_* */
- int rx_filter; /* HWTSTAMP_FILTER_* */
-};
+ struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+ };
Desired behavior is passed into the kernel and to a specific device by
calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
@@ -469,8 +486,8 @@ of packets.
Drivers are free to use a more permissive configuration than the requested
configuration. It is expected that drivers should only implement directly the
most generic mode that can be supported. For example if the hardware can
-support HWTSTAMP_FILTER_V2_EVENT, then it should generally always upscale
-HWTSTAMP_FILTER_V2_L2_SYNC_MESSAGE, and so forth, as HWTSTAMP_FILTER_V2_EVENT
+support HWTSTAMP_FILTER_PTP_V2_EVENT, then it should generally always upscale
+HWTSTAMP_FILTER_PTP_V2_L2_SYNC, and so forth, as HWTSTAMP_FILTER_PTP_V2_EVENT
is more generic (and more useful to applications).
A driver which supports hardware time stamping shall update the struct
@@ -487,44 +504,47 @@ Any process can read the actual configuration by passing this
structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
not been implemented in all drivers.
-/* possible values for hwtstamp_config->tx_type */
-enum {
- /*
- * no outgoing packet will need hardware time stamping;
- * should a packet arrive which asks for it, no hardware
- * time stamping will be done
- */
- HWTSTAMP_TX_OFF,
-
- /*
- * enables hardware time stamping for outgoing packets;
- * the sender of the packet decides which are to be
- * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
- * before sending the packet
- */
- HWTSTAMP_TX_ON,
-};
-
-/* possible values for hwtstamp_config->rx_filter */
-enum {
- /* time stamp no incoming packet at all */
- HWTSTAMP_FILTER_NONE,
-
- /* time stamp any incoming packet */
- HWTSTAMP_FILTER_ALL,
-
- /* return value: time stamp all packets requested plus some others */
- HWTSTAMP_FILTER_SOME,
-
- /* PTP v1, UDP, any kind of event packet */
- HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
-
- /* for the complete list of values, please check
- * the include file include/uapi/linux/net_tstamp.h
- */
-};
+::
+
+ /* possible values for hwtstamp_config->tx_type */
+ enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+ };
+
+ /* possible values for hwtstamp_config->rx_filter */
+ enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ /* for the complete list of values, please check
+ * the include file include/uapi/linux/net_tstamp.h
+ */
+ };
3.1 Hardware Timestamping Implementation: Device Drivers
+--------------------------------------------------------
A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
@@ -533,22 +553,23 @@ should also support SIOCGHWTSTAMP.
Time stamps for received packets must be stored in the skb. To get a pointer
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
-set the time stamps in the structure:
+set the time stamps in the structure::
-struct skb_shared_hwtstamps {
- /* hardware time stamp transformed into duration
- * since arbitrary point in time
- */
- ktime_t hwtstamp;
-};
+ struct skb_shared_hwtstamps {
+ /* hardware time stamp transformed into duration
+ * since arbitrary point in time
+ */
+ ktime_t hwtstamp;
+ };
Time stamps for outgoing packets are to be generated as follows:
+
- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
is set no-zero. If yes, then the driver is expected to do hardware time
stamping.
- If this is possible for the skb and requested, then declare
that the driver is doing the time stamping by setting the flag
- SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with
+ SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with::
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
@@ -561,11 +582,191 @@ Time stamps for outgoing packets are to be generated as follows:
and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
- As soon as the driver has sent the packet and/or obtained a
hardware time stamp for it, it passes the time stamp back by
- calling skb_hwtstamp_tx() with the original skb, the raw
- hardware time stamp. skb_hwtstamp_tx() clones the original skb and
+ calling skb_tstamp_tx() with the original skb, the raw
+ hardware time stamp. skb_tstamp_tx() clones the original skb and
adds the timestamps, therefore the original skb has to be freed now.
If obtaining the hardware time stamp somehow fails, then the driver
should not fall back to software time stamping. The rationale is that
this would occur at a later time in the processing pipeline than other
software time stamping and therefore could lead to unexpected deltas
between time stamps.
+
+3.2 Special considerations for stacked PTP Hardware Clocks
+----------------------------------------------------------
+
+There are situations when there may be more than one PHC (PTP Hardware Clock)
+in the data path of a packet. The kernel has no explicit mechanism to allow the
+user to select which PHC to use for timestamping Ethernet frames. Instead, the
+assumption is that the outermost PHC is always the most preferable, and that
+kernel drivers collaborate towards achieving that goal. Currently there are 3
+cases of stacked PHCs, detailed below:
+
+3.2.1 DSA (Distributed Switch Architecture) switches
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These are Ethernet switches which have one of their ports connected to an
+(otherwise completely unaware) host Ethernet interface, and perform the role of
+a port multiplier with optional forwarding acceleration features. Each DSA
+switch port is visible to the user as a standalone (virtual) network interface,
+and its network I/O is performed, under the hood, indirectly through the host
+interface (redirecting to the host port on TX, and intercepting frames on RX).
+
+When a DSA switch is attached to a host port, PTP synchronization has to
+suffer, since the switch's variable queuing delay introduces a path delay
+jitter between the host port and its PTP partner. For this reason, some DSA
+switches include a timestamping clock of their own, and have the ability to
+perform network timestamping on their own MAC, such that path delays only
+measure wire and PHY propagation latencies. Timestamping DSA switches are
+supported in Linux and expose the same ABI as any other network interface (save
+for the fact that the DSA interfaces are in fact virtual in terms of network
+I/O, they do have their own PHC). It is typical, but not mandatory, for all
+interfaces of a DSA switch to share the same PHC.
+
+By design, PTP timestamping with a DSA switch does not need any special
+handling in the driver for the host port it is attached to. However, when the
+host port also supports PTP timestamping, DSA will take care of intercepting
+the ``.ndo_eth_ioctl`` calls towards the host port, and block attempts to enable
+hardware timestamping on it. This is because the SO_TIMESTAMPING API does not
+allow the delivery of multiple hardware timestamps for the same packet, so
+anybody else except for the DSA switch port must be prevented from doing so.
+
+In the generic layer, DSA provides the following infrastructure for PTP
+timestamping:
+
+- ``.port_txtstamp()``: a hook called prior to the transmission of
+ packets with a hardware TX timestamping request from user space.
+ This is required for two-step timestamping, since the hardware
+ timestamp becomes available after the actual MAC transmission, so the
+ driver must be prepared to correlate the timestamp with the original
+ packet so that it can re-enqueue the packet back into the socket's
+ error queue. To save the packet for when the timestamp becomes
+ available, the driver can call ``skb_clone_sk`` , save the clone pointer
+ in skb->cb and enqueue a tx skb queue. Typically, a switch will have a
+ PTP TX timestamp register (or sometimes a FIFO) where the timestamp
+ becomes available. In case of a FIFO, the hardware might store
+ key-value pairs of PTP sequence ID/message type/domain number and the
+ actual timestamp. To perform the correlation correctly between the
+ packets in a queue waiting for timestamping and the actual timestamps,
+ drivers can use a BPF classifier (``ptp_classify_raw``) to identify
+ the PTP transport type, and ``ptp_parse_header`` to interpret the PTP
+ header fields. There may be an IRQ that is raised upon this
+ timestamp's availability, or the driver might have to poll after
+ invoking ``dev_queue_xmit()`` towards the host interface.
+ One-step TX timestamping do not require packet cloning, since there is
+ no follow-up message required by the PTP protocol (because the
+ TX timestamp is embedded into the packet by the MAC), and therefore
+ user space does not expect the packet annotated with the TX timestamp
+ to be re-enqueued into its socket's error queue.
+
+- ``.port_rxtstamp()``: On RX, the BPF classifier is run by DSA to
+ identify PTP event messages (any other packets, including PTP general
+ messages, are not timestamped). The original (and only) timestampable
+ skb is provided to the driver, for it to annotate it with a timestamp,
+ if that is immediately available, or defer to later. On reception,
+ timestamps might either be available in-band (through metadata in the
+ DSA header, or attached in other ways to the packet), or out-of-band
+ (through another RX timestamping FIFO). Deferral on RX is typically
+ necessary when retrieving the timestamp needs a sleepable context. In
+ that case, it is the responsibility of the DSA driver to call
+ ``netif_rx()`` on the freshly timestamped skb.
+
+3.2.2 Ethernet PHYs
+^^^^^^^^^^^^^^^^^^^
+
+These are devices that typically fulfill a Layer 1 role in the network stack,
+hence they do not have a representation in terms of a network interface as DSA
+switches do. However, PHYs may be able to detect and timestamp PTP packets, for
+performance reasons: timestamps taken as close as possible to the wire have the
+potential to yield a more stable and precise synchronization.
+
+A PHY driver that supports PTP timestamping must create a ``struct
+mii_timestamper`` and add a pointer to it in ``phydev->mii_ts``. The presence
+of this pointer will be checked by the networking stack.
+
+Since PHYs do not have network interface representations, the timestamping and
+ethtool ioctl operations for them need to be mediated by their respective MAC
+driver. Therefore, as opposed to DSA switches, modifications need to be done
+to each individual MAC driver for PHY timestamping support. This entails:
+
+- Checking, in ``.ndo_eth_ioctl``, whether ``phy_has_hwtstamp(netdev->phydev)``
+ is true or not. If it is, then the MAC driver should not process this request
+ but instead pass it on to the PHY using ``phy_mii_ioctl()``.
+
+- On RX, special intervention may or may not be needed, depending on the
+ function used to deliver skb's up the network stack. In the case of plain
+ ``netif_rx()`` and similar, MAC drivers must check whether
+ ``skb_defer_rx_timestamp(skb)`` is necessary or not - and if it is, don't
+ call ``netif_rx()`` at all. If ``CONFIG_NETWORK_PHY_TIMESTAMPING`` is
+ enabled, and ``skb->dev->phydev->mii_ts`` exists, its ``.rxtstamp()`` hook
+ will be called now, to determine, using logic very similar to DSA, whether
+ deferral for RX timestamping is necessary. Again like DSA, it becomes the
+ responsibility of the PHY driver to send the packet up the stack when the
+ timestamp is available.
+
+ For other skb receive functions, such as ``napi_gro_receive`` and
+ ``netif_receive_skb``, the stack automatically checks whether
+ ``skb_defer_rx_timestamp()`` is necessary, so this check is not needed inside
+ the driver.
+
+- On TX, again, special intervention might or might not be needed. The
+ function that calls the ``mii_ts->txtstamp()`` hook is named
+ ``skb_clone_tx_timestamp()``. This function can either be called directly
+ (case in which explicit MAC driver support is indeed needed), but the
+ function also piggybacks from the ``skb_tx_timestamp()`` call, which many MAC
+ drivers already perform for software timestamping purposes. Therefore, if a
+ MAC supports software timestamping, it does not need to do anything further
+ at this stage.
+
+3.2.3 MII bus snooping devices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These perform the same role as timestamping Ethernet PHYs, save for the fact
+that they are discrete devices and can therefore be used in conjunction with
+any PHY even if it doesn't support timestamping. In Linux, they are
+discoverable and attachable to a ``struct phy_device`` through Device Tree, and
+for the rest, they use the same mii_ts infrastructure as those. See
+Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
+
+3.2.4 Other caveats for MAC drivers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Stacked PHCs, especially DSA (but not only) - since that doesn't require any
+modification to MAC drivers, so it is more difficult to ensure correctness of
+all possible code paths - is that they uncover bugs which were impossible to
+trigger before the existence of stacked PTP clocks. One example has to do with
+this line of code, already presented earlier::
+
+ skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+Any TX timestamping logic, be it a plain MAC driver, a DSA switch driver, a PHY
+driver or a MII bus snooping device driver, should set this flag.
+But a MAC driver that is unaware of PHC stacking might get tripped up by
+somebody other than itself setting this flag, and deliver a duplicate
+timestamp.
+For example, a typical driver design for TX timestamping might be to split the
+transmission part into 2 portions:
+
+1. "TX": checks whether PTP timestamping has been previously enabled through
+ the ``.ndo_eth_ioctl`` ("``priv->hwtstamp_tx_enabled == true``") and the
+ current skb requires a TX timestamp ("``skb_shinfo(skb)->tx_flags &
+ SKBTX_HW_TSTAMP``"). If this is true, it sets the
+ "``skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS``" flag. Note: as
+ described above, in the case of a stacked PHC system, this condition should
+ never trigger, as this MAC is certainly not the outermost PHC. But this is
+ not where the typical issue is. Transmission proceeds with this packet.
+
+2. "TX confirmation": Transmission has finished. The driver checks whether it
+ is necessary to collect any TX timestamp for it. Here is where the typical
+ issues are: the MAC driver takes a shortcut and only checks whether
+ "``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``" was set. With a stacked
+ PHC system, this is incorrect because this MAC driver is not the only entity
+ in the TX data path who could have enabled SKBTX_IN_PROGRESS in the first
+ place.
+
+The correct solution for this problem is for MAC drivers to have a compound
+check in their "TX confirmation" portion, not only for
+"``skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS``", but also for
+"``priv->hwtstamp_tx_enabled == true``". Because the rest of the system ensures
+that PTP timestamping is not enabled for anything other than the outermost PHC,
+this enhanced check will avoid delivering a duplicated TX timestamp to user
+space.
diff --git a/Documentation/networking/tipc.rst b/Documentation/networking/tipc.rst
new file mode 100644
index 000000000000..ab63d298cca2
--- /dev/null
+++ b/Documentation/networking/tipc.rst
@@ -0,0 +1,215 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel TIPC
+=================
+
+Introduction
+============
+
+TIPC (Transparent Inter Process Communication) is a protocol that is specially
+designed for intra-cluster communication. It can be configured to transmit
+messages either on UDP or directly across Ethernet. Message delivery is
+sequence guaranteed, loss free and flow controlled. Latency times are shorter
+than with any other known protocol, while maximal throughput is comparable to
+that of TCP.
+
+TIPC Features
+-------------
+
+- Cluster wide IPC service
+
+ Have you ever wished you had the convenience of Unix Domain Sockets even when
+ transmitting data between cluster nodes? Where you yourself determine the
+ addresses you want to bind to and use? Where you don't have to perform DNS
+ lookups and worry about IP addresses? Where you don't have to start timers
+ to monitor the continuous existence of peer sockets? And yet without the
+ downsides of that socket type, such as the risk of lingering inodes?
+
+ Welcome to the Transparent Inter Process Communication service, TIPC in short,
+ which gives you all of this, and a lot more.
+
+- Service Addressing
+
+ A fundamental concept in TIPC is that of Service Addressing which makes it
+ possible for a programmer to chose his own address, bind it to a server
+ socket and let client programs use only that address for sending messages.
+
+- Service Tracking
+
+ A client wanting to wait for the availability of a server, uses the Service
+ Tracking mechanism to subscribe for binding and unbinding/close events for
+ sockets with the associated service address.
+
+ The service tracking mechanism can also be used for Cluster Topology Tracking,
+ i.e., subscribing for availability/non-availability of cluster nodes.
+
+ Likewise, the service tracking mechanism can be used for Cluster Connectivity
+ Tracking, i.e., subscribing for up/down events for individual links between
+ cluster nodes.
+
+- Transmission Modes
+
+ Using a service address, a client can send datagram messages to a server socket.
+
+ Using the same address type, it can establish a connection towards an accepting
+ server socket.
+
+ It can also use a service address to create and join a Communication Group,
+ which is the TIPC manifestation of a brokerless message bus.
+
+ Multicast with very good performance and scalability is available both in
+ datagram mode and in communication group mode.
+
+- Inter Node Links
+
+ Communication between any two nodes in a cluster is maintained by one or two
+ Inter Node Links, which both guarantee data traffic integrity and monitor
+ the peer node's availability.
+
+- Cluster Scalability
+
+ By applying the Overlapping Ring Monitoring algorithm on the inter node links
+ it is possible to scale TIPC clusters up to 1000 nodes with a maintained
+ neighbor failure discovery time of 1-2 seconds. For smaller clusters this
+ time can be made much shorter.
+
+- Neighbor Discovery
+
+ Neighbor Node Discovery in the cluster is done by Ethernet broadcast or UDP
+ multicast, when any of those services are available. If not, configured peer
+ IP addresses can be used.
+
+- Configuration
+
+ When running TIPC in single node mode no configuration whatsoever is needed.
+ When running in cluster mode TIPC must as a minimum be given a node address
+ (before Linux 4.17) and told which interface to attach to. The "tipc"
+ configuration tool makes is possible to add and maintain many more
+ configuration parameters.
+
+- Performance
+
+ TIPC message transfer latency times are better than in any other known protocol.
+ Maximal byte throughput for inter-node connections is still somewhat lower than
+ for TCP, while they are superior for intra-node and inter-container throughput
+ on the same host.
+
+- Language Support
+
+ The TIPC user API has support for C, Python, Perl, Ruby, D and Go.
+
+More Information
+----------------
+
+- How to set up TIPC:
+
+ http://tipc.io/getting_started.html
+
+- How to program with TIPC:
+
+ http://tipc.io/programming.html
+
+- How to contribute to TIPC:
+
+- http://tipc.io/contacts.html
+
+- More details about TIPC specification:
+
+ http://tipc.io/protocol.html
+
+
+Implementation
+==============
+
+TIPC is implemented as a kernel module in net/tipc/ directory.
+
+TIPC Base Types
+---------------
+
+.. kernel-doc:: net/tipc/subscr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bearer.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_table.h
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.h
+ :internal:
+
+.. kernel-doc:: net/tipc/bcast.c
+ :internal:
+
+TIPC Bearer Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/bearer.c
+ :internal:
+
+.. kernel-doc:: net/tipc/udp_media.c
+ :internal:
+
+TIPC Crypto Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/crypto.c
+ :internal:
+
+TIPC Discoverer Interfaces
+--------------------------
+
+.. kernel-doc:: net/tipc/discover.c
+ :internal:
+
+TIPC Link Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/link.c
+ :internal:
+
+TIPC msg Interfaces
+-------------------
+
+.. kernel-doc:: net/tipc/msg.c
+ :internal:
+
+TIPC Name Interfaces
+--------------------
+
+.. kernel-doc:: net/tipc/name_table.c
+ :internal:
+
+.. kernel-doc:: net/tipc/name_distr.c
+ :internal:
+
+TIPC Node Management Interfaces
+-------------------------------
+
+.. kernel-doc:: net/tipc/node.c
+ :internal:
+
+TIPC Socket Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/socket.c
+ :internal:
+
+TIPC Network Topology Interfaces
+--------------------------------
+
+.. kernel-doc:: net/tipc/subscr.c
+ :internal:
+
+TIPC Server Interfaces
+----------------------
+
+.. kernel-doc:: net/tipc/topsrv.c
+ :internal:
+
+TIPC Trace Interfaces
+---------------------
+
+.. kernel-doc:: net/tipc/trace.c
+ :internal:
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index f914e81fd3a6..5f0dea3d571e 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -428,6 +428,24 @@ by the driver:
which were part of a TLS stream.
* ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
which were successfully decrypted.
+ * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
+ decryption.
+ * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
+ (connection has finished).
+ * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
+ request.
+ * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
+ was started.
+ * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
+ properly ended with providing the HW tracked tcp-seq.
+ * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
+ procedure was started by not properly ended.
+ * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
+ the driver was successfully handled.
+ * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
+ the driver was terminated unsuccessfully.
+ * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
+ but were not decrypted due to unexpected error in the state machine.
* ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
for encryption of their TLS payload.
* ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
@@ -506,7 +524,16 @@ on TCP retransmissions to handle corner cases is not acceptable.
TLS device features
-------------------
-Drivers should ignore the changes to TLS the device feature flags.
+Drivers should ignore the changes to the TLS device feature flags.
These flags will be acted upon accordingly by the core ``ktls`` code.
TLS device feature flags only control adding of new TLS connection
offloads, old connections will remain active after flags are cleared.
+
+TLS encryption cannot be offloaded to devices without checksum calculation
+offload. Hence, TLS TX device feature flag requires TX csum offload being set.
+Disabling the latter implies clearing the former. Disabling TX checksum offload
+should not affect old connections, and drivers should make sure checksum
+calculation does not break for them.
+Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
+does not want to enable RX csum offload, TLS RX device feature is disabled
+as well.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 8cb2cd4e2a80..658ed3a71e1b 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -214,6 +214,44 @@ of calling send directly after a handshake using gnutls.
Since it doesn't implement a full record layer, control
messages are not supported.
+Optional optimizations
+----------------------
+
+There are certain condition-specific optimizations the TLS ULP can make,
+if requested. Those optimizations are either not universally beneficial
+or may impact correctness, hence they require an opt-in.
+All options are set per-socket using setsockopt(), and their
+state can be checked using getsockopt() and via socket diag (``ss``).
+
+TLS_TX_ZEROCOPY_RO
+~~~~~~~~~~~~~~~~~~
+
+For device offload only. Allow sendfile() data to be transmitted directly
+to the NIC without making an in-kernel copy. This allows true zero-copy
+behavior when device offload is enabled.
+
+The application must make sure that the data is not modified between being
+submitted and transmission completing. In other words this is mostly
+applicable if the data sent on a socket via sendfile() is read-only.
+
+Modifying the data may result in different versions of the data being used
+for the original TCP transmission and TCP retransmissions. To the receiver
+this will look like TLS records had been tampered with and will result
+in record authentication failures.
+
+TLS_RX_EXPECT_NO_PAD
+~~~~~~~~~~~~~~~~~~~~
+
+TLS 1.3 only. Expect the sender to not pad records. This allows the data
+to be decrypted directly into user space buffers with TLS 1.3.
+
+This optimization is safe to enable only if the remote end is trusted,
+otherwise it is an attack vector to doubling the TLS processing cost.
+
+If the record decrypted turns out to had been padded or is not a data
+record it will be decrypted again into a kernel buffer without zero copy.
+Such events are counted in the ``TlsDecryptRetry`` statistic.
+
Statistics
==========
@@ -239,3 +277,12 @@ TLS implementation exposes the following per-namespace statistics
- ``TlsDeviceRxResync`` -
number of RX resyncs sent to NICs handling cryptography
+
+- ``TlsDecryptRetry`` -
+ number of RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will
+ also increment for non-data records.
+
+- ``TlsRxNoPadViolation`` -
+ number of data RX records which had to be re-decrypted due to
+ ``TLS_RX_EXPECT_NO_PAD`` mis-prediction.
diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.rst
index b9a188823d9f..00dc3a1a66b4 100644
--- a/Documentation/networking/tproxy.txt
+++ b/Documentation/networking/tproxy.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
Transparent proxy support
=========================
@@ -11,39 +14,39 @@ From Linux 4.18 transparent proxy support is also available in nf_tables.
================================
The idea is that you identify packets with destination address matching a local
-socket on your box, set the packet mark to a certain value:
+socket on your box, set the packet mark to a certain value::
-# iptables -t mangle -N DIVERT
-# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
-# iptables -t mangle -A DIVERT -j MARK --set-mark 1
-# iptables -t mangle -A DIVERT -j ACCEPT
+ # iptables -t mangle -N DIVERT
+ # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+ # iptables -t mangle -A DIVERT -j MARK --set-mark 1
+ # iptables -t mangle -A DIVERT -j ACCEPT
-Alternatively you can do this in nft with the following commands:
+Alternatively you can do this in nft with the following commands::
-# nft add table filter
-# nft add chain filter divert "{ type filter hook prerouting priority -150; }"
-# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
+ # nft add table filter
+ # nft add chain filter divert "{ type filter hook prerouting priority -150; }"
+ # nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
And then match on that value using policy routing to have those packets
-delivered locally:
+delivered locally::
-# ip rule add fwmark 1 lookup 100
-# ip route add local 0.0.0.0/0 dev lo table 100
+ # ip rule add fwmark 1 lookup 100
+ # ip route add local 0.0.0.0/0 dev lo table 100
Because of certain restrictions in the IPv4 routing output code you'll have to
modify your application to allow it to send datagrams _from_ non-local IP
addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket
-option before calling bind:
-
-fd = socket(AF_INET, SOCK_STREAM, 0);
-/* - 8< -*/
-int value = 1;
-setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
-/* - 8< -*/
-name.sin_family = AF_INET;
-name.sin_port = htons(0xCAFE);
-name.sin_addr.s_addr = htonl(0xDEADBEEF);
-bind(fd, &name, sizeof(name));
+option before calling bind::
+
+ fd = socket(AF_INET, SOCK_STREAM, 0);
+ /* - 8< -*/
+ int value = 1;
+ setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
+ /* - 8< -*/
+ name.sin_family = AF_INET;
+ name.sin_port = htons(0xCAFE);
+ name.sin_addr.s_addr = htonl(0xDEADBEEF);
+ bind(fd, &name, sizeof(name));
A trivial patch for netcat is available here:
http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch
@@ -61,10 +64,10 @@ be able to find out the original destination address. Even in case of TCP
getting the original destination address is racy.)
The 'TPROXY' target provides similar functionality without relying on NAT. Simply
-add rules like this to the iptables ruleset above:
+add rules like this to the iptables ruleset above::
-# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
- --tproxy-mark 0x1/0x1 --on-port 50080
+ # iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
+ --tproxy-mark 0x1/0x1 --on-port 50080
Or the following rule to nft:
@@ -82,10 +85,12 @@ nf_tables implementation.
====================================
To use tproxy you'll need to have the following modules compiled for iptables:
+
- NETFILTER_XT_MATCH_SOCKET
- NETFILTER_XT_TARGET_TPROXY
Or the floowing modules for nf_tables:
+
- NFT_SOCKET
- NFT_TPROXY
diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.rst
index 0104830d5075..4d7087f727be 100644
--- a/Documentation/networking/tuntap.txt
+++ b/Documentation/networking/tuntap.rst
@@ -1,20 +1,28 @@
-Universal TUN/TAP device driver.
-Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
- Linux, Solaris drivers
- Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+===============================
+Universal TUN/TAP device driver
+===============================
- FreeBSD TAP driver
- Copyright (c) 1999-2000 Maksim Yevmenkin <m_evmenkin@yahoo.com>
+Copyright |copy| 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ Linux, Solaris drivers
+ Copyright |copy| 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ FreeBSD TAP driver
+ Copyright |copy| 1999-2000 Maksim Yevmenkin <m_evmenkin@yahoo.com>
Revision of this document 2002 by Florian Thiel <florian.thiel@gmx.net>
1. Description
- TUN/TAP provides packet reception and transmission for user space programs.
+==============
+
+ TUN/TAP provides packet reception and transmission for user space programs.
It can be seen as a simple Point-to-Point or Ethernet device, which,
- instead of receiving packets from physical media, receives them from
- user space program and instead of sending packets via physical media
- writes them to the user space program.
+ instead of receiving packets from physical media, receives them from
+ user space program and instead of sending packets via physical media
+ writes them to the user space program.
In order to use the driver a program has to open /dev/net/tun and issue a
corresponding ioctl() to register a network device with the kernel. A network
@@ -33,41 +41,51 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
br_sigio.c - bridge based on async io and SIGIO signal.
However, the best example is VTun http://vtun.sourceforge.net :))
-2. Configuration
- Create device node:
+2. Configuration
+================
+
+ Create device node::
+
mkdir /dev/net (if it doesn't exist already)
mknod /dev/net/tun c 10 200
-
- Set permissions:
+
+ Set permissions::
+
e.g. chmod 0666 /dev/net/tun
- There's no harm in allowing the device to be accessible by non-root users,
- since CAP_NET_ADMIN is required for creating network devices or for
- connecting to network devices which aren't owned by the user in question.
- If you want to create persistent devices and give ownership of them to
- unprivileged users, then you need the /dev/net/tun device to be usable by
- those users.
+
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
Driver module autoloading
Make sure that "Kernel module loader" - module auto-loading
support is enabled in your kernel. The kernel should load it on
first access.
-
- Manual loading
- insert the module by hand:
- modprobe tun
+
+ Manual loading
+
+ insert the module by hand::
+
+ modprobe tun
If you do it the latter way, you have to load the module every time you
need it, if you do it the other way it will be automatically loaded when
/dev/net/tun is being opened.
-3. Program interface
- 3.1 Network device allocation:
+3. Program interface
+====================
+
+3.1 Network device allocation
+-----------------------------
- char *dev should be the name of the device with a format string (e.g.
- "tun%d"), but (as far as I can see) this can be any valid network device name.
- Note that the character pointer becomes overwritten with the real device name
- (e.g. "tun0")
+``char *dev`` should be the name of the device with a format string (e.g.
+"tun%d"), but (as far as I can see) this can be any valid network device name.
+Note that the character pointer becomes overwritten with the real device name
+(e.g. "tun0")::
#include <linux/if.h>
#include <linux/if_tun.h>
@@ -78,45 +96,51 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
int fd, err;
if( (fd = open("/dev/net/tun", O_RDWR)) < 0 )
- return tun_alloc_old(dev);
+ return tun_alloc_old(dev);
memset(&ifr, 0, sizeof(ifr));
- /* Flags: IFF_TUN - TUN device (no Ethernet headers)
- * IFF_TAP - TAP device
+ /* Flags: IFF_TUN - TUN device (no Ethernet headers)
+ * IFF_TAP - TAP device
*
- * IFF_NO_PI - Do not provide packet information
- */
- ifr.ifr_flags = IFF_TUN;
+ * IFF_NO_PI - Do not provide packet information
+ */
+ ifr.ifr_flags = IFF_TUN;
if( *dev )
- strncpy(ifr.ifr_name, dev, IFNAMSIZ);
+ strscpy_pad(ifr.ifr_name, dev, IFNAMSIZ);
if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
- close(fd);
- return err;
+ close(fd);
+ return err;
}
strcpy(dev, ifr.ifr_name);
return fd;
- }
-
- 3.2 Frame format:
- If flag IFF_NO_PI is not set each frame format is:
+ }
+
+3.2 Frame format
+----------------
+
+If flag IFF_NO_PI is not set each frame format is::
+
Flags [2 bytes]
Proto [2 bytes]
Raw protocol(IP, IPv6, etc) frame.
- 3.3 Multiqueue tuntap interface:
+3.3 Multiqueue tuntap interface
+-------------------------------
+
+From version 3.8, Linux supports multiqueue tuntap which can uses multiple
+file descriptors (queues) to parallelize packets sending or receiving. The
+device allocation is the same as before, and if user wants to create multiple
+queues, TUNSETIFF with the same device name must be called many times with
+IFF_MULTI_QUEUE flag.
- From version 3.8, Linux supports multiqueue tuntap which can uses multiple
- file descriptors (queues) to parallelize packets sending or receiving. The
- device allocation is the same as before, and if user wants to create multiple
- queues, TUNSETIFF with the same device name must be called many times with
- IFF_MULTI_QUEUE flag.
+``char *dev`` should be the name of the device, queues is the number of queues
+to be created, fds is used to store and return the file descriptors (queues)
+created to the caller. Each file descriptor were served as the interface of a
+queue which could be accessed by userspace.
- char *dev should be the name of the device, queues is the number of queues to
- be created, fds is used to store and return the file descriptors (queues)
- created to the caller. Each file descriptor were served as the interface of a
- queue which could be accessed by userspace.
+::
#include <linux/if.h>
#include <linux/if_tun.h>
@@ -127,7 +151,7 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
int fd, err, i;
if (!dev)
- return -1;
+ return -1;
memset(&ifr, 0, sizeof(ifr));
/* Flags: IFF_TUN - TUN device (no Ethernet headers)
@@ -140,30 +164,30 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
strcpy(ifr.ifr_name, dev);
for (i = 0; i < queues; i++) {
- if ((fd = open("/dev/net/tun", O_RDWR)) < 0)
- goto err;
- err = ioctl(fd, TUNSETIFF, (void *)&ifr);
- if (err) {
- close(fd);
- goto err;
- }
- fds[i] = fd;
+ if ((fd = open("/dev/net/tun", O_RDWR)) < 0)
+ goto err;
+ err = ioctl(fd, TUNSETIFF, (void *)&ifr);
+ if (err) {
+ close(fd);
+ goto err;
+ }
+ fds[i] = fd;
}
return 0;
err:
for (--i; i >= 0; i--)
- close(fds[i]);
+ close(fds[i]);
return err;
}
- A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When
- calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when
- calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were
- enabled by default after it was created through TUNSETIFF.
+A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When
+calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when
+calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were
+enabled by default after it was created through TUNSETIFF.
- fd is the file descriptor (queue) that we want to enable or disable, when
- enable is true we enable it, otherwise we disable it
+fd is the file descriptor (queue) that we want to enable or disable, when
+enable is true we enable it, otherwise we disable it::
#include <linux/if.h>
#include <linux/if_tun.h>
@@ -175,53 +199,61 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
memset(&ifr, 0, sizeof(ifr));
if (enable)
- ifr.ifr_flags = IFF_ATTACH_QUEUE;
+ ifr.ifr_flags = IFF_ATTACH_QUEUE;
else
- ifr.ifr_flags = IFF_DETACH_QUEUE;
+ ifr.ifr_flags = IFF_DETACH_QUEUE;
return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
}
-Universal TUN/TAP device driver Frequently Asked Question.
-
+Universal TUN/TAP device driver Frequently Asked Question
+=========================================================
+
1. What platforms are supported by TUN/TAP driver ?
+
Currently driver has been written for 3 Unices:
- Linux kernels 2.2.x, 2.4.x
- FreeBSD 3.x, 4.x, 5.x
- Solaris 2.6, 7.0, 8.0
+
+ - Linux kernels 2.2.x, 2.4.x
+ - FreeBSD 3.x, 4.x, 5.x
+ - Solaris 2.6, 7.0, 8.0
2. What is TUN/TAP driver used for?
-As mentioned above, main purpose of TUN/TAP driver is tunneling.
+
+As mentioned above, main purpose of TUN/TAP driver is tunneling.
It is used by VTun (http://vtun.sourceforge.net).
Another interesting application using TUN/TAP is pipsecd
(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec
implementation that can use complete kernel routing (unlike FreeS/WAN).
-3. How does Virtual network device actually work ?
+3. How does Virtual network device actually work ?
+
Virtual network device can be viewed as a simple Point-to-Point or
-Ethernet device, which instead of receiving packets from a physical
-media, receives them from user space program and instead of sending
-packets via physical media sends them to the user space program.
+Ethernet device, which instead of receiving packets from a physical
+media, receives them from user space program and instead of sending
+packets via physical media sends them to the user space program.
Let's say that you configured IPv6 on the tap0, then whenever
the kernel sends an IPv6 packet to tap0, it is passed to the application
-(VTun for example). The application encrypts, compresses and sends it to
+(VTun for example). The application encrypts, compresses and sends it to
the other side over TCP or UDP. The application on the other side decompresses
-and decrypts the data received and writes the packet to the TAP device,
+and decrypts the data received and writes the packet to the TAP device,
the kernel handles the packet like it came from real physical device.
4. What is the difference between TUN driver and TAP driver?
+
TUN works with IP frames. TAP works with Ethernet frames.
This means that you have to read/write IP packets when you are using tun and
ethernet frames when using tap.
5. What is the difference between BPF and TUN/TAP driver?
+
BPF is an advanced packet filter. It can be attached to existing
network interface. It does not provide a virtual network interface.
A TUN/TAP driver does provide a virtual network interface and it is possible
to attach BPF to this interface.
6. Does TAP driver support kernel Ethernet bridging?
-Yes. Linux and FreeBSD drivers support Ethernet bridging.
+
+Yes. Linux and FreeBSD drivers support Ethernet bridging.
diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.rst
index 53a726855e49..2c225f28b7b2 100644
--- a/Documentation/networking/udplite.txt
+++ b/Documentation/networking/udplite.rst
@@ -1,6 +1,8 @@
- ===========================================================================
- The UDP-Lite protocol (RFC 3828)
- ===========================================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+The UDP-Lite protocol (RFC 3828)
+================================
UDP-Lite is a Standards-Track IETF transport protocol whose characteristic
@@ -11,39 +13,43 @@
This file briefly describes the existing kernel support and the socket API.
For in-depth information, you can consult:
- o The UDP-Lite Homepage:
- http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
- From here you can also download some example application source code.
+ - The UDP-Lite Homepage:
+ http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
+
+ From here you can also download some example application source code.
- o The UDP-Lite HOWTO on
- http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
- files/UDP-Lite-HOWTO.txt
+ - The UDP-Lite HOWTO on
+ http://web.archive.org/web/%2E/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/files/UDP-Lite-HOWTO.txt
- o The Wireshark UDP-Lite WiKi (with capture files):
- https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol
+ - The Wireshark UDP-Lite WiKi (with capture files):
+ https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol
- o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt
+ - The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt
- I) APPLICATIONS
+1. Applications
+===============
Several applications have been ported successfully to UDP-Lite. Ethereal
- (now called wireshark) has UDP-Litev4/v6 support by default.
+ (now called wireshark) has UDP-Litev4/v6 support by default.
+
Porting applications to UDP-Lite is straightforward: only socket level and
IPPROTO need to be changed; senders additionally set the checksum coverage
length (default = header length = 8). Details are in the next section.
-
- II) PROGRAMMING API
+2. Programming API
+==================
UDP-Lite provides a connectionless, unreliable datagram service and hence
uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is
- very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2)
- call so that the statement looks like:
+ very easy: simply add ``IPPROTO_UDPLITE`` as the last argument of the
+ socket(2) call so that the statement looks like::
s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE);
- or, respectively,
+ or, respectively,
+
+ ::
s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE);
@@ -56,10 +62,10 @@
* Sender checksum coverage: UDPLITE_SEND_CSCOV
- For example,
+ For example::
- int val = 20;
- setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int));
+ int val = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int));
sets the checksum coverage length to 20 bytes (12b data + 8b header).
Of each packet only the first 20 bytes (plus the pseudo-header) will be
@@ -74,10 +80,10 @@
that of a traffic filter: when enabled, it instructs the kernel to drop
all packets which have a coverage _less_ than this value. For example, if
RTP and UDP headers are to be protected, a receiver can enforce that only
- packets with a minimum coverage of 20 are admitted:
+ packets with a minimum coverage of 20 are admitted::
- int min = 20;
- setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int));
+ int min = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int));
The calls to getsockopt(2) are analogous. Being an extension and not a stand-
alone protocol, all socket options known from UDP can be used in exactly the
@@ -85,18 +91,18 @@
A detailed discussion of UDP-Lite checksum coverage options is in section IV.
-
- III) HEADER FILES
+3. Header Files
+===============
The socket API requires support through header files in /usr/include:
* /usr/include/netinet/in.h
- to define IPPROTO_UDPLITE
+ to define IPPROTO_UDPLITE
* /usr/include/netinet/udplite.h
- for UDP-Lite header fields and protocol constants
+ for UDP-Lite header fields and protocol constants
- For testing purposes, the following can serve as a `mini' header file:
+ For testing purposes, the following can serve as a ``mini`` header file::
#define IPPROTO_UDPLITE 136
#define SOL_UDPLITE 136
@@ -105,8 +111,9 @@
Ready-made header files for various distros are in the UDP-Lite tarball.
+4. Kernel Behaviour with Regards to the Various Socket Options
+==============================================================
- IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS
To enable debugging messages, the log level need to be set to 8, as most
messages use the KERN_DEBUG level (7).
@@ -136,13 +143,13 @@
3) Disabling the Checksum Computation
On both sender and receiver, checksumming will always be performed
- and cannot be disabled using SO_NO_CHECK. Thus
+ and cannot be disabled using SO_NO_CHECK. Thus::
- setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... );
+ setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... );
- will always will be ignored, while the value of
+ will always will be ignored, while the value of::
- getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...);
+ getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...);
is meaningless (as in TCP). Packets with a zero checksum field are
illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded.
@@ -167,15 +174,15 @@
first one contains the L4 header.
The send buffer size has implications on the checksum coverage length.
- Consider the following example:
+ Consider the following example::
- Payload: 1536 bytes Send Buffer: 1024 bytes
- MTU: 1500 bytes Coverage Length: 856 bytes
+ Payload: 1536 bytes Send Buffer: 1024 bytes
+ MTU: 1500 bytes Coverage Length: 856 bytes
- UDP-Lite will ship the 1536 bytes in two separate packets:
+ UDP-Lite will ship the 1536 bytes in two separate packets::
- Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes
- Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes
+ Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes
+ Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes
The coverage packet covers the UDP-Lite header and 848 bytes of the
payload in the first packet, the second packet is fully covered. Note
@@ -184,17 +191,17 @@
length in such cases.
As an example of what happens when one UDP-Lite packet is split into
- several tiny fragments, consider the following example.
+ several tiny fragments, consider the following example::
- Payload: 1024 bytes Send buffer size: 1024 bytes
- MTU: 300 bytes Coverage length: 575 bytes
+ Payload: 1024 bytes Send buffer size: 1024 bytes
+ MTU: 300 bytes Coverage length: 575 bytes
- +-+-----------+--------------+--------------+--------------+
- |8| 272 | 280 | 280 | 280 |
- +-+-----------+--------------+--------------+--------------+
- 280 560 840 1032
- ^
- *****checksum coverage*************
+ +-+-----------+--------------+--------------+--------------+
+ |8| 272 | 280 | 280 | 280 |
+ +-+-----------+--------------+--------------+--------------+
+ 280 560 840 1032
+ ^
+ *****checksum coverage*************
The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte
header). According to the interface MTU, these are split into 4 IP
@@ -208,7 +215,7 @@
lengths), only the first fragment needs to be considered. When using
larger checksum coverage lengths, each eligible fragment needs to be
checksummed. Suppose we have a checksum coverage of 3062. The buffer
- of 3356 bytes will be split into the following fragments:
+ of 3356 bytes will be split into the following fragments::
Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data
Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data
@@ -222,57 +229,63 @@
performance over wireless (or generally noisy) links and thus smaller
coverage lengths are likely to be expected.
-
- V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING
+5. UDP-Lite Runtime Statistics and their Meaning
+================================================
Exceptional and error conditions are logged to syslog at the KERN_DEBUG
level. Live statistics about UDP-Lite are available in /proc/net/snmp
- and can (with newer versions of netstat) be viewed using
+ and can (with newer versions of netstat) be viewed using::
- netstat -svu
+ netstat -svu
This displays UDP-Lite statistics variables, whose meaning is as follows.
- InDatagrams: The total number of datagrams delivered to users.
+ ============ =====================================================
+ InDatagrams The total number of datagrams delivered to users.
- NoPorts: Number of packets received to an unknown port.
- These cases are counted separately (not as InErrors).
+ NoPorts Number of packets received to an unknown port.
+ These cases are counted separately (not as InErrors).
- InErrors: Number of erroneous UDP-Lite packets. Errors include:
- * internal socket queue receive errors
- * packet too short (less than 8 bytes or stated
- coverage length exceeds received length)
- * xfrm4_policy_check() returned with error
- * application has specified larger min. coverage
- length than that of incoming packet
- * checksum coverage violated
- * bad checksum
+ InErrors Number of erroneous UDP-Lite packets. Errors include:
- OutDatagrams: Total number of sent datagrams.
+ * internal socket queue receive errors
+ * packet too short (less than 8 bytes or stated
+ coverage length exceeds received length)
+ * xfrm4_policy_check() returned with error
+ * application has specified larger min. coverage
+ length than that of incoming packet
+ * checksum coverage violated
+ * bad checksum
- These statistics derive from the UDP MIB (RFC 2013).
+ OutDatagrams Total number of sent datagrams.
+ ============ =====================================================
+ These statistics derive from the UDP MIB (RFC 2013).
- VI) IPTABLES
+6. IPtables
+===========
There is packet match support for UDP-Lite as well as support for the LOG target.
- If you copy and paste the following line into /etc/protocols,
+ If you copy and paste the following line into /etc/protocols::
- udplite 136 UDP-Lite # UDP-Lite [RFC 3828]
+ udplite 136 UDP-Lite # UDP-Lite [RFC 3828]
- then
- iptables -A INPUT -p udplite -j LOG
+ then::
- will produce logging output to syslog. Dropping and rejecting packets also works.
+ iptables -A INPUT -p udplite -j LOG
+ will produce logging output to syslog. Dropping and rejecting packets also works.
- VII) MAINTAINER ADDRESS
+7. Maintainer Address
+=====================
The UDP-Lite patch was developed at
- University of Aberdeen
- Electronics Research Group
- Department of Engineering
- Fraser Noble Building
- Aberdeen AB24 3UE; UK
+
+ University of Aberdeen
+ Electronics Research Group
+ Department of Engineering
+ Fraser Noble Building
+ Aberdeen AB24 3UE; UK
+
The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial
code was developed by William Stanislaus, <william@erg.abdn.ac.uk>.
diff --git a/Documentation/networking/vrf.rst b/Documentation/networking/vrf.rst
new file mode 100644
index 000000000000..0a9a6f968cb9
--- /dev/null
+++ b/Documentation/networking/vrf.rst
@@ -0,0 +1,464 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
+Virtual Routing and Forwarding (VRF)
+====================================
+
+The VRF Device
+==============
+
+The VRF device combined with ip rules provides the ability to create virtual
+routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the
+Linux network stack. One use case is the multi-tenancy problem where each
+tenant has their own unique routing tables and in the very least need
+different default gateways.
+
+Processes can be "VRF aware" by binding a socket to the VRF device. Packets
+through the socket then use the routing table associated with the VRF
+device. An important feature of the VRF device implementation is that it
+impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected
+(ie., they do not need to be run in each VRF). The design also allows
+the use of higher priority ip rules (Policy Based Routing, PBR) to take
+precedence over the VRF device rules directing specific traffic as desired.
+
+In addition, VRF devices allow VRFs to be nested within namespaces. For
+example network namespaces provide separation of network interfaces at the
+device layer, VLANs on the interfaces within a namespace provide L2 separation
+and then VRF devices provide L3 separation.
+
+Design
+------
+A VRF device is created with an associated route table. Network interfaces
+are then enslaved to a VRF device::
+
+ +-----------------------------+
+ | vrf-blue | ===> route table 10
+ +-----------------------------+
+ | | |
+ +------+ +------+ +-------------+
+ | eth1 | | eth2 | ... | bond1 |
+ +------+ +------+ +-------------+
+ | |
+ +------+ +------+
+ | eth8 | | eth9 |
+ +------+ +------+
+
+Packets received on an enslaved device and are switched to the VRF device
+in the IPv4 and IPv6 processing stacks giving the impression that packets
+flow through the VRF device. Similarly on egress routing rules are used to
+send packets to the VRF device driver before getting sent out the actual
+interface. This allows tcpdump on a VRF device to capture all packets into
+and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules
+can be applied using the VRF device to specify rules that apply to the VRF
+domain as a whole.
+
+.. [1] Packets in the forwarded state do not flow through the device, so those
+ packets are not seen by tcpdump. Will revisit this limitation in a
+ future release.
+
+.. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real
+ ingress device and both INPUT and PREROUTING rules with skb->dev set to
+ the VRF device. For egress POSTROUTING and OUTPUT rules can be written
+ using either the VRF device or real egress device.
+
+Setup
+-----
+1. VRF device is created with an association to a FIB table.
+ e.g,::
+
+ ip link add vrf-blue type vrf table 10
+ ip link set dev vrf-blue up
+
+2. An l3mdev FIB rule directs lookups to the table associated with the device.
+ A single l3mdev rule is sufficient for all VRFs. The VRF device adds the
+ l3mdev rule for IPv4 and IPv6 when the first device is created with a
+ default preference of 1000. Users may delete the rule if desired and add
+ with a different priority or install per-VRF rules.
+
+ Prior to the v4.8 kernel iif and oif rules are needed for each VRF device::
+
+ ip ru add oif vrf-blue table 10
+ ip ru add iif vrf-blue table 10
+
+3. Set the default route for the table (and hence default route for the VRF)::
+
+ ip route add table 10 unreachable default metric 4278198272
+
+ This high metric value ensures that the default unreachable route can
+ be overridden by a routing protocol suite. FRRouting interprets
+ kernel metrics as a combined admin distance (upper byte) and priority
+ (lower 3 bytes). Thus the above metric translates to [255/8192].
+
+4. Enslave L3 interfaces to a VRF device::
+
+ ip link set dev eth1 master vrf-blue
+
+ Local and connected routes for enslaved devices are automatically moved to
+ the table associated with VRF device. Any additional routes depending on
+ the enslaved device are dropped and will need to be reinserted to the VRF
+ FIB table following the enslavement.
+
+ The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global
+ addresses as VRF enslavement changes::
+
+ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
+
+5. Additional VRF routes are added to associated table::
+
+ ip route add table 10 ...
+
+
+Applications
+------------
+Applications that are to work within a VRF need to bind their socket to the
+VRF device::
+
+ setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
+
+or to specify the output device using cmsg and IP_PKTINFO.
+
+By default the scope of the port bindings for unbound sockets is
+limited to the default VRF. That is, it will not be matched by packets
+arriving on interfaces enslaved to an l3mdev and processes may bind to
+the same port if they bind to an l3mdev.
+
+TCP & UDP services running in the default VRF context (ie., not bound
+to any VRF device) can work across all VRF domains by enabling the
+tcp_l3mdev_accept and udp_l3mdev_accept sysctl options::
+
+ sysctl -w net.ipv4.tcp_l3mdev_accept=1
+ sysctl -w net.ipv4.udp_l3mdev_accept=1
+
+These options are disabled by default so that a socket in a VRF is only
+selected for packets in that VRF. There is a similar option for RAW
+sockets, which is enabled by default for reasons of backwards compatibility.
+This is so as to specify the output device with cmsg and IP_PKTINFO, but
+using a socket not bound to the corresponding VRF. This allows e.g. older ping
+implementations to be run with specifying the device but without executing it
+in the VRF. This option can be disabled so that packets received in a VRF
+context are only handled by a raw socket bound to the VRF, and packets in the
+default VRF are only handled by a socket not bound to any VRF::
+
+ sysctl -w net.ipv4.raw_l3mdev_accept=0
+
+netfilter rules on the VRF device can be used to limit access to services
+running in the default VRF context as well.
+
+Using VRF-aware applications (applications which simultaneously create sockets
+outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1``
+is possible but may lead to problems in some situations. With that sysctl
+value, it is unspecified which listening socket will be selected to handle
+connections for VRF traffic; ie. either a socket bound to the VRF or an unbound
+socket may be used to accept new connections from a VRF. This somewhat
+unexpected behavior can lead to problems if sockets are configured with extra
+options (ex. TCP MD5 keys) with the expectation that VRF traffic will
+exclusively be handled by sockets bound to VRFs, as would be the case with
+``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of
+which listening socket is selected, established sockets will be created in the
+VRF based on the ingress interface, as documented earlier.
+
+--------------------------------------------------------------------------------
+
+Using iproute2 for VRFs
+=======================
+iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this
+section lists both commands where appropriate -- with the vrf keyword and the
+older form without it.
+
+1. Create a VRF
+
+ To instantiate a VRF device and associate it with a table::
+
+ $ ip link add dev NAME type vrf table ID
+
+ As of v4.8 the kernel supports the l3mdev FIB rule where a single rule
+ covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first
+ device create.
+
+2. List VRFs
+
+ To list VRFs that have been created::
+
+ $ ip [-d] link show type vrf
+ NOTE: The -d option is needed to show the table id
+
+ For example::
+
+ $ ip -d link show type vrf
+ 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 1 addrgenmode eui64
+ 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 10 addrgenmode eui64
+ 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 66 addrgenmode eui64
+ 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
+ link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0
+ vrf table 81 addrgenmode eui64
+
+
+ Or in brief output::
+
+ $ ip -br link show type vrf
+ mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP>
+ red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP>
+ blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP>
+ green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP>
+
+
+3. Assign a Network Interface to a VRF
+
+ Network interfaces are assigned to a VRF by enslaving the netdevice to a
+ VRF device::
+
+ $ ip link set dev NAME master NAME
+
+ On enslavement connected and local routes are automatically moved to the
+ table associated with the VRF device.
+
+ For example::
+
+ $ ip link set dev eth0 master mgmt
+
+
+4. Show Devices Assigned to a VRF
+
+ To show devices that have been assigned to a specific VRF add the master
+ option to the ip command::
+
+ $ ip link show vrf NAME
+ $ ip link show master NAME
+
+ For example::
+
+ $ ip link show vrf red
+ 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
+ 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
+ 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000
+ link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
+
+
+ Or using the brief output::
+
+ $ ip -br link show vrf red
+ eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
+ eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP>
+ eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST>
+
+
+5. Show Neighbor Entries for a VRF
+
+ To list neighbor entries associated with devices enslaved to a VRF device
+ add the master option to the ip command::
+
+ $ ip [-6] neigh show vrf NAME
+ $ ip [-6] neigh show master NAME
+
+ For example::
+
+ $ ip neigh show vrf red
+ 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
+ 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE
+
+ $ ip -6 neigh show vrf red
+ 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
+
+
+6. Show Addresses for a VRF
+
+ To show addresses for interfaces associated with a VRF add the master
+ option to the ip command::
+
+ $ ip addr show vrf NAME
+ $ ip addr show master NAME
+
+ For example::
+
+ $ ip addr show vrf red
+ 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
+ link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
+ inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1
+ valid_lft forever preferred_lft forever
+ inet6 2002:1::2/120 scope global
+ valid_lft forever preferred_lft forever
+ inet6 fe80::ff:fe00:202/64 scope link
+ valid_lft forever preferred_lft forever
+ 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
+ link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
+ inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2
+ valid_lft forever preferred_lft forever
+ inet6 2002:2::2/120 scope global
+ valid_lft forever preferred_lft forever
+ inet6 fe80::ff:fe00:203/64 scope link
+ valid_lft forever preferred_lft forever
+ 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000
+ link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
+
+ Or in brief format::
+
+ $ ip -br addr show vrf red
+ eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64
+ eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64
+ eth5 DOWN
+
+
+7. Show Routes for a VRF
+
+ To show routes for a VRF use the ip command to display the table associated
+ with the VRF device::
+
+ $ ip [-6] route show vrf NAME
+ $ ip [-6] route show table ID
+
+ For example::
+
+ $ ip route show vrf red
+ unreachable default metric 4278198272
+ broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
+ 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
+ local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
+ broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
+ broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2
+ 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2
+ local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2
+ broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2
+
+ $ ip -6 route show vrf red
+ local 2002:1:: dev lo proto none metric 0 pref medium
+ local 2002:1::2 dev lo proto none metric 0 pref medium
+ 2002:1::/120 dev eth1 proto kernel metric 256 pref medium
+ local 2002:2:: dev lo proto none metric 0 pref medium
+ local 2002:2::2 dev lo proto none metric 0 pref medium
+ 2002:2::/120 dev eth2 proto kernel metric 256 pref medium
+ local fe80:: dev lo proto none metric 0 pref medium
+ local fe80:: dev lo proto none metric 0 pref medium
+ local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium
+ local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium
+ fe80::/64 dev eth1 proto kernel metric 256 pref medium
+ fe80::/64 dev eth2 proto kernel metric 256 pref medium
+ ff00::/8 dev red metric 256 pref medium
+ ff00::/8 dev eth1 metric 256 pref medium
+ ff00::/8 dev eth2 metric 256 pref medium
+ unreachable default dev lo metric 4278198272 error -101 pref medium
+
+8. Route Lookup for a VRF
+
+ A test route lookup can be done for a VRF::
+
+ $ ip [-6] route get vrf NAME ADDRESS
+ $ ip [-6] route get oif NAME ADDRESS
+
+ For example::
+
+ $ ip route get 10.2.1.40 vrf red
+ 10.2.1.40 dev eth1 table red src 10.2.1.2
+ cache
+
+ $ ip -6 route get 2002:1::32 vrf red
+ 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium
+
+
+9. Removing Network Interface from a VRF
+
+ Network interfaces are removed from a VRF by breaking the enslavement to
+ the VRF device::
+
+ $ ip link set dev NAME nomaster
+
+ Connected routes are moved back to the default table and local entries are
+ moved to the local table.
+
+ For example::
+
+ $ ip link set dev eth0 nomaster
+
+--------------------------------------------------------------------------------
+
+Commands used in this example::
+
+ cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
+ 1 mgmt
+ 10 red
+ 66 blue
+ 81 green
+ EOF
+
+ function vrf_create
+ {
+ VRF=$1
+ TBID=$2
+
+ # create VRF device
+ ip link add ${VRF} type vrf table ${TBID}
+
+ if [ "${VRF}" != "mgmt" ]; then
+ ip route add table ${TBID} unreachable default metric 4278198272
+ fi
+ ip link set dev ${VRF} up
+ }
+
+ vrf_create mgmt 1
+ ip link set dev eth0 master mgmt
+
+ vrf_create red 10
+ ip link set dev eth1 master red
+ ip link set dev eth2 master red
+ ip link set dev eth5 master red
+
+ vrf_create blue 66
+ ip link set dev eth3 master blue
+
+ vrf_create green 81
+ ip link set dev eth4 master green
+
+
+ Interface addresses from /etc/network/interfaces:
+ auto eth0
+ iface eth0 inet static
+ address 10.0.0.2
+ netmask 255.255.255.0
+ gateway 10.0.0.254
+
+ iface eth0 inet6 static
+ address 2000:1::2
+ netmask 120
+
+ auto eth1
+ iface eth1 inet static
+ address 10.2.1.2
+ netmask 255.255.255.0
+
+ iface eth1 inet6 static
+ address 2002:1::2
+ netmask 120
+
+ auto eth2
+ iface eth2 inet static
+ address 10.2.2.2
+ netmask 255.255.255.0
+
+ iface eth2 inet6 static
+ address 2002:2::2
+ netmask 120
+
+ auto eth3
+ iface eth3 inet static
+ address 10.2.3.2
+ netmask 255.255.255.0
+
+ iface eth3 inet6 static
+ address 2002:3::2
+ netmask 120
+
+ auto eth4
+ iface eth4 inet static
+ address 10.2.4.2
+ netmask 255.255.255.0
+
+ iface eth4 inet6 static
+ address 2002:4::2
+ netmask 120
diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
deleted file mode 100644
index a5f103b083a0..000000000000
--- a/Documentation/networking/vrf.txt
+++ /dev/null
@@ -1,418 +0,0 @@
-Virtual Routing and Forwarding (VRF)
-====================================
-The VRF device combined with ip rules provides the ability to create virtual
-routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the
-Linux network stack. One use case is the multi-tenancy problem where each
-tenant has their own unique routing tables and in the very least need
-different default gateways.
-
-Processes can be "VRF aware" by binding a socket to the VRF device. Packets
-through the socket then use the routing table associated with the VRF
-device. An important feature of the VRF device implementation is that it
-impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected
-(ie., they do not need to be run in each VRF). The design also allows
-the use of higher priority ip rules (Policy Based Routing, PBR) to take
-precedence over the VRF device rules directing specific traffic as desired.
-
-In addition, VRF devices allow VRFs to be nested within namespaces. For
-example network namespaces provide separation of network interfaces at the
-device layer, VLANs on the interfaces within a namespace provide L2 separation
-and then VRF devices provide L3 separation.
-
-Design
-------
-A VRF device is created with an associated route table. Network interfaces
-are then enslaved to a VRF device:
-
- +-----------------------------+
- | vrf-blue | ===> route table 10
- +-----------------------------+
- | | |
- +------+ +------+ +-------------+
- | eth1 | | eth2 | ... | bond1 |
- +------+ +------+ +-------------+
- | |
- +------+ +------+
- | eth8 | | eth9 |
- +------+ +------+
-
-Packets received on an enslaved device and are switched to the VRF device
-in the IPv4 and IPv6 processing stacks giving the impression that packets
-flow through the VRF device. Similarly on egress routing rules are used to
-send packets to the VRF device driver before getting sent out the actual
-interface. This allows tcpdump on a VRF device to capture all packets into
-and out of the VRF as a whole.[1] Similarly, netfilter[2] and tc rules can be
-applied using the VRF device to specify rules that apply to the VRF domain
-as a whole.
-
-[1] Packets in the forwarded state do not flow through the device, so those
- packets are not seen by tcpdump. Will revisit this limitation in a
- future release.
-
-[2] Iptables on ingress supports PREROUTING with skb->dev set to the real
- ingress device and both INPUT and PREROUTING rules with skb->dev set to
- the VRF device. For egress POSTROUTING and OUTPUT rules can be written
- using either the VRF device or real egress device.
-
-Setup
------
-1. VRF device is created with an association to a FIB table.
- e.g, ip link add vrf-blue type vrf table 10
- ip link set dev vrf-blue up
-
-2. An l3mdev FIB rule directs lookups to the table associated with the device.
- A single l3mdev rule is sufficient for all VRFs. The VRF device adds the
- l3mdev rule for IPv4 and IPv6 when the first device is created with a
- default preference of 1000. Users may delete the rule if desired and add
- with a different priority or install per-VRF rules.
-
- Prior to the v4.8 kernel iif and oif rules are needed for each VRF device:
- ip ru add oif vrf-blue table 10
- ip ru add iif vrf-blue table 10
-
-3. Set the default route for the table (and hence default route for the VRF).
- ip route add table 10 unreachable default metric 4278198272
-
- This high metric value ensures that the default unreachable route can
- be overridden by a routing protocol suite. FRRouting interprets
- kernel metrics as a combined admin distance (upper byte) and priority
- (lower 3 bytes). Thus the above metric translates to [255/8192].
-
-4. Enslave L3 interfaces to a VRF device.
- ip link set dev eth1 master vrf-blue
-
- Local and connected routes for enslaved devices are automatically moved to
- the table associated with VRF device. Any additional routes depending on
- the enslaved device are dropped and will need to be reinserted to the VRF
- FIB table following the enslavement.
-
- The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global
- addresses as VRF enslavement changes.
- sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
-
-5. Additional VRF routes are added to associated table.
- ip route add table 10 ...
-
-
-Applications
-------------
-Applications that are to work within a VRF need to bind their socket to the
-VRF device:
-
- setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);
-
-or to specify the output device using cmsg and IP_PKTINFO.
-
-By default the scope of the port bindings for unbound sockets is
-limited to the default VRF. That is, it will not be matched by packets
-arriving on interfaces enslaved to an l3mdev and processes may bind to
-the same port if they bind to an l3mdev.
-
-TCP & UDP services running in the default VRF context (ie., not bound
-to any VRF device) can work across all VRF domains by enabling the
-tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
-
- sysctl -w net.ipv4.tcp_l3mdev_accept=1
- sysctl -w net.ipv4.udp_l3mdev_accept=1
-
-These options are disabled by default so that a socket in a VRF is only
-selected for packets in that VRF. There is a similar option for RAW
-sockets, which is enabled by default for reasons of backwards compatibility.
-This is so as to specify the output device with cmsg and IP_PKTINFO, but
-using a socket not bound to the corresponding VRF. This allows e.g. older ping
-implementations to be run with specifying the device but without executing it
-in the VRF. This option can be disabled so that packets received in a VRF
-context are only handled by a raw socket bound to the VRF, and packets in the
-default VRF are only handled by a socket not bound to any VRF:
-
- sysctl -w net.ipv4.raw_l3mdev_accept=0
-
-netfilter rules on the VRF device can be used to limit access to services
-running in the default VRF context as well.
-
-################################################################################
-
-Using iproute2 for VRFs
-=======================
-iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this
-section lists both commands where appropriate -- with the vrf keyword and the
-older form without it.
-
-1. Create a VRF
-
- To instantiate a VRF device and associate it with a table:
- $ ip link add dev NAME type vrf table ID
-
- As of v4.8 the kernel supports the l3mdev FIB rule where a single rule
- covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first
- device create.
-
-2. List VRFs
-
- To list VRFs that have been created:
- $ ip [-d] link show type vrf
- NOTE: The -d option is needed to show the table id
-
- For example:
- $ ip -d link show type vrf
- 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
- link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0
- vrf table 1 addrgenmode eui64
- 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
- link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0
- vrf table 10 addrgenmode eui64
- 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
- link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0
- vrf table 66 addrgenmode eui64
- 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
- link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0
- vrf table 81 addrgenmode eui64
-
-
- Or in brief output:
-
- $ ip -br link show type vrf
- mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP>
- red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP>
- blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP>
- green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP>
-
-
-3. Assign a Network Interface to a VRF
-
- Network interfaces are assigned to a VRF by enslaving the netdevice to a
- VRF device:
- $ ip link set dev NAME master NAME
-
- On enslavement connected and local routes are automatically moved to the
- table associated with the VRF device.
-
- For example:
- $ ip link set dev eth0 master mgmt
-
-
-4. Show Devices Assigned to a VRF
-
- To show devices that have been assigned to a specific VRF add the master
- option to the ip command:
- $ ip link show vrf NAME
- $ ip link show master NAME
-
- For example:
- $ ip link show vrf red
- 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
- link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
- 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000
- link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
- 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000
- link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
-
-
- Or using the brief output:
- $ ip -br link show vrf red
- eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP>
- eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP>
- eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST>
-
-
-5. Show Neighbor Entries for a VRF
-
- To list neighbor entries associated with devices enslaved to a VRF device
- add the master option to the ip command:
- $ ip [-6] neigh show vrf NAME
- $ ip [-6] neigh show master NAME
-
- For example:
- $ ip neigh show vrf red
- 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
- 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE
-
- $ ip -6 neigh show vrf red
- 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE
-
-
-6. Show Addresses for a VRF
-
- To show addresses for interfaces associated with a VRF add the master
- option to the ip command:
- $ ip addr show vrf NAME
- $ ip addr show master NAME
-
- For example:
- $ ip addr show vrf red
- 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
- link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff
- inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1
- valid_lft forever preferred_lft forever
- inet6 2002:1::2/120 scope global
- valid_lft forever preferred_lft forever
- inet6 fe80::ff:fe00:202/64 scope link
- valid_lft forever preferred_lft forever
- 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
- link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff
- inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2
- valid_lft forever preferred_lft forever
- inet6 2002:2::2/120 scope global
- valid_lft forever preferred_lft forever
- inet6 fe80::ff:fe00:203/64 scope link
- valid_lft forever preferred_lft forever
- 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000
- link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff
-
- Or in brief format:
- $ ip -br addr show vrf red
- eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64
- eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64
- eth5 DOWN
-
-
-7. Show Routes for a VRF
-
- To show routes for a VRF use the ip command to display the table associated
- with the VRF device:
- $ ip [-6] route show vrf NAME
- $ ip [-6] route show table ID
-
- For example:
- $ ip route show vrf red
- unreachable default metric 4278198272
- broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2
- 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2
- local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2
- broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2
- broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2
- 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2
- local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2
- broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2
-
- $ ip -6 route show vrf red
- local 2002:1:: dev lo proto none metric 0 pref medium
- local 2002:1::2 dev lo proto none metric 0 pref medium
- 2002:1::/120 dev eth1 proto kernel metric 256 pref medium
- local 2002:2:: dev lo proto none metric 0 pref medium
- local 2002:2::2 dev lo proto none metric 0 pref medium
- 2002:2::/120 dev eth2 proto kernel metric 256 pref medium
- local fe80:: dev lo proto none metric 0 pref medium
- local fe80:: dev lo proto none metric 0 pref medium
- local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium
- local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium
- fe80::/64 dev eth1 proto kernel metric 256 pref medium
- fe80::/64 dev eth2 proto kernel metric 256 pref medium
- ff00::/8 dev red metric 256 pref medium
- ff00::/8 dev eth1 metric 256 pref medium
- ff00::/8 dev eth2 metric 256 pref medium
- unreachable default dev lo metric 4278198272 error -101 pref medium
-
-8. Route Lookup for a VRF
-
- A test route lookup can be done for a VRF:
- $ ip [-6] route get vrf NAME ADDRESS
- $ ip [-6] route get oif NAME ADDRESS
-
- For example:
- $ ip route get 10.2.1.40 vrf red
- 10.2.1.40 dev eth1 table red src 10.2.1.2
- cache
-
- $ ip -6 route get 2002:1::32 vrf red
- 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium
-
-
-9. Removing Network Interface from a VRF
-
- Network interfaces are removed from a VRF by breaking the enslavement to
- the VRF device:
- $ ip link set dev NAME nomaster
-
- Connected routes are moved back to the default table and local entries are
- moved to the local table.
-
- For example:
- $ ip link set dev eth0 nomaster
-
---------------------------------------------------------------------------------
-
-Commands used in this example:
-
-cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
-1 mgmt
-10 red
-66 blue
-81 green
-EOF
-
-function vrf_create
-{
- VRF=$1
- TBID=$2
-
- # create VRF device
- ip link add ${VRF} type vrf table ${TBID}
-
- if [ "${VRF}" != "mgmt" ]; then
- ip route add table ${TBID} unreachable default metric 4278198272
- fi
- ip link set dev ${VRF} up
-}
-
-vrf_create mgmt 1
-ip link set dev eth0 master mgmt
-
-vrf_create red 10
-ip link set dev eth1 master red
-ip link set dev eth2 master red
-ip link set dev eth5 master red
-
-vrf_create blue 66
-ip link set dev eth3 master blue
-
-vrf_create green 81
-ip link set dev eth4 master green
-
-
-Interface addresses from /etc/network/interfaces:
-auto eth0
-iface eth0 inet static
- address 10.0.0.2
- netmask 255.255.255.0
- gateway 10.0.0.254
-
-iface eth0 inet6 static
- address 2000:1::2
- netmask 120
-
-auto eth1
-iface eth1 inet static
- address 10.2.1.2
- netmask 255.255.255.0
-
-iface eth1 inet6 static
- address 2002:1::2
- netmask 120
-
-auto eth2
-iface eth2 inet static
- address 10.2.2.2
- netmask 255.255.255.0
-
-iface eth2 inet6 static
- address 2002:2::2
- netmask 120
-
-auto eth3
-iface eth3 inet static
- address 10.2.3.2
- netmask 255.255.255.0
-
-iface eth3 inet6 static
- address 2002:3::2
- netmask 120
-
-auto eth4
-iface eth4 inet static
- address 10.2.4.2
- netmask 255.255.255.0
-
-iface eth4 inet6 static
- address 2002:4::2
- netmask 120
diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.rst
index c28f4989c3f0..2759dc1cc525 100644
--- a/Documentation/networking/vxlan.txt
+++ b/Documentation/networking/vxlan.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================================
Virtual eXtensible Local Area Networking documentation
======================================================
@@ -21,8 +24,9 @@ neighbors GRE and VLAN. Configuring VXLAN requires the version of
iproute2 that matches the kernel release where VXLAN was first merged
upstream.
-1. Create vxlan device
- # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789
+1. Create vxlan device::
+
+ # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789
This creates a new device named vxlan0. The device uses the multicast
group 239.1.1.1 over eth1 to handle traffic for which there is no
@@ -32,20 +36,53 @@ pre-dates the IANA's selection of a standard destination port number
and uses the Linux-selected value by default to maintain backwards
compatibility.
-2. Delete vxlan device
- # ip link delete vxlan0
+2. Delete vxlan device::
+
+ # ip link delete vxlan0
-3. Show vxlan info
- # ip -d link show vxlan0
+3. Show vxlan info::
+
+ # ip -d link show vxlan0
It is possible to create, destroy and display the vxlan
forwarding table using the new bridge command.
-1. Create forwarding table entry
- # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0
+1. Create forwarding table entry::
+
+ # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0
+
+2. Delete forwarding table entry::
+
+ # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0
+
+3. Show forwarding table::
+
+ # bridge fdb show dev vxlan0
+
+The following NIC features may indicate support for UDP tunnel-related
+offloads (most commonly VXLAN features, but support for a particular
+encapsulation protocol is NIC specific):
+
+ - `tx-udp_tnl-segmentation`
+ - `tx-udp_tnl-csum-segmentation`
+ ability to perform TCP segmentation offload of UDP encapsulated frames
+
+ - `rx-udp_tunnel-port-offload`
+ receive side parsing of UDP encapsulated frames which allows NICs to
+ perform protocol-aware offloads, like checksum validation offload of
+ inner frames (only needed by NICs without protocol-agnostic offloads)
-2. Delete forwarding table entry
- # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0
+For devices supporting `rx-udp_tunnel-port-offload` the list of currently
+offloaded ports can be interrogated with `ethtool`::
-3. Show forwarding table
- # bridge fdb show dev vxlan0
+ $ ethtool --show-tunnels eth0
+ Tunnel information for eth0:
+ UDP port table 0:
+ Size: 4
+ Types: vxlan
+ No entries
+ UDP port table 1:
+ Size: 4
+ Types: geneve, vxlan-gpe
+ Entries (1):
+ port 1230, vxlan-gpe
diff --git a/Documentation/networking/x25-iface.rst b/Documentation/networking/x25-iface.rst
new file mode 100644
index 000000000000..f34e9ec64937
--- /dev/null
+++ b/Documentation/networking/x25-iface.rst
@@ -0,0 +1,82 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================-
+X.25 Device Driver Interface
+============================-
+
+Version 1.1
+
+ Jonathan Naylor 26.12.96
+
+This is a description of the messages to be passed between the X.25 Packet
+Layer and the X.25 device driver. They are designed to allow for the easy
+setting of the LAPB mode from within the Packet Layer.
+
+The X.25 device driver will be coded normally as per the Linux device driver
+standards. Most X.25 device drivers will be moderately similar to the
+already existing Ethernet device drivers. However unlike those drivers, the
+X.25 device driver has a state associated with it, and this information
+needs to be passed to and from the Packet Layer for proper operation.
+
+All messages are held in sk_buff's just like real data to be transmitted
+over the LAPB link. The first byte of the skbuff indicates the meaning of
+the rest of the skbuff, if any more information does exist.
+
+
+Packet Layer to Device Driver
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data to be transmitted
+over the LAPB link. The LAPB link should already exist before any data is
+passed down.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+Establish the LAPB link. If the link is already established then the connect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+Terminate the LAPB link. If it is already disconnected then the disconnect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+Device Driver to Packet Layer
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data that has been
+received over the LAPB link.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+LAPB link has been established. The same message is used for both a LAPB
+link connect_confirmation and a connect_indication.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+LAPB link has been terminated. This same message is used for both a LAPB
+link disconnect_confirmation and a disconnect_indication.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+Requirements for the device driver
+----------------------------------
+
+Packets should not be reordered or dropped when delivering between the
+Packet Layer and the device driver.
+
+To avoid packets from being reordered or dropped when delivering from
+the device driver to the Packet Layer, the device driver should not
+call "netif_rx" to deliver the received packets. Instead, it should
+call "netif_receive_skb_core" from softirq context to deliver them.
diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt
deleted file mode 100644
index 7f213b556e85..000000000000
--- a/Documentation/networking/x25-iface.txt
+++ /dev/null
@@ -1,123 +0,0 @@
- X.25 Device Driver Interface 1.1
-
- Jonathan Naylor 26.12.96
-
-This is a description of the messages to be passed between the X.25 Packet
-Layer and the X.25 device driver. They are designed to allow for the easy
-setting of the LAPB mode from within the Packet Layer.
-
-The X.25 device driver will be coded normally as per the Linux device driver
-standards. Most X.25 device drivers will be moderately similar to the
-already existing Ethernet device drivers. However unlike those drivers, the
-X.25 device driver has a state associated with it, and this information
-needs to be passed to and from the Packet Layer for proper operation.
-
-All messages are held in sk_buff's just like real data to be transmitted
-over the LAPB link. The first byte of the skbuff indicates the meaning of
-the rest of the skbuff, if any more information does exist.
-
-
-Packet Layer to Device Driver
------------------------------
-
-First Byte = 0x00 (X25_IFACE_DATA)
-
-This indicates that the rest of the skbuff contains data to be transmitted
-over the LAPB link. The LAPB link should already exist before any data is
-passed down.
-
-First Byte = 0x01 (X25_IFACE_CONNECT)
-
-Establish the LAPB link. If the link is already established then the connect
-confirmation message should be returned as soon as possible.
-
-First Byte = 0x02 (X25_IFACE_DISCONNECT)
-
-Terminate the LAPB link. If it is already disconnected then the disconnect
-confirmation message should be returned as soon as possible.
-
-First Byte = 0x03 (X25_IFACE_PARAMS)
-
-LAPB parameters. To be defined.
-
-
-Device Driver to Packet Layer
------------------------------
-
-First Byte = 0x00 (X25_IFACE_DATA)
-
-This indicates that the rest of the skbuff contains data that has been
-received over the LAPB link.
-
-First Byte = 0x01 (X25_IFACE_CONNECT)
-
-LAPB link has been established. The same message is used for both a LAPB
-link connect_confirmation and a connect_indication.
-
-First Byte = 0x02 (X25_IFACE_DISCONNECT)
-
-LAPB link has been terminated. This same message is used for both a LAPB
-link disconnect_confirmation and a disconnect_indication.
-
-First Byte = 0x03 (X25_IFACE_PARAMS)
-
-LAPB parameters. To be defined.
-
-
-
-Possible Problems
-=================
-
-(Henner Eisen, 2000-10-28)
-
-The X.25 packet layer protocol depends on a reliable datalink service.
-The LAPB protocol provides such reliable service. But this reliability
-is not preserved by the Linux network device driver interface:
-
-- With Linux 2.4.x (and above) SMP kernels, packet ordering is not
- preserved. Even if a device driver calls netif_rx(skb1) and later
- netif_rx(skb2), skb2 might be delivered to the network layer
- earlier that skb1.
-- Data passed upstream by means of netif_rx() might be dropped by the
- kernel if the backlog queue is congested.
-
-The X.25 packet layer protocol will detect this and reset the virtual
-call in question. But many upper layer protocols are not designed to
-handle such N-Reset events gracefully. And frequent N-Reset events
-will always degrade performance.
-
-Thus, driver authors should make netif_rx() as reliable as possible:
-
-SMP re-ordering will not occur if the driver's interrupt handler is
-always executed on the same CPU. Thus,
-
-- Driver authors should use irq affinity for the interrupt handler.
-
-The probability of packet loss due to backlog congestion can be
-reduced by the following measures or a combination thereof:
-
-(1) Drivers for kernel versions 2.4.x and above should always check the
- return value of netif_rx(). If it returns NET_RX_DROP, the
- driver's LAPB protocol must not confirm reception of the frame
- to the peer.
- This will reliably suppress packet loss. The LAPB protocol will
- automatically cause the peer to re-transmit the dropped packet
- later.
- The lapb module interface was modified to support this. Its
- data_indication() method should now transparently pass the
- netif_rx() return value to the (lapb module) caller.
-(2) Drivers for kernel versions 2.2.x should always check the global
- variable netdev_dropping when a new frame is received. The driver
- should only call netif_rx() if netdev_dropping is zero. Otherwise
- the driver should not confirm delivery of the frame and drop it.
- Alternatively, the driver can queue the frame internally and call
- netif_rx() later when netif_dropping is 0 again. In that case, delivery
- confirmation should also be deferred such that the internal queue
- cannot grow to much.
- This will not reliably avoid packet loss, but the probability
- of packet loss in netif_rx() path will be significantly reduced.
-(3) Additionally, driver authors might consider to support
- CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up
- when a previously congested backlog queue becomes empty again.
- The driver could uses this for flow-controlling the peer by means
- of the LAPB protocol's flow-control service.
diff --git a/Documentation/networking/x25.txt b/Documentation/networking/x25.rst
index c91c6d7159ff..e11d9ebdf9a3 100644
--- a/Documentation/networking/x25.txt
+++ b/Documentation/networking/x25.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
Linux X.25 Project
+==================
As my third year dissertation at University I have taken it upon myself to
write an X.25 implementation for Linux. My aim is to provide a complete X.25
@@ -15,13 +19,11 @@ implementation of LAPB. Therefore the LAPB modules would be called by
unintelligent X.25 card drivers and not by intelligent ones, this would
provide a uniform device driver interface, and simplify configuration.
-To confuse matters a little, an 802.2 LLC implementation for Linux is being
-written which will allow X.25 to be run over an Ethernet (or Token Ring) and
-conform with the JNT "Pink Book", this will have a different interface to
-the Packet Layer but there will be no confusion since the class of device
-being served by the LLC will be completely separate from LAPB. The LLC
-implementation is being done as part of another protocol project (SNA) and
-by a different author.
+To confuse matters a little, an 802.2 LLC implementation is also possible
+which could allow X.25 to be run over an Ethernet (or Token Ring) and
+conform with the JNT "Pink Book", this would have a different interface to
+the Packet Layer but there would be no confusion since the class of device
+being served by the LLC would be completely separate from LAPB.
Just when you thought that it could not become more confusing, another
option appeared, XOT. This allows X.25 Packet Layer frames to operate over
diff --git a/Documentation/networking/xfrm_device.txt b/Documentation/networking/xfrm_device.rst
index a1c904dc70dc..01391dfd37d9 100644
--- a/Documentation/networking/xfrm_device.txt
+++ b/Documentation/networking/xfrm_device.rst
@@ -1,7 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
===============================================
XFRM device - offloading the IPsec computations
===============================================
+
Shannon Nelson <shannon.nelson@oracle.com>
@@ -19,7 +21,7 @@ hardware offload.
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
-like this:
+like this::
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -34,19 +36,21 @@ Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
Callbacks to implement
======================
-/* from include/linux/netdevice.h */
-struct xfrmdev_ops {
+::
+
+ /* from include/linux/netdevice.h */
+ struct xfrmdev_ops {
int (*xdo_dev_state_add) (struct xfrm_state *x);
void (*xdo_dev_state_delete) (struct xfrm_state *x);
void (*xdo_dev_state_free) (struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
-};
+ };
The NIC driver offering ipsec offload will need to implement these
callbacks to make the offload available to the network stack's
-XFRM subsytem. Additionally, the feature bits NETIF_F_HW_ESP and
+XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
@@ -58,6 +62,8 @@ At probe time and before the call to register_netdev(), the driver should
set up local data structures and XFRM callbacks, and set the feature bits.
The XFRM code's listener will finish the setup on NETDEV_REGISTER.
+::
+
adapter->netdev->xfrmdev_ops = &ixgbe_xfrmdev_ops;
adapter->netdev->features |= NETIF_F_HW_ESP;
adapter->netdev->hw_enc_features |= NETIF_F_HW_ESP;
@@ -65,16 +71,20 @@ The XFRM code's listener will finish the setup on NETDEV_REGISTER.
When new SAs are set up with a request for "offload" feature, the
driver's xdo_dev_state_add() will be given the new SA to be offloaded
and an indication of whether it is for Rx or Tx. The driver should
+
- verify the algorithm is supported for offloads
- store the SA information (key, salt, target-ip, protocol, etc)
- enable the HW offload of the SA
- return status value:
+
+ =========== ===================================
0 success
-EOPNETSUPP offload not supported, try SW IPsec
other fail the request
+ =========== ===================================
The driver can also set an offload_handle in the SA, an opaque void pointer
-that can be used to convey context into the fast-path offload requests.
+that can be used to convey context into the fast-path offload requests::
xs->xso.offload_handle = context;
@@ -88,7 +98,7 @@ return true of false to signify its support.
When ready to send, the driver needs to inspect the Tx packet for the
offload information, including the opaque context, and set up the packet
-send accordingly.
+send accordingly::
xs = xfrm_input_state(skb);
context = xs->xso.offload_handle;
@@ -105,18 +115,21 @@ the packet's skb. At this point the data should be decrypted but the
IPsec headers are still in the packet data; they are removed later up
the stack in xfrm_input().
- find and hold the SA that was used to the Rx skb
+ find and hold the SA that was used to the Rx skb::
+
get spi, protocol, and destination IP from packet headers
xs = find xs from (spi, protocol, dest_IP)
xfrm_state_hold(xs);
- store the state information into the skb
+ store the state information into the skb::
+
sp = secpath_set(skb);
if (!sp) return;
sp->xvec[sp->len++] = xs;
sp->olen++;
- indicate the success and/or error status of the offload
+ indicate the success and/or error status of the offload::
+
xo = xfrm_offload(skb);
xo->flags = CRYPTO_DONE;
xo->status = crypto_status;
@@ -136,5 +149,3 @@ hardware needs.
As a netdev is set to DOWN the XFRM stack's netdev listener will call
xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
states.
-
-
diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.rst
index 2eae619ab67b..0a771c5a7399 100644
--- a/Documentation/networking/xfrm_proc.txt
+++ b/Documentation/networking/xfrm_proc.rst
@@ -1,5 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
XFRM proc - /proc/net/xfrm_* files
==================================
+
Masahide NAKAMURA <nakam@linux-ipv6.org>
@@ -14,42 +18,58 @@ as part of the linux private MIB. These counters can be viewed in
Inbound errors
~~~~~~~~~~~~~~
+
XfrmInError:
All errors which is not matched others
+
XfrmInBufferError:
No buffer is left
+
XfrmInHdrError:
Header error
+
XfrmInNoStates:
No state is found
i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong
+
XfrmInStateProtoError:
Transformation protocol specific error
e.g. SA key is wrong
+
XfrmInStateModeError:
Transformation mode specific error
+
XfrmInStateSeqError:
Sequence error
i.e. Sequence number is out of window
+
XfrmInStateExpired:
State is expired
+
XfrmInStateMismatch:
State has mismatch option
e.g. UDP encapsulation type is mismatch
+
XfrmInStateInvalid:
State is invalid
+
XfrmInTmplMismatch:
No matching template for states
e.g. Inbound SAs are correct but SP rule is wrong
+
XfrmInNoPols:
No policy is found for states
e.g. Inbound SAs are correct but no SP is found
+
XfrmInPolBlock:
Policy discards
+
XfrmInPolError:
Policy error
+
XfrmAcquireError:
State hasn't been fully acquired before use
+
XfrmFwdHdrError:
Forward routing of a packet is not allowed
@@ -57,26 +77,37 @@ Outbound errors
~~~~~~~~~~~~~~~
XfrmOutError:
All errors which is not matched others
+
XfrmOutBundleGenError:
Bundle generation error
+
XfrmOutBundleCheckError:
Bundle check error
+
XfrmOutNoStates:
No state is found
+
XfrmOutStateProtoError:
Transformation protocol specific error
+
XfrmOutStateModeError:
Transformation mode specific error
+
XfrmOutStateSeqError:
Sequence error
i.e. Sequence number overflow
+
XfrmOutStateExpired:
State is expired
+
XfrmOutPolBlock:
Policy discards
+
XfrmOutPolDead:
Policy is dead
+
XfrmOutPolError:
Policy error
+
XfrmOutStateInvalid:
State is invalid, perhaps expired
diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.rst
index 8d88e0f2ec49..6246503ceab2 100644
--- a/Documentation/networking/xfrm_sync.txt
+++ b/Documentation/networking/xfrm_sync.rst
@@ -1,3 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+XFRM
+====
The sync patches work is based on initial patches from
Krisztian <hidden@balabit.hu> and others and additional patches
@@ -40,30 +45,32 @@ The netlink message types are:
XFRM_MSG_NEWAE and XFRM_MSG_GETAE.
A XFRM_MSG_GETAE does not have TLVs.
+
A XFRM_MSG_NEWAE will have at least two TLVs (as is
discussed further below).
-aevent_id structure looks like:
+aevent_id structure looks like::
struct xfrm_aevent_id {
- struct xfrm_usersa_id sa_id;
- xfrm_address_t saddr;
- __u32 flags;
- __u32 reqid;
+ struct xfrm_usersa_id sa_id;
+ xfrm_address_t saddr;
+ __u32 flags;
+ __u32 reqid;
};
The unique SA is identified by the combination of xfrm_usersa_id,
reqid and saddr.
flags are used to indicate different things. The possible
-flags are:
- XFRM_AE_RTHR=1, /* replay threshold*/
- XFRM_AE_RVAL=2, /* replay value */
- XFRM_AE_LVAL=4, /* lifetime value */
- XFRM_AE_ETHR=8, /* expiry timer threshold */
- XFRM_AE_CR=16, /* Event cause is replay update */
- XFRM_AE_CE=32, /* Event cause is timer expiry */
- XFRM_AE_CU=64, /* Event cause is policy update */
+flags are::
+
+ XFRM_AE_RTHR=1, /* replay threshold*/
+ XFRM_AE_RVAL=2, /* replay value */
+ XFRM_AE_LVAL=4, /* lifetime value */
+ XFRM_AE_ETHR=8, /* expiry timer threshold */
+ XFRM_AE_CR=16, /* Event cause is replay update */
+ XFRM_AE_CE=32, /* Event cause is timer expiry */
+ XFRM_AE_CU=64, /* Event cause is policy update */
How these flags are used is dependent on the direction of the
message (kernel<->user) as well the cause (config, query or event).
@@ -80,23 +87,27 @@ to get notified of these events.
-----------------------------------------
a) byte value (XFRMA_LTIME_VAL)
+
This TLV carries the running/current counter for byte lifetime since
last event.
b)replay value (XFRMA_REPLAY_VAL)
+
This TLV carries the running/current counter for replay sequence since
last event.
c)replay threshold (XFRMA_REPLAY_THRESH)
+
This TLV carries the threshold being used by the kernel to trigger events
when the replay sequence is exceeded.
d) expiry timer (XFRMA_ETIMER_THRESH)
+
This is a timer value in milliseconds which is used as the nagle
value to rate limit the events.
3) Default configurations for the parameters:
-----------------------------------------------
+---------------------------------------------
By default these events should be turned off unless there is
at least one listener registered to listen to the multicast
@@ -108,6 +119,7 @@ we also provide default threshold values for these different parameters
in case they are not specified.
the two sysctls/proc entries are:
+
a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
used to provide default values for the XFRMA_ETIMER_THRESH in incremental
units of time of 100ms. The default is 10 (1 second)
@@ -120,37 +132,45 @@ in incremental packet count. The default is two packets.
----------------
a) XFRM_MSG_GETAE issued by user-->kernel.
-XFRM_MSG_GETAE does not carry any TLVs.
+ XFRM_MSG_GETAE does not carry any TLVs.
+
The response is a XFRM_MSG_NEWAE which is formatted based on what
XFRM_MSG_GETAE queried for.
+
The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
-*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
-*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
+* if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
+* if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
b) XFRM_MSG_NEWAE is issued by either user space to configure
-or kernel to announce events or respond to a XFRM_MSG_GETAE.
+ or kernel to announce events or respond to a XFRM_MSG_GETAE.
i) user --> kernel to configure a specific SA.
+
any of the values or threshold parameters can be updated by passing the
appropriate TLV.
+
A response is issued back to the sender in user space to indicate success
or failure.
+
In the case of success, additionally an event with
XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
ii) kernel->user direction as a response to XFRM_MSG_GETAE
+
The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
The threshold TLVs will be included if explicitly requested in
the XFRM_MSG_GETAE message.
iii) kernel->user to report as event if someone sets any values or
-thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
-In such a case XFRM_AE_CU flag is set to inform the user that
-the change happened as a result of an update.
-The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+ thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
+ In such a case XFRM_AE_CU flag is set to inform the user that
+ the change happened as a result of an update.
+ The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
iv) kernel->user to report event when replay threshold or a timeout
-is exceeded.
+ is exceeded.
+
In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
happened) is set to inform the user what happened.
Note the two flags are mutually exclusive.
diff --git a/Documentation/networking/xfrm_sysctl.txt b/Documentation/networking/xfrm_sysctl.rst
index 5bbd16792fe1..47b9bbdd0179 100644
--- a/Documentation/networking/xfrm_sysctl.txt
+++ b/Documentation/networking/xfrm_sysctl.rst
@@ -1,4 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+XFRM Syscall
+============
+
/proc/sys/net/core/xfrm_* Variables:
+====================================
xfrm_acq_expires - INTEGER
default 30 - hard timeout in seconds for acquire requests
diff --git a/Documentation/networking/z8530book.rst b/Documentation/networking/z8530book.rst
deleted file mode 100644
index fea2c40e7973..000000000000
--- a/Documentation/networking/z8530book.rst
+++ /dev/null
@@ -1,256 +0,0 @@
-=======================
-Z8530 Programming Guide
-=======================
-
-:Author: Alan Cox
-
-Introduction
-============
-
-The Z85x30 family synchronous/asynchronous controller chips are used on
-a large number of cheap network interface cards. The kernel provides a
-core interface layer that is designed to make it easy to provide WAN
-services using this chip.
-
-The current driver only support synchronous operation. Merging the
-asynchronous driver support into this code to allow any Z85x30 device to
-be used as both a tty interface and as a synchronous controller is a
-project for Linux post the 2.4 release
-
-Driver Modes
-============
-
-The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in
-three different modes. Each mode can be applied to an individual channel
-on the chip (each chip has two channels).
-
-The PIO synchronous mode supports the most common Z8530 wiring. Here the
-chip is interface to the I/O and interrupt facilities of the host
-machine but not to the DMA subsystem. When running PIO the Z8530 has
-extremely tight timing requirements. Doing high speeds, even with a
-Z85230 will be tricky. Typically you should expect to achieve at best
-9600 baud with a Z8C530 and 64Kbits with a Z85230.
-
-The DMA mode supports the chip when it is configured to use dual DMA
-channels on an ISA bus. The better cards tend to support this mode of
-operation for a single channel. With DMA running the Z85230 tops out
-when it starts to hit ISA DMA constraints at about 512Kbits. It is worth
-noting here that many PC machines hang or crash when the chip is driven
-fast enough to hold the ISA bus solid.
-
-Transmit DMA mode uses a single DMA channel. The DMA channel is used for
-transmission as the transmit FIFO is smaller than the receive FIFO. it
-gives better performance than pure PIO mode but is nowhere near as ideal
-as pure DMA mode.
-
-Using the Z85230 driver
-=======================
-
-The Z85230 driver provides the back end interface to your board. To
-configure a Z8530 interface you need to detect the board and to identify
-its ports and interrupt resources. It is also your problem to verify the
-resources are available.
-
-Having identified the chip you need to fill in a struct z8530_dev,
-which describes each chip. This object must exist until you finally
-shutdown the board. Firstly zero the active field. This ensures nothing
-goes off without you intending it. The irq field should be set to the
-interrupt number of the chip. (Each chip has a single interrupt source
-rather than each channel). You are responsible for allocating the
-interrupt line. The interrupt handler should be set to
-:c:func:`z8530_interrupt()`. The device id should be set to the
-z8530_dev structure pointer. Whether the interrupt can be shared or not
-is board dependent, and up to you to initialise.
-
-The structure holds two channel structures. Initialise chanA.ctrlio and
-chanA.dataio with the address of the control and data ports. You can or
-this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS
-delay for chip settling done in software. The PORT_SLEEP option is
-architecture specific. Other flags may become available on future
-platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to
-start the chip up as disabled and discarding interrupt events. This
-ensures that stray interrupts will be mopped up and not hang the bus.
-Set chanA.dev to point to the device structure itself. The private and
-name field you may use as you wish. The private field is unused by the
-Z85230 layer. The name is used for error reporting and it may thus make
-sense to make it match the network name.
-
-Repeat the same operation with the B channel if your chip has both
-channels wired to something useful. This isn't always the case. If it is
-not wired then the I/O values do not matter, but you must initialise
-chanB.dev.
-
-If your board has DMA facilities then initialise the txdma and rxdma
-fields for the relevant channels. You must also allocate the ISA DMA
-channels and do any necessary board level initialisation to configure
-them. The low level driver will do the Z8530 and DMA controller
-programming but not board specific magic.
-
-Having initialised the device you can then call
-:c:func:`z8530_init()`. This will probe the chip and reset it into
-a known state. An identification sequence is then run to identify the
-chip type. If the checks fail to pass the function returns a non zero
-error code. Typically this indicates that the port given is not valid.
-After this call the type field of the z8530_dev structure is
-initialised to either Z8530, Z85C30 or Z85230 according to the chip
-found.
-
-Once you have called z8530_init you can also make use of the utility
-function :c:func:`z8530_describe()`. This provides a consistent
-reporting format for the Z8530 devices, and allows all the drivers to
-provide consistent reporting.
-
-Attaching Network Interfaces
-============================
-
-If you wish to use the network interface facilities of the driver, then
-you need to attach a network device to each channel that is present and
-in use. In addition to use the generic HDLC you need to follow some
-additional plumbing rules. They may seem complex but a look at the
-example hostess_sv11 driver should reassure you.
-
-The network device used for each channel should be pointed to by the
-netdevice field of each channel. The hdlc-> priv field of the network
-device points to your private data - you will need to be able to find
-your private data from this.
-
-The way most drivers approach this particular problem is to create a
-structure holding the Z8530 device definition and put that into the
-private field of the network device. The network device fields of the
-channels then point back to the network devices.
-
-If you wish to use the generic HDLC then you need to register the HDLC
-device.
-
-Before you register your network device you will also need to provide
-suitable handlers for most of the network device callbacks. See the
-network device documentation for more details on this.
-
-Configuring And Activating The Port
-===================================
-
-The Z85230 driver provides helper functions and tables to load the port
-registers on the Z8530 chips. When programming the register settings for
-a channel be aware that the documentation recommends initialisation
-orders. Strange things happen when these are not followed.
-
-:c:func:`z8530_channel_load()` takes an array of pairs of
-initialisation values in an array of u8 type. The first value is the
-Z8530 register number. Add 16 to indicate the alternate register bank on
-the later chips. The array is terminated by a 255.
-
-The driver provides a pair of public tables. The z8530_hdlc_kilostream
-table is for the UK 'Kilostream' service and also happens to cover most
-other end host configurations. The z8530_hdlc_kilostream_85230 table
-is the same configuration using the enhancements of the 85230 chip. The
-configuration loaded is standard NRZ encoded synchronous data with HDLC
-bitstuffing. All of the timing is taken from the other end of the link.
-
-When writing your own tables be aware that the driver internally tracks
-register values. It may need to reload values. You should therefore be
-sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where
-the register settings depend on DMA selection the driver will update the
-bits itself when you open or close. Loading a new table with the
-interface open is not recommended.
-
-There are three standard configurations supported by the core code. In
-PIO mode the interface is programmed up to use interrupt driven PIO.
-This places high demands on the host processor to avoid latency. The
-driver is written to take account of latency issues but it cannot avoid
-latencies caused by other drivers, notably IDE in PIO mode. Because the
-drivers allocate buffers you must also prevent MTU changes while the
-port is open.
-
-Once the port is open it will call the rx_function of each channel
-whenever a completed packet arrived. This is invoked from interrupt
-context and passes you the channel and a network buffer (struct
-sk_buff) holding the data. The data includes the CRC bytes so most
-users will want to trim the last two bytes before processing the data.
-This function is very timing critical. When you wish to simply discard
-data the support code provides the function
-:c:func:`z8530_null_rx()` to discard the data.
-
-To active PIO mode sending and receiving the ``z8530_sync_open`` is called.
-This expects to be passed the network device and the channel. Typically
-this is called from your network device open callback. On a failure a
-non zero error status is returned.
-The :c:func:`z8530_sync_close()` function shuts down a PIO
-channel. This must be done before the channel is opened again and before
-the driver shuts down and unloads.
-
-The ideal mode of operation is dual channel DMA mode. Here the kernel
-driver will configure the board for DMA in both directions. The driver
-also handles ISA DMA issues such as controller programming and the
-memory range limit for you. This mode is activated by calling the
-:c:func:`z8530_sync_dma_open()` function. On failure a non zero
-error value is returned. Once this mode is activated it can be shut down
-by calling the :c:func:`z8530_sync_dma_close()`. You must call
-the close function matching the open mode you used.
-
-The final supported mode uses a single DMA channel to drive the transmit
-side. As the Z85C30 has a larger FIFO on the receive channel this tends
-to increase the maximum speed a little. This is activated by calling the
-``z8530_sync_txdma_open``. This returns a non zero error code on failure. The
-:c:func:`z8530_sync_txdma_close()` function closes down the Z8530
-interface from this mode.
-
-Network Layer Functions
-=======================
-
-The Z8530 layer provides functions to queue packets for transmission.
-The driver internally buffers the frame currently being transmitted and
-one further frame (in order to keep back to back transmission running).
-Any further buffering is up to the caller.
-
-The function :c:func:`z8530_queue_xmit()` takes a network buffer
-in sk_buff format and queues it for transmission. The caller must
-provide the entire packet with the exception of the bitstuffing and CRC.
-This is normally done by the caller via the generic HDLC interface
-layer. It returns 0 if the buffer has been queued and non zero values
-for queue full. If the function accepts the buffer it becomes property
-of the Z8530 layer and the caller should not free it.
-
-The function :c:func:`z8530_get_stats()` returns a pointer to an
-internally maintained per interface statistics block. This provides most
-of the interface code needed to implement the network layer get_stats
-callback.
-
-Porting The Z8530 Driver
-========================
-
-The Z8530 driver is written to be portable. In DMA mode it makes
-assumptions about the use of ISA DMA. These are probably warranted in
-most cases as the Z85230 in particular was designed to glue to PC type
-machines. The PIO mode makes no real assumptions.
-
-Should you need to retarget the Z8530 driver to another architecture the
-only code that should need changing are the port I/O functions. At the
-moment these assume PC I/O port accesses. This may not be appropriate
-for all platforms. Replacing :c:func:`z8530_read_port()` and
-``z8530_write_port`` is intended to be all that is required to port
-this driver layer.
-
-Known Bugs And Assumptions
-==========================
-
-Interrupt Locking
- The locking in the driver is done via the global cli/sti lock. This
- makes for relatively poor SMP performance. Switching this to use a
- per device spin lock would probably materially improve performance.
-
-Occasional Failures
- We have reports of occasional failures when run for very long
- periods of time and the driver starts to receive junk frames. At the
- moment the cause of this is not clear.
-
-Public Functions Provided
-=========================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
- :export:
-
-Internal Functions
-==================
-
-.. kernel-doc:: drivers/net/wan/z85230.c
- :internal: