diff options
Diffstat (limited to 'Documentation/networking')
38 files changed, 2601 insertions, 715 deletions
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index c1f7f75e5fd9..4bc6ff29976a 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -25,6 +25,7 @@ Contents: mellanox/mlx5 netronome/nfp pensando/ionic + stmicro/stmmac .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/netronome/nfp.rst b/Documentation/networking/device_drivers/netronome/nfp.rst index 6c08ac8b5147..ada611fb427c 100644 --- a/Documentation/networking/device_drivers/netronome/nfp.rst +++ b/Documentation/networking/device_drivers/netronome/nfp.rst @@ -131,3 +131,119 @@ abi_drv_reset abi_drv_load_ifc Defines a list of PF devices allowed to load FW on the device. This variable is not currently user configurable. + +Statistics +========== + +Following device statistics are available through the ``ethtool -S`` interface: + +.. flat-table:: NFP device statistics + :header-rows: 1 + :widths: 3 1 11 + + * - Name + - ID + - Meaning + + * - dev_rx_discards + - 1 + - Packet can be discarded on the RX path for one of the following reasons: + + * The NIC is not in promisc mode, and the destination MAC address + doesn't match the interfaces' MAC address. + * The received packet is larger than the max buffer size on the host. + I.e. it exceeds the Layer 3 MRU. + * There is no freelist descriptor available on the host for the packet. + It is likely that the NIC couldn't cache one in time. + * A BPF program discarded the packet. + * The datapath drop action was executed. + * The MAC discarded the packet due to lack of ingress buffer space + on the NIC. + + * - dev_rx_errors + - 2 + - A packet can be counted (and dropped) as RX error for the following + reasons: + + * A problem with the VEB lookup (only when SR-IOV is used). + * A physical layer problem that causes Ethernet errors, like FCS or + alignment errors. The cause is usually faulty cables or SFPs. + + * - dev_rx_bytes + - 3 + - Total number of bytes received. + + * - dev_rx_uc_bytes + - 4 + - Unicast bytes received. + + * - dev_rx_mc_bytes + - 5 + - Multicast bytes received. + + * - dev_rx_bc_bytes + - 6 + - Broadcast bytes received. + + * - dev_rx_pkts + - 7 + - Total number of packets received. + + * - dev_rx_mc_pkts + - 8 + - Multicast packets received. + + * - dev_rx_bc_pkts + - 9 + - Broadcast packets received. + + * - dev_tx_discards + - 10 + - A packet can be discarded in the TX direction if the MAC is + being flow controlled and the NIC runs out of TX queue space. + + * - dev_tx_errors + - 11 + - A packet can be counted as TX error (and dropped) for one for the + following reasons: + + * The packet is an LSO segment, but the Layer 3 or Layer 4 offset + could not be determined. Therefore LSO could not continue. + * An invalid packet descriptor was received over PCIe. + * The packet Layer 3 length exceeds the device MTU. + * An error on the MAC/physical layer. Usually due to faulty cables or + SFPs. + * A CTM buffer could not be allocated. + * The packet offset was incorrect and could not be fixed by the NIC. + + * - dev_tx_bytes + - 12 + - Total number of bytes transmitted. + + * - dev_tx_uc_bytes + - 13 + - Unicast bytes transmitted. + + * - dev_tx_mc_bytes + - 14 + - Multicast bytes transmitted. + + * - dev_tx_bc_bytes + - 15 + - Broadcast bytes transmitted. + + * - dev_tx_pkts + - 16 + - Total number of packets transmitted. + + * - dev_tx_mc_pkts + - 17 + - Multicast packets transmitted. + + * - dev_tx_bc_pkts + - 18 + - Broadcast packets transmitted. + +Note that statistics unknown to the driver will be displayed as +``dev_unknown_stat$ID``, where ``$ID`` refers to the second column +above. diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.rst b/Documentation/networking/device_drivers/stmicro/stmmac.rst new file mode 100644 index 000000000000..c34bab3d2df0 --- /dev/null +++ b/Documentation/networking/device_drivers/stmicro/stmmac.rst @@ -0,0 +1,697 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================================================== +Linux Driver for the Synopsys(R) Ethernet Controllers "stmmac" +============================================================== + +Authors: Giuseppe Cavallaro <peppe.cavallaro@st.com>, +Alexandre Torgue <alexandre.torgue@st.com>, Jose Abreu <joabreu@synopsys.com> + +Contents +======== + +- In This Release +- Feature List +- Kernel Configuration +- Command Line Parameters +- Driver Information and Notes +- Debug Information +- Support + +In This Release +=============== + +This file describes the stmmac Linux Driver for all the Synopsys(R) Ethernet +Controllers. + +Currently, this network device driver is for all STi embedded MAC/GMAC +(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XILINX XC2V3000 +FF1152AMT0221 D1215994A VIRTEX FPGA board. The Synopsys Ethernet QoS 5.0 IPK +is also supported. + +DesignWare(R) Cores Ethernet MAC 10/100/1000 Universal version 3.70a +(and older) and DesignWare(R) Cores Ethernet Quality-of-Service version 4.0 +(and upper) have been used for developing this driver as well as +DesignWare(R) Cores XGMAC - 10G Ethernet MAC. + +This driver supports both the platform bus and PCI. + +This driver includes support for the following Synopsys(R) DesignWare(R) +Cores Ethernet Controllers and corresponding minimum and maximum versions: + ++-------------------------------+--------------+--------------+--------------+ +| Controller Name | Min. Version | Max. Version | Abbrev. Name | ++===============================+==============+==============+==============+ +| Ethernet MAC Universal | N/A | 3.73a | GMAC | ++-------------------------------+--------------+--------------+--------------+ +| Ethernet Quality-of-Service | 4.00a | N/A | GMAC4+ | ++-------------------------------+--------------+--------------+--------------+ +| XGMAC - 10G Ethernet MAC | 2.10a | N/A | XGMAC2+ | ++-------------------------------+--------------+--------------+--------------+ + +For questions related to hardware requirements, refer to the documentation +supplied with your Ethernet adapter. All hardware requirements listed apply +to use with Linux. + +Feature List +============ + +The following features are available in this driver: + - GMII/MII/RGMII/SGMII/RMII/XGMII Interface + - Half-Duplex / Full-Duplex Operation + - Energy Efficient Ethernet (EEE) + - IEEE 802.3x PAUSE Packets (Flow Control) + - RMON/MIB Counters + - IEEE 1588 Timestamping (PTP) + - Pulse-Per-Second Output (PPS) + - MDIO Clause 22 / Clause 45 Interface + - MAC Loopback + - ARP Offloading + - Automatic CRC / PAD Insertion and Checking + - Checksum Offload for Received and Transmitted Packets + - Standard or Jumbo Ethernet Packets + - Source Address Insertion / Replacement + - VLAN TAG Insertion / Replacement / Deletion / Filtering (HASH and PERFECT) + - Programmable TX and RX Watchdog and Coalesce Settings + - Destination Address Filtering (PERFECT) + - HASH Filtering (Multicast) + - Layer 3 / Layer 4 Filtering + - Remote Wake-Up Detection + - Receive Side Scaling (RSS) + - Frame Preemption for TX and RX + - Programmable Burst Length, Threshold, Queue Size + - Multiple Queues (up to 8) + - Multiple Scheduling Algorithms (TX: WRR, DWRR, WFQ, SP, CBS, EST, TBS; + RX: WRR, SP) + - Flexible RX Parser + - TCP / UDP Segmentation Offload (TSO, USO) + - Split Header (SPH) + - Safety Features (ECC Protection, Data Parity Protection) + - Selftests using Ethtool + +Kernel Configuration +==================== + +The kernel configuration option is ``CONFIG_STMMAC_ETH``: + - ``CONFIG_STMMAC_PLATFORM``: is to enable the platform driver. + - ``CONFIG_STMMAC_PCI``: is to enable the pci driver. + +Command Line Parameters +======================= + +If the driver is built as a module the following optional parameters are used +by entering them on the command line with the modprobe command using this +syntax (e.g. for PCI module):: + + modprobe stmmac_pci [<option>=<VAL1>,<VAL2>,...] + +Driver parameters can be also passed in command line by using:: + + stmmaceth=watchdog:100,chain_mode=1 + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +watchdog +-------- +:Valid Range: 5000-None +:Default Value: 5000 + +This parameter overrides the transmit timeout in milliseconds. + +debug +----- +:Valid Range: 0-16 (0=none,...,16=all) +:Default Value: 0 + +This parameter adjusts the level of debug messages displayed in the system +logs. + +phyaddr +------- +:Valid Range: 0-31 +:Default Value: -1 + +This parameter overrides the physical address of the PHY device. + +flow_ctrl +--------- +:Valid Range: 0-3 (0=off,1=rx,2=tx,3=rx/tx) +:Default Value: 3 + +This parameter changes the default Flow Control ability. + +pause +----- +:Valid Range: 0-65535 +:Default Value: 65535 + +This parameter changes the default Flow Control Pause time. + +tc +-- +:Valid Range: 64-256 +:Default Value: 64 + +This parameter changes the default HW FIFO Threshold control value. + +buf_sz +------ +:Valid Range: 1536-16384 +:Default Value: 1536 + +This parameter changes the default RX DMA packet buffer size. + +eee_timer +--------- +:Valid Range: 0-None +:Default Value: 1000 + +This parameter changes the default LPI TX Expiration time in milliseconds. + +chain_mode +---------- +:Valid Range: 0-1 (0=off,1=on) +:Default Value: 0 + +This parameter changes the default mode of operation from Ring Mode to +Chain Mode. + +Driver Information and Notes +============================ + +Transmit Process +---------------- + +The xmit method is invoked when the kernel needs to transmit a packet; it sets +the descriptors in the ring and informs the DMA engine that there is a packet +ready to be transmitted. + +By default, the driver sets the ``NETIF_F_SG`` bit in the features field of +the ``net_device`` structure, enabling the scatter-gather feature. This is +true on chips and configurations where the checksum can be done in hardware. + +Once the controller has finished transmitting the packet, timer will be +scheduled to release the transmit resources. + +Receive Process +--------------- + +When one or more packets are received, an interrupt happens. The interrupts +are not queued, so the driver has to scan all the descriptors in the ring +during the receive process. + +This is based on NAPI, so the interrupt handler signals only if there is work +to be done, and it exits. Then the poll method will be scheduled at some +future point. + +The incoming packets are stored, by the DMA, in a list of pre-allocated socket +buffers in order to avoid the memcpy (zero-copy). + +Interrupt Mitigation +-------------------- + +The driver is able to mitigate the number of its DMA interrupts using NAPI for +the reception on chips older than the 3.50. New chips have an HW RX Watchdog +used for this mitigation. + +Mitigation parameters can be tuned by ethtool. + +WoL +--- + +Wake up on Lan feature through Magic and Unicast frames are supported for the +GMAC, GMAC4/5 and XGMAC core. + +DMA Descriptors +--------------- + +Driver handles both normal and alternate descriptors. The latter has been only +tested on DesignWare(R) Cores Ethernet MAC Universal version 3.41a and later. + +stmmac supports DMA descriptor to operate both in dual buffer (RING) and +linked-list(CHAINED) mode. In RING each descriptor points to two data buffer +pointers whereas in CHAINED mode they point to only one data buffer pointer. +RING mode is the default. + +In CHAINED mode each descriptor will have pointer to next descriptor in the +list, hence creating the explicit chaining in the descriptor itself, whereas +such explicit chaining is not possible in RING mode. + +Extended Descriptors +-------------------- + +The extended descriptors give us information about the Ethernet payload when +it is carrying PTP packets or TCP/UDP/ICMP over IP. These are not available on +GMAC Synopsys(R) chips older than the 3.50. At probe time the driver will +decide if these can be actually used. This support also is mandatory for PTPv2 +because the extra descriptors are used for saving the hardware timestamps and +Extended Status. + +Ethtool Support +--------------- + +Ethtool is supported. For example, driver statistics (including RMON), +internal errors can be taken using:: + + ethtool -S ethX + +Ethtool selftests are also supported. This allows to do some early sanity +checks to the HW using MAC and PHY loopback mechanisms:: + + ethtool -t ethX + +Jumbo and Segmentation Offloading +--------------------------------- + +Jumbo frames are supported and tested for the GMAC. The GSO has been also +added but it's performed in software. LRO is not supported. + +TSO Support +----------- + +TSO (TCP Segmentation Offload) feature is supported by GMAC > 4.x and XGMAC +chip family. When a packet is sent through TCP protocol, the TCP stack ensures +that the SKB provided to the low level driver (stmmac in our case) matches +with the maximum frame len (IP header + TCP header + payload <= 1500 bytes +(for MTU set to 1500)). It means that if an application using TCP want to send +a packet which will have a length (after adding headers) > 1514 the packet +will be split in several TCP packets: The data payload is split and headers +(TCP/IP ..) are added. It is done by software. + +When TSO is enabled, the TCP stack doesn't care about the maximum frame length +and provide SKB packet to stmmac as it is. The GMAC IP will have to perform +the segmentation by it self to match with maximum frame length. + +This feature can be enabled in device tree through ``snps,tso`` entry. + +Energy Efficient Ethernet +------------------------- + +Energy Efficient Ethernet (EEE) enables IEEE 802.3 MAC sublayer along with a +family of Physical layer to operate in the Low Power Idle (LPI) mode. The EEE +mode supports the IEEE 802.3 MAC operation at 100Mbps, 1000Mbps and 1Gbps. + +The LPI mode allows power saving by switching off parts of the communication +device functionality when there is no data to be transmitted & received. +The system on both the side of the link can disable some functionalities and +save power during the period of low-link utilization. The MAC controls whether +the system should enter or exit the LPI mode and communicate this to PHY. + +As soon as the interface is opened, the driver verifies if the EEE can be +supported. This is done by looking at both the DMA HW capability register and +the PHY devices MCD registers. + +To enter in TX LPI mode the driver needs to have a software timer that enable +and disable the LPI mode when there is nothing to be transmitted. + +Precision Time Protocol (PTP) +----------------------------- + +The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), which +enables precise synchronization of clocks in measurement and control systems +implemented with technologies such as network communication. + +In addition to the basic timestamp features mentioned in IEEE 1588-2002 +Timestamps, new GMAC cores support the advanced timestamp features. +IEEE 1588-2008 can be enabled when configuring the Kernel. + +SGMII/RGMII Support +------------------- + +New GMAC devices provide own way to manage RGMII/SGMII. This information is +available at run-time by looking at the HW capability register. This means +that the stmmac can manage auto-negotiation and link status w/o using the +PHYLIB stuff. In fact, the HW provides a subset of extended registers to +restart the ANE, verify Full/Half duplex mode and Speed. Thanks to these +registers, it is possible to look at the Auto-negotiated Link Parter Ability. + +Physical +-------- + +The driver is compatible with Physical Abstraction Layer to be connected with +PHY and GPHY devices. + +Platform Information +-------------------- + +Several information can be passed through the platform and device-tree. + +:: + + struct plat_stmmacenet_data { + +1) Bus identifier:: + + int bus_id; + +2) PHY Physical Address. If set to -1 the driver will pick the first PHY it +finds:: + + int phy_addr; + +3) PHY Device Interface:: + + int interface; + +4) Specific platform fields for the MDIO bus:: + + struct stmmac_mdio_bus_data *mdio_bus_data; + +5) Internal DMA parameters:: + + struct stmmac_dma_cfg *dma_cfg; + +6) Fixed CSR Clock Range selection:: + + int clk_csr; + +7) HW uses the GMAC core:: + + int has_gmac; + +8) If set the MAC will use Enhanced Descriptors:: + + int enh_desc; + +9) Core is able to perform TX Checksum and/or RX Checksum in HW:: + + int tx_coe; + int rx_coe; + +11) Some HWs are not able to perform the csum in HW for over-sized frames due +to limited buffer sizes. Setting this flag the csum will be done in SW on +JUMBO frames:: + + int bugged_jumbo; + +12) Core has the embedded power module:: + + int pmt; + +13) Force DMA to use the Store and Forward mode or Threshold mode:: + + int force_sf_dma_mode; + int force_thresh_dma_mode; + +15) Force to disable the RX Watchdog feature and switch to NAPI mode:: + + int riwt_off; + +16) Limit the maximum operating speed and MTU:: + + int max_speed; + int maxmtu; + +18) Number of Multicast/Unicast filters:: + + int multicast_filter_bins; + int unicast_filter_entries; + +20) Limit the maximum TX and RX FIFO size:: + + int tx_fifo_size; + int rx_fifo_size; + +21) Use the specified number of TX and RX Queues:: + + u32 rx_queues_to_use; + u32 tx_queues_to_use; + +22) Use the specified TX and RX scheduling algorithm:: + + u8 rx_sched_algorithm; + u8 tx_sched_algorithm; + +23) Internal TX and RX Queue parameters:: + + struct stmmac_rxq_cfg rx_queues_cfg[MTL_MAX_RX_QUEUES]; + struct stmmac_txq_cfg tx_queues_cfg[MTL_MAX_TX_QUEUES]; + +24) This callback is used for modifying some syscfg registers (on ST SoCs) +according to the link speed negotiated by the physical layer:: + + void (*fix_mac_speed)(void *priv, unsigned int speed); + +25) Callbacks used for calling a custom initialization; This is sometimes +necessary on some platforms (e.g. ST boxes) where the HW needs to have set +some PIO lines or system cfg registers. init/exit callbacks should not use +or modify platform data:: + + int (*init)(struct platform_device *pdev, void *priv); + void (*exit)(struct platform_device *pdev, void *priv); + +26) Perform HW setup of the bus. For example, on some ST platforms this field +is used to configure the AMBA bridge to generate more efficient STBus traffic:: + + struct mac_device_info *(*setup)(void *priv); + void *bsp_priv; + +27) Internal clocks and rates:: + + struct clk *stmmac_clk; + struct clk *pclk; + struct clk *clk_ptp_ref; + unsigned int clk_ptp_rate; + unsigned int clk_ref_rate; + s32 ptp_max_adj; + +28) Main reset:: + + struct reset_control *stmmac_rst; + +29) AXI Internal Parameters:: + + struct stmmac_axi *axi; + +30) HW uses GMAC>4 cores:: + + int has_gmac4; + +31) HW is sun8i based:: + + bool has_sun8i; + +32) Enables TSO feature:: + + bool tso_en; + +33) Enables Receive Side Scaling (RSS) feature:: + + int rss_en; + +34) MAC Port selection:: + + int mac_port_sel_speed; + +35) Enables TX LPI Clock Gating:: + + bool en_tx_lpi_clockgating; + +36) HW uses XGMAC>2.10 cores:: + + int has_xgmac; + +:: + + } + +For MDIO bus data, we have: + +:: + + struct stmmac_mdio_bus_data { + +1) PHY mask passed when MDIO bus is registered:: + + unsigned int phy_mask; + +2) List of IRQs, one per PHY:: + + int *irqs; + +3) If IRQs is NULL, use this for probed PHY:: + + int probed_phy_irq; + +4) Set to true if PHY needs reset:: + + bool needs_reset; + +:: + + } + +For DMA engine configuration, we have: + +:: + + struct stmmac_dma_cfg { + +1) Programmable Burst Length (TX and RX):: + + int pbl; + +2) If set, DMA TX / RX will use this value rather than pbl:: + + int txpbl; + int rxpbl; + +3) Enable 8xPBL:: + + bool pblx8; + +4) Enable Fixed or Mixed burst:: + + int fixed_burst; + int mixed_burst; + +5) Enable Address Aligned Beats:: + + bool aal; + +6) Enable Enhanced Addressing (> 32 bits):: + + bool eame; + +:: + + } + +For DMA AXI parameters, we have: + +:: + + struct stmmac_axi { + +1) Enable AXI LPI:: + + bool axi_lpi_en; + bool axi_xit_frm; + +2) Set AXI Write / Read maximum outstanding requests:: + + u32 axi_wr_osr_lmt; + u32 axi_rd_osr_lmt; + +3) Set AXI 4KB bursts:: + + bool axi_kbbe; + +4) Set AXI maximum burst length map:: + + u32 axi_blen[AXI_BLEN]; + +5) Set AXI Fixed burst / mixed burst:: + + bool axi_fb; + bool axi_mb; + +6) Set AXI rebuild incrx mode:: + + bool axi_rb; + +:: + + } + +For the RX Queues configuration, we have: + +:: + + struct stmmac_rxq_cfg { + +1) Mode to use (DCB or AVB):: + + u8 mode_to_use; + +2) DMA channel to use:: + + u32 chan; + +3) Packet routing, if applicable:: + + u8 pkt_route; + +4) Use priority routing, and priority to route:: + + bool use_prio; + u32 prio; + +:: + + } + +For the TX Queues configuration, we have: + +:: + + struct stmmac_txq_cfg { + +1) Queue weight in scheduler:: + + u32 weight; + +2) Mode to use (DCB or AVB):: + + u8 mode_to_use; + +3) Credit Base Shaper Parameters:: + + u32 send_slope; + u32 idle_slope; + u32 high_credit; + u32 low_credit; + +4) Use priority scheduling, and priority:: + + bool use_prio; + u32 prio; + +:: + + } + +Device Tree Information +----------------------- + +Please refer to the following document: +Documentation/devicetree/bindings/net/snps,dwmac.yaml + +HW Capabilities +--------------- + +Note that, starting from new chips, where it is available the HW capability +register, many configurations are discovered at run-time for example to +understand if EEE, HW csum, PTP, enhanced descriptor etc are actually +available. As strategy adopted in this driver, the information from the HW +capability register can replace what has been passed from the platform. + +Debug Information +================= + +The driver exports many information i.e. internal statistics, debug +information, MAC and DMA registers etc. + +These can be read in several ways depending on the type of the information +actually needed. + +For example a user can be use the ethtool support to get statistics: e.g. +using: ``ethtool -S ethX`` (that shows the Management counters (MMC) if +supported) or sees the MAC/DMA registers: e.g. using: ``ethtool -d ethX`` + +Compiling the Kernel with ``CONFIG_DEBUG_FS`` the driver will export the +following debugfs entries: + + - ``descriptors_status``: To show the DMA TX/RX descriptor rings + - ``dma_cap``: To show the HW Capabilities + +Developer can also use the ``debug`` module parameter to get further debug +information (please see: NETIF Msg Level). + +Support +======= + +If an issue is identified with the released source code on a supported kernel +with a supported adapter, email the specific information related to the +issue to netdev@vger.kernel.org diff --git a/Documentation/networking/device_drivers/stmicro/stmmac.txt b/Documentation/networking/device_drivers/stmicro/stmmac.txt deleted file mode 100644 index 1ae979fd90d2..000000000000 --- a/Documentation/networking/device_drivers/stmicro/stmmac.txt +++ /dev/null @@ -1,401 +0,0 @@ - STMicroelectronics 10/100/1000 Synopsys Ethernet driver - -Copyright (C) 2007-2015 STMicroelectronics Ltd -Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> - -This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers -(Synopsys IP blocks). - -Currently this network device driver is for all STi embedded MAC/GMAC -(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 -FF1152AMT0221 D1215994A VIRTEX FPGA board. - -DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether -MAC 10/100 Universal version 4.0 have been used for developing this driver. - -This driver supports both the platform bus and PCI. - -Please, for more information also visit: www.stlinux.com - -1) Kernel Configuration -The kernel configuration option is STMMAC_ETH: - Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) ---> - STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH) - -CONFIG_STMMAC_PLATFORM: is to enable the platform driver. -CONFIG_STMMAC_PCI: is to enable the pci driver. - -2) Driver parameters list: - debug: message level (0: no output, 16: all); - phyaddr: to manually provide the physical address to the PHY device; - buf_sz: DMA buffer size; - tc: control the HW FIFO threshold; - watchdog: transmit timeout (in milliseconds); - flow_ctrl: Flow control ability [on/off]; - pause: Flow Control Pause Time; - eee_timer: tx EEE timer; - chain_mode: select chain mode instead of ring. - -3) Command line options -Driver parameters can be also passed in command line by using: - stmmaceth=watchdog:100,chain_mode=1 - -4) Driver information and notes - -4.1) Transmit process -The xmit method is invoked when the kernel needs to transmit a packet; it sets -the descriptors in the ring and informs the DMA engine, that there is a packet -ready to be transmitted. -By default, the driver sets the NETIF_F_SG bit in the features field of the -net_device structure, enabling the scatter-gather feature. This is true on -chips and configurations where the checksum can be done in hardware. -Once the controller has finished transmitting the packet, timer will be -scheduled to release the transmit resources. - -4.2) Receive process -When one or more packets are received, an interrupt happens. The interrupts -are not queued, so the driver has to scan all the descriptors in the ring during -the receive process. -This is based on NAPI, so the interrupt handler signals only if there is work -to be done, and it exits. -Then the poll method will be scheduled at some future point. -The incoming packets are stored, by the DMA, in a list of pre-allocated socket -buffers in order to avoid the memcpy (zero-copy). - -4.3) Interrupt mitigation -The driver is able to mitigate the number of its DMA interrupts -using NAPI for the reception on chips older than the 3.50. -New chips have an HW RX-Watchdog used for this mitigation. -Mitigation parameters can be tuned by ethtool. - -4.4) WOL -Wake up on Lan feature through Magic and Unicast frames are supported for the -GMAC core. - -4.5) DMA descriptors -Driver handles both normal and alternate descriptors. The latter has been only -tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later. - -STMMAC supports DMA descriptor to operate both in dual buffer (RING) -and linked-list(CHAINED) mode. In RING each descriptor points to two -data buffer pointers whereas in CHAINED mode they point to only one data -buffer pointer. RING mode is the default. - -In CHAINED mode each descriptor will have pointer to next descriptor in -the list, hence creating the explicit chaining in the descriptor itself, -whereas such explicit chaining is not possible in RING mode. - -4.5.1) Extended descriptors -The extended descriptors give us information about the Ethernet payload -when it is carrying PTP packets or TCP/UDP/ICMP over IP. -These are not available on GMAC Synopsys chips older than the 3.50. -At probe time the driver will decide if these can be actually used. -This support also is mandatory for PTPv2 because the extra descriptors -are used for saving the hardware timestamps and Extended Status. - -4.6) Ethtool support -Ethtool is supported. - -For example, driver statistics (including RMON), internal errors can be taken -using: - # ethtool -S ethX -command - -4.7) Jumbo and Segmentation Offloading -Jumbo frames are supported and tested for the GMAC. -The GSO has been also added but it's performed in software. -LRO is not supported. - -4.8) Physical -The driver is compatible with Physical Abstraction Layer to be connected with -PHY and GPHY devices. - -4.9) Platform information -Several information can be passed through the platform and device-tree. - -struct plat_stmmacenet_data { - char *phy_bus_name; - int bus_id; - int phy_addr; - int interface; - struct stmmac_mdio_bus_data *mdio_bus_data; - struct stmmac_dma_cfg *dma_cfg; - int clk_csr; - int has_gmac; - int enh_desc; - int tx_coe; - int rx_coe; - int bugged_jumbo; - int pmt; - int force_sf_dma_mode; - int force_thresh_dma_mode; - int riwt_off; - int max_speed; - int maxmtu; - void (*fix_mac_speed)(void *priv, unsigned int speed); - void (*bus_setup)(void __iomem *ioaddr); - int (*init)(struct platform_device *pdev, void *priv); - void (*exit)(struct platform_device *pdev, void *priv); - void *bsp_priv; - int has_gmac4; - bool tso_en; -}; - -Where: - o phy_bus_name: phy bus name to attach to the stmmac. - o bus_id: bus identifier. - o phy_addr: the physical address can be passed from the platform. - If it is set to -1 the driver will automatically - detect it at run-time by probing all the 32 addresses. - o interface: PHY device's interface. - o mdio_bus_data: specific platform fields for the MDIO bus. - o dma_cfg: internal DMA parameters - o pbl: the Programmable Burst Length is maximum number of beats to - be transferred in one DMA transaction. - GMAC also enables the 4xPBL by default. (8xPBL for GMAC 3.50 and newer) - o txpbl/rxpbl: GMAC and newer supports independent DMA pbl for tx/rx. - o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. - o fixed_burst/mixed_burst/aal - o clk_csr: fixed CSR Clock range selection. - o has_gmac: uses the GMAC core. - o enh_desc: if sets the MAC will use the enhanced descriptor structure. - o tx_coe: core is able to perform the tx csum in HW. - o rx_coe: the supports three check sum offloading engine types: - type_1, type_2 (full csum) and no RX coe. - o bugged_jumbo: some HWs are not able to perform the csum in HW for - over-sized frames due to limited buffer sizes. - Setting this flag the csum will be done in SW on - JUMBO frames. - o pmt: core has the embedded power module (optional). - o force_sf_dma_mode: force DMA to use the Store and Forward mode - instead of the Threshold. - o force_thresh_dma_mode: force DMA to use the Threshold mode other than - the Store and Forward mode. - o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode. - o fix_mac_speed: this callback is used for modifying some syscfg registers - (on ST SoCs) according to the link speed negotiated by the - physical layer . - o bus_setup: perform HW setup of the bus. For example, on some ST platforms - this field is used to configure the AMBA bridge to generate more - efficient STBus traffic. - o init/exit: callbacks used for calling a custom initialization; - this is sometime necessary on some platforms (e.g. ST boxes) - where the HW needs to have set some PIO lines or system cfg - registers. init/exit callbacks should not use or modify - platform data. - o bsp_priv: another private pointer. - o has_gmac4: uses GMAC4 core. - o tso_en: Enables TSO (TCP Segmentation Offload) feature. - -For MDIO bus The we have: - - struct stmmac_mdio_bus_data { - int (*phy_reset)(void *priv); - unsigned int phy_mask; - int *irqs; - int probed_phy_irq; - }; - -Where: - o phy_reset: hook to reset the phy device attached to the bus. - o phy_mask: phy mask passed when register the MDIO bus within the driver. - o irqs: list of IRQs, one per PHY. - o probed_phy_irq: if irqs is NULL, use this for probed PHY. - -For DMA engine we have the following internal fields that should be -tuned according to the HW capabilities. - -struct stmmac_dma_cfg { - int pbl; - int txpbl; - int rxpbl; - bool pblx8; - int fixed_burst; - int mixed_burst; - bool aal; -}; - -Where: - o pbl: Programmable Burst Length (tx and rx) - o txpbl: Transmit Programmable Burst Length. Only for GMAC and newer. - If set, DMA tx will use this value rather than pbl. - o rxpbl: Receive Programmable Burst Length. Only for GMAC and newer. - If set, DMA rx will use this value rather than pbl. - o pblx8: Enable 8xPBL (4xPBL for core rev < 3.50). Enabled by default. - o fixed_burst: program the DMA to use the fixed burst mode - o mixed_burst: program the DMA to use the mixed burst mode - o aal: Address-Aligned Beats - ---- - -Below an example how the structures above are using on ST platforms. - - static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = { - .has_gmac = 0, - .enh_desc = 0, - .fix_mac_speed = stxYYY_ethernet_fix_mac_speed, - | - |-> to write an internal syscfg - | on this platform when the - | link speed changes from 10 to - | 100 and viceversa - .init = &stmmac_claim_resource, - | - |-> On ST SoC this calls own "PAD" - | manager framework to claim - | all the resources necessary - | (GPIO ...). The .custom_cfg field - | is used to pass a custom config. -}; - -Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact, -there are two MAC cores: one MAC is for MDIO Bus/PHY emulation -with fixed_link support. - -static struct stmmac_mdio_bus_data stmmac1_mdio_bus = { - .phy_reset = phy_reset; - | - |-> function to provide the phy_reset on this board - .phy_mask = 0, -}; - -static struct fixed_phy_status stmmac0_fixed_phy_status = { - .link = 1, - .speed = 100, - .duplex = 1, -}; - -During the board's device_init we can configure the first -MAC for fixed_link by calling: - fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status); -and the second one, with a real PHY device attached to the bus, -by using the stmmac_mdio_bus_data structure (to provide the id, the -reset procedure etc). - -Note that, starting from new chips, where it is available the HW capability -register, many configurations are discovered at run-time for example to -understand if EEE, HW csum, PTP, enhanced descriptor etc are actually -available. As strategy adopted in this driver, the information from the HW -capability register can replace what has been passed from the platform. - -4.10) Device-tree support. - -Please see the following document: - Documentation/devicetree/bindings/net/stmmac.txt - -4.11) This is a summary of the content of some relevant files: - o stmmac_main.c: implements the main network device driver; - o stmmac_mdio.c: provides MDIO functions; - o stmmac_pci: this is the PCI driver; - o stmmac_platform.c: this the platform driver (OF supported); - o stmmac_ethtool.c: implements the ethtool support; - o stmmac.h: private driver structure; - o common.h: common definitions and VFTs; - o mmc_core.c/mmc.h: Management MAC Counters; - o stmmac_hwtstamp.c: HW timestamp support for PTP; - o stmmac_ptp.c: PTP 1588 clock; - o stmmac_pcs.h: Physical Coding Sublayer common implementation; - o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c - for STMicroelectronics SoCs. - -- GMAC 3.x - o descs.h: descriptor structure definitions; - o dwmac1000_core.c: dwmac GiGa core functions; - o dwmac1000_dma.c: dma functions for the GMAC chip; - o dwmac1000.h: specific header file for the dwmac GiGa; - o dwmac100_core: dwmac 100 core code; - o dwmac100_dma.c: dma functions for the dwmac 100 chip; - o dwmac1000.h: specific header file for the MAC; - o dwmac_lib.c: generic DMA functions; - o enh_desc.c: functions for handling enhanced descriptors; - o norm_desc.c: functions for handling normal descriptors; - o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; - -- GMAC4.x generation - o dwmac4_core.c: dwmac GMAC4.x core functions; - o dwmac4_desc.c: functions for handling GMAC4.x descriptors; - o dwmac4_descs.h: descriptor definitions; - o dwmac4_dma.c: dma functions for the GMAC4.x chip; - o dwmac4_dma.h: dma definitions for the GMAC4.x chip; - o dwmac4.h: core definitions for the GMAC4.x chip; - o dwmac4_lib.c: generic GMAC4.x functions; - -4.12) TSO support (GMAC4.x) - -TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family. -When a packet is sent through TCP protocol, the TCP stack ensures that -the SKB provided to the low level driver (stmmac in our case) matches with -the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for -MTU set to 1500)). It means that if an application using TCP want to send a -packet which will have a length (after adding headers) > 1514 the packet -will be split in several TCP packets: The data payload is split and headers -(TCP/IP ..) are added. It is done by software. - -When TSO is enabled, the TCP stack doesn't care about the maximum frame -length and provide SKB packet to stmmac as it is. The GMAC IP will have to -perform the segmentation by it self to match with maximum frame length. - -This feature can be enabled in device tree through "snps,tso" entry. - -5) Debug Information - -The driver exports many information i.e. internal statistics, -debug information, MAC and DMA registers etc. - -These can be read in several ways depending on the -type of the information actually needed. - -For example a user can be use the ethtool support -to get statistics: e.g. using: ethtool -S ethX -(that shows the Management counters (MMC) if supported) -or sees the MAC/DMA registers: e.g. using: ethtool -d ethX - -Compiling the Kernel with CONFIG_DEBUG_FS the driver will export the following -debugfs entries: - -/sys/kernel/debug/stmmaceth/descriptors_status - To show the DMA TX/RX descriptor rings - -Developer can also use the "debug" module parameter to get further debug -information (please see: NETIF Msg Level). - -6) Energy Efficient Ethernet - -Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along -with a family of Physical layer to operate in the Low power Idle(LPI) -mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps, -1000Mbps & 10Gbps. - -The LPI mode allows power saving by switching off parts of the -communication device functionality when there is no data to be -transmitted & received. The system on both the side of the link can -disable some functionalities & save power during the period of low-link -utilization. The MAC controls whether the system should enter or exit -the LPI mode & communicate this to PHY. - -As soon as the interface is opened, the driver verifies if the EEE can -be supported. This is done by looking at both the DMA HW capability -register and the PHY devices MCD registers. -To enter in Tx LPI mode the driver needs to have a software timer -that enable and disable the LPI mode when there is nothing to be -transmitted. - -7) Precision Time Protocol (PTP) -The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), -which enables precise synchronization of clocks in measurement and -control systems implemented with technologies such as network -communication. - -In addition to the basic timestamp features mentioned in IEEE 1588-2002 -Timestamps, new GMAC cores support the advanced timestamp features. -IEEE 1588-2008 that can be enabled when configure the Kernel. - -8) SGMII/RGMII support -New GMAC devices provide own way to manage RGMII/SGMII. -This information is available at run-time by looking at the -HW capability register. This means that the stmmac can manage -auto-negotiation and link status w/o using the PHYLIB stuff. -In fact, the HW provides a subset of extended registers to -restart the ANE, verify Full/Half duplex mode and Speed. -Thanks to these registers, it is possible to look at the -Auto-negotiated Link Parter Ability. diff --git a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt index 5c8cee17fca9..12855ab268b8 100644 --- a/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt +++ b/Documentation/networking/device_drivers/ti/cpsw_switchdev.txt @@ -39,7 +39,7 @@ but without enabling "switch" mode, or to different bridges. Devlink configuration parameters ==================== -See Documentation/networking/devlink-params-ti-cpsw-switch.txt +See Documentation/networking/devlink/ti-cpsw-switch.rst ==================== # Bridging in dual mac mode diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt deleted file mode 100644 index 1db3fbea0831..000000000000 --- a/Documentation/networking/devlink-health.txt +++ /dev/null @@ -1,86 +0,0 @@ -The health mechanism is targeted for Real Time Alerting, in order to know when -something bad had happened to a PCI device -- Provide alert debug information -- Self healing -- If problem needs vendor support, provide a way to gather all needed debugging - information. - -The main idea is to unify and centralize driver health reports in the -generic devlink instance and allow the user to set different -attributes of the health reporting and recovery procedures. - -The devlink health reporter: -Device driver creates a "health reporter" per each error/health type. -Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) -or unknown (driver specific). -For each registered health reporter a driver can issue error/health reports -asynchronously. All health reports handling is done by devlink. -Device driver can provide specific callbacks for each "health reporter", e.g. - - Recovery procedures - - Diagnostics and object dump procedures - - OOB initial parameters -Different parts of the driver can register different types of health reporters -with different handlers. - -Once an error is reported, devlink health will do the following actions: - * A log is being send to the kernel trace events buffer - * Health status and statistics are being updated for the reporter instance - * Object dump is being taken and saved at the reporter instance (as long as - there is no other dump which is already stored) - * Auto recovery attempt is being done. Depends on: - - Auto-recovery configuration - - Grace period vs. time passed since last recover - -The user interface: -User can access/change each reporter's parameters and driver specific callbacks -via devlink, e.g per error type (per health reporter) - - Configure reporter's generic parameters (like: disable/enable auto recovery) - - Invoke recovery procedure - - Run diagnostics - - Object dump - -The devlink health interface (via netlink): -DEVLINK_CMD_HEALTH_REPORTER_GET - Retrieves status and configuration info per DEV and reporter. -DEVLINK_CMD_HEALTH_REPORTER_SET - Allows reporter-related configuration setting. -DEVLINK_CMD_HEALTH_REPORTER_RECOVER - Triggers a reporter's recovery procedure. -DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE - Retrieves diagnostics data from a reporter on a device. -DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET - Retrieves the last stored dump. Devlink health - saves a single dump. If an dump is not already stored by the devlink - for this reporter, devlink generates a new dump. - dump output is defined by the reporter. -DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR - Clears the last saved dump file for the specified reporter. - - - netlink - +--------------------------+ - | | - | + | - | | | - +--------------------------+ - |request for ops - |(diagnose, - mlx5_core devlink |recover, - |dump) -+--------+ +--------------------------+ -| | | reporter| | -| | | +---------v----------+ | -| | ops execution | | | | -| <----------------------------------+ | | -| | | | | | -| | | + ^------------------+ | -| | | | request for ops | -| | | | (recover, dump) | -| | | | | -| | | +-+------------------+ | -| | health report | | health handler | | -| +-------------------------------> | | -| | | +--------------------+ | -| | health reporter create | | -| +----------------------------> | -+--------+ +--------------------------+ diff --git a/Documentation/networking/devlink-info-versions.rst b/Documentation/networking/devlink-info-versions.rst deleted file mode 100644 index 4914f581b1fd..000000000000 --- a/Documentation/networking/devlink-info-versions.rst +++ /dev/null @@ -1,64 +0,0 @@ -.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) - -===================== -Devlink info versions -===================== - -board.id -======== - -Unique identifier of the board design. - -board.rev -========= - -Board design revision. - -asic.id -======= - -ASIC design identifier. - -asic.rev -======== - -ASIC design revision. - -board.manufacture -================= - -An identifier of the company or the facility which produced the part. - -fw -== - -Overall firmware version, often representing the collection of -fw.mgmt, fw.app, etc. - -fw.mgmt -======= - -Control unit firmware version. This firmware is responsible for house -keeping tasks, PHY control etc. but not the packet-by-packet data path -operation. - -fw.app -====== - -Data path microcode controlling high-speed packet processing. - -fw.undi -======= - -UNDI software, may include the UEFI driver, firmware or both. - -fw.ncsi -======= - -Version of the software responsible for supporting/handling the -Network Controller Sideband Interface. - -fw.psid -======= - -Unique identifier of the firmware parameter set. diff --git a/Documentation/networking/devlink-params-bnxt.txt b/Documentation/networking/devlink-params-bnxt.txt deleted file mode 100644 index 481aa303d5b4..000000000000 --- a/Documentation/networking/devlink-params-bnxt.txt +++ /dev/null @@ -1,18 +0,0 @@ -enable_sriov [DEVICE, GENERIC] - Configuration mode: Permanent - -ignore_ari [DEVICE, GENERIC] - Configuration mode: Permanent - -msix_vec_per_pf_max [DEVICE, GENERIC] - Configuration mode: Permanent - -msix_vec_per_pf_min [DEVICE, GENERIC] - Configuration mode: Permanent - -gre_ver_check [DEVICE, DRIVER-SPECIFIC] - Generic Routing Encapsulation (GRE) version check will - be enabled in the device. If disabled, device skips - version checking for incoming packets. - Type: Boolean - Configuration mode: Permanent diff --git a/Documentation/networking/devlink-params-mlx5.txt b/Documentation/networking/devlink-params-mlx5.txt deleted file mode 100644 index 5071467118bd..000000000000 --- a/Documentation/networking/devlink-params-mlx5.txt +++ /dev/null @@ -1,17 +0,0 @@ -flow_steering_mode [DEVICE, DRIVER-SPECIFIC] - Controls the flow steering mode of the driver. - Two modes are supported: - 1. 'dmfs' - Device managed flow steering. - 2. 'smfs - Software/Driver managed flow steering. - In DMFS mode, the HW steering entities are created and - managed through the Firmware. - In SMFS mode, the HW steering entities are created and - managed though by the driver directly into Hardware - without firmware intervention. - Type: String - Configuration mode: runtime - -enable_roce [DEVICE, GENERIC] - Enable handling of RoCE traffic in the device. - Defaultly enabled. - Configuration mode: driverinit diff --git a/Documentation/networking/devlink-params-mlxsw.txt b/Documentation/networking/devlink-params-mlxsw.txt deleted file mode 100644 index c63ea9fc7009..000000000000 --- a/Documentation/networking/devlink-params-mlxsw.txt +++ /dev/null @@ -1,10 +0,0 @@ -fw_load_policy [DEVICE, GENERIC] - Configuration mode: driverinit - -acl_region_rehash_interval [DEVICE, DRIVER-SPECIFIC] - Sets an interval for periodic ACL region rehashes. - The value is in milliseconds, minimal value is "3000". - Value "0" disables the periodic work. - The first rehash will be run right after value is set. - Type: u32 - Configuration mode: runtime diff --git a/Documentation/networking/devlink-params-mv88e6xxx.txt b/Documentation/networking/devlink-params-mv88e6xxx.txt deleted file mode 100644 index 21c4b3556ef2..000000000000 --- a/Documentation/networking/devlink-params-mv88e6xxx.txt +++ /dev/null @@ -1,7 +0,0 @@ -ATU_hash [DEVICE, DRIVER-SPECIFIC] - Select one of four possible hashing algorithms for - MAC addresses in the Address Translation Unit. - A value of 3 seems to work better than the default of - 1 when many MAC addresses have the same OUI. - Configuration mode: runtime - Type: u8. 0-3 valid. diff --git a/Documentation/networking/devlink-params-nfp.txt b/Documentation/networking/devlink-params-nfp.txt deleted file mode 100644 index 43e4d4034865..000000000000 --- a/Documentation/networking/devlink-params-nfp.txt +++ /dev/null @@ -1,5 +0,0 @@ -fw_load_policy [DEVICE, GENERIC] - Configuration mode: permanent - -reset_dev_on_drv_probe [DEVICE, GENERIC] - Configuration mode: permanent diff --git a/Documentation/networking/devlink-params-ti-cpsw-switch.txt b/Documentation/networking/devlink-params-ti-cpsw-switch.txt deleted file mode 100644 index 4037458499f7..000000000000 --- a/Documentation/networking/devlink-params-ti-cpsw-switch.txt +++ /dev/null @@ -1,10 +0,0 @@ -ale_bypass [DEVICE, DRIVER-SPECIFIC] - Allows to enable ALE_CONTROL(4).BYPASS mode for debug purposes. - All packets will be sent to the Host port only if enabled. - Type: bool - Configuration mode: runtime - -switch_mode [DEVICE, DRIVER-SPECIFIC] - Enable switch mode - Type: bool - Configuration mode: runtime diff --git a/Documentation/networking/devlink-params.txt b/Documentation/networking/devlink-params.txt deleted file mode 100644 index 04e234e9acc9..000000000000 --- a/Documentation/networking/devlink-params.txt +++ /dev/null @@ -1,71 +0,0 @@ -Devlink configuration parameters -================================ -Following is the list of configuration parameters via devlink interface. -Each parameter can be generic or driver specific and are device level -parameters. - -Note that the driver-specific files should contain the generic params -they support to, with supported config modes. - -Each parameter can be set in different configuration modes: - runtime - set while driver is running, no reset required. - driverinit - applied while driver initializes, requires restart - driver by devlink reload command. - permanent - written to device's non-volatile memory, hard reset - required. - -Following is the list of parameters: -==================================== -enable_sriov [DEVICE, GENERIC] - Enable Single Root I/O Virtualisation (SRIOV) in - the device. - Type: Boolean - -ignore_ari [DEVICE, GENERIC] - Ignore Alternative Routing-ID Interpretation (ARI) - capability. If enabled, adapter will ignore ARI - capability even when platforms has the support - enabled and creates same number of partitions when - platform does not support ARI. - Type: Boolean - -msix_vec_per_pf_max [DEVICE, GENERIC] - Provides the maximum number of MSIX interrupts that - a device can create. Value is same across all - physical functions (PFs) in the device. - Type: u32 - -msix_vec_per_pf_min [DEVICE, GENERIC] - Provides the minimum number of MSIX interrupts required - for the device initialization. Value is same across all - physical functions (PFs) in the device. - Type: u32 - -fw_load_policy [DEVICE, GENERIC] - Controls the device's firmware loading policy. - Valid values: - * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER (0) - Load firmware version preferred by the driver. - * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH (1) - Load firmware currently stored in flash. - * DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK (2) - Load firmware currently available on host's disk. - Type: u8 - -reset_dev_on_drv_probe [DEVICE, GENERIC] - Controls the device's reset policy on driver probe. - Valid values: - * DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN (0) - Unknown or invalid value. - * DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS (1) - Always reset device on driver probe. - * DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER (2) - Never reset device on driver probe. - * DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK (3) - Reset only if device firmware can be found in the - filesystem. - Type: u8 - -enable_roce [DEVICE, GENERIC] - Enable handling of RoCE traffic in the device. - Type: Boolean diff --git a/Documentation/networking/devlink-trap-netdevsim.rst b/Documentation/networking/devlink-trap-netdevsim.rst deleted file mode 100644 index b721c9415473..000000000000 --- a/Documentation/networking/devlink-trap-netdevsim.rst +++ /dev/null @@ -1,20 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================== -Devlink Trap netdevsim -====================== - -Driver-specific Traps -===================== - -.. list-table:: List of Driver-specific Traps Registered by ``netdevsim`` - :widths: 5 5 90 - - * - Name - - Type - - Description - * - ``fid_miss`` - - ``exception`` - - When a packet enters the device it is classified to a filtering - indentifier (FID) based on the ingress port and VLAN. This trap is used - to trap packets for which a FID could not be found diff --git a/Documentation/networking/devlink/bnxt.rst b/Documentation/networking/devlink/bnxt.rst new file mode 100644 index 000000000000..79e746d22a53 --- /dev/null +++ b/Documentation/networking/devlink/bnxt.rst @@ -0,0 +1,41 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +bnxt devlink support +==================== + +This document describes the devlink features implemented by the ``bnxt`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``enable_sriov`` + - Permanent + * - ``ignore_ari`` + - Permanent + * - ``msix_vec_per_pf_max`` + - Permanent + * - ``msix_vec_per_pf_min`` + - Permanent + +The ``bnxt`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``gre_ver_check`` + - Boolean + - Permanent + - Generic Routing Encapsulation (GRE) version check will be enabled in + the device. If disabled, the device will skip the version check for + incoming packets. diff --git a/Documentation/networking/devlink/devlink-dpipe.rst b/Documentation/networking/devlink/devlink-dpipe.rst new file mode 100644 index 000000000000..468fe1001b74 --- /dev/null +++ b/Documentation/networking/devlink/devlink-dpipe.rst @@ -0,0 +1,252 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Devlink DPIPE +============= + +Background +========== + +While performing the hardware offloading process, much of the hardware +specifics cannot be presented. These details are useful for debugging, and +``devlink-dpipe`` provides a standardized way to provide visibility into the +offloading process. + +For example, the routing longest prefix match (LPM) algorithm used by the +Linux kernel may differ from the hardware implementation. The pipeline debug +API (DPIPE) is aimed at providing the user visibility into the ASIC's +pipeline in a generic way. + +The hardware offload process is expected to be done in a way that the user +should not be able to distinguish between the hardware vs. software +implementation. In this process, hardware specifics are neglected. In +reality those details can have lots of meaning and should be exposed in some +standard way. + +This problem is made even more complex when one wishes to offload the +control path of the whole networking stack to a switch ASIC. Due to +differences in the hardware and software models some processes cannot be +represented correctly. + +One example is the kernel's LPM algorithm which in many cases differs +greatly to the hardware implementation. The configuration API is the same, +but one cannot rely on the Forward Information Base (FIB) to look like the +Level Path Compression trie (LPC-trie) in hardware. + +In many situations trying to analyze systems failure solely based on the +kernel's dump may not be enough. By combining this data with complementary +information about the underlying hardware, this debugging can be made +easier; additionally, the information can be useful when debugging +performance issues. + +Overview +======== + +The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is +modeled as a graph of match/action tables. Each table represents a specific +hardware block. This model is not new, first being used by the P4 language. + +Traditionally it has been used as an alternative model for hardware +configuration, but the ``devlink-dpipe`` interface uses it for visibility +purposes as a standard complementary tool. The system's view from +``devlink-dpipe`` should change according to the changes done by the +standard configuration tools. + +For example, it’s quiet common to implement Access Control Lists (ACL) +using Ternary Content Addressable Memory (TCAM). The TCAM memory can be +divided into TCAM regions. Complex TC filters can have multiple rules with +different priorities and different lookup keys. On the other hand hardware +TCAM regions have a predefined lookup key. Offloading the TC filter rules +using TCAM engine can result in multiple TCAM regions being interconnected +in a chain (which may affect the data path latency). In response to a new TC +filter new tables should be created describing those regions. + +Model +===== + +The ``DPIPE`` model introduces several objects: + + * headers + * tables + * entries + +A ``header`` describes packet formats and provides names for fields within +the packet. A ``table`` describes hardware blocks. An ``entry`` describes +the actual content of a specific table. + +The hardware pipeline is not port specific, but rather describes the whole +ASIC. Thus it is tied to the top of the ``devlink`` infrastructure. + +Drivers can register and unregister tables at run time, in order to support +dynamic behavior. This dynamic behavior is mandatory for describing hardware +blocks like TCAM regions which can be allocated and freed dynamically. + +``devlink-dpipe`` generally is not intended for configuration. The exception +is hardware counting for a specific table. + +The following commands are used to obtain the ``dpipe`` objects from +userspace: + + * ``table_get``: Receive a table's description. + * ``headers_get``: Receive a device's supported headers. + * ``entries_get``: Receive a table's current entries. + * ``counters_set``: Enable or disable counters on a table. + +Table +----- + +The driver should implement the following operations for each table: + + * ``matches_dump``: Dump the supported matches. + * ``actions_dump``: Dump the supported actions. + * ``entries_dump``: Dump the actual content of the table. + * ``counters_set_update``: Synchronize hardware with counters enabled or + disabled. + +Header/Field +------------ + +In a similar way to P4 headers and fields are used to describe a table's +behavior. There is a slight difference between the standard protocol headers +and specific ASIC metadata. The protocol headers should be declared in the +``devlink`` core API. On the other hand ASIC meta data is driver specific +and should be defined in the driver. Additionally, each driver-specific +devlink documentation file should document the driver-specific ``dpipe`` +headers it implements. The headers and fields are identified by enumeration. + +In order to provide further visibility some ASIC metadata fields could be +mapped to kernel objects. For example, internal router interface indexes can +be directly mapped to the net device ifindex. FIB table indexes used by +different Virtual Routing and Forwarding (VRF) tables can be mapped to +internal routing table indexes. + +Match +----- + +Matches are kept primitive and close to hardware operation. Match types like +LPM are not supported due to the fact that this is exactly a process we wish +to describe in full detail. Example of matches: + + * ``field_exact``: Exact match on a specific field. + * ``field_exact_mask``: Exact match on a specific field after masking. + * ``field_range``: Match on a specific range. + +The id's of the header and the field should be specified in order to +identify the specific field. Furthermore, the header index should be +specified in order to distinguish multiple headers of the same type in a +packet (tunneling). + +Action +------ + +Similar to match, the actions are kept primitive and close to hardware +operation. For example: + + * ``field_modify``: Modify the field value. + * ``field_inc``: Increment the field value. + * ``push_header``: Add a header. + * ``pop_header``: Remove a header. + +Entry +----- + +Entries of a specific table can be dumped on demand. Each eentry is +identified with an index and its properties are described by a list of +match/action values and specific counter. By dumping the tables content the +interactions between tables can be resolved. + +Abstraction Example +=================== + +The following is an example of the abstraction model of the L3 part of +Mellanox Spectrum ASIC. The blocks are described in the order they appear in +the pipeline. The table sizes in the following examples are not real +hardware sizes and are provided for demonstration purposes. + +LPM +--- + +The LPM algorithm can be implemented as a list of hash tables. Each hash +table contains routes with the same prefix length. The root of the list is +/32, and in case of a miss the hardware will continue to the next hash +table. The depth of the search will affect the data path latency. + +In case of a hit the entry contains information about the next stage of the +pipeline which resolves the MAC address. The next stage can be either local +host table for directly connected routes, or adjacency table for next-hops. +The ``meta.lpm_prefix`` field is used to connect two LPM tables. + +.. code:: + + table lpm_prefix_16 { + size: 4096, + counters_enabled: true, + match: { meta.vr_id: exact, + ipv4.dst_addr: exact_mask, + ipv6.dst_addr: exact_mask, + meta.lpm_prefix: exact }, + action: { meta.adj_index: set, + meta.adj_group_size: set, + meta.rif_port: set, + meta.lpm_prefix: set }, + } + +Local Host +---------- + +In the case of local routes the LPM lookup already resolves the egress +router interface (RIF), yet the exact MAC address is not known. The local +host table is a hash table combining the output interface id with +destination IP address as a key. The result is the MAC address. + +.. code:: + + table local_host { + size: 4096, + counters_enabled: true, + match: { meta.rif_port: exact, + ipv4.dst_addr: exact}, + action: { ethernet.daddr: set } + } + +Adjacency +--------- + +In case of remote routes this table does the ECMP. The LPM lookup results in +ECMP group size and index that serves as a global offset into this table. +Concurrently a hash of the packet is generated. Based on the ECMP group size +and the packet's hash a local offset is generated. Multiple LPM entries can +point to the same adjacency group. + +.. code:: + + table adjacency { + size: 4096, + counters_enabled: true, + match: { meta.adj_index: exact, + meta.adj_group_size: exact, + meta.packet_hash_index: exact }, + action: { ethernet.daddr: set, + meta.erif: set } + } + +ERIF +---- + +In case the egress RIF and destination MAC have been resolved by previous +tables this table does multiple operations like TTL decrease and MTU check. +Then the decision of forward/drop is taken and the port L3 statistics are +updated based on the packet's type (broadcast, unicast, multicast). + +.. code:: + + table erif { + size: 800, + counters_enabled: true, + match: { meta.rif_port: exact, + meta.is_l3_unicast: exact, + meta.is_l3_broadcast: exact, + meta.is_l3_multicast, exact }, + action: { meta.l3_drop: set, + meta.l3_forward: set } + } diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst new file mode 100644 index 000000000000..0c99b11f05f9 --- /dev/null +++ b/Documentation/networking/devlink/devlink-health.rst @@ -0,0 +1,114 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Devlink Health +============== + +Background +========== + +The ``devlink`` health mechanism is targeted for Real Time Alerting, in +order to know when something bad happened to a PCI device. + + * Provide alert debug information. + * Self healing. + * If problem needs vendor support, provide a way to gather all needed + debugging information. + +Overview +======== + +The main idea is to unify and centralize driver health reports in the +generic ``devlink`` instance and allow the user to set different +attributes of the health reporting and recovery procedures. + +The ``devlink`` health reporter: +Device driver creates a "health reporter" per each error/health type. +Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) +or unknown (driver specific). +For each registered health reporter a driver can issue error/health reports +asynchronously. All health reports handling is done by ``devlink``. +Device driver can provide specific callbacks for each "health reporter", e.g.: + + * Recovery procedures + * Diagnostics procedures + * Object dump procedures + * OOB initial parameters + +Different parts of the driver can register different types of health reporters +with different handlers. + +Actions +======= + +Once an error is reported, devlink health will perform the following actions: + + * A log is being send to the kernel trace events buffer + * Health status and statistics are being updated for the reporter instance + * Object dump is being taken and saved at the reporter instance (as long as + there is no other dump which is already stored) + * Auto recovery attempt is being done. Depends on: + - Auto-recovery configuration + - Grace period vs. time passed since last recover + +User Interface +============== + +User can access/change each reporter's parameters and driver specific callbacks +via ``devlink``, e.g per error type (per health reporter): + + * Configure reporter's generic parameters (like: disable/enable auto recovery) + * Invoke recovery procedure + * Run diagnostics + * Object dump + +.. list-table:: List of devlink health interfaces + :widths: 10 90 + + * - Name + - Description + * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` + - Retrieves status and configuration info per DEV and reporter. + * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` + - Allows reporter-related configuration setting. + * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` + - Triggers a reporter's recovery procedure. + * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` + - Retrieves diagnostics data from a reporter on a device. + * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` + - Retrieves the last stored dump. Devlink health + saves a single dump. If an dump is not already stored by the devlink + for this reporter, devlink generates a new dump. + dump output is defined by the reporter. + * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` + - Clears the last saved dump file for the specified reporter. + +The following diagram provides a general overview of ``devlink-health``:: + + netlink + +--------------------------+ + | | + | + | + | | | + +--------------------------+ + |request for ops + |(diagnose, + mlx5_core devlink |recover, + |dump) + +--------+ +--------------------------+ + | | | reporter| | + | | | +---------v----------+ | + | | ops execution | | | | + | <----------------------------------+ | | + | | | | | | + | | | + ^------------------+ | + | | | | request for ops | + | | | | (recover, dump) | + | | | | | + | | | +-+------------------+ | + | | health report | | health handler | | + | +-------------------------------> | | + | | | +--------------------+ | + | | health reporter create | | + | +----------------------------> | + +--------+ +--------------------------+ diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst new file mode 100644 index 000000000000..0385f15028b1 --- /dev/null +++ b/Documentation/networking/devlink/devlink-info.rst @@ -0,0 +1,94 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +============ +Devlink Info +============ + +The ``devlink-info`` mechanism enables device drivers to report device +information in a generic fashion. It is extensible, and enables exporting +even device or driver specific information. + +devlink supports representing the following types of versions + +.. list-table:: List of version types + :widths: 5 95 + + * - Type + - Description + * - ``fixed`` + - Represents fixed versions, which cannot change. For example, + component identifiers or the board version reported in the PCI VPD. + * - ``running`` + - Represents the version of the currently running component. For + example the running version of firmware. These versions generally + only update after a reboot. + * - ``stored`` + - Represents the version of a component as stored, such as after a + flash update. Stored values should update to reflect changes in the + flash even if a reboot has not yet occurred. + +Generic Versions +================ + +It is expected that drivers use the following generic names for exporting +version information. Other information may be exposed using driver-specific +names, but these should be documented in the driver-specific file. + +board.id +-------- + +Unique identifier of the board design. + +board.rev +--------- + +Board design revision. + +asic.id +------- + +ASIC design identifier. + +asic.rev +-------- + +ASIC design revision. + +board.manufacture +----------------- + +An identifier of the company or the facility which produced the part. + +fw +-- + +Overall firmware version, often representing the collection of +fw.mgmt, fw.app, etc. + +fw.mgmt +------- + +Control unit firmware version. This firmware is responsible for house +keeping tasks, PHY control etc. but not the packet-by-packet data path +operation. + +fw.app +------ + +Data path microcode controlling high-speed packet processing. + +fw.undi +------- + +UNDI software, may include the UEFI driver, firmware or both. + +fw.ncsi +------- + +Version of the software responsible for supporting/handling the +Network Controller Sideband Interface. + +fw.psid +------- + +Unique identifier of the firmware parameter set. diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst new file mode 100644 index 000000000000..da2f85c0fa21 --- /dev/null +++ b/Documentation/networking/devlink/devlink-params.rst @@ -0,0 +1,108 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Devlink Params +============== + +``devlink`` provides capability for a driver to expose device parameters for low +level device functionality. Since devlink can operate at the device-wide +level, it can be used to provide configuration that may affect multiple +ports on a single device. + +This document describes a number of generic parameters that are supported +across multiple drivers. Each driver is also free to add their own +parameters. Each driver must document the specific parameters they support, +whether generic or not. + +Configuration modes +=================== + +Parameters may be set in different configuration modes. + +.. list-table:: Possible configuration modes + :widths: 5 90 + + * - Name + - Description + * - ``runtime`` + - set while the driver is running, and takes effect immediately. No + reset is required. + * - ``driverinit`` + - applied while the driver initializes. Requires the user to restart + the driver using the ``devlink`` reload command. + * - ``permanent`` + - written to the device's non-volatile memory. A hard reset is required + for it to take effect. + +Reloading +--------- + +In order for ``driverinit`` parameters to take effect, the driver must +support reloading via the ``devlink-reload`` command. This command will +request a reload of the device driver. + +Generic configuration parameters +================================ +The following is a list of generic configuration parameters that drivers may +add. Use of generic parameters is preferred over each driver creating their +own name. + +.. list-table:: List of generic parameters + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``enable_sriov`` + - Boolean + - Enable Single Root I/O Virtualization (SRIOV) in the device. + * - ``ignore_ari`` + - Boolean + - Ignore Alternative Routing-ID Interpretation (ARI) capability. If + enabled, the adapter will ignore ARI capability even when the + platform has support enabled. The device will create the same number + of partitions as when the platform does not support ARI. + * - ``msix_vec_per_pf_max`` + - u32 + - Provides the maximum number of MSI-X interrupts that a device can + create. Value is the same across all physical functions (PFs) in the + device. + * - ``msix_vec_per_pf_min`` + - u32 + - Provides the minimum number of MSI-X interrupts required for the + device to initialize. Value is the same across all physical functions + (PFs) in the device. + * - ``fw_load_policy`` + - u8 + - Control the device's firmware loading policy. + - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DRIVER`` (0) + Load firmware version preferred by the driver. + - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_FLASH`` (1) + Load firmware currently stored in flash. + - ``DEVLINK_PARAM_FW_LOAD_POLICY_VALUE_DISK`` (2) + Load firmware currently available on host's disk. + * - ``reset_dev_on_drv_probe`` + - u8 + - Controls the device's reset policy on driver probe. + - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_UNKNOWN`` (0) + Unknown or invalid value. + - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_ALWAYS`` (1) + Always reset device on driver probe. + - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_NEVER`` (2) + Never reset device on driver probe. + - ``DEVLINK_PARAM_RESET_DEV_ON_DRV_PROBE_VALUE_DISK`` (3) + Reset the device only if firmware can be found in the filesystem. + * - ``enable_roce`` + - Boolean + - Enable handling of RoCE traffic in the device. + * - ``internal_err_reset`` + - Boolean + - When enabled, the device driver will reset the device on internal + errors. + * - ``max_macs`` + - u32 + - Specifies the maximum number of MAC addresses per ethernet port of + this device. + * - ``region_snapshot_enable`` + - Boolean + - Enable capture of ``devlink-region`` snapshots. diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst new file mode 100644 index 000000000000..1a7683e7acb2 --- /dev/null +++ b/Documentation/networking/devlink/devlink-region.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Devlink Region +============== + +``devlink`` regions enable access to driver defined address regions using +devlink. + +Each device can create and register its own supported address regions. The +region can then be accessed via the devlink region interface. + +Region snapshots are collected by the driver, and can be accessed via read +or dump commands. This allows future analysis on the created snapshots. +Regions may optionally support triggering snapshots on demand. + +The major benefit to creating a region is to provide access to internal +address regions that are otherwise inaccessible to the user. + +Regions may also be used to provide an additional way to debug complex error +states, but see also :doc:`devlink-health` + +example usage +------------- + +.. code:: shell + + $ devlink region help + $ devlink region show [ DEV/REGION ] + $ devlink region del DEV/REGION snapshot SNAPSHOT_ID + $ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ] + $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] + address ADDRESS length length + + # Show all of the exposed regions with region sizes: + $ devlink region show + pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2] + pci/0000:00:05.0/fw-health: size 64 snapshot [1 2] + + # Delete a snapshot using: + $ devlink region del pci/0000:00:05.0/cr-space snapshot 1 + + # Trigger (request) a snapshot be taken: + $ devlink region trigger pci/0000:00:05.0/cr-space + + # Dump a snapshot: + $ devlink region dump pci/0000:00:05.0/fw-health snapshot 1 + 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 + 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 + 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc + 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5 + + # Read a specific part of a snapshot: + $ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 + length 16 + 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 + +As regions are likely very device or driver specific, no generic regions are +defined. See the driver-specific documentation files for information on the +specific regions a driver supports. diff --git a/Documentation/networking/devlink/devlink-resource.rst b/Documentation/networking/devlink/devlink-resource.rst new file mode 100644 index 000000000000..93e92d2f0752 --- /dev/null +++ b/Documentation/networking/devlink/devlink-resource.rst @@ -0,0 +1,62 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +Devlink Resource +================ + +``devlink`` provides the ability for drivers to register resources, which +can allow administrators to see the device restrictions for a given +resource, as well as how much of the given resource is currently +in use. Additionally, these resources can optionally have configurable size. +This could enable the administrator to limit the number of resources that +are used. + +For example, the ``netdevsim`` driver enables ``/IPv4/fib`` and +``/IPv4/fib-rules`` as resources to limit the number of IPv4 FIB entries and +rules for a given device. + +Resource Ids +============ + +Each resource is represented by an id, and contains information about its +current size and related sub resources. To access a sub resource, you +specify the path of the resource. For example ``/IPv4/fib`` is the id for +the ``fib`` sub-resource under the ``IPv4`` resource. + +example usage +------------- + +The resources exposed by the driver can be observed, for example: + +.. code:: shell + + $devlink resource show pci/0000:03:00.0 + pci/0000:03:00.0: + name kvd size 245760 unit entry + resources: + name linear size 98304 occ 0 unit entry size_min 0 size_max 147456 size_gran 128 + name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128 + name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128 + +Some resource's size can be changed. Examples: + +.. code:: shell + + $devlink resource set pci/0000:03:00.0 path /kvd/hash_single size 73088 + $devlink resource set pci/0000:03:00.0 path /kvd/hash_double size 74368 + +The changes do not apply immediately, this can be validated by the 'size_new' +attribute, which represents the pending change in size. For example: + +.. code:: shell + + $devlink resource show pci/0000:03:00.0 + pci/0000:03:00.0: + name kvd size 245760 unit entry size_valid false + resources: + name linear size 98304 size_new 147456 occ 0 unit entry size_min 0 size_max 147456 size_gran 128 + name hash_double size 60416 unit entry size_min 32768 size_max 180224 size_gran 128 + name hash_single size 87040 unit entry size_min 65536 size_max 212992 size_gran 128 + +Note that changes in resource size may require a device reload to properly +take effect. diff --git a/Documentation/networking/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst index 03311849bfb1..47a429bb8658 100644 --- a/Documentation/networking/devlink-trap.rst +++ b/Documentation/networking/devlink/devlink-trap.rst @@ -223,6 +223,21 @@ be added to the following table: * - ``ipv6_lpm_miss`` - ``exception`` - Traps unicast IPv6 packets that did not match any route + * - ``non_routable_packet`` + - ``drop`` + - Traps packets that the device decided to drop because they are not + supposed to be routed. For example, IGMP queries can be flooded by the + device in layer 2 and reach the router. Such packets should not be + routed and instead dropped + * - ``decap_error`` + - ``exception`` + - Traps NVE and IPinIP packets that the device decided to drop because of + failure during decapsulation (e.g., packet being too short, reserved + bits set in VXLAN header) + * - ``overlay_smac_is_mc`` + - ``drop`` + - Traps NVE packets that the device decided to drop because their overlay + source MAC is multicast Driver-specific Packet Traps ============================ @@ -233,7 +248,8 @@ help debug packet drops caused by these exceptions. The following list includes links to the description of driver-specific traps registered by various device drivers: - * :doc:`devlink-trap-netdevsim` + * :doc:`netdevsim` + * :doc:`mlxsw` Generic Packet Trap Groups ========================== @@ -258,6 +274,9 @@ narrow. The description of these groups must be added to the following table: * - ``buffer_drops`` - Contains packet traps for packets that were dropped by the device due to an enqueue decision + * - ``tunnel_drops`` + - Contains packet traps for packets that were dropped by the device during + tunnel encapsulation / decapsulation Testing ======= diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst new file mode 100644 index 000000000000..087ff54d53fc --- /dev/null +++ b/Documentation/networking/devlink/index.rst @@ -0,0 +1,42 @@ +Linux Devlink Documentation +=========================== + +devlink is an API to expose device information and resources not directly +related to any device class, such as chip-wide/switch-ASIC-wide configuration. + +Interface documentation +----------------------- + +The following pages describe various interfaces available through devlink in +general. + +.. toctree:: + :maxdepth: 1 + + devlink-dpipe + devlink-health + devlink-info + devlink-params + devlink-region + devlink-resource + devlink-trap + +Driver-specific documentation +----------------------------- + +Each driver that implements ``devlink`` is expected to document what +parameters, info versions, and other features it supports. + +.. toctree:: + :maxdepth: 1 + + bnxt + ionic + mlx4 + mlx5 + mlxsw + mv88e6xxx + netdevsim + nfp + qed + ti-cpsw-switch diff --git a/Documentation/networking/devlink/ionic.rst b/Documentation/networking/devlink/ionic.rst new file mode 100644 index 000000000000..48da9c92d584 --- /dev/null +++ b/Documentation/networking/devlink/ionic.rst @@ -0,0 +1,29 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +ionic devlink support +===================== + +This document describes the devlink features implemented by the ``ionic`` +device driver. + +Info versions +============= + +The ``ionic`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw`` + - running + - Version of firmware running on the device + * - ``asic.id`` + - fixed + - The ASIC type for this device + * - ``asic.rev`` + - fixed + - The revision of the ASIC for this device diff --git a/Documentation/networking/devlink/mlx4.rst b/Documentation/networking/devlink/mlx4.rst new file mode 100644 index 000000000000..7b2d17ea5471 --- /dev/null +++ b/Documentation/networking/devlink/mlx4.rst @@ -0,0 +1,56 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +mlx4 devlink support +==================== + +This document describes the devlink features implemented by the ``mlx4`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``internal_err_reset`` + - driverinit, runtime + * - ``max_macs`` + - driverinit + * - ``region_snapshot_enable`` + - driverinit, runtime + +The ``mlx4`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``enable_64b_cqe_eqe`` + - Boolean + - driverinit + - Enable 64 byte CQEs/EQEs, if the FW supports it. + * - ``enable_4k_uar`` + - Boolean + - driverinit + - Enable using the 4k UAR. + +The ``mlx4`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` + +Regions +======= + +The ``mlx4`` driver supports dumping the firmware PCI crspace and health +buffer during a critical firmware issue. + +In case a firmware command times out, firmware getting stuck, or a non zero +value on the catastrophic buffer, a snapshot will be taken by the driver. + +The ``cr-space`` region will contain the firmware PCI crspace contents. The +``fw-health`` region will contain the device firmware's health buffer. +Snapshots for both of these regions are taken on the same event triggers. diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst new file mode 100644 index 000000000000..629a6e69c036 --- /dev/null +++ b/Documentation/networking/devlink/mlx5.rst @@ -0,0 +1,59 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +mlx5 devlink support +==================== + +This document describes the devlink features implemented by the ``mlx5`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``enable_roce`` + - driverinit + +The ``mlx5`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``flow_steering_mode`` + - string + - runtime + - Controls the flow steering mode of the driver + + * ``dmfs`` Device managed flow steering. In DMFS mode, the HW + steering entities are created and managed through firmware. + * ``smfs`` Software managed flow steering. In SMFS mode, the HW + steering entities are created and manage through the driver without + firmware intervention. + +The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` + +Info versions +============= + +The ``mlx5`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw.psid`` + - fixed + - Used to represent the board id of the device. + * - ``fw.version`` + - stored, running + - Three digit major.minor.subminor firmware version number. diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst new file mode 100644 index 000000000000..cf857cb4ba8f --- /dev/null +++ b/Documentation/networking/devlink/mlxsw.rst @@ -0,0 +1,81 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +mlxsw devlink support +===================== + +This document describes the devlink features implemented by the ``mlxsw`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``fw_load_policy`` + - driverinit + +The ``mlxsw`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``acl_region_rehash_interval`` + - u32 + - runtime + - Sets an interval for periodic ACL region rehashes. The value is + specified in milliseconds, with a minimum of ``3000``. The value of + ``0`` disables periodic work entirely. The first rehash will be run + immediately after the value is set. + +The ``mlxsw`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` + +Info versions +============= + +The ``mlxsw`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``hw.revision`` + - fixed + - The hardware revision for this board + * - ``fw.psid`` + - fixed + - Firmware PSID + * - ``fw.version`` + - running + - Three digit firmware version + +Driver-specific Traps +===================== + +.. list-table:: List of Driver-specific Traps Registered by ``mlxsw`` + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``irif_disabled`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed from a disabled router interface (RIF). This can happen during + RIF dismantle, when the RIF is first disabled before being removed + completely + * - ``erif_disabled`` + - ``drop`` + - Traps packets that the device decided to drop because they need to be + routed through a disabled router interface (RIF). This can happen during + RIF dismantle, when the RIF is first disabled before being removed + completely diff --git a/Documentation/networking/devlink/mv88e6xxx.rst b/Documentation/networking/devlink/mv88e6xxx.rst new file mode 100644 index 000000000000..c621212a47a1 --- /dev/null +++ b/Documentation/networking/devlink/mv88e6xxx.rst @@ -0,0 +1,28 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +mv88e6xxx devlink support +========================= + +This document describes the devlink features implemented by the ``mv88e6xxx`` +device driver. + +Parameters +========== + +The ``mv88e6xxx`` driver implements the following driver-specific parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``ATU_hash`` + - u8 + - runtime + - Select one of four possible hashing algorithms for MAC addresses in + the Address Translation Unit. A value of 3 may work better than the + default of 1 when many MAC addresses have the same OUI. Only the + values 0 to 3 are valid for this parameter. diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst new file mode 100644 index 000000000000..2a266b7e7b38 --- /dev/null +++ b/Documentation/networking/devlink/netdevsim.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================= +netdevsim devlink support +========================= + +This document describes the ``devlink`` features supported by the +``netdevsim`` device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``max_macs`` + - driverinit + +The ``netdevsim`` driver also implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``test1`` + - Boolean + - driverinit + - Test parameter used to show how a driver-specific devlink parameter + can be implemented. + +The ``netdevsim`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` + +Regions +======= + +The ``netdevsim`` driver exposes a ``dummy`` region as an example of how the +devlink-region interfaces work. A snapshot is taken whenever the +``take_snapshot`` debugfs file is written to. + +Resources +========= + +The ``netdevsim`` driver exposes resources to control the number of FIB +entries and FIB rule entries that the driver will allow. + +.. code:: shell + + $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 96 + $ devlink resource set netdevsim/netdevsim0 path /IPv4/fib-rules size 16 + $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib size 64 + $ devlink resource set netdevsim/netdevsim0 path /IPv6/fib-rules size 16 + $ devlink dev reload netdevsim/netdevsim0 + +Driver-specific Traps +===================== + +.. list-table:: List of Driver-specific Traps Registered by ``netdevsim`` + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fid_miss`` + - ``exception`` + - When a packet enters the device it is classified to a filtering + indentifier (FID) based on the ingress port and VLAN. This trap is used + to trap packets for which a FID could not be found diff --git a/Documentation/networking/devlink/nfp.rst b/Documentation/networking/devlink/nfp.rst new file mode 100644 index 000000000000..a1717db0dfcc --- /dev/null +++ b/Documentation/networking/devlink/nfp.rst @@ -0,0 +1,65 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +nfp devlink support +=================== + +This document describes the devlink features implemented by the ``nfp`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + + * - Name + - Mode + * - ``fw_load_policy`` + - permanent + * - ``reset_dev_on_drv_probe`` + - permanent + +Info versions +============= + +The ``nfp`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``board.id`` + - fixed + - Part number identifying the board design + * - ``board.rev`` + - fixed + - Revision of the board design + * - ``board.manufacture`` + - fixed + - Vendor of the board design + * - ``board.model`` + - fixed + - Model name of the board design + * - ``fw.bundle_id`` + - stored, running + - Firmware bundle id + * - ``fw.mgmt`` + - stored, running + - Version of the management firmware + * - ``fw.cpld`` + - stored, running + - The CPLD firmware component version + * - ``fw.app`` + - stored, running + - The APP firmware component version + * - ``fw.undi`` + - stored, running + - The UNDI firmware component version + * - ``fw.ncsi`` + - stored, running + - The NSCI firmware component version + * - ``chip.init`` + - stored, running + - The CFGR firmware component version diff --git a/Documentation/networking/devlink/qed.rst b/Documentation/networking/devlink/qed.rst new file mode 100644 index 000000000000..805c6f63621a --- /dev/null +++ b/Documentation/networking/devlink/qed.rst @@ -0,0 +1,26 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +qed devlink support +=================== + +This document describes the devlink features implemented by the ``qed`` core +device driver. + +Parameters +========== + +The ``qed`` driver implements the following driver-specific parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``iwarp_cmt`` + - Boolean + - runtime + - Enable iWARP functionality for 100g devices. Note that this impacts + L2 performance, and is therefore not enabled by default. diff --git a/Documentation/networking/devlink/ti-cpsw-switch.rst b/Documentation/networking/devlink/ti-cpsw-switch.rst new file mode 100644 index 000000000000..dc399e32abaa --- /dev/null +++ b/Documentation/networking/devlink/ti-cpsw-switch.rst @@ -0,0 +1,31 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================== +ti-cpsw-switch devlink support +============================== + +This document describes the devlink features implemented by the ``ti-cpsw-switch`` +device driver. + +Parameters +========== + +The ``ti-cpsw-switch`` driver implements the following driver-specific +parameters. + +.. list-table:: Driver-specific parameters implemented + :widths: 5 5 5 85 + + * - Name + - Type + - Mode + - Description + * - ``ale_bypass`` + - Boolean + - runtime + - Enables ALE_CONTROL(4).BYPASS mode for debugging purposes. In this + mode, all packets will be sent to the host port only. + * - ``switch_mode`` + - Boolean + - runtime + - Enable switch mode diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst new file mode 100644 index 000000000000..c60afba69e3c --- /dev/null +++ b/Documentation/networking/ethtool-netlink.rst @@ -0,0 +1,520 @@ +============================= +Netlink interface for ethtool +============================= + + +Basic information +================= + +Netlink interface for ethtool uses generic netlink family ``ethtool`` +(userspace application should use macros ``ETHTOOL_GENL_NAME`` and +``ETHTOOL_GENL_VERSION`` defined in ``<linux/ethtool_netlink.h>`` uapi +header). This family does not use a specific header, all information in +requests and replies is passed using netlink attributes. + +The ethtool netlink interface uses extended ACK for error and warning +reporting, userspace application developers are encouraged to make these +messages available to user in a suitable way. + +Requests can be divided into three categories: "get" (retrieving information), +"set" (setting parameters) and "action" (invoking an action). + +All "set" and "action" type requests require admin privileges +(``CAP_NET_ADMIN`` in the namespace). Most "get" type requests are allowed for +anyone but there are exceptions (where the response contains sensitive +information). In some cases, the request as such is allowed for anyone but +unprivileged users have attributes with sensitive information (e.g. +wake-on-lan password) omitted. + + +Conventions +=========== + +Attributes which represent a boolean value usually use NLA_U8 type so that we +can distinguish three states: "on", "off" and "not present" (meaning the +information is not available in "get" requests or value is not to be changed +in "set" requests). For these attributes, the "true" value should be passed as +number 1 but any non-zero value should be understood as "true" by recipient. +In the tables below, "bool" denotes NLA_U8 attributes interpreted in this way. + +In the message structure descriptions below, if an attribute name is suffixed +with "+", parent nest can contain multiple attributes of the same type. This +implements an array of entries. + + +Request header +============== + +Each request or reply message contains a nested attribute with common header. +Structure of this header is + + ============================== ====== ============================= + ``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex + ``ETHTOOL_A_HEADER_DEV_NAME`` string device name + ``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests + ============================== ====== ============================= + +``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the +device message relates to. One of them is sufficient in requests, if both are +used, they must identify the same device. Some requests, e.g. global string +sets, do not require device identification. Most ``GET`` requests also allow +dump requests without device identification to query the same information for +all devices providing it (each device in a separate message). + +``ETHTOOL_A_HEADER_FLAGS`` is a bitmap of request flags common for all request +types. The interpretation of these flags is the same for all request types but +the flags may not apply to requests. Recognized flags are: + + ================================= =================================== + ``ETHTOOL_FLAG_COMPACT_BITSETS`` use compact format bitsets in reply + ``ETHTOOL_FLAG_OMIT_REPLY`` omit optional reply (_SET and _ACT) + ================================= =================================== + +New request flags should follow the general idea that if the flag is not set, +the behaviour is backward compatible, i.e. requests from old clients not aware +of the flag should be interpreted the way the client expects. A client must +not set flags it does not understand. + + +Bit sets +======== + +For short bitmaps of (reasonably) fixed length, standard ``NLA_BITFIELD32`` +type is used. For arbitrary length bitmaps, ethtool netlink uses a nested +attribute with contents of one of two forms: compact (two binary bitmaps +representing bit values and mask of affected bits) and bit-by-bit (list of +bits identified by either index or name). + +Verbose (bit-by-bit) bitsets allow sending symbolic names for bits together +with their values which saves a round trip (when the bitset is passed in a +request) or at least a second request (when the bitset is in a reply). This is +useful for one shot applications like traditional ethtool command. On the +other hand, long running applications like ethtool monitor (displaying +notifications) or network management daemons may prefer fetching the names +only once and using compact form to save message size. Notifications from +ethtool netlink interface always use compact form for bitsets. + +A bitset can represent either a value/mask pair (``ETHTOOL_A_BITSET_NOMASK`` +not set) or a single bitmap (``ETHTOOL_A_BITSET_NOMASK`` set). In requests +modifying a bitmap, the former changes the bit set in mask to values set in +value and preserves the rest; the latter sets the bits set in the bitmap and +clears the rest. + +Compact form: nested (bitset) atrribute contents: + + ============================ ====== ============================ + ``ETHTOOL_A_BITSET_NOMASK`` flag no mask, only a list + ``ETHTOOL_A_BITSET_SIZE`` u32 number of significant bits + ``ETHTOOL_A_BITSET_VALUE`` binary bitmap of bit values + ``ETHTOOL_A_BITSET_MASK`` binary bitmap of valid bits + ============================ ====== ============================ + +Value and mask must have length at least ``ETHTOOL_A_BITSET_SIZE`` bits +rounded up to a multiple of 32 bits. They consist of 32-bit words in host byte +order, words ordered from least significant to most significant (i.e. the same +way as bitmaps are passed with ioctl interface). + +For compact form, ``ETHTOOL_A_BITSET_SIZE`` and ``ETHTOOL_A_BITSET_VALUE`` are +mandatory. ``ETHTOOL_A_BITSET_MASK`` attribute is mandatory if +``ETHTOOL_A_BITSET_NOMASK`` is not set (bitset represents a value/mask pair); +if ``ETHTOOL_A_BITSET_NOMASK`` is not set, ``ETHTOOL_A_BITSET_MASK`` is not +allowed (bitset represents a single bitmap. + +Kernel bit set length may differ from userspace length if older application is +used on newer kernel or vice versa. If userspace bitmap is longer, an error is +issued only if the request actually tries to set values of some bits not +recognized by kernel. + +Bit-by-bit form: nested (bitset) attribute contents: + + +------------------------------------+--------+-----------------------------+ + | ``ETHTOOL_A_BITSET_NOMASK`` | flag | no mask, only a list | + +------------------------------------+--------+-----------------------------+ + | ``ETHTOOL_A_BITSET_SIZE`` | u32 | number of significant bits | + +------------------------------------+--------+-----------------------------+ + | ``ETHTOOL_A_BITSET_BITS`` | nested | array of bits | + +-+----------------------------------+--------+-----------------------------+ + | | ``ETHTOOL_A_BITSET_BITS_BIT+`` | nested | one bit | + +-+-+--------------------------------+--------+-----------------------------+ + | | | ``ETHTOOL_A_BITSET_BIT_INDEX`` | u32 | bit index (0 for LSB) | + +-+-+--------------------------------+--------+-----------------------------+ + | | | ``ETHTOOL_A_BITSET_BIT_NAME`` | string | bit name | + +-+-+--------------------------------+--------+-----------------------------+ + | | | ``ETHTOOL_A_BITSET_BIT_VALUE`` | flag | present if bit is set | + +-+-+--------------------------------+--------+-----------------------------+ + +Bit size is optional for bit-by-bit form. ``ETHTOOL_A_BITSET_BITS`` nest can +only contain ``ETHTOOL_A_BITSET_BITS_BIT`` attributes but there can be an +arbitrary number of them. A bit may be identified by its index or by its +name. When used in requests, listed bits are set to 0 or 1 according to +``ETHTOOL_A_BITSET_BIT_VALUE``, the rest is preserved. A request fails if +index exceeds kernel bit length or if name is not recognized. + +When ``ETHTOOL_A_BITSET_NOMASK`` flag is present, bitset is interpreted as +a simple bitmap. ``ETHTOOL_A_BITSET_BIT_VALUE`` attributes are not used in +such case. Such bitset represents a bitmap with listed bits set and the rest +zero. + +In requests, application can use either form. Form used by kernel in reply is +determined by ``ETHTOOL_FLAG_COMPACT_BITSETS`` flag in flags field of request +header. Semantics of value and mask depends on the attribute. + + +List of message types +===================== + +All constants identifying message types use ``ETHTOOL_CMD_`` prefix and suffix +according to message purpose: + + ============== ====================================== + ``_GET`` userspace request to retrieve data + ``_SET`` userspace request to set data + ``_ACT`` userspace request to perform an action + ``_GET_REPLY`` kernel reply to a ``GET`` request + ``_SET_REPLY`` kernel reply to a ``SET`` request + ``_ACT_REPLY`` kernel reply to an ``ACT`` request + ``_NTF`` kernel notification + ============== ====================================== + +Userspace to kernel: + + ===================================== ================================ + ``ETHTOOL_MSG_STRSET_GET`` get string set + ``ETHTOOL_MSG_LINKINFO_GET`` get link settings + ``ETHTOOL_MSG_LINKINFO_SET`` set link settings + ``ETHTOOL_MSG_LINKMODES_GET`` get link modes info + ``ETHTOOL_MSG_LINKMODES_SET`` set link modes info + ``ETHTOOL_MSG_LINKSTATE_GET`` get link state + ===================================== ================================ + +Kernel to userspace: + + ===================================== ================================ + ``ETHTOOL_MSG_STRSET_GET_REPLY`` string set contents + ``ETHTOOL_MSG_LINKINFO_GET_REPLY`` link settings + ``ETHTOOL_MSG_LINKINFO_NTF`` link settings notification + ``ETHTOOL_MSG_LINKMODES_GET_REPLY`` link modes info + ``ETHTOOL_MSG_LINKMODES_NTF`` link modes notification + ``ETHTOOL_MSG_LINKSTATE_GET_REPLY`` link state info + ===================================== ================================ + +``GET`` requests are sent by userspace applications to retrieve device +information. They usually do not contain any message specific attributes. +Kernel replies with corresponding "GET_REPLY" message. For most types, ``GET`` +request with ``NLM_F_DUMP`` and no device identification can be used to query +the information for all devices supporting the request. + +If the data can be also modified, corresponding ``SET`` message with the same +layout as corresponding ``GET_REPLY`` is used to request changes. Only +attributes where a change is requested are included in such request (also, not +all attributes may be changed). Replies to most ``SET`` request consist only +of error code and extack; if kernel provides additional data, it is sent in +the form of corresponding ``SET_REPLY`` message which can be suppressed by +setting ``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header. + +Data modification also triggers sending a ``NTF`` message with a notification. +These usually bear only a subset of attributes which was affected by the +change. The same notification is issued if the data is modified using other +means (mostly ioctl ethtool interface). Unlike notifications from ethtool +netlink code which are only sent if something actually changed, notifications +triggered by ioctl interface may be sent even if the request did not actually +change any data. + +``ACT`` messages request kernel (driver) to perform a specific action. If some +information is reported by kernel (which can be suppressed by setting +``ETHTOOL_FLAG_OMIT_REPLY`` flag in request header), the reply takes form of +an ``ACT_REPLY`` message. Performing an action also triggers a notification +(``NTF`` message). + +Later sections describe the format and semantics of these messages. + + +STRSET_GET +========== + +Requests contents of a string set as provided by ioctl commands +``ETHTOOL_GSSET_INFO`` and ``ETHTOOL_GSTRINGS.`` String sets are not user +writeable so that the corresponding ``STRSET_SET`` message is only used in +kernel replies. There are two types of string sets: global (independent of +a device, e.g. device feature names) and device specific (e.g. device private +flags). + +Request contents: + + +---------------------------------------+--------+------------------------+ + | ``ETHTOOL_A_STRSET_HEADER`` | nested | request header | + +---------------------------------------+--------+------------------------+ + | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | string set to request | + +-+-------------------------------------+--------+------------------------+ + | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set | + +-+-+-----------------------------------+--------+------------------------+ + | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id | + +-+-+-----------------------------------+--------+------------------------+ + +Kernel response contents: + + +---------------------------------------+--------+-----------------------+ + | ``ETHTOOL_A_STRSET_HEADER`` | nested | reply header | + +---------------------------------------+--------+-----------------------+ + | ``ETHTOOL_A_STRSET_STRINGSETS`` | nested | array of string sets | + +-+-------------------------------------+--------+-----------------------+ + | | ``ETHTOOL_A_STRINGSETS_STRINGSET+`` | nested | one string set | + +-+-+-----------------------------------+--------+-----------------------+ + | | | ``ETHTOOL_A_STRINGSET_ID`` | u32 | set id | + +-+-+-----------------------------------+--------+-----------------------+ + | | | ``ETHTOOL_A_STRINGSET_COUNT`` | u32 | number of strings | + +-+-+-----------------------------------+--------+-----------------------+ + | | | ``ETHTOOL_A_STRINGSET_STRINGS`` | nested | array of strings | + +-+-+-+---------------------------------+--------+-----------------------+ + | | | | ``ETHTOOL_A_STRINGS_STRING+`` | nested | one string | + +-+-+-+-+-------------------------------+--------+-----------------------+ + | | | | | ``ETHTOOL_A_STRING_INDEX`` | u32 | string index | + +-+-+-+-+-------------------------------+--------+-----------------------+ + | | | | | ``ETHTOOL_A_STRING_VALUE`` | string | string value | + +-+-+-+-+-------------------------------+--------+-----------------------+ + | ``ETHTOOL_A_STRSET_COUNTS_ONLY`` | flag | return only counts | + +---------------------------------------+--------+-----------------------+ + +Device identification in request header is optional. Depending on its presence +a and ``NLM_F_DUMP`` flag, there are three type of ``STRSET_GET`` requests: + + - no ``NLM_F_DUMP,`` no device: get "global" stringsets + - no ``NLM_F_DUMP``, with device: get string sets related to the device + - ``NLM_F_DUMP``, no device: get device related string sets for all devices + +If there is no ``ETHTOOL_A_STRSET_STRINGSETS`` array, all string sets of +requested type are returned, otherwise only those specified in the request. +Flag ``ETHTOOL_A_STRSET_COUNTS_ONLY`` tells kernel to only return string +counts of the sets, not the actual strings. + + +LINKINFO_GET +============ + +Requests link settings as provided by ``ETHTOOL_GLINKSETTINGS`` except for +link modes and autonegotiation related information. The request does not use +any attributes. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKINFO_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKINFO_HEADER`` nested reply header + ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port + ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address + ``ETHTOOL_A_LINKINFO_TP_MDIX`` u8 MDI(-X) status + ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control + ``ETHTOOL_A_LINKINFO_TRANSCEIVER`` u8 transceiver + ==================================== ====== ========================== + +Attributes and their values have the same meaning as matching members of the +corresponding ioctl structures. + +``LINKINFO_GET`` allows dump requests (kernel returns reply message for all +devices supporting the request). + + +LINKINFO_SET +============ + +``LINKINFO_SET`` request allows setting some of the attributes reported by +``LINKINFO_GET``. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKINFO_HEADER`` nested request header + ``ETHTOOL_A_LINKINFO_PORT`` u8 physical port + ``ETHTOOL_A_LINKINFO_PHYADDR`` u8 phy MDIO address + ``ETHTOOL_A_LINKINFO_TP_MDIX_CTRL`` u8 MDI(-X) control + ==================================== ====== ========================== + +MDI(-X) status and transceiver cannot be set, request with the corresponding +attributes is rejected. + + +LINKMODES_GET +============= + +Requests link modes (supported, advertised and peer advertised) and related +information (autonegotiation status, link speed and duplex) as provided by +``ETHTOOL_GLINKSETTINGS``. The request does not use any attributes. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKMODES_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKMODES_HEADER`` nested reply header + ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status + ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes + ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes + ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s) + ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode + ==================================== ====== ========================== + +For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask +represents supported modes. ``ETHTOOL_A_LINKMODES_PEER`` in the reply is a bit +list. + +``LINKMODES_GET`` allows dump requests (kernel returns reply messages for all +devices supporting the request). + + +LINKMODES_SET +============= + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKMODES_HEADER`` nested request header + ``ETHTOOL_A_LINKMODES_AUTONEG`` u8 autonegotiation status + ``ETHTOOL_A_LINKMODES_OURS`` bitset advertised link modes + ``ETHTOOL_A_LINKMODES_PEER`` bitset partner link modes + ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s) + ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode + ==================================== ====== ========================== + +``ETHTOOL_A_LINKMODES_OURS`` bit set allows setting advertised link modes. If +autonegotiation is on (either set now or kept from before), advertised modes +are not changed (no ``ETHTOOL_A_LINKMODES_OURS`` attribute) and at least one +of speed and duplex is specified, kernel adjusts advertised modes to all +supported modes matching speed, duplex or both (whatever is specified). This +autoselection is done on ethtool side with ioctl interface, netlink interface +is supposed to allow requesting changes without knowing what exactly kernel +supports. + + +LINKSTATE_GET +============= + +Requests link state information. At the moment, only link up/down flag (as +provided by ``ETHTOOL_GLINK`` ioctl command) is provided but some future +extensions are planned (e.g. link down reason). This request does not have any +attributes. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKSTATE_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_LINKSTATE_HEADER`` nested reply header + ``ETHTOOL_A_LINKSTATE_LINK`` bool link state (up/down) + ==================================== ====== ========================== + +For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns +carrier flag provided by ``netif_carrier_ok()`` but there are drivers which +define their own handler. + +``LINKSTATE_GET`` allows dump requests (kernel returns reply messages for all +devices supporting the request). + + +Request translation +=================== + +The following table maps ioctl commands to netlink commands providing their +functionality. Entries with "n/a" in right column are commands which do not +have their netlink replacement yet. + + =================================== ===================================== + ioctl command netlink command + =================================== ===================================== + ``ETHTOOL_GSET`` ``ETHTOOL_MSG_LINKINFO_GET`` + ``ETHTOOL_MSG_LINKMODES_GET`` + ``ETHTOOL_SSET`` ``ETHTOOL_MSG_LINKINFO_SET`` + ``ETHTOOL_MSG_LINKMODES_SET`` + ``ETHTOOL_GDRVINFO`` n/a + ``ETHTOOL_GREGS`` n/a + ``ETHTOOL_GWOL`` n/a + ``ETHTOOL_SWOL`` n/a + ``ETHTOOL_GMSGLVL`` n/a + ``ETHTOOL_SMSGLVL`` n/a + ``ETHTOOL_NWAY_RST`` n/a + ``ETHTOOL_GLINK`` ``ETHTOOL_MSG_LINKSTATE_GET`` + ``ETHTOOL_GEEPROM`` n/a + ``ETHTOOL_SEEPROM`` n/a + ``ETHTOOL_GCOALESCE`` n/a + ``ETHTOOL_SCOALESCE`` n/a + ``ETHTOOL_GRINGPARAM`` n/a + ``ETHTOOL_SRINGPARAM`` n/a + ``ETHTOOL_GPAUSEPARAM`` n/a + ``ETHTOOL_SPAUSEPARAM`` n/a + ``ETHTOOL_GRXCSUM`` n/a + ``ETHTOOL_SRXCSUM`` n/a + ``ETHTOOL_GTXCSUM`` n/a + ``ETHTOOL_STXCSUM`` n/a + ``ETHTOOL_GSG`` n/a + ``ETHTOOL_SSG`` n/a + ``ETHTOOL_TEST`` n/a + ``ETHTOOL_GSTRINGS`` ``ETHTOOL_MSG_STRSET_GET`` + ``ETHTOOL_PHYS_ID`` n/a + ``ETHTOOL_GSTATS`` n/a + ``ETHTOOL_GTSO`` n/a + ``ETHTOOL_STSO`` n/a + ``ETHTOOL_GPERMADDR`` rtnetlink ``RTM_GETLINK`` + ``ETHTOOL_GUFO`` n/a + ``ETHTOOL_SUFO`` n/a + ``ETHTOOL_GGSO`` n/a + ``ETHTOOL_SGSO`` n/a + ``ETHTOOL_GFLAGS`` n/a + ``ETHTOOL_SFLAGS`` n/a + ``ETHTOOL_GPFLAGS`` n/a + ``ETHTOOL_SPFLAGS`` n/a + ``ETHTOOL_GRXFH`` n/a + ``ETHTOOL_SRXFH`` n/a + ``ETHTOOL_GGRO`` n/a + ``ETHTOOL_SGRO`` n/a + ``ETHTOOL_GRXRINGS`` n/a + ``ETHTOOL_GRXCLSRLCNT`` n/a + ``ETHTOOL_GRXCLSRULE`` n/a + ``ETHTOOL_GRXCLSRLALL`` n/a + ``ETHTOOL_SRXCLSRLDEL`` n/a + ``ETHTOOL_SRXCLSRLINS`` n/a + ``ETHTOOL_FLASHDEV`` n/a + ``ETHTOOL_RESET`` n/a + ``ETHTOOL_SRXNTUPLE`` n/a + ``ETHTOOL_GRXNTUPLE`` n/a + ``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET`` + ``ETHTOOL_GRXFHINDIR`` n/a + ``ETHTOOL_SRXFHINDIR`` n/a + ``ETHTOOL_GFEATURES`` n/a + ``ETHTOOL_SFEATURES`` n/a + ``ETHTOOL_GCHANNELS`` n/a + ``ETHTOOL_SCHANNELS`` n/a + ``ETHTOOL_SET_DUMP`` n/a + ``ETHTOOL_GET_DUMP_FLAG`` n/a + ``ETHTOOL_GET_DUMP_DATA`` n/a + ``ETHTOOL_GET_TS_INFO`` n/a + ``ETHTOOL_GMODULEINFO`` n/a + ``ETHTOOL_GMODULEEEPROM`` n/a + ``ETHTOOL_GEEE`` n/a + ``ETHTOOL_SEEE`` n/a + ``ETHTOOL_GRSSH`` n/a + ``ETHTOOL_SRSSH`` n/a + ``ETHTOOL_GTUNABLE`` n/a + ``ETHTOOL_STUNABLE`` n/a + ``ETHTOOL_GPHYSTATS`` n/a + ``ETHTOOL_PERQUEUE`` n/a + ``ETHTOOL_GLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_GET`` + ``ETHTOOL_MSG_LINKMODES_GET`` + ``ETHTOOL_SLINKSETTINGS`` ``ETHTOOL_MSG_LINKINFO_SET`` + ``ETHTOOL_MSG_LINKMODES_SET`` + ``ETHTOOL_PHY_GTUNABLE`` n/a + ``ETHTOOL_PHY_STUNABLE`` n/a + ``ETHTOOL_GFECPARAM`` n/a + ``ETHTOOL_SFECPARAM`` n/a + =================================== ===================================== diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 5acab1290e03..d07d9855dcd3 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -13,9 +13,8 @@ Contents: can_ucan_protocol device_drivers/index dsa/index - devlink-info-versions - devlink-trap - devlink-trap-netdevsim + devlink/index + ethtool-netlink ieee802154 j1939 kapi diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 48ccb1b31160..5f53faff4e25 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -479,6 +479,10 @@ tcp_no_metrics_save - BOOLEAN degradation. If set, TCP will not cache metrics on closing connections. +tcp_no_ssthresh_metrics_save - BOOLEAN + Controls whether TCP saves ssthresh metrics in the route cache. + Default is 1, which disables ssthresh metrics. + tcp_orphan_retries - INTEGER This value influences the timeout of a locally closed TCP connection, when RTO retransmissions remain unacknowledged. diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index e0a7c7af6525..1e4735cc0553 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -267,6 +267,24 @@ Some of the interface modes are described below: duplex, pause or other settings. This is dependent on the MAC and/or PHY behaviour. +``PHY_INTERFACE_MODE_10GBASER`` + This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with + various different mediums. Please refer to the IEEE standard for a + definition of this. + + Note: 10GBASE-R is just one protocol that can be used with XFI and SFI. + XFI and SFI permit multiple protocols over a single SERDES lane, and + also defines the electrical characteristics of the signals with a host + compliance board plugged into the host XFP/SFP connector. Therefore, + XFI and SFI are not PHY interface types in their own right. + +``PHY_INTERFACE_MODE_10GKR`` + This is the IEEE 802.3 Clause 49 defined 10GBASE-R with Clause 73 + autonegotiation. Please refer to the IEEE standard for further + information. + + Note: due to legacy usage, some 10GBASE-R usage incorrectly makes + use of this definition. Pause frames / flow control =========================== diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst index a5e00a159d21..d753a309f9d1 100644 --- a/Documentation/networking/sfp-phylink.rst +++ b/Documentation/networking/sfp-phylink.rst @@ -251,7 +251,8 @@ this documentation. phylink_mac_change(priv->phylink, link_is_up); where ``link_is_up`` is true if the link is currently up or false - otherwise. + otherwise. If a MAC is unable to provide these interrupts, then + it should set ``priv->phylink_config.pcs_poll = true;`` in step 9. 11. Verify that the driver does not call:: |