diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/af_xdp.rst | 24 | ||||
-rw-r--r-- | Documentation/networking/device_drivers/aquantia/atlantic.txt | 439 | ||||
-rw-r--r-- | Documentation/networking/device_drivers/google/gve.rst | 123 | ||||
-rw-r--r-- | Documentation/networking/device_drivers/index.rst | 2 | ||||
-rw-r--r-- | Documentation/networking/device_drivers/mellanox/mlx5.rst | 173 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 39 | ||||
-rw-r--r-- | Documentation/networking/phy.rst | 45 | ||||
-rw-r--r-- | Documentation/networking/rds.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/tls-offload.rst | 54 |
9 files changed, 888 insertions, 13 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index e14d7d40fc75..eeedc2e826aa 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -220,7 +220,21 @@ Usage In order to use AF_XDP sockets there are two parts needed. The user-space application and the XDP program. For a complete setup and usage example, please refer to the sample application. The user-space -side is xdpsock_user.c and the XDP side xdpsock_kern.c. +side is xdpsock_user.c and the XDP side is part of libbpf. + +The XDP code sample included in tools/lib/bpf/xsk.c is the following:: + + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) + { + int index = ctx->rx_queue_index; + + // A set entry here means that the correspnding queue_id + // has an active AF_XDP socket bound to it. + if (bpf_map_lookup_elem(&xsks_map, &index)) + return bpf_redirect_map(&xsks_map, index, 0); + + return XDP_PASS; + } Naive ring dequeue and enqueue could look like this:: @@ -316,16 +330,16 @@ A: When a netdev of a physical NIC is initialized, Linux usually all the traffic, you can force the netdev to only have 1 queue, queue id 0, and then bind to queue 0. You can use ethtool to do this:: - sudo ethtool -L <interface> combined 1 + sudo ethtool -L <interface> combined 1 If you want to only see part of the traffic, you can program the NIC through ethtool to filter out your traffic to a single queue id that you can bind your XDP socket to. Here is one example in which UDP traffic to and from port 4242 are sent to queue 2:: - sudo ethtool -N <interface> rx-flow-hash udp4 fn - sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ - 4242 action 2 + sudo ethtool -N <interface> rx-flow-hash udp4 fn + sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ + 4242 action 2 A number of other ways are possible all up to the capabilitites of the NIC you have. diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt new file mode 100644 index 000000000000..d235cbaeccc6 --- /dev/null +++ b/Documentation/networking/device_drivers/aquantia/atlantic.txt @@ -0,0 +1,439 @@ +aQuantia AQtion Driver for the aQuantia Multi-Gigabit PCI Express Family of +Ethernet Adapters +============================================================================= + +Contents +======== + +- Identifying Your Adapter +- Configuration +- Supported ethtool options +- Command Line Parameters +- Config file parameters +- Support +- License + +Identifying Your Adapter +======================== + +The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters. + + +SFP+ Devices (for AQC-100 based adapters) +---------------------------------- + +This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver. + +Configuration +========================= + Viewing Link Messages + --------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following: + + dmesg -n 8 + + NOTE: This setting is not saved across reboots. + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16000. Use the `ip` command to + increase the MTU size. For example: + + ip link set mtu 16000 dev enp1s0 + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + NAPI + ---- + NAPI (Rx polling mode) is supported in the atlantic driver. + +Supported ethtool options +============================ + Viewing adapter settings + --------------------- + ethtool <ethX> + + Output example: + + Settings for enp1s0: + Supported ports: [ TP ] + Supported link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Supported pause frame use: Symmetric + Supports auto-negotiation: Yes + Supported FEC modes: Not reported + Advertised link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Advertised pause frame use: Symmetric + Advertised auto-negotiation: Yes + Advertised FEC modes: Not reported + Speed: 10000Mb/s + Duplex: Full + Port: Twisted Pair + PHYAD: 0 + Transceiver: internal + Auto-negotiation: on + MDI-X: Unknown + Supports Wake-on: g + Wake-on: d + Link detected: yes + + --- + Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. + But you can still use these speeds: + ethtool -s eth0 autoneg off speed 2500 + + Viewing adapter information + --------------------- + ethtool -i <ethX> + + Output example: + + driver: atlantic + version: 5.2.0-050200rc5-generic-kern + firmware-version: 3.1.78 + expansion-rom-version: + bus-info: 0000:01:00.0 + supports-statistics: yes + supports-test: no + supports-eeprom-access: no + supports-register-dump: yes + supports-priv-flags: no + + + Viewing Ethernet adapter statistics: + --------------------- + ethtool -S <ethX> + + Output example: + NIC statistics: + InPackets: 13238607 + InUCast: 13293852 + InMCast: 52 + InBCast: 3 + InErrors: 0 + OutPackets: 23703019 + OutUCast: 23704941 + OutMCast: 67 + OutBCast: 11 + InUCastOctects: 213182760 + OutUCastOctects: 22698443 + InMCastOctects: 6600 + OutMCastOctects: 8776 + InBCastOctects: 192 + OutBCastOctects: 704 + InOctects: 2131839552 + OutOctects: 226938073 + InPacketsDma: 95532300 + OutPacketsDma: 59503397 + InOctetsDma: 1137102462 + OutOctetsDma: 2394339518 + InDroppedDma: 0 + Queue[0] InPackets: 23567131 + Queue[0] OutPackets: 20070028 + Queue[0] InJumboPackets: 0 + Queue[0] InLroPackets: 0 + Queue[0] InErrors: 0 + Queue[1] InPackets: 45428967 + Queue[1] OutPackets: 11306178 + Queue[1] InJumboPackets: 0 + Queue[1] InLroPackets: 0 + Queue[1] InErrors: 0 + Queue[2] InPackets: 3187011 + Queue[2] OutPackets: 13080381 + Queue[2] InJumboPackets: 0 + Queue[2] InLroPackets: 0 + Queue[2] InErrors: 0 + Queue[3] InPackets: 23349136 + Queue[3] OutPackets: 15046810 + Queue[3] InJumboPackets: 0 + Queue[3] InLroPackets: 0 + Queue[3] InErrors: 0 + + Interrupt coalescing support + --------------------------------- + ITR mode, TX/RX coalescing timings could be viewed with: + + ethtool -c <ethX> + + and changed with: + + ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs> + + To disable coalescing: + + ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 + + Wake on LAN support + --------------------------------- + + WOL support by magic packet: + + ethtool -s <ethX> wol g + + To disable WOL: + + ethtool -s <ethX> wol d + + Set and check the driver message level + --------------------------------- + + Set message level + + ethtool -s <ethX> msglvl <level> + + Level values: + + 0x0001 - general driver status. + 0x0002 - hardware probing. + 0x0004 - link state. + 0x0008 - periodic status check. + 0x0010 - interface being brought down. + 0x0020 - interface being brought up. + 0x0040 - receive error. + 0x0080 - transmit error. + 0x0200 - interrupt handling. + 0x0400 - transmit completion. + 0x0800 - receive completion. + 0x1000 - packet contents. + 0x2000 - hardware status. + 0x4000 - Wake-on-LAN status. + + By default, the level of debugging messages is set 0x0001(general driver status). + + Check message level + + ethtool <ethX> | grep "Current message level" + + If you want to disable the output of messages + + ethtool -s <ethX> msglvl 0 + + RX flow rules (ntuple filters) + --------------------------------- + There are separate rules supported, that applies in that order: + 1. 16 VLAN ID rules + 2. 16 L2 EtherType rules + 3. 8 L3/L4 5-Tuple rules + + + The driver utilizes the ethtool interface for configuring ntuple filters, + via "ethtool -N <device> <filter>". + + To enable or disable the RX flow rules: + + ethtool -K ethX ntuple <on|off> + + When disabling ntuple filters, all the user programed filters are + flushed from the driver cache and hardware. All needed filters must + be re-added when ntuple is re-enabled. + + Because of the fixed order of the rules, the location of filters is also fixed: + - Locations 0 - 15 for VLAN ID filters + - Locations 16 - 31 for L2 EtherType filters + - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) + + The L3/L4 5-tuple (protocol, source and destination IP address, source and + destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to + 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of + addresses can be supported. Source and destination ports are only compared for + TCP/UDP/SCTP packets. + + To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch: + + ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32> + + - action is the queue number. + - loc is the rule number. + + For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc + number within 32 - 39. + For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules + for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic + IPv6 is 32 and 36. + At the moment you can not use IPv4 and IPv6 filters at the same time. + + Example filter for IPv6 filter traffic: + + sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 + sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 + + Example filter for IPv4 filter traffic: + + sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 + sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 + sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 + + If you set action -1, then all traffic corresponding to the filter will be discarded. + The maximum value action is 31. + + + The VLAN filter (VLAN id) is compared against 16 filters. + VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter + from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID + are passed in the same 'vlan' parameter. + + To add a filter that directs packets from VLAN 2001 to queue 5: + ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 + + + L2 EtherType filters allows filter packet by EtherType field or both EtherType + and User Priority (PCP) field of 802.1Q. + UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to + distinguish VLAN filter from L2 Ethertype filter with UserPriority since both + User Priority and VLAN ID are passed in the same 'vlan' parameter. + + To add a filter that directs IP4 packess of priority 3 to queue 3: + ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 + + + To see the list of filters currently present: + + ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX> + + Rules may be deleted from the table itself. This is done using: + + sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc> + + - loc is the rule number to be deleted. + + Rx filters is an interface to load the filter table that funnels all flow + into queue 0 unless an alternative queue is specified using "action". In that + case, any flow that matches the filter criteria will be directed to the + appropriate queue. RX filters is supported on all kernels 2.6.30 and later. + + RSS for UDP + --------------------------------- + Currently, NIC does not support RSS for fragmented IP packets, which leads to + incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the + RX Flow L3/L4 rule may be used. + + Example: + ethtool -N eth0 flow-type udp4 action 0 loc 32 + +Command Line Parameters +======================= +The following command line parameters are available on atlantic driver: + +aq_itr -Interrupt throttling mode +---------------------------------------- +Accepted values: 0, 1, 0xFFFF +Default value: 0xFFFF +0 - Disable interrupt throttling. +1 - Enable interrupt throttling and use specified tx and rx rates. +0xFFFF - Auto throttling mode. Driver will choose the best RX and TX + interrupt throtting settings based on link speed. + +aq_itr_tx - TX interrupt throttle rate +---------------------------------------- +Accepted values: 0 - 0x1FF +Default value: 0 +TX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +aq_itr_rx - RX interrupt throttle rate +---------------------------------------- +Accepted values: 0 - 0x1FF +Default value: 0 +RX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +Note: ITR settings could be changed in runtime by ethtool -c means (see below) + +Config file parameters +======================= +For some fine tuning and performance optimizations, +some parameters can be changed in the {source_dir}/aq_cfg.h file. + +AQ_CFG_RX_PAGEORDER +---------------------------------------- +Default value: 0 +RX page order override. Thats a power of 2 number of RX pages allocated for +each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX. +Increasing pageorder makes page reuse better (actual on iommu enabled systems). + +AQ_CFG_RX_REFILL_THRES +---------------------------------------- +Default value: 32 +RX refill threshold. RX path will not refill freed descriptors until the +specified number of free descriptors is observed. Larger values may help +better page reuse but may lead to packet drops as well. + +AQ_CFG_VECS_DEF +------------------------------------------------------------ +Number of queues +Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) +Default value: 8 +Notice this value will be capped by the number of cores available on the system. + +AQ_CFG_IS_RSS_DEF +------------------------------------------------------------ +Enable/disable Receive Side Scaling + +This feature allows the adapter to distribute receive processing +across multiple CPU-cores and to prevent from overloading a single CPU core. + +Valid values +0 - disabled +1 - enabled + +Default value: 1 + +AQ_CFG_NUM_RSS_QUEUES_DEF +------------------------------------------------------------ +Number of queues for Receive Side Scaling +Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) + +Default value: AQ_CFG_VECS_DEF + +AQ_CFG_IS_LRO_DEF +------------------------------------------------------------ +Enable/disable Large Receive Offload + +This offload enables the adapter to coalesce multiple TCP segments and indicate +them as a single coalesced unit to the OS networking subsystem. +The system consumes less energy but it also introduces more latency in packets processing. + +Valid values +0 - disabled +1 - enabled + +Default value: 1 + +AQ_CFG_TX_CLEAN_BUDGET +---------------------------------------- +Maximum descriptors to cleanup on TX at once. +Default value: 256 + +After the aq_cfg.h file changed the driver must be rebuilt to take effect. + +Support +======= + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to support@aquantia.com + +License +======= + +aQuantia Corporation Network Driver +Copyright(c) 2014 - 2019 aQuantia Corporation. + +This program is free software; you can redistribute it and/or modify it +under the terms and conditions of the GNU General Public License, +version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/google/gve.rst new file mode 100644 index 000000000000..793693cef6e3 --- /dev/null +++ b/Documentation/networking/device_drivers/google/gve.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================================================== +Linux kernel driver for Compute Engine Virtual Ethernet (gve): +============================================================== + +Supported Hardware +=================== +The GVE driver binds to a single PCI device id used by the virtual +Ethernet device found in some Compute Engine VMs. + ++--------------+----------+---------+ +|Field | Value | Comments| ++==============+==========+=========+ +|Vendor ID | `0x1AE0` | Google | ++--------------+----------+---------+ +|Device ID | `0x0042` | | ++--------------+----------+---------+ +|Sub-vendor ID | `0x1AE0` | Google | ++--------------+----------+---------+ +|Sub-device ID | `0x0058` | | ++--------------+----------+---------+ +|Revision ID | `0x0` | | ++--------------+----------+---------+ +|Device Class | `0x200` | Ethernet| ++--------------+----------+---------+ + +PCI Bars +======== +The gVNIC PCI device exposes three 32-bit memory BARS: +- Bar0 - Device configuration and status registers. +- Bar1 - MSI-X vector table +- Bar2 - IRQ, RX and TX doorbells + +Device Interactions +=================== +The driver interacts with the device in the following ways: + - Registers + - A block of MMIO registers + - See gve_register.h for more detail + - Admin Queue + - See description below + - Reset + - At any time the device can be reset + - Interrupts + - See supported interrupts below + - Transmit and Receive Queues + - See description below + +Registers +--------- +All registers are MMIO and big endian. + +The registers are used for initializing and configuring the device as well as +querying device status in response to management interrupts. + +Admin Queue (AQ) +---------------- +The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ +commands, used by the driver to issue commands to the device and set up +resources.The driver and the device maintain a count of how many commands +have been submitted and executed. To issue AQ commands, the driver must do +the following (with proper locking): + +1) Copy new commands into next available slots in the AQ array +2) Increment its counter by he number of new commands +3) Write the counter into the GVE_ADMIN_QUEUE_DOORBELL register +4) Poll the ADMIN_QUEUE_EVENT_COUNTER register until it equals + the value written to the doorbell, or until a timeout. + +The device will update the status field in each AQ command reported as +executed through the ADMIN_QUEUE_EVENT_COUNTER register. + +Device Resets +------------- +A device reset is triggered by writing 0x0 to the AQ PFN register. +This causes the device to release all resources allocated by the +driver, including the AQ itself. + +Interrupts +---------- +The following interrupts are supported by the driver: + +Management Interrupt +~~~~~~~~~~~~~~~~~~~~ +The management interrupt is used by the device to tell the driver to +look at the GVE_DEVICE_STATUS register. + +The handler for the management irq simply queues the service task in +the workqueue to check the register and acks the irq. + +Notification Block Interrupts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The notification block interrupts are used to tell the driver to poll +the queues associated with that interrupt. + +The handler for these irqs schedule the napi for that block to run +and poll the queues. + +Traffic Queues +-------------- +gVNIC's queues are composed of a descriptor ring and a buffer and are +assigned to a notification block. + +The descriptor rings are power-of-two-sized ring buffers consisting of +fixed-size descriptors. They advance their head pointer using a __be32 +doorbell located in Bar2. The tail pointers are advanced by consuming +descriptors in-order and updating a __be32 counter. Both the doorbell +and the counter overflow to zero. + +Each queue's buffers must be registered in advance with the device as a +queue page list, and packet data can only be put in those pages. + +Transmit +~~~~~~~~ +gve maps the buffers for transmit rings into a FIFO and copies the packets +into the FIFO before sending them to the NIC. + +Receive +~~~~~~~ +The buffers for receive rings are put into a data ring that is the same +length as the descriptor ring and the head and tail pointers advance over +the rings together. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 75fa537763a4..2b7fefe72351 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -21,6 +21,8 @@ Contents: intel/i40e intel/iavf intel/ice + google/gve + mellanox/mlx5 .. only:: subproject diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst new file mode 100644 index 000000000000..4eeef2df912f --- /dev/null +++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst @@ -0,0 +1,173 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB + +================================================= +Mellanox ConnectX(R) mlx5 core VPI Network Driver +================================================= + +Copyright (c) 2019, Mellanox Technologies LTD. + +Contents +======== + +- `Enabling the driver and kconfig options`_ +- `Devlink health reporters`_ + +Enabling the driver and kconfig options +================================================ + +| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out) +| at build time via kernel Kconfig flags. +| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags +| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y. +| For the list of advanced features please see below. + +**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko) + +| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config. +| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib). + + +**CONFIG_MLX5_CORE_EN=(y/n)** + +| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads. +| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be +| built-in into mlx5_core.ko. + + +**CONFIG_MLX5_EN_ARFS=(y/n)** + +| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering. +| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4 + + +**CONFIG_MLX5_EN_RXNFC=(y/n)** + +| Enables ethtool receive network flow classification, which allows user defined +| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API. + + +**CONFIG_MLX5_CORE_EN_DCB=(y/n)**: + +| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. + + +**CONFIG_MLX5_MPFS=(y/n)** + +| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC. +| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing +| user configured unicast MAC addresses to the requesting PF. + + +**CONFIG_MLX5_ESWITCH=(y/n)** + +| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering +| and switching for the enabled VFs and PF in two available modes: +| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_. +| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_. + + +**CONFIG_MLX5_CORE_IPOIB=(y/n)** + +| IPoIB offloads & acceleration support. +| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma +| IPoIB ulp netdevice. + + +**CONFIG_MLX5_FPGA=(y/n)** + +| Build support for the Innova family of network cards by Mellanox Technologies. +| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board. +| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow +| building sandbox-specific client drivers. + + +**CONFIG_MLX5_EN_IPSEC=(y/n)** + +| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_. + +**CONFIG_MLX5_EN_TLS=(y/n)** + +| TLS cryptography-offload accelaration. + + +**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko) + +| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. + + +**External options** ( Choose if the corresponding mlx5 feature is required ) + +- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled +- CONFIG_VXLAN: When chosen, mlx5 vxaln support will be enabled. +- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool). + + +Devlink health reporters +======================== + +tx reporter +----------- +The tx reporter is responsible of two error scenarios: + +- TX timeout + Report on kernel tx timeout detection. + Recover by searching lost interrupts. +- TX error completion + Report on error tx completion. + Recover by flushing the TX queue and reset it. + +TX reporter also support Diagnose callback, on which it provides +real time information of its send queues status. + +User commands examples: + +- Diagnose send queues status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter tx + +- Show number of tx errors indicated, number of recover flows ended successfully, + is autorecover enabled and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter tx + +fw reporter +----------- +The fw reporter implements diagnose and dump callbacks. +It follows symptoms of fw error such as fw syndrome by triggering +fw core dump and storing it into the dump buffer. +The fw reporter diagnose command can be triggered any time by the user to check +current fw status. + +User commands examples: + +- Check fw heath status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter fw + +- Read FW core dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.0 reporter fw + +NOTE: This command can run only on the PF which has fw tracer ownership, +running it on other PF or any VF will return "Operation not permitted". + +fw fatal reporter +----------------- +The fw fatal reporter implements dump and recover callbacks. +It follows fatal errors indications by CR-space dump and recover flow. +The CR-space dump uses vsc interface which is valid even if the FW command +interface is not functional, which is the case in most FW fatal errors. +The recover function runs recover flow which reloads the driver and triggers fw +reset if needed. + +User commands examples: + +- Run fw recover flow manually:: + + $ devlink health recover pci/0000:82:00.0 reporter fw_fatal + +- Read FW CR-space dump if already strored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal + +NOTE: This command can run only on PF. diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index a73b3a02e49a..f0e6d1f53485 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -80,6 +80,7 @@ fib_multipath_hash_policy - INTEGER Possible values: 0 - Layer 3 1 - Layer 4 + 2 - Layer 3 or inner Layer 3 if present fib_sync_mem - UNSIGNED INTEGER Amount of dirty memory from fib entries that can be backlogged before @@ -255,6 +256,14 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_min_snd_mss - INTEGER + TCP SYN and SYNACK messages usually advertise an ADVMSS option, + as described in RFC 1122 and RFC 6691. + If this ADVMSS option is smaller than tcp_min_snd_mss, + it is silently capped to tcp_min_snd_mss. + + Default : 48 (at least 8 bytes of payload per segment) + tcp_congestion_control - STRING Set the congestion control algorithm to be used for new connections. The algorithm "reno" is always available, but @@ -792,6 +801,14 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 100 +tcp_rx_skb_cache - BOOLEAN + Controls a per TCP socket cache of one skb, that might help + performance of some workloads. This might be dangerous + on systems with a lot of TCP sockets, since it increases + memory usage. + + Default: 0 (disabled) + UDP variables: udp_l3mdev_accept - BOOLEAN @@ -1429,14 +1446,26 @@ flowlabel_state_ranges - BOOLEAN FALSE: disabled Default: true -flowlabel_reflect - BOOLEAN - Automatically reflect the flow label. Needed for Path MTU +flowlabel_reflect - INTEGER + Control flow label reflection. Needed for Path MTU Discovery to work with Equal Cost Multipath Routing in anycast environments. See RFC 7690 and: https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 - TRUE: enabled - FALSE: disabled - Default: FALSE + + This is a bitmask. + 1: enabled for established flows + + Note that this prevents automatic flowlabel changes, as done + in "tcp: change IPv6 flow-label upon receiving spurious retransmission" + and "tcp: Change txhash on every SYN and RTO retransmit" + + 2: enabled for TCP RESET packets (no active listener) + If set, a RST packet sent in response to a SYN packet on a closed + port will reflect the incoming flow label. + + 4: enabled for ICMPv6 echo reply messages. + + Default: 0 fib_multipath_hash_policy - INTEGER Controls which hash policy to use for multipath routes. diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 0dd90d7df5ec..a689966bc4be 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -202,7 +202,8 @@ the PHY/controller, of which the PHY needs to be aware. *interface* is a u32 which specifies the connection type used between the controller and the PHY. Examples are GMII, MII, -RGMII, and SGMII. For a full list, see include/linux/phy.h +RGMII, and SGMII. See "PHY interface mode" below. For a full +list, see include/linux/phy.h Now just make sure that phydev->supported and phydev->advertising have any values pruned from them which don't make sense for your controller (a 10/100 @@ -225,6 +226,48 @@ When you want to disconnect from the network (even if just briefly), you call phy_stop(phydev). This function also stops the phylib state machine and disables PHY interrupts. +PHY interface modes +=================== + +The PHY interface mode supplied in the phy_connect() family of functions +defines the initial operating mode of the PHY interface. This is not +guaranteed to remain constant; there are PHYs which dynamically change +their interface mode without software interaction depending on the +negotiation results. + +Some of the interface modes are described below: + +``PHY_INTERFACE_MODE_1000BASEX`` + This defines the 1000BASE-X single-lane serdes link as defined by the + 802.3 standard section 36. The link operates at a fixed bit rate of + 1.25Gbaud using a 10B/8B encoding scheme, resulting in an underlying + data rate of 1Gbps. Embedded in the data stream is a 16-bit control + word which is used to negotiate the duplex and pause modes with the + remote end. This does not include "up-clocked" variants such as 2.5Gbps + speeds (see below.) + +``PHY_INTERFACE_MODE_2500BASEX`` + This defines a variant of 1000BASE-X which is clocked 2.5 times faster, + than the 802.3 standard giving a fixed bit rate of 3.125Gbaud. + +``PHY_INTERFACE_MODE_SGMII`` + This is used for Cisco SGMII, which is a modification of 1000BASE-X + as defined by the 802.3 standard. The SGMII link consists of a single + serdes lane running at a fixed bit rate of 1.25Gbaud with 10B/8B + encoding. The underlying data rate is 1Gbps, with the slower speeds of + 100Mbps and 10Mbps being achieved through replication of each data symbol. + The 802.3 control word is re-purposed to send the negotiated speed and + duplex information from to the MAC, and for the MAC to acknowledge + receipt. This does not include "up-clocked" variants such as 2.5Gbps + speeds. + + Note: mismatched SGMII vs 1000BASE-X configuration on a link can + successfully pass data in some circumstances, but the 16-bit control + word will not be correctly interpreted, which may cause mismatches in + duplex, pause or other settings. This is dependent on the MAC and/or + PHY behaviour. + + Pause frames / flow control =========================== diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index 0235ae69af2a..f2a0147c933d 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -389,7 +389,7 @@ Multipath RDS (mprds) a common (to all paths) part, and a per-path struct rds_conn_path. All I/O workqs and reconnect threads are driven from the rds_conn_path. Transports such as TCP that are multipath capable may then set up a - TPC socket per rds_conn_path, and this is managed by the transport via + TCP socket per rds_conn_path, and this is managed by the transport via the transport privatee cp_transport_data pointer. Transports announce themselves as multipath capable by setting the diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst index eb7c9b81ccf5..048e5ca44824 100644 --- a/Documentation/networking/tls-offload.rst +++ b/Documentation/networking/tls-offload.rst @@ -206,7 +206,11 @@ TX Segments transmitted from an offloaded socket can get out of sync in similar ways to the receive side-retransmissions - local drops -are possible, though network reorders are not. +are possible, though network reorders are not. There are currently +two mechanisms for dealing with out of order segments. + +Crypto state rebuilding +~~~~~~~~~~~~~~~~~~~~~~~ Whenever an out of order segment is transmitted the driver provides the device with enough information to perform cryptographic operations. @@ -225,6 +229,35 @@ was just a retransmission. The former is simpler, and does not require retransmission detection therefore it is the recommended method until such time it is proven inefficient. +Next record sync +~~~~~~~~~~~~~~~~ + +Whenever an out of order segment is detected the driver requests +that the ``ktls`` software fallback code encrypt it. If the segment's +sequence number is lower than expected the driver assumes retransmission +and doesn't change device state. If the segment is in the future, it +may imply a local drop, the driver asks the stack to sync the device +to the next record state and falls back to software. + +Resync request is indicated with: + +.. code-block:: c + + void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) + +Until resync is complete driver should not access its expected TCP +sequence number (as it will be updated from a different context). +Following helper should be used to test if resync is complete: + +.. code-block:: c + + bool tls_offload_tx_resync_pending(struct sock *sk) + +Next time ``ktls`` pushes a record it will first send its TCP sequence number +and TLS record number to the driver. Stack will also make sure that +the new record will start on a segment boundary (like it does when +the connection is initially added). + RX -- @@ -268,6 +301,9 @@ Device can only detect that segment 4 also contains a TLS header if it knows the length of the previous record from segment 2. In this case the device will lose synchronization with the stream. +Stream scan resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + When the device gets out of sync and the stream reaches TCP sequence numbers more than a max size record past the expected TCP sequence number, the device starts scanning for a known header pattern. For example @@ -298,6 +334,22 @@ Special care has to be taken if the confirmation request is passed asynchronously to the packet stream and record may get processed by the kernel before the confirmation request. +Stack-driven resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The driver may also request the stack to perform resynchronization +whenever it sees the records are no longer getting decrypted. +If the connection is configured in this mode the stack automatically +schedules resynchronization after it has received two completely encrypted +records. + +The stack waits for the socket to drain and informs the device about +the next expected record number and its TCP sequence number. If the +records continue to be received fully encrypted stack retries the +synchronization with an exponential back off (first after 2 encrypted +records, then after 4 records, after 8, after 16... up until every +128 records). + Error handling ============== |