aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst24
-rw-r--r--Documentation/networking/device_drivers/aquantia/atlantic.txt439
-rw-r--r--Documentation/networking/device_drivers/google/gve.rst123
-rw-r--r--Documentation/networking/device_drivers/index.rst2
-rw-r--r--Documentation/networking/device_drivers/mellanox/mlx5.rst173
-rw-r--r--Documentation/networking/ip-sysctl.txt39
-rw-r--r--Documentation/networking/phy.rst45
-rw-r--r--Documentation/networking/rds.txt2
-rw-r--r--Documentation/networking/tls-offload.rst54
9 files changed, 888 insertions, 13 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index e14d7d40fc75..eeedc2e826aa 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -220,7 +220,21 @@ Usage
In order to use AF_XDP sockets there are two parts needed. The
user-space application and the XDP program. For a complete setup and
usage example, please refer to the sample application. The user-space
-side is xdpsock_user.c and the XDP side xdpsock_kern.c.
+side is xdpsock_user.c and the XDP side is part of libbpf.
+
+The XDP code sample included in tools/lib/bpf/xsk.c is the following::
+
+ SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+ {
+ int index = ctx->rx_queue_index;
+
+ // A set entry here means that the correspnding queue_id
+ // has an active AF_XDP socket bound to it.
+ if (bpf_map_lookup_elem(&xsks_map, &index))
+ return bpf_redirect_map(&xsks_map, index, 0);
+
+ return XDP_PASS;
+ }
Naive ring dequeue and enqueue could look like this::
@@ -316,16 +330,16 @@ A: When a netdev of a physical NIC is initialized, Linux usually
all the traffic, you can force the netdev to only have 1 queue, queue
id 0, and then bind to queue 0. You can use ethtool to do this::
- sudo ethtool -L <interface> combined 1
+ sudo ethtool -L <interface> combined 1
If you want to only see part of the traffic, you can program the
NIC through ethtool to filter out your traffic to a single queue id
that you can bind your XDP socket to. Here is one example in which
UDP traffic to and from port 4242 are sent to queue 2::
- sudo ethtool -N <interface> rx-flow-hash udp4 fn
- sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
- 4242 action 2
+ sudo ethtool -N <interface> rx-flow-hash udp4 fn
+ sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \
+ 4242 action 2
A number of other ways are possible all up to the capabilitites of
the NIC you have.
diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt
new file mode 100644
index 000000000000..d235cbaeccc6
--- /dev/null
+++ b/Documentation/networking/device_drivers/aquantia/atlantic.txt
@@ -0,0 +1,439 @@
+aQuantia AQtion Driver for the aQuantia Multi-Gigabit PCI Express Family of
+Ethernet Adapters
+=============================================================================
+
+Contents
+========
+
+- Identifying Your Adapter
+- Configuration
+- Supported ethtool options
+- Command Line Parameters
+- Config file parameters
+- Support
+- License
+
+Identifying Your Adapter
+========================
+
+The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters.
+
+
+SFP+ Devices (for AQC-100 based adapters)
+----------------------------------
+
+This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver.
+
+Configuration
+=========================
+ Viewing Link Messages
+ ---------------------
+ Link messages will not be displayed to the console if the distribution is
+ restricting system messages. In order to see network driver link messages on
+ your console, set dmesg to eight by entering the following:
+
+ dmesg -n 8
+
+ NOTE: This setting is not saved across reboots.
+
+ Jumbo Frames
+ ------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16000. Use the `ip` command to
+ increase the MTU size. For example:
+
+ ip link set mtu 16000 dev enp1s0
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The latest
+ ethtool version is required for this functionality.
+
+ NAPI
+ ----
+ NAPI (Rx polling mode) is supported in the atlantic driver.
+
+Supported ethtool options
+============================
+ Viewing adapter settings
+ ---------------------
+ ethtool <ethX>
+
+ Output example:
+
+ Settings for enp1s0:
+ Supported ports: [ TP ]
+ Supported link modes: 100baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
+ Supported pause frame use: Symmetric
+ Supports auto-negotiation: Yes
+ Supported FEC modes: Not reported
+ Advertised link modes: 100baseT/Full
+ 1000baseT/Full
+ 10000baseT/Full
+ 2500baseT/Full
+ 5000baseT/Full
+ Advertised pause frame use: Symmetric
+ Advertised auto-negotiation: Yes
+ Advertised FEC modes: Not reported
+ Speed: 10000Mb/s
+ Duplex: Full
+ Port: Twisted Pair
+ PHYAD: 0
+ Transceiver: internal
+ Auto-negotiation: on
+ MDI-X: Unknown
+ Supports Wake-on: g
+ Wake-on: d
+ Link detected: yes
+
+ ---
+ Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10.
+ But you can still use these speeds:
+ ethtool -s eth0 autoneg off speed 2500
+
+ Viewing adapter information
+ ---------------------
+ ethtool -i <ethX>
+
+ Output example:
+
+ driver: atlantic
+ version: 5.2.0-050200rc5-generic-kern
+ firmware-version: 3.1.78
+ expansion-rom-version:
+ bus-info: 0000:01:00.0
+ supports-statistics: yes
+ supports-test: no
+ supports-eeprom-access: no
+ supports-register-dump: yes
+ supports-priv-flags: no
+
+
+ Viewing Ethernet adapter statistics:
+ ---------------------
+ ethtool -S <ethX>
+
+ Output example:
+ NIC statistics:
+ InPackets: 13238607
+ InUCast: 13293852
+ InMCast: 52
+ InBCast: 3
+ InErrors: 0
+ OutPackets: 23703019
+ OutUCast: 23704941
+ OutMCast: 67
+ OutBCast: 11
+ InUCastOctects: 213182760
+ OutUCastOctects: 22698443
+ InMCastOctects: 6600
+ OutMCastOctects: 8776
+ InBCastOctects: 192
+ OutBCastOctects: 704
+ InOctects: 2131839552
+ OutOctects: 226938073
+ InPacketsDma: 95532300
+ OutPacketsDma: 59503397
+ InOctetsDma: 1137102462
+ OutOctetsDma: 2394339518
+ InDroppedDma: 0
+ Queue[0] InPackets: 23567131
+ Queue[0] OutPackets: 20070028
+ Queue[0] InJumboPackets: 0
+ Queue[0] InLroPackets: 0
+ Queue[0] InErrors: 0
+ Queue[1] InPackets: 45428967
+ Queue[1] OutPackets: 11306178
+ Queue[1] InJumboPackets: 0
+ Queue[1] InLroPackets: 0
+ Queue[1] InErrors: 0
+ Queue[2] InPackets: 3187011
+ Queue[2] OutPackets: 13080381
+ Queue[2] InJumboPackets: 0
+ Queue[2] InLroPackets: 0
+ Queue[2] InErrors: 0
+ Queue[3] InPackets: 23349136
+ Queue[3] OutPackets: 15046810
+ Queue[3] InJumboPackets: 0
+ Queue[3] InLroPackets: 0
+ Queue[3] InErrors: 0
+
+ Interrupt coalescing support
+ ---------------------------------
+ ITR mode, TX/RX coalescing timings could be viewed with:
+
+ ethtool -c <ethX>
+
+ and changed with:
+
+ ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs>
+
+ To disable coalescing:
+
+ ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1
+
+ Wake on LAN support
+ ---------------------------------
+
+ WOL support by magic packet:
+
+ ethtool -s <ethX> wol g
+
+ To disable WOL:
+
+ ethtool -s <ethX> wol d
+
+ Set and check the driver message level
+ ---------------------------------
+
+ Set message level
+
+ ethtool -s <ethX> msglvl <level>
+
+ Level values:
+
+ 0x0001 - general driver status.
+ 0x0002 - hardware probing.
+ 0x0004 - link state.
+ 0x0008 - periodic status check.
+ 0x0010 - interface being brought down.
+ 0x0020 - interface being brought up.
+ 0x0040 - receive error.
+ 0x0080 - transmit error.
+ 0x0200 - interrupt handling.
+ 0x0400 - transmit completion.
+ 0x0800 - receive completion.
+ 0x1000 - packet contents.
+ 0x2000 - hardware status.
+ 0x4000 - Wake-on-LAN status.
+
+ By default, the level of debugging messages is set 0x0001(general driver status).
+
+ Check message level
+
+ ethtool <ethX> | grep "Current message level"
+
+ If you want to disable the output of messages
+
+ ethtool -s <ethX> msglvl 0
+
+ RX flow rules (ntuple filters)
+ ---------------------------------
+ There are separate rules supported, that applies in that order:
+ 1. 16 VLAN ID rules
+ 2. 16 L2 EtherType rules
+ 3. 8 L3/L4 5-Tuple rules
+
+
+ The driver utilizes the ethtool interface for configuring ntuple filters,
+ via "ethtool -N <device> <filter>".
+
+ To enable or disable the RX flow rules:
+
+ ethtool -K ethX ntuple <on|off>
+
+ When disabling ntuple filters, all the user programed filters are
+ flushed from the driver cache and hardware. All needed filters must
+ be re-added when ntuple is re-enabled.
+
+ Because of the fixed order of the rules, the location of filters is also fixed:
+ - Locations 0 - 15 for VLAN ID filters
+ - Locations 16 - 31 for L2 EtherType filters
+ - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6)
+
+ The L3/L4 5-tuple (protocol, source and destination IP address, source and
+ destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to
+ 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of
+ addresses can be supported. Source and destination ports are only compared for
+ TCP/UDP/SCTP packets.
+
+ To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch:
+
+ ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32>
+
+ - action is the queue number.
+ - loc is the rule number.
+
+ For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc
+ number within 32 - 39.
+ For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules
+ for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic
+ IPv6 is 32 and 36.
+ At the moment you can not use IPv4 and IPv6 filters at the same time.
+
+ Example filter for IPv6 filter traffic:
+
+ sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32
+ sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36
+
+ Example filter for IPv4 filter traffic:
+
+ sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32
+ sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33
+ sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34
+
+ If you set action -1, then all traffic corresponding to the filter will be discarded.
+ The maximum value action is 31.
+
+
+ The VLAN filter (VLAN id) is compared against 16 filters.
+ VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter
+ from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID
+ are passed in the same 'vlan' parameter.
+
+ To add a filter that directs packets from VLAN 2001 to queue 5:
+ ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0
+
+
+ L2 EtherType filters allows filter packet by EtherType field or both EtherType
+ and User Priority (PCP) field of 802.1Q.
+ UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to
+ distinguish VLAN filter from L2 Ethertype filter with UserPriority since both
+ User Priority and VLAN ID are passed in the same 'vlan' parameter.
+
+ To add a filter that directs IP4 packess of priority 3 to queue 3:
+ ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16
+
+
+ To see the list of filters currently present:
+
+ ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX>
+
+ Rules may be deleted from the table itself. This is done using:
+
+ sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc>
+
+ - loc is the rule number to be deleted.
+
+ Rx filters is an interface to load the filter table that funnels all flow
+ into queue 0 unless an alternative queue is specified using "action". In that
+ case, any flow that matches the filter criteria will be directed to the
+ appropriate queue. RX filters is supported on all kernels 2.6.30 and later.
+
+ RSS for UDP
+ ---------------------------------
+ Currently, NIC does not support RSS for fragmented IP packets, which leads to
+ incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the
+ RX Flow L3/L4 rule may be used.
+
+ Example:
+ ethtool -N eth0 flow-type udp4 action 0 loc 32
+
+Command Line Parameters
+=======================
+The following command line parameters are available on atlantic driver:
+
+aq_itr -Interrupt throttling mode
+----------------------------------------
+Accepted values: 0, 1, 0xFFFF
+Default value: 0xFFFF
+0 - Disable interrupt throttling.
+1 - Enable interrupt throttling and use specified tx and rx rates.
+0xFFFF - Auto throttling mode. Driver will choose the best RX and TX
+ interrupt throtting settings based on link speed.
+
+aq_itr_tx - TX interrupt throttle rate
+----------------------------------------
+Accepted values: 0 - 0x1FF
+Default value: 0
+TX side throttling in microseconds. Adapter will setup maximum interrupt delay
+to this value. Minimum interrupt delay will be a half of this value
+
+aq_itr_rx - RX interrupt throttle rate
+----------------------------------------
+Accepted values: 0 - 0x1FF
+Default value: 0
+RX side throttling in microseconds. Adapter will setup maximum interrupt delay
+to this value. Minimum interrupt delay will be a half of this value
+
+Note: ITR settings could be changed in runtime by ethtool -c means (see below)
+
+Config file parameters
+=======================
+For some fine tuning and performance optimizations,
+some parameters can be changed in the {source_dir}/aq_cfg.h file.
+
+AQ_CFG_RX_PAGEORDER
+----------------------------------------
+Default value: 0
+RX page order override. Thats a power of 2 number of RX pages allocated for
+each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX.
+Increasing pageorder makes page reuse better (actual on iommu enabled systems).
+
+AQ_CFG_RX_REFILL_THRES
+----------------------------------------
+Default value: 32
+RX refill threshold. RX path will not refill freed descriptors until the
+specified number of free descriptors is observed. Larger values may help
+better page reuse but may lead to packet drops as well.
+
+AQ_CFG_VECS_DEF
+------------------------------------------------------------
+Number of queues
+Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX)
+Default value: 8
+Notice this value will be capped by the number of cores available on the system.
+
+AQ_CFG_IS_RSS_DEF
+------------------------------------------------------------
+Enable/disable Receive Side Scaling
+
+This feature allows the adapter to distribute receive processing
+across multiple CPU-cores and to prevent from overloading a single CPU core.
+
+Valid values
+0 - disabled
+1 - enabled
+
+Default value: 1
+
+AQ_CFG_NUM_RSS_QUEUES_DEF
+------------------------------------------------------------
+Number of queues for Receive Side Scaling
+Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF)
+
+Default value: AQ_CFG_VECS_DEF
+
+AQ_CFG_IS_LRO_DEF
+------------------------------------------------------------
+Enable/disable Large Receive Offload
+
+This offload enables the adapter to coalesce multiple TCP segments and indicate
+them as a single coalesced unit to the OS networking subsystem.
+The system consumes less energy but it also introduces more latency in packets processing.
+
+Valid values
+0 - disabled
+1 - enabled
+
+Default value: 1
+
+AQ_CFG_TX_CLEAN_BUDGET
+----------------------------------------
+Maximum descriptors to cleanup on TX at once.
+Default value: 256
+
+After the aq_cfg.h file changed the driver must be rebuilt to take effect.
+
+Support
+=======
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to support@aquantia.com
+
+License
+=======
+
+aQuantia Corporation Network Driver
+Copyright(c) 2014 - 2019 aQuantia Corporation.
+
+This program is free software; you can redistribute it and/or modify it
+under the terms and conditions of the GNU General Public License,
+version 2, as published by the Free Software Foundation.
diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/google/gve.rst
new file mode 100644
index 000000000000..793693cef6e3
--- /dev/null
+++ b/Documentation/networking/device_drivers/google/gve.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==============================================================
+Linux kernel driver for Compute Engine Virtual Ethernet (gve):
+==============================================================
+
+Supported Hardware
+===================
+The GVE driver binds to a single PCI device id used by the virtual
+Ethernet device found in some Compute Engine VMs.
+
++--------------+----------+---------+
+|Field | Value | Comments|
++==============+==========+=========+
+|Vendor ID | `0x1AE0` | Google |
++--------------+----------+---------+
+|Device ID | `0x0042` | |
++--------------+----------+---------+
+|Sub-vendor ID | `0x1AE0` | Google |
++--------------+----------+---------+
+|Sub-device ID | `0x0058` | |
++--------------+----------+---------+
+|Revision ID | `0x0` | |
++--------------+----------+---------+
+|Device Class | `0x200` | Ethernet|
++--------------+----------+---------+
+
+PCI Bars
+========
+The gVNIC PCI device exposes three 32-bit memory BARS:
+- Bar0 - Device configuration and status registers.
+- Bar1 - MSI-X vector table
+- Bar2 - IRQ, RX and TX doorbells
+
+Device Interactions
+===================
+The driver interacts with the device in the following ways:
+ - Registers
+ - A block of MMIO registers
+ - See gve_register.h for more detail
+ - Admin Queue
+ - See description below
+ - Reset
+ - At any time the device can be reset
+ - Interrupts
+ - See supported interrupts below
+ - Transmit and Receive Queues
+ - See description below
+
+Registers
+---------
+All registers are MMIO and big endian.
+
+The registers are used for initializing and configuring the device as well as
+querying device status in response to management interrupts.
+
+Admin Queue (AQ)
+----------------
+The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ
+commands, used by the driver to issue commands to the device and set up
+resources.The driver and the device maintain a count of how many commands
+have been submitted and executed. To issue AQ commands, the driver must do
+the following (with proper locking):
+
+1) Copy new commands into next available slots in the AQ array
+2) Increment its counter by he number of new commands
+3) Write the counter into the GVE_ADMIN_QUEUE_DOORBELL register
+4) Poll the ADMIN_QUEUE_EVENT_COUNTER register until it equals
+ the value written to the doorbell, or until a timeout.
+
+The device will update the status field in each AQ command reported as
+executed through the ADMIN_QUEUE_EVENT_COUNTER register.
+
+Device Resets
+-------------
+A device reset is triggered by writing 0x0 to the AQ PFN register.
+This causes the device to release all resources allocated by the
+driver, including the AQ itself.
+
+Interrupts
+----------
+The following interrupts are supported by the driver:
+
+Management Interrupt
+~~~~~~~~~~~~~~~~~~~~
+The management interrupt is used by the device to tell the driver to
+look at the GVE_DEVICE_STATUS register.
+
+The handler for the management irq simply queues the service task in
+the workqueue to check the register and acks the irq.
+
+Notification Block Interrupts
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The notification block interrupts are used to tell the driver to poll
+the queues associated with that interrupt.
+
+The handler for these irqs schedule the napi for that block to run
+and poll the queues.
+
+Traffic Queues
+--------------
+gVNIC's queues are composed of a descriptor ring and a buffer and are
+assigned to a notification block.
+
+The descriptor rings are power-of-two-sized ring buffers consisting of
+fixed-size descriptors. They advance their head pointer using a __be32
+doorbell located in Bar2. The tail pointers are advanced by consuming
+descriptors in-order and updating a __be32 counter. Both the doorbell
+and the counter overflow to zero.
+
+Each queue's buffers must be registered in advance with the device as a
+queue page list, and packet data can only be put in those pages.
+
+Transmit
+~~~~~~~~
+gve maps the buffers for transmit rings into a FIFO and copies the packets
+into the FIFO before sending them to the NIC.
+
+Receive
+~~~~~~~
+The buffers for receive rings are put into a data ring that is the same
+length as the descriptor ring and the head and tail pointers advance over
+the rings together.
diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst
index 75fa537763a4..2b7fefe72351 100644
--- a/Documentation/networking/device_drivers/index.rst
+++ b/Documentation/networking/device_drivers/index.rst
@@ -21,6 +21,8 @@ Contents:
intel/i40e
intel/iavf
intel/ice
+ google/gve
+ mellanox/mlx5
.. only:: subproject
diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst
new file mode 100644
index 000000000000..4eeef2df912f
--- /dev/null
+++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst
@@ -0,0 +1,173 @@
+.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+
+=================================================
+Mellanox ConnectX(R) mlx5 core VPI Network Driver
+=================================================
+
+Copyright (c) 2019, Mellanox Technologies LTD.
+
+Contents
+========
+
+- `Enabling the driver and kconfig options`_
+- `Devlink health reporters`_
+
+Enabling the driver and kconfig options
+================================================
+
+| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out)
+| at build time via kernel Kconfig flags.
+| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
+| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
+| For the list of advanced features please see below.
+
+**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
+
+| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config.
+| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib).
+
+
+**CONFIG_MLX5_CORE_EN=(y/n)**
+
+| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads.
+| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be
+| built-in into mlx5_core.ko.
+
+
+**CONFIG_MLX5_EN_ARFS=(y/n)**
+
+| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering.
+| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4
+
+
+**CONFIG_MLX5_EN_RXNFC=(y/n)**
+
+| Enables ethtool receive network flow classification, which allows user defined
+| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API.
+
+
+**CONFIG_MLX5_CORE_EN_DCB=(y/n)**:
+
+| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_.
+
+
+**CONFIG_MLX5_MPFS=(y/n)**
+
+| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC.
+| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing
+| user configured unicast MAC addresses to the requesting PF.
+
+
+**CONFIG_MLX5_ESWITCH=(y/n)**
+
+| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering
+| and switching for the enabled VFs and PF in two available modes:
+| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_.
+| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_.
+
+
+**CONFIG_MLX5_CORE_IPOIB=(y/n)**
+
+| IPoIB offloads & acceleration support.
+| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma
+| IPoIB ulp netdevice.
+
+
+**CONFIG_MLX5_FPGA=(y/n)**
+
+| Build support for the Innova family of network cards by Mellanox Technologies.
+| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board.
+| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow
+| building sandbox-specific client drivers.
+
+
+**CONFIG_MLX5_EN_IPSEC=(y/n)**
+
+| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
+
+**CONFIG_MLX5_EN_TLS=(y/n)**
+
+| TLS cryptography-offload accelaration.
+
+
+**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
+
+| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.
+
+
+**External options** ( Choose if the corresponding mlx5 feature is required )
+
+- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled
+- CONFIG_VXLAN: When chosen, mlx5 vxaln support will be enabled.
+- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool).
+
+
+Devlink health reporters
+========================
+
+tx reporter
+-----------
+The tx reporter is responsible of two error scenarios:
+
+- TX timeout
+ Report on kernel tx timeout detection.
+ Recover by searching lost interrupts.
+- TX error completion
+ Report on error tx completion.
+ Recover by flushing the TX queue and reset it.
+
+TX reporter also support Diagnose callback, on which it provides
+real time information of its send queues status.
+
+User commands examples:
+
+- Diagnose send queues status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter tx
+
+- Show number of tx errors indicated, number of recover flows ended successfully,
+ is autorecover enabled and graceful period from last recover::
+
+ $ devlink health show pci/0000:82:00.0 reporter tx
+
+fw reporter
+-----------
+The fw reporter implements diagnose and dump callbacks.
+It follows symptoms of fw error such as fw syndrome by triggering
+fw core dump and storing it into the dump buffer.
+The fw reporter diagnose command can be triggered any time by the user to check
+current fw status.
+
+User commands examples:
+
+- Check fw heath status::
+
+ $ devlink health diagnose pci/0000:82:00.0 reporter fw
+
+- Read FW core dump if already stored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.0 reporter fw
+
+NOTE: This command can run only on the PF which has fw tracer ownership,
+running it on other PF or any VF will return "Operation not permitted".
+
+fw fatal reporter
+-----------------
+The fw fatal reporter implements dump and recover callbacks.
+It follows fatal errors indications by CR-space dump and recover flow.
+The CR-space dump uses vsc interface which is valid even if the FW command
+interface is not functional, which is the case in most FW fatal errors.
+The recover function runs recover flow which reloads the driver and triggers fw
+reset if needed.
+
+User commands examples:
+
+- Run fw recover flow manually::
+
+ $ devlink health recover pci/0000:82:00.0 reporter fw_fatal
+
+- Read FW CR-space dump if already strored or trigger new one::
+
+ $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
+
+NOTE: This command can run only on PF.
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index a73b3a02e49a..f0e6d1f53485 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -80,6 +80,7 @@ fib_multipath_hash_policy - INTEGER
Possible values:
0 - Layer 3
1 - Layer 4
+ 2 - Layer 3 or inner Layer 3 if present
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
@@ -255,6 +256,14 @@ tcp_base_mss - INTEGER
Path MTU discovery (MTU probing). If MTU probing is enabled,
this is the initial MSS used by the connection.
+tcp_min_snd_mss - INTEGER
+ TCP SYN and SYNACK messages usually advertise an ADVMSS option,
+ as described in RFC 1122 and RFC 6691.
+ If this ADVMSS option is smaller than tcp_min_snd_mss,
+ it is silently capped to tcp_min_snd_mss.
+
+ Default : 48 (at least 8 bytes of payload per segment)
+
tcp_congestion_control - STRING
Set the congestion control algorithm to be used for new
connections. The algorithm "reno" is always available, but
@@ -792,6 +801,14 @@ tcp_challenge_ack_limit - INTEGER
in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
Default: 100
+tcp_rx_skb_cache - BOOLEAN
+ Controls a per TCP socket cache of one skb, that might help
+ performance of some workloads. This might be dangerous
+ on systems with a lot of TCP sockets, since it increases
+ memory usage.
+
+ Default: 0 (disabled)
+
UDP variables:
udp_l3mdev_accept - BOOLEAN
@@ -1429,14 +1446,26 @@ flowlabel_state_ranges - BOOLEAN
FALSE: disabled
Default: true
-flowlabel_reflect - BOOLEAN
- Automatically reflect the flow label. Needed for Path MTU
+flowlabel_reflect - INTEGER
+ Control flow label reflection. Needed for Path MTU
Discovery to work with Equal Cost Multipath Routing in anycast
environments. See RFC 7690 and:
https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
- TRUE: enabled
- FALSE: disabled
- Default: FALSE
+
+ This is a bitmask.
+ 1: enabled for established flows
+
+ Note that this prevents automatic flowlabel changes, as done
+ in "tcp: change IPv6 flow-label upon receiving spurious retransmission"
+ and "tcp: Change txhash on every SYN and RTO retransmit"
+
+ 2: enabled for TCP RESET packets (no active listener)
+ If set, a RST packet sent in response to a SYN packet on a closed
+ port will reflect the incoming flow label.
+
+ 4: enabled for ICMPv6 echo reply messages.
+
+ Default: 0
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 0dd90d7df5ec..a689966bc4be 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -202,7 +202,8 @@ the PHY/controller, of which the PHY needs to be aware.
*interface* is a u32 which specifies the connection type used
between the controller and the PHY. Examples are GMII, MII,
-RGMII, and SGMII. For a full list, see include/linux/phy.h
+RGMII, and SGMII. See "PHY interface mode" below. For a full
+list, see include/linux/phy.h
Now just make sure that phydev->supported and phydev->advertising have any
values pruned from them which don't make sense for your controller (a 10/100
@@ -225,6 +226,48 @@ When you want to disconnect from the network (even if just briefly), you call
phy_stop(phydev). This function also stops the phylib state machine and
disables PHY interrupts.
+PHY interface modes
+===================
+
+The PHY interface mode supplied in the phy_connect() family of functions
+defines the initial operating mode of the PHY interface. This is not
+guaranteed to remain constant; there are PHYs which dynamically change
+their interface mode without software interaction depending on the
+negotiation results.
+
+Some of the interface modes are described below:
+
+``PHY_INTERFACE_MODE_1000BASEX``
+ This defines the 1000BASE-X single-lane serdes link as defined by the
+ 802.3 standard section 36. The link operates at a fixed bit rate of
+ 1.25Gbaud using a 10B/8B encoding scheme, resulting in an underlying
+ data rate of 1Gbps. Embedded in the data stream is a 16-bit control
+ word which is used to negotiate the duplex and pause modes with the
+ remote end. This does not include "up-clocked" variants such as 2.5Gbps
+ speeds (see below.)
+
+``PHY_INTERFACE_MODE_2500BASEX``
+ This defines a variant of 1000BASE-X which is clocked 2.5 times faster,
+ than the 802.3 standard giving a fixed bit rate of 3.125Gbaud.
+
+``PHY_INTERFACE_MODE_SGMII``
+ This is used for Cisco SGMII, which is a modification of 1000BASE-X
+ as defined by the 802.3 standard. The SGMII link consists of a single
+ serdes lane running at a fixed bit rate of 1.25Gbaud with 10B/8B
+ encoding. The underlying data rate is 1Gbps, with the slower speeds of
+ 100Mbps and 10Mbps being achieved through replication of each data symbol.
+ The 802.3 control word is re-purposed to send the negotiated speed and
+ duplex information from to the MAC, and for the MAC to acknowledge
+ receipt. This does not include "up-clocked" variants such as 2.5Gbps
+ speeds.
+
+ Note: mismatched SGMII vs 1000BASE-X configuration on a link can
+ successfully pass data in some circumstances, but the 16-bit control
+ word will not be correctly interpreted, which may cause mismatches in
+ duplex, pause or other settings. This is dependent on the MAC and/or
+ PHY behaviour.
+
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index 0235ae69af2a..f2a0147c933d 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -389,7 +389,7 @@ Multipath RDS (mprds)
a common (to all paths) part, and a per-path struct rds_conn_path. All
I/O workqs and reconnect threads are driven from the rds_conn_path.
Transports such as TCP that are multipath capable may then set up a
- TPC socket per rds_conn_path, and this is managed by the transport via
+ TCP socket per rds_conn_path, and this is managed by the transport via
the transport privatee cp_transport_data pointer.
Transports announce themselves as multipath capable by setting the
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index eb7c9b81ccf5..048e5ca44824 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -206,7 +206,11 @@ TX
Segments transmitted from an offloaded socket can get out of sync
in similar ways to the receive side-retransmissions - local drops
-are possible, though network reorders are not.
+are possible, though network reorders are not. There are currently
+two mechanisms for dealing with out of order segments.
+
+Crypto state rebuilding
+~~~~~~~~~~~~~~~~~~~~~~~
Whenever an out of order segment is transmitted the driver provides
the device with enough information to perform cryptographic operations.
@@ -225,6 +229,35 @@ was just a retransmission. The former is simpler, and does not require
retransmission detection therefore it is the recommended method until
such time it is proven inefficient.
+Next record sync
+~~~~~~~~~~~~~~~~
+
+Whenever an out of order segment is detected the driver requests
+that the ``ktls`` software fallback code encrypt it. If the segment's
+sequence number is lower than expected the driver assumes retransmission
+and doesn't change device state. If the segment is in the future, it
+may imply a local drop, the driver asks the stack to sync the device
+to the next record state and falls back to software.
+
+Resync request is indicated with:
+
+.. code-block:: c
+
+ void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
+
+Until resync is complete driver should not access its expected TCP
+sequence number (as it will be updated from a different context).
+Following helper should be used to test if resync is complete:
+
+.. code-block:: c
+
+ bool tls_offload_tx_resync_pending(struct sock *sk)
+
+Next time ``ktls`` pushes a record it will first send its TCP sequence number
+and TLS record number to the driver. Stack will also make sure that
+the new record will start on a segment boundary (like it does when
+the connection is initially added).
+
RX
--
@@ -268,6 +301,9 @@ Device can only detect that segment 4 also contains a TLS header
if it knows the length of the previous record from segment 2. In this case
the device will lose synchronization with the stream.
+Stream scan resynchronization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
When the device gets out of sync and the stream reaches TCP sequence
numbers more than a max size record past the expected TCP sequence number,
the device starts scanning for a known header pattern. For example
@@ -298,6 +334,22 @@ Special care has to be taken if the confirmation request is passed
asynchronously to the packet stream and record may get processed
by the kernel before the confirmation request.
+Stack-driven resynchronization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The driver may also request the stack to perform resynchronization
+whenever it sees the records are no longer getting decrypted.
+If the connection is configured in this mode the stack automatically
+schedules resynchronization after it has received two completely encrypted
+records.
+
+The stack waits for the socket to drain and informs the device about
+the next expected record number and its TCP sequence number. If the
+records continue to be received fully encrypted stack retries the
+synchronization with an exponential back off (first after 2 encrypted
+records, then after 4 records, after 8, after 16... up until every
+128 records).
+
Error handling
==============