aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/devicetree/bindings/net/dsa/ksz.txt102
-rw-r--r--Documentation/devicetree/bindings/net/qcom,ethqos.txt64
-rw-r--r--Documentation/devicetree/bindings/ptp/ptp-qoriq.txt2
-rw-r--r--Documentation/networking/dsa/dsa.txt13
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/phy.rst447
-rw-r--r--Documentation/networking/phy.txt427
-rw-r--r--Documentation/networking/snmp_counter.rst111
-rw-r--r--Documentation/networking/switchdev.txt2
-rw-r--r--Documentation/sysctl/net.txt14
10 files changed, 666 insertions, 517 deletions
diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt
index 0f407fb371ce..8d58c2a7de39 100644
--- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
+++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
@@ -19,58 +19,58 @@ Examples:
Ethernet switch connected via SPI to the host, CPU port wired to eth0:
- eth0: ethernet@10001000 {
- fixed-link {
- speed = <1000>;
- full-duplex;
- };
- };
+ eth0: ethernet@10001000 {
+ fixed-link {
+ speed = <1000>;
+ full-duplex;
+ };
+ };
- spi1: spi@f8008000 {
- pinctrl-0 = <&pinctrl_spi_ksz>;
- cs-gpios = <&pioC 25 0>;
- id = <1>;
+ spi1: spi@f8008000 {
+ pinctrl-0 = <&pinctrl_spi_ksz>;
+ cs-gpios = <&pioC 25 0>;
+ id = <1>;
- ksz9477: ksz9477@0 {
- compatible = "microchip,ksz9477";
- reg = <0>;
+ ksz9477: ksz9477@0 {
+ compatible = "microchip,ksz9477";
+ reg = <0>;
- spi-max-frequency = <44000000>;
- spi-cpha;
- spi-cpol;
+ spi-max-frequency = <44000000>;
+ spi-cpha;
+ spi-cpol;
- ports {
- #address-cells = <1>;
- #size-cells = <0>;
- port@0 {
- reg = <0>;
- label = "lan1";
- };
- port@1 {
- reg = <1>;
- label = "lan2";
- };
- port@2 {
- reg = <2>;
- label = "lan3";
- };
- port@3 {
- reg = <3>;
- label = "lan4";
- };
- port@4 {
- reg = <4>;
- label = "lan5";
- };
- port@5 {
- reg = <5>;
- label = "cpu";
- ethernet = <&eth0>;
- fixed-link {
- speed = <1000>;
- full-duplex;
- };
- };
- };
- };
- };
+ ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ port@0 {
+ reg = <0>;
+ label = "lan1";
+ };
+ port@1 {
+ reg = <1>;
+ label = "lan2";
+ };
+ port@2 {
+ reg = <2>;
+ label = "lan3";
+ };
+ port@3 {
+ reg = <3>;
+ label = "lan4";
+ };
+ port@4 {
+ reg = <4>;
+ label = "lan5";
+ };
+ port@5 {
+ reg = <5>;
+ label = "cpu";
+ ethernet = <&eth0>;
+ fixed-link {
+ speed = <1000>;
+ full-duplex;
+ };
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/qcom,ethqos.txt b/Documentation/devicetree/bindings/net/qcom,ethqos.txt
new file mode 100644
index 000000000000..fcf5035810b5
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/qcom,ethqos.txt
@@ -0,0 +1,64 @@
+Qualcomm Ethernet ETHQOS device
+
+This documents dwmmac based ethernet device which supports Gigabit
+ethernet for version v2.3.0 onwards.
+
+This device has following properties:
+
+Required properties:
+
+- compatible: Should be qcom,qcs404-ethqos"
+
+- reg: Address and length of the register set for the device
+
+- reg-names: Should contain register names "stmmaceth", "rgmii"
+
+- clocks: Should contain phandle to clocks
+
+- clock-names: Should contain clock names "stmmaceth", "pclk",
+ "ptp_ref", "rgmii"
+
+- interrupts: Should contain phandle to interrupts
+
+- interrupt-names: Should contain interrupt names "macirq", "eth_lpi"
+
+Rest of the properties are defined in stmmac.txt file in same directory
+
+
+Example:
+
+ethernet: ethernet@7a80000 {
+ compatible = "qcom,qcs404-ethqos";
+ reg = <0x07a80000 0x10000>,
+ <0x07a96000 0x100>;
+ reg-names = "stmmaceth", "rgmii";
+ clock-names = "stmmaceth", "pclk", "ptp_ref", "rgmii";
+ clocks = <&gcc GCC_ETH_AXI_CLK>,
+ <&gcc GCC_ETH_SLAVE_AHB_CLK>,
+ <&gcc GCC_ETH_PTP_CLK>,
+ <&gcc GCC_ETH_RGMII_CLK>;
+ interrupts = <GIC_SPI 56 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 55 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "macirq", "eth_lpi";
+ snps,reset-gpio = <&tlmm 60 GPIO_ACTIVE_LOW>;
+ snps,reset-active-low;
+
+ snps,txpbl = <8>;
+ snps,rxpbl = <2>;
+ snps,aal;
+ snps,tso;
+
+ phy-handle = <&phy1>;
+ phy-mode = "rgmii";
+
+ mdio {
+ #address-cells = <0x1>;
+ #size-cells = <0x0>;
+ compatible = "snps,dwmac-mdio";
+ phy1: phy@4 {
+ device_type = "ethernet-phy";
+ reg = <0x4>;
+ };
+ };
+
+};
diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
index c5d0e7998e2b..8e7f8551d190 100644
--- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
+++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
@@ -17,6 +17,8 @@ Clock Properties:
- fsl,tmr-fiper1 Fixed interval period pulse generator.
- fsl,tmr-fiper2 Fixed interval period pulse generator.
- fsl,max-adj Maximum frequency adjustment in parts per billion.
+ - fsl,extts-fifo The presence of this property indicates hardware
+ support for the external trigger stamp FIFO.
These properties set the operational parameters for the PTP
clock. You must choose these carefully for the clock to work right.
diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt
index 25170ad7d25b..1000b821681c 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -236,19 +236,6 @@ description.
Design limitations
==================
-DSA is a platform device driver
--------------------------------
-
-DSA is implemented as a DSA platform device driver which is convenient because
-it will register the entire DSA switch tree attached to a master network device
-in one-shot, facilitating the device creation and simplifying the device driver
-model a bit, this comes however with a number of limitations:
-
-- building DSA and its switch drivers as modules is currently not working
-- the device driver parenting does not necessarily reflect the original
- bus/device the switch can be created from
-- supporting non-MDIO and non-MMIO (platform) switches is not possible
-
Limits on the number of devices and ports
-----------------------------------------
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 59e86de662cd..f1627ca2a0ea 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -29,6 +29,7 @@ Contents:
msg_zerocopy
failover
net_failover
+ phy
alias
bridge
snmp_counter
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
new file mode 100644
index 000000000000..0dd90d7df5ec
--- /dev/null
+++ b/Documentation/networking/phy.rst
@@ -0,0 +1,447 @@
+=====================
+PHY Abstraction Layer
+=====================
+
+Purpose
+=======
+
+Most network devices consist of set of registers which provide an interface
+to a MAC layer, which communicates with the physical connection through a
+PHY. The PHY concerns itself with negotiating link parameters with the link
+partner on the other side of the network connection (typically, an ethernet
+cable), and provides a register interface to allow drivers to determine what
+settings were chosen, and to configure what settings are allowed.
+
+While these devices are distinct from the network devices, and conform to a
+standard layout for the registers, it has been common practice to integrate
+the PHY management code with the network driver. This has resulted in large
+amounts of redundant code. Also, on embedded systems with multiple (and
+sometimes quite different) ethernet controllers connected to the same
+management bus, it is difficult to ensure safe use of the bus.
+
+Since the PHYs are devices, and the management busses through which they are
+accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
+In doing so, it has these goals:
+
+#. Increase code-reuse
+#. Increase overall code-maintainability
+#. Speed development time for new network drivers, and for new systems
+
+Basically, this layer is meant to provide an interface to PHY devices which
+allows network driver writers to write as little code as possible, while
+still providing a full feature set.
+
+The MDIO bus
+============
+
+Most network devices are connected to a PHY by means of a management bus.
+Different devices use different busses (though some share common interfaces).
+In order to take advantage of the PAL, each bus interface needs to be
+registered as a distinct device.
+
+#. read and write functions must be implemented. Their prototypes are::
+
+ int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
+ int read(struct mii_bus *bus, int mii_id, int regnum);
+
+ mii_id is the address on the bus for the PHY, and regnum is the register
+ number. These functions are guaranteed not to be called from interrupt
+ time, so it is safe for them to block, waiting for an interrupt to signal
+ the operation is complete
+
+#. A reset function is optional. This is used to return the bus to an
+ initialized state.
+
+#. A probe function is needed. This function should set up anything the bus
+ driver needs, setup the mii_bus structure, and register with the PAL using
+ mdiobus_register. Similarly, there's a remove function to undo all of
+ that (use mdiobus_unregister).
+
+#. Like any driver, the device_driver structure must be configured, and init
+ exit functions are used to register the driver.
+
+#. The bus must also be declared somewhere as a device, and registered.
+
+As an example for how one driver implemented an mdio bus driver, see
+drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
+for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
+
+(RG)MII/electrical interface considerations
+===========================================
+
+The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
+electrical signal interface using a synchronous 125Mhz clock signal and several
+data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
+between the clock line (RXC or TXC) and the data lines to let the PHY (clock
+sink) have enough setup and hold times to sample the data lines correctly. The
+PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
+the PHY driver and optionally the MAC driver, implement the required delay. The
+values of phy_interface_t must be understood from the perspective of the PHY
+device itself, leading to the following:
+
+* PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
+ internal delay by itself, it assumes that either the Ethernet MAC (if capable
+ or the PCB traces) insert the correct 1.5-2ns delay
+
+* PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
+ for the transmit data lines (TXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
+ for the receive data lines (RXD[3:0]) processed by the PHY device
+
+* PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
+ both transmit AND receive data lines from/to the PHY device
+
+Whenever possible, use the PHY side RGMII delay for these reasons:
+
+* PHY devices may offer sub-nanosecond granularity in how they allow a
+ receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
+ precision may be required to account for differences in PCB trace lengths
+
+* PHY devices are typically qualified for a large range of applications
+ (industrial, medical, automotive...), and they provide a constant and
+ reliable delay across temperature/pressure/voltage ranges
+
+* PHY device drivers in PHYLIB being reusable by nature, being able to
+ configure correctly a specified delay enables more designs with similar delay
+ requirements to be operate correctly
+
+For cases where the PHY is not capable of providing this delay, but the
+Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
+should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
+configured correctly in order to provide the required transmit and/or receive
+side delay from the perspective of the PHY device. Conversely, if the Ethernet
+MAC driver looks at the phy_interface_t value, for any other mode but
+PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
+disabled.
+
+In case neither the Ethernet MAC, nor the PHY are capable of providing the
+required delays, as defined per the RGMII standard, several options may be
+available:
+
+* Some SoCs may offer a pin pad/mux/controller capable of configuring a given
+ set of pins'strength, delays, and voltage; and it may be a suitable
+ option to insert the expected 2ns RGMII delay.
+
+* Modifying the PCB design to include a fixed delay (e.g: using a specifically
+ designed serpentine), which may not require software configuration at all.
+
+Common problems with RGMII delay mismatch
+-----------------------------------------
+
+When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
+will most likely result in the clock and data line signals to be unstable when
+the PHY or MAC take a snapshot of these signals to translate them into logical
+1 or 0 states and reconstruct the data being transmitted/received. Typical
+symptoms include:
+
+* Transmission/reception partially works, and there is frequent or occasional
+ packet loss observed
+
+* Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
+ or just discard them all
+
+* Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
+ (since there is enough setup/hold time in that case)
+
+Connecting to a PHY
+===================
+
+Sometime during startup, the network driver needs to establish a connection
+between the PHY device, and the network device. At this time, the PHY's bus
+and drivers need to all have been loaded, so it is ready for the connection.
+At this point, there are several ways to connect to the PHY:
+
+#. The PAL handles everything, and only calls the network driver when
+ the link state changes, so it can react.
+
+#. The PAL handles everything except interrupts (usually because the
+ controller has the interrupt registers).
+
+#. The PAL handles everything, but checks in with the driver every second,
+ allowing the network driver to react first to any changes before the PAL
+ does.
+
+#. The PAL serves only as a library of functions, with the network device
+ manually calling functions to update status, and configure the PHY
+
+
+Letting the PHY Abstraction Layer do Everything
+===============================================
+
+If you choose option 1 (The hope is that every driver can, but to still be
+useful to drivers that can't), connecting to the PHY is simple:
+
+First, you need a function to react to changes in the link state. This
+function follows this protocol::
+
+ static void adjust_link(struct net_device *dev);
+
+Next, you need to know the device name of the PHY connected to this device.
+The name will look something like, "0:00", where the first number is the
+bus id, and the second is the PHY's address on that bus. Typically,
+the bus is responsible for making its ID unique.
+
+Now, to connect, just call this function::
+
+ phydev = phy_connect(dev, phy_name, &adjust_link, interface);
+
+*phydev* is a pointer to the phy_device structure which represents the PHY.
+If phy_connect is successful, it will return the pointer. dev, here, is the
+pointer to your net_device. Once done, this function will have started the
+PHY's software state machine, and registered for the PHY's interrupt, if it
+has one. The phydev structure will be populated with information about the
+current state, though the PHY will not yet be truly operational at this
+point.
+
+PHY-specific flags should be set in phydev->dev_flags prior to the call
+to phy_connect() such that the underlying PHY driver can check for flags
+and perform specific operations based on them.
+This is useful if the system has put hardware restrictions on
+the PHY/controller, of which the PHY needs to be aware.
+
+*interface* is a u32 which specifies the connection type used
+between the controller and the PHY. Examples are GMII, MII,
+RGMII, and SGMII. For a full list, see include/linux/phy.h
+
+Now just make sure that phydev->supported and phydev->advertising have any
+values pruned from them which don't make sense for your controller (a 10/100
+controller may be connected to a gigabit capable PHY, so you would need to
+mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
+for these bitfields. Note that you should not SET any bits, except the
+SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
+put into an unsupported state.
+
+Lastly, once the controller is ready to handle network traffic, you call
+phy_start(phydev). This tells the PAL that you are ready, and configures the
+PHY to connect to the network. If the MAC interrupt of your network driver
+also handles PHY status changes, just set phydev->irq to PHY_IGNORE_INTERRUPT
+before you call phy_start and use phy_mac_interrupt() from the network
+driver. If you don't want to use interrupts, set phydev->irq to PHY_POLL.
+phy_start() enables the PHY interrupts (if applicable) and starts the
+phylib state machine.
+
+When you want to disconnect from the network (even if just briefly), you call
+phy_stop(phydev). This function also stops the phylib state machine and
+disables PHY interrupts.
+
+Pause frames / flow control
+===========================
+
+The PHY does not participate directly in flow control/pause frames except by
+making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
+MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
+controller supports such a thing. Since flow control/pause frames generation
+involves the Ethernet MAC driver, it is recommended that this driver takes care
+of properly indicating advertisement and support for such features by setting
+the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
+either before or after phy_connect() and/or as a result of implementing the
+ethtool::set_pauseparam feature.
+
+
+Keeping Close Tabs on the PAL
+=============================
+
+It is possible that the PAL's built-in state machine needs a little help to
+keep your network device and the PHY properly in sync. If so, you can
+register a helper function when connecting to the PHY, which will be called
+every second before the state machine reacts to any changes. To do this, you
+need to manually call phy_attach() and phy_prepare_link(), and then call
+phy_start_machine() with the second argument set to point to your special
+handler.
+
+Currently there are no examples of how to use this functionality, and testing
+on it has been limited because the author does not have any drivers which use
+it (they all use option 1). So Caveat Emptor.
+
+Doing it all yourself
+=====================
+
+There's a remote chance that the PAL's built-in state machine cannot track
+the complex interactions between the PHY and your network device. If this is
+so, you can simply call phy_attach(), and not call phy_start_machine or
+phy_prepare_link(). This will mean that phydev->state is entirely yours to
+handle (phy_start and phy_stop toggle between some of the states, so you
+might need to avoid them).
+
+An effort has been made to make sure that useful functionality can be
+accessed without the state-machine running, and most of these functions are
+descended from functions which did not interact with a complex state-machine.
+However, again, no effort has been made so far to test running without the
+state machine, so tryer beware.
+
+Here is a brief rundown of the functions::
+
+ int phy_read(struct phy_device *phydev, u16 regnum);
+ int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
+
+Simple read/write primitives. They invoke the bus's read/write function
+pointers.
+::
+
+ void phy_print_status(struct phy_device *phydev);
+
+A convenience function to print out the PHY status neatly.
+::
+
+ void phy_request_interrupt(struct phy_device *phydev);
+
+Requests the IRQ for the PHY interrupts.
+::
+
+ struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
+ phy_interface_t interface);
+
+Attaches a network device to a particular PHY, binding the PHY to a generic
+driver if none was found during bus initialization.
+::
+
+ int phy_start_aneg(struct phy_device *phydev);
+
+Using variables inside the phydev structure, either configures advertising
+and resets autonegotiation, or disables autonegotiation, and configures
+forced settings.
+::
+
+ static inline int phy_read_status(struct phy_device *phydev);
+
+Fills the phydev structure with up-to-date information about the current
+settings in the PHY.
+::
+
+ int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
+
+Ethtool convenience functions.
+::
+
+ int phy_mii_ioctl(struct phy_device *phydev,
+ struct mii_ioctl_data *mii_data, int cmd);
+
+The MII ioctl. Note that this function will completely screw up the state
+machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
+use this only to write registers which are not standard, and don't set off
+a renegotiation.
+
+PHY Device Drivers
+==================
+
+With the PHY Abstraction Layer, adding support for new PHYs is
+quite easy. In some cases, no work is required at all! However,
+many PHYs require a little hand-holding to get up-and-running.
+
+Generic PHY driver
+------------------
+
+If the desired PHY doesn't have any errata, quirks, or special
+features you want to support, then it may be best to not add
+support, and let the PHY Abstraction Layer's Generic PHY Driver
+do all of the work.
+
+Writing a PHY driver
+--------------------
+
+If you do need to write a PHY driver, the first thing to do is
+make sure it can be matched with an appropriate PHY device.
+This is done during bus initialization by reading the device's
+UID (stored in registers 2 and 3), then comparing it to each
+driver's phy_id field by ANDing it with each driver's
+phy_id_mask field. Also, it needs a name. Here's an example::
+
+ static struct phy_driver dm9161_driver = {
+ .phy_id = 0x0181b880,
+ .name = "Davicom DM9161E",
+ .phy_id_mask = 0x0ffffff0,
+ ...
+ }
+
+Next, you need to specify what features (speed, duplex, autoneg,
+etc) your PHY device and driver support. Most PHYs support
+PHY_BASIC_FEATURES, but you can look in include/mii.h for other
+features.
+
+Each driver consists of a number of function pointers, documented
+in include/linux/phy.h under the phy_driver structure.
+
+Of these, only config_aneg and read_status are required to be
+assigned by the driver code. The rest are optional. Also, it is
+preferred to use the generic phy driver's versions of these two
+functions if at all possible: genphy_read_status and
+genphy_config_aneg. If this is not possible, it is likely that
+you only need to perform some actions before and after invoking
+these functions, and so your functions will wrap the generic
+ones.
+
+Feel free to look at the Marvell, Cicada, and Davicom drivers in
+drivers/net/phy/ for examples (the lxt and qsemi drivers have
+not been tested as of this writing).
+
+The PHY's MMD register accesses are handled by the PAL framework
+by default, but can be overridden by a specific PHY driver if
+required. This could be the case if a PHY was released for
+manufacturing before the MMD PHY register definitions were
+standardized by the IEEE. Most modern PHYs will be able to use
+the generic PAL framework for accessing the PHY's MMD registers.
+An example of such usage is for Energy Efficient Ethernet support,
+implemented in the PAL. This support uses the PAL to access MMD
+registers for EEE query and configuration if the PHY supports
+the IEEE standard access mechanisms, or can use the PHY's specific
+access interfaces if overridden by the specific PHY driver. See
+the Micrel driver in drivers/net/phy/ for an example of how this
+can be implemented.
+
+Board Fixups
+============
+
+Sometimes the specific interaction between the platform and the PHY requires
+special handling. For instance, to change where the PHY's clock input is,
+or to add a delay to account for latency issues in the data path. In order
+to support such contingencies, the PHY Layer allows platform code to register
+fixups to be run when the PHY is brought up (or subsequently reset).
+
+When the PHY Layer brings up a PHY it checks to see if there are any fixups
+registered for it, matching based on UID (contained in the PHY device's phy_id
+field) and the bus identifier (contained in phydev->dev.bus_id). Both must
+match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
+wildcards for the bus ID and UID, respectively.
+
+When a match is found, the PHY layer will invoke the run function associated
+with the fixup. This function is passed a pointer to the phy_device of
+interest. It should therefore only operate on that PHY.
+
+The platform code can either register the fixup using phy_register_fixup()::
+
+ int phy_register_fixup(const char *phy_id,
+ u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+
+Or using one of the two stubs, phy_register_fixup_for_uid() and
+phy_register_fixup_for_id()::
+
+ int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+ int phy_register_fixup_for_id(const char *phy_id,
+ int (*run)(struct phy_device *));
+
+The stubs set one of the two matching criteria, and set the other one to
+match anything.
+
+When phy_register_fixup() or \*_for_uid()/\*_for_id() is called at module,
+unregister fixup and free allocate memory are required.
+
+Call one of following function before unloading module::
+
+ int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
+ int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
+ int phy_register_fixup_for_id(const char *phy_id);
+
+Standards
+=========
+
+IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
+http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
+
+RGMII v1.3:
+http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
+
+RGMII v2.0:
+http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
deleted file mode 100644
index bdec0f700bc1..000000000000
--- a/Documentation/networking/phy.txt
+++ /dev/null
@@ -1,427 +0,0 @@
-
--------
-PHY Abstraction Layer
-(Updated 2008-04-08)
-
-Purpose
-
- Most network devices consist of set of registers which provide an interface
- to a MAC layer, which communicates with the physical connection through a
- PHY. The PHY concerns itself with negotiating link parameters with the link
- partner on the other side of the network connection (typically, an ethernet
- cable), and provides a register interface to allow drivers to determine what
- settings were chosen, and to configure what settings are allowed.
-
- While these devices are distinct from the network devices, and conform to a
- standard layout for the registers, it has been common practice to integrate
- the PHY management code with the network driver. This has resulted in large
- amounts of redundant code. Also, on embedded systems with multiple (and
- sometimes quite different) ethernet controllers connected to the same
- management bus, it is difficult to ensure safe use of the bus.
-
- Since the PHYs are devices, and the management busses through which they are
- accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
- In doing so, it has these goals:
-
- 1) Increase code-reuse
- 2) Increase overall code-maintainability
- 3) Speed development time for new network drivers, and for new systems
-
- Basically, this layer is meant to provide an interface to PHY devices which
- allows network driver writers to write as little code as possible, while
- still providing a full feature set.
-
-The MDIO bus
-
- Most network devices are connected to a PHY by means of a management bus.
- Different devices use different busses (though some share common interfaces).
- In order to take advantage of the PAL, each bus interface needs to be
- registered as a distinct device.
-
- 1) read and write functions must be implemented. Their prototypes are:
-
- int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
- int read(struct mii_bus *bus, int mii_id, int regnum);
-
- mii_id is the address on the bus for the PHY, and regnum is the register
- number. These functions are guaranteed not to be called from interrupt
- time, so it is safe for them to block, waiting for an interrupt to signal
- the operation is complete
-
- 2) A reset function is optional. This is used to return the bus to an
- initialized state.
-
- 3) A probe function is needed. This function should set up anything the bus
- driver needs, setup the mii_bus structure, and register with the PAL using
- mdiobus_register. Similarly, there's a remove function to undo all of
- that (use mdiobus_unregister).
-
- 4) Like any driver, the device_driver structure must be configured, and init
- exit functions are used to register the driver.
-
- 5) The bus must also be declared somewhere as a device, and registered.
-
- As an example for how one driver implemented an mdio bus driver, see
- drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
- for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
-
-(RG)MII/electrical interface considerations
-
- The Reduced Gigabit Medium Independent Interface (RGMII) is a 12-pin
- electrical signal interface using a synchronous 125Mhz clock signal and several
- data lines. Due to this design decision, a 1.5ns to 2ns delay must be added
- between the clock line (RXC or TXC) and the data lines to let the PHY (clock
- sink) have enough setup and hold times to sample the data lines correctly. The
- PHY library offers different types of PHY_INTERFACE_MODE_RGMII* values to let
- the PHY driver and optionally the MAC driver, implement the required delay. The
- values of phy_interface_t must be understood from the perspective of the PHY
- device itself, leading to the following:
-
- * PHY_INTERFACE_MODE_RGMII: the PHY is not responsible for inserting any
- internal delay by itself, it assumes that either the Ethernet MAC (if capable
- or the PCB traces) insert the correct 1.5-2ns delay
-
- * PHY_INTERFACE_MODE_RGMII_TXID: the PHY should insert an internal delay
- for the transmit data lines (TXD[3:0]) processed by the PHY device
-
- * PHY_INTERFACE_MODE_RGMII_RXID: the PHY should insert an internal delay
- for the receive data lines (RXD[3:0]) processed by the PHY device
-
- * PHY_INTERFACE_MODE_RGMII_ID: the PHY should insert internal delays for
- both transmit AND receive data lines from/to the PHY device
-
- Whenever possible, use the PHY side RGMII delay for these reasons:
-
- * PHY devices may offer sub-nanosecond granularity in how they allow a
- receiver/transmitter side delay (e.g: 0.5, 1.0, 1.5ns) to be specified. Such
- precision may be required to account for differences in PCB trace lengths
-
- * PHY devices are typically qualified for a large range of applications
- (industrial, medical, automotive...), and they provide a constant and
- reliable delay across temperature/pressure/voltage ranges
-
- * PHY device drivers in PHYLIB being reusable by nature, being able to
- configure correctly a specified delay enables more designs with similar delay
- requirements to be operate correctly
-
- For cases where the PHY is not capable of providing this delay, but the
- Ethernet MAC driver is capable of doing so, the correct phy_interface_t value
- should be PHY_INTERFACE_MODE_RGMII, and the Ethernet MAC driver should be
- configured correctly in order to provide the required transmit and/or receive
- side delay from the perspective of the PHY device. Conversely, if the Ethernet
- MAC driver looks at the phy_interface_t value, for any other mode but
- PHY_INTERFACE_MODE_RGMII, it should make sure that the MAC-level delays are
- disabled.
-
- In case neither the Ethernet MAC, nor the PHY are capable of providing the
- required delays, as defined per the RGMII standard, several options may be
- available:
-
- * Some SoCs may offer a pin pad/mux/controller capable of configuring a given
- set of pins'strength, delays, and voltage; and it may be a suitable
- option to insert the expected 2ns RGMII delay.
-
- * Modifying the PCB design to include a fixed delay (e.g: using a specifically
- designed serpentine), which may not require software configuration at all.
-
-Common problems with RGMII delay mismatch
-
- When there is a RGMII delay mismatch between the Ethernet MAC and the PHY, this
- will most likely result in the clock and data line signals to be unstable when
- the PHY or MAC take a snapshot of these signals to translate them into logical
- 1 or 0 states and reconstruct the data being transmitted/received. Typical
- symptoms include:
-
- * Transmission/reception partially works, and there is frequent or occasional
- packet loss observed
-
- * Ethernet MAC may report some or all packets ingressing with a FCS/CRC error,
- or just discard them all
-
- * Switching to lower speeds such as 10/100Mbits/sec makes the problem go away
- (since there is enough setup/hold time in that case)
-
-
-Connecting to a PHY
-
- Sometime during startup, the network driver needs to establish a connection
- between the PHY device, and the network device. At this time, the PHY's bus
- and drivers need to all have been loaded, so it is ready for the connection.
- At this point, there are several ways to connect to the PHY:
-
- 1) The PAL handles everything, and only calls the network driver when
- the link state changes, so it can react.
-
- 2) The PAL handles everything except interrupts (usually because the
- controller has the interrupt registers).
-
- 3) The PAL handles everything, but checks in with the driver every second,
- allowing the network driver to react first to any changes before the PAL
- does.
-
- 4) The PAL serves only as a library of functions, with the network device
- manually calling functions to update status, and configure the PHY
-
-
-Letting the PHY Abstraction Layer do Everything
-
- If you choose option 1 (The hope is that every driver can, but to still be
- useful to drivers that can't), connecting to the PHY is simple:
-
- First, you need a function to react to changes in the link state. This
- function follows this protocol:
-
- static void adjust_link(struct net_device *dev);
-
- Next, you need to know the device name of the PHY connected to this device.
- The name will look something like, "0:00", where the first number is the
- bus id, and the second is the PHY's address on that bus. Typically,
- the bus is responsible for making its ID unique.
-
- Now, to connect, just call this function:
-
- phydev = phy_connect(dev, phy_name, &adjust_link, interface);
-
- phydev is a pointer to the phy_device structure which represents the PHY. If
- phy_connect is successful, it will return the pointer. dev, here, is the
- pointer to your net_device. Once done, this function will have started the
- PHY's software state machine, and registered for the PHY's interrupt, if it
- has one. The phydev structure will be populated with information about the
- current state, though the PHY will not yet be truly operational at this
- point.
-
- PHY-specific flags should be set in phydev->dev_flags prior to the call
- to phy_connect() such that the underlying PHY driver can check for flags
- and perform specific operations based on them.
- This is useful if the system has put hardware restrictions on
- the PHY/controller, of which the PHY needs to be aware.
-
- interface is a u32 which specifies the connection type used
- between the controller and the PHY. Examples are GMII, MII,
- RGMII, and SGMII. For a full list, see include/linux/phy.h
-
- Now just make sure that phydev->supported and phydev->advertising have any
- values pruned from them which don't make sense for your controller (a 10/100
- controller may be connected to a gigabit capable PHY, so you would need to
- mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
- for these bitfields. Note that you should not SET any bits, except the
- SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
- put into an unsupported state.
-
- Lastly, once the controller is ready to handle network traffic, you call
- phy_start(phydev). This tells the PAL that you are ready, and configures the
- PHY to connect to the network. If you want to handle your own interrupts,
- just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start.
- Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL.
-
- When you want to disconnect from the network (even if just briefly), you call
- phy_stop(phydev).
-
-Pause frames / flow control
-
- The PHY does not participate directly in flow control/pause frames except by
- making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
- MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
- controller supports such a thing. Since flow control/pause frames generation
- involves the Ethernet MAC driver, it is recommended that this driver takes care
- of properly indicating advertisement and support for such features by setting
- the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
- either before or after phy_connect() and/or as a result of implementing the
- ethtool::set_pauseparam feature.
-
-
-Keeping Close Tabs on the PAL
-
- It is possible that the PAL's built-in state machine needs a little help to
- keep your network device and the PHY properly in sync. If so, you can
- register a helper function when connecting to the PHY, which will be called
- every second before the state machine reacts to any changes. To do this, you
- need to manually call phy_attach() and phy_prepare_link(), and then call
- phy_start_machine() with the second argument set to point to your special
- handler.
-
- Currently there are no examples of how to use this functionality, and testing
- on it has been limited because the author does not have any drivers which use
- it (they all use option 1). So Caveat Emptor.
-
-Doing it all yourself
-
- There's a remote chance that the PAL's built-in state machine cannot track
- the complex interactions between the PHY and your network device. If this is
- so, you can simply call phy_attach(), and not call phy_start_machine or
- phy_prepare_link(). This will mean that phydev->state is entirely yours to
- handle (phy_start and phy_stop toggle between some of the states, so you
- might need to avoid them).
-
- An effort has been made to make sure that useful functionality can be
- accessed without the state-machine running, and most of these functions are
- descended from functions which did not interact with a complex state-machine.
- However, again, no effort has been made so far to test running without the
- state machine, so tryer beware.
-
- Here is a brief rundown of the functions:
-
- int phy_read(struct phy_device *phydev, u16 regnum);
- int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
-
- Simple read/write primitives. They invoke the bus's read/write function
- pointers.
-
- void phy_print_status(struct phy_device *phydev);
-
- A convenience function to print out the PHY status neatly.
-
- int phy_start_interrupts(struct phy_device *phydev);
- int phy_stop_interrupts(struct phy_device *phydev);
-
- Requests the IRQ for the PHY interrupts, then enables them for
- start, or disables then frees them for stop.
-
- struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
- phy_interface_t interface);
-
- Attaches a network device to a particular PHY, binding the PHY to a generic
- driver if none was found during bus initialization.
-
- int phy_start_aneg(struct phy_device *phydev);
-
- Using variables inside the phydev structure, either configures advertising
- and resets autonegotiation, or disables autonegotiation, and configures
- forced settings.
-
- static inline int phy_read_status(struct phy_device *phydev);
-
- Fills the phydev structure with up-to-date information about the current
- settings in the PHY.
-
- int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
-
- Ethtool convenience functions.
-
- int phy_mii_ioctl(struct phy_device *phydev,
- struct mii_ioctl_data *mii_data, int cmd);
-
- The MII ioctl. Note that this function will completely screw up the state
- machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
- use this only to write registers which are not standard, and don't set off
- a renegotiation.
-
-
-PHY Device Drivers
-
- With the PHY Abstraction Layer, adding support for new PHYs is
- quite easy. In some cases, no work is required at all! However,
- many PHYs require a little hand-holding to get up-and-running.
-
-Generic PHY driver
-
- If the desired PHY doesn't have any errata, quirks, or special
- features you want to support, then it may be best to not add
- support, and let the PHY Abstraction Layer's Generic PHY Driver
- do all of the work.
-
-Writing a PHY driver
-
- If you do need to write a PHY driver, the first thing to do is
- make sure it can be matched with an appropriate PHY device.
- This is done during bus initialization by reading the device's
- UID (stored in registers 2 and 3), then comparing it to each
- driver's phy_id field by ANDing it with each driver's
- phy_id_mask field. Also, it needs a name. Here's an example:
-
- static struct phy_driver dm9161_driver = {
- .phy_id = 0x0181b880,
- .name = "Davicom DM9161E",
- .phy_id_mask = 0x0ffffff0,
- ...
- }
-
- Next, you need to specify what features (speed, duplex, autoneg,
- etc) your PHY device and driver support. Most PHYs support
- PHY_BASIC_FEATURES, but you can look in include/mii.h for other
- features.
-
- Each driver consists of a number of function pointers, documented
- in include/linux/phy.h under the phy_driver structure.
-
- Of these, only config_aneg and read_status are required to be
- assigned by the driver code. The rest are optional. Also, it is
- preferred to use the generic phy driver's versions of these two
- functions if at all possible: genphy_read_status and
- genphy_config_aneg. If this is not possible, it is likely that
- you only need to perform some actions before and after invoking
- these functions, and so your functions will wrap the generic
- ones.
-
- Feel free to look at the Marvell, Cicada, and Davicom drivers in
- drivers/net/phy/ for examples (the lxt and qsemi drivers have
- not been tested as of this writing).
-
- The PHY's MMD register accesses are handled by the PAL framework
- by default, but can be overridden by a specific PHY driver if
- required. This could be the case if a PHY was released for
- manufacturing before the MMD PHY register definitions were
- standardized by the IEEE. Most modern PHYs will be able to use
- the generic PAL framework for accessing the PHY's MMD registers.
- An example of such usage is for Energy Efficient Ethernet support,
- implemented in the PAL. This support uses the PAL to access MMD
- registers for EEE query and configuration if the PHY supports
- the IEEE standard access mechanisms, or can use the PHY's specific
- access interfaces if overridden by the specific PHY driver. See
- the Micrel driver in drivers/net/phy/ for an example of how this
- can be implemented.
-
-Board Fixups
-
- Sometimes the specific interaction between the platform and the PHY requires
- special handling. For instance, to change where the PHY's clock input is,
- or to add a delay to account for latency issues in the data path. In order
- to support such contingencies, the PHY Layer allows platform code to register
- fixups to be run when the PHY is brought up (or subsequently reset).
-
- When the PHY Layer brings up a PHY it checks to see if there are any fixups
- registered for it, matching based on UID (contained in the PHY device's phy_id
- field) and the bus identifier (contained in phydev->dev.bus_id). Both must
- match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
- wildcards for the bus ID and UID, respectively.
-
- When a match is found, the PHY layer will invoke the run function associated
- with the fixup. This function is passed a pointer to the phy_device of
- interest. It should therefore only operate on that PHY.
-
- The platform code can either register the fixup using phy_register_fixup():
-
- int phy_register_fixup(const char *phy_id,
- u32 phy_uid, u32 phy_uid_mask,
- int (*run)(struct phy_device *));
-
- Or using one of the two stubs, phy_register_fixup_for_uid() and
- phy_register_fixup_for_id():
-
- int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
- int (*run)(struct phy_device *));
- int phy_register_fixup_for_id(const char *phy_id,
- int (*run)(struct phy_device *));
-
- The stubs set one of the two matching criteria, and set the other one to
- match anything.
-
- When phy_register_fixup() or *_for_uid()/*_for_id() is called at module,
- unregister fixup and free allocate memory are required.
-
- Call one of following function before unloading module.
-
- int phy_unregister_fixup(const char *phy_id, u32 phy_uid, u32 phy_uid_mask);
- int phy_unregister_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask);
- int phy_register_fixup_for_id(const char *phy_id);
-
-Standards
-
- IEEE Standard 802.3: CSMA/CD Access Method and Physical Layer Specifications, Section Two:
- http://standards.ieee.org/getieee802/download/802.3-2008_section2.pdf
-
- RGMII v1.3:
- http://web.archive.org/web/20160303212629/http://www.hp.com/rnd/pdfs/RGMIIv1_3.pdf
-
- RGMII v2.0:
- http://web.archive.org/web/20160303171328/http://www.hp.com/rnd/pdfs/RGMIIv2_0_final_hp.pdf
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index fe8f741193be..c5642f430d2e 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -1,16 +1,17 @@
-===========
+============
SNMP counter
-===========
+============
This document explains the meaning of SNMP counters.
General IPv4 counters
-====================
+=====================
All layer 4 packets and ICMP packets will change these counters, but
these counters won't be changed by layer 2 packets (such as STP) or
ARP packets.
* IpInReceives
+
Defined in `RFC1213 ipInReceives`_
.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
@@ -23,6 +24,7 @@ and so on). It indicates the number of aggregated segments after
GRO/LRO.
* IpInDelivers
+
Defined in `RFC1213 ipInDelivers`_
.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
@@ -33,6 +35,7 @@ supported protocols will be delivered, if someone listens on the raw
socket, all valid IP packets will be delivered.
* IpOutRequests
+
Defined in `RFC1213 ipOutRequests`_
.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
@@ -42,6 +45,7 @@ multicast packets, and would always be updated together with
IpExtOutOctets.
* IpExtInOctets and IpExtOutOctets
+
They are Linux kernel extensions, no RFC definitions. Please note,
RFC1213 indeed defines ifInOctets and ifOutOctets, but they
are different things. The ifInOctets and ifOutOctets include the MAC
@@ -49,6 +53,7 @@ layer header size but IpExtInOctets and IpExtOutOctets don't, they
only include the IP layer header and the IP layer data.
* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
+
They indicate the number of four kinds of ECN IP packets, please refer
`Explicit Congestion Notification`_ for more details.
@@ -60,6 +65,7 @@ for the same packet, you might find that IpInReceives count 1, but
IpExtInNoECTPkts counts 2 or more.
* IpInHdrErrors
+
Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
dropped due to the IP header error. It might happen in both IP input
and IP forward paths.
@@ -67,6 +73,7 @@ and IP forward paths.
.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
* IpInAddrErrors
+
Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
scenarios: (1) The IP address is invalid. (2) The destination IP
address is not a local address and IP forwarding is not enabled
@@ -74,6 +81,7 @@ address is not a local address and IP forwarding is not enabled
.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
* IpExtInNoRoutes
+
This counter means the packet is dropped when the IP stack receives a
packet and can't find a route for it from the route table. It might
happen when IP forwarding is enabled and the destination IP address is
@@ -81,6 +89,7 @@ not a local address and there is no route for the destination IP
address.
* IpInUnknownProtos
+
Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
layer 4 protocol is unsupported by kernel. If an application is using
raw socket, kernel will always deliver the packet to the raw socket
@@ -89,10 +98,12 @@ and this counter won't be increased.
.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
* IpExtInTruncatedPkts
+
For IPv4 packet, it means the actual data size is smaller than the
"Total Length" field in the IPv4 header.
* IpInDiscards
+
Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
in the IP receiving path and due to kernel internal reasons (e.g. no
enough memory).
@@ -100,20 +111,23 @@ enough memory).
.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
* IpOutDiscards
+
Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
dropped in the IP sending path and due to kernel internal reasons.
.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
* IpOutNoRoutes
+
Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
dropped in the IP sending path and no route is found for it.
.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
ICMP counters
-============
+=============
* IcmpInMsgs and IcmpOutMsgs
+
Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
@@ -126,6 +140,7 @@ IcmpOutMsgs would still be updated if the IP header is constructed by
a userspace program.
* ICMP named types
+
| These counters include most of common ICMP types, they are:
| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
@@ -180,6 +195,7 @@ straightforward. The 'In' counter means kernel receives such a packet
and the 'Out' counter means kernel sends such a packet.
* ICMP numeric types
+
They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
ICMP type number. These counters track all kinds of ICMP packets. The
ICMP type number definition could be found in the `ICMP parameters`_
@@ -192,12 +208,14 @@ IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
packet, IcmpMsgInType0 would increase 1.
* IcmpInCsumErrors
+
This counter indicates the checksum of the ICMP packet is
wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
before updating IcmpMsgInType[N]. If a packet has bad checksum, the
IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
* IcmpInErrors and IcmpOutErrors
+
Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
@@ -209,7 +227,7 @@ and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
is increased, IcmpInErrors would always be increased too.
relationship of the ICMP counters
--------------------------------
+---------------------------------
The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
are updated at the same time. The sum of IcmpMsgInType[N] plus
IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
@@ -229,8 +247,9 @@ IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
IcmpInErrors.
General TCP counters
-==================
+====================
* TcpInSegs
+
Defined in `RFC1213 tcpInSegs`_
.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
@@ -247,6 +266,7 @@ isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
counter would only increase 1.
* TcpOutSegs
+
Defined in `RFC1213 tcpOutSegs`_
.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
@@ -258,6 +278,7 @@ GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
increase 2.
* TcpActiveOpens
+
Defined in `RFC1213 tcpActiveOpens`_
.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
@@ -267,6 +288,7 @@ state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
increase 1.
* TcpPassiveOpens
+
Defined in `RFC1213 tcpPassiveOpens`_
.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
@@ -275,6 +297,7 @@ It means the TCP layer receives a SYN, replies a SYN+ACK, come into
the SYN-RCVD state.
* TcpExtTCPRcvCoalesce
+
When packets are received by the TCP layer and are not be read by the
application, the TCP layer will try to merge them. This counter
indicate how many packets are merged in such situation. If GRO is
@@ -282,12 +305,14 @@ enabled, lots of packets would be merged by GRO, these packets
wouldn't be counted to TcpExtTCPRcvCoalesce.
* TcpExtTCPAutoCorking
+
When sending packets, the TCP layer will try to merge small packets to
a bigger one. This counter increase 1 for every packet merged in such
situation. Please refer to the LWN article for more details:
https://lwn.net/Articles/576263/
* TcpExtTCPOrigDataSent
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -297,6 +322,7 @@ explaination below::
more useful to track the TCP retransmission rate.
* TCPSynRetrans
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -304,6 +330,7 @@ explaination below::
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
* TCPFastOpenActiveFail
+
This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
explaination below::
@@ -313,6 +340,7 @@ explaination below::
.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
* TcpExtListenOverflows and TcpExtListenDrops
+
When kernel receives a SYN from a client, and if the TCP accept queue
is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
At the same time kernel will also add 1 to TcpExtListenDrops. When a
@@ -336,6 +364,8 @@ time client replies ACK, this socket will get another chance to move
to the accept queue.
+TCP Fast Open
+=============
* TcpEstabResets
Defined in `RFC1213 tcpEstabResets`_.
@@ -389,20 +419,23 @@ will disable the fast path at first, and try to enable it after kernel
receives packets.
* TcpExtTCPPureAcks and TcpExtTCPHPAcks
+
If a packet set ACK flag and has no data, it is a pure ACK packet, if
kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
if kernel handles it in the slow path, TcpExtTCPPureAcks will
increase 1.
* TcpExtTCPHPHits
+
If a TCP packet has data (which means it is not a pure ACK packet),
and this packet is handled in the fast path, TcpExtTCPHPHits will
increase 1.
TCP abort
-========
+=========
* TcpExtTCPAbortOnData
+
It means TCP layer has data in flight, but need to close the
connection. So TCP layer sends a RST to the other side, indicate the
connection is not closed very graceful. An easy way to increase this
@@ -421,11 +454,13 @@ when the application closes a connection, kernel will send a RST
immediately and increase the TcpExtTCPAbortOnData counter.
* TcpExtTCPAbortOnClose
+
This counter means the application has unread data in the TCP layer when
the application wants to close the TCP connection. In such a situation,
kernel will send a RST to the other side of the TCP connection.
* TcpExtTCPAbortOnMemory
+
When an application closes a TCP connection, kernel still need to track
the connection, let it complete the TCP disconnect process. E.g. an
app calls the close method of a socket, kernel sends fin to the other
@@ -447,10 +482,12 @@ the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
* TcpExtTCPAbortOnTimeout
+
This counter will increase when any of the TCP timers expire. In such
situation, kernel won't send RST, just give up the connection.
* TcpExtTCPAbortOnLinger
+
When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
for the fin packet from the other side, kernel could send a RST and
delete the socket immediately. This is not the default behavior of
@@ -458,6 +495,7 @@ Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
you could let kernel follow this behavior.
* TcpExtTCPAbortFailed
+
The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
satisfied. If an internal error occurs during this process,
TcpExtTCPAbortFailed will be increased.
@@ -465,7 +503,7 @@ TcpExtTCPAbortFailed will be increased.
.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
TCP Hybrid Slow Start
-====================
+=====================
The Hybrid Slow Start algorithm is an enhancement of the traditional
TCP congestion window Slow Start algorithm. It uses two pieces of
information to detect whether the max bandwidth of the TCP path is
@@ -481,23 +519,27 @@ relate with the Hybrid Slow Start algorithm.
.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
* TcpExtTCPHystartTrainDetect
+
How many times the ACK train length threshold is detected
* TcpExtTCPHystartTrainCwnd
+
The sum of CWND detected by ACK train length. Dividing this value by
TcpExtTCPHystartTrainDetect is the average CWND which detected by the
ACK train length.
* TcpExtTCPHystartDelayDetect
+
How many times the packet delay threshold is detected.
* TcpExtTCPHystartDelayCwnd
+
The sum of CWND detected by packet delay. Dividing this value by
TcpExtTCPHystartDelayDetect is the average CWND which detected by the
packet delay.
TCP retransmission and congestion control
-======================================
+=========================================
The TCP protocol has two retransmission mechanisms: SACK and fast
recovery. They are exclusive with each other. When SACK is enabled,
the kernel TCP stack would use SACK, or kernel would use fast
@@ -516,12 +558,14 @@ https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
.. _RFC6582: https://tools.ietf.org/html/rfc6582
* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
+
When the congestion control comes into Recovery state, if sack is
used, TcpExtTCPSackRecovery increases 1, if sack is not used,
TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
stack begins to retransmit the lost packets.
* TcpExtTCPSACKReneging
+
A packet was acknowledged by SACK, but the receiver has dropped this
packet, so the sender needs to retransmit this packet. In this
situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
@@ -532,6 +576,7 @@ the RTO expires for this packet, then the sender assumes this packet
has been dropped by the receiver.
* TcpExtTCPRenoReorder
+
The reorder packet is detected by fast recovery. It would only be used
if SACK is disabled. The fast recovery algorithm detects recorder by
the duplicate ACK number. E.g., if retransmission is triggered, and
@@ -542,6 +587,7 @@ order packet. Thus the sender would find more ACks than its
expectation, and the sender knows out of order occurs.
* TcpExtTCPTSReorder
+
The reorder packet is detected when a hole is filled. E.g., assume the
sender sends packet 1,2,3,4,5, and the receiving order is
1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
@@ -551,6 +597,7 @@ fill the hole), two conditions will let TcpExtTCPTSReorder increase
than the retransmission timestamp.
* TcpExtTCPSACKReorder
+
The reorder packet detected by SACK. The SACK has two methods to
detect reorder: (1) DSACK is received by the sender. It means the
sender sends the same packet more than one times. And the only reason
@@ -574,10 +621,12 @@ sender side.
.. _RFC2883 : https://tools.ietf.org/html/rfc2883
* TcpExtTCPDSACKOldSent
+
The TCP stack receives a duplicate packet which has been acked, so it
sends a DSACK to the sender.
* TcpExtTCPDSACKOfoSent
+
The TCP stack receives an out of order duplicate packet, so it sends a
DSACK to the sender.
@@ -586,6 +635,7 @@ The TCP stack receives a DSACK, which indicates an acknowledged
duplicate packet is received.
* TcpExtTCPDSACKOfoRecv
+
The TCP stack receives a DSACK, which indicate an out of order
duplicate packet is received.
@@ -640,23 +690,26 @@ A skb should be shifted or merged, but the TCP stack doesn't do it for
some reasons.
TCP out of order
-===============
+================
* TcpExtTCPOFOQueue
+
The TCP layer receives an out of order packet and has enough memory
to queue it.
* TcpExtTCPOFODrop
+
The TCP layer receives an out of order packet but doesn't have enough
memory, so drops it. Such packets won't be counted into
TcpExtTCPOFOQueue.
* TcpExtTCPOFOMerge
+
The received out of order packet has an overlay with the previous
packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
packets will also be counted into TcpExtTCPOFOQueue.
TCP PAWS
-=======
+========
PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
which is used to drop old packets. It depends on the TCP
timestamps. For detail information, please refer the `timestamp wiki`_
@@ -666,13 +719,15 @@ and the `RFC of PAWS`_.
.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
* TcpExtPAWSActive
+
Packets are dropped by PAWS in Syn-Sent status.
* TcpExtPAWSEstab
+
Packets are dropped by PAWS in any status other than Syn-Sent.
TCP ACK skip
-===========
+============
In some scenarios, kernel would avoid sending duplicate ACKs too
frequently. Please find more details in the tcp_invalid_ratelimit
section of the `sysctl document`_. When kernel decides to skip an ACK
@@ -684,6 +739,7 @@ it has no data.
.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
* TcpExtTCPACKSkippedSynRecv
+
The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
@@ -697,6 +753,7 @@ increase TcpExtTCPACKSkippedSynRecv.
* TcpExtTCPACKSkippedPAWS
+
The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
or Time-Wait statuses, the skipped ACK would be counted to
@@ -705,18 +762,22 @@ TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
would be counted to TcpExtTCPACKSkippedPAWS.
* TcpExtTCPACKSkippedSeq
+
The sequence number is out of window and the timestamp passes the PAWS
check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
* TcpExtTCPACKSkippedFinWait2
+
The ACK is skipped in Fin-Wait-2 status, the reason would be either
PAWS check fails or the received sequence number is out of window.
* TcpExtTCPACKSkippedTimeWait
+
Tha ACK is skipped in Time-Wait status, the reason would be either
PAWS check failed or the received sequence number is out of window.
* TcpExtTCPACKSkippedChallenge
+
The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
@@ -784,10 +845,10 @@ A TLP probe packet is sent.
A packet loss is detected and recovered by TLP.
examples
-=======
+========
ping test
---------
+---------
Run the ping command against the public dns server 8.8.8.8::
nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
@@ -831,7 +892,7 @@ and its corresponding Echo Reply packet are constructed by:
So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
tcp 3-way handshake
-------------------
+-------------------
On server side, we run::
nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
@@ -873,7 +934,7 @@ ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
1, TcpOutSegs increased 2.
TCP normal traffic
------------------
+------------------
Run nc on server::
nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
@@ -996,7 +1057,7 @@ and the packet received from client qualified for fast path, so it
was counted into 'TcpExtTCPHPHits'.
TcpExtTCPAbortOnClose
---------------------
+---------------------
On the server side, we run below python script::
import socket
@@ -1030,7 +1091,7 @@ If we run tcpdump on the server side, we could find the server sent a
RST after we type Ctrl-C.
TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
------------------------------------------------
+---------------------------------------------------
Below is an example which let the orphan socket count be higher than
net.ipv4.tcp_max_orphans.
Change tcp_max_orphans to a smaller value on client::
@@ -1152,7 +1213,7 @@ FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
TcpExtTCPAbortOnTimeout 10 0.0
TcpExtTCPAbortOnLinger
----------------------
+----------------------
The server side code::
nstatuser@nstat-b:~$ cat server_linger.py
@@ -1197,7 +1258,7 @@ After run client_linger.py, check the output of nstat::
TcpExtTCPAbortOnLinger 1 0.0
TcpExtTCPRcvCoalesce
--------------------
+--------------------
On the server, we run a program which listen on TCP port 9000, but
doesn't read any data::
@@ -1257,7 +1318,7 @@ the receiving queue. So the TCP layer merged the two packets, and we
could find the TcpExtTCPRcvCoalesce increased 1.
TcpExtListenOverflows and TcpExtListenDrops
-----------------------------------------
+-------------------------------------------
On server, run the nc command, listen on port 9000::
nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
@@ -1305,7 +1366,7 @@ TcpExtListenOverflows and TcpExtListenDrops would be larger, because
the SYN of the 4th nc was dropped, the client was retrying.
IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
-----------------------------------------------
+-------------------------------------------------
server A IP address: 192.168.122.250
server B IP address: 192.168.122.251
Prepare on server A, add a route to server B::
@@ -1400,7 +1461,7 @@ a route for the 8.8.8.8 IP address, so server B increased
IpOutNoRoutes.
TcpExtTCPACKSkippedSynRecv
-------------------------
+--------------------------
In this test, we send 3 same SYN packets from client to server. The
first SYN will let server create a socket, set it to Syn-Recv status,
and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
@@ -1448,7 +1509,7 @@ Check snmp cunter on nstat-b::
As we expected, TcpExtTCPACKSkippedSynRecv is 1.
TcpExtTCPACKSkippedPAWS
-----------------------
+-----------------------
To trigger PAWS, we could send an old SYN.
On nstat-b, let nc listen on port 9000::
@@ -1485,7 +1546,7 @@ failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
TcpExtTCPACKSkippedSeq
---------------------
+----------------------
To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
timestamp (to pass PAWS check) but the sequence number is out of
window. The linux TCP stack would avoid to skip if the packet has
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
index 82236a17b5e6..f3244d87512a 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -196,7 +196,7 @@ The switch device will learn/forget source MAC address/VLAN on ingress packets
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
in turn, will notify the bridge driver using the switchdev notifier call:
- err = call_switchdev_notifiers(val, dev, info);
+ err = call_switchdev_notifiers(val, dev, info, extack);
Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 2793d4eac55f..bc0680706870 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -291,6 +291,20 @@ user space is responsible for creating them if needed.
Default : 0 (for compatibility reasons)
+devconf_inherit_init_net
+----------------------------
+
+Controls if a new network namespace should inherit all current
+settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By
+default, we keep the current behavior: for IPv4 we inherit all current
+settings from init_net and for IPv6 we reset all settings to default.
+
+If set to 1, both IPv4 and IPv6 settings are forced to inherit from
+current ones in init_net. If set to 2, both IPv4 and IPv6 settings are
+forced to reset to their default values.
+
+Default : 0 (for compatibility reasons)
+
2. /proc/sys/net/unix - Parameters for Unix domain sockets
-------------------------------------------------------