aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/altera_tse.txt263
-rw-r--r--Documentation/networking/bonding.txt171
-rw-r--r--Documentation/networking/can.txt39
-rw-r--r--Documentation/networking/cdc_mbim.txt339
-rw-r--r--Documentation/networking/dccp.txt2
-rw-r--r--Documentation/networking/filter.txt472
-rw-r--r--Documentation/networking/gianfar.txt30
-rw-r--r--Documentation/networking/i40e.txt7
-rw-r--r--Documentation/networking/igb.txt48
-rw-r--r--Documentation/networking/ip-sysctl.txt38
-rw-r--r--Documentation/networking/netlink_mmap.txt4
-rw-r--r--Documentation/networking/packet_mmap.txt18
-rw-r--r--Documentation/networking/phy.txt29
-rw-r--r--Documentation/networking/pktgen.txt52
-rw-r--r--Documentation/networking/rxrpc.txt81
-rw-r--r--Documentation/networking/scaling.txt2
-rw-r--r--Documentation/networking/spider_net.txt2
-rw-r--r--Documentation/networking/tcp.txt2
-rw-r--r--Documentation/networking/timestamping.txt22
-rw-r--r--Documentation/networking/timestamping/timestamping.c7
20 files changed, 1434 insertions, 194 deletions
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt
new file mode 100644
index 000000000000..3f24df8c6e65
--- /dev/null
+++ b/Documentation/networking/altera_tse.txt
@@ -0,0 +1,263 @@
+ Altera Triple-Speed Ethernet MAC driver
+
+Copyright (C) 2008-2014 Altera Corporation
+
+This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
+using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
+platform bus to obtain component resources. The designs used to test this
+driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
+and tested with ARM and NIOS processor hosts seperately. The anticipated use
+cases are simple communications between an embedded system and an external peer
+for status and simple configuration of the embedded system.
+
+For more information visit www.altera.com and www.rocketboards.org. Support
+forums for the driver may be found on www.rocketboards.org, and a design used
+to test this driver may be found there as well. Support is also available from
+the maintainer of this driver, found in MAINTAINERS.
+
+The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP
+components that can be assembled and built into an FPGA using the Altera
+Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that
+this driver was tested against. The sopc2dts tool is used to create the
+device tree for the driver, and may be found at rocketboards.org.
+
+The driver probe function examines the device tree and determines if the
+Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The
+probe function then installs the appropriate set of DMA routines to
+initialize, setup transmits, receives, and interrupt handling primitives for
+the respective configurations.
+
+The SGDMA component is to be deprecated in the near future (over the next 1-2
+years as of this writing in early 2014) in favor of the MSGDMA component.
+SGDMA support is included for existing designs and reference in case a
+developer wishes to support their own soft DMA logic and driver support. Any
+new designs should not use the SGDMA.
+
+The SGDMA supports only a single transmit or receive operation at a time, and
+therefore will not perform as well compared to the MSGDMA soft IP. Please
+visit www.altera.com for known, documented SGDMA errata.
+
+Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time.
+Scatter-gather DMA will be added to a future maintenance update to this
+driver.
+
+Jumbo frames are not supported at this time.
+
+The driver limits PHY operations to 10/100Mbps, and has not yet been fully
+tested for 1Gbps. This support will be added in a future maintenance update.
+
+1) Kernel Configuration
+The kernel configuration option is ALTERA_TSE:
+ Device Drivers ---> Network device support ---> Ethernet driver support --->
+ Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
+
+2) Driver parameters list:
+ debug: message level (0: no output, 16: all);
+ dma_rx_num: Number of descriptors in the RX list (default is 64);
+ dma_tx_num: Number of descriptors in the TX list (default is 64).
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+ altera_tse=dma_rx_num:128,dma_tx_num:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+When the driver's transmit routine is called by the kernel, it sets up a
+transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
+MSGDMA), and initites a transmit operation. Once the transmit is complete, an
+interrupt is driven by the transmit DMA logic. The driver handles the transmit
+completion in the context of the interrupt handling chain by recycling
+resource required to send and track the requested transmit operation.
+
+4.2) Receive process
+The driver will post receive buffers to the receive DMA logic during driver
+intialization. Receive buffers may or may not be queued depending upon the
+underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
+to queue receive buffers to the SGDMA receive logic). When a packet is
+received, the DMA logic generates an interrupt. The driver handles a receive
+interrupt by obtaining the DMA receive logic status, reaping receive
+completions until no more receive completions are available.
+
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for receive operations. Interrupt mitigation is not yet supported
+for transmit operations, but will be added in a future maintenance release.
+
+4.4) Ethtool support
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.5) PHY Support
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.7) List of source files:
+ o Kconfig
+ o Makefile
+ o altera_tse_main.c: main network device driver
+ o altera_tse_ethtool.c: ethtool support
+ o altera_tse.h: private driver structure and common definitions
+ o altera_msgdma.h: MSGDMA implementation function definitions
+ o altera_sgdma.h: SGDMA implementation function definitions
+ o altera_msgdma.c: MSGDMA implementation
+ o altera_sgdma.c: SGDMA implementation
+ o altera_sgdmahw.h: SGDMA register and descriptor definitions
+ o altera_msgdmahw.h: MSGDMA register and descriptor definitions
+ o altera_utils.c: Driver utility functions
+ o altera_utils.h: Driver utility function definitions
+
+5) Debug Information
+
+The driver exports debug information such as internal statistics,
+debug information, MAC and DMA registers etc.
+
+A user may use the ethtool support to get statistics:
+e.g. using: ethtool -S ethX (that shows the statistics counters)
+or sees the MAC registers: e.g. using: ethtool -d ethX
+
+The developer can also use the "debug" module parameter to get
+further debug information.
+
+6) Statistics Support
+
+The controller and driver support a mix of IEEE standard defined statistics,
+RFC defined statistics, and driver or Altera defined statistics. The four
+specifications containing the standard definitions for these statistics are
+as follows:
+
+ o IEEE 802.3-2012 - IEEE Standard for Ethernet.
+ o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
+ o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
+ o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
+
+The statistics supported by the TSE and the device driver are as follows:
+
+"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.2. This statistics is the count of frames that are successfully
+transmitted.
+
+"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.5. This statistic is the count of frames that are successfully
+received. This count does not include any error packets such as CRC errors,
+length errors, or alignment errors.
+
+"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE
+802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are
+an integral number of bytes in length and do not pass the CRC test as the frame
+is received.
+
+"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012,
+Section 5.2.2.1.7. This statistic is the count of frames that are not an
+integral number of bytes in length and do not pass the CRC test as the frame is
+received.
+
+"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.8. This statistic is the count of data and pad bytes
+successfully transmitted from the interface.
+
+"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.14. This statistic is the count of data and pad bytes
+successfully received by the controller.
+
+"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE
+802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames
+transmitted from the network controller.
+
+"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE
+802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames
+received by the network controller.
+
+"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is
+a count of the number of packets received containing errors that prevented the
+packet from being delivered to a higher level protocol.
+
+"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic
+is a count of the number of packets that could not be transmitted due to errors.
+
+"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were not addressed
+to the broadcast address or a multicast group.
+
+"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+a multicast address group.
+
+"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+the broadcast address.
+
+"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This
+statistic is the number of outbound packets not transmitted even though an
+error was not detected. An example of a reason this might occur is to free up
+internal buffer space.
+
+"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were not addressed to
+a multicast group or broadcast address.
+
+"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+multicast group.
+
+"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+broadcast address.
+
+"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819.
+This statistic counts the number of packets dropped due to lack of internal
+controller resources.
+
+"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819.
+This statistic counts the total number of bytes received by the controller,
+including error and discarded packets.
+
+"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819.
+This statistic counts the total number of packets received by the controller,
+including error, discarded, unicast, multicast, and broadcast packets.
+
+"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets received less
+than 64 bytes long.
+
+"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets greater than 1518
+bytes long.
+
+"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819.
+This statistic counts the total number of packets received that were 64 octets
+in length.
+
+"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC
+2819. This statistic counts the total number of packets received that were
+between 65 and 127 octets in length inclusive.
+
+"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 128 and 255 octets in length inclusive.
+
+"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 256 and 511 octets in length inclusive.
+
+"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 512 and 1023 octets in length inclusive.
+
+"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define
+in RFC 2819. This statistic is the total number of packets received that were
+between 1024 and 1518 octets in length inclusive.
+
+"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the
+Altera TSE. This statistics counts the number of received good and errored
+frames between the length of 1519 and the maximum frame length configured
+in the frm_length register. See the Altera TSE User Guide for More details.
+
+"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This
+statistic is the total number of packets received that were longer than 1518
+octets, and had either a bad CRC with an integral number of octets (CRC Error)
+or a bad CRC with a non-integral number of octets (Alignment Error).
+
+"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This
+statistic is the total number of packets received that were less than 64 octets
+in length and had either a bad CRC with an integral number of octets (CRC
+error) or a bad CRC with a non-integral number of octets (Alignment Error).
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 5cdb22971d19..eeb5b2e97bed 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -270,16 +270,15 @@ arp_ip_target
arp_validate
Specifies whether or not ARP probes and replies should be
- validated in the active-backup mode. This causes the ARP
- monitor to examine the incoming ARP requests and replies, and
- only consider a slave to be up if it is receiving the
- appropriate ARP traffic.
+ validated in any mode that supports arp monitoring, or whether
+ non-ARP traffic should be filtered (disregarded) for link
+ monitoring purposes.
Possible values are:
none or 0
- No validation is performed. This is the default.
+ No validation or filtering is performed.
active or 1
@@ -293,31 +292,68 @@ arp_validate
Validation is performed for all slaves.
- For the active slave, the validation checks ARP replies to
- confirm that they were generated by an arp_ip_target. Since
- backup slaves do not typically receive these replies, the
- validation performed for backup slaves is on the ARP request
- sent out via the active slave. It is possible that some
- switch or network configurations may result in situations
- wherein the backup slaves do not receive the ARP requests; in
- such a situation, validation of backup slaves must be
- disabled.
-
- The validation of ARP requests on backup slaves is mainly
- helping bonding to decide which slaves are more likely to
- work in case of the active slave failure, it doesn't really
- guarantee that the backup slave will work if it's selected
- as the next active slave.
-
- This option is useful in network configurations in which
- multiple bonding hosts are concurrently issuing ARPs to one or
- more targets beyond a common switch. Should the link between
- the switch and target fail (but not the switch itself), the
- probe traffic generated by the multiple bonding instances will
- fool the standard ARP monitor into considering the links as
- still up. Use of the arp_validate option can resolve this, as
- the ARP monitor will only consider ARP requests and replies
- associated with its own instance of bonding.
+ filter or 4
+
+ Filtering is applied to all slaves. No validation is
+ performed.
+
+ filter_active or 5
+
+ Filtering is applied to all slaves, validation is performed
+ only for the active slave.
+
+ filter_backup or 6
+
+ Filtering is applied to all slaves, validation is performed
+ only for backup slaves.
+
+ Validation:
+
+ Enabling validation causes the ARP monitor to examine the incoming
+ ARP requests and replies, and only consider a slave to be up if it
+ is receiving the appropriate ARP traffic.
+
+ For an active slave, the validation checks ARP replies to confirm
+ that they were generated by an arp_ip_target. Since backup slaves
+ do not typically receive these replies, the validation performed
+ for backup slaves is on the broadcast ARP request sent out via the
+ active slave. It is possible that some switch or network
+ configurations may result in situations wherein the backup slaves
+ do not receive the ARP requests; in such a situation, validation
+ of backup slaves must be disabled.
+
+ The validation of ARP requests on backup slaves is mainly helping
+ bonding to decide which slaves are more likely to work in case of
+ the active slave failure, it doesn't really guarantee that the
+ backup slave will work if it's selected as the next active slave.
+
+ Validation is useful in network configurations in which multiple
+ bonding hosts are concurrently issuing ARPs to one or more targets
+ beyond a common switch. Should the link between the switch and
+ target fail (but not the switch itself), the probe traffic
+ generated by the multiple bonding instances will fool the standard
+ ARP monitor into considering the links as still up. Use of
+ validation can resolve this, as the ARP monitor will only consider
+ ARP requests and replies associated with its own instance of
+ bonding.
+
+ Filtering:
+
+ Enabling filtering causes the ARP monitor to only use incoming ARP
+ packets for link availability purposes. Arriving packets that are
+ not ARPs are delivered normally, but do not count when determining
+ if a slave is available.
+
+ Filtering operates by only considering the reception of ARP
+ packets (any ARP packet, regardless of source or destination) when
+ determining if a slave has received traffic for link availability
+ purposes.
+
+ Filtering is useful in network configurations in which significant
+ levels of third party broadcast traffic would fool the standard
+ ARP monitor into considering the links as still up. Use of
+ filtering can resolve this, as only ARP traffic is considered for
+ link availability purposes.
This option was added in bonding version 3.1.0.
@@ -506,10 +542,10 @@ mode
XOR policy: Transmit based on the selected transmit
hash policy. The default policy is a simple [(source
- MAC address XOR'd with destination MAC address) modulo
- slave count]. Alternate transmit policies may be
- selected via the xmit_hash_policy option, described
- below.
+ MAC address XOR'd with destination MAC address XOR
+ packet type ID) modulo slave count]. Alternate transmit
+ policies may be selected via the xmit_hash_policy option,
+ described below.
This mode provides load balancing and fault tolerance.
@@ -549,13 +585,19 @@ mode
balance-tlb or 5
Adaptive transmit load balancing: channel bonding that
- does not require any special switch support. The
- outgoing traffic is distributed according to the
- current load (computed relative to the speed) on each
- slave. Incoming traffic is received by the current
- slave. If the receiving slave fails, another slave
- takes over the MAC address of the failed receiving
- slave.
+ does not require any special switch support.
+
+ In tlb_dynamic_lb=1 mode; the outgoing traffic is
+ distributed according to the current load (computed
+ relative to the speed) on each slave.
+
+ In tlb_dynamic_lb=0 mode; the load balancing based on
+ current load is disabled and the load is distributed
+ only using the hash distribution.
+
+ Incoming traffic is received by the current slave.
+ If the receiving slave fails, another slave takes over
+ the MAC address of the failed receiving slave.
Prerequisite:
@@ -700,6 +742,28 @@ primary_reselect
This option was added for bonding version 3.6.0.
+tlb_dynamic_lb
+
+ Specifies if dynamic shuffling of flows is enabled in tlb
+ mode. The value has no effect on any other modes.
+
+ The default behavior of tlb mode is to shuffle active flows across
+ slaves based on the load in that interval. This gives nice lb
+ characteristics but can cause packet reordering. If re-ordering is
+ a concern use this variable to disable flow shuffling and rely on
+ load balancing provided solely by the hash distribution.
+ xmit-hash-policy can be used to select the appropriate hashing for
+ the setup.
+
+ The sysfs entry can be used to change the setting per bond device
+ and the initial value is derived from the module parameter. The
+ sysfs entry is allowed to be changed only if the bond device is
+ down.
+
+ The default value is "1" that enables flow shuffling while value "0"
+ disables it. This option was added in bonding driver 3.7.1
+
+
updelay
Specifies the time, in milliseconds, to wait before enabling a
@@ -733,14 +797,15 @@ use_carrier
xmit_hash_policy
Selects the transmit hash policy to use for slave selection in
- balance-xor and 802.3ad modes. Possible values are:
+ balance-xor, 802.3ad, and tlb modes. Possible values are:
layer2
- Uses XOR of hardware MAC addresses to generate the
- hash. The formula is
+ Uses XOR of hardware MAC addresses and packet type ID
+ field to generate the hash. The formula is
- (source MAC XOR destination MAC) modulo slave count
+ hash = source MAC XOR destination MAC XOR packet type ID
+ slave number = hash modulo slave count
This algorithm will place all traffic to a particular
network peer on the same slave.
@@ -755,7 +820,7 @@ xmit_hash_policy
Uses XOR of hardware MAC addresses and IP addresses to
generate the hash. The formula is
- hash = source MAC XOR destination MAC
+ hash = source MAC XOR destination MAC XOR packet type ID
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
@@ -2237,13 +2302,13 @@ broadcast: Like active-backup, there is not much advantage to this
bandwidth.
Additionally, the linux bonding 802.3ad implementation
- distributes traffic by peer (using an XOR of MAC addresses),
- so in a "gatewayed" configuration, all outgoing traffic will
- generally use the same device. Incoming traffic may also end
- up on a single device, but that is dependent upon the
- balancing policy of the peer's 8023.ad implementation. In a
- "local" configuration, traffic will be distributed across the
- devices in the bond.
+ distributes traffic by peer (using an XOR of MAC addresses
+ and packet type ID), so in a "gatewayed" configuration, all
+ outgoing traffic will generally use the same device. Incoming
+ traffic may also end up on a single device, but that is
+ dependent upon the balancing policy of the peer's 8023.ad
+ implementation. In a "local" configuration, traffic will be
+ distributed across the devices in the bond.
Finally, the 802.3ad mode mandates the use of the MII monitor,
therefore, the ARP monitor is not available in this mode.
diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt
index 0cbe6ec22d6f..2236d6dcb7da 100644
--- a/Documentation/networking/can.txt
+++ b/Documentation/networking/can.txt
@@ -469,6 +469,41 @@ solution for a couple of reasons:
having this 'send only' use-case we may remove the receive list in the
Kernel to save a little (really a very little!) CPU usage.
+ 4.1.1.1 CAN filter usage optimisation
+
+ The CAN filters are processed in per-device filter lists at CAN frame
+ reception time. To reduce the number of checks that need to be performed
+ while walking through the filter lists the CAN core provides an optimized
+ filter handling when the filter subscription focusses on a single CAN ID.
+
+ For the possible 2048 SFF CAN identifiers the identifier is used as an index
+ to access the corresponding subscription list without any further checks.
+ For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as
+ hash function to retrieve the EFF table index.
+
+ To benefit from the optimized filters for single CAN identifiers the
+ CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together
+ with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the
+ can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is
+ subscribed. E.g. in the example from above
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+
+ both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass.
+
+ To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the
+ filter has to be defined in this way to benefit from the optimized filters:
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK);
+ rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG;
+ rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK);
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
4.1.2 RAW socket option CAN_RAW_ERR_FILTER
As described in chapter 3.4 the CAN interface driver can generate so
@@ -706,7 +741,7 @@ solution for a couple of reasons:
RX_NO_AUTOTIMER: Prevent automatically starting the timeout monitor.
- RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occured, a
+ RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occurred, a
RX_CHANGED message will be generated when the (cyclic) receive restarts.
TX_RESET_MULTI_IDX: Reset the index for the multiple frame transmission.
@@ -1017,7 +1052,7 @@ solution for a couple of reasons:
in case of a bus-off condition after the specified delay time
in milliseconds. By default it's off.
- "bitrate 125000 sample_point 0.875"
+ "bitrate 125000 sample-point 0.875"
Shows the real bit-rate in bits/sec and the sample-point in the
range 0.000..0.999. If the calculation of bit-timing parameters
is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the
diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt
new file mode 100644
index 000000000000..a15ea602aa52
--- /dev/null
+++ b/Documentation/networking/cdc_mbim.txt
@@ -0,0 +1,339 @@
+ cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
+ ========================================================
+
+The cdc_mbim driver supports USB devices conforming to the "Universal
+Serial Bus Communications Class Subclass Specification for Mobile
+Broadband Interface Model" [1], which is a further development of
+"Universal Serial Bus Communications Class Subclass Specifications for
+Network Control Model Devices" [2] optimized for Mobile Broadband
+devices, aka "3G/LTE modems".
+
+
+Command Line Parameters
+=======================
+
+The cdc_mbim driver has no parameters of its own. But the probing
+behaviour for NCM 1.0 backwards compatible MBIM functions (an
+"NCM/MBIM function" as defined in section 3.2 of [1]) is affected
+by a cdc_ncm driver parameter:
+
+prefer_mbim
+-----------
+Type: Boolean
+Valid Range: N/Y (0-1)
+Default Value: Y (MBIM is preferred)
+
+This parameter sets the system policy for NCM/MBIM functions. Such
+functions will be handled by either the cdc_ncm driver or the cdc_mbim
+driver depending on the prefer_mbim setting. Setting prefer_mbim=N
+makes the cdc_mbim driver ignore these functions and lets the cdc_ncm
+driver handle them instead.
+
+The parameter is writable, and can be changed at any time. A manual
+unbind/bind is required to make the change effective for NCM/MBIM
+functions bound to the "wrong" driver
+
+
+Basic usage
+===========
+
+MBIM functions are inactive when unmanaged. The cdc_mbim driver only
+provides an userspace interface to the MBIM control channel, and will
+not participate in the management of the function. This implies that a
+userspace MBIM management application always is required to enable a
+MBIM function.
+
+Such userspace applications includes, but are not limited to:
+ - mbimcli (included with the libmbim [3] library), and
+ - ModemManager [4]
+
+Establishing a MBIM IP session reequires at least these actions by the
+management application:
+ - open the control channel
+ - configure network connection settings
+ - connect to network
+ - configure IP interface
+
+Management application development
+----------------------------------
+The driver <-> userspace interfaces are described below. The MBIM
+control channel protocol is described in [1].
+
+
+MBIM control channel userspace ABI
+==================================
+
+/dev/cdc-wdmX character device
+------------------------------
+The driver creates a two-way pipe to the MBIM function control channel
+using the cdc-wdm driver as a subdriver. The userspace end of the
+control channel pipe is a /dev/cdc-wdmX character device.
+
+The cdc_mbim driver does not process or police messages on the control
+channel. The channel is fully delegated to the userspace management
+application. It is therefore up to this application to ensure that it
+complies with all the control channel requirements in [1].
+
+The cdc-wdmX device is created as a child of the MBIM control
+interface USB device. The character device associated with a specific
+MBIM function can be looked up using sysfs. For example:
+
+ bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc
+ cdc-wdm0
+
+ bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev
+ 180:0
+
+
+USB configuration descriptors
+-----------------------------
+The wMaxControlMessage field of the CDC MBIM functional descriptor
+limits the maximum control message size. The managament application is
+responsible for negotiating a control message size complying with the
+requirements in section 9.3.1 of [1], taking this descriptor field
+into consideration.
+
+The userspace application can access the CDC MBIM functional
+descriptor of a MBIM function using either of the two USB
+configuration descriptor kernel interfaces described in [6] or [7].
+
+See also the ioctl documentation below.
+
+
+Fragmentation
+-------------
+The userspace application is responsible for all control message
+fragmentation and defragmentaion, as described in section 9.5 of [1].
+
+
+/dev/cdc-wdmX write()
+---------------------
+The MBIM control messages from the management application *must not*
+exceed the negotiated control message size.
+
+
+/dev/cdc-wdmX read()
+--------------------
+The management application *must* accept control messages of up the
+negotiated control message size.
+
+
+/dev/cdc-wdmX ioctl()
+--------------------
+IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size
+This ioctl returns the wMaxControlMessage field of the CDC MBIM
+functional descriptor for MBIM devices. This is intended as a
+convenience, eliminating the need to parse the USB descriptors from
+userspace.
+
+ #include <stdio.h>
+ #include <fcntl.h>
+ #include <sys/ioctl.h>
+ #include <linux/types.h>
+ #include <linux/usb/cdc-wdm.h>
+ int main()
+ {
+ __u16 max;
+ int fd = open("/dev/cdc-wdm0", O_RDWR);
+ if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max))
+ printf("wMaxControlMessage is %d\n", max);
+ }
+
+
+Custom device services
+----------------------
+The MBIM specification allows vendors to freely define additional
+services. This is fully supported by the cdc_mbim driver.
+
+Support for new MBIM services, including vendor specified services, is
+implemented entirely in userspace, like the rest of the MBIM control
+protocol
+
+New services should be registered in the MBIM Registry [5].
+
+
+
+MBIM data channel userspace ABI
+===============================
+
+wwanY network device
+--------------------
+The cdc_mbim driver represents the MBIM data channel as a single
+network device of the "wwan" type. This network device is initially
+mapped to MBIM IP session 0.
+
+
+Multiplexed IP sessions (IPS)
+-----------------------------
+MBIM allows multiplexing up to 256 IP sessions over a single USB data
+channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN
+subdevices of the master wwanY device, mapping MBIM IP session Z to
+VLAN ID Z for all values of Z greater than 0.
+
+The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure
+described in section 10.5.1 of [1].
+
+The userspace management application is responsible for adding new
+VLAN links prior to establishing MBIM IP sessions where the SessionId
+is greater than 0. These links can be added by using the normal VLAN
+kernel interfaces, either ioctl or netlink.
+
+For example, adding a link for a MBIM IP session with SessionId 3:
+
+ ip link add link wwan0 name wwan0.3 type vlan id 3
+
+The driver will automatically map the "wwan0.3" network device to MBIM
+IP session 3.
+
+
+Device Service Streams (DSS)
+----------------------------
+MBIM also allows up to 256 non-IP data streams to be multiplexed over
+the same shared USB data channel. The cdc_mbim driver models these
+sessions as another set of 802.1q VLAN subdevices of the master wwanY
+device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values
+of A.
+
+The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO
+structure described in section 10.5.29 of [1].
+
+The DSS VLAN subdevices are used as a practical interface between the
+shared MBIM data channel and a MBIM DSS aware userspace application.
+It is not intended to be presented as-is to an end user. The
+assumption is that an userspace application initiating a DSS session
+also takes care of the necessary framing of the DSS data, presenting
+the stream to the end user in an appropriate way for the stream type.
+
+The network device ABI requires a dummy ethernet header for every DSS
+data frame being transported. The contents of this header is
+arbitrary, with the following exceptions:
+ - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped
+ - RX frames will have the protocol field set to ETH_P_802_3 (but will
+ not be properly formatted 802.3 frames)
+ - RX frames will have the destination address set to the hardware
+ address of the master device
+
+The DSS supporting userspace management application is responsible for
+adding the dummy ethernet header on TX and stripping it on RX.
+
+This is a simple example using tools commonly available, exporting
+DssSessionId 5 as a pty character device pointed to by a /dev/nmea
+symlink:
+
+ ip link add link wwan0 name wwan0.dss5 type vlan id 261
+ ip link set dev wwan0.dss5 up
+ socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea
+
+This is only an example, most suitable for testing out a DSS
+service. Userspace applications supporting specific MBIM DSS services
+are expected to use the tools and programming interfaces required by
+that service.
+
+Note that adding VLAN links for DSS sessions is entirely optional. A
+management application may instead choose to bind a packet socket
+directly to the master network device, using the received VLAN tags to
+map frames to the correct DSS session and adding 18 byte VLAN ethernet
+headers with the appropriate tag on TX. In this case using a socket
+filter is recommended, matching only the DSS VLAN subset. This avoid
+unnecessary copying of unrelated IP session data to userspace. For
+example:
+
+ static struct sock_filter dssfilter[] = {
+ /* use special negative offsets to get VLAN tag */
+ BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */
+
+ /* verify DSS VLAN range */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG),
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */
+
+ /* verify ethertype */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
+
+ BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
+ BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
+ };
+
+
+
+Tagged IP session 0 VLAN
+------------------------
+As described above, MBIM IP session 0 is treated as special by the
+driver. It is initially mapped to untagged frames on the wwanY
+network device.
+
+This mapping implies a few restrictions on multiplexed IPS and DSS
+sessions, which may not always be practical:
+ - no IPS or DSS session can use a frame size greater than the MTU on
+ IP session 0
+ - no IPS or DSS session can be in the up state unless the network
+ device representing IP session 0 also is up
+
+These problems can be avoided by optionally making the driver map IP
+session 0 to a VLAN subdevice, similar to all other IP sessions. This
+behaviour is triggered by adding a VLAN link for the magic VLAN ID
+4094. The driver will then immediately start mapping MBIM IP session
+0 to this VLAN, and will drop untagged frames on the master wwanY
+device.
+
+Tip: It might be less confusing to the end user to name this VLAN
+subdevice after the MBIM SessionID instead of the VLAN ID. For
+example:
+
+ ip link add link wwan0 name wwan0.0 type vlan id 4094
+
+
+VLAN mapping
+------------
+
+Summarizing the cdc_mbim driver mapping described above, we have this
+relationship between VLAN tags on the wwanY network device and MBIM
+sessions on the shared USB data channel:
+
+ VLAN ID MBIM type MBIM SessionID Notes
+ ---------------------------------------------------------
+ untagged IPS 0 a)
+ 1 - 255 IPS 1 - 255 <VLANID>
+ 256 - 511 DSS 0 - 255 <VLANID - 256>
+ 512 - 4093 b)
+ 4094 IPS 0 c)
+
+ a) if no VLAN ID 4094 link exists, else dropped
+ b) unsupported VLAN range, unconditionally dropped
+ c) if a VLAN ID 4094 link exists, else dropped
+
+
+
+
+References
+==========
+
+[1] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specification for Mobile Broadband
+ Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[2] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specifications for Network Control
+ Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[3] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[4] ModemManager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[5] "MBIM (Mobile Broadband Interface Model) Registry"
+ - http://compliance.usb.org/mbim/
+
+[6] "/proc/bus/usb filesystem output"
+ - Documentation/usb/proc_usb_info.txt
+
+[7] "/sys/bus/usb/devices/.../descriptors"
+ - Documentation/ABI/stable/sysfs-bus-usb
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index bf5dbe3ab8c5..55c575fcaf17 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -86,7 +86,7 @@ built-in CCIDs.
DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
time, combining the operation of the next two socket options. This option is
-preferrable over the latter two, since often applications will use the same
+preferable over the latter two, since often applications will use the same
type of CCID for both directions; and mixed use of CCIDs is not currently well
understood. This socket option takes as argument at least one uint8_t value, or
an array of uint8_t values, which must match available CCIDS (see above). CCIDs
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d2f5cc..d16f424c5e8d 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -277,10 +277,11 @@ Possible BPF extensions are shown in the following table:
mark skb->mark
queue skb->queue_mapping
hatype skb->dev->type
- rxhash skb->rxhash
+ rxhash skb->hash
cpu raw_smp_processor_id()
vlan_tci vlan_tx_tag_get(skb)
vlan_pr vlan_tx_tag_present(skb)
+ rand prandom_u32()
These extensions can also be prefixed with '#'.
Examples for low-level BPF:
@@ -308,6 +309,18 @@ Examples for low-level BPF:
ret #-1
drop: ret #0
+** icmp random packet sampling, 1 in 4
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #1, drop
+ # get a random uint32 number
+ ld rand
+ mod #4
+ jneq #1, drop
+ ret #-1
+ drop: ret #0
+
** SECCOMP filter example:
ld [4] /* offsetof(struct seccomp_data, arch) */
@@ -449,9 +462,9 @@ JIT compiler
------------
The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
-ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
-transparently invoked for each attached filter from user space or for internal
-kernel users if it has been previously enabled by root:
+ARM, MIPS and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler
+is transparently invoked for each attached filter from user space or for
+internal kernel users if it has been previously enabled by root:
echo 1 > /proc/sys/net/core/bpf_jit_enable
@@ -546,6 +559,456 @@ ffffffffa0069c8f + <x>:
For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
toolchain for developing and testing the kernel's JIT compiler.
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later). This new
+ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
+originates from [e]xtended BPF is not the same as BPF extensions! While
+eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
+of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
+
+It is designed to be JITed with one to one mapping, which can also open up
+the possibility for GCC/LLVM compilers to generate optimized eBPF code through
+an eBPF backend that performs almost as fast as natively compiled code.
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into eBPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> eBPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the eBPF interpreter. For in-kernel handlers, this all works transparently
+by using bpf_prog_create() for setting up the filter, resp.
+bpf_prog_destroy() for destroying it. The macro
+BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct bpf_prog that we
+got from bpf_prog_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from bpf_check_classic() apply
+before a conversion to the new layout is being done behind the scenes!
+
+Currently, the classic BPF format is being used for JITing on most of the
+architectures. Only x86-64 performs JIT compilation from eBPF instruction set,
+however, future work will migrate other JIT compilers as well, so that they
+will profit from the very same benefits.
+
+Some core changes of the new internal format:
+
+- Number of registers increase from 2 to 10:
+
+ The old format had two registers A and X, and a hidden frame pointer. The
+ new layout extends this to be 10 internal registers and a read-only frame
+ pointer. Since 64-bit CPUs are passing arguments to functions via registers
+ the number of args from eBPF program to in-kernel function is restricted
+ to 5 and one register is used to accept return value from an in-kernel
+ function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
+ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
+ registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+ Therefore, eBPF calling convention is defined as:
+
+ * R0 - return value from in-kernel function, and exit value for eBPF program
+ * R1 - R5 - arguments from eBPF program to in-kernel function
+ * R6 - R9 - callee saved registers that in-kernel function will preserve
+ * R10 - read-only frame pointer to access stack
+
+ Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
+ etc, and eBPF calling convention maps directly to ABIs used by the kernel on
+ 64-bit architectures.
+
+ On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+ and may let more complex programs to be interpreted.
+
+ R0 - R5 are scratch registers and eBPF program needs spill/fill them if
+ necessary across calls. Note that there is only one eBPF program (== one
+ eBPF main routine) and it cannot call other eBPF functions, it can only
+ call predefined in-kernel functions, though.
+
+- Register width increases from 32-bit to 64-bit:
+
+ Still, the semantics of the original 32-bit ALU operations are preserved
+ via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
+ subregisters that zero-extend into 64-bit if they are being written to.
+ That behavior maps directly to x86_64 and arm64 subregister definition, but
+ makes other JITs more difficult.
+
+ 32-bit architectures run 64-bit internal BPF programs via interpreter.
+ Their JITs may convert BPF programs that only use 32-bit subregisters into
+ native instruction set and let the rest being interpreted.
+
+ Operation is 64-bit, because on 64-bit architectures, pointers are also
+ 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
+ so 32-bit eBPF registers would otherwise require to define register-pair
+ ABI, thus, there won't be able to use a direct eBPF register to HW register
+ mapping and JIT would need to do combine/split/move operations for every
+ register in and out of the function, which is complex, bug prone and slow.
+ Another reason is the use of atomic 64-bit counters.
+
+- Conditional jt/jf targets replaced with jt/fall-through:
+
+ While the original design has constructs such as "if (cond) jump_true;
+ else jump_false;", they are being replaced into alternative constructs like
+ "if (cond) jump_true; /* else fall-through */".
+
+- Introduces bpf_call insn and register passing convention for zero overhead
+ calls from/to other kernel functions:
+
+ Before an in-kernel function call, the internal BPF program needs to
+ place function arguments into R1 to R5 registers to satisfy calling
+ convention, then the interpreter will take them from registers and pass
+ to in-kernel function. If R1 - R5 registers are mapped to CPU registers
+ that are used for argument passing on given architecture, the JIT compiler
+ doesn't need to emit extra moves. Function arguments will be in the correct
+ registers and BPF_CALL instruction will be JITed as single 'call' HW
+ instruction. This calling convention was picked to cover common call
+ situations without performance penalty.
+
+ After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
+ a return value of the function. Since R6 - R9 are callee saved, their state
+ is preserved across the call.
+
+ For example, consider three C functions:
+
+ u64 f1() { return (*_f2)(1); }
+ u64 f2(u64 a) { return f3(a + 1, a); }
+ u64 f3(u64 a, u64 b) { return a - b; }
+
+ GCC can compile f1, f3 into x86_64:
+
+ f1:
+ movl $1, %edi
+ movq _f2(%rip), %rax
+ jmp *%rax
+ f3:
+ movq %rdi, %rax
+ subq %rsi, %rax
+ ret
+
+ Function f2 in eBPF may look like:
+
+ f2:
+ bpf_mov R2, R1
+ bpf_add R1, 1
+ bpf_call f3
+ bpf_exit
+
+ If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
+ returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to
+ be used to call into f2.
+
+ For practical reasons all eBPF programs have only one argument 'ctx' which is
+ already placed into R1 (e.g. on __sk_run_filter() startup) and the programs
+ can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
+ are currently not supported, but these restrictions can be lifted if necessary
+ in the future.
+
+ On 64-bit architectures all register map to HW registers one to one. For
+ example, x86_64 JIT compiler can map them as ...
+
+ R0 - rax
+ R1 - rdi
+ R2 - rsi
+ R3 - rdx
+ R4 - rcx
+ R5 - r8
+ R6 - rbx
+ R7 - r13
+ R8 - r14
+ R9 - r15
+ R10 - rbp
+
+ ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
+ and rbx, r12 - r15 are callee saved.
+
+ Then the following internal BPF pseudo-program:
+
+ bpf_mov R6, R1 /* save ctx */
+ bpf_mov R2, 2
+ bpf_mov R3, 3
+ bpf_mov R4, 4
+ bpf_mov R5, 5
+ bpf_call foo
+ bpf_mov R7, R0 /* save foo() return value */
+ bpf_mov R1, R6 /* restore ctx for next call */
+ bpf_mov R2, 6
+ bpf_mov R3, 7
+ bpf_mov R4, 8
+ bpf_mov R5, 9
+ bpf_call bar
+ bpf_add R0, R7
+ bpf_exit
+
+ After JIT to x86_64 may look like:
+
+ push %rbp
+ mov %rsp,%rbp
+ sub $0x228,%rsp
+ mov %rbx,-0x228(%rbp)
+ mov %r13,-0x220(%rbp)
+ mov %rdi,%rbx
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq foo
+ mov %rax,%r13
+ mov %rbx,%rdi
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq bar
+ add %r13,%rax
+ mov -0x228(%rbp),%rbx
+ mov -0x220(%rbp),%r13
+ leaveq
+ retq
+
+ Which is in this example equivalent in C to:
+
+ u64 bpf_filter(u64 ctx)
+ {
+ return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
+ }
+
+ In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
+ arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
+ registers and place their return value into '%rax' which is R0 in eBPF.
+ Prologue and epilogue are emitted by JIT and are implicit in the
+ interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
+ them across the calls as defined by calling convention.
+
+ For example the following program is invalid:
+
+ bpf_mov R1, 1
+ bpf_call foo
+ bpf_mov R0, R1
+ bpf_exit
+
+ After the call the registers R1-R5 contain junk values and cannot be read.
+ In the future an eBPF verifier can be used to validate internal BPF programs.
+
+Also in the new design, eBPF is limited to 4096 insns, which means that any
+program will terminate quickly and will only call a fixed number of kernel
+functions. Original BPF and the new format are two operand instructions,
+which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
+
+The input context pointer for invoking the interpreter function is generic,
+its content is defined by a specific use case. For seccomp register R1 points
+to seccomp_data, for converted BPF filters R1 points to a skb.
+
+A program, that is translated internally consists of the following elements:
+
+ op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
+
+So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
+has room for new instructions. Some of them may use 16/24/32 byte encoding. New
+instructions must be multiple of 8 bytes to preserve backward compatibility.
+
+Internal BPF is a general purpose RISC instruction set. Not every register and
+every instruction are used during translation from original BPF to new format.
+For example, socket filters are not using 'exclusive add' instruction, but
+tracing filters may do to maintain counters of events, for example. Register R9
+is not used by socket filters either, but more complex filters may be running
+out of registers and would have to resort to spill/fill to stack.
+
+Internal BPF can used as generic assembler for last step performance
+optimizations, socket filters and seccomp are using it as assembler. Tracing
+filters may use it as assembler to generate code from kernel. In kernel usage
+may not be bounded by security considerations, since generated internal BPF code
+may be optimizing internal code path and not being exposed to the user space.
+Safety of internal BPF can come from a verifier (TBD). In such use cases as
+described, it may be used as safe instruction set.
+
+Just like the original BPF, the new format runs within a controlled environment,
+is deterministic and the kernel can easily prove that. The safety of the program
+can be determined in two steps: first step does depth-first-search to disallow
+loops and other CFG validation; second step starts from the first insn and
+descends all possible paths. It simulates execution of every insn and observes
+the state change of registers and stack.
+
+eBPF opcode encoding
+--------------------
+
+eBPF is reusing most of the opcode encoding from classic to simplify conversion
+of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
+field is divided into three parts:
+
+ +----------------+--------+--------------------+
+ | 4 bits | 1 bit | 3 bits |
+ | operation code | source | instruction class |
+ +----------------+--------+--------------------+
+ (MSB) (LSB)
+
+Three LSB bits store instruction class which is one of:
+
+ Classic BPF classes: eBPF classes:
+
+ BPF_LD 0x00 BPF_LD 0x00
+ BPF_LDX 0x01 BPF_LDX 0x01
+ BPF_ST 0x02 BPF_ST 0x02
+ BPF_STX 0x03 BPF_STX 0x03
+ BPF_ALU 0x04 BPF_ALU 0x04
+ BPF_JMP 0x05 BPF_JMP 0x05
+ BPF_RET 0x06 [ class 6 unused, for future if needed ]
+ BPF_MISC 0x07 BPF_ALU64 0x07
+
+When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
+
+ BPF_K 0x00
+ BPF_X 0x08
+
+ * in classic BPF, this means:
+
+ BPF_SRC(code) == BPF_X - use register X as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+ * in eBPF, this means:
+
+ BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+... and four MSB bits store operation code.
+
+If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
+
+ BPF_ADD 0x00
+ BPF_SUB 0x10
+ BPF_MUL 0x20
+ BPF_DIV 0x30
+ BPF_OR 0x40
+ BPF_AND 0x50
+ BPF_LSH 0x60
+ BPF_RSH 0x70
+ BPF_NEG 0x80
+ BPF_MOD 0x90
+ BPF_XOR 0xa0
+ BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
+ BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
+ BPF_END 0xd0 /* eBPF only: endianness conversion */
+
+If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
+
+ BPF_JA 0x00
+ BPF_JEQ 0x10
+ BPF_JGT 0x20
+ BPF_JGE 0x30
+ BPF_JSET 0x40
+ BPF_JNE 0x50 /* eBPF only: jump != */
+ BPF_JSGT 0x60 /* eBPF only: signed '>' */
+ BPF_JSGE 0x70 /* eBPF only: signed '>=' */
+ BPF_CALL 0x80 /* eBPF only: function call */
+ BPF_EXIT 0x90 /* eBPF only: function return */
+
+So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
+and eBPF. There are only two registers in classic BPF, so it means A += X.
+In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
+BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
+src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
+
+Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
+eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
+BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
+exactly the same operations as BPF_ALU, but with 64-bit wide operands
+instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
+dst_reg = dst_reg + src_reg
+
+Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
+operation. Classic BPF_RET | BPF_K means copy imm32 into return register
+and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
+in eBPF means function exit only. The eBPF program needs to store return
+value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently
+unused and reserved for future use.
+
+For load and store instructions the 8-bit 'code' field is divided as:
+
+ +--------+--------+-------------------+
+ | 3 bits | 2 bits | 3 bits |
+ | mode | size | instruction class |
+ +--------+--------+-------------------+
+ (MSB) (LSB)
+
+Size modifier is one of ...
+
+ BPF_W 0x00 /* word */
+ BPF_H 0x08 /* half word */
+ BPF_B 0x10 /* byte */
+ BPF_DW 0x18 /* eBPF only, double word */
+
+... which encodes size of load/store operation:
+
+ B - 1 byte
+ H - 2 byte
+ W - 4 byte
+ DW - 8 byte (eBPF only)
+
+Mode modifier is one of:
+
+ BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */
+ BPF_ABS 0x20
+ BPF_IND 0x40
+ BPF_MEM 0x60
+ BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
+ BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
+ BPF_XADD 0xc0 /* eBPF only, exclusive add */
+
+eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
+(BPF_IND | <size> | BPF_LD) which are used to access packet data.
+
+They had to be carried over from classic to have strong performance of
+socket filters running in eBPF interpreter. These instructions can only
+be used when interpreter context is a pointer to 'struct sk_buff' and
+have seven implicit operands. Register R6 is an implicit input that must
+contain pointer to sk_buff. Register R0 is an implicit output which contains
+the data fetched from the packet. Registers R1-R5 are scratch registers
+and must not be used to store the data across BPF_ABS | BPF_LD or
+BPF_IND | BPF_LD instructions.
+
+These instructions have implicit program exit condition as well. When
+eBPF program is trying to access the data beyond the packet boundary,
+the interpreter will abort the execution of the program. JIT compilers
+therefore must preserve this property. src_reg and imm32 fields are
+explicit inputs to these instructions.
+
+For example:
+
+ BPF_IND | BPF_W | BPF_LD means:
+
+ R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
+ and R1 - R5 were scratched.
+
+Unlike classic BPF instruction set, eBPF has generic load/store operations:
+
+BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
+BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
+BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off)
+BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
+BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
+
+Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
+2 byte atomic increments are not supported.
+
+Testing
+-------
+
+Next to the BPF toolchain, the kernel also ships a test module that contains
+various test cases for classic and internal BPF that can be executed against
+the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
+enabled via Kconfig:
+
+ CONFIG_TEST_BPF=m
+
+After the module has been built and installed, the test suite can be executed
+via insmod or modprobe against 'test_bpf' module. Results of the test cases
+including timings in nsec can be found in the kernel log (dmesg).
+
Misc
----
@@ -561,3 +1024,4 @@ the underlying architecture.
Jay Schulist <jschlst@samba.org>
Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt
index ad474ea07d07..ba1daea7f2e4 100644
--- a/Documentation/networking/gianfar.txt
+++ b/Documentation/networking/gianfar.txt
@@ -1,38 +1,8 @@
The Gianfar Ethernet Driver
-Sysfs File description
Author: Andy Fleming <afleming@freescale.com>
Updated: 2005-07-28
-SYSFS
-
-Several of the features of the gianfar driver are controlled
-through sysfs files. These are:
-
-bd_stash:
-To stash RX Buffer Descriptors in the L2, echo 'on' or '1' to
-bd_stash, echo 'off' or '0' to disable
-
-rx_stash_len:
-To stash the first n bytes of the packet in L2, echo the number
-of bytes to buf_stash_len. echo 0 to disable.
-
-WARNING: You could really screw these up if you set them too low or high!
-fifo_threshold:
-To change the number of bytes the controller needs in the
-fifo before it starts transmission, echo the number of bytes to
-fifo_thresh. Range should be 0-511.
-
-fifo_starve:
-When the FIFO has less than this many bytes during a transmit, it
-enters starve mode, and increases the priority of TX memory
-transactions. To change, echo the number of bytes to
-fifo_starve. Range should be 0-511.
-
-fifo_starve_off:
-Once in starve mode, the FIFO remains there until it has this
-many bytes. To change, echo the number of bytes to
-fifo_starve_off. Range should be 0-511.
CHECKSUM OFFLOADING
diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt
index f737273c6dc1..a251bf4fe9c9 100644
--- a/Documentation/networking/i40e.txt
+++ b/Documentation/networking/i40e.txt
@@ -69,8 +69,11 @@ Additional Configurations
FCoE
----
- Fiber Channel over Ethernet (FCoE) hardware offload is not currently
- supported.
+ The driver supports Fiber Channel over Ethernet (FCoE) and Data Center
+ Bridging (DCB) functionality. Configuring DCB and FCoE is outside the scope
+ of this driver doc. Refer to http://www.open-fcoe.org/ for FCoE project
+ information and http://www.open-lldp.org/ or email list
+ e1000-eedc@lists.sourceforge.net for DCB information.
MAC and VLAN anti-spoofing feature
----------------------------------
diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt
index 4ebbd659256f..43d3549366a0 100644
--- a/Documentation/networking/igb.txt
+++ b/Documentation/networking/igb.txt
@@ -36,54 +36,6 @@ Default Value: 0
This parameter adds support for SR-IOV. It causes the driver to spawn up to
max_vfs worth of virtual function.
-QueuePairs
-----------
-Valid Range: 0-1
-Default Value: 1 (TX and RX will be paired onto one interrupt vector)
-
-If set to 0, when MSI-X is enabled, the TX and RX will attempt to occupy
-separate vectors.
-
-This option can be overridden to 1 if there are not sufficient interrupts
-available. This can occur if any combination of RSS, VMDQ, and max_vfs
-results in more than 4 queues being used.
-
-Node
-----
-Valid Range: 0-n
-Default Value: -1 (off)
-
- 0 - n: where n is the number of the NUMA node that should be used to
- allocate memory for this adapter port.
- -1: uses the driver default of allocating memory on whichever processor is
- running insmod/modprobe.
-
- The Node parameter will allow you to pick which NUMA node you want to have
- the adapter allocate memory from. All driver structures, in-memory queues,
- and receive buffers will be allocated on the node specified. This parameter
- is only useful when interrupt affinity is specified, otherwise some portion
- of the time the interrupt could run on a different core than the memory is
- allocated on, causing slower memory access and impacting throughput, CPU, or
- both.
-
-EEE
----
-Valid Range: 0-1
-Default Value: 1 (enabled)
-
- A link between two EEE-compliant devices will result in periodic bursts of
- data followed by long periods where in the link is in an idle state. This Low
- Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds.
- NOTE: EEE support requires autonegotiation.
-
-DMAC
-----
-Valid Range: 0-1
-Default Value: 1 (enabled)
- Enables or disables DMA Coalescing feature.
-
-
-
Additional Configurations
=========================
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ea8f3b182e70..caedb18d4564 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -101,19 +101,17 @@ ipfrag_high_thresh - INTEGER
Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
- is reached.
+ is reached. This also serves as a maximum limit to namespaces
+ different from the initial one.
ipfrag_low_thresh - INTEGER
- See ipfrag_high_thresh
+ Maximum memory used to reassemble IP fragments before the kernel
+ begins to remove incomplete fragment queues to free up resources.
+ The kernel still accepts new fragments for defragmentation.
ipfrag_time - INTEGER
Time in seconds to keep an IP fragment in memory.
-ipfrag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
- for the hash secret) for IP fragments.
- Default: 600
-
ipfrag_max_dist - INTEGER
ipfrag_max_dist is a non-negative integer value which defines the
maximum "disorder" which is allowed among fragments which share a
@@ -1126,6 +1124,15 @@ flowlabel_consistency - BOOLEAN
FALSE: disabled
Default: TRUE
+auto_flowlabels - BOOLEAN
+ Automatically generate flow labels based based on a flow hash
+ of the packet. This allows intermediate devices, such as routers,
+ to idenfify packet flows for mechanisms like Equal Cost Multipath
+ Routing (see RFC 6438).
+ TRUE: enabled
+ FALSE: disabled
+ Default: false
+
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
@@ -1147,11 +1154,6 @@ ip6frag_low_thresh - INTEGER
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
-ip6frag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
- for the hash secret) for IPv6 fragments.
- Default: 600
-
conf/default/*:
Change the interface-specific default settings.
@@ -1204,6 +1206,18 @@ accept_ra_defrtr - BOOLEAN
Functional default: enabled if accept_ra is enabled.
disabled if accept_ra is disabled.
+accept_ra_from_local - BOOLEAN
+ Accept RA with source-address that is found on local machine
+ if the RA is otherwise proper and able to be accepted.
+ Default is to NOT accept these as it may be an un-intended
+ network loop.
+
+ Functional default:
+ enabled if accept_ra_from_local is enabled
+ on a specific interface.
+ disabled if accept_ra_from_local is disabled
+ on a specific interface.
+
accept_ra_pinfo - BOOLEAN
Learn Prefix Information in Router Advertisement.
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
index b26122973525..c6af4bac5aa8 100644
--- a/Documentation/networking/netlink_mmap.txt
+++ b/Documentation/networking/netlink_mmap.txt
@@ -226,9 +226,9 @@ Ring setup:
void *rx_ring, *tx_ring;
/* Configure ring parameters */
- if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
- if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1)
/* Calculate size of each individual ring */
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 6fea79efb4cb..a6d7cb91069e 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -578,7 +578,7 @@ processes. This also works in combination with mmap(2) on packet sockets.
Currently implemented fanout policies are:
- - PACKET_FANOUT_HASH: schedule to socket by skb's rxhash
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
- PACKET_FANOUT_LB: schedule to socket by round-robin
- PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
- PACKET_FANOUT_RND: schedule to socket by random selection
@@ -1008,14 +1008,9 @@ hardware timestamps to be used. Note: you may need to enable the generation
of hardware timestamps with SIOCSHWTSTAMP (see related information from
Documentation/networking/timestamping.txt).
-PACKET_TIMESTAMP accepts the same integer bit field as
-SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE
-and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by
-PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over
-SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
- int req = 0;
- req |= SOF_TIMESTAMPING_SYS_HARDWARE;
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
For the mmap(2)ed ring buffers, such timestamps are stored in the
@@ -1023,14 +1018,13 @@ tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
what kind of timestamp has been reported, the tp_status field is binary |'ed
with the following possible bits ...
- TP_STATUS_TS_SYS_HARDWARE
TP_STATUS_TS_RAW_HARDWARE
TP_STATUS_TS_SOFTWARE
... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
-RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set),
-then this means that a software fallback was invoked *within* PF_PACKET's
-processing code (less precise).
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index ebf270719402..e839e7efc835 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -48,7 +48,7 @@ The MDIO bus
time, so it is safe for them to block, waiting for an interrupt to signal
the operation is complete
- 2) A reset function is necessary. This is used to return the bus to an
+ 2) A reset function is optional. This is used to return the bus to an
initialized state.
3) A probe function is needed. This function should set up anything the bus
@@ -253,16 +253,27 @@ Writing a PHY driver
Each driver consists of a number of function pointers:
+ soft_reset: perform a PHY software reset
config_init: configures PHY into a sane state after a reset.
For instance, a Davicom PHY requires descrambling disabled.
probe: Allocate phy->priv, optionally refuse to bind.
PHY may not have been reset or had fixups run yet.
suspend/resume: power management
config_aneg: Changes the speed/duplex/negotiation settings
+ aneg_done: Determines the auto-negotiation result
read_status: Reads the current speed/duplex/negotiation settings
ack_interrupt: Clear a pending interrupt
+ did_interrupt: Checks if the PHY generated an interrupt
config_intr: Enable or disable interrupts
remove: Does any driver take-down
+ ts_info: Queries about the HW timestamping status
+ hwtstamp: Set the PHY HW timestamping configuration
+ rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
+ txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
+ set_wol: Enable Wake-on-LAN at the PHY level
+ get_wol: Get the Wake-on-LAN status at the PHY level
+ read_mmd_indirect: Read PHY MMD indirect register
+ write_mmd_indirect: Write PHY MMD indirect register
Of these, only config_aneg and read_status are required to be
assigned by the driver code. The rest are optional. Also, it is
@@ -275,7 +286,21 @@ Writing a PHY driver
Feel free to look at the Marvell, Cicada, and Davicom drivers in
drivers/net/phy/ for examples (the lxt and qsemi drivers have
- not been tested as of this writing)
+ not been tested as of this writing).
+
+ The PHY's MMD register accesses are handled by the PAL framework
+ by default, but can be overridden by a specific PHY driver if
+ required. This could be the case if a PHY was released for
+ manufacturing before the MMD PHY register definitions were
+ standardized by the IEEE. Most modern PHYs will be able to use
+ the generic PAL framework for accessing the PHY's MMD registers.
+ An example of such usage is for Energy Efficient Ethernet support,
+ implemented in the PAL. This support uses the PAL to access MMD
+ registers for EEE query and configuration if the PHY supports
+ the IEEE standard access mechanisms, or can use the PHY's specific
+ access interfaces if overridden by the specific PHY driver. See
+ the Micrel driver in drivers/net/phy/ for an example of how this
+ can be implemented.
Board Fixups
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 5a61a240a652..0dffc6e37902 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -24,6 +24,34 @@ For monitoring and control pktgen creates:
/proc/net/pktgen/ethX
+Tuning NIC for max performance
+==============================
+
+The default NIC setting are (likely) not tuned for pktgen's artificial
+overload type of benchmarking, as this could hurt the normal use-case.
+
+Specifically increasing the TX ring buffer in the NIC:
+ # ethtool -G ethX tx 1024
+
+A larger TX ring can improve pktgen's performance, while it can hurt
+in the general case, 1) because the TX ring buffer might get larger
+than the CPUs L1/L2 cache, 2) because it allow more queueing in the
+NIC HW layer (which is bad for bufferbloat).
+
+One should be careful to conclude, that packets/descriptors in the HW
+TX ring cause delay. Drivers usually delay cleaning up the
+ring-buffers (for various performance reasons), thus packets stalling
+the TX ring, might just be waiting for cleanup.
+
+This cleanup issues is specifically the case, for the driver ixgbe
+(Intel 82599 chip). This driver (ixgbe) combine TX+RX ring cleanups,
+and the cleanup interval is affected by the ethtool --coalesce setting
+of parameter "rx-usecs".
+
+For ixgbe use e.g "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+ # ethtool -C ethX rx-usecs 30
+
+
Viewing threads
===============
/proc/net/pktgen/kpktgend_0
@@ -102,13 +130,18 @@ Examples:
The 'minimum' MAC is what you set with dstmac.
pgset "flag [name]" Set a flag to determine behaviour. Current flags
- are: IPSRC_RND #IP Source is random (between min/max),
- IPDST_RND, UDPSRC_RND,
- UDPDST_RND, MACSRC_RND, MACDST_RND
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
QUEUE_MAP_RND # queue map random
QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
- IPSEC # Make IPsec encapsulation for packet
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
pgset spi SPI_VALUE Set specific SA used to transform packet.
@@ -233,13 +266,22 @@ udp_dst_max
flag
IPSRC_RND
- TXSIZE_RND
IPDST_RND
UDPSRC_RND
UDPDST_RND
MACSRC_RND
MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
IPSEC
+ NODE_ALLOC
dst_min
dst_max
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index b89bc82eed46..16a924c486bf 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -27,6 +27,8 @@ Contents of this document:
(*) AF_RXRPC kernel interface.
+ (*) Configurable parameters.
+
========
OVERVIEW
@@ -864,3 +866,82 @@ The kernel interface functions are as follows:
This is used to allocate a null RxRPC key that can be used to indicate
anonymous security for a particular domain.
+
+
+=======================
+CONFIGURABLE PARAMETERS
+=======================
+
+The RxRPC protocol driver has a number of configurable parameters that can be
+adjusted through sysctls in /proc/net/rxrpc/:
+
+ (*) req_ack_delay
+
+ The amount of time in milliseconds after receiving a packet with the
+ request-ack flag set before we honour the flag and actually send the
+ requested ack.
+
+ Usually the other side won't stop sending packets until the advertised
+ reception window is full (to a maximum of 255 packets), so delaying the
+ ACK permits several packets to be ACK'd in one go.
+
+ (*) soft_ack_delay
+
+ The amount of time in milliseconds after receiving a new packet before we
+ generate a soft-ACK to tell the sender that it doesn't need to resend.
+
+ (*) idle_ack_delay
+
+ The amount of time in milliseconds after all the packets currently in the
+ received queue have been consumed before we generate a hard-ACK to tell
+ the sender it can free its buffers, assuming no other reason occurs that
+ we would send an ACK.
+
+ (*) resend_timeout
+
+ The amount of time in milliseconds after transmitting a packet before we
+ transmit it again, assuming no ACK is received from the receiver telling
+ us they got it.
+
+ (*) max_call_lifetime
+
+ The maximum amount of time in seconds that a call may be in progress
+ before we preemptively kill it.
+
+ (*) dead_call_expiry
+
+ The amount of time in seconds before we remove a dead call from the call
+ list. Dead calls are kept around for a little while for the purpose of
+ repeating ACK and ABORT packets.
+
+ (*) connection_expiry
+
+ The amount of time in seconds after a connection was last used before we
+ remove it from the connection list. Whilst a connection is in existence,
+ it serves as a placeholder for negotiated security; when it is deleted,
+ the security must be renegotiated.
+
+ (*) transport_expiry
+
+ The amount of time in seconds after a transport was last used before we
+ remove it from the transport list. Whilst a transport is in existence, it
+ serves to anchor the peer data and keeps the connection ID counter.
+
+ (*) rxrpc_rx_window_size
+
+ The size of the receive window in packets. This is the maximum number of
+ unconsumed received packets we're willing to hold in memory for any
+ particular call.
+
+ (*) rxrpc_rx_mtu
+
+ The maximum packet MTU size that we're willing to receive in bytes. This
+ indicates to the peer whether we're willing to accept jumbo packets.
+
+ (*) rxrpc_rx_jumbo_max
+
+ The maximum number of packets that we're willing to accept in a jumbo
+ packet. Non-terminal packets in a jumbo packet must contain a four byte
+ header plus exactly 1412 bytes of data. The terminal packet must contain
+ a four byte header plus any amount of data. In any event, a jumbo packet
+ may not exceed rxrpc_rx_mtu in size.
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index ca6977f5b2ed..99ca40e8e810 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -429,7 +429,7 @@ RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
(therbert@google.com)
Accelerated RFS was introduced in 2.6.35. Original patches were
-submitted by Ben Hutchings (bhutchings@solarflare.com)
+submitted by Ben Hutchings (bwh@kernel.org)
Authors:
Tom Herbert (therbert@google.com)
diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt
index 4b4adb8eb14f..b0b75f8463b3 100644
--- a/Documentation/networking/spider_net.txt
+++ b/Documentation/networking/spider_net.txt
@@ -73,7 +73,7 @@ Thus, in an idle system, the GDACTDPA, tail and head pointers will
all be pointing at the same descr, which should be "empty". All of the
other descrs in the ring should be "empty" as well.
-The show_rx_chain() routine will print out the the locations of the
+The show_rx_chain() routine will print out the locations of the
GDACTDPA, tail and head pointers. It will also summarize the contents
of the ring, starting at the tail pointer, and listing the status
of the descrs that follow.
diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt
index 7d11bb5dc30a..bdc4c0db51e1 100644
--- a/Documentation/networking/tcp.txt
+++ b/Documentation/networking/tcp.txt
@@ -30,7 +30,7 @@ A congestion control mechanism can be registered through functions in
tcp_cong.c. The functions used by the congestion control mechanism are
registered via passing a tcp_congestion_ops struct to
tcp_register_congestion_control. As a minimum name, ssthresh,
-cong_avoid, min_cwnd must be valid.
+cong_avoid must be valid.
Private data for a congestion control mechanism is stored in tp->ca_priv.
tcp_ca(tp) returns a pointer to this space. This is preallocated space - it
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 048c92b487f6..897f942b976b 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -40,7 +40,7 @@ the set bits correspond to data that is available, then the control
message will not be generated:
SOF_TIMESTAMPING_SOFTWARE: report systime if available
-SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available
+SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available (deprecated)
SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
It is worth noting that timestamps may be collected for reasons other
@@ -88,13 +88,12 @@ hwtimeraw is the original hardware time stamp. Filled in if
SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its
relation to system time should be made.
-hwtimetrans is the hardware time stamp transformed so that it
-corresponds as good as possible to system time. This correlation is
-not perfect; as a consequence, sorting packets received via different
-NICs by their hwtimetrans may differ from the order in which they were
-received. hwtimetrans may be non-monotonic even for the same NIC.
-Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
-by the network device and will be empty without that support.
+hwtimetrans is always zero. This field is deprecated. It used to hold
+hw timestamps converted to system time. Instead, expose the hardware
+clock device on the NIC directly as a HW PTP clock source, to allow
+time conversion in userspace and optionally synchronize system time
+with a userspace PTP stack such as linuxptp. For the PTP clock API,
+see Documentation/ptp/ptp.txt.
SIOCSHWTSTAMP, SIOCGHWTSTAMP:
@@ -185,7 +184,6 @@ struct skb_shared_hwtstamps {
* since arbitrary point in time
*/
ktime_t hwtstamp;
- ktime_t syststamp; /* hwtstamp transformed to system time base */
};
Time stamps for outgoing packets are to be generated as follows:
@@ -202,6 +200,9 @@ Time stamps for outgoing packets are to be generated as follows:
and not free the skb. A driver not supporting hardware time stamping doesn't
do that. A driver must never touch sk_buff::tstamp! It is used to store
software generated time stamps by the network subsystem.
+- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
+ as possible. skb_tx_timestamp() provides a software time stamp if requested
+ and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
- As soon as the driver has sent the packet and/or obtained a
hardware time stamp for it, it passes the time stamp back by
calling skb_hwtstamp_tx() with the original skb, the raw
@@ -212,6 +213,3 @@ Time stamps for outgoing packets are to be generated as follows:
this would occur at a later time in the processing pipeline than other
software time stamping and therefore could lead to unexpected deltas
between time stamps.
-- If the driver did not set the SKBTX_IN_PROGRESS flag (see above), then
- dev_hard_start_xmit() checks whether software time stamping
- is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
index 8ba82bfe6a33..5cdfd743447b 100644
--- a/Documentation/networking/timestamping/timestamping.c
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -76,7 +76,6 @@ static void usage(const char *error)
" SOF_TIMESTAMPING_RX_HARDWARE - hardware time stamping of incoming packets\n"
" SOF_TIMESTAMPING_RX_SOFTWARE - software fallback for incoming packets\n"
" SOF_TIMESTAMPING_SOFTWARE - request reporting of software time stamps\n"
- " SOF_TIMESTAMPING_SYS_HARDWARE - request reporting of transformed HW time stamps\n"
" SOF_TIMESTAMPING_RAW_HARDWARE - request reporting of raw HW time stamps\n"
" SIOCGSTAMP - check last socket time stamp\n"
" SIOCGSTAMPNS - more accurate socket time stamp\n");
@@ -202,9 +201,7 @@ static void printpacket(struct msghdr *msg, int res,
(long)stamp->tv_sec,
(long)stamp->tv_nsec);
stamp++;
- printf("HW transformed %ld.%09ld ",
- (long)stamp->tv_sec,
- (long)stamp->tv_nsec);
+ /* skip deprecated HW transformed */
stamp++;
printf("HW raw %ld.%09ld",
(long)stamp->tv_sec,
@@ -361,8 +358,6 @@ int main(int argc, char **argv)
so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE"))
so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
- else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE"))
- so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE;
else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE"))
so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
else