diff options
Diffstat (limited to 'Documentation/networking')
58 files changed, 3166 insertions, 2446 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index 31cfd7d674a6..96cd7a26f3d9 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -196,11 +196,12 @@ ad_actor_sys_prio ad_actor_system In an AD system, this specifies the mac-address for the actor in - protocol packet exchanges (LACPDUs). The value cannot be NULL or - multicast. It is preferred to have the local-admin bit set for this - mac but driver does not enforce it. If the value is not given then - system defaults to using the masters' mac address as actors' system - address. + protocol packet exchanges (LACPDUs). The value cannot be a multicast + address. If the all-zeroes MAC is specified, bonding will internally + use the MAC of the bond itself. It is preferred to have the + local-admin bit set for this mac but driver does not enforce it. If + the value is not given then system defaults to using the masters' + mac address as actors' system address. This parameter has effect only in 802.3ad mode and is available through SysFs interface. @@ -312,6 +313,17 @@ arp_ip_target maximum number of targets that can be specified is 16. The default value is no IP addresses. +ns_ip6_target + + Specifies the IPv6 addresses to use as IPv6 monitoring peers when + arp_interval is > 0. These are the targets of the NS request + sent to determine the health of the link to the targets. + Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6 + addresses must be separated by a comma. At least one IPv6 + address must be given for NS/NA monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IPv6 addresses. + arp_validate Specifies whether or not ARP probes and replies should be @@ -421,6 +433,17 @@ arp_all_targets consider the slave up only when all of the arp_ip_targets are reachable +arp_missed_max + + Specifies the number of arp_interval monitor checks that must + fail in order for an interface to be marked down by the ARP monitor. + + In order to provide orderly failover semantics, backup interfaces + are permitted an extra monitor check (i.e., they must fail + arp_missed_max + 1 times before being marked down). + + The default value is 2, and the allowable range is 1 - 255. + downdelay Specifies the time, in milliseconds, to wait before disabling @@ -757,6 +780,17 @@ peer_notif_delay value is 0 which means to match the value of the link monitor interval. +prio + Slave priority. A higher number means higher priority. + The primary slave has the highest priority. This option also + follows the primary_reselect rules. + + This option could only be configured via netlink, and is only valid + for active-backup(1), balance-tlb (5) and balance-alb (6) mode. + The valid value range is a signed 32 bit integer. + + The default value is 0. + primary A string (eth0, eth2, etc) specifying which slave is the @@ -812,7 +846,7 @@ primary_reselect tlb_dynamic_lb Specifies if dynamic shuffling of flows is enabled in tlb - mode. The value has no effect on any other modes. + or alb mode. The value has no effect on any other modes. The default behavior of tlb mode is to shuffle active flows across slaves based on the load in that interval. This gives nice lb @@ -871,7 +905,7 @@ xmit_hash_policy Uses XOR of hardware MAC addresses and packet type ID field to generate the hash. The formula is - hash = source MAC XOR destination MAC XOR packet type ID + hash = source MAC[5] XOR destination MAC[5] XOR packet type ID slave number = hash modulo slave count This algorithm will place all traffic to a particular @@ -887,7 +921,7 @@ xmit_hash_policy Uses XOR of hardware MAC addresses and IP addresses to generate the hash. The formula is - hash = source MAC XOR destination MAC XOR packet type ID + hash = source MAC[5] XOR destination MAC[5] XOR packet type ID hash = hash XOR source IP XOR destination IP hash = hash XOR (hash RSHIFT 16) hash = hash XOR (hash RSHIFT 8) @@ -1948,15 +1982,6 @@ uses the response as an indication that the link is operating. This gives some assurance that traffic is actually flowing to and from one or more peers on the local network. -The ARP monitor relies on the device driver itself to verify -that traffic is flowing. In particular, the driver must keep up to -date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX -flag must also update netdev_queue->trans_start. If they do not, then the -ARP monitor will immediately fail any slaves using that driver, and -those slaves will stay down. If networking monitoring (tcpdump, etc) -shows the ARP requests and replies on the network, then it may be that -your device driver is not updating last_rx and trans_start. - 7.2 Configuring Multiple ARP Targets ------------------------------------ diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index f34cb0e4460e..ebc822e605f5 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -168,7 +168,7 @@ reflect the correct [#f1]_ traffic on the node the loopback of the sent data has to be performed right after a successful transmission. If the CAN network interface is not capable of performing the loopback for some reason the SocketCAN core can do this task as a fallback solution. -See :ref:`socketcan-local-loopback1` for details (recommended). +See :ref:`socketcan-local-loopback2` for details (recommended). The loopback functionality is enabled by default to reflect standard networking behaviour for CAN applications. Due to some requests from diff --git a/Documentation/networking/decnet.rst b/Documentation/networking/decnet.rst deleted file mode 100644 index b8bc11ff8370..000000000000 --- a/Documentation/networking/decnet.rst +++ /dev/null @@ -1,243 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -========================================= -Linux DECnet Networking Layer Information -========================================= - -1. Other documentation.... -========================== - - - Project Home Pages - - http://www.chygwyn.com/ - Kernel info - - http://linux-decnet.sourceforge.net/ - Userland tools - - http://www.sourceforge.net/projects/linux-decnet/ - Status page - -2. Configuring the kernel -========================= - -Be sure to turn on the following options: - - - CONFIG_DECNET (obviously) - - CONFIG_PROC_FS (to see what's going on) - - CONFIG_SYSCTL (for easy configuration) - -if you want to try out router support (not properly debugged yet) -you'll need the following options as well... - - - CONFIG_DECNET_ROUTER (to be able to add/delete routes) - - CONFIG_NETFILTER (will be required for the DECnet routing daemon) - -Don't turn on SIOCGIFCONF support for DECnet unless you are really sure -that you need it, in general you won't and it can cause ifconfig to -malfunction. - -Run time configuration has changed slightly from the 2.4 system. If you -want to configure an endnode, then the simplified procedure is as follows: - - - Set the MAC address on your ethernet card before starting _any_ other - network protocols. - -As soon as your network card is brought into the UP state, DECnet should -start working. If you need something more complicated or are unsure how -to set the MAC address, see the next section. Also all configurations which -worked with 2.4 will work under 2.5 with no change. - -3. Command line options -======================= - -You can set a DECnet address on the kernel command line for compatibility -with the 2.4 configuration procedure, but in general it's not needed any more. -If you do st a DECnet address on the command line, it has only one purpose -which is that its added to the addresses on the loopback device. - -With 2.4 kernels, DECnet would only recognise addresses as local if they -were added to the loopback device. In 2.5, any local interface address -can be used to loop back to the local machine. Of course this does not -prevent you adding further addresses to the loopback device if you -want to. - -N.B. Since the address list of an interface determines the addresses for -which "hello" messages are sent, if you don't set an address on the loopback -interface then you won't see any entries in /proc/net/neigh for the local -host until such time as you start a connection. This doesn't affect the -operation of the local communications in any other way though. - -The kernel command line takes options looking like the following:: - - decnet.addr=1,2 - -the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels -and early 2.3.xx kernels, you must use a comma when specifying the -DECnet address like this. For more recent 2.3.xx kernels, you may -use almost any character except space, although a `.` would be the most -obvious choice :-) - -There used to be a third number specifying the node type. This option -has gone away in favour of a per interface node type. This is now set -using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be -set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. - -There are also equivalent options for modules. The node address can -also be set through the /proc/sys/net/decnet/ files, as can other system -parameters. - -Currently the only supported devices are ethernet and ip_gre. The -ethernet address of your ethernet card has to be set according to the DECnet -address of the node in order for it to be autoconfigured (and then appear in -/proc/net/decnet_dev). There is a utility available at the above -FTP sites called dn2ethaddr which can compute the correct ethernet -address to use. The address can be set by ifconfig either before or -at the time the device is brought up. If you are using RedHat you can -add the line:: - - MACADDR=AA:00:04:00:03:04 - -or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or -wherever your network card's configuration lives. Setting the MAC address -of your ethernet card to an address starting with "hi-ord" will cause a -DECnet address which matches to be added to the interface (which you can -verify with iproute2). - -The default device for routing can be set through the /proc filesystem -by setting /proc/sys/net/decnet/default_device to the -device you want DECnet to route packets out of when no specific route -is available. Usually this will be eth0, for example:: - - echo -n "eth0" >/proc/sys/net/decnet/default_device - -If you don't set the default device, then it will default to the first -ethernet card which has been autoconfigured as described above. You can -confirm that by looking in the default_device file of course. - -There is a list of what the other files under /proc/sys/net/decnet/ do -on the kernel patch web site (shown above). - -4. Run time kernel configuration -================================ - - -This is either done through the sysctl/proc interface (see the kernel web -pages for details on what the various options do) or through the iproute2 -package in the same way as IPv4/6 configuration is performed. - -Documentation for iproute2 is included with the package, although there is -as yet no specific section on DECnet, most of the features apply to both -IP and DECnet, albeit with DECnet addresses instead of IP addresses and -a reduced functionality. - -If you want to configure a DECnet router you'll need the iproute2 package -since its the _only_ way to add and delete routes currently. Eventually -there will be a routing daemon to send and receive routing messages for -each interface and update the kernel routing tables accordingly. The -routing daemon will use netfilter to listen to routing packets, and -rtnetlink to update the kernels routing tables. - -The DECnet raw socket layer has been removed since it was there purely -for use by the routing daemon which will now use netfilter (a much cleaner -and more generic solution) instead. - -5. How can I tell if its working? -================================= - -Here is a quick guide of what to look for in order to know if your DECnet -kernel subsystem is working. - - - Is the node address set (see /proc/sys/net/decnet/node_address) - - Is the node of the correct type - (see /proc/sys/net/decnet/conf/<dev>/forwarding) - - Is the Ethernet MAC address of each Ethernet card set to match - the DECnet address. If in doubt use the dn2ethaddr utility available - at the ftp archive. - - If the previous two steps are satisfied, and the Ethernet card is up, - you should find that it is listed in /proc/net/decnet_dev and also - that it appears as a directory in /proc/sys/net/decnet/conf/. The - loopback device (lo) should also appear and is required to communicate - within a node. - - If you have any DECnet routers on your network, they should appear - in /proc/net/decnet_neigh, otherwise this file will only contain the - entry for the node itself (if it doesn't check to see if lo is up). - - If you want to send to any node which is not listed in the - /proc/net/decnet_neigh file, you'll need to set the default device - to point to an Ethernet card with connection to a router. This is - again done with the /proc/sys/net/decnet/default_device file. - - Try starting a simple server and client, like the dnping/dnmirror - over the loopback interface. With luck they should communicate. - For this step and those after, you'll need the DECnet library - which can be obtained from the above ftp sites as well as the - actual utilities themselves. - - If this seems to work, then try talking to a node on your local - network, and see if you can obtain the same results. - - At this point you are on your own... :-) - -6. How to send a bug report -=========================== - -If you've found a bug and want to report it, then there are several things -you can do to help me work out exactly what it is that is wrong. Useful -information (_most_ of which _is_ _essential_) includes: - - - What kernel version are you running ? - - What version of the patch are you running ? - - How far though the above set of tests can you get ? - - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? - - Which services are you running ? - - Which client caused the problem ? - - How much data was being transferred ? - - Was the network congested ? - - How can the problem be reproduced ? - - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of - tcpdump don't understand how to dump DECnet properly, so including - the hex listing of the packet contents is _essential_, usually the -x flag. - You may also need to increase the length grabbed with the -s flag. The - -e flag also provides very useful information (ethernet MAC addresses)) - -7. MAC FAQ -========== - -A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet -interact and how to get the best performance from your hardware. - -Ethernet cards are designed to normally only pass received network frames -to a host computer when they are addressed to it, or to the broadcast address. - -Linux has an interface which allows the setting of extra addresses for -an ethernet card to listen to. If the ethernet card supports it, the -filtering operation will be done in hardware, if not the extra unwanted packets -received will be discarded by the host computer. In the latter case, -significant processor time and bus bandwidth can be used up on a busy -network (see the NAPI documentation for a longer explanation of these -effects). - -DECnet makes use of this interface to allow running DECnet on an ethernet -card which has already been configured using TCP/IP (presumably using the -built in MAC address of the card, as usual) and/or to allow multiple DECnet -addresses on each physical interface. If you do this, be aware that if your -ethernet card doesn't support perfect hashing in its MAC address filter -then your computer will be doing more work than required. Some cards -will simply set themselves into promiscuous mode in order to receive -packets from the DECnet specified addresses. So if you have one of these -cards its better to set the MAC address of the card as described above -to gain the best efficiency. Better still is to use a card which supports -NAPI as well. - - -8. Mailing list -=============== - -If you are keen to get involved in development, or want to ask questions -about configuration, or even just report bugs, then there is a mailing -list that you can join, details are at: - -http://sourceforge.net/mail/?group_id=4993 - -9. Legal Info -============= - -The Linux DECnet project team have placed their code under the GPL. The -software is provided "as is" and without warranty express or implied. -DECnet is a trademark of Compaq. This software is not a product of -Compaq. We acknowledge the help of people at Compaq in providing extra -documentation above and beyond what was previously publicly available. - -Steve Whitehouse <SteveW@ACM.org> - diff --git a/Documentation/networking/device_drivers/appletalk/index.rst b/Documentation/networking/device_drivers/appletalk/index.rst index de7507f02037..c196baeb0856 100644 --- a/Documentation/networking/device_drivers/appletalk/index.rst +++ b/Documentation/networking/device_drivers/appletalk/index.rst @@ -9,7 +9,6 @@ Contents: :maxdepth: 2 cops - ltpc .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/appletalk/ltpc.rst b/Documentation/networking/device_drivers/appletalk/ltpc.rst deleted file mode 100644 index 0ad197fd17ce..000000000000 --- a/Documentation/networking/device_drivers/appletalk/ltpc.rst +++ /dev/null @@ -1,144 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=========== -LTPC Driver -=========== - -This is the ALPHA version of the ltpc driver. - -In order to use it, you will need at least version 1.3.3 of the -netatalk package, and the Apple or Farallon LocalTalk PC card. -There are a number of different LocalTalk cards for the PC; this -driver applies only to the one with the 65c02 processor chip on it. - -To include it in the kernel, select the CONFIG_LTPC switch in the -configuration dialog. You can also compile it as a module. - -While the driver will attempt to autoprobe the I/O port address, IRQ -line, and DMA channel of the card, this does not always work. For -this reason, you should be prepared to supply these parameters -yourself. (see "Card Configuration" below for how to determine or -change the settings on your card) - -When the driver is compiled into the kernel, you can add a line such -as the following to your /etc/lilo.conf:: - - append="ltpc=0x240,9,1" - -where the parameters (in order) are the port address, IRQ, and DMA -channel. The second and third values can be omitted, in which case -the driver will try to determine them itself. - -If you load the driver as a module, you can pass the parameters "io=", -"irq=", and "dma=" on the command line with insmod or modprobe, or add -them as options in a configuration file in /etc/modprobe.d/ directory:: - - alias lt0 ltpc # autoload the module when the interface is configured - options ltpc io=0x240 irq=9 dma=1 - -Before starting up the netatalk demons (perhaps in rc.local), you -need to add a line such as:: - - /sbin/ifconfig lt0 127.0.0.42 - -The address is unimportant - however, the card needs to be configured -with ifconfig so that Netatalk can find it. - -The appropriate netatalk configuration depends on whether you are -attached to a network that includes AppleTalk routers or not. If, -like me, you are simply connecting to your home Macintoshes and -printers, you need to set up netatalk to "seed". The way I do this -is to have the lines:: - - dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033" - lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033" - -in my atalkd.conf. What is going on here is that I need to fool -netatalk into thinking that there are two AppleTalk interfaces -present; otherwise, it refuses to seed. This is a hack, and a more -permanent solution would be to alter the netatalk code. Also, make -sure you have the correct name for the dummy interface - If it's -compiled as a module, you will need to refer to it as "dummy0" or some -such. - -If you are attached to an extended AppleTalk network, with routers on -it, then you don't need to fool around with this -- the appropriate -line in atalkd.conf is:: - - lt0 -phase 1 - - -Card Configuration -================== - -The interrupts and so forth are configured via the dipswitch on the -board. Set the switches so as not to conflict with other hardware. - - Interrupts -- set at most one. If none are set, the driver uses - polled mode. Because the card was developed in the XT era, the - original documentation refers to IRQ2. Since you'll be running - this on an AT (or later) class machine, that really means IRQ9. - - === =========================================================== - SW1 IRQ 4 - SW2 IRQ 3 - SW3 IRQ 9 (2 in original card documentation only applies to XT) - === =========================================================== - - - DMA -- choose DMA 1 or 3, and set both corresponding switches. - - === ===== - SW4 DMA 3 - SW5 DMA 1 - SW6 DMA 3 - SW7 DMA 1 - === ===== - - - I/O address -- choose one. - - === ========= - SW8 220 / 240 - === ========= - - -IP -== - -Yes, it is possible to do IP over LocalTalk. However, you can't just -treat the LocalTalk device like an ordinary Ethernet device, even if -that's what it looks like to Netatalk. - -Instead, you follow the same procedure as for doing IP in EtherTalk. -See Documentation/networking/ipddp.rst for more information about the -kernel driver and userspace tools needed. - - -Bugs -==== - -IRQ autoprobing often doesn't work on a cold boot. To get around -this, either compile the driver as a module, or pass the parameters -for the card to the kernel as described above. - -Also, as usual, autoprobing is not recommended when you use the driver -as a module. (though it usually works at boot time, at least) - -Polled mode is *really* slow sometimes, but this seems to depend on -the configuration of the network. - -It may theoretically be possible to use two LTPC cards in the same -machine, but this is unsupported, so if you really want to do this, -you'll probably have to hack the initialization code a bit. - - -Thanks -====== - -Thanks to Alan Cox for helpful discussions early on in this -work, and to Denis Hainsworth for doing the bleeding-edge testing. - -Bradford Johnson <bradford@math.umn.edu> - -Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org> diff --git a/Documentation/networking/device_drivers/can/can327.rst b/Documentation/networking/device_drivers/can/can327.rst new file mode 100644 index 000000000000..b87bfbe5d51c --- /dev/null +++ b/Documentation/networking/device_drivers/can/can327.rst @@ -0,0 +1,331 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-3-Clause) + +can327: ELM327 driver for Linux SocketCAN +========================================== + +Authors +-------- + +Max Staudt <max@enpas.org> + + + +Motivation +----------- + +This driver aims to lower the initial cost for hackers interested in +working with CAN buses. + +CAN adapters are expensive, few, and far between. +ELM327 interfaces are cheap and plentiful. +Let's use ELM327s as CAN adapters. + + + +Introduction +------------- + +This driver is an effort to turn abundant ELM327 based OBD interfaces +into full fledged (as far as possible) CAN interfaces. + +Since the ELM327 was never meant to be a stand alone CAN controller, +the driver has to switch between its modes as quickly as possible in +order to fake full-duplex operation. + +As such, can327 is a best effort driver. However, this is more than +enough to implement simple request-response protocols (such as OBD II), +and to monitor broadcast messages on a bus (such as in a vehicle). + +Most ELM327s come as nondescript serial devices, attached via USB or +Bluetooth. The driver cannot recognize them by itself, and as such it +is up to the user to attach it in form of a TTY line discipline +(similar to PPP, SLIP, slcan, ...). + +This driver is meant for ELM327 versions 1.4b and up, see below for +known limitations in older controllers and clones. + + + +Data sheet +----------- + +The official data sheets can be found at ELM electronics' home page: + + https://www.elmelectronics.com/ + + + +How to attach the line discipline +---------------------------------- + +Every ELM327 chip is factory programmed to operate at a serial setting +of 38400 baud/s, 8 data bits, no parity, 1 stopbit. + +If you have kept this default configuration, the line discipline can +be attached on a command prompt as follows:: + + sudo ldattach \ + --debug \ + --speed 38400 \ + --eightbits \ + --noparity \ + --onestopbit \ + --iflag -ICRNL,INLCR,-IXOFF \ + 30 \ + /dev/ttyUSB0 + +To change the ELM327's serial settings, please refer to its data +sheet. This needs to be done before attaching the line discipline. + +Once the ldisc is attached, the CAN interface starts out unconfigured. +Set the speed before starting it:: + + # The interface needs to be down to change parameters + sudo ip link set can0 down + sudo ip link set can0 type can bitrate 500000 + sudo ip link set can0 up + +500000 bit/s is a common rate for OBD-II diagnostics. +If you're connecting straight to a car's OBD port, this is the speed +that most cars (but not all!) expect. + +After this, you can set out as usual with candump, cansniffer, etc. + + + +How to check the controller version +------------------------------------ + +Use a terminal program to attach to the controller. + +After issuing the "``AT WS``" command, the controller will respond with +its version:: + + >AT WS + + + ELM327 v1.4b + + > + +Note that clones may claim to be any version they like. +It is not indicative of their actual feature set. + + + + +Communication example +---------------------- + +This is a short and incomplete introduction on how to talk to an ELM327. +It is here to guide understanding of the controller's and the driver's +limitation (listed below) as well as manual testing. + + +The ELM327 has two modes: + +- Command mode +- Reception mode + +In command mode, it expects one command per line, terminated by CR. +By default, the prompt is a "``>``", after which a command can be +entered:: + + >ATE1 + OK + > + +The init script in the driver switches off several configuration options +that are only meaningful in the original OBD scenario the chip is meant +for, and are actually a hindrance for can327. + + +When a command is not recognized, such as by an older version of the +ELM327, a question mark is printed as a response instead of OK:: + + >ATUNKNOWN + ? + > + +At present, can327 does not evaluate this response. See the section +below on known limitations for details. + + +When a CAN frame is to be sent, the target address is configured, after +which the frame is sent as a command that consists of the data's hex +dump:: + + >ATSH123 + OK + >DEADBEEF12345678 + OK + > + +The above interaction sends the SFF frame "``DE AD BE EF 12 34 56 78``" +with (11 bit) CAN ID ``0x123``. +For this to function, the controller must be configured for SFF sending +mode (using "``AT PB``", see code or datasheet). + + +Once a frame has been sent and wait-for-reply mode is on (``ATR1``, +configured on ``listen-only=off``), or when the reply timeout expires +and the driver sets the controller into monitoring mode (``ATMA``), +the ELM327 will send one line for each received CAN frame, consisting +of CAN ID, DLC, and data:: + + 123 8 DEADBEEF12345678 + +For EFF (29 bit) CAN frames, the address format is slightly different, +which can327 uses to tell the two apart:: + + 12 34 56 78 8 DEADBEEF12345678 + +The ELM327 will receive both SFF and EFF frames - the current CAN +config (``ATPB``) does not matter. + + +If the ELM327's internal UART sending buffer runs full, it will abort +the monitoring mode, print "BUFFER FULL" and drop back into command +mode. Note that in this case, unlike with other error messages, the +error message may appear on the same line as the last (usually +incomplete) data frame:: + + 12 34 56 78 8 DEADBEEF123 BUFFER FULL + + + +Known limitations of the controller +------------------------------------ + +- Clone devices ("v1.5" and others) + + Sending RTR frames is not supported and will be dropped silently. + + Receiving RTR with DLC 8 will appear to be a regular frame with + the last received frame's DLC and payload. + + "``AT CSM``" (CAN Silent Monitoring, i.e. don't send CAN ACKs) is + not supported, and is hard coded to ON. Thus, frames are not ACKed + while listening: "``AT MA``" (Monitor All) will always be "silent". + However, immediately after sending a frame, the ELM327 will be in + "receive reply" mode, in which it *does* ACK any received frames. + Once the bus goes silent, or an error occurs (such as BUFFER FULL), + or the receive reply timeout runs out, the ELM327 will end reply + reception mode on its own and can327 will fall back to "``AT MA``" + in order to keep monitoring the bus. + + Other limitations may apply, depending on the clone and the quality + of its firmware. + + +- All versions + + No full duplex operation is supported. The driver will switch + between input/output mode as quickly as possible. + + The length of outgoing RTR frames cannot be set. In fact, some + clones (tested with one identifying as "``v1.5``") are unable to + send RTR frames at all. + + We don't have a way to get real-time notifications on CAN errors. + While there is a command (``AT CS``) to retrieve some basic stats, + we don't poll it as it would force us to interrupt reception mode. + + +- Versions prior to 1.4b + + These versions do not send CAN ACKs when in monitoring mode (AT MA). + However, they do send ACKs while waiting for a reply immediately + after sending a frame. The driver maximizes this time to make the + controller as useful as possible. + + Starting with version 1.4b, the ELM327 supports the "``AT CSM``" + command, and the "listen-only" CAN option will take effect. + + +- Versions prior to 1.4 + + These chips do not support the "``AT PB``" command, and thus cannot + change bitrate or SFF/EFF mode on-the-fly. This will have to be + programmed by the user before attaching the line discipline. See the + data sheet for details. + + +- Versions prior to 1.3 + + These chips cannot be used at all with can327. They do not support + the "``AT D1``" command, which is necessary to avoid parsing conflicts + on incoming data, as well as distinction of RTR frame lengths. + + Specifically, this allows for easy distinction of SFF and EFF + frames, and to check whether frames are complete. While it is possible + to deduce the type and length from the length of the line the ELM327 + sends us, this method fails when the ELM327's UART output buffer + overruns. It may abort sending in the middle of the line, which will + then be mistaken for something else. + + + +Known limitations of the driver +-------------------------------- + +- No 8/7 timing. + + ELM327 can only set CAN bitrates that are of the form 500000/n, where + n is an integer divisor. + However there is an exception: With a separate flag, it may set the + speed to be 8/7 of the speed indicated by the divisor. + This mode is not currently implemented. + +- No evaluation of command responses. + + The ELM327 will reply with OK when a command is understood, and with ? + when it is not. The driver does not currently check this, and simply + assumes that the chip understands every command. + The driver is built such that functionality degrades gracefully + nevertheless. See the section on known limitations of the controller. + +- No use of hardware CAN ID filtering + + An ELM327's UART sending buffer will easily overflow on heavy CAN bus + load, resulting in the "``BUFFER FULL``" message. Using the hardware + filters available through "``AT CF xxx``" and "``AT CM xxx``" would be + helpful here, however SocketCAN does not currently provide a facility + to make use of such hardware features. + + + +Rationale behind the chosen configuration +------------------------------------------ + +``AT E1`` + Echo on + + We need this to be able to get a prompt reliably. + +``AT S1`` + Spaces on + + We need this to distinguish 11/29 bit CAN addresses received. + + Note: + We can usually do this using the line length (odd/even), + but this fails if the line is not transmitted fully to + the host (BUFFER FULL). + +``AT D1`` + DLC on + + We need this to tell the "length" of RTR frames. + + + +A note on CAN bus termination +------------------------------ + +Your adapter may have resistors soldered in which are meant to terminate +the bus. This is correct when it is plugged into a OBD-II socket, but +not helpful when trying to tap into the middle of an existing CAN bus. + +If communications don't work with the adapter connected, check for the +termination resistors on its PCB and try removing them. diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst new file mode 100644 index 000000000000..40c92ea272af --- /dev/null +++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst @@ -0,0 +1,639 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +CTU CAN FD Driver +================= + +Author: Martin Jerabek <martin.jerabek01@gmail.com> + + +About CTU CAN FD IP Core +------------------------ + +`CTU CAN FD <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_ +is an open source soft core written in VHDL. +It originated in 2015 as Ondrej Ille's project +at the `Department of Measurement <https://meas.fel.cvut.cz/>`_ +of `FEE <http://www.fel.cvut.cz/en/>`_ at `CTU <https://www.cvut.cz/en>`_. + +The SocketCAN driver for Xilinx Zynq SoC based MicroZed board +`Vivado integration <https://gitlab.fel.cvut.cz/canbus/zynq/zynq-can-sja1000-top>`_ +and Intel Cyclone V 5CSEMA4U23C6 based DE0-Nano-SoC Terasic board +`QSys integration <https://gitlab.fel.cvut.cz/canbus/intel-soc-ctucanfd>`_ +has been developed as well as support for +`PCIe integration <https://gitlab.fel.cvut.cz/canbus/pcie-ctucanfd>`_ of the core. + +In the case of Zynq, the core is connected via the APB system bus, which does +not have enumeration support, and the device must be specified in Device Tree. +This kind of devices is called platform device in the kernel and is +handled by a platform device driver. + +The basic functional model of the CTU CAN FD peripheral has been +accepted into QEMU mainline. See QEMU `CAN emulation support <https://www.qemu.org/docs/master/system/devices/can.html>`_ +for CAN FD buses, host connection and CTU CAN FD core emulation. The development +version of emulation support can be cloned from ctu-canfd branch of QEMU local +development `repository <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_. + + +About SocketCAN +--------------- + +SocketCAN is a standard common interface for CAN devices in the Linux +kernel. As the name suggests, the bus is accessed via sockets, similarly +to common network devices. The reasoning behind this is in depth +described in `Linux SocketCAN <https://www.kernel.org/doc/html/latest/networking/can.html>`_. +In short, it offers a +natural way to implement and work with higher layer protocols over CAN, +in the same way as, e.g., UDP/IP over Ethernet. + +Device probe +~~~~~~~~~~~~ + +Before going into detail about the structure of a CAN bus device driver, +let's reiterate how the kernel gets to know about the device at all. +Some buses, like PCI or PCIe, support device enumeration. That is, when +the system boots, it discovers all the devices on the bus and reads +their configuration. The kernel identifies the device via its vendor ID +and device ID, and if there is a driver registered for this identifier +combination, its probe method is invoked to populate the driver's +instance for the given hardware. A similar situation goes with USB, only +it allows for device hot-plug. + +The situation is different for peripherals which are directly embedded +in the SoC and connected to an internal system bus (AXI, APB, Avalon, +and others). These buses do not support enumeration, and thus the kernel +has to learn about the devices from elsewhere. This is exactly what the +Device Tree was made for. + +Device tree +~~~~~~~~~~~ + +An entry in device tree states that a device exists in the system, how +it is reachable (on which bus it resides) and its configuration – +registers address, interrupts and so on. An example of such a device +tree is given in . + +:: + + / { + /* ... */ + amba: amba { + #address-cells = <1>; + #size-cells = <1>; + compatible = "simple-bus"; + + CTU_CAN_FD_0: CTU_CAN_FD@43c30000 { + compatible = "ctu,ctucanfd"; + interrupt-parent = <&intc>; + interrupts = <0 30 4>; + clocks = <&clkc 15>; + reg = <0x43c30000 0x10000>; + }; + }; + }; + + +.. _sec:socketcan:drv: + +Driver structure +~~~~~~~~~~~~~~~~ + +The driver can be divided into two parts – platform-dependent device +discovery and set up, and platform-independent CAN network device +implementation. + +.. _sec:socketcan:platdev: + +Platform device driver +^^^^^^^^^^^^^^^^^^^^^^ + +In the case of Zynq, the core is connected via the AXI system bus, which +does not have enumeration support, and the device must be specified in +Device Tree. This kind of devices is called *platform device* in the +kernel and is handled by a *platform device driver*\ [1]_. + +A platform device driver provides the following things: + +- A *probe* function + +- A *remove* function + +- A table of *compatible* devices that the driver can handle + +The *probe* function is called exactly once when the device appears (or +the driver is loaded, whichever happens later). If there are more +devices handled by the same driver, the *probe* function is called for +each one of them. Its role is to allocate and initialize resources +required for handling the device, as well as set up low-level functions +for the platform-independent layer, e.g., *read_reg* and *write_reg*. +After that, the driver registers the device to a higher layer, in our +case as a *network device*. + +The *remove* function is called when the device disappears, or the +driver is about to be unloaded. It serves to free the resources +allocated in *probe* and to unregister the device from higher layers. + +Finally, the table of *compatible* devices states which devices the +driver can handle. The Device Tree entry ``compatible`` is matched +against the tables of all *platform drivers*. + +.. code:: c + + /* Match table for OF platform binding */ + static const struct of_device_id ctucan_of_match[] = { + { .compatible = "ctu,canfd-2", }, + { .compatible = "ctu,ctucanfd", }, + { /* end of list */ }, + }; + MODULE_DEVICE_TABLE(of, ctucan_of_match); + + static int ctucan_probe(struct platform_device *pdev); + static int ctucan_remove(struct platform_device *pdev); + + static struct platform_driver ctucanfd_driver = { + .probe = ctucan_probe, + .remove = ctucan_remove, + .driver = { + .name = DRIVER_NAME, + .of_match_table = ctucan_of_match, + }, + }; + module_platform_driver(ctucanfd_driver); + + +.. _sec:socketcan:netdev: + +Network device driver +^^^^^^^^^^^^^^^^^^^^^ + +Each network device must support at least these operations: + +- Bring the device up: ``ndo_open`` + +- Bring the device down: ``ndo_close`` + +- Submit TX frames to the device: ``ndo_start_xmit`` + +- Signal TX completion and errors to the network subsystem: ISR + +- Submit RX frames to the network subsystem: ISR and NAPI + +There are two possible event sources: the device and the network +subsystem. Device events are usually signaled via an interrupt, handled +in an Interrupt Service Routine (ISR). Handlers for the events +originating in the network subsystem are then specified in +``struct net_device_ops``. + +When the device is brought up, e.g., by calling ``ip link set can0 up``, +the driver’s function ``ndo_open`` is called. It should validate the +interface configuration and configure and enable the device. The +analogous opposite is ``ndo_close``, called when the device is being +brought down, be it explicitly or implicitly. + +When the system should transmit a frame, it does so by calling +``ndo_start_xmit``, which enqueues the frame into the device. If the +device HW queue (FIFO, mailboxes or whatever the implementation is) +becomes full, the ``ndo_start_xmit`` implementation informs the network +subsystem that it should stop the TX queue (via ``netif_stop_queue``). +It is then re-enabled later in ISR when the device has some space +available again and is able to enqueue another frame. + +All the device events are handled in ISR, namely: + +#. **TX completion**. When the device successfully finishes transmitting + a frame, the frame is echoed locally. On error, an informative error + frame [2]_ is sent to the network subsystem instead. In both cases, + the software TX queue is resumed so that more frames may be sent. + +#. **Error condition**. If something goes wrong (e.g., the device goes + bus-off or RX overrun happens), error counters are updated, and + informative error frames are enqueued to SW RX queue. + +#. **RX buffer not empty**. In this case, read the RX frames and enqueue + them to SW RX queue. Usually NAPI is used as a middle layer (see ). + +.. _sec:socketcan:napi: + +NAPI +~~~~ + +The frequency of incoming frames can be high and the overhead to invoke +the interrupt service routine for each frame can cause significant +system load. There are multiple mechanisms in the Linux kernel to deal +with this situation. They evolved over the years of Linux kernel +development and enhancements. For network devices, the current standard +is NAPI – *the New API*. It is similar to classical top-half/bottom-half +interrupt handling in that it only acknowledges the interrupt in the ISR +and signals that the rest of the processing should be done in softirq +context. On top of that, it offers the possibility to *poll* for new +frames for a while. This has a potential to avoid the costly round of +enabling interrupts, handling an incoming IRQ in ISR, re-enabling the +softirq and switching context back to softirq. + +More detailed documentation of NAPI may be found on the pages of Linux +Foundation `<https://wiki.linuxfoundation.org/networking/napi>`_. + +Integrating the core to Xilinx Zynq +----------------------------------- + +The core interfaces a simple subset of the Avalon +(search for Intel **Avalon Interface Specifications**) +bus as it was originally used on +Alterra FPGA chips, yet Xilinx natively interfaces with AXI +(search for ARM **AMBA AXI and ACE Protocol Specification AXI3, +AXI4, and AXI4-Lite, ACE and ACE-Lite**). +The most obvious solution would be to use +an Avalon/AXI bridge or implement some simple conversion entity. +However, the core’s interface is half-duplex with no handshake +signaling, whereas AXI is full duplex with two-way signaling. Moreover, +even AXI-Lite slave interface is quite resource-intensive, and the +flexibility and speed of AXI are not required for a CAN core. + +Thus a much simpler bus was chosen – APB (Advanced Peripheral Bus) +(search for ARM **AMBA APB Protocol Specification**). +APB-AXI bridge is directly available in +Xilinx Vivado, and the interface adaptor entity is just a few simple +combinatorial assignments. + +Finally, to be able to include the core in a block diagram as a custom +IP, the core, together with the APB interface, has been packaged as a +Vivado component. + +CTU CAN FD Driver design +------------------------ + +The general structure of a CAN device driver has already been examined +in . The next paragraphs provide a more detailed description of the CTU +CAN FD core driver in particular. + +Low-level driver +~~~~~~~~~~~~~~~~ + +The core is not intended to be used solely with SocketCAN, and thus it +is desirable to have an OS-independent low-level driver. This low-level +driver can then be used in implementations of OS driver or directly +either on bare metal or in a user-space application. Another advantage +is that if the hardware slightly changes, only the low-level driver +needs to be modified. + +The code [3]_ is in part automatically generated and in part written +manually by the core author, with contributions of the thesis’ author. +The low-level driver supports operations such as: set bit timing, set +controller mode, enable/disable, read RX frame, write TX frame, and so +on. + +Configuring bit timing +~~~~~~~~~~~~~~~~~~~~~~ + +On CAN, each bit is divided into four segments: SYNC, PROP, PHASE1, and +PHASE2. Their duration is expressed in multiples of a Time Quantum +(details in `CAN Specification, Version 2.0 <http://esd.cs.ucr.edu/webres/can20.pdf>`_, chapter 8). +When configuring +bitrate, the durations of all the segments (and time quantum) must be +computed from the bitrate and Sample Point. This is performed +independently for both the Nominal bitrate and Data bitrate for CAN FD. + +SocketCAN is fairly flexible and offers either highly customized +configuration by setting all the segment durations manually, or a +convenient configuration by setting just the bitrate and sample point +(and even that is chosen automatically per Bosch recommendation if not +specified). However, each CAN controller may have different base clock +frequency and different width of segment duration registers. The +algorithm thus needs the minimum and maximum values for the durations +(and clock prescaler) and tries to optimize the numbers to fit both the +constraints and the requested parameters. + +.. code:: c + + struct can_bittiming_const { + char name[16]; /* Name of the CAN controller hardware */ + __u32 tseg1_min; /* Time segment 1 = prop_seg + phase_seg1 */ + __u32 tseg1_max; + __u32 tseg2_min; /* Time segment 2 = phase_seg2 */ + __u32 tseg2_max; + __u32 sjw_max; /* Synchronisation jump width */ + __u32 brp_min; /* Bit-rate prescaler */ + __u32 brp_max; + __u32 brp_inc; + }; + + +[lst:can_bittiming_const] + +A curious reader will notice that the durations of the segments PROP_SEG +and PHASE_SEG1 are not determined separately but rather combined and +then, by default, the resulting TSEG1 is evenly divided between PROP_SEG +and PHASE_SEG1. In practice, this has virtually no consequences as the +sample point is between PHASE_SEG1 and PHASE_SEG2. In CTU CAN FD, +however, the duration registers ``PROP`` and ``PH1`` have different +widths (6 and 7 bits, respectively), so the auto-computed values might +overflow the shorter register and must thus be redistributed among the +two [4]_. + +Handling RX +~~~~~~~~~~~ + +Frame reception is handled in NAPI queue, which is enabled from ISR when +the RXNE (RX FIFO Not Empty) bit is set. Frames are read one by one +until either no frame is left in the RX FIFO or the maximum work quota +has been reached for the NAPI poll run (see ). Each frame is then passed +to the network interface RX queue. + +An incoming frame may be either a CAN 2.0 frame or a CAN FD frame. The +way to distinguish between these two in the kernel is to allocate either +``struct can_frame`` or ``struct canfd_frame``, the two having different +sizes. In the controller, the information about the frame type is stored +in the first word of RX FIFO. + +This brings us a chicken-egg problem: we want to allocate the ``skb`` +for the frame, and only if it succeeds, fetch the frame from FIFO; +otherwise keep it there for later. But to be able to allocate the +correct ``skb``, we have to fetch the first work of FIFO. There are +several possible solutions: + +#. Read the word, then allocate. If it fails, discard the rest of the + frame. When the system is low on memory, the situation is bad anyway. + +#. Always allocate ``skb`` big enough for an FD frame beforehand. Then + tweak the ``skb`` internals to look like it has been allocated for + the smaller CAN 2.0 frame. + +#. Add option to peek into the FIFO instead of consuming the word. + +#. If the allocation fails, store the read word into driver’s data. On + the next try, use the stored word instead of reading it again. + +Option 1 is simple enough, but not very satisfying if we could do +better. Option 2 is not acceptable, as it would require modifying the +private state of an integral kernel structure. The slightly higher +memory consumption is just a virtual cherry on top of the “cake”. Option +3 requires non-trivial HW changes and is not ideal from the HW point of +view. + +Option 4 seems like a good compromise, with its disadvantage being that +a partial frame may stay in the FIFO for a prolonged time. Nonetheless, +there may be just one owner of the RX FIFO, and thus no one else should +see the partial frame (disregarding some exotic debugging scenarios). +Basides, the driver resets the core on its initialization, so the +partial frame cannot be “adopted” either. In the end, option 4 was +selected [5]_. + +.. _subsec:ctucanfd:rxtimestamp: + +Timestamping RX frames +^^^^^^^^^^^^^^^^^^^^^^ + +The CTU CAN FD core reports the exact timestamp when the frame has been +received. The timestamp is by default captured at the sample point of +the last bit of EOF but is configurable to be captured at the SOF bit. +The timestamp source is external to the core and may be up to 64 bits +wide. At the time of writing, passing the timestamp from kernel to +userspace is not yet implemented, but is planned in the future. + +Handling TX +~~~~~~~~~~~ + +The CTU CAN FD core has 4 independent TX buffers, each with its own +state and priority. When the core wants to transmit, a TX buffer in +Ready state with the highest priority is selected. + +The priorities are 3bit numbers in register TX_PRIORITY +(nibble-aligned). This should be flexible enough for most use cases. +SocketCAN, however, supports only one FIFO queue for outgoing +frames [6]_. The buffer priorities may be used to simulate the FIFO +behavior by assigning each buffer a distinct priority and *rotating* the +priorities after a frame transmission is completed. + +In addition to priority rotation, the SW must maintain head and tail +pointers into the FIFO formed by the TX buffers to be able to determine +which buffer should be used for next frame (``txb_head``) and which +should be the first completed one (``txb_tail``). The actual buffer +indices are (obviously) modulo 4 (number of TX buffers), but the +pointers must be at least one bit wider to be able to distinguish +between FIFO full and FIFO empty – in this situation, +:math:`txb\_head \equiv txb\_tail\ (\textrm{mod}\ 4)`. An example of how +the FIFO is maintained, together with priority rotation, is depicted in + +| + ++------+---+---+---+---+ +| TXB# | 0 | 1 | 2 | 3 | ++======+===+===+===+===+ +| Seq | A | B | C | | ++------+---+---+---+---+ +| Prio | 7 | 6 | 5 | 4 | ++------+---+---+---+---+ +| | | T | | H | ++------+---+---+---+---+ + +| + ++------+---+---+---+---+ +| TXB# | 0 | 1 | 2 | 3 | ++======+===+===+===+===+ +| Seq | | B | C | | ++------+---+---+---+---+ +| Prio | 4 | 7 | 6 | 5 | ++------+---+---+---+---+ +| | | T | | H | ++------+---+---+---+---+ + +| + ++------+---+---+---+---+----+ +| TXB# | 0 | 1 | 2 | 3 | 0’ | ++======+===+===+===+===+====+ +| Seq | E | B | C | D | | ++------+---+---+---+---+----+ +| Prio | 4 | 7 | 6 | 5 | | ++------+---+---+---+---+----+ +| | | T | | | H | ++------+---+---+---+---+----+ + +| + +.. kernel-figure:: fsm_txt_buffer_user.svg + + TX Buffer states with possible transitions + +.. _subsec:ctucanfd:txtimestamp: + +Timestamping TX frames +^^^^^^^^^^^^^^^^^^^^^^ + +When submitting a frame to a TX buffer, one may specify the timestamp at +which the frame should be transmitted. The frame transmission may start +later, but not sooner. Note that the timestamp does not participate in +buffer prioritization – that is decided solely by the mechanism +described above. + +Support for time-based packet transmission was recently merged to Linux +v4.19 `Time-based packet transmission <https://lwn.net/Articles/748879/>`_, +but it remains yet to be researched +whether this functionality will be practical for CAN. + +Also similarly to retrieving the timestamp of RX frames, the core +supports retrieving the timestamp of TX frames – that is the time when +the frame was successfully delivered. The particulars are very similar +to timestamping RX frames and are described in . + +Handling RX buffer overrun +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a received frame does no more fit into the hardware RX FIFO in its +entirety, RX FIFO overrun flag (STATUS[DOR]) is set and Data Overrun +Interrupt (DOI) is triggered. When servicing the interrupt, care must be +taken first to clear the DOR flag (via COMMAND[CDO]) and after that +clear the DOI interrupt flag. Otherwise, the interrupt would be +immediately [7]_ rearmed. + +**Note**: During development, it was discussed whether the internal HW +pipelining cannot disrupt this clear sequence and whether an additional +dummy cycle is necessary between clearing the flag and the interrupt. On +the Avalon interface, it indeed proved to be the case, but APB being +safe because it uses 2-cycle transactions. Essentially, the DOR flag +would be cleared, but DOI register’s Preset input would still be high +the cycle when the DOI clear request would also be applied (by setting +the register’s Reset input high). As Set had higher priority than Reset, +the DOI flag would not be reset. This has been already fixed by swapping +the Set/Reset priority (see issue #187). + +Reporting Error Passive and Bus Off conditions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It may be desirable to report when the node reaches *Error Passive*, +*Error Warning*, and *Bus Off* conditions. The driver is notified about +error state change by an interrupt (EPI, EWLI), and then proceeds to +determine the core’s error state by reading its error counters. + +There is, however, a slight race condition here – there is a delay +between the time when the state transition occurs (and the interrupt is +triggered) and when the error counters are read. When EPI is received, +the node may be either *Error Passive* or *Bus Off*. If the node goes +*Bus Off*, it obviously remains in the state until it is reset. +Otherwise, the node is *or was* *Error Passive*. However, it may happen +that the read state is *Error Warning* or even *Error Active*. It may be +unclear whether and what exactly to report in that case, but I +personally entertain the idea that the past error condition should still +be reported. Similarly, when EWLI is received but the state is later +detected to be *Error Passive*, *Error Passive* should be reported. + + +CTU CAN FD Driver Sources Reference +----------------------------------- + +.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd.h + :internal: + +.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_base.c + :internal: + +.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_pci.c + :internal: + +.. kernel-doc:: drivers/net/can/ctucanfd/ctucanfd_platform.c + :internal: + +CTU CAN FD IP Core and Driver Development Acknowledgment +--------------------------------------------------------- + +* Odrej Ille <ondrej.ille@gmail.com> + + * started the project as student at Department of Measurement, FEE, CTU + * invested great amount of personal time and enthusiasm to the project over years + * worked on more funded tasks + +* `Department of Measurement <https://meas.fel.cvut.cz/>`_, + `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_, + `Czech Technical University <https://www.cvut.cz/en>`_ + + * is the main investor into the project over many years + * uses project in their CAN/CAN FD diagnostics framework for `Skoda Auto <https://www.skoda-auto.cz/>`_ + +* `Digiteq Automotive <https://www.digiteqautomotive.com/en>`_ + + * funding of the project CAN FD Open Cores Support Linux Kernel Based Systems + * negotiated and paid CTU to allow public access to the project + * provided additional funding of the work + +* `Department of Control Engineering <https://control.fel.cvut.cz/en>`_, + `Faculty of Electrical Engineering <http://www.fel.cvut.cz/en/>`_, + `Czech Technical University <https://www.cvut.cz/en>`_ + + * solving the project CAN FD Open Cores Support Linux Kernel Based Systems + * providing GitLab management + * virtual servers and computational power for continuous integration + * providing hardware for HIL continuous integration tests + +* `PiKRON Ltd. <http://pikron.com/>`_ + + * minor funding to initiate preparation of the project open-sourcing + +* Petr Porazil <porazil@pikron.com> + + * design of PCIe transceiver addon board and assembly of boards + * design and assembly of MZ_APO baseboard for MicroZed/Zynq based system + +* Martin Jerabek <martin.jerabek01@gmail.com> + + * Linux driver development + * continuous integration platform architect and GHDL updates + * theses `Open-source and Open-hardware CAN FD Protocol Support <https://dspace.cvut.cz/bitstream/handle/10467/80366/F3-DP-2019-Jerabek-Martin-Jerabek-thesis-2019-canfd.pdf>`_ + +* Jiri Novak <jnovak@fel.cvut.cz> + + * project initiation, management and use at Department of Measurement, FEE, CTU + +* Pavel Pisa <pisa@cmp.felk.cvut.cz> + + * initiate open-sourcing, project coordination, management at Department of Control Engineering, FEE, CTU + +* Jaroslav Beran<jara.beran@gmail.com> + + * system integration for Intel SoC, core and driver testing and updates + +* Carsten Emde (`OSADL <https://www.osadl.org/>`_) + + * provided OSADL expertise to discuss IP core licensing + * pointed to possible deadlock for LGPL and CAN bus possible patent case which lead to relicense IP core design to BSD like license + +* Reiner Zitzmann and Holger Zeltwanger (`CAN in Automation <https://www.can-cia.org/>`_) + + * provided suggestions and help to inform community about the project and invited us to events focused on CAN bus future development directions + +* Jan Charvat + + * implemented CTU CAN FD functional model for QEMU which has been integrated into QEMU mainline (`docs/system/devices/can.rst <https://www.qemu.org/docs/master/system/devices/can.html>`_) + * Bachelor theses Model of CAN FD Communication Controller for QEMU Emulator + +Notes +----- + + +.. [1] + Other buses have their own specific driver interface to set up the + device. + +.. [2] + Not to be mistaken with CAN Error Frame. This is a ``can_frame`` with + ``CAN_ERR_FLAG`` set and some error info in its ``data`` field. + +.. [3] + Available in CTU CAN FD repository + `<https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_ + +.. [4] + As is done in the low-level driver functions + ``ctucan_hw_set_nom_bittiming`` and + ``ctucan_hw_set_data_bittiming``. + +.. [5] + At the time of writing this thesis, option 1 is still being used and + the modification is queued in gitlab issue #222 + +.. [6] + Strictly speaking, multiple CAN TX queues are supported since v4.19 + `can: enable multi-queue for SocketCAN devices <https://lore.kernel.org/patchwork/patch/913526/>`_ but no mainline driver is using + them yet. + +.. [7] + Or rather in the next clock cycle diff --git a/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg new file mode 100644 index 000000000000..b371650788f4 --- /dev/null +++ b/Documentation/networking/device_drivers/can/ctu/fsm_txt_buffer_user.svg @@ -0,0 +1,151 @@ +<?xml version="1.0" encoding="UTF-8"?> +<svg width="113.611mm" height="86.6873mm" version="1.1" viewBox="0 0 113.611 86.6873" xmlns="http://www.w3.org/2000/svg" xmlns:cc="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> + <defs> + <marker id="marker3667" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker3517" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker3373" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker3199" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker3037" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker2779" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker2477" overflow="visible" orient="auto"> + <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker2074" overflow="visible" orient="auto"> + <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker1964" overflow="visible" orient="auto"> + <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="marker1856" overflow="visible" orient="auto"> + <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <marker id="Arrow2Mend" overflow="visible" orient="auto"> + <path transform="scale(.6) rotate(180) translate(0)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill-rule="evenodd" stroke="#000" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <filter id="filter1204" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <marker id="marker2074-3" overflow="visible" orient="auto"> + <path transform="scale(-.6)" d="m8.71859 4.03374-10.9259-4.01772 10.9259-4.01772c-1.7455 2.37206-1.73544 5.61745-6e-7 8.03544z" fill="#28a4ff" fill-rule="evenodd" stroke="#28a4ff" stroke-linejoin="round" stroke-width=".625"/> + </marker> + <filter id="filter1204-6" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2-9" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2-9-4" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2-9-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2-9-1-3" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + <filter id="filter1204-6-2-9-1-3-1" x="-4.19953e-6" y="-5.60084e-6" width="1.00001" height="1.00001" color-interpolation-filters="sRGB"> + <feGaussianBlur stdDeviation="0.00018829868"/> + </filter> + </defs> + <metadata> + <rdf:RDF> + <cc:Work rdf:about=""> + <dc:format>image/svg+xml</dc:format> + <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage"/> + <dc:title/> + </cc:Work> + </rdf:RDF> + </metadata> + <g transform="translate(-49.0277 -104.823)"> + <g> + <path d="m130.534 165.429h-71.1816v-17.5315" fill="none" marker-end="url(#marker2477)" stroke="#28a4ff" stroke-width=".6"/> + <path d="m145.034 122.959v-11.5914h-43.1215" fill="none" marker-end="url(#marker3037)" stroke="#28a4ff" stroke-width=".6"/> + <rect x="130.679" y="122.933" width="28.2965" height="45.2319" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".499999"/> + <path d="m102.044 116.236h23.3126l-0.13388 18.8185h19.9383v3.66603" fill="none" marker-end="url(#marker3199)" stroke="#28a4ff" stroke-width=".6"/> + <path d="m59.5006 138.391v-24.2517h20.6338" fill="none" marker-end="url(#marker2779)" stroke="#28a4ff" stroke-width=".6"/> + <rect x="78.1389" y="126.411" width="28.0037" height="35.0443" rx="0" ry="0" fill="#e5e5e5" stroke="#717171" stroke-linecap="square" stroke-width=".5"/> + </g> + <g fill="#ffcb35" stroke="#000" stroke-linecap="square"> + <ellipse cx="92.1408" cy="114.239" rx="10.8866" ry="4.39308" stroke-width=".5"/> + <ellipse cx="92.1408" cy="134.185" rx="10.8866" ry="4.39308" stroke-width=".499999"/> + <ellipse cx="92.1408" cy="152.199" rx="10.8866" ry="4.39308" stroke-width=".499999"/> + </g> + <g fill="#28a4ff" stroke="#000" stroke-linecap="square" stroke-width=".499999"> + <ellipse cx="144.827" cy="143.316" rx="10.8866" ry="4.39308"/> + <ellipse cx="144.827" cy="159.143" rx="10.8866" ry="4.39308"/> + <ellipse cx="59.4364" cy="142.823" rx="7.36455" ry="4.39308"/> + <ellipse cx="144.827" cy="129.196" rx="10.8866" ry="4.39308"/> + <ellipse cx="143.077" cy="180.53" rx="10.8866" ry="4.39308"/> + </g> + <ellipse cx="110.386" cy="180.53" rx="10.8866" ry="4.39308" fill="#ffcb35" stroke="#000" stroke-linecap="square" stroke-width=".499999"/> + <text x="110.90907" y="179.42688" font-size="3.175px" xml:space="preserve"><tspan x="110.90907" y="179.42688" dy="0.60000002" text-align="center" text-anchor="middle">Accessible</tspan><tspan x="110.90907" y="183.39563"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text> + <text x="143.5869" y="179.52795" xml:space="preserve"><tspan x="143.5869" y="179.52795" dy="1 0 0 0 0 0" font-family="sans-serif" font-size="2.82222px" text-align="center" text-anchor="middle" style="font-variant-caps:normal;font-variant-east-asian:normal;font-variant-ligatures:normal;font-variant-numeric:normal">Inaccessible</tspan><tspan x="143.5869" y="183.36786" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">for S</tspan>W</tspan></text> + <g font-size="3.175px"> + <text x="91.95018" y="115.29005" xml:space="preserve"><tspan x="91.95018" y="115.29005" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Ready</tspan></tspan></text> + <text x="145.25127" y="130.49019" xml:space="preserve"><tspan x="145.25127" y="130.49019" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX OK</tspan></tspan></text> + <text x="145.31845" y="144.43121" xml:space="preserve"><tspan x="145.31845" y="144.43121" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Aborted</tspan></tspan></text> + <text x="145.40399" y="160.36035" xml:space="preserve"><tspan x="145.40399" y="160.36035" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX failed</tspan></tspan></text> + <text x="91.823967" y="133.53941" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.823967" y="133.53941" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">TX in</tspan></tspan><tspan x="91.823967" y="136.39691" text-align="center">progress</tspan></text> + <text x="91.648918" y="151.84813" text-align="center" text-anchor="middle" style="line-height:0.9" xml:space="preserve"><tspan x="91.648918" y="151.84813" text-align="center"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Abort in</tspan></tspan><tspan x="91.648918" y="154.70563" text-align="center">progress</tspan></text> + <text x="59.456043" y="143.91658" xml:space="preserve"><tspan x="59.456043" y="143.91658" font-size="3.175px"><tspan font-size="3.175px" text-align="center" text-anchor="middle">Empty</tspan></tspan></text> + </g> + <g fill="none"> + <g stroke="#000"> + <rect x="52.3943" y="171.63" width="106.581" height="16.601" rx="0" ry="0" stroke-linecap="square" stroke-width=".499999"/> + <g stroke-width=".6"> + <path d="m106.383 159.046h26.4967" marker-end="url(#Arrow2Mend)"/> + <path d="m103.138 152.268h41.5564v-3.92426" marker-end="url(#marker1856)"/> + <path d="m106.38 129.354h17.7785"/> + <path d="m125.818 129.359h7.2418" marker-end="url(#marker1964)"/> + </g> + <path d="m124.169 129.354a0.959514 0.97091 0 0 1 0.47587-0.84557 0.959514 0.97091 0 0 1 0.96164-3e-3 0.959514 0.97091 0 0 1 0.48149 0.84231" stroke-linecap="square" stroke-width=".600001"/> + <path d="m55.7026 180.832h34.8131" marker-end="url(#marker2074)" stroke-width=".6"/> + </g> + <g> + <path d="m55.6464 185.744h34.8131" marker-end="url(#marker2074-3)" stroke="#28a4ff" stroke-width=".600001"/> + <g stroke-width=".6"> + <path d="m94.0487 129.889v-10.6493" marker-end="url(#marker3373)" stroke="#000"/> + <path d="m89.7534 118.621v10.662" marker-end="url(#marker3517)" stroke="#000"/> + <path d="m92.119 138.812v7.9718" marker-end="url(#marker3667)" stroke="#28a4ff"/> + </g> + </g> + </g> + <text transform="matrix(.264583 0 0 .264583 91.8919 139.964)" x="26.959213" y="9.11724" fill="#2aa1ff" filter="url(#filter1204-6-2-9-1-3-1)" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="26.959213" y="9.11724" text-align="center">Set</tspan><tspan x="26.959213" y="22.31724" text-align="center">abort</tspan></text> + <text transform="translate(49.0277 104.823)" x="57.620724" y="16.855087" filter="url(#filter1204)" font-size="3.175px" text-align="center" text-anchor="middle" style="line-height:1.1" xml:space="preserve"><tspan x="57.620724" y="16.855087" text-align="center">Transmission</tspan><tspan x="57.620724" y="20.347588" text-align="center">unsuccesfull</tspan></text> + <g font-size="12px" stroke-width="3.77953" text-anchor="middle"> + <text transform="matrix(.264583 0 0 .264583 68.5988 118.913)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">starts</tspan></text> + <text transform="matrix(.264583 0 0 .264583 106.802 130.509)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">succesfull</tspan></text> + <text transform="matrix(.264583 0 0 .264583 107.77 145.476)" x="38.824219" y="9.1171875" filter="url(#filter1204)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Transmission</tspan><tspan x="38.824219" y="22.317188" text-align="center">sborted</tspan></text> + </g> + <g stroke-width="3.77953" text-anchor="middle"> + <text transform="matrix(.264583 0 0 .264583 107.574 155.948)" x="38.824219" y="9.1171875" filter="url(#filter1204)" font-size="10.6667px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824219" y="9.1171875" text-align="center">Retransmit</tspan><tspan x="38.824219" y="20.850557" text-align="center">limit reached or</tspan><tspan x="38.824219" y="32.583927" text-align="center">node went bus off</tspan><tspan x="38.824219" y="44.317299" text-align="center"/></text> + <text transform="matrix(.264583 0 0 .264583 60.7127 177.384)" x="38.824539" y="9.1173134" filter="url(#filter1204-6)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="38.824539" y="9.1173134" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Transmission result</tspan></text> + <text transform="matrix(.264583 0 0 .264583 45.6885 173.226)" x="57.727047" y="9.11724" filter="url(#filter1204-6-9)" font-size="12px" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Legend:</tspan></text> + </g> + <g fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-anchor="middle"> + <text transform="matrix(.264583 0 0 .264583 57.0045 182.079)" x="57.727047" y="9.11724" filter="url(#filter1204-6-2)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="57.727047" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">SW command</tspan></text> + <text transform="matrix(.264583 0 0 .264583 57.7865 110.104)" x="40.822609" y="9.11724" filter="url(#filter1204-6-2-9)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="40.822609" y="9.11724" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text> + <text transform="matrix(.264583 0 0 .264583 116.893 107.491)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-4)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set ready</tspan></text> + <text transform="matrix(.264583 0 0 .264583 87.5687 166.324)" x="28.049065" y="9.1172523" filter="url(#filter1204-6-2-9-1)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="28.049065" y="9.1172523" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set empty</tspan></text> + <text transform="matrix(.264583 0 0 .264583 106.53 113.074)" x="30.228771" y="8.9063139" filter="url(#filter1204-6-2-9-1-3)" text-align="center" style="line-height:1.1" xml:space="preserve"><tspan x="30.228771" y="8.9063139" fill="#2aa1ff" font-size="12px" stroke-width="3.77953" text-align="center" text-anchor="middle">Set abort</tspan></text> + </g> + </g> +</svg> diff --git a/Documentation/networking/device_drivers/can/freescale/flexcan.rst b/Documentation/networking/device_drivers/can/freescale/flexcan.rst new file mode 100644 index 000000000000..106cd2890135 --- /dev/null +++ b/Documentation/networking/device_drivers/can/freescale/flexcan.rst @@ -0,0 +1,54 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================= +Flexcan CAN Controller driver +============================= + +Authors: Marc Kleine-Budde <mkl@pengutronix.de>, +Dario Binacchi <dario.binacchi@amarulasolutions.com> + +On/off RTR frames reception +=========================== + +For most flexcan IP cores the driver supports 2 RX modes: + +- FIFO +- mailbox + +The older flexcan cores (integrated into the i.MX25, i.MX28, i.MX35 +and i.MX53 SOCs) only receive RTR frames if the controller is +configured for RX-FIFO mode. + +The RX FIFO mode uses a hardware FIFO with a depth of 6 CAN frames, +while the mailbox mode uses a software FIFO with a depth of up to 62 +CAN frames. With the help of the bigger buffer, the mailbox mode +performs better under high system load situations. + +As reception of RTR frames is part of the CAN standard, all flexcan +cores come up in a mode where RTR reception is possible. + +With the "rx-rtr" private flag the ability to receive RTR frames can +be waived at the expense of losing the ability to receive RTR +messages. This trade off is beneficial in certain use cases. + +"rx-rtr" on + Receive RTR frames. (default) + + The CAN controller can and will receive RTR frames. + + On some IP cores the controller cannot receive RTR frames in the + more performant "RX mailbox" mode and will use "RX FIFO" mode + instead. + +"rx-rtr" off + + Waive ability to receive RTR frames. (not supported on all IP cores) + + This mode activates the "RX mailbox mode" for better performance, on + some IP cores RTR frames cannot be received anymore. + +The setting can only be changed if the interface is down:: + + ip link set dev can0 down + ethtool --set-priv-flags can0 rx-rtr {off|on} + ip link set dev can0 up diff --git a/Documentation/networking/device_drivers/wan/index.rst b/Documentation/networking/device_drivers/can/index.rst index 9d9ae94f00b4..6a8a4f74fa26 100644 --- a/Documentation/networking/device_drivers/wan/index.rst +++ b/Documentation/networking/device_drivers/can/index.rst @@ -1,14 +1,18 @@ .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) -Classic WAN Device Drivers -========================== +Controller Area Network (CAN) Device Drivers +============================================ + +Device drivers for CAN devices. Contents: .. toctree:: :maxdepth: 2 - z8530book + can327 + ctu/ctucanfd-driver + freescale/flexcan .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 01b2a69b0cb0..8bcb173e0353 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -135,7 +135,7 @@ The ENA driver supports two Queue Operation modes for Tx SQs: - **Low Latency Queue (LLQ) mode or "push-mode":** In this mode the driver pushes the transmit descriptors and the - first 128 bytes of the packet directly to the ENA device memory + first 96 bytes of the packet directly to the ENA device memory space. The rest of the packet payload is fetched by the device. For this operation mode, the driver uses a dedicated PCI device memory BAR, which is mapped with write-combine capability. diff --git a/Documentation/networking/device_drivers/ethernet/dec/de4x5.rst b/Documentation/networking/device_drivers/ethernet/dec/de4x5.rst deleted file mode 100644 index e03e9c631879..000000000000 --- a/Documentation/networking/device_drivers/ethernet/dec/de4x5.rst +++ /dev/null @@ -1,189 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=================================== -DEC EtherWORKS Ethernet De4x5 cards -=================================== - - Originally, this driver was written for the Digital Equipment - Corporation series of EtherWORKS Ethernet cards: - - - DE425 TP/COAX EISA - - DE434 TP PCI - - DE435 TP/COAX/AUI PCI - - DE450 TP/COAX/AUI PCI - - DE500 10/100 PCI Fasternet - - but it will now attempt to support all cards which conform to the - Digital Semiconductor SROM Specification. The driver currently - recognises the following chips: - - - DC21040 (no SROM) - - DC21041[A] - - DC21140[A] - - DC21142 - - DC21143 - - So far the driver is known to work with the following cards: - - - KINGSTON - - Linksys - - ZNYX342 - - SMC8432 - - SMC9332 (w/new SROM) - - ZNYX31[45] - - ZNYX346 10/100 4 port (can act as a 10/100 bridge!) - - The driver has been tested on a relatively busy network using the DE425, - DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred - 16M of data to a DECstation 5000/200 as follows:: - - TCP UDP - TX RX TX RX - DE425 1030k 997k 1170k 1128k - DE434 1063k 995k 1170k 1125k - DE435 1063k 995k 1170k 1125k - DE500 1063k 998k 1170k 1125k in 10Mb/s mode - - All values are typical (in kBytes/sec) from a sample of 4 for each - measurement. Their error is +/-20k on a quiet (private) network and also - depend on what load the CPU has. - ----------------------------------------------------------------------------- - - The ability to load this driver as a loadable module has been included - and used extensively during the driver development (to save those long - reboot sequences). Loadable module support under PCI and EISA has been - achieved by letting the driver autoprobe as if it were compiled into the - kernel. Do make sure you're not sharing interrupts with anything that - cannot accommodate interrupt sharing! - - To utilise this ability, you have to do 8 things: - - 0) have a copy of the loadable modules code installed on your system. - 1) copy de4x5.c from the /linux/drivers/net directory to your favourite - temporary directory. - 2) for fixed autoprobes (not recommended), edit the source code near - line 5594 to reflect the I/O address you're using, or assign these when - loading by:: - - insmod de4x5 io=0xghh where g = bus number - hh = device number - - .. note:: - - autoprobing for modules is now supported by default. You may just - use:: - - insmod de4x5 - - to load all available boards. For a specific board, still use - the 'io=?' above. - 3) compile de4x5.c, but include -DMODULE in the command line to ensure - that the correct bits are compiled (see end of source code). - 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a - kernel with the de4x5 configuration turned off and reboot. - 5) insmod de4x5 [io=0xghh] - 6) run the net startup bits for your new eth?? interface(s) manually - (usually /etc/rc.inet[12] at boot time). - 7) enjoy! - - To unload a module, turn off the associated interface(s) - 'ifconfig eth?? down' then 'rmmod de4x5'. - - Automedia detection is included so that in principle you can disconnect - from, e.g. TP, reconnect to BNC and things will still work (after a - pause while the driver figures out where its media went). My tests - using ping showed that it appears to work.... - - By default, the driver will now autodetect any DECchip based card. - Should you have a need to restrict the driver to DIGITAL only cards, you - can compile with a DEC_ONLY define, or if loading as a module, use the - 'dec_only=1' parameter. - - I've changed the timing routines to use the kernel timer and scheduling - functions so that the hangs and other assorted problems that occurred - while autosensing the media should be gone. A bonus for the DC21040 - auto media sense algorithm is that it can now use one that is more in - line with the rest (the DC21040 chip doesn't have a hardware timer). - The downside is the 1 'jiffies' (10ms) resolution. - - IEEE 802.3u MII interface code has been added in anticipation that some - products may use it in the future. - - The SMC9332 card has a non-compliant SROM which needs fixing - I have - patched this driver to detect it because the SROM format used complies - to a previous DEC-STD format. - - I have removed the buffer copies needed for receive on Intels. I cannot - remove them for Alphas since the Tulip hardware only does longword - aligned DMA transfers and the Alphas get alignment traps with non - longword aligned data copies (which makes them really slow). No comment. - - I have added SROM decoding routines to make this driver work with any - card that supports the Digital Semiconductor SROM spec. This will help - all cards running the dc2114x series chips in particular. Cards using - the dc2104x chips should run correctly with the basic driver. I'm in - debt to <mjacob@feral.com> for the testing and feedback that helped get - this feature working. So far we have tested KINGSTON, SMC8432, SMC9332 - (with the latest SROM complying with the SROM spec V3: their first was - broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315 - (quad 21041 MAC) cards also appear to work despite their incorrectly - wired IRQs. - - I have added a temporary fix for interrupt problems when some SCSI cards - share the same interrupt as the DECchip based cards. The problem occurs - because the SCSI card wants to grab the interrupt as a fast interrupt - (runs the service routine with interrupts turned off) vs. this card - which really needs to run the service routine with interrupts turned on. - This driver will now add the interrupt service routine as a fast - interrupt if it is bounced from the slow interrupt. THIS IS NOT A - RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time - until people sort out their compatibility issues and the kernel - interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST - INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not - run on the same interrupt. PCMCIA/CardBus is another can of worms... - - Finally, I think I have really fixed the module loading problem with - more than one DECchip based card. As a side effect, I don't mess with - the device structure any more which means that if more than 1 card in - 2.0.x is installed (4 in 2.1.x), the user will have to edit - linux/drivers/net/Space.c to make room for them. Hence, module loading - is the preferred way to use this driver, since it doesn't have this - limitation. - - Where SROM media detection is used and full duplex is specified in the - SROM, the feature is ignored unless lp->params.fdx is set at compile - time OR during a module load (insmod de4x5 args='eth??:fdx' [see - below]). This is because there is no way to automatically detect full - duplex links except through autonegotiation. When I include the - autonegotiation feature in the SROM autoconf code, this detection will - occur automatically for that case. - - Command line arguments are now allowed, similar to passing arguments - through LILO. This will allow a per adapter board set up of full duplex - and media. The only lexical constraints are: the board name (dev->name) - appears in the list before its parameters. The list of parameters ends - either at the end of the parameter list or with another board name. The - following parameters are allowed: - - ========= =============================================== - fdx for full duplex - autosense to set the media/speed; with the following - sub-parameters: - TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO - ========= =============================================== - - Case sensitivity is important for the sub-parameters. They *must* be - upper case. Examples:: - - insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'. - - For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.:: - - DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"' - - Yes, I know full duplex isn't permissible on BNC or AUI; they're just - examples. By default, full duplex is turned off and AUTO is the default - autosense setting. In reality, I expect only the full duplex option to - be used. Note the use of single quotes in the two examples above and the - lack of commas to separate items. diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst index d638b5a8aadd..199647729251 100644 --- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst +++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst @@ -183,6 +183,7 @@ PHY and allows physical transmission and reception of Ethernet frames. IRQ config, enable, reset DPNI (Datapath Network Interface) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Contains TX/RX queues, network interface configuration, and RX buffer pool configuration mechanisms. The TX/RX queues are in memory and are identified by queue number. diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 6b5dc203da2b..5196905582c5 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -19,7 +19,6 @@ Contents: cirrus/cs89x0 dlink/dl2k davicom/dm9000 - dec/de4x5 dec/dmfe freescale/dpaa freescale/dpaa2/index @@ -39,10 +38,10 @@ Contents: intel/iavf intel/ice marvell/octeontx2 + marvell/octeon_ep mellanox/mlx5 microsoft/netvsc neterion/s2io - neterion/vxge netronome/nfp pensando/ionic smsc/smc9 @@ -52,6 +51,8 @@ Contents: ti/am65_nuss_cpsw_switchdev ti/tlan toshiba/spider_net + wangxun/txgbe + wangxun/ngbe .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 67b7a701ce9e..dc2e60ced927 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -901,6 +901,15 @@ To enable/disable UDP Segmentation Offload, issue the following command:: # ethtool -K <ethX> tx-udp-segmentation [off|on] +GNSS module +----------- +Allows user to read messages from the GNSS module and write supported commands. +If the module is physically present, driver creates 2 TTYs for each supported +device in /dev, ttyGNSS_<device>:<function>_0 and _1. First one (_0) is RW and +the second one is RO. +The protocol of write commands is dependent on the GNSS module as the driver +writes raw bytes from the TTY to the GNSS i2c. Please refer to the module +documentation for details. Performance Optimization ======================== diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst index f1d5233e5e51..0a233b17c664 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst @@ -440,6 +440,22 @@ NOTE: For 82599-based network connections, if you are enabling jumbo frames in a virtual function (VF), jumbo frames must first be enabled in the physical function (PF). The VF MTU setting cannot be larger than the PF MTU. +NBASE-T Support +--------------- +The ixgbe driver supports NBASE-T on some devices. However, the advertisement +of NBASE-T speeds is suppressed by default, to accommodate broken network +switches which cannot cope with advertised NBASE-T speeds. Use the ethtool +command to enable advertising NBASE-T speeds on devices which support it:: + + ethtool -s eth? advertise 0x1800000001028 + +On Linux systems with INTERFACES(5), this can be specified as a pre-up command +in /etc/network/interfaces so that the interface is always brought up with +NBASE-T support, e.g.:: + + iface eth? inet dhcp + pre-up ethtool -s eth? advertise 0x1800000001028 || true + Generic Receive Offload, aka GRO -------------------------------- The driver supports the in-kernel software implementation of GRO. GRO has diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst new file mode 100644 index 000000000000..bc562c49011b --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst @@ -0,0 +1,35 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +==================================================================== +Linux kernel networking driver for Marvell's Octeon PCI Endpoint NIC +==================================================================== + +Network driver for Marvell's Octeon PCI EndPoint NIC. +Copyright (c) 2020 Marvell International Ltd. + +Contents +======== + +- `Overview`_ +- `Supported Devices`_ +- `Interface Control`_ + +Overview +======== +This driver implements networking functionality of Marvell's Octeon PCI +EndPoint NIC. + +Supported Devices +================= +Currently, this driver support following devices: + * Network controller: Cavium, Inc. Device b200 + +Interface Control +================= +Network Interface control like changing mtu, link speed, link down/up are +done by writing command to mailbox command queue, a mailbox interface +implemented through a reserved region in BAR4. +This driver writes the commands into the mailbox and the firmware on the +Octeon device processes them. The firmware also sends unsolicited notifications +to driver for events suchs as link change, through notification queue +implemented as part of mailbox interface. diff --git a/Documentation/networking/device_drivers/ethernet/neterion/vxge.rst b/Documentation/networking/device_drivers/ethernet/neterion/vxge.rst deleted file mode 100644 index 589c6b15c63d..000000000000 --- a/Documentation/networking/device_drivers/ethernet/neterion/vxge.rst +++ /dev/null @@ -1,115 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -============================================================================== -Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver -============================================================================== - -.. Contents - - 1) Introduction - 2) Features supported - 3) Configurable driver parameters - 4) Troubleshooting - -1. Introduction -=============== - -This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O -Virtualized Server adapters. - -The X3100 series supports four modes of operation, configurable via -firmware: - - - Single function mode - - Multi function mode - - SRIOV mode - - MRIOV mode - -The functions share a 10GbE link and the pci-e bus, but hardly anything else -inside the ASIC. Features like independent hw reset, statistics, bandwidth/ -priority allocation and guarantees, GRO, TSO, interrupt moderation etc are -supported independently on each function. - -(See below for a complete list of features supported for both IPv4 and IPv6) - -2. Features supported -===================== - -i) Single function mode (up to 17 queues) - -ii) Multi function mode (up to 17 functions) - -iii) PCI-SIG's I/O Virtualization - - - Single Root mode: v1.0 (up to 17 functions) - - Multi-Root mode: v1.0 (up to 17 functions) - -iv) Jumbo frames - - X3100 Series supports MTU up to 9600 bytes, modifiable using - ip command. - -v) Offloads supported: (Enabled by default) - - - Checksum offload (TCP/UDP/IP) on transmit and receive paths - - TCP Segmentation Offload (TSO) on transmit path - - Generic Receive Offload (GRO) on receive path - -vi) MSI-X: (Enabled by default) - - Resulting in noticeable performance improvement (up to 7% on certain - platforms). - -vii) NAPI: (Enabled by default) - - For better Rx interrupt moderation. - -viii)RTH (Receive Traffic Hash): (Enabled by default) - - Receive side steering for better scaling. - -ix) Statistics - - Comprehensive MAC-level and software statistics displayed using - "ethtool -S" option. - -x) Multiple hardware queues: (Enabled by default) - - Up to 17 hardware based transmit and receive data channels, with - multiple steering options (transmit multiqueue enabled by default). - -3) Configurable driver parameters: ----------------------------------- - -i) max_config_dev - Specifies maximum device functions to be enabled. - - Valid range: 1-8 - -ii) max_config_port - Specifies number of ports to be enabled. - - Valid range: 1,2 - - Default: 1 - -iii) max_config_vpath - Specifies maximum VPATH(s) configured for each device function. - - Valid range: 1-17 - -iv) vlan_tag_strip - Enables/disables vlan tag stripping from all received tagged frames that - are not replicated at the internal L2 switch. - - Valid range: 0,1 (disabled, enabled respectively) - - Default: 1 - -v) addr_learn_en - Enable learning the mac address of the guest OS interface in - virtualization environment. - - Valid range: 0,1 (disabled, enabled respectively) - - Default: 0 diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst new file mode 100644 index 000000000000..43a02f9943e1 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbe.rst @@ -0,0 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================= +Linux Base Driver for WangXun(R) Gigabit PCI Express Adapters +============================================================= + +WangXun Gigabit Linux driver. +Copyright (c) 2019 - 2022 Beijing WangXun Technology Co., Ltd. + +Support +======= + If you have problems with the software or hardware, please contact our + customer support team via email at nic-support@net-swift.com or check our website + at https://www.net-swift.com diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst new file mode 100644 index 000000000000..eaa87dbe8848 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbe.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================================ +Linux Base Driver for WangXun(R) 10 Gigabit PCI Express Adapters +================================================================ + +WangXun 10 Gigabit Linux driver. +Copyright (c) 2015 - 2022 Beijing WangXun Technology Co., Ltd. + + +Contents +======== + +- Support + + +Support +======= +If you got any problem, contact Wangxun support team via support@trustnetic.com +and Cc: netdev. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 3a5a1d46e77e..601eacaf12f3 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -11,12 +11,12 @@ Contents: appletalk/index atm/index cable/index + can/index cellular/index ethernet/index fddi/index hamradio/index qlogic/index - wan/index wifi/index wwan/index diff --git a/Documentation/networking/device_drivers/wan/z8530book.rst b/Documentation/networking/device_drivers/wan/z8530book.rst deleted file mode 100644 index fea2c40e7973..000000000000 --- a/Documentation/networking/device_drivers/wan/z8530book.rst +++ /dev/null @@ -1,256 +0,0 @@ -======================= -Z8530 Programming Guide -======================= - -:Author: Alan Cox - -Introduction -============ - -The Z85x30 family synchronous/asynchronous controller chips are used on -a large number of cheap network interface cards. The kernel provides a -core interface layer that is designed to make it easy to provide WAN -services using this chip. - -The current driver only support synchronous operation. Merging the -asynchronous driver support into this code to allow any Z85x30 device to -be used as both a tty interface and as a synchronous controller is a -project for Linux post the 2.4 release - -Driver Modes -============ - -The Z85230 driver layer can drive Z8530, Z85C30 and Z85230 devices in -three different modes. Each mode can be applied to an individual channel -on the chip (each chip has two channels). - -The PIO synchronous mode supports the most common Z8530 wiring. Here the -chip is interface to the I/O and interrupt facilities of the host -machine but not to the DMA subsystem. When running PIO the Z8530 has -extremely tight timing requirements. Doing high speeds, even with a -Z85230 will be tricky. Typically you should expect to achieve at best -9600 baud with a Z8C530 and 64Kbits with a Z85230. - -The DMA mode supports the chip when it is configured to use dual DMA -channels on an ISA bus. The better cards tend to support this mode of -operation for a single channel. With DMA running the Z85230 tops out -when it starts to hit ISA DMA constraints at about 512Kbits. It is worth -noting here that many PC machines hang or crash when the chip is driven -fast enough to hold the ISA bus solid. - -Transmit DMA mode uses a single DMA channel. The DMA channel is used for -transmission as the transmit FIFO is smaller than the receive FIFO. it -gives better performance than pure PIO mode but is nowhere near as ideal -as pure DMA mode. - -Using the Z85230 driver -======================= - -The Z85230 driver provides the back end interface to your board. To -configure a Z8530 interface you need to detect the board and to identify -its ports and interrupt resources. It is also your problem to verify the -resources are available. - -Having identified the chip you need to fill in a struct z8530_dev, -which describes each chip. This object must exist until you finally -shutdown the board. Firstly zero the active field. This ensures nothing -goes off without you intending it. The irq field should be set to the -interrupt number of the chip. (Each chip has a single interrupt source -rather than each channel). You are responsible for allocating the -interrupt line. The interrupt handler should be set to -:c:func:`z8530_interrupt()`. The device id should be set to the -z8530_dev structure pointer. Whether the interrupt can be shared or not -is board dependent, and up to you to initialise. - -The structure holds two channel structures. Initialise chanA.ctrlio and -chanA.dataio with the address of the control and data ports. You can or -this with Z8530_PORT_SLEEP to indicate your interface needs the 5uS -delay for chip settling done in software. The PORT_SLEEP option is -architecture specific. Other flags may become available on future -platforms, eg for MMIO. Initialise the chanA.irqs to &z8530_nop to -start the chip up as disabled and discarding interrupt events. This -ensures that stray interrupts will be mopped up and not hang the bus. -Set chanA.dev to point to the device structure itself. The private and -name field you may use as you wish. The private field is unused by the -Z85230 layer. The name is used for error reporting and it may thus make -sense to make it match the network name. - -Repeat the same operation with the B channel if your chip has both -channels wired to something useful. This isn't always the case. If it is -not wired then the I/O values do not matter, but you must initialise -chanB.dev. - -If your board has DMA facilities then initialise the txdma and rxdma -fields for the relevant channels. You must also allocate the ISA DMA -channels and do any necessary board level initialisation to configure -them. The low level driver will do the Z8530 and DMA controller -programming but not board specific magic. - -Having initialised the device you can then call -:c:func:`z8530_init()`. This will probe the chip and reset it into -a known state. An identification sequence is then run to identify the -chip type. If the checks fail to pass the function returns a non zero -error code. Typically this indicates that the port given is not valid. -After this call the type field of the z8530_dev structure is -initialised to either Z8530, Z85C30 or Z85230 according to the chip -found. - -Once you have called z8530_init you can also make use of the utility -function :c:func:`z8530_describe()`. This provides a consistent -reporting format for the Z8530 devices, and allows all the drivers to -provide consistent reporting. - -Attaching Network Interfaces -============================ - -If you wish to use the network interface facilities of the driver, then -you need to attach a network device to each channel that is present and -in use. In addition to use the generic HDLC you need to follow some -additional plumbing rules. They may seem complex but a look at the -example hostess_sv11 driver should reassure you. - -The network device used for each channel should be pointed to by the -netdevice field of each channel. The hdlc-> priv field of the network -device points to your private data - you will need to be able to find -your private data from this. - -The way most drivers approach this particular problem is to create a -structure holding the Z8530 device definition and put that into the -private field of the network device. The network device fields of the -channels then point back to the network devices. - -If you wish to use the generic HDLC then you need to register the HDLC -device. - -Before you register your network device you will also need to provide -suitable handlers for most of the network device callbacks. See the -network device documentation for more details on this. - -Configuring And Activating The Port -=================================== - -The Z85230 driver provides helper functions and tables to load the port -registers on the Z8530 chips. When programming the register settings for -a channel be aware that the documentation recommends initialisation -orders. Strange things happen when these are not followed. - -:c:func:`z8530_channel_load()` takes an array of pairs of -initialisation values in an array of u8 type. The first value is the -Z8530 register number. Add 16 to indicate the alternate register bank on -the later chips. The array is terminated by a 255. - -The driver provides a pair of public tables. The z8530_hdlc_kilostream -table is for the UK 'Kilostream' service and also happens to cover most -other end host configurations. The z8530_hdlc_kilostream_85230 table -is the same configuration using the enhancements of the 85230 chip. The -configuration loaded is standard NRZ encoded synchronous data with HDLC -bitstuffing. All of the timing is taken from the other end of the link. - -When writing your own tables be aware that the driver internally tracks -register values. It may need to reload values. You should therefore be -sure to set registers 1-7, 9-11, 14 and 15 in all configurations. Where -the register settings depend on DMA selection the driver will update the -bits itself when you open or close. Loading a new table with the -interface open is not recommended. - -There are three standard configurations supported by the core code. In -PIO mode the interface is programmed up to use interrupt driven PIO. -This places high demands on the host processor to avoid latency. The -driver is written to take account of latency issues but it cannot avoid -latencies caused by other drivers, notably IDE in PIO mode. Because the -drivers allocate buffers you must also prevent MTU changes while the -port is open. - -Once the port is open it will call the rx_function of each channel -whenever a completed packet arrived. This is invoked from interrupt -context and passes you the channel and a network buffer (struct -sk_buff) holding the data. The data includes the CRC bytes so most -users will want to trim the last two bytes before processing the data. -This function is very timing critical. When you wish to simply discard -data the support code provides the function -:c:func:`z8530_null_rx()` to discard the data. - -To active PIO mode sending and receiving the ``z8530_sync_open`` is called. -This expects to be passed the network device and the channel. Typically -this is called from your network device open callback. On a failure a -non zero error status is returned. -The :c:func:`z8530_sync_close()` function shuts down a PIO -channel. This must be done before the channel is opened again and before -the driver shuts down and unloads. - -The ideal mode of operation is dual channel DMA mode. Here the kernel -driver will configure the board for DMA in both directions. The driver -also handles ISA DMA issues such as controller programming and the -memory range limit for you. This mode is activated by calling the -:c:func:`z8530_sync_dma_open()` function. On failure a non zero -error value is returned. Once this mode is activated it can be shut down -by calling the :c:func:`z8530_sync_dma_close()`. You must call -the close function matching the open mode you used. - -The final supported mode uses a single DMA channel to drive the transmit -side. As the Z85C30 has a larger FIFO on the receive channel this tends -to increase the maximum speed a little. This is activated by calling the -``z8530_sync_txdma_open``. This returns a non zero error code on failure. The -:c:func:`z8530_sync_txdma_close()` function closes down the Z8530 -interface from this mode. - -Network Layer Functions -======================= - -The Z8530 layer provides functions to queue packets for transmission. -The driver internally buffers the frame currently being transmitted and -one further frame (in order to keep back to back transmission running). -Any further buffering is up to the caller. - -The function :c:func:`z8530_queue_xmit()` takes a network buffer -in sk_buff format and queues it for transmission. The caller must -provide the entire packet with the exception of the bitstuffing and CRC. -This is normally done by the caller via the generic HDLC interface -layer. It returns 0 if the buffer has been queued and non zero values -for queue full. If the function accepts the buffer it becomes property -of the Z8530 layer and the caller should not free it. - -The function :c:func:`z8530_get_stats()` returns a pointer to an -internally maintained per interface statistics block. This provides most -of the interface code needed to implement the network layer get_stats -callback. - -Porting The Z8530 Driver -======================== - -The Z8530 driver is written to be portable. In DMA mode it makes -assumptions about the use of ISA DMA. These are probably warranted in -most cases as the Z85230 in particular was designed to glue to PC type -machines. The PIO mode makes no real assumptions. - -Should you need to retarget the Z8530 driver to another architecture the -only code that should need changing are the port I/O functions. At the -moment these assume PC I/O port accesses. This may not be appropriate -for all platforms. Replacing :c:func:`z8530_read_port()` and -``z8530_write_port`` is intended to be all that is required to port -this driver layer. - -Known Bugs And Assumptions -========================== - -Interrupt Locking - The locking in the driver is done via the global cli/sti lock. This - makes for relatively poor SMP performance. Switching this to use a - per device spin lock would probably materially improve performance. - -Occasional Failures - We have reports of occasional failures when run for very long - periods of time and the driver starts to receive junk frames. At the - moment the cause of this is not clear. - -Public Functions Provided -========================= - -.. kernel-doc:: drivers/net/wan/z85230.c - :export: - -Internal Functions -================== - -.. kernel-doc:: drivers/net/wan/z85230.c - :internal: diff --git a/Documentation/networking/device_drivers/wwan/index.rst b/Documentation/networking/device_drivers/wwan/index.rst index 1cb8c7371401..370d8264d5dc 100644 --- a/Documentation/networking/device_drivers/wwan/index.rst +++ b/Documentation/networking/device_drivers/wwan/index.rst @@ -9,6 +9,7 @@ Contents: :maxdepth: 2 iosm + t7xx .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst new file mode 100644 index 000000000000..dd5b731957ca --- /dev/null +++ b/Documentation/networking/device_drivers/wwan/t7xx.rst @@ -0,0 +1,120 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +.. Copyright (C) 2020-21 Intel Corporation + +.. _t7xx_driver_doc: + +============================================ +t7xx driver for MTK PCIe based T700 5G modem +============================================ +The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS platforms +for data exchange over PCIe interface between Host platform & MediaTek's T700 5G modem. +The driver exposes an interface conforming to the MBIM protocol [1]. Any front end +application (e.g. Modem Manager) could easily manage the MBIM interface to enable +data communication towards WWAN. The driver also provides an interface to interact +with the MediaTek's modem via AT commands. + +Basic usage +=========== +MBIM & AT functions are inactive when unmanaged. The t7xx driver provides +WWAN port userspace interfaces representing MBIM & AT control channels and does +not play any role in managing their functionality. It is the job of a userspace +application to detect port enumeration and enable MBIM & AT functionalities. + +Examples of few such userspace applications are: + +- mbimcli (included with the libmbim [2] library), and +- Modem Manager [3] + +Management Applications to carry out below required actions for establishing +MBIM IP session: + +- open the MBIM control channel +- configure network connection settings +- connect to network +- configure IP network interface + +Management Applications to carry out below required actions for send an AT +command and receive response: + +- open the AT control channel using a UART tool or a special user tool + +Management application development +================================== +The driver and userspace interfaces are described below. The MBIM protocol is +described in [1] Mobile Broadband Interface Model v1.0 Errata-1. + +MBIM control channel userspace ABI +---------------------------------- + +/dev/wwan0mbim0 character device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver exposes an MBIM interface to the MBIM function by implementing +MBIM WWAN Port. The userspace end of the control channel pipe is a +/dev/wwan0mbim0 character device. Application shall use this interface for +MBIM protocol communication. + +Fragmentation +~~~~~~~~~~~~~ +The userspace application is responsible for all control message fragmentation +and defragmentation as per MBIM specification. + +/dev/wwan0mbim0 write() +~~~~~~~~~~~~~~~~~~~~~~~ +The MBIM control messages from the management application must not exceed the +negotiated control message size. + +/dev/wwan0mbim0 read() +~~~~~~~~~~~~~~~~~~~~~~ +The management application must accept control messages of up the negotiated +control message size. + +MBIM data channel userspace ABI +------------------------------- + +wwan0-X network device +~~~~~~~~~~~~~~~~~~~~~~ +The t7xx driver exposes IP link interface "wwan0-X" of type "wwan" for IP +traffic. Iproute network utility is used for creating "wwan0-X" network +interface and for associating it with MBIM IP session. + +The userspace management application is responsible for creating new IP link +prior to establishing MBIM IP session where the SessionId is greater than 0. + +For example, creating new IP link for a MBIM IP session with SessionId 1: + + ip link add dev wwan0-1 parentdev wwan0 type wwan linkid 1 + +The driver will automatically map the "wwan0-1" network device to MBIM IP +session 1. + +AT port userspace ABI +---------------------------------- + +/dev/wwan0at0 character device +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The driver exposes an AT port by implementing AT WWAN Port. +The userspace end of the control port is a /dev/wwan0at0 character +device. Application shall use this interface to issue AT commands. + +The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification. + +References +========== +[1] *MBIM (Mobile Broadband Interface Model) Errata-1* + +- https://www.usb.org/document-library/ + +[2] *libmbim "a glib-based library for talking to WWAN modems and devices which +speak the Mobile Interface Broadband Model (MBIM) protocol"* + +- http://www.freedesktop.org/wiki/Software/libmbim/ + +[3] *Modem Manager "a DBus-activated daemon which controls mobile broadband +(2G/3G/4G/5G) devices and connections"* + +- http://www.freedesktop.org/wiki/Software/ModemManager/ + +[4] *Specification # 27.007 - 3GPP* + +- https://www.3gpp.org/DynaReport/27007.htm diff --git a/Documentation/networking/devlink/devlink-linecard.rst b/Documentation/networking/devlink/devlink-linecard.rst new file mode 100644 index 000000000000..6c0b8928bc13 --- /dev/null +++ b/Documentation/networking/devlink/devlink-linecard.rst @@ -0,0 +1,122 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Devlink Line card +================= + +Background +========== + +The ``devlink-linecard`` mechanism is targeted for manipulation of +line cards that serve as a detachable PHY modules for modular switch +system. Following operations are provided: + + * Get a list of supported line card types. + * Provision of a slot with specific line card type. + * Get and monitor of line card state and its change. + +Line card according to the type may contain one or more gearboxes +to mux the lanes with certain speed to multiple ports with lanes +of different speed. Line card ensures N:M mapping between +the switch ASIC modules and physical front panel ports. + +Overview +======== + +Each line card devlink object is created by device driver, +according to the physical line card slots available on the device. + +Similar to splitter cable, where the device might have no way +of detection of the splitter cable geometry, the device +might not have a way to detect line card type. For that devices, +concept of provisioning is introduced. It allows the user to: + + * Provision a line card slot with certain line card type + + - Device driver would instruct the ASIC to prepare all + resources accordingly. The device driver would + create all instances, namely devlink port and netdevices + that reside on the line card, according to the line card type + * Manipulate of line card entities even without line card + being physically connected or powered-up + * Setup splitter cable on line card ports + + - As on the ordinary ports, user may provision a splitter + cable of a certain type, without the need to + be physically connected to the port + * Configure devlink ports and netdevices + +Netdevice carrier is decided as follows: + + * Line card is not inserted or powered-down + + - The carrier is always down + * Line card is inserted and powered up + + - The carrier is decided as for ordinary port netdevice + +Line card state +=============== + +The ``devlink-linecard`` mechanism supports the following line card states: + + * ``unprovisioned``: Line card is not provisioned on the slot. + * ``unprovisioning``: Line card slot is currently being unprovisioned. + * ``provisioning``: Line card slot is currently in a process of being provisioned + with a line card type. + * ``provisioning_failed``: Provisioning was not successful. + * ``provisioned``: Line card slot is provisioned with a type. + * ``active``: Line card is powered-up and active. + +The following diagram provides a general overview of ``devlink-linecard`` +state transitions:: + + +-------------------------+ + | | + +----------------------------------> unprovisioned | + | | | + | +--------|-------^--------+ + | | | + | | | + | +--------v-------|--------+ + | | | + | | provisioning | + | | | + | +------------|------------+ + | | + | +-----------------------------+ + | | | + | +------------v------------+ +------------v------------+ +-------------------------+ + | | | | ----> | + +----- provisioning_failed | | provisioned | | active | + | | | | <---- | + | +------------^------------+ +------------|------------+ +-------------------------+ + | | | + | | | + | | +------------v------------+ + | | | | + | | | unprovisioning | + | | | | + | | +------------|------------+ + | | | + | +-----------------------------+ + | | + +-----------------------------------------------+ + + +Example usage +============= + +.. code:: shell + + $ devlink lc show [ DEV [ lc LC_INDEX ] ] + $ devlink lc set DEV lc LC_INDEX [ { type LC_TYPE | notype } ] + + # Show current line card configuration and status for all slots: + $ devlink lc + + # Set slot 8 to be provisioned with type "16x100G": + $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G + + # Set slot 8 to be unprovisioned: + $ devlink lc set pci/0000:01:00.0 lc 8 notype diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index 4878907e9232..4e01dc32bc08 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -109,14 +109,19 @@ own name. - Boolean - When enabled, the device driver will instantiate VDPA networking specific auxiliary device of the devlink device. + * - ``enable_iwarp`` + - Boolean + - Enable handling of iWARP traffic in the device. * - ``internal_err_reset`` - Boolean - When enabled, the device driver will reset the device on internal errors. * - ``max_macs`` - u32 - - Specifies the maximum number of MAC addresses per ethernet port of - this device. + - Typically macvlan, vlan net devices mac are also programmed in their + parent netdevice's Function rx filter. This parameter limit the + maximum number of unicast mac address filters to receive traffic from + per ethernet port of this device. * - ``region_snapshot_enable`` - Boolean - Enable capture of ``devlink-region`` snapshots. @@ -126,3 +131,9 @@ own name. will NACK any attempt of other host to reset the device. This parameter is useful for setups where a device is shared by different hosts, such as multi-host setup. + * - ``io_eq_size`` + - u32 + - Control the size of I/O completion EQs. + * - ``event_eq_size`` + - u32 + - Control the size of asynchronous control events EQ. diff --git a/Documentation/networking/devlink/devlink-selftests.rst b/Documentation/networking/devlink/devlink-selftests.rst new file mode 100644 index 000000000000..c0aa1f3aef0d --- /dev/null +++ b/Documentation/networking/devlink/devlink-selftests.rst @@ -0,0 +1,38 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +================= +Devlink Selftests +================= + +The ``devlink-selftests`` API allows executing selftests on the device. + +Tests Mask +========== +The ``devlink-selftests`` command should be run with a mask indicating +the tests to be executed. + +Tests Description +================= +The following is a list of tests that drivers may execute. + +.. list-table:: List of tests + :widths: 5 90 + + * - Name + - Description + * - ``DEVLINK_SELFTEST_FLASH`` + - Devices may have the firmware on non-volatile memory on the board, e.g. + flash. This particular test helps to run a flash selftest on the device. + Implementation of the test is left to the driver/firmware. + +example usage +------------- + +.. code:: shell + + # Query selftests supported on the devlink device + $ devlink dev selftests show DEV + # Query selftests supported on all devlink devices + $ devlink dev selftests show + # Executes selftests on the device + $ devlink dev selftests run DEV id flash diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst index 59c78e9717d2..0c89ceb8986d 100644 --- a/Documentation/networking/devlink/ice.rst +++ b/Documentation/networking/devlink/ice.rst @@ -26,8 +26,10 @@ The ``ice`` driver reports the following versions * - ``fw.mgmt`` - running - 2.1.7 - - 3-digit version number of the management firmware that controls the - PHY, link, etc. + - 3-digit version number of the management firmware running on the + Embedded Management Processor of the device. It controls the PHY, + link, access to device resources, etc. Intel documentation refers to + this as the EMP firmware. * - ``fw.mgmt.api`` - running - 1.5.1 @@ -119,6 +121,60 @@ preserving settings, and thus ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS`` on its own will be rejected. If no overwrite mask is provided, the firmware will be instructed to preserve all settings and identifying fields when updating. +Reload +====== + +The ``ice`` driver supports activating new firmware after a flash update +using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE`` +action. + +.. code:: shell + + $ devlink dev reload pci/0000:01:00.0 reload action fw_activate + +The new firmware is activated by issuing a device specific Embedded +Management Processor reset which requests the device to reset and reload the +EMP firmware image. + +The driver does not currently support reloading the driver via +``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``. + +Port split +========== + +The ``ice`` driver supports port splitting only for port 0, as the FW has +a predefined set of available port split options for the whole device. + +A system reboot is required for port split to be applied. + +The following command will select the port split option with 4 ports: + +.. code:: shell + + $ devlink port split pci/0000:16:00.0/0 count 4 + +The list of all available port options will be printed to dynamic debug after +each ``split`` and ``unsplit`` command. The first option is the default. + +.. code:: shell + + ice 0000:16:00.0: Available port split options and max port speeds (Gbps): + ice 0000:16:00.0: Status Split Quad 0 Quad 1 + ice 0000:16:00.0: count L0 L1 L2 L3 L4 L5 L6 L7 + ice 0000:16:00.0: Active 2 100 - - - 100 - - - + ice 0000:16:00.0: 2 50 - 50 - - - - - + ice 0000:16:00.0: Pending 4 25 25 25 25 - - - - + ice 0000:16:00.0: 4 25 25 - - 25 25 - - + ice 0000:16:00.0: 8 10 10 10 10 10 10 10 10 + ice 0000:16:00.0: 1 100 - - - - - - - + +There could be multiple FW port options with the same port split count. When +the same port split count request is issued again, the next FW port option with +the same port split count will be selected. + +``devlink port unsplit`` will select the option with a split count of 1. If +there is no FW option available with split count 1, you will receive an error. + Regions ======= diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 443123772f44..4b653d040627 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -4,6 +4,20 @@ Linux Devlink Documentation devlink is an API to expose device information and resources not directly related to any device class, such as chip-wide/switch-ASIC-wide configuration. +Locking +------- + +Driver facing APIs are currently transitioning to allow more explicit +locking. Drivers can use the existing ``devlink_*`` set of APIs, or +new APIs prefixed by ``devl_*``. The older APIs handle all the locking +in devlink core, but don't allow registration of most sub-objects once +the main devlink object is itself registered. The newer ``devl_*`` APIs assume +the devlink instance lock is already held. Drivers can take the instance +lock by calling ``devl_lock()``. It is also held all callbacks of devlink +netlink commands. + +Drivers are encouraged to use the devlink instance lock for their own needs. + Interface documentation ----------------------- @@ -22,7 +36,9 @@ general. devlink-region devlink-resource devlink-reload + devlink-selftests devlink-trap + devlink-linecard Driver-specific documentation ----------------------------- diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 4e4b97f7971a..29ad304e6fba 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -14,8 +14,19 @@ Parameters * - Name - Mode + - Validation * - ``enable_roce`` - driverinit + - Type: Boolean + * - ``io_eq_size`` + - driverinit + - The range is between 64 and 4096. + * - ``event_eq_size`` + - driverinit + - The range is between 64 and 4096. + * - ``max_macs`` + - driverinit + - The range is between 1 and 2^31. Only power of 2 values are supported. The ``mlx5`` driver also implements the following driver-specific parameters. diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst index cf857cb4ba8f..433962225bd4 100644 --- a/Documentation/networking/devlink/mlxsw.rst +++ b/Documentation/networking/devlink/mlxsw.rst @@ -58,6 +58,30 @@ The ``mlxsw`` driver reports the following versions - running - Three digit firmware version +Line card auxiliary device info versions +======================================== + +The ``mlxsw`` driver reports the following versions for line card auxiliary device + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``hw.revision`` + - fixed + - The hardware revision for this line card + * - ``ini.version`` + - running + - Version of line card INI loaded + * - ``fw.psid`` + - fixed + - Line card device PSID + * - ``fw.version`` + - running + - Three digit firmware version of line card device + Driver-specific Traps ===================== diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst index 8a292fb5aaea..ec5e6d79b2e2 100644 --- a/Documentation/networking/devlink/netdevsim.rst +++ b/Documentation/networking/devlink/netdevsim.rst @@ -67,7 +67,7 @@ The ``netdevsim`` driver supports rate objects management, which includes: - setting tx_share and tx_max rate values for any rate object type; - setting parent node for any rate object type. -Rate nodes and it's parameters are exposed in ``netdevsim`` debugfs in RO mode. +Rate nodes and their parameters are exposed in ``netdevsim`` debugfs in RO mode. For example created rate node with name ``some_group``: .. code:: shell diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst index c8f59dbda46f..64f7236ff10b 100644 --- a/Documentation/networking/driver.rst +++ b/Documentation/networking/driver.rst @@ -8,7 +8,7 @@ Transmit path guidelines: 1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under any normal circumstances. It is considered a hard error unless - there is no way your device can tell ahead of time when it's + there is no way your device can tell ahead of time when its transmit function will become busy. Instead it must maintain the queue properly. For example, diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst index 2b08f1a772d3..827701f8cbfe 100644 --- a/Documentation/networking/dsa/configuration.rst +++ b/Documentation/networking/dsa/configuration.rst @@ -49,6 +49,9 @@ In this documentation the following Ethernet interfaces are used: *eth0* the master interface +*eth1* + another master interface + *lan1* a slave interface @@ -360,3 +363,96 @@ the ``self`` flag) has been removed. This results in the following changes: Script writers are therefore encouraged to use the ``master static`` set of flags when working with bridge FDB entries on DSA switch interfaces. + +Affinity of user ports to CPU ports +----------------------------------- + +Typically, DSA switches are attached to the host via a single Ethernet +interface, but in cases where the switch chip is discrete, the hardware design +may permit the use of 2 or more ports connected to the host, for an increase in +termination throughput. + +DSA can make use of multiple CPU ports in two ways. First, it is possible to +statically assign the termination traffic associated with a certain user port +to be processed by a certain CPU port. This way, user space can implement +custom policies of static load balancing between user ports, by spreading the +affinities according to the available CPU ports. + +Secondly, it is possible to perform load balancing between CPU ports on a per +packet basis, rather than statically assigning user ports to CPU ports. +This can be achieved by placing the DSA masters under a LAG interface (bonding +or team). DSA monitors this operation and creates a mirror of this software LAG +on the CPU ports facing the physical DSA masters that constitute the LAG slave +devices. + +To make use of multiple CPU ports, the firmware (device tree) description of +the switch must mark all the links between CPU ports and their DSA masters +using the ``ethernet`` reference/phandle. At startup, only a single CPU port +and DSA master will be used - the numerically first port from the firmware +description which has an ``ethernet`` property. It is up to the user to +configure the system for the switch to use other masters. + +DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow +changing the DSA master of a user port. The ``IFLA_DSA_MASTER`` u32 netlink +attribute contains the ifindex of the master device that handles each slave +device. The DSA master must be a valid candidate based on firmware node +information, or a LAG interface which contains only slaves which are valid +candidates. + +Using iproute2, the following manipulations are possible: + + .. code-block:: sh + + # See the DSA master in current use + ip -d link show dev swp0 + (...) + dsa master eth0 + + # Static CPU port distribution + ip link set swp0 type dsa master eth1 + ip link set swp1 type dsa master eth0 + ip link set swp2 type dsa master eth1 + ip link set swp3 type dsa master eth0 + + # CPU ports in LAG, using explicit assignment of the DSA master + ip link add bond0 type bond mode balance-xor && ip link set bond0 up + ip link set eth1 down && ip link set eth1 master bond0 + ip link set swp0 type dsa master bond0 + ip link set swp1 type dsa master bond0 + ip link set swp2 type dsa master bond0 + ip link set swp3 type dsa master bond0 + ip link set eth0 down && ip link set eth0 master bond0 + ip -d link show dev swp0 + (...) + dsa master bond0 + + # CPU ports in LAG, relying on implicit migration of the DSA master + ip link add bond0 type bond mode balance-xor && ip link set bond0 up + ip link set eth0 down && ip link set eth0 master bond0 + ip link set eth1 down && ip link set eth1 master bond0 + ip -d link show dev swp0 + (...) + dsa master bond0 + +Notice that in the case of CPU ports under a LAG, the use of the +``IFLA_DSA_MASTER`` netlink attribute is not strictly needed, but rather, DSA +reacts to the ``IFLA_MASTER`` attribute change of its present master (``eth0``) +and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly, +when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports +that were assigned to this interface to the first physical DSA master which is +eligible, based on the firmware description (it effectively reverts to the +startup configuration). + +In a setup with more than 2 physical CPU ports, it is therefore possible to mix +static user to CPU port assignment with LAG between DSA masters. It is not +possible to statically assign a user port towards a DSA master that has any +upper interfaces (this includes LAG devices - the master must always be the LAG +in this case). + +Live changing of the DSA master (and thus CPU port) affinity of a user port is +permitted, in order to allow dynamic redistribution in response to traffic. + +Physical DSA masters are allowed to join and leave at any time a LAG interface +used as a DSA master; however, DSA will reject a LAG interface as a valid +candidate for being a DSA master unless it has at least one physical DSA master +as a slave device. diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst index 89bb4fa4c362..a94ddf83348a 100644 --- a/Documentation/networking/dsa/dsa.rst +++ b/Documentation/networking/dsa/dsa.rst @@ -10,21 +10,21 @@ in joining the effort. Design principles ================= -The Distributed Switch Architecture is a subsystem which was primarily designed -to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line) -using Linux, but has since evolved to support other vendors as well. +The Distributed Switch Architecture subsystem was primarily designed to +support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product +line) using Linux, but has since evolved to support other vendors as well. The original philosophy behind this design was to be able to use unmodified Linux tools such as bridge, iproute2, ifconfig to work transparently whether they configured/queried a switch port network device or a regular network device. -An Ethernet switch is typically comprised of multiple front-panel ports, and one -or more CPU or management port. The DSA subsystem currently relies on the +An Ethernet switch typically comprises multiple front-panel ports and one +or more CPU or management ports. The DSA subsystem currently relies on the presence of a management port connected to an Ethernet controller capable of receiving Ethernet frames from the switch. This is a very common setup for all kinds of Ethernet switches found in Small Home and Office products: routers, -gateways, or even top-of-the rack switches. This host Ethernet controller will +gateways, or even top-of-rack switches. This host Ethernet controller will be later referred to as "master" and "cpu" in DSA terminology and code. The D in DSA stands for Distributed, because the subsystem has been designed @@ -33,14 +33,14 @@ using upstream and downstream Ethernet links between switches. These specific ports are referred to as "dsa" ports in DSA terminology and code. A collection of multiple switches connected to each other is called a "switch tree". -For each front-panel port, DSA will create specialized network devices which are +For each front-panel port, DSA creates specialized network devices which are used as controlling and data-flowing endpoints for use by the Linux networking stack. These specialized network interfaces are referred to as "slave" network interfaces in DSA terminology and code. The ideal case for using DSA is when an Ethernet switch supports a "switch tag" which is a hardware feature making the switch insert a specific tag for each -Ethernet frames it received to/from specific ports to help the management +Ethernet frame it receives to/from specific ports to help the management interface figure out: - what port is this frame coming from @@ -125,7 +125,7 @@ other switches from the same fabric, and in this case, the outermost switch ports must decapsulate the packet. Note that in certain cases, it might be the case that the tagging format used -by a leaf switch (not connected directly to the CPU) to not be the same as what +by a leaf switch (not connected directly to the CPU) is not the same as what the network stack sees. This can be seen with Marvell switch trees, where the CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA) format, but the DSA links are configured to use the shorter (without Ethertype) @@ -193,6 +193,23 @@ protocol. If not all packets are of equal size, the tagger can implement the default behavior by specifying the correct offset incurred by each individual RX packet. Tail taggers do not cause issues to the flow dissector. +Checksum offload should work with category 1 and 2 taggers when the DSA master +driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and +csum_offset. For those cases, DSA will shift the checksum start and offset by +the tag size. If the DSA master driver still uses the legacy NETIF_F_IP_CSUM +or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the +offload hardware already expects that specific tag (perhaps due to matching +vendors). DSA slaves inherit those flags from the master port, and it is up to +the driver to correctly fall back to software checksum when the IP header is not +where the hardware expects. If that check is ineffective, the packets might go +to the network without a proper checksum (the checksum field will have the +pseudo IP header sum). For category 3, when the offload hardware does not +already expect the switch tag in use, the checksum must be calculated before any +tag is inserted (i.e. inside the tagger). Otherwise, the DSA master would +include the tail tag in the (software or hardware) checksum calculation. Then, +when the tag gets stripped by the switch during transmission, it will leave an +incorrect IP checksum in place. + Due to various reasons (most common being category 1 taggers being associated with DSA-unaware masters, mangling what the master perceives as MAC DA), the tagging protocol may require the DSA master to operate in promiscuous mode, to @@ -270,21 +287,35 @@ These interfaces are specialized in order to: to/from specific switch ports - query the switch for ethtool operations: statistics, link state, Wake-on-LAN, register dumps... -- external/internal PHY management: link, auto-negotiation etc. +- manage external/internal PHY: link, auto-negotiation, etc. These slave network devices have custom net_device_ops and ethtool_ops function pointers which allow DSA to introduce a level of layering between the networking -stack/ethtool, and the switch driver implementation. +stack/ethtool and the switch driver implementation. Upon frame transmission from these slave network devices, DSA will look up which -switch tagging protocol is currently registered with these network devices, and +switch tagging protocol is currently registered with these network devices and invoke a specific transmit routine which takes care of adding the relevant switch tag in the Ethernet frames. These frames are then queued for transmission using the master network device -``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the +``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the Ethernet switch will be able to process these incoming frames from the -management interface and delivers these frames to the physical switch port. +management interface and deliver them to the physical switch port. + +When using multiple CPU ports, it is possible to stack a LAG (bonding/team) +device between the DSA slave devices and the physical DSA masters. The LAG +device is thus also a DSA master, but the LAG slave devices continue to be DSA +masters as well (just with no user port assigned to them; this is needed for +recovery in case the LAG DSA master disappears). Thus, the data path of the LAG +DSA master is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which +calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA master; +LAG slave). Therefore, the RX data path of the LAG DSA master is not used. +On the other hand, TX takes place linearly: ``dsa_slave_xmit`` calls +``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA master. +The latter calls ``dev_queue_xmit`` towards one physical DSA master or the +other, and in both cases, the packet exits the system through a hardware path +towards the switch. Graphical representation ------------------------ @@ -330,9 +361,9 @@ MDIO reads/writes towards specific PHY addresses. In most MDIO-connected switches, these functions would utilize direct or indirect PHY addressing mode to return standard MII registers from the switch builtin PHYs, allowing the PHY library and/or to return link status, link partner pages, auto-negotiation -results etc.. +results, etc. -For Ethernet switches which have both external and internal MDIO busses, the +For Ethernet switches which have both external and internal MDIO buses, the slave MII bus can be utilized to mux/demux MDIO reads and writes towards either internal or external MDIO devices this switch might be connected to: internal PHYs, external PHYs, or even external switches. @@ -349,7 +380,7 @@ DSA data structures are defined in ``include/net/dsa.h`` as well as table indication (when cascading switches) - ``dsa_platform_data``: platform device configuration data which can reference - a collection of dsa_chip_data structure if multiples switches are cascaded, + a collection of dsa_chip_data structures if multiple switches are cascaded, the master network device this switch tree is attached to needs to be referenced @@ -426,7 +457,7 @@ logic basically looks like this: "phy-handle" property, if found, this PHY device is created and registered using ``of_phy_connect()`` -- if Device Tree is used, and the PHY device is "fixed", that is, conforms to +- if Device Tree is used and the PHY device is "fixed", that is, conforms to the definition of a non-MDIO managed PHY as defined in ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered and connected transparently using the special fixed MDIO bus driver @@ -481,35 +512,117 @@ Device Tree DSA features a standardized binding which is documented in ``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query -per-port PHY specific details: interface connection, MDIO bus location etc.. +per-port PHY specific details: interface connection, MDIO bus location, etc. Driver development ================== -DSA switch drivers need to implement a dsa_switch_ops structure which will +DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will contain the various members described below. -``register_switch_driver()`` registers this dsa_switch_ops in its internal list -of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite. +Probing, registration and device lifetime +----------------------------------------- + +DSA switches are regular ``device`` structures on buses (be they platform, SPI, +I2C, MDIO or otherwise). The DSA framework is not involved in their probing +with the device core. + +Switch registration from the perspective of a driver means passing a valid +``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the +switch driver's probing function. The following members must be valid in the +provided structure: + +- ``ds->dev``: will be used to parse the switch's OF node or platform data. + +- ``ds->num_ports``: will be used to create the port list for this switch, and + to validate the port indices provided in the OF node. + +- ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA + method implementations. + +- ``ds->priv``: backpointer to a driver-private data structure which can be + retrieved in all further DSA method callbacks. + +In addition, the following flags in the ``dsa_switch`` structure may optionally +be configured to obtain driver-specific behavior from the DSA core. Their +behavior when set is documented through comments in ``include/net/dsa.h``. + +- ``ds->vlan_filtering_is_global`` + +- ``ds->needs_standalone_vlan_filtering`` + +- ``ds->configure_vlan_while_not_filtering`` + +- ``ds->untag_bridge_pvid`` + +- ``ds->assisted_learning_on_cpu_port`` + +- ``ds->mtu_enforcement_ingress`` + +- ``ds->fdb_isolation`` + +Internally, DSA keeps an array of switch trees (group of switches) global to +the kernel, and attaches a ``dsa_switch`` structure to a tree on registration. +The tree ID to which the switch is attached is determined by the first u32 +number of the ``dsa,member`` property of the switch's OF node (0 if missing). +The switch ID within the tree is determined by the second u32 number of the +same OF property (0 if missing). Registering multiple switches with the same +switch ID and tree ID is illegal and will cause an error. Using platform data, +a single switch and a single switch tree is permitted. -Unless requested differently by setting the priv_size member accordingly, DSA -does not allocate any driver private context space. +In case of a tree with multiple switches, probing takes place asymmetrically. +The first N-1 callers of ``dsa_register_switch()`` only add their ports to the +port list of the tree (``dst->ports``), each port having a backpointer to its +associated switch (``dp->ds``). Then, these switches exit their +``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()`` +has determined that the tree is not yet complete (not all ports referenced by +DSA links are present in the tree's port list). The tree becomes complete when +the last switch calls ``dsa_register_switch()``, and this triggers the effective +continuation of initialization (including the call to ``ds->ops->setup()``) for +all switches within that tree, all as part of the calling context of the last +switch's probe function. + +The opposite of registration takes place when calling ``dsa_unregister_switch()``, +which removes a switch's ports from the port list of the tree. The entire tree +is torn down when the first switch unregisters. + +It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback +of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal +version of the full teardown performed by ``dsa_unregister_switch()``). +The reason is that DSA keeps a reference on the master net device, and if the +driver for the master device decides to unbind on shutdown, DSA's reference +will block that operation from finalizing. + +Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called, +but not both, and the device driver model permits the bus' ``remove()`` method +to be called even if ``shutdown()`` was already called. Therefore, drivers are +expected to implement a mutual exclusion method between ``remove()`` and +``shutdown()`` by setting their drvdata to NULL after any of these has run, and +checking whether the drvdata is NULL before proceeding to take any action. + +After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no +further callbacks via the provided ``dsa_switch_ops`` may take place, and the +driver may free the data structures associated with the ``dsa_switch``. Switch configuration -------------------- -- ``tag_protocol``: this is to indicate what kind of tagging protocol is supported, - should be a valid value from the ``dsa_tag_protocol`` enum +- ``get_tag_protocol``: this is to indicate what kind of tagging protocol is + supported, should be a valid value from the ``dsa_tag_protocol`` enum. + The returned information does not have to be static; the driver is passed the + CPU port number, as well as the tagging protocol of a possibly stacked + upstream switch, in case there are hardware limitations in terms of supported + tag formats. -- ``probe``: probe routine which will be invoked by the DSA platform device upon - registration to test for the presence/absence of a switch device. For MDIO - devices, it is recommended to issue a read towards internal registers using - the switch pseudo-PHY and return whether this is a supported device. For other - buses, return a non-NULL string +- ``change_tag_protocol``: when the default tagging protocol has compatibility + problems with the master or other issues, the driver may support changing it + at runtime, either through a device tree property or through sysfs. In that + case, further calls to ``get_tag_protocol`` should report the protocol in + current use. - ``setup``: setup function for the switch, this function is responsible for setting up the ``dsa_switch_ops`` private structure with all it needs: register maps, - interrupts, mutexes, locks etc.. This function is also expected to properly + interrupts, mutexes, locks, etc. This function is also expected to properly configure the switch to separate all network interfaces from each other, that is, they should be isolated by the switch hardware itself, typically by creating a Port-based VLAN ID for each port and allowing only the CPU port and the @@ -518,7 +631,35 @@ Switch configuration fully configured and ready to serve any kind of request. It is recommended to issue a software reset of the switch during this setup function in order to avoid relying on what a previous software agent such as a bootloader/firmware - may have previously configured. + may have previously configured. The method responsible for undoing any + applicable allocations or operations done here is ``teardown``. + +- ``port_setup`` and ``port_teardown``: methods for initialization and + destruction of per-port data structures. It is mandatory for some operations + such as registering and unregistering devlink port regions to be done from + these methods, otherwise they are optional. A port will be torn down only if + it has been previously set up. It is possible for a port to be set up during + probing only to be torn down immediately afterwards, for example in case its + PHY cannot be found. In this case, probing of the DSA switch continues + without that particular port. + +- ``port_change_master``: method through which the affinity (association used + for traffic termination purposes) between a user port and a CPU port can be + changed. By default all user ports from a tree are assigned to the first + available CPU port that makes sense for them (most of the times this means + the user ports of a tree are all assigned to the same CPU port, except for H + topologies as described in commit 2c0b03258b8b). The ``port`` argument + represents the index of the user port, and the ``master`` argument represents + the new DSA master ``net_device``. The CPU port associated with the new + master can be retrieved by looking at ``struct dsa_port *cpu_dp = + master->dsa_ptr``. Additionally, the master can also be a LAG device where + all the slave devices are physical DSA masters. LAG DSA masters also have a + valid ``master->dsa_ptr`` pointer, however this is not unique, but rather a + duplicate of the first physical DSA master's (LAG slave) ``dsa_ptr``. In case + of a LAG DSA master, a further call to ``port_lag_join`` will be emitted + separately for the physical CPU ports associated with the physical DSA + masters, requesting them to create a hardware LAG associated with the LAG + interface. PHY devices and link management ------------------------------- @@ -526,13 +667,13 @@ PHY devices and link management - ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs, if the PHY library PHY driver needs to know about information it cannot obtain on its own (e.g.: coming from switch memory mapped registers), this function - should return a 32-bits bitmask of "flags", that is private between the switch + should return a 32-bit bitmask of "flags" that is private between the switch driver and the Ethernet PHY driver in ``drivers/net/phy/\*``. - ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read the switch port MDIO registers. If unavailable, return 0xffff for each read. For builtin switch Ethernet PHYs, this function should allow reading the link - status, auto-negotiation results, link partner pages etc.. + status, auto-negotiation results, link partner pages, etc. - ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write to the switch port MDIO registers. If unavailable return a negative error @@ -554,7 +695,7 @@ Ethtool operations ------------------ - ``get_strings``: ethtool function used to query the driver's strings, will - typically return statistics strings, private flags strings etc. + typically return statistics strings, private flags strings, etc. - ``get_ethtool_stats``: ethtool function used to query per-port statistics and return their values. DSA overlays slave network devices general statistics: @@ -564,7 +705,7 @@ Ethtool operations - ``get_sset_count``: ethtool function used to query the number of statistics items - ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this - function may, for certain implementations also query the master network device + function may for certain implementations also query the master network device Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN - ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port, @@ -607,37 +748,209 @@ Power management in a fully active state - ``port_enable``: function invoked by the DSA slave network device ndo_open - function when a port is administratively brought up, this function should be - fully enabling a given switch port. DSA takes care of marking the port with + function when a port is administratively brought up, this function should + fully enable a given switch port. DSA takes care of marking the port with ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it was not, and propagating these changes down to the hardware - ``port_disable``: function invoked by the DSA slave network device ndo_close - function when a port is administratively brought down, this function should be - fully disabling a given switch port. DSA takes care of marking the port with + function when a port is administratively brought down, this function should + fully disable a given switch port. DSA takes care of marking the port with ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is disabled while being a bridge member +Address databases +----------------- + +Switching hardware is expected to have a table for FDB entries, however not all +of them are active at the same time. An address database is the subset (partition) +of FDB entries that is active (can be matched by address learning on RX, or FDB +lookup on TX) depending on the state of the port. An address database may +occasionally be called "FID" (Filtering ID) in this document, although the +underlying implementation may choose whatever is available to the hardware. + +For example, all ports that belong to a VLAN-unaware bridge (which is +*currently* VLAN-unaware) are expected to learn source addresses in the +database associated by the driver with that bridge (and not with other +VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a +VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having +the same MAC DA as the packet, which is present on another port member of the +same bridge. At the same time, the FDB lookup process must be able to not find +an FDB entry having the same MAC DA as the packet, if that entry points towards +a port which is a member of a different VLAN-unaware bridge (and is therefore +associated with a different address database). + +Similarly, each VLAN of each offloaded VLAN-aware bridge should have an +associated address database, which is shared by all ports which are members of +that VLAN, but not shared by ports belonging to different bridges that are +members of the same VID. + +In this context, a VLAN-unaware database means that all packets are expected to +match on it irrespective of VLAN ID (only MAC address lookup), whereas a +VLAN-aware database means that packets are supposed to match based on the VLAN +ID from the classified 802.1Q header (or the pvid if untagged). + +At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0, +whereas VLAN-aware FDB entries have non-zero VID values. Note that a +VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a +VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the +software bridge keeps separate address databases, and offloads to hardware the +FDB entries belonging to these databases, through switchdev, asynchronously +relative to the moment when the databases become active or inactive. + +When a user port operates in standalone mode, its driver should configure it to +use a separate database called a port private database. This is different from +the databases described above, and should impede operation as standalone port +(packet in, packet out to the CPU port) as little as possible. For example, +on ingress, it should not attempt to learn the MAC SA of ingress traffic, since +learning is a bridging layer service and this is a standalone port, therefore +it would consume useless space. With no address learning, the port private +database should be empty in a naive implementation, and in this case, all +received packets should be trivially flooded to the CPU port. + +DSA (cascade) and CPU ports are also called "shared" ports because they service +multiple address databases, and the database that a packet should be associated +to is usually embedded in the DSA tag. This means that the CPU port may +simultaneously transport packets coming from a standalone port (which were +classified by hardware in one address database), and from a bridge port (which +were classified to a different address database). + +Switch drivers which satisfy certain criteria are able to optimize the naive +configuration by removing the CPU port from the flooding domain of the switch, +and just program the hardware with FDB entries pointing towards the CPU port +for which it is known that software is interested in those MAC addresses. +Packets which do not match a known FDB entry will not be delivered to the CPU, +which will save CPU cycles required for creating an skb just to drop it. + +DSA is able to perform host address filtering for the following kinds of +addresses: + +- Primary unicast MAC addresses of ports (``dev->dev_addr``). These are + associated with the port private database of the respective user port, + and the driver is notified to install them through ``port_fdb_add`` towards + the CPU port. + +- Secondary unicast and multicast MAC addresses of ports (addresses added + through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated + with the port private database of the respective user port. + +- Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC + addresses of the bridge ports, for which packets must be terminated locally + and not forwarded. They are associated with the address database for that + bridge. + +- Static bridge FDB entries installed towards foreign (non-DSA) interfaces + present in the same bridge as some DSA switch ports. These are also + associated with the address database for that bridge. + +- Dynamically learned FDB entries on foreign interfaces present in the same + bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port`` + is set to true by the driver. These are associated with the address database + for that bridge. + +For various operations detailed below, DSA provides a ``dsa_db`` structure +which can be of the following types: + +- ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to + the port private database of user port ``db->dp``. +- ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge + ``db->bridge``. Separation between the VLAN-unaware database and the per-VID + databases of this bridge is expected to be done by the driver. +- ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``. + Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future. + +The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``, +``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true. + +DSA associates each offloaded bridge and each offloaded LAG with a one-based ID +(``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of +refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering +scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may +implement their own. + +Only the drivers which declare support for FDB isolation are notified of FDB +entries on the CPU port belonging to ``DSA_DB_PORT`` databases. +For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to +drivers even if they do not support FDB isolation. However, ``db->bridge.num`` +and ``db->lag.id`` are always set to 0 in that case (to denote the lack of +isolation, for refcounting purposes). + +Note that it is not mandatory for a switch driver to implement physically +separate address databases for each standalone user port. Since FDB entries in +the port private databases will always point to the CPU port, there is no risk +for incorrect forwarding decisions. In this case, all standalone ports may +share the same database, but the reference counting of host-filtered addresses +(not deleting the FDB entry for a port's MAC address if it's still in use by +another port) becomes the responsibility of the driver, because DSA is unaware +that the port databases are in fact shared. This can be achieved by calling +``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``. +The down side is that the RX filtering lists of each user port are in fact +shared, which means that user port A may accept a packet with a MAC DA it +shouldn't have, only because that MAC address was in the RX filtering list of +user port B. These packets will still be dropped in software, however. + Bridge layer ------------ +Offloading the bridge forwarding plane is optional and handled by the methods +below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may +be non-zero and exceeded, and in this case, joining a bridge port is still +possible, but the packet forwarding will take place in software, and the ports +under a software bridge must remain configured in the same way as for +standalone operation, i.e. have all bridging service functions (address +learning etc) disabled, and send all received packets to the CPU port only. + +Concretely, a port starts offloading the forwarding plane of a bridge once it +returns success to the ``port_bridge_join`` method, and stops doing so after +``port_bridge_leave`` has been called. Offloading the bridge means autonomously +learning FDB entries in accordance with the software bridge port's state, and +autonomously forwarding (or flooding) received packets without CPU intervention. +This is optional even when offloading a bridge port. Tagging protocol drivers +are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which +have already been autonomously forwarded in the forwarding domain of the +ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all +switch ports part of the same tree ID to be part of the same bridge forwarding +domain (capable of autonomous forwarding to each other). + +Offloading the TX forwarding process of a bridge is a distinct concept from +simply offloading its forwarding plane, and refers to the ability of certain +driver and tag protocol combinations to transmit a single skb coming from the +bridge device's transmit function to potentially multiple egress ports (and +thereby avoid its cloning in software). + +Packets for which the bridge requests this behavior are called data plane +packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol +driver's ``xmit`` function. Data plane packets are subject to FDB lookup, +hardware learning on the CPU port, and do not override the port STP state. +Additionally, replication of data plane packets (multicast, flooding) is +handled in hardware and the bridge driver will transmit a single skb for each +packet that may or may not need replication. + +When the TX forwarding offload is enabled, the tag protocol driver is +responsible to inject packets into the data plane of the hardware towards the +correct bridging domain (FID) that the port is a part of. The port may be +VLAN-unaware, and in this case the FID must be equal to the FID used by the +driver for its VLAN-unaware address database associated with that bridge. +Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed +that the packet is also VLAN-tagged with the VLAN ID that the bridge processed +this packet in. It is the responsibility of the hardware to untag the VID on +the egress-untagged ports, or keep the tag on the egress-tagged ones. + - ``port_bridge_join``: bridge layer function invoked when a given switch port is - added to a bridge, this function should be doing the necessary at the switch - level to permit the joining port from being added to the relevant logical + added to a bridge, this function should do what's necessary at the switch + level to permit the joining port to be added to the relevant logical domain for it to ingress/egress traffic with other members of the bridge. + By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process + of this bridge is also offloaded. - ``port_bridge_leave``: bridge layer function invoked when a given switch port is - removed from a bridge, this function should be doing the necessary at the + removed from a bridge, this function should do what's necessary at the switch level to deny the leaving port from ingress/egress traffic from the - remaining bridge members. When the port leaves the bridge, it should be aged - out at the switch hardware for the switch to (re) learn MAC addresses behind - this port. + remaining bridge members. - ``port_stp_state_set``: bridge layer function invoked when a given switch port STP state is computed by the bridge layer and should be propagated to switch - hardware to forward/block/learn traffic. The switch driver is responsible for - computing a STP state change based on current and asked parameters and perform - the relevant ageing based on the intersection results + hardware to forward/block/learn traffic. - ``port_bridge_flags``: bridge layer function invoked when a port must configure its settings for e.g. flooding of unknown traffic or source address @@ -650,21 +963,11 @@ Bridge layer CPU port, and flooding towards the CPU port should also be enabled, due to a lack of an explicit address filtering mechanism in the DSA core. -- ``port_bridge_tx_fwd_offload``: bridge layer function invoked after - ``port_bridge_join`` when a driver sets ``ds->num_fwd_offloading_bridges`` to - a non-zero value. Returning success in this function activates the TX - forwarding offload bridge feature for this port, which enables the tagging - protocol driver to inject data plane packets towards the bridging domain that - the port is a part of. Data plane packets are subject to FDB lookup, hardware - learning on the CPU port, and do not override the port STP state. - Additionally, replication of data plane packets (multicast, flooding) is - handled in hardware and the bridge driver will transmit a single skb for each - packet that needs replication. The method is provided as a configuration - point for drivers that need to configure the hardware for enabling this - feature. - -- ``port_bridge_tx_fwd_unoffload``: bridge layer function invoken when a driver - leaves a bridge port which had the TX forwarding offload feature enabled. +- ``port_fast_age``: bridge layer function invoked when flushing the + dynamically learned FDB entries on the port is necessary. This is called when + transitioning from an STP state where learning should take place to an STP + state where it shouldn't, or when leaving a bridge, or when address learning + is turned off via ``port_bridge_flags``. Bridge VLAN filtering --------------------- @@ -680,55 +983,44 @@ Bridge VLAN filtering allowed. - ``port_vlan_add``: bridge layer function invoked when a VLAN is configured - (tagged or untagged) for the given switch port. If the operation is not - supported by the hardware, this function should return ``-EOPNOTSUPP`` to - inform the bridge code to fallback to a software implementation. + (tagged or untagged) for the given switch port. The CPU port becomes a member + of a VLAN only if a foreign bridge port is also a member of it (and + forwarding needs to take place in software), or the VLAN is installed to the + VLAN group of the bridge device itself, for termination purposes + (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are + reference counted and removed when there is no user left. Drivers do not need + to manually install a VLAN on the CPU port. - ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the given switch port -- ``port_vlan_dump``: bridge layer function invoked with a switchdev callback - function that the driver has to call for each VLAN the given port is a member - of. A switchdev object is used to carry the VID and bridge flags. - - ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a Forwarding Database entry, the switch hardware should be programmed with the specified address in the specified VLAN Id in the forwarding database - associated with this VLAN ID. If the operation is not supported, this - function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to - a software implementation. - -.. note:: VLAN ID 0 corresponds to the port private database, which, in the context - of DSA, would be its port-based VLAN, used by the associated bridge device. + associated with this VLAN ID. - ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a Forwarding Database entry, the switch hardware should be programmed to delete the specified MAC address from the specified VLAN ID if it was mapped into this port forwarding database -- ``port_fdb_dump``: bridge layer function invoked with a switchdev callback - function that the driver has to call for each MAC address known to be behind - the given port. A switchdev object is used to carry the VID and FDB info. +- ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the + physical DSA port interfaces. Since DSA does not attempt to keep in sync its + hardware FDB entries with the software bridge, this method is implemented as + a means to view the entries visible on user ports in the hardware database. + The entries reported by this function have the ``self`` flag in the output of + the ``bridge fdb show`` command. - ``port_mdb_add``: bridge layer function invoked when the bridge wants to install - a multicast database entry. If the operation is not supported, this function - should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to a - software implementation. The switch hardware should be programmed with the + a multicast database entry. The switch hardware should be programmed with the specified address in the specified VLAN ID in the forwarding database associated with this VLAN ID. -.. note:: VLAN ID 0 corresponds to the port private database, which, in the context - of DSA, would be its port-based VLAN, used by the associated bridge device. - - ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a multicast database entry, the switch hardware should be programmed to delete the specified MAC address from the specified VLAN ID if it was mapped into this port forwarding database. -- ``port_mdb_dump``: bridge layer function invoked with a switchdev callback - function that the driver has to call for each MAC address known to be behind - the given port. A switchdev object is used to carry the VID and MDB info. - Link aggregation ---------------- @@ -835,9 +1127,3 @@ capable hardware, but does not enforce a strict switch device driver model. On the other DSA enforces a fairly strict device driver model, and deals with most of the switch specific. At some point we should envision a merger between these two subsystems and get the best of both worlds. - -Other hanging fruits --------------------- - -- allowing more than one CPU/management interface: - http://comments.gmane.org/gmane.linux.network/365657 diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst index 29b1bae0cf00..e0219c1452ab 100644 --- a/Documentation/networking/dsa/sja1105.rst +++ b/Documentation/networking/dsa/sja1105.rst @@ -293,6 +293,33 @@ of dropped frames, which is a sum of frames dropped due to timing violations, lack of destination ports and MTU enforcement checks). Byte-level counters are not available. +Limitations +=========== + +The SJA1105 switch family always performs VLAN processing. When configured as +VLAN-unaware, frames carry a different VLAN tag internally, depending on +whether the port is standalone or under a VLAN-unaware bridge. + +The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the +driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware +bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on +whether the port is standalone or in a VLAN-unaware bridge, and accepts only +"VLAN-unaware" tc-flower keys (MAC DA). + +The existing tc-flower keys that are offloaded using virtual links are no +longer operational after one of the following happens: + +- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware) +- port is part of a bridge whose VLAN awareness state changes +- port was part of a bridge and becomes standalone +- port was standalone, but another port joins a VLAN-aware bridge and this + changes the global VLAN awareness state of the bridge + +The driver cannot veto all these operations, and it cannot update/remove the +existing tc-flower filters either. So for proper operation, the tc-flower +filters should be installed only after the forwarding configuration of the port +has been made, and removed by user space before making any changes to it. + Device Tree bindings and board design ===================================== diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 7b598c7e3912..d578b8bcd8a4 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -220,6 +220,8 @@ Userspace to kernel: ``ETHTOOL_MSG_PHC_VCLOCKS_GET`` get PHC virtual clocks info ``ETHTOOL_MSG_MODULE_SET`` set transceiver module parameters ``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters + ``ETHTOOL_MSG_PSE_SET`` set PSE parameters + ``ETHTOOL_MSG_PSE_GET`` get PSE parameters ===================================== ================================= Kernel to userspace: @@ -260,6 +262,7 @@ Kernel to userspace: ``ETHTOOL_MSG_STATS_GET_REPLY`` standard statistics ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters + ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -426,6 +429,7 @@ Kernel response contents: ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_STATE`` u8 Master/slave port state + ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching ========================================== ====== ========================== For ``ETHTOOL_A_LINKMODES_OURS``, value represents advertised modes and mask @@ -449,6 +453,7 @@ Request contents: ``ETHTOOL_A_LINKMODES_SPEED`` u32 link speed (Mb/s) ``ETHTOOL_A_LINKMODES_DUPLEX`` u8 duplex mode ``ETHTOOL_A_LINKMODES_MASTER_SLAVE_CFG`` u8 Master/slave port mode + ``ETHTOOL_A_LINKMODES_RATE_MATCHING`` u8 PHY rate matching ``ETHTOOL_A_LINKMODES_LANES`` u32 lanes ========================================== ====== ========================== @@ -849,7 +854,7 @@ Request contents: Kernel response contents: - ==================================== ====== ========================== + ==================================== ====== =========================== ``ETHTOOL_A_RINGS_HEADER`` nested reply header ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring @@ -859,8 +864,25 @@ Kernel response contents: ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring - ==================================== ====== ========================== - + ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split + ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE + ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode + ==================================== ====== =========================== + +``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with +page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``). +If enabled the device is configured to place frame headers and data into +separate buffers. The device configuration must make it possible to receive +full memory pages of data, for example because MTU is high enough or through +HW-GRO. + +``ETHTOOL_A_RINGS_TX_PUSH`` flag is used to enable descriptor fast +path to send packets. In ordinary path, driver fills descriptors in DRAM and +notifies NIC hardware. In fast path, driver pushes descriptors to the device +through MMIO writes, thus reducing the latency. However, enabling this feature +may increase the CPU cost. Drivers may enforce additional per-packet +eligibility checks (e.g. on packet size). RINGS_SET ========= @@ -869,19 +891,31 @@ Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request. Request contents: - ==================================== ====== ========================== + ==================================== ====== =========================== ``ETHTOOL_A_RINGS_HEADER`` nested reply header ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring - ==================================== ====== ========================== + ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE + ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode + ==================================== ====== =========================== Kernel checks that requested ring sizes do not exceed limits reported by driver. Driver may impose additional constraints and may not suspport all attributes. +``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size. +Completion queue events(CQE) are the events posted by NIC to indicate the +completion status of a packet when the packet is sent(like send success or +error) or received(like pointers to packet fragments). The CQE size parameter +enables to modify the CQE size other than default size if NIC supports it. +A bigger CQE can have more receive buffer pointers inturn NIC can transfer +a bigger frame from wire. Based on the NIC hardware, the overall completion +queue size can be adjusted in the driver if CQE size is modified. + CHANNELS_GET ============ @@ -1596,6 +1630,62 @@ For SFF-8636 modules, low power mode is forced by the host according to table For CMIS modules, low power mode is forced by the host according to table 6-12 in revision 5.0 of the specification. +PSE_GET +======= + +Gets PSE attributes. + +Request contents: + + ===================================== ====== ========================== + ``ETHTOOL_A_PSE_HEADER`` nested request header + ===================================== ====== ========================== + +Kernel response contents: + + ====================================== ====== ============================= + ``ETHTOOL_A_PSE_HEADER`` nested reply header + ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL + PSE functions + ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the + PoDL PSE. + ====================================== ====== ============================= + +When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies +the operational state of the PoDL PSE functions. The operational state of the +PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` +action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2 +aPoDLPSEAdminState. Possible values are: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_podl_pse_admin_state + +When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies +the power detection status of the PoDL PSE. The status depend on internal PSE +state machine and automatic PD classification support. This option is +corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus. +Possible values are: + +.. kernel-doc:: include/uapi/linux/ethtool.h + :identifiers: ethtool_podl_pse_pw_d_status + +PSE_SET +======= + +Sets PSE parameters. + +Request contents: + + ====================================== ====== ============================= + ``ETHTOOL_A_PSE_HEADER`` nested request header + ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state + ====================================== ====== ============================= + +When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used +to control PoDL PSE Admin functions. This option is implementing +``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See +``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values. + Request translation =================== diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst index ce2b8e8bb9ab..f69da5074860 100644 --- a/Documentation/networking/filter.rst +++ b/Documentation/networking/filter.rst @@ -6,6 +6,13 @@ Linux Socket Filtering aka Berkeley Packet Filter (BPF) ======================================================= +Notice +------ + +This file used to document the eBPF format and mechanisms even when not +related to socket filtering. The ../bpf/index.rst has more details +on eBPF. + Introduction ------------ @@ -298,7 +305,7 @@ Possible BPF extensions are shown in the following table: vlan_tci skb_vlan_tag_get(skb) vlan_avail skb_vlan_tag_present(skb) vlan_tpid skb->vlan_proto - rand prandom_u32() + rand get_random_u32() =================================== ================================================= These extensions can also be prefixed with '#'. @@ -617,15 +624,11 @@ format with similar underlying principles from BPF described in previous paragraphs is being used. However, the instruction set format is modelled closer to the underlying architecture to mimic native instruction sets, so that a better performance can be achieved (more details later). This new -ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which originates from [e]xtended BPF is not the same as BPF extensions! While eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) -It is designed to be JITed with one to one mapping, which can also open up -the possibility for GCC/LLVM compilers to generate optimized eBPF code through -an eBPF backend that performs almost as fast as natively compiled code. - The new instruction set was originally designed with the possible goal in mind to write programs in "restricted C" and compile into eBPF with a optional GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with @@ -650,1032 +653,11 @@ Currently, the classic BPF format is being used for JITing on most sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF instruction set. -Some core changes of the new internal format: - -- Number of registers increase from 2 to 10: - - The old format had two registers A and X, and a hidden frame pointer. The - new layout extends this to be 10 internal registers and a read-only frame - pointer. Since 64-bit CPUs are passing arguments to functions via registers - the number of args from eBPF program to in-kernel function is restricted - to 5 and one register is used to accept return value from an in-kernel - function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ - sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved - registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. - - Therefore, eBPF calling convention is defined as: - - * R0 - return value from in-kernel function, and exit value for eBPF program - * R1 - R5 - arguments from eBPF program to in-kernel function - * R6 - R9 - callee saved registers that in-kernel function will preserve - * R10 - read-only frame pointer to access stack - - Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, - etc, and eBPF calling convention maps directly to ABIs used by the kernel on - 64-bit architectures. - - On 32-bit architectures JIT may map programs that use only 32-bit arithmetic - and may let more complex programs to be interpreted. - - R0 - R5 are scratch registers and eBPF program needs spill/fill them if - necessary across calls. Note that there is only one eBPF program (== one - eBPF main routine) and it cannot call other eBPF functions, it can only - call predefined in-kernel functions, though. - -- Register width increases from 32-bit to 64-bit: - - Still, the semantics of the original 32-bit ALU operations are preserved - via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower - subregisters that zero-extend into 64-bit if they are being written to. - That behavior maps directly to x86_64 and arm64 subregister definition, but - makes other JITs more difficult. - - 32-bit architectures run 64-bit internal BPF programs via interpreter. - Their JITs may convert BPF programs that only use 32-bit subregisters into - native instruction set and let the rest being interpreted. - - Operation is 64-bit, because on 64-bit architectures, pointers are also - 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, - so 32-bit eBPF registers would otherwise require to define register-pair - ABI, thus, there won't be able to use a direct eBPF register to HW register - mapping and JIT would need to do combine/split/move operations for every - register in and out of the function, which is complex, bug prone and slow. - Another reason is the use of atomic 64-bit counters. - -- Conditional jt/jf targets replaced with jt/fall-through: - - While the original design has constructs such as ``if (cond) jump_true; - else jump_false;``, they are being replaced into alternative constructs like - ``if (cond) jump_true; /* else fall-through */``. - -- Introduces bpf_call insn and register passing convention for zero overhead - calls from/to other kernel functions: - - Before an in-kernel function call, the internal BPF program needs to - place function arguments into R1 to R5 registers to satisfy calling - convention, then the interpreter will take them from registers and pass - to in-kernel function. If R1 - R5 registers are mapped to CPU registers - that are used for argument passing on given architecture, the JIT compiler - doesn't need to emit extra moves. Function arguments will be in the correct - registers and BPF_CALL instruction will be JITed as single 'call' HW - instruction. This calling convention was picked to cover common call - situations without performance penalty. - - After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has - a return value of the function. Since R6 - R9 are callee saved, their state - is preserved across the call. - - For example, consider three C functions:: - - u64 f1() { return (*_f2)(1); } - u64 f2(u64 a) { return f3(a + 1, a); } - u64 f3(u64 a, u64 b) { return a - b; } - - GCC can compile f1, f3 into x86_64:: - - f1: - movl $1, %edi - movq _f2(%rip), %rax - jmp *%rax - f3: - movq %rdi, %rax - subq %rsi, %rax - ret - - Function f2 in eBPF may look like:: - - f2: - bpf_mov R2, R1 - bpf_add R1, 1 - bpf_call f3 - bpf_exit - - If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and - returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to - be used to call into f2. - - For practical reasons all eBPF programs have only one argument 'ctx' which is - already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs - can call kernel functions with up to 5 arguments. Calls with 6 or more arguments - are currently not supported, but these restrictions can be lifted if necessary - in the future. - - On 64-bit architectures all register map to HW registers one to one. For - example, x86_64 JIT compiler can map them as ... - - :: - - R0 - rax - R1 - rdi - R2 - rsi - R3 - rdx - R4 - rcx - R5 - r8 - R6 - rbx - R7 - r13 - R8 - r14 - R9 - r15 - R10 - rbp - - ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing - and rbx, r12 - r15 are callee saved. - - Then the following internal BPF pseudo-program:: - - bpf_mov R6, R1 /* save ctx */ - bpf_mov R2, 2 - bpf_mov R3, 3 - bpf_mov R4, 4 - bpf_mov R5, 5 - bpf_call foo - bpf_mov R7, R0 /* save foo() return value */ - bpf_mov R1, R6 /* restore ctx for next call */ - bpf_mov R2, 6 - bpf_mov R3, 7 - bpf_mov R4, 8 - bpf_mov R5, 9 - bpf_call bar - bpf_add R0, R7 - bpf_exit - - After JIT to x86_64 may look like:: - - push %rbp - mov %rsp,%rbp - sub $0x228,%rsp - mov %rbx,-0x228(%rbp) - mov %r13,-0x220(%rbp) - mov %rdi,%rbx - mov $0x2,%esi - mov $0x3,%edx - mov $0x4,%ecx - mov $0x5,%r8d - callq foo - mov %rax,%r13 - mov %rbx,%rdi - mov $0x6,%esi - mov $0x7,%edx - mov $0x8,%ecx - mov $0x9,%r8d - callq bar - add %r13,%rax - mov -0x228(%rbp),%rbx - mov -0x220(%rbp),%r13 - leaveq - retq - - Which is in this example equivalent in C to:: - - u64 bpf_filter(u64 ctx) - { - return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); - } - - In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 - arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper - registers and place their return value into ``%rax`` which is R0 in eBPF. - Prologue and epilogue are emitted by JIT and are implicit in the - interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve - them across the calls as defined by calling convention. - - For example the following program is invalid:: - - bpf_mov R1, 1 - bpf_call foo - bpf_mov R0, R1 - bpf_exit - - After the call the registers R1-R5 contain junk values and cannot be read. - An in-kernel eBPF verifier is used to validate internal BPF programs. - -Also in the new design, eBPF is limited to 4096 insns, which means that any -program will terminate quickly and will only call a fixed number of kernel -functions. Original BPF and the new format are two operand instructions, -which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. - -The input context pointer for invoking the interpreter function is generic, -its content is defined by a specific use case. For seccomp register R1 points -to seccomp_data, for converted BPF filters R1 points to a skb. - -A program, that is translated internally consists of the following elements:: - - op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 - -So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field -has room for new instructions. Some of them may use 16/24/32 byte encoding. New -instructions must be multiple of 8 bytes to preserve backward compatibility. - -Internal BPF is a general purpose RISC instruction set. Not every register and -every instruction are used during translation from original BPF to new format. -For example, socket filters are not using ``exclusive add`` instruction, but -tracing filters may do to maintain counters of events, for example. Register R9 -is not used by socket filters either, but more complex filters may be running -out of registers and would have to resort to spill/fill to stack. - -Internal BPF can be used as a generic assembler for last step performance -optimizations, socket filters and seccomp are using it as assembler. Tracing -filters may use it as assembler to generate code from kernel. In kernel usage -may not be bounded by security considerations, since generated internal BPF code -may be optimizing internal code path and not being exposed to the user space. -Safety of internal BPF can come from a verifier (TBD). In such use cases as -described, it may be used as safe instruction set. - -Just like the original BPF, the new format runs within a controlled environment, -is deterministic and the kernel can easily prove that. The safety of the program -can be determined in two steps: first step does depth-first-search to disallow -loops and other CFG validation; second step starts from the first insn and -descends all possible paths. It simulates execution of every insn and observes -the state change of registers and stack. - -eBPF opcode encoding --------------------- - -eBPF is reusing most of the opcode encoding from classic to simplify conversion -of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' -field is divided into three parts:: - - +----------------+--------+--------------------+ - | 4 bits | 1 bit | 3 bits | - | operation code | source | instruction class | - +----------------+--------+--------------------+ - (MSB) (LSB) - -Three LSB bits store instruction class which is one of: - - =================== =============== - Classic BPF classes eBPF classes - =================== =============== - BPF_LD 0x00 BPF_LD 0x00 - BPF_LDX 0x01 BPF_LDX 0x01 - BPF_ST 0x02 BPF_ST 0x02 - BPF_STX 0x03 BPF_STX 0x03 - BPF_ALU 0x04 BPF_ALU 0x04 - BPF_JMP 0x05 BPF_JMP 0x05 - BPF_RET 0x06 BPF_JMP32 0x06 - BPF_MISC 0x07 BPF_ALU64 0x07 - =================== =============== - -When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... - - :: - - BPF_K 0x00 - BPF_X 0x08 - - * in classic BPF, this means:: - - BPF_SRC(code) == BPF_X - use register X as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - - * in eBPF, this means:: - - BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand - BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand - -... and four MSB bits store operation code. - -If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: - - BPF_ADD 0x00 - BPF_SUB 0x10 - BPF_MUL 0x20 - BPF_DIV 0x30 - BPF_OR 0x40 - BPF_AND 0x50 - BPF_LSH 0x60 - BPF_RSH 0x70 - BPF_NEG 0x80 - BPF_MOD 0x90 - BPF_XOR 0xa0 - BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ - BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ - BPF_END 0xd0 /* eBPF only: endianness conversion */ - -If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: - - BPF_JA 0x00 /* BPF_JMP only */ - BPF_JEQ 0x10 - BPF_JGT 0x20 - BPF_JGE 0x30 - BPF_JSET 0x40 - BPF_JNE 0x50 /* eBPF only: jump != */ - BPF_JSGT 0x60 /* eBPF only: signed '>' */ - BPF_JSGE 0x70 /* eBPF only: signed '>=' */ - BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ - BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ - BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ - BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ - BPF_JSLT 0xc0 /* eBPF only: signed '<' */ - BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ - -So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF -and eBPF. There are only two registers in classic BPF, so it means A += X. -In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, -BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous -src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. - -Classic BPF is using BPF_MISC class to represent A = X and X = A moves. -eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no -BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean -exactly the same operations as BPF_ALU, but with 64-bit wide operands -instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: -dst_reg = dst_reg + src_reg - -Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` -operation. Classic BPF_RET | BPF_K means copy imm32 into return register -and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT -in eBPF means function exit only. The eBPF program needs to store return -value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as -BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide -operands for the comparisons instead. - -For load and store instructions the 8-bit 'code' field is divided as:: - - +--------+--------+-------------------+ - | 3 bits | 2 bits | 3 bits | - | mode | size | instruction class | - +--------+--------+-------------------+ - (MSB) (LSB) - -Size modifier is one of ... - -:: - - BPF_W 0x00 /* word */ - BPF_H 0x08 /* half word */ - BPF_B 0x10 /* byte */ - BPF_DW 0x18 /* eBPF only, double word */ - -... which encodes size of load/store operation:: - - B - 1 byte - H - 2 byte - W - 4 byte - DW - 8 byte (eBPF only) - -Mode modifier is one of:: - - BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ - BPF_ABS 0x20 - BPF_IND 0x40 - BPF_MEM 0x60 - BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ - BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ - BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */ - -eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and -(BPF_IND | <size> | BPF_LD) which are used to access packet data. - -They had to be carried over from classic to have strong performance of -socket filters running in eBPF interpreter. These instructions can only -be used when interpreter context is a pointer to ``struct sk_buff`` and -have seven implicit operands. Register R6 is an implicit input that must -contain pointer to sk_buff. Register R0 is an implicit output which contains -the data fetched from the packet. Registers R1-R5 are scratch registers -and must not be used to store the data across BPF_ABS | BPF_LD or -BPF_IND | BPF_LD instructions. - -These instructions have implicit program exit condition as well. When -eBPF program is trying to access the data beyond the packet boundary, -the interpreter will abort the execution of the program. JIT compilers -therefore must preserve this property. src_reg and imm32 fields are -explicit inputs to these instructions. - -For example:: - - BPF_IND | BPF_W | BPF_LD means: - - R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) - and R1 - R5 were scratched. - -Unlike classic BPF instruction set, eBPF has generic load/store operations:: - - BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg - BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 - BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) - -Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. - -It also includes atomic operations, which use the immediate field for extra -encoding:: - - .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg - .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg - -The basic atomic operations supported are:: - - BPF_ADD - BPF_AND - BPF_OR - BPF_XOR - -Each having equivalent semantics with the ``BPF_ADD`` example, that is: the -memory location addresed by ``dst_reg + off`` is atomically modified, with -``src_reg`` as the other operand. If the ``BPF_FETCH`` flag is set in the -immediate, then these operations also overwrite ``src_reg`` with the -value that was in memory before it was modified. - -The more special operations are:: - - BPF_XCHG - -This atomically exchanges ``src_reg`` with the value addressed by ``dst_reg + -off``. :: - - BPF_CMPXCHG - -This atomically compares the value addressed by ``dst_reg + off`` with -``R0``. If they match it is replaced with ``src_reg``. In either case, the -value that was there before is zero-extended and loaded back to ``R0``. - -Note that 1 and 2 byte atomic operations are not supported. - -Clang can generate atomic instructions by default when ``-mcpu=v3`` is -enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction -Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable -the atomics features, while keeping a lower ``-mcpu`` version, you can use -``-Xclang -target-feature -Xclang +alu32``. - -You may encounter ``BPF_XADD`` - this is a legacy name for ``BPF_ATOMIC``, -referring to the exclusive-add operation encoded when the immediate field is -zero. - -eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists -of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single -instruction that loads 64-bit immediate value into a dst_reg. -Classic BPF has similar instruction: ``BPF_LD | BPF_W | BPF_IMM`` which loads -32-bit immediate value into a register. - -eBPF verifier -------------- -The safety of the eBPF program is determined in two steps. - -First step does DAG check to disallow loops and other CFG validation. -In particular it will detect programs that have unreachable instructions. -(though classic BPF checker allows them) - -Second step starts from the first insn and descends all possible paths. -It simulates execution of every insn and observes the state change of -registers and stack. - -At the start of the program the register R1 contains a pointer to context -and has type PTR_TO_CTX. -If verifier sees an insn that does R2=R1, then R2 has now type -PTR_TO_CTX as well and can be used on the right hand side of expression. -If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, -since addition of two valid pointers makes invalid pointer. -(In 'secure' mode verifier will reject any type of pointer arithmetic to make -sure that kernel addresses don't leak to unprivileged users) - -If register was never written to, it's not readable:: - - bpf_mov R0 = R2 - bpf_exit - -will be rejected, since R2 is unreadable at the start of the program. - -After kernel function call, R1-R5 are reset to unreadable and -R0 has a return type of the function. - -Since R6-R9 are callee saved, their state is preserved across the call. - -:: - - bpf_mov R6 = 1 - bpf_call foo - bpf_mov R0 = R6 - bpf_exit - -is a correct program. If there was R1 instead of R6, it would have -been rejected. - -load/store instructions are allowed only with registers of valid types, which -are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. -For example:: - - bpf_mov R1 = 1 - bpf_mov R2 = 2 - bpf_xadd *(u32 *)(R1 + 3) += R2 - bpf_exit - -will be rejected, since R1 doesn't have a valid pointer type at the time of -execution of instruction bpf_xadd. - -At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) -A callback is used to customize verifier to restrict eBPF program access to only -certain fields within ctx structure with specified size and alignment. - -For example, the following insn:: - - bpf_ld R0 = *(u32 *)(R6 + 8) - -intends to load a word from address R6 + 8 and store it into R0 -If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know -that offset 8 of size 4 bytes can be accessed for reading, otherwise -the verifier will reject the program. -If R6=PTR_TO_STACK, then access should be aligned and be within -stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, -so it will fail verification, since it's out of bounds. - -The verifier will allow eBPF program to read data from stack only after -it wrote into it. - -Classic BPF verifier does similar check with M[0-15] memory slots. -For example:: - - bpf_ld R0 = *(u32 *)(R10 - 4) - bpf_exit - -is invalid program. -Though R10 is correct read-only register and has type PTR_TO_STACK -and R10 - 4 is within stack bounds, there were no stores into that location. - -Pointer register spill/fill is tracked as well, since four (R6-R9) -callee saved registers may not be enough for some programs. - -Allowed function calls are customized with bpf_verifier_ops->get_func_proto() -The eBPF verifier will check that registers match argument constraints. -After the call register R0 will be set to return type of the function. - -Function calls is a main mechanism to extend functionality of eBPF programs. -Socket filters may let programs to call one set of functions, whereas tracing -filters may allow completely different set. - -If a function made accessible to eBPF program, it needs to be thought through -from safety point of view. The verifier will guarantee that the function is -called with valid arguments. - -seccomp vs socket filters have different security restrictions for classic BPF. -Seccomp solves this by two stage verifier: classic BPF verifier is followed -by seccomp verifier. In case of eBPF one configurable verifier is shared for -all use cases. - -See details of eBPF verifier in kernel/bpf/verifier.c - -Register value tracking ------------------------ -In order to determine the safety of an eBPF program, the verifier must track -the range of possible values in each register and also in each stack slot. -This is done with ``struct bpf_reg_state``, defined in include/linux/ -bpf_verifier.h, which unifies tracking of scalar and pointer values. Each -register state has a type, which is either NOT_INIT (the register has not been -written to), SCALAR_VALUE (some value which is not usable as a pointer), or a -pointer type. The types of pointers describe their base, as follows: - - - PTR_TO_CTX - Pointer to bpf_context. - CONST_PTR_TO_MAP - Pointer to struct bpf_map. "Const" because arithmetic - on these pointers is forbidden. - PTR_TO_MAP_VALUE - Pointer to the value stored in a map element. - PTR_TO_MAP_VALUE_OR_NULL - Either a pointer to a map value, or NULL; map accesses - (see section 'eBPF maps', below) return this type, - which becomes a PTR_TO_MAP_VALUE when checked != NULL. - Arithmetic on these pointers is forbidden. - PTR_TO_STACK - Frame pointer. - PTR_TO_PACKET - skb->data. - PTR_TO_PACKET_END - skb->data + headlen; arithmetic forbidden. - PTR_TO_SOCKET - Pointer to struct bpf_sock_ops, implicitly refcounted. - PTR_TO_SOCKET_OR_NULL - Either a pointer to a socket, or NULL; socket lookup - returns this type, which becomes a PTR_TO_SOCKET when - checked != NULL. PTR_TO_SOCKET is reference-counted, - so programs must release the reference through the - socket release function before the end of the program. - Arithmetic on these pointers is forbidden. - -However, a pointer may be offset from this base (as a result of pointer -arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable -offset'. The former is used when an exactly-known value (e.g. an immediate -operand) is added to a pointer, while the latter is used for values which are -not exactly known. The variable offset is also used in SCALAR_VALUEs, to track -the range of possible values in the register. - -The verifier's knowledge about the variable offset consists of: - -* minimum and maximum values as unsigned -* minimum and maximum values as signed - -* knowledge of the values of individual bits, in the form of a 'tnum': a u64 - 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; - 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both - mask and value; no bit should ever be 1 in both. For example, if a byte is read - into a register from memory, the register's top 56 bits are known zero, while - the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we - then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; - 0x1ff), because of potential carries. - -Besides arithmetic, the register state can also be updated by conditional -branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch -it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' -branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or -BPF_JSGE) would instead update the signed minimum/maximum values. Information -from the signed and unsigned bounds can be combined; for instance if a value is -first tested < 8 and then tested s> 4, the verifier will conclude that the value -is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. - -PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all -pointers sharing that same variable offset. This is important for packet range -checks: after adding a variable to a packet pointer register A, if you then copy -it to another register B and then add a constant 4 to A, both registers will -share the same 'id' but the A will have a fixed offset of +4. Then if A is -bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is -now known to have a safe range of at least 4 bytes. See 'Direct packet access', -below, for more on PTR_TO_PACKET ranges. - -The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of -the pointer returned from a map lookup. This means that when one copy is -checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. -As well as range-checking, the tracked information is also used for enforcing -alignment of pointer accesses. For instance, on most systems the packet pointer -is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump -over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting -pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 -bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through -that pointer are safe. -The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common -to all copies of the pointer returned from a socket lookup. This has similar -behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but -it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly -represents a reference to the corresponding ``struct sock``. To ensure that the -reference is not leaked, it is imperative to NULL-check the reference and in -the non-NULL case, and pass the valid reference to the socket release function. - -Direct packet access --------------------- -In cls_bpf and act_bpf programs the verifier allows direct access to the packet -data via skb->data and skb->data_end pointers. -Ex:: - - 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ - 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ - 3: r5 = r3 - 4: r5 += 14 - 5: if r5 > r4 goto pc+16 - R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp - 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ - -this 2byte load from the packet is safe to do, since the program author -did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which -means that in the fall-through case the register R3 (which points to skb->data) -has at least 14 directly accessible bytes. The verifier marks it -as R3=pkt(id=0,off=0,r=14). -id=0 means that no additional variables were added to the register. -off=0 means that no additional constants were added. -r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. -Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points -to the packet data, but constant 14 was added to the register, so -it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) -which is zero bytes. - -More complex packet access may look like:: - - - R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp - 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ - 7: r4 = *(u8 *)(r3 +12) - 8: r4 *= 14 - 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ - 10: r3 += r4 - 11: r2 = r1 - 12: r2 <<= 48 - 13: r2 >>= 48 - 14: r3 += r2 - 15: r2 = r3 - 16: r2 += 8 - 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ - 18: if r2 > r1 goto pc+2 - R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp - 19: r1 = *(u8 *)(r3 +4) - -The state of the register R3 is R3=pkt(id=2,off=0,r=8) -id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some -offset within a packet and since the program author did -``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). -The verifier only allows 'add'/'sub' operations on packet registers. Any other -operation will set the register state to 'SCALAR_VALUE' and it won't be -available for direct packet access. - -Operation ``r3 += rX`` may overflow and become less than original skb->data, -therefore the verifier has to prevent that. So when it sees ``r3 += rX`` -instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 -against skb->data_end will not give us 'range' information, so attempts to read -through the pointer will give "invalid access to packet" error. - -Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is -R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits -of the register are guaranteed to be zero, and nothing is known about the lower -8 bits. After insn ``r4 *= 14`` the state becomes -R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit -value by constant 14 will keep upper 52 bits as zero, also the least significant -bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make -R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign -extending. This logic is implemented in adjust_reg_min_max_vals() function, -which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice -versa) and adjust_scalar_min_max_vals() for operations on two scalars. - -The end result is that bpf program author can access packet directly -using normal C code as:: - - void *data = (void *)(long)skb->data; - void *data_end = (void *)(long)skb->data_end; - struct eth_hdr *eth = data; - struct iphdr *iph = data + sizeof(*eth); - struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); - - if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) - return 0; - if (eth->h_proto != htons(ETH_P_IP)) - return 0; - if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) - return 0; - if (udp->dest == 53 || udp->source == 9) - ...; - -which makes such programs easier to write comparing to LD_ABS insn -and significantly faster. - -eBPF maps ---------- -'maps' is a generic storage of different types for sharing data between kernel -and userspace. - -The maps are accessed from user space via BPF syscall, which has commands: - -- create a map with given type and attributes - ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` - using attr->map_type, attr->key_size, attr->value_size, attr->max_entries - returns process-local file descriptor or negative error - -- lookup key in a given map - ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key, attr->value - returns zero and stores found elem into value or negative error - -- create or update key/value pair in a given map - ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key, attr->value - returns zero or negative error - -- find and delete element by key in a given map - ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` - using attr->map_fd, attr->key - -- to delete map: close(fd) - Exiting process will delete maps automatically - -userspace programs use this syscall to create/access maps that eBPF programs -are concurrently updating. - -maps can have different types: hash, array, bloom filter, radix-tree, etc. - -The map is defined by: - - - type - - max number of elements - - key size in bytes - - value size in bytes - -Pruning -------- -The verifier does not actually walk all possible paths through the program. For -each new branch to analyse, the verifier looks at all the states it's previously -been in when at this instruction. If any of them contain the current state as a -subset, the branch is 'pruned' - that is, the fact that the previous state was -accepted implies the current state would be as well. For instance, if in the -previous state, r1 held a packet-pointer, and in the current state, r1 holds a -packet-pointer with a range as long or longer and at least as strict an -alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't -have been used by any path from that point, so any value in r2 (including -another NOT_INIT) is safe. The implementation is in the function regsafe(). -Pruning considers not only the registers but also the stack (and any spilled -registers it may hold). They must all be safe for the branch to be pruned. -This is implemented in states_equal(). - -Understanding eBPF verifier messages ------------------------------------- - -The following are few examples of invalid eBPF programs and verifier error -messages as seen in the log: - -Program with unreachable instructions:: - - static struct bpf_insn prog[] = { - BPF_EXIT_INSN(), - BPF_EXIT_INSN(), - }; - -Error: - - unreachable insn 1 - -Program that reads uninitialized register:: - - BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), - BPF_EXIT_INSN(), - -Error:: - - 0: (bf) r0 = r2 - R2 !read_ok - -Program that doesn't initialize R0 before exiting:: - - BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), - BPF_EXIT_INSN(), - -Error:: - - 0: (bf) r2 = r1 - 1: (95) exit - R0 !read_ok - -Program that accesses stack out of bounds:: - - BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), - BPF_EXIT_INSN(), - -Error:: - - 0: (7a) *(u64 *)(r10 +8) = 0 - invalid stack off=8 size=8 - -Program that doesn't initialize stack before passing its address into function:: - - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), - -Error:: - - 0: (bf) r2 = r10 - 1: (07) r2 += -8 - 2: (b7) r1 = 0x0 - 3: (85) call 1 - invalid indirect read from stack off -8+0 size 8 - -Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: - - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_EXIT_INSN(), - -Error:: - - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - fd 0 is not pointing to valid bpf_map - -Program that doesn't check return value of map_lookup_elem() before accessing -map element:: - - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), - -Error:: - - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 0x0 - 4: (85) call 1 - 5: (7a) *(u64 *)(r0 +0) = 0 - R0 invalid mem access 'map_value_or_null' - -Program that correctly checks map_lookup_elem() returned value for NULL, but -accesses the memory with incorrect alignment:: - - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), - BPF_EXIT_INSN(), - -Error:: - - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+1 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +4) = 0 - misaligned access off 4 size 8 - -Program that correctly checks map_lookup_elem() returned value for NULL and -accesses memory with correct alignment in one side of 'if' branch, but fails -to do so in the other side of 'if' branch:: - - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_LD_MAP_FD(BPF_REG_1, 0), - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), - BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), - BPF_EXIT_INSN(), - BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), - BPF_EXIT_INSN(), - -Error:: - - 0: (7a) *(u64 *)(r10 -8) = 0 - 1: (bf) r2 = r10 - 2: (07) r2 += -8 - 3: (b7) r1 = 1 - 4: (85) call 1 - 5: (15) if r0 == 0x0 goto pc+2 - R0=map_ptr R10=fp - 6: (7a) *(u64 *)(r0 +0) = 0 - 7: (95) exit - - from 5 to 8: R0=imm0 R10=fp - 8: (7a) *(u64 *)(r0 +0) = 1 - R0 invalid mem access 'imm' - -Program that performs a socket lookup then sets the pointer to NULL without -checking it:: - - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_MOV64_IMM(BPF_REG_0, 0), - BPF_EXIT_INSN(), - -Error:: - - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (b7) r0 = 0 - 9: (95) exit - Unreleased reference id=1, alloc_insn=7 - -Program that performs a socket lookup but does not NULL-check the returned -value:: - - BPF_MOV64_IMM(BPF_REG_2, 0), - BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), - BPF_MOV64_IMM(BPF_REG_3, 4), - BPF_MOV64_IMM(BPF_REG_4, 0), - BPF_MOV64_IMM(BPF_REG_5, 0), - BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), - BPF_EXIT_INSN(), - -Error:: - - 0: (b7) r2 = 0 - 1: (63) *(u32 *)(r10 -8) = r2 - 2: (bf) r2 = r10 - 3: (07) r2 += -8 - 4: (b7) r3 = 4 - 5: (b7) r4 = 0 - 6: (b7) r5 = 0 - 7: (85) call bpf_sk_lookup_tcp#65 - 8: (95) exit - Unreleased reference id=1, alloc_insn=7 - Testing ------- Next to the BPF toolchain, the kernel also ships a test module that contains -various test cases for classic and internal BPF that can be executed against +various test cases for classic and eBPF that can be executed against the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and enabled via Kconfig:: diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 58bc8cd367c6..16a153bcc5fe 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -1,12 +1,13 @@ -Linux Networking Documentation -============================== +Networking +========== + +Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics. Contents: .. toctree:: :maxdepth: 2 - netdev-FAQ af_xdp bareudp batman-adv @@ -46,7 +47,6 @@ Contents: cdc_mbim dccp dctcp - decnet dns_resolver driver eql @@ -92,10 +92,13 @@ Contents: radiotap-headers rds regulatory + representors rxrpc sctp secid seg6-sysctl + skbuff + smc-sysctl statistics strparser switchdev diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index c04431144f7a..e7b3fa7bb3f7 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -25,7 +25,8 @@ ip_default_ttl - INTEGER ip_no_pmtu_disc - INTEGER Disable Path MTU Discovery. If enabled in mode 1 and a fragmentation-required ICMP is received, the PMTU to this - destination will be set to min_pmtu (see below). You will need + destination will be set to the smallest of the old MTU to + this destination and min_pmtu (see below). You will need to raise min_pmtu to the smallest interface MTU on your system manually if you want to avoid locally generated fragments. @@ -49,7 +50,8 @@ ip_no_pmtu_disc - INTEGER Default: FALSE min_pmtu - INTEGER - default 552 - minimum discovered Path MTU + default 552 - minimum Path MTU. Unless this is changed mannually, + each cached pmtu will never be lower than this setting. ip_forward_use_pmtu - BOOLEAN By default we don't trust protocol path MTUs while forwarding @@ -200,6 +202,12 @@ neigh/default/unres_qlen - INTEGER Default: 101 +neigh/default/interval_probe_time_ms - INTEGER + The probe interval for neighbor entries with NTF_MANAGED flag, + the min value is 1. + + Default: 5000 + mtu_expires - INTEGER Time, in seconds, that cached PMTU information is kept. @@ -265,6 +273,13 @@ ipfrag_max_dist - INTEGER from different IP datagrams, which could result in data corruption. Default: 64 +bc_forwarding - INTEGER + bc_forwarding enables the feature described in rfc1812#section-5.3.5.2 + and rfc2644. It allows the router to forward directed broadcast. + To enable this feature, the 'all' entry and the input interface entry + should be set to 1. + Default: 0 + INET peer storage ================= @@ -621,6 +636,16 @@ tcp_recovery - INTEGER Default: 0x1 +tcp_reflect_tos - BOOLEAN + For listening sockets, reuse the DSCP value of the initial SYN message + for outgoing packets. This allows to have both directions of a TCP + stream to use the same DSCP value, assuming DSCP remains unchanged for + the lifetime of the connection. + + This options affects both IPv4 and IPv6. + + Default: 0 (disabled) + tcp_reordering - INTEGER Initial reordering level of packets in a TCP stream. TCP stack can then dynamically adjust flow reordering level @@ -876,6 +901,29 @@ tcp_min_tso_segs - INTEGER Default: 2 +tcp_tso_rtt_log - INTEGER + Adjustment of TSO packet sizes based on min_rtt + + Starting from linux-5.18, TCP autosizing can be tweaked + for flows having small RTT. + + Old autosizing was splitting the pacing budget to send 1024 TSO + per second. + + tso_packet_size = sk->sk_pacing_rate / 1024; + + With the new mechanism, we increase this TSO sizing using: + + distance = min_rtt_usec / (2^tcp_tso_rtt_log) + tso_packet_size += gso_max_size >> distance; + + This means that flows between very close hosts can use bigger + TSO packets, reducing their cpu costs. + + If you want to use the old autosizing, set this sysctl to 0. + + Default: 9 (2^9 = 512 usec) + tcp_pacing_ss_ratio - INTEGER sk->sk_pacing_rate is set by TCP stack using a ratio applied to current rate. (current_rate = cwnd * mss / srtt) @@ -987,7 +1035,39 @@ tcp_limit_output_bytes - INTEGER tcp_challenge_ack_limit - INTEGER Limits number of Challenge ACK sent per second, as recommended in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) - Default: 1000 + Note that this per netns rate limit can allow some side channel + attacks and probably should not be enabled. + TCP stack implements per TCP socket limits anyway. + Default: INT_MAX (unlimited) + +tcp_ehash_entries - INTEGER + Show the number of hash buckets for TCP sockets in the current + networking namespace. + + A negative value means the networking namespace does not own its + hash buckets and shares the initial networking namespace's one. + +tcp_child_ehash_entries - INTEGER + Control the number of hash buckets for TCP sockets in the child + networking namespace, which must be set before clone() or unshare(). + + If the value is not 0, the kernel uses a value rounded up to 2^n + as the actual hash bucket size. 0 is a special value, meaning + the child networking namespace will share the initial networking + namespace's hash buckets. + + Note that the child will use the global one in case the kernel + fails to allocate enough memory. In addition, the global hash + buckets are spread over available NUMA nodes, but the allocation + of the child hash table depends on the current process's NUMA + policy, which could result in performance differences. + + Note also that the default value of tcp_max_tw_buckets and + tcp_max_syn_backlog depend on the hash bucket size. + + Possible values: 0, 2^n (n: 0 - 24 (16Mi)) + + Default: 0 UDP variables ============= @@ -1020,11 +1100,7 @@ udp_rmem_min - INTEGER Default: 4K udp_wmem_min - INTEGER - Minimal size of send buffer used by UDP sockets in moderation. - Each UDP socket is able to use the size for sending data, even if - total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - - Default: 4K + UDP does not have tx memory accounting and this tunable has no effect. RAW variables ============= @@ -1053,7 +1129,7 @@ cipso_cache_enable - BOOLEAN cipso_cache_bucket_size - INTEGER The CIPSO label cache consists of a fixed size hash table with each hash bucket containing a number of cache entries. This variable limits - the number of entries in each hash bucket; the larger the value the + the number of entries in each hash bucket; the larger the value is, the more CIPSO label mappings that can be cached. When the number of entries in a given hash bucket reaches this limit adding new entries causes the oldest entry in the bucket to be removed to make room. @@ -1147,7 +1223,7 @@ ip_autobind_reuse - BOOLEAN option should only be set by experts. Default: 0 -ip_dynaddr - BOOLEAN +ip_dynaddr - INTEGER If set non-zero, enables support for dynamic addresses. If set to a non-zero value larger than 1, a kernel log message will be printed when dynamic address rewriting @@ -1595,12 +1671,15 @@ arp_notify - BOOLEAN or hardware address changes. == ========================================================== -arp_accept - BOOLEAN - Define behavior for gratuitous ARP frames who's IP is not - already present in the ARP table: +arp_accept - INTEGER + Define behavior for accepting gratuitous ARP (garp) frames from devices + that are not already present in the ARP table: - 0 - don't create new entries in the ARP table - 1 - create new entries in the ARP table + - 2 - create new entries only if the source IP address is in the same + subnet as an address configured on the interface that received the + garp message. Both replies and requests type gratuitous arp will trigger the ARP table to be updated, if this setting is on. @@ -2442,6 +2521,37 @@ drop_unsolicited_na - BOOLEAN By default this is turned off. +accept_untracked_na - INTEGER + Define behavior for accepting neighbor advertisements from devices that + are absent in the neighbor cache: + + - 0 - (default) Do not accept unsolicited and untracked neighbor + advertisements. + + - 1 - Add a new neighbor cache entry in STALE state for routers on + receiving a neighbor advertisement (either solicited or unsolicited) + with target link-layer address option specified if no neighbor entry + is already present for the advertised IPv6 address. Without this knob, + NAs received for untracked addresses (absent in neighbor cache) are + silently ignored. + + This is as per router-side behavior documented in RFC9131. + + This has lower precedence than drop_unsolicited_na. + + This will optimize the return path for the initial off-link + communication that is initiated by a directly connected host, by + ensuring that the first-hop router which turns on this setting doesn't + have to buffer the initial return packets to do neighbor-solicitation. + The prerequisite is that the host is configured to send unsolicited + neighbor advertisements on interface bringup. This setting should be + used in conjunction with the ndisc_notify setting on the host to + satisfy this prerequisite. + + - 2 - Extend option (1) to add a new neighbor cache entry only if the + source IP address is in the same subnet as an address configured on + the interface that received the neighbor advertisement. + enhanced_dad - BOOLEAN Include a nonce option in the IPv6 neighbor solicitation messages used for duplicate address detection per RFC7527. A received DAD NS will only signal @@ -2816,7 +2926,14 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max Default: 4K sctp_wmem - vector of 3 INTEGERs: min, default, max - Currently this tunable has no effect. + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimum size of send buffer that can be used by SCTP sockets. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 4K addr_scope_policy - INTEGER Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 @@ -2871,6 +2988,43 @@ plpmtud_probe_interval - INTEGER Default: 0 +reconf_enable - BOOLEAN + Enable or disable extension of Stream Reconfiguration functionality + specified in RFC6525. This extension provides the ability to "reset" + a stream, and it includes the Parameters of "Outgoing/Incoming SSN + Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams". + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + +intl_enable - BOOLEAN + Enable or disable extension of User Message Interleaving functionality + specified in RFC8260. This extension allows the interleaving of user + messages sent on different streams. With this feature enabled, I-DATA + chunk will replace DATA chunk to carry user messages if also supported + by the peer. Note that to use this feature, one needs to set this option + to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2 + and SCTP_INTERLEAVING_SUPPORTED to 1. + + - 1: Enable extension. + - 0: Disable extension. + + Default: 0 + +ecn_enable - BOOLEAN + Control use of Explicit Congestion Notification (ECN) by SCTP. + Like in TCP, ECN is used only when both ends of the SCTP connection + indicate support for it. This feature is useful in avoiding losses + due to congestion by allowing supporting routers to signal congestion + before having to drop packets. + + 1: Enable ecn. + 0: Disable ecn. + + Default: 1 + ``/proc/sys/net/core/*`` ======================== diff --git a/Documentation/networking/ipvlan.rst b/Documentation/networking/ipvlan.rst index 694adcba36b0..0000c1d383bc 100644 --- a/Documentation/networking/ipvlan.rst +++ b/Documentation/networking/ipvlan.rst @@ -11,7 +11,7 @@ Initial Release: ================ This is conceptually very similar to the macvlan driver with one major exception of using L3 for mux-ing /demux-ing among slaves. This property makes -the master device share the L2 with it's slave devices. I have developed this +the master device share the L2 with its slave devices. I have developed this driver in conjunction with network namespaces and not sure if there is use case outside of it. diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst index 498b382d25a0..7f383e99dbad 100644 --- a/Documentation/networking/l2tp.rst +++ b/Documentation/networking/l2tp.rst @@ -530,7 +530,7 @@ its tunnel close actions. For L2TPIP sockets, the socket's close handler initiates the same tunnel close actions. All sessions are first closed. Each session drops its tunnel ref. When the tunnel ref reaches zero, the tunnel puts its socket ref. When the socket is -eventually destroyed, it's sk_destruct finally frees the L2TP tunnel +eventually destroyed, its sk_destruct finally frees the L2TP tunnel context. Sessions diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst index 46f74bffce0f..c628cb5406d2 100644 --- a/Documentation/networking/mctp.rst +++ b/Documentation/networking/mctp.rst @@ -212,6 +212,54 @@ remote address is already known, or the message does not require a reply. Like the send calls, sockets will only receive responses to requests they have sent (TO=1) and may only respond (TO=0) to requests they have received. +``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These tags give applications more control over MCTP message tags, by allocating +(and dropping) tag values explicitly, rather than the kernel automatically +allocating a per-message tag at ``sendmsg()`` time. + +In general, you will only need to use these ioctls if your MCTP protocol does +not fit the usual request/response model. For example, if you need to persist +tags across multiple requests, or a request may generate more than one response. +In these cases, the ioctls allow you to decouple the tag allocation (and +release) from individual message send and receive operations. + +Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``: + +.. code-block:: C + + struct mctp_ioc_tag_ctl { + mctp_eid_t peer_addr; + __u8 tag; + __u16 flags; + }; + +``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application +can use in future ``sendmsg()`` calls. The application populates the +``peer_addr`` member with the remote EID. Other fields must be zero. + +On return, the ``tag`` member will be populated with the allocated tag value. +The allocated tag will have the following tag bits set: + + - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag + owner + + - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a + preallocated tag. + + - ... and the actual tag value, within the least-significant three bits + (``MCTP_TAG_MASK``). Note that zero is a valid tag value. + +The tag value should be used as-is for the ``smctp_tag`` member of ``struct +sockaddr_mctp``. + +``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a +``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the +allocation, and the ``tag`` value must match exactly the tag returned from the +allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits). +The ``flags`` field must be zero. + Kernel internals ================ diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index b0d4da71e68e..213510698014 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -46,6 +46,23 @@ allow_join_initial_addr_port - BOOLEAN Default: 1 +pm_type - INTEGER + Set the default path manager type to use for each new MPTCP + socket. In-kernel path management will control subflow + connections and address advertisements according to + per-namespace values configured over the MPTCP netlink + API. Userspace path management puts per-MPTCP-connection subflow + connection decisions and address advertisements under control of + a privileged userspace program, at the cost of more netlink + traffic to propagate all of the related events and commands. + + This is a per-namespace sysctl. + + * 0 - In-kernel path manager + * 1 - Userspace path manager + + Default: 0 + stale_loss_cnt - INTEGER The number of MPTCP-level retransmission intervals with no traffic and pending outstanding data on a given subflow required to declare it stale. diff --git a/Documentation/networking/net_failover.rst b/Documentation/networking/net_failover.rst index e143ab79a960..3a662f2b4d6e 100644 --- a/Documentation/networking/net_failover.rst +++ b/Documentation/networking/net_failover.rst @@ -35,7 +35,7 @@ To support this, the hypervisor needs to enable VIRTIO_NET_F_STANDBY feature on the virtio-net interface and assign the same MAC address to both virtio-net and VF interfaces. -Here is an example XML snippet that shows such configuration. +Here is an example libvirt XML snippet that shows such configuration: :: <interface type='network'> @@ -45,18 +45,32 @@ Here is an example XML snippet that shows such configuration. <model type='virtio'/> <driver name='vhost' queues='4'/> <link state='down'/> - <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> </interface> <interface type='hostdev' managed='yes'> <mac address='52:54:00:00:12:53'/> <source> <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> </source> - <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> + <teaming type='transient' persistent='ua-backup0'/> </interface> +In this configuration, the first device definition is for the virtio-net +interface and this acts as the 'persistent' device indicating that this +interface will always be plugged in. This is specified by the 'teaming' tag with +required attribute type having value 'persistent'. The link state for the +virtio-net device is set to 'down' to ensure that the 'failover' netdev prefers +the VF passthrough device for normal communication. The virtio-net device will +be brought UP during live migration to allow uninterrupted communication. + +The second device definition is for the VF passthrough interface. Here the +'teaming' tag is provided with type 'transient' indicating that this device may +periodically be unplugged. A second attribute - 'persistent' is provided and +points to the alias name declared for the virtio-net device. + Booting a VM with the above configuration will result in the following 3 -netdevs created in the VM. +interfaces created in the VM: :: 4: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 @@ -65,13 +79,36 @@ netdevs created in the VM. valid_lft 42482sec preferred_lft 42482sec inet6 fe80::97d8:db2:8c10:b6d6/64 scope link valid_lft forever preferred_lft forever - 5: ens10nsby: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master ens10 state UP group default qlen 1000 + 5: ens10nsby: <BROADCAST,MULTICAST> mtu 1500 qdisc fq_codel master ens10 state DOWN group default qlen 1000 link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff 7: ens11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ens10 state UP group default qlen 1000 link/ether 52:54:00:00:12:53 brd ff:ff:ff:ff:ff:ff -ens10 is the 'failover' master netdev, ens10nsby and ens11 are the slave -'standby' and 'primary' netdevs respectively. +Here, ens10 is the 'failover' master interface, ens10nsby is the slave 'standby' +virtio-net interface, and ens11 is the slave 'primary' VF passthrough interface. + +One point to note here is that some user space network configuration daemons +like systemd-networkd, ifupdown, etc, do not understand the 'net_failover' +device; and on the first boot, the VM might end up with both 'failover' device +and VF accquiring IP addresses (either same or different) from the DHCP server. +This will result in lack of connectivity to the VM. So some tweaks might be +needed to these network configuration daemons to make sure that an IP is +received only on the 'failover' device. + +Below is the patch snippet used with 'cloud-ifupdown-helper' script found on +Debian cloud images: + +:: + @@ -27,6 +27,8 @@ do_setup() { + local working="$cfgdir/.$INTERFACE" + local final="$cfgdir/$INTERFACE" + + + if [ -d "/sys/class/net/${INTERFACE}/master" ]; then exit 0; fi + + + if ifup --no-act "$INTERFACE" > /dev/null 2>&1; then + # interface is already known to ifupdown, no need to generate cfg + log "Skipping configuration generation for $INTERFACE" + Live Migration of a VM with SR-IOV VF & virtio-net in STANDBY mode ================================================================== @@ -80,40 +117,68 @@ net_failover also enables hypervisor controlled live migration to be supported with VMs that have direct attached SR-IOV VF devices by automatic failover to the paravirtual datapath when the VF is unplugged. -Here is a sample script that shows the steps to initiate live migration on -the source hypervisor. +Here is a sample script that shows the steps to initiate live migration from +the source hypervisor. Note: It is assumed that the VM is connected to a +software bridge 'br0' which has a single VF attached to it along with the vnet +device to the VM. This is not the VF that was passthrough'd to the VM (seen in +the vf.xml file). :: - # cat vf_xml + # cat vf.xml <interface type='hostdev' managed='yes'> <mac address='52:54:00:00:12:53'/> <source> <address type='pci' domain='0x0000' bus='0x42' slot='0x02' function='0x5'/> </source> - <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/> + <teaming type='transient' persistent='ua-backup0'/> </interface> - # Source Hypervisor + # Source Hypervisor migrate.sh #!/bin/bash - DOMAIN=fedora27-tap01 - PF=enp66s0f0 - VF_NUM=5 - TAP_IF=tap01 - VF_XML= + DOMAIN=vm-01 + PF=ens6np0 + VF=ens6v1 # VF attached to the bridge. + VF_NUM=1 + TAP_IF=vmtap01 # virtio-net interface in the VM. + VF_XML=vf.xml MAC=52:54:00:00:12:53 ZERO_MAC=00:00:00:00:00:00 + # Set the virtio-net interface up. virsh domif-setlink $DOMAIN $TAP_IF up - bridge fdb del $MAC dev $PF master - virsh detach-device $DOMAIN $VF_XML + + # Remove the VF that was passthrough'd to the VM. + virsh detach-device --live --config $DOMAIN $VF_XML + ip link set $PF vf $VF_NUM mac $ZERO_MAC - virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system + # Add FDB entry for traffic to continue going to the VM via + # the VF -> br0 -> vnet interface path. + bridge fdb add $MAC dev $VF + bridge fdb add $MAC dev $TAP_IF master + + # Migrate the VM + virsh migrate --live --persistent $DOMAIN qemu+ssh://$REMOTE_HOST/system + + # Clean up FDB entries after migration completes. + bridge fdb del $MAC dev $VF + bridge fdb del $MAC dev $TAP_IF master - # Destination Hypervisor +On the destination hypervisor, a shared bridge 'br0' is created before migration +starts, and a VF from the destination PF is added to the bridge. Similarly an +appropriate FDB entry is added. + +The following script is executed on the destination hypervisor once migration +completes, and it reattaches the VF to the VM and brings down the virtio-net +interface. + +:: + # reattach-vf.sh #!/bin/bash - virsh attach-device $DOMAIN $VF_XML - virsh domif-setlink $DOMAIN $TAP_IF down + bridge fdb del 52:54:00:00:12:53 dev ens36v0 + bridge fdb del 52:54:00:00:12:53 dev vmtap01 master + virsh attach-device --config --live vm01 vf.xml + virsh domif-setlink vm01 vmtap01 down diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst deleted file mode 100644 index e26532f49760..000000000000 --- a/Documentation/networking/netdev-FAQ.rst +++ /dev/null @@ -1,263 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -.. _netdev-FAQ: - -========== -netdev FAQ -========== - -What is netdev? ---------------- -It is a mailing list for all network-related Linux stuff. This -includes anything found under net/ (i.e. core code like IPv6) and -drivers/net (i.e. hardware specific drivers) in the Linux source tree. - -Note that some subsystems (e.g. wireless drivers) which have a high -volume of traffic have their own specific mailing lists. - -The netdev list is managed (like many other Linux mailing lists) through -VGER (http://vger.kernel.org/) and archives can be found below: - -- http://marc.info/?l=linux-netdev -- http://www.spinics.net/lists/netdev/ - -Aside from subsystems like that mentioned above, all network-related -Linux development (i.e. RFC, review, comments, etc.) takes place on -netdev. - -How do the changes posted to netdev make their way into Linux? --------------------------------------------------------------- -There are always two trees (git repositories) in play. Both are -driven by David Miller, the main network maintainer. There is the -``net`` tree, and the ``net-next`` tree. As you can probably guess from -the names, the ``net`` tree is for fixes to existing code already in the -mainline tree from Linus, and ``net-next`` is where the new code goes -for the future release. You can find the trees here: - -- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git -- https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git - -How often do changes from these trees make it to the mainline Linus tree? -------------------------------------------------------------------------- -To understand this, you need to know a bit of background information on -the cadence of Linux development. Each new release starts off with a -two week "merge window" where the main maintainers feed their new stuff -to Linus for merging into the mainline tree. After the two weeks, the -merge window is closed, and it is called/tagged ``-rc1``. No new -features get mainlined after this -- only fixes to the rc1 content are -expected. After roughly a week of collecting fixes to the rc1 content, -rc2 is released. This repeats on a roughly weekly basis until rc7 -(typically; sometimes rc6 if things are quiet, or rc8 if things are in a -state of churn), and a week after the last vX.Y-rcN was done, the -official vX.Y is released. - -Relating that to netdev: At the beginning of the 2-week merge window, -the ``net-next`` tree will be closed - no new changes/features. The -accumulated new content of the past ~10 weeks will be passed onto -mainline/Linus via a pull request for vX.Y -- at the same time, the -``net`` tree will start accumulating fixes for this pulled content -relating to vX.Y - -An announcement indicating when ``net-next`` has been closed is usually -sent to netdev, but knowing the above, you can predict that in advance. - -IMPORTANT: Do not send new ``net-next`` content to netdev during the -period during which ``net-next`` tree is closed. - -Shortly after the two weeks have passed (and vX.Y-rc1 is released), the -tree for ``net-next`` reopens to collect content for the next (vX.Y+1) -release. - -If you aren't subscribed to netdev and/or are simply unsure if -``net-next`` has re-opened yet, simply check the ``net-next`` git -repository link above for any new networking-related commits. You may -also check the following website for the current status: - - http://vger.kernel.org/~davem/net-next.html - -The ``net`` tree continues to collect fixes for the vX.Y content, and is -fed back to Linus at regular (~weekly) intervals. Meaning that the -focus for ``net`` is on stabilization and bug fixes. - -Finally, the vX.Y gets released, and the whole cycle starts over. - -So where are we now in this cycle? ----------------------------------- - -Load the mainline (Linus) page here: - - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git - -and note the top of the "tags" section. If it is rc1, it is early in -the dev cycle. If it was tagged rc7 a week ago, then a release is -probably imminent. - -How do I indicate which tree (net vs. net-next) my patch should be in? ----------------------------------------------------------------------- -Firstly, think whether you have a bug fix or new "next-like" content. -Then once decided, assuming that you use git, use the prefix flag, i.e. -:: - - git format-patch --subject-prefix='PATCH net-next' start..finish - -Use ``net`` instead of ``net-next`` (always lower case) in the above for -bug-fix ``net`` content. If you don't use git, then note the only magic -in the above is just the subject text of the outgoing e-mail, and you -can manually change it yourself with whatever MUA you are comfortable -with. - -I sent a patch and I'm wondering what happened to it - how can I tell whether it got merged? --------------------------------------------------------------------------------------------- -Start by looking at the main patchworks queue for netdev: - - https://patchwork.kernel.org/project/netdevbpf/list/ - -The "State" field will tell you exactly where things are at with your -patch. - -The above only says "Under Review". How can I find out more? -------------------------------------------------------------- -Generally speaking, the patches get triaged quickly (in less than -48h). So be patient. Asking the maintainer for status updates on your -patch is a good way to ensure your patch is ignored or pushed to the -bottom of the priority list. - -I submitted multiple versions of the patch series. Should I directly update patchwork for the previous versions of these patch series? --------------------------------------------------------------------------------------------------------------------------------------- -No, please don't interfere with the patch status on patchwork, leave -it to the maintainer to figure out what is the most recent and current -version that should be applied. If there is any doubt, the maintainer -will reply and ask what should be done. - -I made changes to only a few patches in a patch series should I resend only those changed? ------------------------------------------------------------------------------------------- -No, please resend the entire patch series and make sure you do number your -patches such that it is clear this is the latest and greatest set of patches -that can be applied. - -I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do? ----------------------------------------------------------------------------------------------------------------------------------------- -There is no revert possible, once it is pushed out, it stays like that. -Please send incremental versions on top of what has been merged in order to fix -the patches the way they would look like if your latest patch series was to be -merged. - -Are there special rules regarding stable submissions on netdev? ---------------------------------------------------------------- -While it used to be the case that netdev submissions were not supposed -to carry explicit ``CC: stable@vger.kernel.org`` tags that is no longer -the case today. Please follow the standard stable rules in -:ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`, -and make sure you include appropriate Fixes tags! - -Is the comment style convention different for the networking content? ---------------------------------------------------------------------- -Yes, in a largely trivial way. Instead of this:: - - /* - * foobar blah blah blah - * another line of text - */ - -it is requested that you make it look like this:: - - /* foobar blah blah blah - * another line of text - */ - -I am working in existing code that has the former comment style and not the latter. Should I submit new code in the former style or the latter? ------------------------------------------------------------------------------------------------------------------------------------------------ -Make it the latter style, so that eventually all code in the domain -of netdev is of this format. - -I found a bug that might have possible security implications or similar. Should I mail the main netdev maintainer off-list? ---------------------------------------------------------------------------------------------------------------------------- -No. The current netdev maintainer has consistently requested that -people use the mailing lists and not reach out directly. If you aren't -OK with that, then perhaps consider mailing security@kernel.org or -reading about http://oss-security.openwall.org/wiki/mailing-lists/distros -as possible alternative mechanisms. - -What level of testing is expected before I submit my change? ------------------------------------------------------------- -If your changes are against ``net-next``, the expectation is that you -have tested by layering your changes on top of ``net-next``. Ideally -you will have done run-time testing specific to your change, but at a -minimum, your changes should survive an ``allyesconfig`` and an -``allmodconfig`` build without new warnings or failures. - -How do I post corresponding changes to user space components? -------------------------------------------------------------- -User space code exercising kernel features should be posted -alongside kernel patches. This gives reviewers a chance to see -how any new interface is used and how well it works. - -When user space tools reside in the kernel repo itself all changes -should generally come as one series. If series becomes too large -or the user space project is not reviewed on netdev include a link -to a public repo where user space patches can be seen. - -In case user space tooling lives in a separate repository but is -reviewed on netdev (e.g. patches to `iproute2` tools) kernel and -user space patches should form separate series (threads) when posted -to the mailing list, e.g.:: - - [PATCH net-next 0/3] net: some feature cover letter - └─ [PATCH net-next 1/3] net: some feature prep - └─ [PATCH net-next 2/3] net: some feature do it - └─ [PATCH net-next 3/3] selftest: net: some feature - - [PATCH iproute2-next] ip: add support for some feature - -Posting as one thread is discouraged because it confuses patchwork -(as of patchwork 2.2.2). - -Can I reproduce the checks from patchwork on my local machine? --------------------------------------------------------------- - -Checks in patchwork are mostly simple wrappers around existing kernel -scripts, the sources are available at: - -https://github.com/kuba-moo/nipa/tree/master/tests - -Running all the builds and checks locally is a pain, can I post my patches and have the patchwork bot validate them? --------------------------------------------------------------------------------------------------------------------- - -No, you must ensure that your patches are ready by testing them locally -before posting to the mailing list. The patchwork build bot instance -gets overloaded very easily and netdev@vger really doesn't need more -traffic if we can help it. - -netdevsim is great, can I extend it for my out-of-tree tests? -------------------------------------------------------------- - -No, `netdevsim` is a test vehicle solely for upstream tests. -(Please add your tests under tools/testing/selftests/.) - -We also give no guarantees that `netdevsim` won't change in the future -in a way which would break what would normally be considered uAPI. - -Is netdevsim considered a "user" of an API? -------------------------------------------- - -Linux kernel has a long standing rule that no API should be added unless -it has a real, in-tree user. Mock-ups and tests based on `netdevsim` are -strongly encouraged when adding new APIs, but `netdevsim` in itself -is **not** considered a use case/user. - -Any other tips to help ensure my net/net-next patch gets OK'd? --------------------------------------------------------------- -Attention to detail. Re-read your own work as if you were the -reviewer. You can start with using ``checkpatch.pl``, perhaps even with -the ``--strict`` flag. But do not be mindlessly robotic in doing so. -If your change is a bug fix, make sure your commit log indicates the -end-user visible symptom, the underlying reason as to why it happens, -and then if necessary, explain why the fix proposed is the best way to -get things done. Don't mangle whitespace, and as is common, don't -mis-indent function arguments that span multiple lines. If it is your -first patch, mail it to yourself so you can test apply it to an -unpatched tree to confirm infrastructure didn't mangle it. - -Finally, go back and read -:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` -to be sure you are not repeating some common mistake documented there. diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index 311128abb768..1120d71f28d7 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -34,10 +34,13 @@ nf_conntrack_count - INTEGER (read-only) nf_conntrack_events - BOOLEAN - 0 - disabled - - not 0 - enabled (default) + - 1 - enabled + - 2 - auto (default) If this option is enabled, the connection tracking code will provide userspace with connection tracking events via ctnetlink. + The default allocates the extension if a userspace program is + listening to ctnetlink events. nf_conntrack_expect_max - INTEGER Maximum size of expectation table. Default value is @@ -67,15 +70,6 @@ nf_conntrack_generic_timeout - INTEGER (seconds) Default for generic timeout. This refers to layer 4 unknown/unsupported protocols. -nf_conntrack_helper - BOOLEAN - - 0 - disabled (default) - - not 0 - enabled - - Enable automatic conntrack helper assignment. - If disabled it is required to set up iptables rules to assign - helpers to connections. See the CT target description in the - iptables-extensions(8) man page for further information. - nf_conntrack_icmp_timeout - INTEGER (seconds) default 30 diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index a147591ce203..5db8c263b0c6 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -105,6 +105,47 @@ a page will cause no race conditions is enough. Please note the caller must not use data area after running page_pool_put_page_bulk(), as this function overwrites it. +* page_pool_get_stats(): Retrieve statistics about the page_pool. This API + is only available if the kernel has been configured with + ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct + page_pool_stats`` structure is passed to this API which is filled in. The + caller can then report those stats to the user (perhaps via ethtool, + debugfs, etc.). See below for an example usage of this API. + +Stats API and structures +------------------------ +If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API +``page_pool_get_stats()`` and structures described below are available. It +takes a pointer to a ``struct page_pool`` and a pointer to a ``struct +page_pool_stats`` allocated by the caller. + +The API will fill in the provided ``struct page_pool_stats`` with +statistics about the page_pool. + +The stats structure has the following fields:: + + struct page_pool_stats { + struct page_pool_alloc_stats alloc_stats; + struct page_pool_recycle_stats recycle_stats; + }; + + +The ``struct page_pool_alloc_stats`` has the following fields: + * ``fast``: successful fast path allocations + * ``slow``: slow path order-0 allocations + * ``slow_high_order``: slow path high order allocations + * ``empty``: ptr ring is empty, so a slow path allocation was forced. + * ``refill``: an allocation which triggered a refill of the cache + * ``waive``: pages obtained from the ptr ring that cannot be added to + the cache due to a NUMA mismatch. + +The ``struct page_pool_recycle_stats`` has the following fields: + * ``cached``: recycling placed page in the page pool cache + * ``cache_full``: page pool cache was full + * ``ring``: page placed into the ptr ring + * ``ring_full``: page released from page pool because the ptr ring was full + * ``released_refcnt``: page released (and not recycled) because refcnt > 1 + Coding examples =============== @@ -157,6 +198,21 @@ NAPI poller } } +Stats +----- + +.. code-block:: c + + #ifdef CONFIG_PAGE_POOL_STATS + /* retrieve stats */ + struct page_pool_stats stats = { 0 }; + if (page_pool_get_stats(page_pool, &stats)) { + /* perhaps the driver reports statistics with ethool */ + ethtool_print_allocation_stats(&stats.alloc_stats); + ethtool_print_recycle_stats(&stats.recycle_stats); + } + #endif + Driver unload ------------- diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 571ba08386e7..d11329a08984 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -104,7 +104,7 @@ Whenever possible, use the PHY side RGMII delay for these reasons: * PHY device drivers in PHYLIB being reusable by nature, being able to configure correctly a specified delay enables more designs with similar delay - requirements to be operate correctly + requirements to be operated correctly For cases where the PHY is not capable of providing this delay, but the Ethernet MAC driver is capable of doing so, the correct phy_interface_t value @@ -120,7 +120,7 @@ required delays, as defined per the RGMII standard, several options may be available: * Some SoCs may offer a pin pad/mux/controller capable of configuring a given - set of pins'strength, delays, and voltage; and it may be a suitable + set of pins' strength, delays, and voltage; and it may be a suitable option to insert the expected 2ns RGMII delay. * Modifying the PCB design to include a fixed delay (e.g: using a specifically @@ -237,6 +237,11 @@ negotiation results. Some of the interface modes are described below: +``PHY_INTERFACE_MODE_SMII`` + This is serial MII, clocked at 125MHz, supporting 100M and 10M speeds. + Some details can be found in + https://opencores.org/ocsvn/smii/smii/trunk/doc/SMII.pdf + ``PHY_INTERFACE_MODE_1000BASEX`` This defines the 1000BASE-X single-lane serdes link as defined by the 802.3 standard section 36. The link operates at a fixed bit rate of @@ -303,6 +308,21 @@ Some of the interface modes are described below: rate of 125Mpbs using a 4B/5B encoding scheme, resulting in an underlying data rate of 100Mpbs. +``PHY_INTERFACE_MODE_QUSGMII`` + This defines the Cisco the Quad USGMII mode, which is the Quad variant of + the USGMII (Universal SGMII) link. It's very similar to QSGMII, but uses + a Packet Control Header (PCH) instead of the 7 bytes preamble to carry not + only the port id, but also so-called "extensions". The only documented + extension so-far in the specification is the inclusion of timestamps, for + PTP-enabled PHYs. This mode isn't compatible with QSGMII, but offers the + same capabilities in terms of link speed and negociation. + +``PHY_INTERFACE_MODE_1000BASEKX`` + This is 1000BASE-X as defined by IEEE 802.3 Clause 36 with Clause 73 + autonegotiation. Generally, it will be used with a Clause 70 PMD. To + contrast with the 1000BASE-X phy mode used for Clause 38 and 39 PMDs, this + interface mode has different autonegotiation and only supports full duplex. + Pause frames / flow control =========================== diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst index 44936c27ab3a..498395f5fbcb 100644 --- a/Documentation/networking/rds.rst +++ b/Documentation/networking/rds.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -== +=== RDS === diff --git a/Documentation/networking/representors.rst b/Documentation/networking/representors.rst new file mode 100644 index 000000000000..ee1f5cd54496 --- /dev/null +++ b/Documentation/networking/representors.rst @@ -0,0 +1,259 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +Network Function Representors +============================= + +This document describes the semantics and usage of representor netdevices, as +used to control internal switching on SmartNICs. For the closely-related port +representors on physical (multi-port) switches, see +:ref:`Documentation/networking/switchdev.rst <switchdev>`. + +Motivation +---------- + +Since the mid-2010s, network cards have started offering more complex +virtualisation capabilities than the legacy SR-IOV approach (with its simple +MAC/VLAN-based switching model) can support. This led to a desire to offload +software-defined networks (such as OpenVSwitch) to these NICs to specify the +network connectivity of each function. The resulting designs are variously +called SmartNICs or DPUs. + +Network function representors bring the standard Linux networking stack to +virtual switches and IOV devices. Just as each physical port of a Linux- +controlled switch has a separate netdev, so does each virtual port of a virtual +switch. +When the system boots, and before any offload is configured, all packets from +the virtual functions appear in the networking stack of the PF via the +representors. The PF can thus always communicate freely with the virtual +functions. +The PF can configure standard Linux forwarding between representors, the uplink +or any other netdev (routing, bridging, TC classifiers). + +Thus, a representor is both a control plane object (representing the function in +administrative commands) and a data plane object (one end of a virtual pipe). +As a virtual link endpoint, the representor can be configured like any other +netdevice; in some cases (e.g. link state) the representee will follow the +representor's configuration, while in others there are separate APIs to +configure the representee. + +Definitions +----------- + +This document uses the term "switchdev function" to refer to the PCIe function +which has administrative control over the virtual switch on the device. +Typically, this will be a PF, but conceivably a NIC could be configured to grant +these administrative privileges instead to a VF or SF (subfunction). +Depending on NIC design, a multi-port NIC might have a single switchdev function +for the whole device or might have a separate virtual switch, and hence +switchdev function, for each physical network port. +If the NIC supports nested switching, there might be separate switchdev +functions for each nested switch, in which case each switchdev function should +only create representors for the ports on the (sub-)switch it directly +administers. + +A "representee" is the object that a representor represents. So for example in +the case of a VF representor, the representee is the corresponding VF. + +What does a representor do? +--------------------------- + +A representor has three main roles. + +1. It is used to configure the network connection the representee sees, e.g. + link up/down, MTU, etc. For instance, bringing the representor + administratively UP should cause the representee to see a link up / carrier + on event. +2. It provides the slow path for traffic which does not hit any offloaded + fast-path rules in the virtual switch. Packets transmitted on the + representor netdevice should be delivered to the representee; packets + transmitted by the representee which fail to match any switching rule should + be received on the representor netdevice. (That is, there is a virtual pipe + connecting the representor to the representee, similar in concept to a veth + pair.) + This allows software switch implementations (such as OpenVSwitch or a Linux + bridge) to forward packets between representees and the rest of the network. +3. It acts as a handle by which switching rules (such as TC filters) can refer + to the representee, allowing these rules to be offloaded. + +The combination of 2) and 3) means that the behaviour (apart from performance) +should be the same whether a TC filter is offloaded or not. E.g. a TC rule +on a VF representor applies in software to packets received on that representor +netdevice, while in hardware offload it would apply to packets transmitted by +the representee VF. Conversely, a mirred egress redirect to a VF representor +corresponds in hardware to delivery directly to the representee VF. + +What functions should have a representor? +----------------------------------------- + +Essentially, for each virtual port on the device's internal switch, there +should be a representor. +Some vendors have chosen to omit representors for the uplink and the physical +network port, which can simplify usage (the uplink netdev becomes in effect the +physical port's representor) but does not generalise to devices with multiple +ports or uplinks. + +Thus, the following should all have representors: + + - VFs belonging to the switchdev function. + - Other PFs on the local PCIe controller, and any VFs belonging to them. + - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded + System-on-Chip within the SmartNIC). + - PFs and VFs with other personalities, including network block devices (such + as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only + if) their network access is implemented through a virtual switch port. [#]_ + Note that such functions can require a representor despite the representee + not having a netdev. + - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have + their own port on the switch (as opposed to using their parent PF's port). + - Any accelerators or plugins on the device whose interface to the network is + through a virtual switch port, even if they do not have a corresponding PCIe + PF or VF. + +This allows the entire switching behaviour of the NIC to be controlled through +representor TC rules. + +It is a common misunderstanding to conflate virtual ports with PCIe virtual +functions or their netdevs. While in simple cases there will be a 1:1 +correspondence between VF netdevices and VF representors, more advanced device +configurations may not follow this. +A PCIe function which does not have network access through the internal switch +(not even indirectly through the hardware implementation of whatever services +the function provides) should *not* have a representor (even if it has a +netdev). +Such a function has no switch virtual port for the representor to configure or +to be the other end of the virtual pipe. +The representor represents the virtual port, not the PCIe function nor the 'end +user' netdevice. + +.. [#] The concept here is that a hardware IP stack in the device performs the + translation between block DMA requests and network packets, so that only + network packets pass through the virtual port onto the switch. The network + access that the IP stack "sees" would then be configurable through tc rules; + e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However, + any needed configuration of the block device *qua* block device, not being a + networking entity, would not be appropriate for the representor and would + thus use some other channel such as devlink. + Contrast this with the case of a virtio-blk implementation which forwards the + DMA requests unchanged to another PF whose driver then initiates and + terminates IP traffic in software; in that case the DMA traffic would *not* + run over the virtual switch and the virtio-blk PF should thus *not* have a + representor. + +How are representors created? +----------------------------- + +The driver instance attached to the switchdev function should, for each virtual +port on the switch, create a pure-software netdevice which has some form of +in-kernel reference to the switchdev function's own netdevice or driver private +data (``netdev_priv()``). +This may be by enumerating ports at probe time, reacting dynamically to the +creation and destruction of ports at run time, or a combination of the two. + +The operations of the representor netdevice will generally involve acting +through the switchdev function. For example, ``ndo_start_xmit()`` might send +the packet through a hardware TX queue attached to the switchdev function, with +either packet metadata or queue configuration marking it for delivery to the +representee. + +How are representors identified? +-------------------------------- + +The representor netdevice should *not* directly refer to a PCIe device (e.g. +through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the +representee or of the switchdev function. +Instead, it should implement the ``ndo_get_devlink_port()`` netdevice op, which +the kernel uses to provide the ``phys_switch_id`` and ``phys_port_name`` sysfs +nodes. (Some legacy drivers implement ``ndo_get_port_parent_id()`` and +``ndo_get_phys_port_name()`` directly, but this is deprecated.) See +:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the +details of this API. + +It is expected that userland will use this information (e.g. through udev rules) +to construct an appropriately informative name or alias for the netdevice. For +instance if the switchdev function is ``eth4`` then a representor with a +``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``. + +There are as yet no established conventions for naming representors which do not +correspond to PCIe functions (e.g. accelerators and plugins). + +How do representors interact with TC rules? +------------------------------------------- + +Any TC rule on a representor applies (in software TC) to packets received by +that representor netdevice. Thus, if the delivery part of the rule corresponds +to another port on the virtual switch, the driver may choose to offload it to +hardware, applying it to packets transmitted by the representee. + +Similarly, since a TC mirred egress action targeting the representor would (in +software) send the packet through the representor (and thus indirectly deliver +it to the representee), hardware offload should interpret this as delivery to +the representee. + +As a simple example, if ``PORT_DEV`` is the physical port representor and +``REP_DEV`` is a VF representor, the following rules:: + + tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \ + action mirred egress redirect dev $PORT_DEV + tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \ + action mirred egress mirror dev $REP_DEV + +would mean that all IPv4 packets from the VF are sent out the physical port, and +all IPv4 packets received on the physical port are delivered to the VF in +addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule, +the VF would get two copies, as the packet reception on ``PORT_DEV`` would +trigger the TC rule again and mirror the packet to ``REP_DEV``.) + +On devices without separate port and uplink representors, ``PORT_DEV`` would +instead be the switchdev function's own uplink netdevice. + +Of course the rules can (if supported by the NIC) include packet-modifying +actions (e.g. VLAN push/pop), which should be performed by the virtual switch. + +Tunnel encapsulation and decapsulation are rather more complicated, as they +involve a third netdevice (a tunnel netdev operating in metadata mode, such as +a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and +require an IP address to be bound to the underlay device (e.g. switchdev +function uplink netdev or port representor). TC rules such as:: + + tc filter add dev $REP_DEV parent ffff: flower \ + action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \ + dst_port 4789 \ + action mirred egress redirect dev vxlan0 + tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \ + enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \ + action tunnel_key unset action mirred egress redirect dev $REP_DEV + +where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is +another IP address on the same subnet, mean that packets sent by the VF should +be VxLAN encapsulated and sent out the physical port (the driver has to deduce +this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also +perform an ARP/neighbour table lookup to find the MAC addresses to use in the +outer Ethernet frame), while UDP packets received on the physical port with UDP +port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``, +decapsulated and forwarded to the VF. + +If this all seems complicated, just remember the 'golden rule' of TC offload: +the hardware should ensure the same final results as if the packets were +processed through the slow path, traversed software TC (except ignoring any +``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or +received through the representor netdevices. + +Configuring the representee's MAC +--------------------------------- + +The representee's link state is controlled through the representor. Setting the +representor administratively UP or DOWN should cause carrier ON or OFF at the +representee. + +Setting an MTU on the representor should cause that same MTU to be reported to +the representee. +(On hardware that allows configuring separate and distinct MTU and MRU values, +the representor MTU should correspond to the representee's MRU and vice-versa.) + +Currently there is no way to use the representor to set the station permanent +MAC address of the representee; other methods available to do this include: + + - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``) + - devlink port function (see **devlink-port(8)** and + :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst index 39c2249c7aa7..39494a6ea739 100644 --- a/Documentation/networking/rxrpc.rst +++ b/Documentation/networking/rxrpc.rst @@ -1055,17 +1055,6 @@ The kernel interface functions are as follows: first function to change. Note that this must be called in TASK_RUNNING state. - (#) Get reply timestamp:: - - bool rxrpc_kernel_get_reply_time(struct socket *sock, - struct rxrpc_call *call, - ktime_t *_ts) - - This allows the timestamp on the first DATA packet of the reply of a - client call to be queried, provided that it is still in the Rx ring. If - successful, the timestamp will be stored into ``*_ts`` and true will be - returned; false will be returned otherwise. - (#) Get remote client epoch:: u32 rxrpc_kernel_get_epoch(struct socket *sock, diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst index 328f8d39eeb3..55b65f607a64 100644 --- a/Documentation/networking/sfp-phylink.rst +++ b/Documentation/networking/sfp-phylink.rst @@ -203,7 +203,7 @@ this documentation. The :c:func:`validate` method should mask the supplied supported mask, and ``state->advertising`` with the supported ethtool link modes. These are the new ethtool link modes, so bitmask operations must be - used. For an example, see drivers/net/ethernet/marvell/mvneta.c. + used. For an example, see ``drivers/net/ethernet/marvell/mvneta.c``. The :c:func:`mac_link_state` method is used to read the link state from the MAC, and report back the settings that the MAC is currently @@ -224,7 +224,7 @@ this documentation. function should modify the state and only take the link down when absolutely necessary to change the MAC configuration. An example of how to do this can be found in :c:func:`mvneta_mac_config` in - drivers/net/ethernet/marvell/mvneta.c. + ``drivers/net/ethernet/marvell/mvneta.c``. For further information on these methods, please see the inline documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`. @@ -281,4 +281,4 @@ as necessary. For information describing the SFP cage in DT, please see the binding documentation in the kernel source tree -``Documentation/devicetree/bindings/net/sff,sfp.txt`` +``Documentation/devicetree/bindings/net/sff,sfp.yaml``. diff --git a/Documentation/networking/skbuff.rst b/Documentation/networking/skbuff.rst new file mode 100644 index 000000000000..5b74275a73a3 --- /dev/null +++ b/Documentation/networking/skbuff.rst @@ -0,0 +1,37 @@ +.. SPDX-License-Identifier: GPL-2.0 + +struct sk_buff +============== + +:c:type:`sk_buff` is the main networking structure representing +a packet. + +Basic sk_buff geometry +---------------------- + +.. kernel-doc:: include/linux/skbuff.h + :doc: Basic sk_buff geometry + +Shared skbs and skb clones +-------------------------- + +:c:member:`sk_buff.users` is a simple refcount allowing multiple entities +to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred +to as shared skbs (see skb_shared()). + +skb_clone() allows for fast duplication of skbs. None of the data buffers +get copied, but caller gets a new metadata struct (struct sk_buff). +&skb_shared_info.refcount indicates the number of skbs pointing at the same +packet data (i.e. clones). + +dataref and headerless skbs +--------------------------- + +.. kernel-doc:: include/linux/skbuff.h + :doc: dataref and headerless skbs + +Checksum information +-------------------- + +.. kernel-doc:: include/linux/skbuff.h + :doc: skb checksums diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst new file mode 100644 index 000000000000..6d8acdbe9be1 --- /dev/null +++ b/Documentation/networking/smc-sysctl.rst @@ -0,0 +1,61 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +SMC Sysctl +========== + +/proc/sys/net/smc/* Variables +============================= + +autocorking_size - INTEGER + Setting SMC auto corking size: + SMC auto corking is like TCP auto corking from the application's + perspective of view. When applications do consecutive small + write()/sendmsg() system calls, we try to coalesce these small writes + as much as possible, to lower total amount of CDC and RDMA Write been + sent. + autocorking_size limits the maximum corked bytes that can be sent to + the under device in 1 single sending. If set to 0, the SMC auto corking + is disabled. + Applications can still use TCP_CORK for optimal behavior when they + know how/when to uncork their sockets. + + Default: 64K + +smcr_buf_type - INTEGER + Controls which type of sndbufs and RMBs to use in later newly created + SMC-R link group. Only for SMC-R. + + Default: 0 (physically contiguous sndbufs and RMBs) + + Possible values: + + - 0 - Use physically contiguous buffers + - 1 - Use virtually contiguous buffers + - 2 - Mixed use of the two types. Try physically contiguous buffers first. + If not available, use virtually contiguous buffers then. + +smcr_testlink_time - INTEGER + How frequently SMC-R link sends out TEST_LINK LLC messages to confirm + viability, after the last activity of connections on it. Value 0 means + disabling TEST_LINK. + + Default: 30 seconds. + +wmem - INTEGER + Initial size of send buffer used by SMC sockets. + The default value inherits from net.ipv4.tcp_wmem[1]. + + The minimum value is 16KiB and there is no hard limit for max value, but + only allowed 512KiB for SMC-R and 1MiB for SMC-D. + + Default: 16K + +rmem - INTEGER + Initial size of receive buffer (RMB) used by SMC sockets. + The default value inherits from net.ipv4.tcp_rmem[1]. + + The minimum value is 16KiB and there is no hard limit for max value, but + only allowed 512KiB for SMC-R and 1MiB for SMC-D. + + Default: 128K diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst index f1f4e6a85a29..758f1dae3fce 100644 --- a/Documentation/networking/switchdev.rst +++ b/Documentation/networking/switchdev.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 .. include:: <isonum.txt> +.. _switchdev: =============================================== Ethernet switch device driver model (switchdev) @@ -159,7 +160,7 @@ tools such as iproute2. The switchdev driver can know a particular port's position in the topology by monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a -bond will see it's upper master change. If that bond is moved into a bridge, +bond will see its upper master change. If that bond is moved into a bridge, the bond's upper master will change. And so on. The driver will track such movements to know what position a port is in in the overall topology by registering for netdevice events and acting on NETDEV_CHANGEUPPER. diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst index 80b13353254a..be4eb1242057 100644 --- a/Documentation/networking/timestamping.rst +++ b/Documentation/networking/timestamping.rst @@ -582,8 +582,8 @@ Time stamps for outgoing packets are to be generated as follows: and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). - As soon as the driver has sent the packet and/or obtained a hardware time stamp for it, it passes the time stamp back by - calling skb_hwtstamp_tx() with the original skb, the raw - hardware time stamp. skb_hwtstamp_tx() clones the original skb and + calling skb_tstamp_tx() with the original skb, the raw + hardware time stamp. skb_tstamp_tx() clones the original skb and adds the timestamps, therefore the original skb has to be freed now. If obtaining the hardware time stamp somehow fails, then the driver should not fall back to software time stamping. The rationale is that @@ -668,7 +668,7 @@ timestamping: (through another RX timestamping FIFO). Deferral on RX is typically necessary when retrieving the timestamp needs a sleepable context. In that case, it is the responsibility of the DSA driver to call - ``netif_rx_ni()`` on the freshly timestamped skb. + ``netif_rx()`` on the freshly timestamped skb. 3.2.2 Ethernet PHYs ^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst index 8cb2cd4e2a80..658ed3a71e1b 100644 --- a/Documentation/networking/tls.rst +++ b/Documentation/networking/tls.rst @@ -214,6 +214,44 @@ of calling send directly after a handshake using gnutls. Since it doesn't implement a full record layer, control messages are not supported. +Optional optimizations +---------------------- + +There are certain condition-specific optimizations the TLS ULP can make, +if requested. Those optimizations are either not universally beneficial +or may impact correctness, hence they require an opt-in. +All options are set per-socket using setsockopt(), and their +state can be checked using getsockopt() and via socket diag (``ss``). + +TLS_TX_ZEROCOPY_RO +~~~~~~~~~~~~~~~~~~ + +For device offload only. Allow sendfile() data to be transmitted directly +to the NIC without making an in-kernel copy. This allows true zero-copy +behavior when device offload is enabled. + +The application must make sure that the data is not modified between being +submitted and transmission completing. In other words this is mostly +applicable if the data sent on a socket via sendfile() is read-only. + +Modifying the data may result in different versions of the data being used +for the original TCP transmission and TCP retransmissions. To the receiver +this will look like TLS records had been tampered with and will result +in record authentication failures. + +TLS_RX_EXPECT_NO_PAD +~~~~~~~~~~~~~~~~~~~~ + +TLS 1.3 only. Expect the sender to not pad records. This allows the data +to be decrypted directly into user space buffers with TLS 1.3. + +This optimization is safe to enable only if the remote end is trusted, +otherwise it is an attack vector to doubling the TLS processing cost. + +If the record decrypted turns out to had been padded or is not a data +record it will be decrypted again into a kernel buffer without zero copy. +Such events are counted in the ``TlsDecryptRetry`` statistic. + Statistics ========== @@ -239,3 +277,12 @@ TLS implementation exposes the following per-namespace statistics - ``TlsDeviceRxResync`` - number of RX resyncs sent to NICs handling cryptography + +- ``TlsDecryptRetry`` - + number of RX records which had to be re-decrypted due to + ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. Note that this counter will + also increment for non-data records. + +- ``TlsRxNoPadViolation`` - + number of data RX records which had to be re-decrypted due to + ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. |