From 32c0f0bed5bb08625083ed7f5b661c842d63ebd1 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 30 Apr 2020 18:04:27 +0200 Subject: docs: networking: convert switchdev.txt to ReST - add SPDX header; - use copyright symbol; - adjust title markup; - mark code blocks and literals as such; - mark tables as such; - adjust identation, whitespaces and blank lines where needed; - add to networking/index.rst. Signed-off-by: Mauro Carvalho Chehab Signed-off-by: David S. Miller --- Documentation/networking/index.rst | 1 + Documentation/networking/switchdev.rst | 387 +++++++++++++++++++++++++++++++++ Documentation/networking/switchdev.txt | 373 ------------------------------- 3 files changed, 388 insertions(+), 373 deletions(-) create mode 100644 Documentation/networking/switchdev.rst delete mode 100644 Documentation/networking/switchdev.txt (limited to 'Documentation') diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index e5a705024c6a..5e495804f96f 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -105,6 +105,7 @@ Contents: seg6-sysctl skfp strparser + switchdev .. only:: subproject and html diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst new file mode 100644 index 000000000000..ddc3f35775dc --- /dev/null +++ b/Documentation/networking/switchdev.rst @@ -0,0 +1,387 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: + +=============================================== +Ethernet switch device driver model (switchdev) +=============================================== + +Copyright |copy| 2014 Jiri Pirko + +Copyright |copy| 2014-2015 Scott Feldman + + +The Ethernet switch device driver model (switchdev) is an in-kernel driver +model for switch devices which offload the forwarding (data) plane from the +kernel. + +Figure 1 is a block diagram showing the components of the switchdev model for +an example setup using a data-center-class switch ASIC chip. Other setups +with SR-IOV or soft switches, such as OVS, are possible. + +:: + + + User-space tools + + user space | + +-------------------------------------------------------------------+ + kernel | Netlink + | + +--------------+-------------------------------+ + | Network stack | + | (Linux) | + | | + +----------------------------------------------+ + + sw1p2 sw1p4 sw1p6 + sw1p1 + sw1p3 + sw1p5 + eth1 + + | + | + | + + | | | | | | | + +--+----+----+----+----+----+---+ +-----+-----+ + | Switch driver | | mgmt | + | (this document) | | driver | + | | | | + +--------------+----------------+ +-----------+ + | + kernel | HW bus (eg PCI) + +-------------------------------------------------------------------+ + hardware | + +--------------+----------------+ + | Switch device (sw1) | + | +----+ +--------+ + | | v offloaded data path | mgmt port + | | | | + +--|----|----+----+----+----+---+ + | | | | | | + + + + + + + + p1 p2 p3 p4 p5 p6 + + front-panel ports + + + Fig 1. + + +Include Files +------------- + +:: + + #include + #include + + +Configuration +------------- + +Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model +support is built for driver. + + +Switch Ports +------------ + +On switchdev driver initialization, the driver will allocate and register a +struct net_device (using register_netdev()) for each enumerated physical switch +port, called the port netdev. A port netdev is the software representation of +the physical port and provides a conduit for control traffic to/from the +controller (the kernel) and the network, as well as an anchor point for higher +level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using +standard netdev tools (iproute2, ethtool, etc), the port netdev can also +provide to the user access to the physical properties of the switch port such +as PHY link state and I/O statistics. + +There is (currently) no higher-level kernel object for the switch beyond the +port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. + +A switch management port is outside the scope of the switchdev driver model. +Typically, the management port is not participating in offloaded data plane and +is loaded with a different driver, such as a NIC driver, on the management port +device. + +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the net_device operation +ndo_get_port_parent_id for each port netdev, returning the same physical ID for +each port of a switch. The ID must be unique between switches on the same +system. The ID does not need to be unique between switches on different +systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + +Port Netdev Naming +^^^^^^^^^^^^^^^^^^ + +Udev rules should be used for port netdev naming, using some unique attribute +of the port as a key, for example the port MAC address or the port PHYS name. +Hard-coding of kernel netdev names within the driver is discouraged; let the +kernel pick the default netdev name, and let udev set the final name based on a +port attribute. + +Using port PHYS name (ndo_get_phys_port_name) for the key is particularly +useful for dynamically-named ports where the device names its ports based on +external configuration. For example, if a physical 40G port is split logically +into 4 10G ports, resulting in 4 port netdevs, the device can give a unique +name for each port using port PHYS name. The udev rule would be:: + + SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="", \ + ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" + +Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y +is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 +would be sub-port 0 on port 1 on switch 1. + +Port Features +^^^^^^^^^^^^^ + +NETIF_F_NETNS_LOCAL + +If the switchdev driver (and device) only supports offloading of the default +network namespace (netns), the driver should set this feature flag to prevent +the port netdev from being moved out of the default netns. A netns-aware +driver/device would not set this flag and be responsible for partitioning +hardware to preserve netns containment. This means hardware cannot forward +traffic from a port in one namespace to another port in another namespace. + +Port Topology +^^^^^^^^^^^^^ + +The port netdevs representing the physical switch ports can be organized into +higher-level switching constructs. The default construct is a standalone +router port, used to offload L3 forwarding. Two or more ports can be bonded +together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge +L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 +tunnels can be built on ports. These constructs are built using standard Linux +tools such as the bridge driver, the bonding/team drivers, and netlink-based +tools such as iproute2. + +The switchdev driver can know a particular port's position in the topology by +monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a +bond will see it's upper master change. If that bond is moved into a bridge, +the bond's upper master will change. And so on. The driver will track such +movements to know what position a port is in in the overall topology by +registering for netdevice events and acting on NETDEV_CHANGEUPPER. + +L2 Forwarding Offload +--------------------- + +The idea is to offload the L2 data forwarding (switching) path from the kernel +to the switchdev device by mirroring bridge FDB entries down to the device. An +FDB entry is the {port, MAC, VLAN} tuple forwarding destination. + +To offloading L2 bridging, the switchdev driver/device should support: + + - Static FDB entries installed on a bridge port + - Notification of learned/forgotten src mac/vlans from device + - STP state changes on the port + - VLAN flooding of multicast/broadcast and unknown unicast packets + +Static FDB Entries +^^^^^^^^^^^^^^^^^^ + +The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump +to support static FDB entries installed to the device. Static bridge FDB +entries are installed, for example, using iproute2 bridge cmd:: + + bridge fdb add ADDR dev DEV [vlan VID] [self] + +The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx +ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using +switchdev_port_obj_xxx ops. + +XXX: what should be done if offloading this rule to hardware fails (for +example, due to full capacity in hardware tables) ? + +Note: by default, the bridge does not filter on VLAN and only bridges untagged +traffic. To enable VLAN support, turn on VLAN filtering:: + + echo 1 >/sys/class/net//bridge/vlan_filtering + +Notification of Learned/Forgotten Source MAC/VLANs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The switch device will learn/forget source MAC address/VLAN on ingress packets +and notify the switch driver of the mac/vlan/port tuples. The switch driver, +in turn, will notify the bridge driver using the switchdev notifier call:: + + err = call_switchdev_notifiers(val, dev, info, extack); + +Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when +forgetting, and info points to a struct switchdev_notifier_fdb_info. On +SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the +bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge +command will label these entries "offload":: + + $ bridge fdb + 52:54:00:12:35:01 dev sw1p1 master br0 permanent + 00:02:00:00:02:00 dev sw1p1 master br0 offload + 00:02:00:00:02:00 dev sw1p1 self + 52:54:00:12:35:02 dev sw1p2 master br0 permanent + 00:02:00:00:03:00 dev sw1p2 master br0 offload + 00:02:00:00:03:00 dev sw1p2 self + 33:33:00:00:00:01 dev eth0 self permanent + 01:00:5e:00:00:01 dev eth0 self permanent + 33:33:ff:00:00:00 dev eth0 self permanent + 01:80:c2:00:00:0e dev eth0 self permanent + 33:33:00:00:00:01 dev br0 self permanent + 01:00:5e:00:00:01 dev br0 self permanent + 33:33:ff:12:35:01 dev br0 self permanent + +Learning on the port should be disabled on the bridge using the bridge command:: + + bridge link set dev DEV learning off + +Learning on the device port should be enabled, as well as learning_sync:: + + bridge link set dev DEV learning on self + bridge link set dev DEV learning_sync on self + +Learning_sync attribute enables syncing of the learned/forgotten FDB entry to +the bridge's FDB. It's possible, but not optimal, to enable learning on the +device port and on the bridge port, and disable learning_sync. + +To support learning, the driver implements switchdev op +switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. + +FDB Ageing +^^^^^^^^^^ + +The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is +the responsibility of the port driver/device to age out these entries. If the +port device supports ageing, when the FDB entry expires, it will notify the +driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the +device does not support ageing, the driver can simulate ageing using a +garbage collection timer to monitor FDB entries. Expired entries will be +notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for +example of driver running ageing timer. + +To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB +entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The +notification will reset the FDB entry's last-used time to now. The driver +should rate limit refresh notifications, for example, no more than once a +second. (The last-used time is visible using the bridge -s fdb option). + +STP State Change on Port +^^^^^^^^^^^^^^^^^^^^^^^^ + +Internally or with a third-party STP protocol implementation (e.g. mstpd), the +bridge driver maintains the STP state for ports, and will notify the switch +driver of STP state change on a port using the switchdev op +switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. + +State is one of BR_STATE_*. The switch driver can use STP state updates to +update ingress packet filter list for the port. For example, if port is +DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs +and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. + +Note that STP BDPUs are untagged and STP state applies to all VLANs on the port +so packet filters should be applied consistently across untagged and tagged +VLANs on the port. + +Flooding L2 domain +^^^^^^^^^^^^^^^^^^ + +For a given L2 VLAN domain, the switch device should flood multicast/broadcast +and unknown unicast packets to all ports in domain, if allowed by port's +current STP state. The switch driver, knowing which ports are within which +vlan L2 domain, can program the switch device for flooding. The packet may +be sent to the port netdev for processing by the bridge driver. The +bridge should not reflood the packet to the same ports the device flooded, +otherwise there will be duplicate packets on the wire. + +To avoid duplicate packets, the switch driver should mark a packet as already +forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark +the skb using the ingress bridge port's mark and prevent it from being forwarded +through any bridge port with the same mark. + +It is possible for the switch device to not handle flooding and push the +packets up to the bridge driver for flooding. This is not ideal as the number +of ports scale in the L2 domain as the device is much more efficient at +flooding packets that software. + +If supported by the device, flood control can be offloaded to it, preventing +certain netdevs from flooding unicast traffic for which there is no FDB entry. + +IGMP Snooping +^^^^^^^^^^^^^ + +In order to support IGMP snooping, the port netdevs should trap to the bridge +driver all IGMP join and leave messages. +The bridge multicast module will notify port netdevs on every multicast group +changed whether it is static configured or dynamically joined/leave. +The hardware implementation should be forwarding all registered multicast +traffic groups only to the configured ports. + +L3 Routing Offload +------------------ + +Offloading L3 routing requires that device be programmed with FIB entries from +the kernel, with the device doing the FIB lookup and forwarding. The device +does a longest prefix match (LPM) on FIB entries matching route prefix and +forwards the packet to the matching FIB entry's nexthop(s) egress ports. + +To program the device, the driver has to register a FIB notifier handler +using register_fib_notifier. The following events are available: + +=================== =================================================== +FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, + or modifying an existing entry on the device. +FIB_EVENT_ENTRY_DEL used for removing a FIB entry +FIB_EVENT_RULE_ADD, +FIB_EVENT_RULE_DEL used to propagate FIB rule changes +=================== =================================================== + +FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: + + struct fib_entry_notifier_info { + struct fib_notifier_info info; /* must be first */ + u32 dst; + int dst_len; + struct fib_info *fi; + u8 tos; + u8 type; + u32 tb_id; + u32 nlflags; + }; + +to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` +structure holds details on the route and route's nexthops. ``*dev`` is one +of the port netdevs mentioned in the route's next hop list. + +Routes offloaded to the device are labeled with "offload" in the ip route +listing:: + + $ ip route show + default via 192.168.0.2 dev eth0 + 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload + 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload + 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 12.0.0.2 proto zebra metric 30 offload + nexthop via 11.0.0.1 dev sw1p1 weight 1 + nexthop via 11.0.0.9 dev sw1p2 weight 1 + 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload + 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload + 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 + +The "offload" flag is set in case at least one device offloads the FIB entry. + +XXX: add/mod/del IPv6 FIB API + +Nexthop Resolution +^^^^^^^^^^^^^^^^^^ + +The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for +the switch device to forward the packet with the correct dst mac address, the +nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac +address discovery comes via the ARP (or ND) process and is available via the +arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver +should trigger the kernel's neighbor resolution process. See the rocker +driver's rocker_port_ipv4_resolve() for an example. + +The driver can monitor for updates to arp_tbl using the netevent notifier +NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops +for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy +to know when arp_tbl neighbor entries are purged from the port. diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt deleted file mode 100644 index 86174ce8cd13..000000000000 --- a/Documentation/networking/switchdev.txt +++ /dev/null @@ -1,373 +0,0 @@ -Ethernet switch device driver model (switchdev) -=============================================== -Copyright (c) 2014 Jiri Pirko -Copyright (c) 2014-2015 Scott Feldman - - -The Ethernet switch device driver model (switchdev) is an in-kernel driver -model for switch devices which offload the forwarding (data) plane from the -kernel. - -Figure 1 is a block diagram showing the components of the switchdev model for -an example setup using a data-center-class switch ASIC chip. Other setups -with SR-IOV or soft switches, such as OVS, are possible. - - - User-space tools - - user space | - +-------------------------------------------------------------------+ - kernel | Netlink - | - +--------------+-------------------------------+ - | Network stack | - | (Linux) | - | | - +----------------------------------------------+ - - sw1p2 sw1p4 sw1p6 - sw1p1 + sw1p3 + sw1p5 + eth1 - + | + | + | + - | | | | | | | - +--+----+----+----+----+----+---+ +-----+-----+ - | Switch driver | | mgmt | - | (this document) | | driver | - | | | | - +--------------+----------------+ +-----------+ - | - kernel | HW bus (eg PCI) - +-------------------------------------------------------------------+ - hardware | - +--------------+----------------+ - | Switch device (sw1) | - | +----+ +--------+ - | | v offloaded data path | mgmt port - | | | | - +--|----|----+----+----+----+---+ - | | | | | | - + + + + + + - p1 p2 p3 p4 p5 p6 - - front-panel ports - - - Fig 1. - - -Include Files -------------- - -#include -#include - - -Configuration -------------- - -Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model -support is built for driver. - - -Switch Ports ------------- - -On switchdev driver initialization, the driver will allocate and register a -struct net_device (using register_netdev()) for each enumerated physical switch -port, called the port netdev. A port netdev is the software representation of -the physical port and provides a conduit for control traffic to/from the -controller (the kernel) and the network, as well as an anchor point for higher -level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using -standard netdev tools (iproute2, ethtool, etc), the port netdev can also -provide to the user access to the physical properties of the switch port such -as PHY link state and I/O statistics. - -There is (currently) no higher-level kernel object for the switch beyond the -port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. - -A switch management port is outside the scope of the switchdev driver model. -Typically, the management port is not participating in offloaded data plane and -is loaded with a different driver, such as a NIC driver, on the management port -device. - -Switch ID -^^^^^^^^^ - -The switchdev driver must implement the net_device operation -ndo_get_port_parent_id for each port netdev, returning the same physical ID for -each port of a switch. The ID must be unique between switches on the same -system. The ID does not need to be unique between switches on different -systems. - -The switch ID is used to locate ports on a switch and to know if aggregated -ports belong to the same switch. - -Port Netdev Naming -^^^^^^^^^^^^^^^^^^ - -Udev rules should be used for port netdev naming, using some unique attribute -of the port as a key, for example the port MAC address or the port PHYS name. -Hard-coding of kernel netdev names within the driver is discouraged; let the -kernel pick the default netdev name, and let udev set the final name based on a -port attribute. - -Using port PHYS name (ndo_get_phys_port_name) for the key is particularly -useful for dynamically-named ports where the device names its ports based on -external configuration. For example, if a physical 40G port is split logically -into 4 10G ports, resulting in 4 port netdevs, the device can give a unique -name for each port using port PHYS name. The udev rule would be: - -SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="", \ - ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" - -Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y -is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 -would be sub-port 0 on port 1 on switch 1. - -Port Features -^^^^^^^^^^^^^ - -NETIF_F_NETNS_LOCAL - -If the switchdev driver (and device) only supports offloading of the default -network namespace (netns), the driver should set this feature flag to prevent -the port netdev from being moved out of the default netns. A netns-aware -driver/device would not set this flag and be responsible for partitioning -hardware to preserve netns containment. This means hardware cannot forward -traffic from a port in one namespace to another port in another namespace. - -Port Topology -^^^^^^^^^^^^^ - -The port netdevs representing the physical switch ports can be organized into -higher-level switching constructs. The default construct is a standalone -router port, used to offload L3 forwarding. Two or more ports can be bonded -together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge -L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 -tunnels can be built on ports. These constructs are built using standard Linux -tools such as the bridge driver, the bonding/team drivers, and netlink-based -tools such as iproute2. - -The switchdev driver can know a particular port's position in the topology by -monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a -bond will see it's upper master change. If that bond is moved into a bridge, -the bond's upper master will change. And so on. The driver will track such -movements to know what position a port is in in the overall topology by -registering for netdevice events and acting on NETDEV_CHANGEUPPER. - -L2 Forwarding Offload ---------------------- - -The idea is to offload the L2 data forwarding (switching) path from the kernel -to the switchdev device by mirroring bridge FDB entries down to the device. An -FDB entry is the {port, MAC, VLAN} tuple forwarding destination. - -To offloading L2 bridging, the switchdev driver/device should support: - - - Static FDB entries installed on a bridge port - - Notification of learned/forgotten src mac/vlans from device - - STP state changes on the port - - VLAN flooding of multicast/broadcast and unknown unicast packets - -Static FDB Entries -^^^^^^^^^^^^^^^^^^ - -The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump -to support static FDB entries installed to the device. Static bridge FDB -entries are installed, for example, using iproute2 bridge cmd: - - bridge fdb add ADDR dev DEV [vlan VID] [self] - -The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx -ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using -switchdev_port_obj_xxx ops. - -XXX: what should be done if offloading this rule to hardware fails (for -example, due to full capacity in hardware tables) ? - -Note: by default, the bridge does not filter on VLAN and only bridges untagged -traffic. To enable VLAN support, turn on VLAN filtering: - - echo 1 >/sys/class/net//bridge/vlan_filtering - -Notification of Learned/Forgotten Source MAC/VLANs -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The switch device will learn/forget source MAC address/VLAN on ingress packets -and notify the switch driver of the mac/vlan/port tuples. The switch driver, -in turn, will notify the bridge driver using the switchdev notifier call: - - err = call_switchdev_notifiers(val, dev, info, extack); - -Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when -forgetting, and info points to a struct switchdev_notifier_fdb_info. On -SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the -bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge -command will label these entries "offload": - - $ bridge fdb - 52:54:00:12:35:01 dev sw1p1 master br0 permanent - 00:02:00:00:02:00 dev sw1p1 master br0 offload - 00:02:00:00:02:00 dev sw1p1 self - 52:54:00:12:35:02 dev sw1p2 master br0 permanent - 00:02:00:00:03:00 dev sw1p2 master br0 offload - 00:02:00:00:03:00 dev sw1p2 self - 33:33:00:00:00:01 dev eth0 self permanent - 01:00:5e:00:00:01 dev eth0 self permanent - 33:33:ff:00:00:00 dev eth0 self permanent - 01:80:c2:00:00:0e dev eth0 self permanent - 33:33:00:00:00:01 dev br0 self permanent - 01:00:5e:00:00:01 dev br0 self permanent - 33:33:ff:12:35:01 dev br0 self permanent - -Learning on the port should be disabled on the bridge using the bridge command: - - bridge link set dev DEV learning off - -Learning on the device port should be enabled, as well as learning_sync: - - bridge link set dev DEV learning on self - bridge link set dev DEV learning_sync on self - -Learning_sync attribute enables syncing of the learned/forgotten FDB entry to -the bridge's FDB. It's possible, but not optimal, to enable learning on the -device port and on the bridge port, and disable learning_sync. - -To support learning, the driver implements switchdev op -switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. - -FDB Ageing -^^^^^^^^^^ - -The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is -the responsibility of the port driver/device to age out these entries. If the -port device supports ageing, when the FDB entry expires, it will notify the -driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the -device does not support ageing, the driver can simulate ageing using a -garbage collection timer to monitor FDB entries. Expired entries will be -notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for -example of driver running ageing timer. - -To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB -entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The -notification will reset the FDB entry's last-used time to now. The driver -should rate limit refresh notifications, for example, no more than once a -second. (The last-used time is visible using the bridge -s fdb option). - -STP State Change on Port -^^^^^^^^^^^^^^^^^^^^^^^^ - -Internally or with a third-party STP protocol implementation (e.g. mstpd), the -bridge driver maintains the STP state for ports, and will notify the switch -driver of STP state change on a port using the switchdev op -switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. - -State is one of BR_STATE_*. The switch driver can use STP state updates to -update ingress packet filter list for the port. For example, if port is -DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs -and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. - -Note that STP BDPUs are untagged and STP state applies to all VLANs on the port -so packet filters should be applied consistently across untagged and tagged -VLANs on the port. - -Flooding L2 domain -^^^^^^^^^^^^^^^^^^ - -For a given L2 VLAN domain, the switch device should flood multicast/broadcast -and unknown unicast packets to all ports in domain, if allowed by port's -current STP state. The switch driver, knowing which ports are within which -vlan L2 domain, can program the switch device for flooding. The packet may -be sent to the port netdev for processing by the bridge driver. The -bridge should not reflood the packet to the same ports the device flooded, -otherwise there will be duplicate packets on the wire. - -To avoid duplicate packets, the switch driver should mark a packet as already -forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark -the skb using the ingress bridge port's mark and prevent it from being forwarded -through any bridge port with the same mark. - -It is possible for the switch device to not handle flooding and push the -packets up to the bridge driver for flooding. This is not ideal as the number -of ports scale in the L2 domain as the device is much more efficient at -flooding packets that software. - -If supported by the device, flood control can be offloaded to it, preventing -certain netdevs from flooding unicast traffic for which there is no FDB entry. - -IGMP Snooping -^^^^^^^^^^^^^ - -In order to support IGMP snooping, the port netdevs should trap to the bridge -driver all IGMP join and leave messages. -The bridge multicast module will notify port netdevs on every multicast group -changed whether it is static configured or dynamically joined/leave. -The hardware implementation should be forwarding all registered multicast -traffic groups only to the configured ports. - -L3 Routing Offload ------------------- - -Offloading L3 routing requires that device be programmed with FIB entries from -the kernel, with the device doing the FIB lookup and forwarding. The device -does a longest prefix match (LPM) on FIB entries matching route prefix and -forwards the packet to the matching FIB entry's nexthop(s) egress ports. - -To program the device, the driver has to register a FIB notifier handler -using register_fib_notifier. The following events are available: -FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device, - or modifying an existing entry on the device. -FIB_EVENT_ENTRY_DEL: used for removing a FIB entry -FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes - -FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass: - - struct fib_entry_notifier_info { - struct fib_notifier_info info; /* must be first */ - u32 dst; - int dst_len; - struct fib_info *fi; - u8 tos; - u8 type; - u32 tb_id; - u32 nlflags; - }; - -to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi -structure holds details on the route and route's nexthops. *dev is one of the -port netdevs mentioned in the route's next hop list. - -Routes offloaded to the device are labeled with "offload" in the ip route -listing: - - $ ip route show - default via 192.168.0.2 dev eth0 - 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload - 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload - 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload - 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload - 12.0.0.2 proto zebra metric 30 offload - nexthop via 11.0.0.1 dev sw1p1 weight 1 - nexthop via 11.0.0.9 dev sw1p2 weight 1 - 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload - 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload - 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 - -The "offload" flag is set in case at least one device offloads the FIB entry. - -XXX: add/mod/del IPv6 FIB API - -Nexthop Resolution -^^^^^^^^^^^^^^^^^^ - -The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for -the switch device to forward the packet with the correct dst mac address, the -nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac -address discovery comes via the ARP (or ND) process and is available via the -arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver -should trigger the kernel's neighbor resolution process. See the rocker -driver's rocker_port_ipv4_resolve() for an example. - -The driver can monitor for updates to arp_tbl using the netevent notifier -NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops -for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy -to know when arp_tbl neighbor entries are purged from the port. -- cgit v1.2.3-59-g8ed1b