linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2021-07-23	net: bridge: switchdev: allow the TX data plane forwarding to be offloaded	Tobias Waldekranz	1	-1/+1
	Allow switchdevs to forward frames from the CPU in accordance with the bridge configuration in the same way as is done between bridge ports. This means that the bridge will only send a single skb towards one of the ports under the switchdev's control, and expects the driver to deliver the packet to all eligible ports in its domain. Primarily this improves the performance of multicast flows with multiple subscribers, as it allows the hardware to perform the frame replication. The basic flow between the driver and the bridge is as follows: - When joining a bridge port, the switchdev driver calls switchdev_bridge_port_offload() with tx_fwd_offload = true. - The bridge sends offloadable skbs to one of the ports under the switchdev's control using skb->offload_fwd_mark = true. - The switchdev driver checks the skb->offload_fwd_mark field and lets its FDB lookup select the destination port mask for this packet. v1->v2: - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long - introduce a static key "br_switchdev_fwd_offload_used" to minimize the impact of the newly introduced feature on all the setups which don't have hardware that can make use of it - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache line access - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in __br_forward() - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge is being used - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP v2->v3: - replace the solution based on .ndo_dfwd_add_station with a solution based on switchdev_bridge_port_offload - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD v3->v4: rebase v4->v5: - make sure the static key is decremented on bridge port unoffload - more function and variable renaming and comments for them: br_switchdev_fwd_offload_used to br_switchdev_tx_fwd_offload br_switchdev_accels_skb to br_switchdev_frame_uses_tx_fwd_offload nbp_switchdev_frame_mark_tx_fwd to nbp_switchdev_frame_mark_tx_fwd_to_hwdom nbp_switchdev_frame_mark_accel to nbp_switchdev_frame_mark_tx_fwd_offload fwd_accel to tx_fwd_offload Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-23	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	David S. Miller	2	-8/+9
	Conflicts are simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22	dpaa2-switch: seed the buffer pool after allocating the swp	Ioana Ciornei	1	-8/+8
	Any interraction with the buffer pool (seeding a buffer, acquire one) is made through a software portal (SWP, a DPIO object). There are circumstances where the dpaa2-switch driver probes on a DPSW before any DPIO devices have been probed. In this case, seeding of the buffer pool will lead to a panic since no SWPs are initialized. To fix this, seed the buffer pool after making sure that the software portals have been probed and are ready to be used. Fixes: 0b1b71370458 ("staging: dpaa2-switch: handle Rx path on control interface") Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22	net: bridge: move the switchdev object replay helpers to "push" mode	Vladimir Oltean	1	-2/+10
	Starting with commit 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries"), DSA has introduced some bridge helpers that replay switchdev events (FDB/MDB/VLAN additions and deletions) that can be lost by the switchdev drivers in a variety of circumstances: - an IP multicast group was host-joined on the bridge itself before any switchdev port joined the bridge, leading to the host MDB entries missing in the hardware database. - during the bridge creation process, the MAC address of the bridge was added to the FDB as an entry pointing towards the bridge device itself, but with no switchdev ports being part of the bridge yet, this local FDB entry would remain unknown to the switchdev hardware database. - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface, before any switchdev port joined that LAG, leading to the hardware database missing those entries. - a switchdev port left a LAG that is a bridge port, while the LAG remained part of the bridge, and all FDB/MDB/VLAN entries remained installed in the hardware database of the switchdev port. Also, since commit 0d2cfbd41c4a ("net: bridge: ignore switchdev events for LAG ports which didn't request replay"), DSA introduced a method, based on a const void ctx, to ensure that two switchdev ports under the same LAG that is a bridge port do not see the same MDB/VLAN entry being replayed twice by the bridge, once for every bridge port that joins the LAG. With so many ordering corner cases being possible, it seems unreasonable to expect a switchdev driver writer to get it right from the first try. Therefore, now that DSA has experimented with the bridge replay helpers for a little bit, we can move the code to the bridge driver where it is more readily available to all switchdev drivers. To convert the switchdev object replay helpers from "pull mode" (where the driver asks for them) to a "push mode" (where the bridge offers them automatically), the biggest problem is that the bridge needs to be aware when a switchdev port joins and leaves, even when the switchdev is only indirectly a bridge port (for example when the bridge port is a LAG upper of the switchdev). Luckily, we already have a hook for that, in the form of the newly introduced switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload() calls. These offer a natural place for hooking the object addition and deletion replays. Extend the above 2 functions with: - pointers to the switchdev atomic notifier (for FDB replays) and the blocking notifier (for MDB and VLAN replays). - the "const void ctx" argument required for drivers to be able to disambiguate between which port is targeted, when multiple ports are lowers of the same LAG that is a bridge port. Most of the drivers pass NULL to this argument, except the ones that support LAG offload and have the proper context check already in place in the switchdev blocking notifier handler. Also unexport the replay helpers, since nobody except the bridge calls them directly now. Note that: (a) we abuse the terminology slightly, because FDB entries are not "switchdev objects", but we count them as objects nonetheless. With no direct way to prove it, I think they are not modeled as switchdev objects because those can only be installed by the bridge to the hardware (as opposed to FDB entries which can be propagated in the other direction too). This is merely an abuse of terms, FDB entries are replayed too, despite not being objects. (b) the bridge does not attempt to sync port attributes to newly joined ports, just the countable stuff (the objects). The reason for this is simple: no universal and symmetric way to sync and unsync them is known. For example, VLAN filtering: what to do on unsync, disable or leave it enabled? Similarly, STP state, ageing timer, etc etc. What a switchdev port does when it becomes standalone again is not really up to the bridge's competence, and the driver should deal with it. On the other hand, replaying deletions of switchdev objects can be seen a matter of cleanup and therefore be treated by the bridge, hence this patch. We make the replay helpers opt-in for drivers, because they might not bring immediate benefits for them: - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), so br_vlan_replay() should not do anything for the new drivers on which we call it. The existing drivers where there was even a slight possibility for there to exist a VLAN on a bridge port before they join it are already guarded against this: mlxsw and prestera deny joining LAG interfaces that are members of a bridge. - br_fdb_replay() should now notify of local FDB entries, but I patched all drivers except DSA to ignore these new entries in commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB notifications"). Driver authors can lift this restriction as they wish, and when they do, they can also opt into the FDB replay functionality. - br_mdb_replay() should fix a real issue which is described in commit 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries"). However most drivers do not offload the SWITCHDEV_OBJ_ID_HOST_MDB to see this issue: only cpsw and am65_cpsw offload this switchdev object, and I don't completely understand the way in which they offload this switchdev object anyway. So I'll leave it up to these drivers' respective maintainers to opt into br_mdb_replay(). So most of the drivers pass NULL notifier blocks for the replay helpers, except: - dpaa2-switch which was already acked/regression-tested with the helpers enabled (and there isn't much of a downside in having them) - ocelot which already had replay logic in "pull" mode - DSA which already had replay logic in "pull" mode An important observation is that the drivers which don't currently request bridge event replays don't even have the switchdev_bridge_port_{offload,unoffload} calls placed in proper places right now. This was done to avoid unnecessary rework for drivers which might never even add support for this. For driver writers who wish to add replay support, this can be used as a tentative placement guide: https://patchwork.kernel.org/project/netdevbpf/patch/20210720134655.892334-11-vladimir.oltean@nxp.com/ Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22	net: bridge: switchdev: let drivers inform which bridge ports are offloaded	Vladimir Oltean	1	-0/+13
	On reception of an skb, the bridge checks if it was marked as 'already forwarded in hardware' (checks if skb->offload_fwd_mark == 1), and if it is, it assigns the source hardware domain of that skb based on the hardware domain of the ingress port. Then during forwarding, it enforces that the egress port must have a different hardware domain than the ingress one (this is done in nbp_switchdev_allowed_egress). Non-switchdev drivers don't report any physical switch id (neither through devlink nor .ndo_get_port_parent_id), therefore the bridge assigns them a hardware domain of 0, and packets coming from them will always have skb->offload_fwd_mark = 0. So there aren't any restrictions. Problems appear due to the fact that DSA would like to perform software fallback for bonding and team interfaces that the physical switch cannot offload. +-- br0 ---+ / / \| \ / / \| \ / \| \| bond0 / \| \| / \ swp0 swp1 swp2 swp3 swp4 There, it is desirable that the presence of swp3 and swp4 under a non-offloaded LAG does not preclude us from doing hardware bridging beteen swp0, swp1 and swp2. The bandwidth of the CPU is often times high enough that software bridging between {swp0,swp1,swp2} and bond0 is not impractical. But this creates an impossible paradox given the current way in which port hardware domains are assigned. When the driver receives a packet from swp0 (say, due to flooding), it must set skb->offload_fwd_mark to something. - If we set it to 0, then the bridge will forward it towards swp1, swp2 and bond0. But the switch has already forwarded it towards swp1 and swp2 (not to bond0, remember, that isn't offloaded, so as far as the switch is concerned, ports swp3 and swp4 are not looking up the FDB, and the entire bond0 is a destination that is strictly behind the CPU). But we don't want duplicated traffic towards swp1 and swp2, so it's not ok to set skb->offload_fwd_mark = 0. - If we set it to 1, then the bridge will not forward the skb towards the ports with the same switchdev mark, i.e. not to swp1, swp2 and bond0. Towards swp1 and swp2 that's ok, but towards bond0? It should have forwarded the skb there. So the real issue is that bond0 will be assigned the same hardware domain as {swp0,swp1,swp2}, because the function that assigns hardware domains to bridge ports, nbp_switchdev_add(), recurses through bond0's lower interfaces until it finds something that implements devlink (calls dev_get_port_parent_id with bool recurse = true). This is a problem because the fact that bond0 can be offloaded by swp3 and swp4 in our example is merely an assumption. A solution is to give the bridge explicit hints as to what hardware domain it should use for each port. Currently, the bridging offload is very 'silent': a driver registers a netdevice notifier, which is put on the netns's notifier chain, and which sniffs around for NETDEV_CHANGEUPPER events where the upper is a bridge, and the lower is an interface it knows about (one registered by this driver, normally). Then, from within that notifier, it does a bunch of stuff behind the bridge's back, without the bridge necessarily knowing that there's somebody offloading that port. It looks like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v call_netdevice_notifiers \| v dsa_slave_netdevice_event \| v oh, hey! it's for me! \| v .port_bridge_join What we do to solve the conundrum is to be less silent, and change the switchdev drivers to present themselves to the bridge. Something like this: ip link set swp0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| hardware domain for v \| this port, and zero dsa_slave_netdevice_event \| if I got nothing. \| \| v \| oh, hey! it's for me! \| \| \| v \| .port_bridge_join \| \| \| +------------------------+ switchdev_bridge_port_offload(swp0, swp0) Then stacked interfaces (like bond0 on top of swp3/swp4) would be treated differently in DSA, depending on whether we can or cannot offload them. The offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge: Aye! I'll use this call_netdevice_notifiers ^ ppid as the \| \| switchdev mark for v \| bond0. dsa_slave_netdevice_event \| Coincidentally (or not), \| \| bond0 and swp0, swp1, swp2 v \| all have the same switchdev hmm, it's not quite for me, \| mark now, since the ASIC but my driver has already \| is able to forward towards called .port_lag_join \| all these ports in hw. for it, because I have \| a port with dp->lag_dev == bond0. \| \| \| v \| .port_bridge_join \| for swp3 and swp4 \| \| \| +------------------------+ switchdev_bridge_port_offload(bond0, swp3) switchdev_bridge_port_offload(bond0, swp4) And the non-offload case: ip link set bond0 master br0 \| v br_add_if() calls netdev_master_upper_dev_link() \| v bridge waiting: call_netdevice_notifiers ^ huh, switchdev_bridge_port_offload \| \| wasn't called, okay, I'll use a v \| hwdom of zero for this one. dsa_slave_netdevice_event : Then packets received on swp0 will \| : not be software-forwarded towards v : swp1, but they will towards bond0. it's not for me, but bond0 is an upper of swp3 and swp4, but their dp->lag_dev is NULL because they couldn't offload it. Basically we can draw the conclusion that the lowers of a bridge port can come and go, so depending on the configuration of lowers for a bridge port, it can dynamically toggle between offloaded and unoffloaded. Therefore, we need an equivalent switchdev_bridge_port_unoffload too. This patch changes the way any switchdev driver interacts with the bridge. From now on, everybody needs to call switchdev_bridge_port_offload and switchdev_bridge_port_unoffload, otherwise the bridge will treat the port as non-offloaded and allow software flooding to other ports from the same ASIC. Note that these functions lay the ground for a more complex handshake between switchdev drivers and the bridge in the future. For drivers that will request a replay of the switchdev objects when they offload and unoffload a bridge port (DSA, dpaa2-switch, ocelot), we place the call to switchdev_bridge_port_unoffload() strategically inside the NETDEV_PRECHANGEUPPER notifier's code path, and not inside NETDEV_CHANGEUPPER. This is because the switchdev object replay helpers need the netdev adjacency lists to be valid, and that is only true in NETDEV_PRECHANGEUPPER. Cc: Vadym Kochan <vkochan@marvell.com> Cc: Taras Chornyi <tchornyi@marvell.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: Claudiu Manoil <claudiu.manoil@nxp.com> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch: regression Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com> # ocelot-switch Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22	net: dpaa2-switch: refactor prechangeupper sanity checks	Vladimir Oltean	1	-11/+26
	Make more room for some extra code in the NETDEV_PRECHANGEUPPER handler by moving what already exists into a dedicated function. Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22	net: dpaa2-switch: use extack in dpaa2_switch_port_bridge_join	Vladimir Oltean	1	-4/+7
	We need to propagate the extack argument for dpaa2_switch_port_bridge_join to use it in a future patch, and it looks like there is already an error message there which is currently printed to the console. Move it over netlink so it is properly transmitted to user space. Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-20	fsl/fman: Add fibre support	Maxim Kochetkov	1	-0/+1
	Set SUPPORTED_FIBRE to mac_dev->if_support. It allows proper usage of PHYs with optical/fiber support. Signed-off-by: Maxim Kochetkov <fido_max@inbox.ru> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next	David S. Miller	2	-10/+1
	Daniel Borkmann says: ==================== pull-request: bpf-next 2021-06-28 The following pull-request contains BPF updates for your net-next tree. We've added 37 non-merge commits during the last 12 day(s) which contain a total of 56 files changed, 394 insertions(+), 380 deletions(-). The main changes are: 1) XDP driver RCU cleanups, from Toke Høiland-Jørgensen and Paul E. McKenney. 2) Fix bpf_skb_change_proto() IPv4/v6 GSO handling, from Maciej Żenczykowski. 3) Fix false positive kmemleak report for BPF ringbuf alloc, from Rustam Kovhaev. 4) Fix x86 JIT's extable offset calculation for PROBE_LDX NULL, from Ravi Bangoria. 5) Enable libbpf fallback probing with tracing under RHEL7, from Jonathan Edwards. 6) Clean up x86 JIT to remove unused cnt tracking from EMIT macro, from Jiri Olsa. 7) Netlink cleanups for libbpf to please Coverity, from Kumar Kartikeya Dwivedi. 8) Allow to retrieve ancestor cgroup id in tracing programs, from Namhyung Kim. 9) Fix lirc BPF program query to use user-provided prog_cnt, from Sean Young. 10) Add initial libbpf doc including generated kdoc for its API, from Grant Seltzer. 11) Make xdp_rxq_info_unreg_mem_model() more robust, from Jakub Kicinski. 12) Fix up bpfilter startup log-level to info level, from Gary Lin. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28	net: switchdev: add a context void pointer to struct switchdev_notifier_info	Vladimir Oltean	1	-1/+1
	In the case where the driver asks for a replay of a certain type of event (port object or attribute) for a bridge port that is a LAG, it may do so because this port has just joined the LAG. But there might already be other switchdev ports in that LAG, and it is preferable that those preexisting switchdev ports do not act upon the replayed event. The solution is to add a context to switchdev events, which is NULL most of the time (when the bridge layer initiates the call) but which can be set to a value controlled by the switchdev driver when a replay is requested. The driver can then check the context to figure out if all ports within the LAG should act upon the switchdev event, or just the ones that match the context. We have to modify all switchdev_handle_* helper functions as well as the prototypes in the drivers that use these helpers too, because these helpers hide the underlying struct switchdev_notifier_info from us and there is no way to retrieve the context otherwise. The context structure will be populated and used in later patches. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-25	net: mdiobus: withdraw fwnode_mdbiobus_register	Marcin Wojtas	2	-3/+12
	The newly implemented fwnode_mdbiobus_register turned out to be problematic - in case the fwnode_/of_/acpi_mdio are built as modules, a dependency cycle can be observed during the depmod phase of modules_install, eg.: depmod: ERROR: Cycle detected: fwnode_mdio -> of_mdio -> fwnode_mdio depmod: ERROR: Found 2 modules in dependency cycles! OR: depmod: ERROR: Cycle detected: acpi_mdio -> fwnode_mdio -> acpi_mdio depmod: ERROR: Found 2 modules in dependency cycles! A possible solution could be to rework fwnode_mdiobus_register, so that to merge the contents of acpi_mdiobus_register and of_mdiobus_register. However feasible, such change would be very intrusive and affect huge amount of the of_mdiobus_register users. Since there are currently 2 users of ACPI and MDIO (xgmac_mdio and mvmdio), withdraw the fwnode_mdbiobus_register and roll back to a simple 'if' condition in affected drivers. Fixes: 62a6ef6a996f ("net: mdiobus: Introduce fwnode_mdbiobus_register()") Signed-off-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24	freescale: Remove rcu_read_lock() around XDP program invocation	Toke Høiland-Jørgensen	2	-10/+1
	The dpaa and dpaa2 drivers have rcu_read_lock()/rcu_read_unlock() pairs around XDP program invocations. However, the actual lifetime of the objects referred by the XDP program invocation is longer, all the way through to the call to xdp_do_flush(), making the scope of the rcu_read_lock() too small. This turns out to be harmless because it all happens in a single NAPI poll cycle (and thus under local_bh_disable()), but it makes the rcu_read_lock() misleading. Rather than extend the scope of the rcu_read_lock(), just get rid of it entirely. With the addition of RCU annotations to the XDP_REDIRECT map types that take bh execution into account, lockdep even understands this to be safe, so there's really no reason to keep it around. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Cc: Ioana Radulescu <ruxandra.radulescu@nxp.com> Cc: Madalin Bucur <madalin.bucur@nxp.com> Cc: Ioana Ciornei <ioana.ciornei@nxp.com> Link: https://lore.kernel.org/bpf/20210624160609.292325-11-toke@redhat.com
2021-06-22	net/fsl: switch to fwnode_mdiobus_register	Marcin Wojtas	2	-12/+3
	Utilize the newly added helper routine for registering the MDIO bus via fwnode_ interface. Signed-off-by: Marcin Wojtas <mw@semihalf.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21	net: fec: add ndo_select_queue to fix TX bandwidth fluctuations	Fugang Duan	1	-0/+32
	As we know that AVB is enabled by default, and the ENET IP design is queue 0 for best effort, queue 1&2 for AVB Class A&B. Bandwidth of each queue 1&2 set in driver is 50%, TX bandwidth fluctuated when selecting tx queues randomly with FEC_QUIRK_HAS_AVB quirk available. This patch adds ndo_select_queue callback to select queues for transmitting to fix this issue. It will always return queue 0 if this is not a vlan packet, and return queue 1 or 2 based on priority of vlan packet. You may complain that in fact we only use single queue for trasmitting if we are not targeted to VLAN. Yes, but seems we have no choice, since AVB is enabled when the driver probed, we can't switch this feature dynamicly. After compare multiple queues to single queue, TX throughput almost no improvement. One way we can implemet is to configure the driver to multiple queues with Round-robin scheme by default. Then add ndo_setup_tc callback to enable/disable AVB feature for users. Unfortunately, ENET AVB IP seems not follow the standard 802.1Qav spec. We only can program DMAnCFG[IDLE_SLOPE] field to calculate bandwidth fraction. And idle slope is restricted to certain valus (a total of 19). It's far away from CBS QDisc implemented in Linux TC framework. If you strongly suggest to do this, I think we only can support limited numbers of bandwidth and reject others, but it's really urgly and wried. With this patch, VLAN tagged packets route to queue 0/1/2 based on vlan priority; VLAN untagged packets route to queue 0. Tested-by: Frieder Schrempf <frieder.schrempf@kontron.de> Reported-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21	net: fec: add FEC_QUIRK_HAS_MULTI_QUEUES represents i.MX6SX ENET IP	Joakim Zhang	2	-5/+11
	Frieder Schrempf reported a TX throuthput issue [1], it happens quite often that the measured bandwidth in TX direction drops from its expected/nominal value to something like ~50% (for 100M) or ~67% (for 1G) connections. [1] https://lore.kernel.org/linux-arm-kernel/421cc86c-b66f-b372-32f7-21e59f9a98bc@kontron.de/ The issue becomes clear after digging into it, Net core would select queues when transmitting packets. Since FEC have not impletemented ndo_select_queue callback yet, so it will call netdev_pick_tx to select queues randomly. For i.MX6SX ENET IP with AVB support, driver default enables this feature. According to the setting of QOS/RCMRn/DMAnCFG registers, AVB configured to Credit-based scheme, 50% bandwidth of each queue 1&2. With below tests let me think more: 1) With FEC_QUIRK_HAS_AVB quirk, can reproduce TX bandwidth fluctuations issue. 2) Without FEC_QUIRK_HAS_AVB quirk, can't reproduce TX bandwidth fluctuations issue. The related difference with or w/o FEC_QUIRK_HAS_AVB quirk is that, whether we program FTYPE field of TxBD or not. As I describe above, AVB feature is enabled by default. With FEC_QUIRK_HAS_AVB quirk, frames in queue 0 marked as non-AVB, and frames in queue 1&2 marked as AVB Class A&B. It's unreasonable if frames in queue 1&2 are not required to be time-sensitive. So when Net core select tx queues ramdomly, Credit-based scheme would work and lead to TX bandwidth fluctuated. On the other hand, w/o FEC_QUIRK_HAS_AVB quirk, frames in queue 1&2 are all marked as non-AVB, so Credit-based scheme would not work. Till now, how can we fix this TX throughput issue? Yes, please remove FEC_QUIRK_HAS_AVB quirk if you suffer it from time-nonsensitive networking. However, this quirk is used to indicate i.MX6SX, other setting depends on it. So this patch adds a new quirk FEC_QUIRK_HAS_MULTI_QUEUES to represent i.MX6SX, it is safe for us remove FEC_QUIRK_HAS_AVB quirk now. FEC_QUIRK_HAS_AVB quirk is set by default in the driver, and users may not know much about driver details, they would waste effort to find the root cause, that is not we want. The following patch is a implementation to fix it and users don't need to modify the driver. Tested-by: Frieder Schrempf <frieder.schrempf@kontron.de> Reported-by: Frieder Schrempf <frieder.schrempf@kontron.de> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski	1	-3/+5
	Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-17	net: gianfar: Implement rx_missed_errors counter	Esben Haabendal	2	-3/+57
	Devices with RMON support has a 16-bit RDRP counter. It provides: "Receive dropped packets counter. Increments for frames received which are streamed to system but are later dropped due to lack of system resources." To handle more than 2^16 dropped packets, a carry bit in CAR1 register is set on overflow, so we enable irq when this is set, extending the counter to 2^64 for handling situations where lots of packets are missed (e.g. during heavy network storms). Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17	net: gianfar: Add definitions for CAR1 and CAM1 register bits	Esben Haabendal	1	-0/+54
	These are for carry status and interrupt mask bits of statistics registers. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17	net: gianfar: Avoid 16 bytes of memset	Esben Haabendal	1	-1/+1
	The memset on CAMx is wrong, as it actually unmasks all carry irq's, which we clearly are not interested in. The memset on CARx registers is just pointless, as they are W1C. So let's just stop the memset before CAR1. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17	net: gianfar: Clear CAR registers	Esben Haabendal	1	-0/+3
	The CAR1 and CAR2 registers are W1C style registers, to the memset does not actually clear them. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17	net: gianfar: Extend statistics counters to 64-bit	Esben Haabendal	1	-5/+5
	No reason to wrap counter values at 2^32. Especially the bytes counters can wrap pretty fast on Gbit networks. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17	net: gianfar: Convert to ndo_get_stats64 interface	Esben Haabendal	1	-18/+7
	No reason to produce the legacy net_device_stats struct, only to have it converted to rtnl_link_stats64. And as a bonus, this allows for improving counter size to 64 bit. Signed-off-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16	net: fec_ptp: fix issue caused by refactor the fec_devtype	Joakim Zhang	1	-3/+1
	Commit da722186f654 ("net: fec: set GPR bit on suspend by DT configuration.") refactor the fec_devtype, need adjust ptp driver accordingly. Fixes: da722186f654 ("net: fec: set GPR bit on suspend by DT configuration.") Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16	net: fec_ptp: add clock rate zero check	Fugang Duan	1	-0/+4
	Add clock rate zero check to fix coverity issue of "divide by 0". Fixes: commit 85bd1798b24a ("net: fec: fix spin_lock dead lock") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11	net: dpaa2-mac: Add ACPI support for DPAA2 MAC driver	Calvin Johnson	2	-37/+53
	Modify dpaa2_mac_get_node() to get the dpmac fwnode from either DT or ACPI. Modify dpaa2_mac_get_if_mode() to get interface mode from dpmac_node which is a fwnode. Modify dpaa2_pcs_create() to create pcs from dpmac_node fwnode. Modify dpaa2_mac_connect() to support ACPI along with DT. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> # from the ACPI side Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11	net/fsl: Use [acpi\|of]_mdiobus_register	Calvin Johnson	1	-9/+21
	Depending on the device node type, call the specific OF or ACPI mdiobus_register function. Note: For both ACPI and DT cases, endianness of MDIO controllers need to be specified using the "little-endian" property. Signed-off-by: Calvin Johnson <calvin.johnson@oss.nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Grant Likely <grant.likely@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-07	net: enetc: Use devm_platform_get_and_ioremap_resource()	Yang Yingliang	1	-3/+1
	Use devm_platform_get_and_ioremap_resource() to simplify code. Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-04	net: enetc: use get/put_unaligned helpers for MAC address handling	Michael Walle	1	-4/+5
	The supplied buffer for the MAC address might not be aligned. Thus doing a 32bit (or 16bit) access could be on an unaligned address. For now, enetc is only used on aarch64 which can do unaligned accesses, thus there is no error. In any case, be correct and use the get/put_unaligned helpers. Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-01	net: enetc: catch negative return code from enetc_pf_to_port()	Vladimir Oltean	1	-7/+24
	After the refactoring introduced in commit 87614b931c24 ("net: enetc: create a common enetc_pf_to_port helper"), enetc_pf_to_port was coded up to return -1 in case the passed PCIe device does not have a recognized BDF. Make sure the -1 value is checked by the callers, to appease static checkers. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-27	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski	1	-5/+19
	cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-23	ethernet: ucc_geth: Use kmemdup() rather than kmalloc+memcpy	YueHaibing	1	-2/+1
	Issue identified with Coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21	dpaa2-eth: don't print error from dpaa2_mac_connect if that's EPROBE_DEFER	Vladimir Oltean	1	-3/+4
	When booting a board with DPAA2 interfaces defined statically via DPL (as opposed to creating them dynamically using restool), the driver will print an unspecific error message. This change adds the error code to the message, and avoids printing altogether if the error code is EPROBE_DEFER, because that is not a cause of alarm. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21	dpaa2-eth: name the debugfs directory after the DPNI object	Ioana Ciornei	1	-1/+5
	Name the debugfs directory after the DPNI object instead of the netdev name since this can be changed after probe by udev rules. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-21	dpaa2-eth: setup the of_node field of the device	Ioana Ciornei	2	-12/+14
	When the DPNI object is connected to a DPMAC, setup the of_node to point to the DTS device node of that specific MAC. This enables other drivers, for example the DSA subsystem, to find the net_device by its device node. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12	net: fec: add defer probe for of_get_mac_address	Fugang Duan	1	-3/+10
	If MAC address read from nvmem efuse by calling .of_get_mac_address(), but nvmem efuse is registered later than the driver, then it return -EPROBE_DEFER value. So modify the driver to support defer probe when read MAC address from nvmem efuse. Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-12	net: fec: fix the potential memory leak in fec_enet_init()	Fugang Duan	1	-2/+9
	If the memory allocated for cbd_base is failed, it should free the memory allocated for the queues, otherwise it causes memory leak. And if the memory allocated for the queues is failed, it can return error directly. Fixes: 59d0f7465644 ("net: fec: init multi queue date structure") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-28	net: selftest: fix build issue if INET is disabled	Oleksij Rempel	1	-1/+1
	In case ethernet driver is enabled and INET is disabled, selftest will fail to build. Reported-by: Randy Dunlap <rdunlap@infradead.org> Fixes: 3e1e58d64c3d ("net: add generic selftest support") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20210428130947.29649-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-23	enetc: fix locking for one-step timestamping packet transfer	Yangbo Lu	1	-9/+9
	The previous patch to support PTP Sync packet one-step timestamping described one-step timestamping packet handling logic as below in commit message: - Trasmit packet immediately if no other one in transfer, or queue to skb queue if there is already one in transfer. The test_and_set_bit_lock() is used here to lock and check state. - Start a work when complete transfer on hardware, to release the bit lock and to send one skb in skb queue if has. There was not problem of the description, but there was a mistake in implementation. The locking/test_and_set_bit_lock() should be put in enetc_start_xmit() which may be called by worker, rather than in enetc_xmit(). Otherwise, the worker calling enetc_start_xmit() after bit lock released is not able to lock again for transfer. Fixes: 7294380c5211 ("enetc: support PTP Sync packet one-step timestamping") Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-22	net: enetc: fix link error again	Arnd Bergmann	1	-3/+1
	A link time bug that I had fixed before has come back now that another sub-module was added to the enetc driver: ERROR: modpost: "enetc_ierb_register_pf" [drivers/net/ethernet/freescale/enetc/fsl-enetc.ko] undefined! The problem is that the enetc Makefile is not actually used for the ierb module if that is the only built-in driver in there and everything else is a loadable module. Fix it by always entering the directory this time, regardless of which symbols are configured. This should reliably fix the problem and prevent it from coming back another time. Fixes: 112463ddbe82 ("net: dsa: felix: fix link error") Fixes: e7d48e5fbf30 ("net: enetc: add a mini driver for the Integrated Endpoint Register Block") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20	net: enetc: automatically select IERB module	Michael Walle	1	-1/+1
	Now that enetc supports flow control we have to make sure the settings in the IERB are correct. Therefore, we actually depend on the enetc-ierb module. Previously it was possible that this module was disabled while the enetc was enabled. Fix it by automatically select the enetc-ierb module. Fixes: e7d48e5fbf30 ("net: enetc: add a mini driver for the Integrated Endpoint Register Block") Signed-off-by: Michael Walle <michael@walle.cc> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-20	net: fec: make use of generic NET_SELFTESTS library	Oleksij Rempel	2	-0/+8
	With this patch FEC on iMX will able to run generic net selftests Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19	net: enetc: add support for flow control	Vladimir Oltean	3	-2/+85
	In the ENETC receive path, a frame received by the MAC is first stored in a 256KB 'FIFO' memory, then transferred to DRAM when enqueuing it to the RX ring. The FIFO is a shared resource for all ENETC ports, but every port keeps track of its own memory utilization, on RX and on TX. There is a setting for RX rings through which they can either operate in 'lossy' mode (where the lack of a free buffer causes an immediate discard of the frame) or in 'lossless' mode (where the lack of a free buffer in the ring makes the frame stay longer in the FIFO). In turn, when the memory utilization of the FIFO exceeds a certain margin, the MAC can be configured to emit PAUSE frames. There is enough FIFO memory to buffer up to 3 MTU-sized frames per RX port while not jeopardizing the other use cases (jumbo frames), and also not consume bytes from the port TX allocations. Also, 3 MTU-sized frames worth of memory is enough to ensure zero loss for 64 byte packets at 1G line rate. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19	net: enetc: add a mini driver for the Integrated Endpoint Register Block	Vladimir Oltean	5	-1/+221
	The NXP ENETC is a 4-port Ethernet controller which 'smells' to operating systems like 4 distinct PCIe PFs with SR-IOV, each PF having its own driver instance, but in fact there are some hardware resources which are shared between all ports, like for example the 256 KB SRAM FIFO between the MACs and the Host Transfer Agent which DMAs frames to DRAM. To hide the stuff that cannot be neatly exposed per port, the hardware designers came up with this idea of having a dedicated register block which is supposed to be populated by the bootloader, and contains everything configuration-related: MAC addresses, FIFO partitioning, etc. When a port is reset using PCIe Function Level Reset, its defaults are transferred from the IERB configuration. Most of the time, the settings made through the IERB are read-only in the port's memory space (if they are even visible), so they cannot be modified at runtime. Linux doesn't have any advanced FIFO partitioning requirements at all, but when reading through the hardware manual, it became clear that, even though there are many good 'recommendations' for default values, many of them were not actually put in practice on LS1028A. So we end up with a default configuration that: (a) does not have enough TX and RX byte credits to support the max MTU of 9600 (which the Linux driver claims already) properly (at full speed) (b) allows the FIFO to be overrun with RX traffic, potentially overwriting internal data structures. The last part sounds a bit catastrophic, but it isn't. Frames are supposed to transit the FIFO for a very short time, but they can actually accumulate there under 2 conditions: (a) there is very severe congestion on DRAM memory, or (b) the RX rings visible to the operating system were configured for lossless operation, and they just ran out of free buffers to copy the frame to. This is what is used to put backpressure onto the MAC with flow control. So since ENETC has not supported flow control thus far, RX FIFO overruns were never seen with Linux. But with the addition of flow control, we should configure some registers to prevent this from happening. What we are trying to protect against are bad actors which continue to send us traffic despite the fact that we have signaled a PAUSE condition. Of course we can't be lossless in that case, but it is best to configure the FIFO to do tail dropping rather than letting it overrun. So in a nutshell, this driver is a fixup for all the IERB default values that should have been but aren't. The IERB configuration needs to be done _before_ the PFs are enabled. So every PF searches for the presence of the "fsl,ls1028a-enetc-ierb" node in the device tree, and if it finds it, it "registers" with the IERB, which means that it requests the IERB to fix up its default values. This is done through -EPROBE_DEFER. The IERB driver is part of the fsl_enetc module, but is technically a platform driver, since the IERB is a good old fashioned MMIO region, as opposed to ENETC ports which pretend to be PCIe devices. The driver was already configuring ENETC_PTXMBAR (FIFO allocation for TX) because due to an omission, TXMBAR is a read/write register in the PF memory space. But the manual is quite clear that the formula for this should depend upon the TX byte credits (TXBCR). In turn, the TX byte credits are only readable/writable through the IERB. So if we want to ensure that the TXBCR register also has a value that is correct and in line with TXMBAR, there is simply no way this can be done from the PF driver, access to the IERB is needed. I could have modified U-Boot to fix up the IERB values, but that is quite undesirable, as old U-Boot versions are likely to be floating around for quite some time from now. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19	net: enetc: create a common enetc_pf_to_port helper	Vladimir Oltean	2	-10/+22
	Even though ENETC interfaces are exposed as individual PCIe PFs with their own driver instances, the ENETC is still fundamentally a multi-port Ethernet controller, and some parts of the IP take a port number (as can be seen in the PSFP implementation). Create a common helper that can be used outside of the TSN code for retrieving the ENETC port number based on the PF number. This is only correct for LS1028A, the only Linux-capable instantiation of ENETC thus far. Note that ENETC port 3 is PF 6. The TSN code did not care about this because ENETC port 3 does not support TSN, so the wrong mapping done by enetc_get_port for PF 6 could have never been hit. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: apply the MDIO workaround for XDP_REDIRECT too	Vladimir Oltean	1	-0/+4
	Described in fd5736bf9f23 ("enetc: Workaround for MDIO register access issue") is a workaround for a hardware bug that requires a register access of the MDIO controller to never happen concurrently with a register access of a port PF. To avoid that, a mutual exclusion scheme with rwlocks was implemented - the port PF accessors are the 'read' side, and the MDIO accessors are the 'write' side. When we do XDP_REDIRECT between two ENETC interfaces, all is fine because the MDIO lock is already taken from the NAPI poll loop. But when the ingress interface is not ENETC, just the egress is, the MDIO lock is not taken, so we might access the port PF registers concurrently with MDIO, which will make the link flap due to wrong values returned from the PHY. To avoid this, let's just slap an enetc_lock_mdio/enetc_unlock_mdio at the beginning and ending of enetc_xdp_xmit. The fact that the MDIO lock is designed as a rwlock is important here, because the read side is reentrant (that is one of the main reasons why we chose it). Usually, the way we benefit of its reentrancy is by running the data path concurrently on both CPUs, but in this case, we benefit from the reentrancy by taking the lock even when the lock is already taken (and that's the situation where ENETC is both the ingress and the egress interface for XDP_REDIRECT, which was fine before and still is fine now). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: fix buffer leaks with XDP_TX enqueue rejections	Vladimir Oltean	1	-4/+12
	If the TX ring is congested, enetc_xdp_tx() returns false for the current XDP frame (represented as an array of software BDs). This array of software TX BDs is constructed in enetc_rx_swbd_to_xdp_tx_swbd from software BDs freshly cleaned from the RX ring. The issue is that we scrub the RX software BDs too soon, more precisely before we know that we can enqueue the TX BDs successfully into the TX ring. If we can't enqueue them (and enetc_xdp_tx returns false), we call enetc_xdp_drop which attempts to recycle the buffers held by the RX software BDs. But because we scrubbed those RX BDs already, two things happen: (a) we leak their memory (b) we populate the RX software BD ring with an all-zero rx_swbd structure, which makes the buffer refill path allocate more memory. enetc_refill_rx_ring -> if (unlikely(!rx_swbd->page)) -> enetc_new_page That is a recipe for fast OOM. Fixes: 7ed2bc80074e ("net: enetc: add support for XDP_TX") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: handle the invalid XDP action the same way as XDP_DROP	Vladimir Oltean	1	-4/+3
	When the XDP program returns an invalid action, we should free the RX buffer. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: use dedicated TX rings for XDP	Vladimir Oltean	2	-7/+40
	It is possible for one CPU to perform TX hashing (see netdev_pick_tx) between the 8 ENETC TX rings, and the TX hashing to select TX queue 1. At the same time, it is possible for the other CPU to already use TX ring 1 for XDP (either XDP_TX or XDP_REDIRECT). Since there is no mutual exclusion between XDP and the network stack, we run into an issue because the ENETC TX procedure is not reentrant. The obvious approach would be to just make XDP take the lock of the network stack's TX queue corresponding to the ring it's about to enqueue in. For XDP_REDIRECT, this is quite straightforward, a lock at the beginning and end of enetc_xdp_xmit() should do the trick. But for XDP_TX, it's a bit more complicated. For one, we do TX batching all by ourselves for frames with the XDP_TX verdict. This is something we would like to keep the way it is, for performance reasons. But batching means that the network stack's lock should be kept from the first enqueued XDP_TX frame and until we ring the doorbell. That is mostly fine, except for cases when in the same NAPI loop we have mixed XDP_TX and XDP_REDIRECT frames. So if enetc_xdp_xmit() gets called while we are holding the lock from the RX NAPI, then bam, deadlock. The naive answer could be 'just flush the XDP_TX frames first, then release the network stack's TX queue lock, then call xdp_do_flush_map()'. But even xdp_do_redirect() is capable of flushing the batched XDP_REDIRECT frames, so unless we unlock/relock the TX queue around xdp_do_redirect(), there simply isn't any clean way to protect XDP_TX from concurrent network stack .ndo_start_xmit() on another CPU. So we need to take a different approach, and that is to reserve two rings for the sole use of XDP. We leave TX rings 0..ndev->real_num_tx_queues-1 to be handled by the network stack, and we pick them from the end of the priv->tx_ring array. We make an effort to keep the mapping done by enetc_alloc_msix() which decides which CPU handles the TX completions of which TX ring in its NAPI poll. So the XDP TX ring of CPU 0 is handled by TX ring 6, and the XDP TX ring of CPU 1 is handled by TX ring 7. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: increase TX ring size	Vladimir Oltean	1	-1/+1
	Now that commit d6a2829e82cf ("net: enetc: increase RX ring default size") has increased the RX ring size, it is quite easy to congest the TX rings when the traffic is predominantly XDP_TX, as the RX ring is quite a bit larger than the TX one. Since we bit the bullet and did the expensive thing already (larger RX rings consume more memory pages), it seems quite foolish to keep the TX rings small. So make them equally sized with TX. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16	net: enetc: remove unneeded xdp_do_flush_map()	Vladimir Oltean	1	-5/+0
	xdp_do_redirect already contains: -> dev_map_enqueue -> __xdp_enqueue -> bq_enqueue -> bq_xmit_all // if we have more than 16 frames So the logic from enetc will never be hit, because ENETC_DEFAULT_TX_WORK is 128. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>