linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2021-11-01	tcp: rename sk_wmem_free_skb	Talal Ahmad	4	-12/+12
	sk_wmem_free_skb() is only used by TCP. Rename it to make this clear, and move its declaration to include/net/tcp.h Signed-off-by: Talal Ahmad <talalahmad@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Arjun Roy <arjunroy@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01	netdevsim: fix uninit value in nsim_drv_configure_vfs()	Jakub Kicinski	1	-4/+2
	Build bot points out that I missed initializing ret after refactoring. Reported-by: kernel test robot <lkp@intel.com> Fixes: 1c401078bcf3 ("netdevsim: move details of vf config to dev") Link: https://lore.kernel.org/r/20211101221845.3188490-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-01	net/smc: Introduce tracepoint for smcr link down	Tony Lu	3	-2/+38
	SMC-R link down event is important to help us find links' issues, we should track this event, especially in the single nic mode, which means upper layer connection would be shut down. Then find out the direct link-down reason in time, not only increased the counter, also the location of the code who triggered this event. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net/smc: Introduce tracepoints for tx and rx msg	Tony Lu	4	-0/+45
	This introduce two tracepoints for smc tx and rx msg to help us diagnosis issues of data path. These two tracepoitns don't cover the path of CORK or MSG_MORE in tx, just the top half of data path. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net/smc: Introduce tracepoint for fallback	Tony Lu	4	-0/+59
	This introduces tracepoint for smc fallback to TCP, so that we can track which connection and why it fallbacks, and map the clcsocks' pointer with /proc/net/tcp to find more details about TCP connections. Compared with kprobe or other dynamic tracing, tracepoints are stable and easy to use. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	selftests: add amt interface selftest script	Taehee Yoo	3	-0/+286
	This is selftest script for amt interface. This script includes basic forwarding scenarion and torture scenario. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	amt: add mld report message handler	Taehee Yoo	2	-7/+441
	In the previous patch, igmp report handler was added. That handler can be used for mld too. So, it uses that common code to parse mld report message. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	amt: add multicast(IGMP) report message handler	Taehee Yoo	2	-2/+1234
	amt 'Relay' interface manages multicast groups(igmp/mld) and sources. In order to manage, it should have the function to parse igmp/mld report messages. So, this adds the logic for parsing igmp report messages and saves them on their own data structure. struct amt_group_node means one group(igmp/mld). struct amt_source_node means one source. The same source can't exist in the same group. The same group can exist in the same tunnel because it manages the host address too. The group information is used when forwarding multicast data. If there are no groups in the specific tunnel, Relay doesn't forward it. Although Relay manages sources, it doesn't support the source filtering feature. Because the reason to manage sources is just that in order to manage group more correctly. In the next patch, MLD part will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	amt: add data plane of amt interface	Taehee Yoo	2	-1/+1288
	Before forwarding multicast traffic, the amt interface establishes between gateway and relay. In order to establish, amt defined some message type and those message flow looks like the below. Gateway Relay ------- ----- : Request : [1] \| N \| \|---------------------->\| \| Membership Query \| [2] \| N,MAC,gADDR,gPORT \| \|<======================\| [3] \| Membership Update \| \| ({G:INCLUDE({S})}) \| \|======================>\| \| \| ---------------------:-----------------------:--------------------- \| \| \| \| \| \| Multicast Data \| IP Packet(S,G) \| \| \| gADDR,gPORT \|<-----------------() \| \| IP Packet(S,G) \|<======================\| \| \| ()<-----------------\| \| \| \| \| \| \| ---------------------:-----------------------:--------------------- ~ ~ ~ Request ~ [4] \| N' \| \|---------------------->\| \| Membership Query \| [5] \| N',MAC',gADDR',gPORT' \| \|<======================\| [6] \| \| \| Teardown \| \| N,MAC,gADDR,gPORT \| \|---------------------->\| \| \| [7] \| Membership Update \| \| ({G:INCLUDE({S})}) \| \|======================>\| \| \| ---------------------:-----------------------:--------------------- \| \| \| \| \| \| Multicast Data \| IP Packet(S,G) \| \| \| gADDR',gPORT' \|<-----------------() \| \| IP Packet (S,G) \|<======================\| \| \| ()<-----------------\| \| \| \| \| \| \| ---------------------:-----------------------:--------------------- \| \| : : 1. Discovery - Sent by Gateway to Relay - To find Relay unique ip address 2. Advertisement - Sent by Relay to Gateway - Contains the unique IP address 3. Request - Sent by Gateway to Relay - Solicit to receive 'Query' message. 4. Query - Sent by Relay to Gateway - Contains General Query message. 5. Update - Sent by Gateway to Relay - Contains report message. 6. Multicast Data - Sent by Relay to Gateway - encapsulated multicast traffic. 7. Teardown - Not supported at this time. Except for the Teardown message, it supports all messages. In the next patch, IGMP/MLD logic will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	amt: add control plane of amt interface	Taehee Yoo	6	-0/+815
	It adds definitions and control plane code for AMT. this is very similar to udp tunneling interfaces such as gtp, vxlan, etc. In the next patch, data plane code will be added. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netdevsim: rename 'driver' entry points	Jakub Kicinski	3	-14/+14
	Rename functions serving as driver entry points from nsim_dev_... to nsim_drv_... this makes the API boundary between bus and dev clearer. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netdevsim: move max vf config to dev	Jakub Kicinski	3	-82/+64
	max_vfs is a strange little beast because the file hangs off of nsim's debugfs, but it configures a field in the bus device. Move it to dev.c, let's look at it as if the device driver was imposing VF limit based on FW info (like pci_sriov_set_totalvfs()). Again, when moving refactor the function not to hold the vfs lock pointlessly while parsing the input. Wrap the access from the read side in READ_ONCE() to appease concurrency checkers. Do not check if return value from snprintf() is negative... Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netdevsim: move details of vf config to dev	Jakub Kicinski	3	-73/+63
	Since "eswitch" configuration was added bus.c contains a lot of device details which really belong to dev.c. Restructure the code while moving it. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netdevsim: move vfconfig to nsim_dev	Jakub Kicinski	4	-91/+111
	When netdevsim got split into the faux bus vfconfig ended up in the bus device (think pci_dev) which is strange because it contains very networky not to say netdevy information. Move it to nsim_dev, which is the driver "priv" structure for the device. To make sure we don't race with probe/remove take the device lock (much like PCI). While at it remove the NULL-checking of vfconfigs. It appears to be pointless. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netdevsim: take rtnl_lock when assigning num_vfs	Jakub Kicinski	1	-3/+11
	Legacy VF NDOs look at num_vfs and then based on that index into vfconfig. If we don't rtnl_lock() num_vfs may get set to 0 and vfconfig freed/replaced while the NDO is running. We don't need to protect replacing vfconfig since it's only done when num_vfs is 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	ethtool: don't drop the rtnl_lock half way thru the ioctl	Jakub Kicinski	3	-46/+43
	devlink compat code needs to drop rtnl_lock to take devlink->lock to ensure correct lock ordering. This is problematic because we're not strictly guaranteed that the netdev will not disappear after we re-lock. It may open a possibility of nested ->begin / ->complete calls. Instead of calling into devlink under rtnl_lock take a ref on the devlink instance and make the call after we've dropped rtnl_lock. We (continue to) assume that netdevs have an implicit reference on the devlink returned from ndo_get_devlink_port Note that ndo_get_devlink_port will now get called under rtnl_lock. That should be fine since none of the drivers seem to be taking serious locks inside ndo_get_devlink_port. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	devlink: expose get/put functions	Jakub Kicinski	2	-3/+17
	Allow those who hold implicit reference on a devlink instance to try to take a full ref on it. This will be used from netdev code which has an implicit ref because of driver call ordering. Note that after recent changes devlink_unregister() may happen before netdev unregister, but devlink_free() should still happen after, so we are safe to try, but we can't just refcount_inc() and assume it's not zero. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	ethtool: handle info/flash data copying outside rtnl_lock	Jakub Kicinski	1	-41/+69
	We need to increase the lifetime of the data for .get_info and .flash_update beyond their handlers inside rtnl_lock. Allocate a union on the heap and use it instead. Note that we now copy the ethcmd before we lookup dev, hopefully there is no crazy user space depending on error codes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	ethtool: push the rtnl_lock into dev_ethtool()	Jakub Kicinski	2	-3/+13
	Don't take the lock in net/core/dev_ioctl.c, we'll have things to do outside rtnl_lock soon. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net: mana: Support hibernation and kexec	Dexuan Cui	3	-52/+164
	Implement the suspend/resume/shutdown callbacks for hibernation/kexec. Add mana_gd_setup() and mana_gd_cleanup() for some common code, and use them in the mand_gd_* callbacks. Reuse mana_probe/remove() for the hibernation path. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net: mana: Improve the HWC error handling	Dexuan Cui	2	-44/+31
	Currently when the HWC creation fails, the error handling is flawed, e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails, the resources acquired in mana_hwc_init_queues() is not released. Enhance mana_hwc_destroy_channel() to do the proper cleanup work and call it accordingly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net: mana: Report OS info to the PF driver	Dexuan Cui	1	-0/+11
	The PF driver might use the OS info for statistical purposes. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net: mana: Fix the netdev_err()'s vPort argument in mana_init_port()	Dexuan Cui	1	-1/+2
	Use the correct port index rather than 0. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	selftests: mptcp: more stable simult_flows tests	Paolo Abeni	2	-36/+72
	Currently the simult_flows.sh self-tests are not very stable, especially when running on slow VMs. The tests measure runtime for transfers on multiple subflows and check that the time is near the theoretical maximum. The current test infra introduces a bit of jitter in test runtime, due to multiple explicit delays. Additionally the runtime is measured by the shell script wrapper. On a slow VM, the script overhead is measurable and subject to relevant jitter. One solution to make the test more stable would be adding more slack to the expected time; that could possibly hide real regressions. Instead move the measurement inside the command doing the transfer, and drop most unneeded sleeps. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	selftests: mptcp: fix proto type in link_failure tests	Geliang Tang	1	-1/+1
	In listener_ns, we should pass srv_proto argument to mptcp_connect command, not cl_proto. Fixes: 7d1e6f1639044 ("selftests: mptcp: add testcase for active-back") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	nfp: flower: Allow ipv6gretap interface for offloading	Yu Xiao	3	-3/+4
	The tunnel_type check only allows for "netif_is_gretap", but for OVS the port is actually "netif_is_ip6gretap" when setting up GRE for ipv6, which means offloading request was rejected before. Therefore, adding "netif_is_ip6gretap" allow ipv6gretap interface for offloading. Signed-off-by: Yu Xiao <yu.xiao@corigine.com> Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	net: dsa: populate supported_interfaces member	Marek Behún	3	-0/+10
	Add a new DSA switch operation, phylink_get_interfaces, which should fill in which PHY_INTERFACE_MODE_* are supported by given port. Use this before phylink_create() to fill phylinks supported_interfaces member, allowing phylink to determine which PHY_INTERFACE_MODEs are supported. Signed-off-by: Marek Behún <kabel@kernel.org> [tweaked patch and description to add more complete support -- rmk] Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-01	netfilter: nft_payload: support for inner header matching / mangling	Pablo Neira Ayuso	3	-2/+58
	Allow to match and mangle on inner headers / payload data after the transport header. There is a new field in the pktinfo structure that stores the inner header offset which is calculated only when requested. Only TCP and UDP supported at this stage. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01	netfilter: nf_tables: convert pktinfo->tprot_set to flags field	Pablo Neira Ayuso	7	-14/+19
	Generalize boolean field to store more flags on the pktinfo structure. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01	netfilter: nft_meta: add NFT_META_IFTYPE	Pablo Neira Ayuso	2	-2/+8
	Generalize NFT_META_IIFTYPE to NFT_META_IFTYPE which allows you to match on the interface type of the skb->dev field. This field is used by the netdev family to add an implicit dependency to skip non-ethernet packets when matching on layer 3 and 4 TCP/IP header fields. For backward compatibility, add the NFT_META_IIFTYPE alias to NFT_META_IFTYPE. Add __NFT_META_IIFTYPE, to be used by userspace in the future to match specifically on the iiftype. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-11-01	netfilter: conntrack: set on IPS_ASSURED if flows enters internal stream state	Pablo Neira Ayuso	1	-2/+5
	The internal stream state sets the timeout to 120 seconds 2 seconds after the creation of the flow, attach this internal stream state to the IPS_ASSURED flag for consistent event reporting. Before this patch: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Note IPS_ASSURED for the flow not yet in the internal stream state. after this update: [NEW] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 [UNREPLIED] src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 30 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [UPDATE] udp 17 120 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] [DESTROY] udp 17 src=10.246.11.13 dst=216.239.35.0 sport=37282 dport=123 src=216.239.35.0 dst=10.246.11.13 sport=123 dport=37282 [ASSURED] Before this patch, short-lived UDP flows never entered IPS_ASSURED, so they were already candidate flow to be deleted by early_drop under stress. Before this patch, IPS_ASSURED is set on regardless the internal stream state, attach this internal stream state to IPS_ASSURED. packet #1 (original direction) enters NEW state packet #2 (reply direction) enters ESTABLISHED state, sets on IPS_SEEN_REPLY paclet #3 (any direction) sets on IPS_ASSURED (if 2 seconds since the creation has passed by). Reported-by: Maciej Żenczykowski <zenczykowski@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-29	net: bridge: switchdev: fix shim definition for br_switchdev_mdb_notify	Vladimir Oltean	1	-12/+11
	br_switchdev_mdb_notify() is conditionally compiled only when CONFIG_NET_SWITCHDEV=y and CONFIG_BRIDGE_IGMP_SNOOPING=y. It is called from br_mdb.c, which is conditionally compiled only when CONFIG_BRIDGE_IGMP_SNOOPING=y. The shim definition of br_switchdev_mdb_notify() is therefore needed for the case where CONFIG_NET_SWITCHDEV=n, however we mistakenly put it there for the case where CONFIG_BRIDGE_IGMP_SNOOPING=n. This results in build failures when CONFIG_BRIDGE_IGMP_SNOOPING=y and CONFIG_NET_SWITCHDEV=n. To fix this, put the shim definition right next to br_switchdev_fdb_notify(), which is properly guarded by NET_SWITCHDEV=n. Since this is called only from br_mdb.c, we need not take any extra safety precautions, when NET_SWITCHDEV=n and BRIDGE_IGMP_SNOOPING=n, this shim definition will be absent but nobody will be needing it. Fixes: 9776457c784f ("net: bridge: mdb: move all switchdev logic to br_switchdev.c") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211029223606.3450523-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-29	bnxt_en: Remove not used other ULP define	Leon Romanovsky	1	-2/+1
	There is only one bnxt ULP in the upstream kernel and definition for other ULP can be safely removed. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://lore.kernel.org/r/3a8ea720b28ec4574648012d2a00208f1144eff5.1635527693.git.leonro@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-29	netdevsim: remove max_vfs dentry	Jakub Kicinski	2	-6/+3
	Commit d395381909a3 ("netdevsim: Add max_vfs to bus_dev") added this file and saved the dentry for no apparent reason. Link: https://lore.kernel.org/r/20211028211753.22612-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-29	net/mlx5: Support internal port as decap route device	Ariel Levkovich	2	-13/+40
	When performing route device lookup for decap action, support the case of ovs internal port as the lookup result. In such case, an internal port struct is mapped and attached to the flow attributes so that the source port matching of the rule will match on the internal port's metadata value. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Term table handling of internal port rules	Ariel Levkovich	1	-2/+2
	Adjust termination table logic to handle rules which involve internal port as filter or forwarding device. For cases where the rule forwards from internal port to uplink, always choose to go via termination table. This is because it is not known from where the packet originally arrived to the internal port and it is possible that it came from the uplink itself, in which case a term table is required to perform hairpin. If the packet arrived from a vport, going via term table has no effect. For cases where the rule forwards to an internal port from uplink the rep pointer will point to the uplink rep, avoid going via termination table as it is not required. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Add indirect tc offload of ovs internal port	Ariel Levkovich	4	-19/+82
	Register callbacks for tc blocks of ovs internal port devices. This allows an indirect offloading rules that apply on such devices as the filter device. In case a rule is added to a tc block of an internal port, the mlx5 driver will implicitly add a matching on the internal port's unique vport metadata value to the rule's matching list. Therefore, only packets that previously hit a rule that redirects to an internal port and got the vport metadata overwritten to the internal port's unique metadata, can match on such indirect rule. Offloading of both ingress and egress tc blocks of internal ports is supported as opposed to other devices where only ingress block offloading is supported. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Offload internal port as encap route device	Ariel Levkovich	3	-3/+41
	When pefroming encap action, a route lookup is performed to find the routing device the packet should be forwarded to after the encapsulation. This is the device that has the local tunnel ip address. This change adds support to offload an encap rule where the route device ends up being an ovs internal port. In such case, the driver will add a HW rule that will encapsulate the packet with the tunnel header and will overwrite the vport metadata in reg_c0 to the internal port metadata value. Finally, the packet will be forwarded to the root table to be processed again with the indication that it came from an internal port. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Offload tc rules that redirect to ovs internal port	Ariel Levkovich	6	-4/+141
	Allow offloading rules that redirect to ovs internal port ingress and egress. To support redirect to ingress device, offloading of REDIRECT_INGRESS action is added. When a tc rule redirects to ovs internal port, the hw rule will overwrite the input vport value in reg_c0 with a new vport metadata value that is mapped for this internal port using the internal port mapping api that is introduce in previous patches. After that the hw rule will redirect the packet to the root table to continue processing with the new vport metadata value. The new vport metadata value indicates that this packet is now arriving through an internal port and therefore should be processed using rules that apply on the same internal port as the filter device. Therefore, following rules that apply on this internal port will have to match on the same vport metadata value as part of their matching keys to make sure the packet belongs to the internal port. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Accept action skbedit in the tc actions list	Ariel Levkovich	1	-0/+7
	Setting the skb packet type field to host is usually done when performing forwarding to ingress device. This is required since the receive handling that is used by the redirect to ingress action checks whether the packet doesn't belong to this host and drops the packet in such case. In order to be able to offload action redirect ingress, tc offload code needs to accept the skbedit ptype action as well. There's no special handling in HW for such action since it will be followed by a redirect action and therefore, this code only allows us to accept such action in the actions list but not performing anything specific in HW for it. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5: E-Switch, Add ovs internal port mapping to metadata support	Ariel Levkovich	11	-9/+607
	Adding infrastructure to map ovs internal port device to vport match metadata to support offload of rules with internal port as the filter device or as the destination device. The infrastructure allows adding and removing internal port device to an eswitch database and getting a unique vport metadata value to be placed and match on in reg_c0 when offloading rules that are coming from or going to an internal port. The new int port metadata can be written to the source port register in HW to indicate that current source port of the packet is the internal port and not one of the actual HW vports (uplink or VF). Using this method, it is possible to offload TC rules with an OVS internal port as their destination port (overwriting the src vport register) or as the filter port (matching on the value of the src vport register and making sure it matches to the internal port's value). There is also a need to handle a miss case where the packet's src port value was changed in HW to an internal port but a following rule which matches on this new src port value wasn't found in HW. In such case, the packet will be forwarded to the driver with metadata which allows driver to restore the info of the internal port's netdevice. Once this info is restored, the uplink driver can forward the packet to the relevant netdevice in SW. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Use generic name for the forwarding dev pointer	Ariel Levkovich	2	-5/+5
	Rename tun_dev to fwd_dev within mlx5e_tc_update_priv struct since future implementation may introduce other device types which the handler is forwarding to. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: Refactor rx handler of represetor device	Ariel Levkovich	4	-45/+32
	Move the ownership of skb forwarding to network stack to the tc update_skb handler as different cases will require different handling of the skb. While the tc handler will take care of the various cases and properly handle the handover of the skb to the network stack and freeing the skb, the main rx handler will be kept clean from branches and usage of flags. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5: DR, Add check for unsupported fields in match param	Muhammad Sammar	4	-133/+172
	When a matcher is being built, we "consume" (clear) mask fields one by one, and to verify that we do support all the required fields we check if the whole mask was consumed, else the matching request includes unsupported fields. Signed-off-by: Muhammad Sammar <muhammads@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
2021-10-29	net/mlx5: Allow skipping counter refresh on creation	Paul Blakey	3	-4/+16
	CT creates a counter for each CT rule, and for each such counter, fs_counters tries to queue mlx5_fc_stats_work() work again via mod_delayed_work(0) call to refresh all counters. This call has a large performance impact when reaching high insertion rate and accounts for ~8% of the insertion time when using software steering. Allow skipping the refresh of all counters during counter creation. Change CT to use this refresh skipping for it's counters. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5e: IPsec: Refactor checksum code in tx data path	Raed Salem	2	-18/+28
	Part of code that is related solely to IPsec is always compiled in the driver code regardless if the IPsec functionality is enabled or disabled in the driver code, this will add unnecessary branch in case IPsec is disabled at Tx data path. Move IPsec related code to IPsec related file such that in case of IPsec is disabled and because of unlikely macro the compiler should be able to optimize and omit the checksum IPsec code all together from Tx data path Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Emeel Hakim <ehakim@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5: CT: Remove warning of ignore_flow_level support for VFs	Paul Blakey	2	-20/+27
	ignore_flow_level isn't supported for VFs, and so it causes post_act and ct to warn about it. Instead of disabling CT for VFs, and a driver update will be need to enable CT again once firmware support this, remove this warning specifically for VFs. This way, it could be automatically enabled on future firmwares where VFs support ignore_flow_level capability. Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	net/mlx5: Add esw assignment back in mlx5e_tc_sample_unoffload()	Nathan Chancellor	1	-0/+1
	Clang warns: drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c:635:34: error: variable 'esw' is uninitialized when used here [-Werror,-Wuninitialized] mlx5_eswitch_del_offloaded_rule(esw, sample_flow->pre_rule, sample_flow->pre_attr); ^~~ drivers/net/ethernet/mellanox/mlx5/core/en/tc/sample.c:626:26: note: initialize the variable 'esw' to silence this warning struct mlx5_eswitch *esw; ^ = NULL 1 error generated. It appears that the assignment should have been shuffled instead of removed outright like in mlx5e_tc_sample_offload(). Add it back so there is no use of esw uninitialized. Fixes: a64c5edbd20e ("net/mlx5: Remove unnecessary checks for slow path flag") Link: https://github.com/ClangBuiltLinux/linux/issues/1494 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29	iavf: Fix kernel BUG in free_msi_irqs	Przemyslaw Patynowski	2	-0/+56
	Fix driver not freeing VF's traffic irqs, prior to calling pci_disable_msix in iavf_remove. There were possible 2 erroneous states in which, iavf_close would not be called. One erroneous state is fixed by allowing netdev to register, when state is already running. It was possible for VF adapter to enter state loop from running to resetting, where iavf_open would subsequently fail. If user would then unload driver/remove VF pci, iavf_close would not be called, as the netdev was not registered, leaving traffic pcis still allocated. Fixed this by breaking loop, allowing netdev to open device when adapter state is __IAVF_RUNNING and it is not explicitily downed. Other possiblity is entering to iavf_remove from __IAVF_RESETTING state, where iavf_close would not free irqs, but just return 0. Fixed this by checking for last adapter state and then removing irqs. Kernel panic: [ 2773.628585] kernel BUG at drivers/pci/msi.c:375! ... [ 2773.631567] RIP: 0010:free_msi_irqs+0x180/0x1b0 ... [ 2773.640939] Call Trace: [ 2773.641572] pci_disable_msix+0xf7/0x120 [ 2773.642224] iavf_reset_interrupt_capability.part.41+0x15/0x30 [iavf] [ 2773.642897] iavf_remove+0x12e/0x500 [iavf] [ 2773.643578] pci_device_remove+0x3b/0xc0 [ 2773.644266] device_release_driver_internal+0x103/0x1f0 [ 2773.644948] pci_stop_bus_device+0x69/0x90 [ 2773.645576] pci_stop_and_remove_bus_device+0xe/0x20 [ 2773.646215] pci_iov_remove_virtfn+0xba/0x120 [ 2773.646862] sriov_disable+0x2f/0xe0 [ 2773.647531] ice_free_vfs+0x2f8/0x350 [ice] [ 2773.648207] ice_sriov_configure+0x94/0x960 [ice] [ 2773.648883] ? _kstrtoull+0x3b/0x90 [ 2773.649560] sriov_numvfs_store+0x10a/0x190 [ 2773.650249] kernfs_fop_write+0x116/0x190 [ 2773.650948] vfs_write+0xa5/0x1a0 [ 2773.651651] ksys_write+0x4f/0xb0 [ 2773.652358] do_syscall_64+0x5b/0x1a0 [ 2773.653075] entry_SYSCALL_64_after_hwframe+0x65/0xca Fixes: 22ead37f8af8 ("i40evf: Add longer wait after remove module") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-10-29	iavf: Add helper function to go from pci_dev to adapter	Karen Sornek	1	-7/+17
	Add helper function to go from pci_dev to adapter to make work simple - to go from a pci_dev to the adapter structure and make netdev assignment instead of having to go to the net_device then the adapter. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Karen Sornek <karen.sornek@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>