aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-03-11net/mlx5e: IPoIB, Fix RX checksum statistics updateFeras Daoud1-3/+8
Update the RX checksum only if the feature is enabled. Fixes: 9d6bd752c63c ("net/mlx5e: IPoIB, RX handler") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-25net/mlx5e: Take CQ decompress fields into a separate structureTariq Toukan1-55/+72
Only the Receive CQ makes use of these fields. Take them out into a separate struct and save space in the generic CQ structure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-25net/mlx5e: RX, Make sure packet header does not cross page boundaryTariq Toukan1-23/+4
In the non-linear SKB memory scheme of Striding RQ, a packet header could cross page boundary. This requires special care in fast path that costs LoC, additional runtime instructions and branches. It could happen when the header (up to 256B) does not fit in a single stride. Avoid this by working with a stride size that fits the maximum possible header. Stride is increased form 64B to 256B. Performance: Tested packet rate for UDP streams, single ring, on ConnectX-5. Configuration: Set Striding RQ and LRO ON (to enabled the non-linear SKB scheme). GRO OFF, early drop by TC rule. 64B: 4x worse memory utilization, no page-crossers headers - No degradation (5,887,305 pps). - The reduction in memory utilization is compensated by the saving of branches tests. 192B: 1.33x worse memory utilization, avoid page-crossers headers - Before: 5,727,252. After: 5,777,037. ~1% gain. 256B: Same memory util, no page-crossers - Before: 5,691,885. After: 5,748,007. ~1% gain. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-01-18net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet framesCong Wang1-0/+13
When an ethernet frame is padded to meet the minimum ethernet frame size, the padding octets are not covered by the hardware checksum. Fortunately the padding octets are usually zero's, which don't affect checksum. However, we have a switch which pads non-zero octets, this causes kernel hardware checksum fault repeatedly. Prior to: commit '88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE ...")' skb checksum was forced to be CHECKSUM_NONE when padding is detected. After it, we need to keep skb->csum updated, like what we do for RXFCS. However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP headers, it is not worthy the effort as the packets are so small that CHECKSUM_COMPLETE can't save anything. Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), Cc: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-20net/mlx5e: XDP, Precede XDP-related operations in RQ poll by a loaded program checkTariq Toukan1-10/+2
At the end of the RQ polling loop, some XDP-related operations might be required. Before checking them one by one, check if an XDP program is even loaded. Combine all the checks and operations in a single function in xdp files. This saves unnecessary checks for non-XDP flows. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-4/+6
Lots of conflicts, by happily all cases of overlapping changes, parallel adds, things of that nature. Thanks to Stephen Rothwell, Saeed Mahameed, and others for their guidance in these resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net/mlx5e: RX, Fix wrong early return in receive queue pollTariq Toukan1-4/+6
When the completion queue of the RQ is empty, do not immediately return. If left-over decompressed CQEs (from the previous cycle) were processed, need to go to the finalization part of the poll function. Bug exists only when CQE compression is turned ON. This solves the following issue: mlx5_core 0000:82:00.1: mlx5_eq_int:544:(pid 0): CQ error on CQN 0xc08, syndrome 0x1 mlx5_core 0000:82:00.1 p4p2: mlx5e_cq_error_event: cqn=0x000c08 event=0x04 Fixes: 4b7dfc992514 ("net/mlx5e: Early-return on empty completion queues") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-10Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linuxSaeed Mahameed1-5/+5
mlx5-next shared branch with rdma subtree to avoid mlx5 rdma v.s. netdev conflicts. Highlights: 1) RDMA ODP (On Demand Paging) improvements and moving ODP logic to mlx5 RDMA driver 2) Improved mlx5 core driver and device events handling and provided API for upper layers to subscribe to device events. 3) RDMA only code cleanup from mlx5 core 4) Add helper to get CQE opcode 5) Rework handling of port module events 6) shared mlx5_ifc.h updates to avoid conflicts Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-09net/mlx5: Use helper to get CQE opcodeTariq Toukan1-5/+5
Introduce and use a helper that extracts the opcode from a CQE (completion queue entry) structure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-11-30mlx5: fix get_ip_proto()Cong Wang1-3/+3
IP header is not necessarily located right after struct ethhdr, there could be multiple 802.1Q headers in between, this is why we call __vlan_get_protocol(). Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packets") Cc: Alaa Hleihel <alaa@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19net/mlx5e: RX, verify received packet size in Linear Striding RQMoshe Shemesh1-0/+6
In case of striding RQ, we use MPWRQ (Multi Packet WQE RQ), which means that WQE (RX descriptor) can be used for many packets and so the WQE is much bigger than MTU. In virtualization setups where the port mtu can be larger than the vf mtu, if received packet is bigger than MTU, it won't be dropped by HW on too small receive WQE. If we use linear SKB in striding RQ, since each stride has room for mtu size payload and skb info, an oversized packet can lead to crash for crossing allocated page boundary upon the call to build_skb. So driver needs to check packet size and drop it. Introduce new SW rx counter, rx_oversize_pkts_sw_drop, which counts the number of packets dropped by the driver for being too large. As a new field is added to the RQ struct, re-open the channels whenever this field is being used in datapath (i.e., in the case of linear Striding RQ). Fixes: 619a8f2a42f1 ("net/mlx5e: Use linear SKB in Striding RQ") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-31net/mlx5e: fix csum adjustments caused by RXFCSEric Dumazet1-36/+9
As shown by Dmitris, we need to use csum_block_add() instead of csum_add() when adding the FCS contribution to skb csum. Before 4.18 (more exactly commit 88078d98d1bb "net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends"), the whole skb csum was thrown away, so RXFCS changes were ignored. Then before commit d55bef5059dd ("net: fix pskb_trim_rcsum_slow() with odd trim offset") both mlx5 and pskb_trim_rcsum_slow() bugs were canceling each other. Now we fixed pskb_trim_rcsum_slow() we need to fix mlx5. Note that this patch also rewrites mlx5e_get_fcs() to : - Use skb_header_pointer() instead of reinventing it. - Use __get_unaligned_cpu32() to avoid possible non aligned accesses as Dmitris pointed out. Fixes: 902a545904c7 ("net/mlx5e: When RXFCS is set, add FCS data into checksum calculation") Reported-by: Paweł Staszewski <pstaszewski@itcare.pl> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eran Ben Elisha <eranbe@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Dimitris Michailidis <dmichail@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Paweł Staszewski <pstaszewski@itcare.pl> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Tested-By: Maria Pasechnik <mariap@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-25drivers: net: remove <net/busy_poll.h> inclusion when not neededEric Dumazet1-1/+0
Drivers using generic NAPI interface no longer need to include <net/busy_poll.h>, since busy polling was moved to core networking stack long ago. See commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") for reference. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-7/+5
net/sched/cls_api.c has overlapping changes to a call to nlmsg_parse(), one (from 'net') added rtm_tca_policy instead of NULL to the 5th argument, and another (from 'net-next') added cb->extack instead of NULL to the 6th argument. net/ipv4/ipmr_base.c is a case of a bug fix in 'net' being done to code which moved (to mr_table_dump)) in 'net-next'. Thanks to David Ahern for the heads up. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10net/mlx5: WQ, fixes for fragmented WQ buffers APITariq Toukan1-7/+5
mlx5e netdevice used to calculate fragment edges by a call to mlx5_wq_cyc_get_frag_size(). This calculation did not give the correct indication for queues smaller than a PAGE_SIZE, (broken by default on PowerPC, where PAGE_SIZE == 64KB). Here it is replaced by the correct new calls/API. Since (TX/RX) Work Queues buffers are fragmented, here we introduce changes to the API in core driver, so that it gets a stride index and returns the index of last stride on same fragment, and an additional wrapping function that returns the number of physically contiguous strides that can be written contiguously to the work queue. This obsoletes the following API functions, and their buggy usage in EN driver: * mlx5_wq_cyc_get_frag_size() * mlx5_wq_cyc_ctr2fragix() The new API improves modularity and hides the details of such calculation for mlx5e netdevice and mlx5_ib rdma drivers. New calculation is also more efficient, and improves performance as follows: Packet rate test: pktgen, UDP / IPv4, 64byte, single ring, 8K ring size. Before: 16,477,619 pps After: 17,085,793 pps 3.7% improvement Fixes: 3a2f70331226 ("net/mlx5: Use order-0 allocations for all WQ types") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01net/mlx5e: Allow reporting of checksum unnecessaryOr Gerlitz1-0/+3
Currently we practically never report checksum unnecessary, because for all IP packets we take the checksum complete path. Enable non-default runs with reprorting checksum unnecessary, using an ethtool private flag. This can be useful for performance evals and other explorations. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-10-01net/mlx5e: Enable reporting checksum unnecessary also for L3 packetsOr Gerlitz1-1/+2
We can report checksum unnecessary also when the L3 checksum flag on the cqe is set and there's no L4 header. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-05net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP packetsAlaa Hleihel1-0/+12
CHECKSUM_COMPLETE is not applicable to SCTP protocol. Setting it for SCTP packets leads to CRC32c validation failure. Fixes: bbceefce9adf ("net/mlx5e: Support RX CHECKSUM_COMPLETE") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-05net/mlx5e: Set ECN for received packets using CQE indicationNatali Shechtman1-5/+30
In multi-host (MH) NIC scheme, a single HW port serves multiple hosts or sockets on the same host. The HW uses a mechanism in the PCIe buffer which monitors the amount of consumed PCIe buffers per host. On a certain configuration, under congestion, the HW emulates a switch doing ECN marking on packets using ECN indication on the completion descriptor (CQE). The driver needs to set the ECN bits on the packet SKB, such that the network stack can react on that, this commit does that. Signed-off-by: Natali Shechtman <natali@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-09-02net/mlx5e: IPoIB, Use priv stats in completion rx flowFeras Daoud1-1/+2
Since the RQs are shared between all pkey interfaces, the stats should be taken from where the per-ring stats are stored instead of the parent RQ. Fixes: 4c6c615e3f30 ("net/mlx5e: IPoIB, Add PKEY child interface nic profile") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-26net/mlx5e: RX, Prefetch the xdp_frame data areaTariq Toukan1-0/+1
A loaded XDP program might write to the xdp_frame data area, prefetchw() it to avoid a potential cache miss. Performance tests: ConnectX-5, XDP_TX packet rate, single ring. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Before: 13,172,976 pps After: 13,456,248 pps 2% gain. Fixes: 22f453988194 ("net/mlx5e: Support XDP over Striding RQ") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26net/mlx5e: Re-order fields of struct mlx5e_xdpsqTariq Toukan1-4/+4
In the downstream patch that adds support to XDP_REDIRECT-out, the XDP xmit frame function doesn't share the same run context as the NAPI that polls the XDP-SQ completion queue. Hence, need to re-order the XDP-SQ fields to avoid cacheline false-sharing. Take redirect_flush and doorbell out of DB, into separated cachelines. Add a cacheline breaker within the stats struct. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26net/mlx5e: Move XDP related code into new XDP filesTariq Toukan1-206/+2
Take XDP code out of the general EN header and RX file into new XDP files. Currently, XDP-SQ resides only within an RQ and used from a single flow (XDP_TX) triggered upon RX completions. In a downstream patch, additional type of XDP-SQ instances will be presented and used for the XDP_REDIRECT flow, totally unrelated to the RX context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26net/mlx5e: Do not recycle RX pages in interface down flowTariq Toukan1-17/+20
Keep all page-pool recycle calls within NAPI context. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-26net/mlx5e: Replace call to MPWQE free with dealloc in interface down flowTariq Toukan1-1/+1
No need to expose the MPWQE free function to control path. The dealloc function already exposed, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-07-16net/mlx5e: IPsec, fix byte count in CQEBoris Pismenny1-1/+1
This patch fixes the byte count indication in CQE for processed IPsec packets that contain a metadata header. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-16net/mlx5e: TLS, add Innova TLS rx data pathBoris Pismenny1-0/+6
Implement the TLS rx offload data path according to the requirements of the TLS generic NIC offload infrastructure. Special metadata ethertype is used to pass information to the hardware. When hardware loses synchronization a special resync request metadata message is used to request resync. Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-06-28net/mlx5e: Add counter for MPWQE filler stridesTariq Toukan1-1/+4
Add ethtool counter to indicate the number of strides consumed by filler CQEs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28net/mlx5e: Add a counter for congested UMRsTariq Toukan1-0/+2
Add per-ring and global ethtool counters for congested UMR requests. These events indicate congestion in UMR handlers in HW. Such event is concluded when there's an outstanding UMR post, yet the SW consumed at least two additional MPWQEs in the meanwhile. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28net/mlx5e: Add XDP_TX completions statisticsTariq Toukan1-0/+2
Add per-ring and global ethtool counters for XDP_TX completions. This helps us monitor and analyze XDP_TX flow performance. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-28net/mlx5e: RX, Use existing WQ local variableTariq Toukan1-1/+1
Local variable 'wq' already points to &sq->wq, use it. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Enhance legacy Receive Queue memory schemeTariq Toukan1-60/+147
Enhance the memory scheme of the legacy RQ, such that only order-0 pages are used. Whenever possible, prefer using a linear SKB, and build it wrapping the WQE buffer. Otherwise (for example, jumbo frames on x86), use non-linear SKB, with as many frags as needed. In this case, multiple WQE scatter entries are used, up to a maximum of 4 frags and 10KB of MTU. This implied to remove support of HW LRO in legacy RQ, as it would require large number of page allocations and scatter entries per WQE on archs with PAGE_SIZE = 4KB, yielding bad performance. In earlier patches, we guaranteed that all completions are in-order, and that we use a cyclic WQ. This creates an oppurtunity for a performance optimization: The mapping between a "struct mlx5e_dma_info", and the WQEs (struct mlx5e_wqe_frag_info) pointing to it, is constant across different cycles of a WQ. This allows initializing the mapping in the time of RQ creation, and not handle it in datapath. A struct mlx5e_dma_info that is shared between different WQEs is allocated by the first WQE, and freed by the last one. This implies an important requirement: WQEs that share the same struct mlx5e_dma_info must be posted within the same NAPI. Otherwise, upon completion, struct mlx5e_wqe_frag_info would mistakenly point to the new struct mlx5e_dma_info, not the one that was posted (and the HW wrote to). This bulking requirement is actually good also for performance reasons, hence we extend the bulk beyong the minimal requirement above. With this memory scheme, the RQs memory footprint is reduce by a factor of 2 on x86, and by a factor of 32 on PowerPC. Same factors apply for the number of pages in a GRO session. Performance tests: ConnectX-4, single core, single RX ring, default MTU. x86: CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate (early drop in TC): no degradation TCP streams: ~5% improvement PowerPC: CPU: POWER8 (raw), altivec supported Packet rate (early drop in TC): 20% gain TCP streams: 25% gain Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Use cyclic WQ in legacy RQTariq Toukan1-64/+51
Now that LRO is not supported for Legacy RQ, there is no source of out-of-order completions in the WQ, and we can use a cyclic one. This has multiple advantages: - reduces the WQE size (smaller PCI transactions). - lower overhead in datapath (no handling of 'next' pointers). - no reserved WQE for the WQ head (was need in linked-list). - allows using a constant map between frag and dma_info struct, in downstream patch. Performance tests: ConnectX-4, single core, single RX ring. Major gain in packet rate of single ring XDP drop. Bottleneck is shifted form HW (at 16Mpps) to SW (at 20Mpps). Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Split WQ objects for different RQ typesTariq Toukan1-15/+20
Replace the common RQ WQ object with two separate ones for the different RQ types. This is in preparation for switching to using a cyclic WQ type in Legacy RQ. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Dedicate a function for copying SKB headerTariq Toukan1-13/+17
Get the logic of copying the packet header into the SKB linear part into a generic function. Function does copy length alignment and dma buffer sync. It is currently called only within the MPWQE flow. In a downstream patch, it will be called within the legacy RQ flow as well. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Generalise function of SKB frag additionTariq Toukan1-8/+8
Rename it and pass truesize as an extra argument, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-06-01net/mlx5e: RX, Generalise name of non-linear SKB head sizeTariq Toukan1-2/+2
Make name more generic by dropping MPWRQ from it, as it will be used also in Legacy RQ in a downstream patch. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-29Merge tag 'mlx5e-updates-2018-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linuxDavid S. Miller1-49/+77
Saeed Mahameed says: ==================== mlx5e-updates-2018-05-25 This series includes updates for mlx5e netdev driver. 1) Allowr flow based VF vport mirroring under sriov switchdev scheme, added support for offloading the TC mirred mirror sub-action, from Chris Mi. ================= From: Or Gerlitz <ogerlitz@mellanox.com> The user will typically set the actions order such that the mirror port (mirror VF) sees packets as the original port (VF under mirroring) sent them or as it will receive them. In the general case, it means that packets are potentially sent to the mirror port before or after some actions were applied on them. To properly do that, we follow on the exact action order as set for the flow and make sure this will also be the case when we program the HW offload. If all the actions should apply before forwarding to the mirror and dest port, mirroring is just multicasting to the two vports. Otherwise, we split the TC flow to two HW rules, where the 1st applies only the actions needed up to the mirror (if there are such) and the 2nd the rest of the actions plus the forwarding to the dest vport. ================= 2) Move to order-0 only allocations (using fragmented work queues) for all work queues used by the driver, RX and TX descriptor rings (RQs, SQs and Completion Queues (CQs)), from Tariq Toukan. 3) Avoid resetting netdevice statistics on netdevice state changes, from Eran Ben Elisha. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-0/+42
Lots of easy overlapping changes in the confict resolutions here. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-25net/mlx5e: Avoid reset netdev stats on configuration changesEran Ben Elisha1-33/+42
Move all RQ, SQ and channel counters from the channel objects into the priv structure. With this change, counters will not be reset upon channel configuration changes. Channel's statistics for SQs which are associated with TCs higher than zero will be presented in ethtool -S, only for SQs which were opened at least once since the module was loaded (regardless of their open/close current status). This is done in order to decrease the total amount of statistics presented and calculated for the common out of box use (no QoS). mlx5e_channel_stats is a compound of CH,RQ,SQs stats in order to create locality for the NAPI when handling TX and RX of the same channel. Align the new statistics struct per ring to avoid several channels update to the same cache line at the same time. Packet rate was tested, no degradation sensed. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> CC: Qing Huang <qing.huang@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-25net/mlx5: Use order-0 allocations for all WQ typesTariq Toukan1-8/+9
Complete the transition of all WQ types to use fragmented order-0 coherent memory instead of high-order allocations. CQ-WQ already uses order-0. Here we do the same for cyclic and linked-list WQs. This allows the driver to load cleanly on systems with a highly fragmented coherent memory. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. No degradation is sensed. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-25net/mlx5e: TX, Use actual WQE size for SQ edge fillTariq Toukan1-5/+22
We fill SQ edge with NOPs to avoid WQEs wrap. Here, instead of doing that in advance for the maximum possible WQE size, we do it on-demand using the actual WQE size. We re-order some parts in mlx5e_sq_xmit to finish the calculation of WQE size (ds_cnt) before doing any writes to the WQE buffer. When SQ work queue is fragmented (introduced in an downstream patch), dealing with WQE wraps becomes more frequent. This change would drastically reduce the overhead in this case. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. Before: 14.9 Mpps After: 15.8 Mpps Improvement of 6%. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-25net/mlx5e: Use WQ API functions instead of direct fields accessTariq Toukan1-12/+13
Use the WQ API to get the WQ size, and to map a counter into a WQ entry index. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-24net/mlx5e: When RXFCS is set, add FCS data into checksum calculationEran Ben Elisha1-0/+42
When RXFCS feature is enabled, the HW do not strip the FCS data, however it is not present in the checksum calculated by the HW. Fix that by manually calculating the FCS checksum and adding it to the SKB checksum field. Add helper function to find the FCS data for all SKB forms (linear, one fragment or more). Fixes: 102722fc6832 ("net/mlx5e: Add support for RXFCS feature flag") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-14net/mlx5e: Remove MLX5E_TEST_BIT macroGal Pressman1-5/+5
MLX5E_TEST_BIT macro is the same as the already existent test_bit, remove it and replace all usages. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-14net/mlx5e: Use bool as return type for mlx5e_xdp_handleTariq Toukan1-3/+3
Function returns boolean values, use bool instead of int. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-14net/mlx5e: Use u8 instead of int for LRO number of segmentsTariq Toukan1-2/+1
Range of LRO number of segments fits in u8. Also, bring initialization and declaration together to save code. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-04-17xdp: transition into using xdp_frame for return APIJesper Dangaard Brouer1-0/+1
Changing API xdp_return_frame() to take struct xdp_frame as argument, seems like a natural choice. But there are some subtle performance details here that needs extra care, which is a deliberate choice. When de-referencing xdp_frame on a remote CPU during DMA-TX completion, result in the cache-line is change to "Shared" state. Later when the page is reused for RX, then this xdp_frame cache-line is written, which change the state to "Modified". This situation already happens (naturally) for, virtio_net, tun and cpumap as the xdp_frame pointer is the queued object. In tun and cpumap, the ptr_ring is used for efficiently transferring cache-lines (with pointers) between CPUs. Thus, the only option is to de-referencing xdp_frame. It is only the ixgbe driver that had an optimization, in which it can avoid doing the de-reference of xdp_frame. The driver already have TX-ring queue, which (in case of remote DMA-TX completion) have to be transferred between CPUs anyhow. In this data area, we stored a struct xdp_mem_info and a data pointer, which allowed us to avoid de-referencing xdp_frame. To compensate for this, a prefetchw is used for telling the cache coherency protocol about our access pattern. My benchmarks show that this prefetchw is enough to compensate the ixgbe driver. V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT") V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address and offset in dma_sync call") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17mlx5: use page_pool for xdp_return_frame callJesper Dangaard Brouer1-5/+11
This patch shows how it is possible to have both the driver local page cache, which uses elevated refcnt for "catching"/avoiding SKB put_page returns the page through the page allocator. And at the same time, have pages getting returned to the page_pool from ndp_xdp_xmit DMA completion. The performance improvement for XDP_REDIRECT in this patch is really good. Especially considering that (currently) the xdp_return_frame API and page_pool_put_page() does per frame operations of both rhashtable ID-lookup and locked return into (page_pool) ptr_ring. (It is the plan to remove these per frame operation in a followup patchset). The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe, with xdp_redirect_map (using devmap) . And the target/maximum capability of ixgbe is 13Mpps (on this HW setup). Before this patch for mlx5, XDP redirected frames were returned via the page allocator. The single flow performance was 6Mpps, and if I started two flows the collective performance drop to 4Mpps, because we hit the page allocator lock (further negative scaling occurs). Two test scenarios need to be covered, for xdp_return_frame API, which is DMA-TX completion running on same-CPU or cross-CPU free/return. Results were same-CPU=10Mpps, and cross-CPU=12Mpps. This is very close to our 13Mpps max target. The reason max target isn't reached in cross-CPU test, is likely due to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to ixgbe testing). It is also planned to remove this unnecessary DMA unmap in a later patchset V2: Adjustments requested by Tariq - Changed page_pool_create return codes not return NULL, only ERR_PTR, as this simplifies err handling in drivers. - Save a branch in mlx5e_page_release - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V5: Updated patch desc V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params") V9: - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU") - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ V10: Req from Tariq - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-17mlx5: basic XDP_REDIRECT forward supportJesper Dangaard Brouer1-3/+24
This implements basic XDP redirect support in mlx5 driver. Notice that the ndo_xdp_xmit() is NOT implemented, because that API need some changes that this patchset is working towards. The main purpose of this patch is have different drivers doing XDP_REDIRECT to show how different memory models behave in a cross driver world. Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect, as the return API does not exist yet to to keep this mapped. Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing, introduce xdpsq.db.redirect_flush boolian. V9: Adjust for commit 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>