linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2018-05-25	net/mlx5e: Avoid reset netdev stats on configuration changes	Eran Ben Elisha	1	-2/+4
	Move all RQ, SQ and channel counters from the channel objects into the priv structure. With this change, counters will not be reset upon channel configuration changes. Channel's statistics for SQs which are associated with TCs higher than zero will be presented in ethtool -S, only for SQs which were opened at least once since the module was loaded (regardless of their open/close current status). This is done in order to decrease the total amount of statistics presented and calculated for the common out of box use (no QoS). mlx5e_channel_stats is a compound of CH,RQ,SQs stats in order to create locality for the NAPI when handling TX and RX of the same channel. Align the new statistics struct per ring to avoid several channels update to the same cache line at the same time. Packet rate was tested, no degradation sensed. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> CC: Qing Huang <qing.huang@oracle.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-05-14	net/mlx5e: Remove MLX5E_TEST_BIT macro	Gal Pressman	1	-2/+2
	MLX5E_TEST_BIT macro is the same as the already existent test_bit, remove it and replace all usages. Signed-off-by: Gal Pressman <galp@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-04-24	net/mlx5e: Enable adaptive-TX moderation	Tal Gilboa	1	-9/+28
	Add support for adaptive TX moderation. This greatly reduces TX interrupt rate and increases bandwidth, mostly for TCP bandwidth over ARM architecture (below). There is a slight single stream TCP with very large message sizes degradation (x86). In this case if there's any moderation on transmitted packets the bandwidth would reduce due to hitting TCP output limit. Since this is a synthetic case, this is still worth doing. Performance improvement (ConnectX-4Lx 40GbE, ARM) TCP 64B bandwidth with 1-50 streams increased 6-35%. TCP 64B bandwidth with 100-500 streams increased 20-70%. Performance improvement (ConnectX-5 100GbE, x86) Bandwidth: increased up to 40% (1024B with 10s of streams). Interrupt rate: reduced up to 50% (1024B with 1000s of streams). Performance degradation (ConnectX-5 100GbE, x86) Bandwidth: up to 10% decrease single stream TCP (1MB message size from 51Gb/s to 47Gb/s). Signed-off-by: Tal Gilboa <talgi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10	net/dim: use struct net_dim_sample as arg to net_dim	Andy Gospodarek	1	-5/+8
	Simplify the arguments net_dim() by formatting them into a struct net_dim_sample before calling the function. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Suggested-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10	net/mlx5e: Change Mellanox references in DIM code	Andy Gospodarek	1	-4/+4
	Change all appropriate mlx5_am* and MLX5_AM* references to net_dim and NET_DIM, respectively, in code that handles dynamic interrupt moderation. Also change all references from 'am' to 'dim' when used as local variables and add generic profile references. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-10	net/mlx5e: Remove rq references in mlx5e_rx_am	Andy Gospodarek	1	-1/+4
	This makes mlx5e_am_sample more generic so that it can be called easily from a driver that does not use the same data structure to store these values in a single structure. Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: Tal Gilboa <talgi@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-11-10	net/mlx5e: Fix napi poll with zero budget	Saeed Mahameed	1	-4/+6
	napi->poll can be called with budget 0, e.g. in netpoll scenarios where the caller only wants to poll TX rings (poll_one_napi@net/core/netpoll.c). The below commit changed RX polling from "while" loop to "do {} while", which caused to ignore the initial budget and handle at least one RX packet. This fixes the following warning: [ 2852.049194] mlx5e_napi_poll+0x0/0x260 [mlx5_core] exceeded budget in poll [ 2852.049195] ------------[ cut here ]------------ [ 2852.049195] WARNING: CPU: 0 PID: 25691 at net/core/netpoll.c:171 netpoll_poll_dev+0x18a/0x1a0 Fixes: 4b7dfc992514 ("net/mlx5e: Early-return on empty completion queues") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reported-by: Martin KaFai Lau <kafai@fb.com> Tested-by: Martin KaFai Lau <kafai@fb.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03	net/mlx5e: Stop NAPI when irq balancer changes affinity	Tariq Toukan	1	-2/+18
	NAPI context keeps rescheduling on same CPU as long as it's busy. This doesn't give the oppurtunity for changes in irq affinities to take effect. Fix that by calling napi_complete_done() upon a change in affinity. This would stop the NAPI and reschedule it on the new CPU. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03	net/mlx5e: Use kernel's mechanism to avoid missing NAPIs	Tariq Toukan	1	-9/+1
	We used a channel state bit MLX5E_CHANNEL_NAPI_SCHED to make sure no NAPI is missed when a channel's napi_schedule() is called for completion events of the different channel's resources/rings while NAPI is currently running. Now, as similar mechanism is implemented in kernel, ("39e6c8208d7b net: solve a NAPI race"), we obsolete our own implementation and rely on the return value of napi_complete_done(). This patch removes a redundant overhead of atomic bit operations. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03	net/mlx5e: Type-specific optimizations for RX post WQEs function	Tariq Toukan	1	-63/+1
	Separate the RX post WQEs function of the different RQ types. This enables RQ type-specific optimizations in data-path. Poll the ICOSQ completion queue only for Striding RQ, and only when a UMR post completion could be possibly available. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-09-03	net/mlx5e: Non-atomic indicator for ring enabled state	Tariq Toukan	1	-2/+2
	Rings enabled state change occurs in control path only, and is always followed by a napi_sychronize(), so that following NAPIs read the new value. This read does not need to be atomic. The RQ auto-moderation bit is not set/cleared in data-path. No need for atomic read, a regular read operation is sufficient. In RQ creation time as well, there's no multiple threads trying to access it yet, hence a regular read can be used. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-06-27	net/mlx5: Make get_cqe routine not ethernet-specific	Ilan Tayari	1	-18/+1
	Move mlx5e_get_cqe routine to wq.h and rename it to mlx5_cqwq_get_cqe. This allows it to be used by other CQ users outside of the ethernet driver code. A later patch in this patchset will make use of it from FPGA code for the FPGA high-speed connection. Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30	net/mlx5e: Use u8 as ownership type in mlx5e_get_cqe()	Tariq Toukan	1	-2/+2
	CQE ownership indication is as small as a single bit. Use u8 to speedup the comparison. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-04-30	net/mlx5e: Optimize poll ICOSQ completion queue	Tariq Toukan	1	-29/+33
	UMR operations are more frequent and important. Check them first, and add a compiler branch predictor hint. According to current design, ICOSQ CQ can contain at most one pending CQE per napi. Poll function is optimized accordingly. Performance: Single-stream packet-rate tested with pktgen. Packets are dropped in tc level to zoom into driver data-path. Larger gain is expected for larger packet sizes, as BW is higher and UMR posts are more frequent. --------------------------------------------- packet size \| before \| after \| gain \| 64B \| 4,092,370 \| 4,113,306 \| 0.5% \| 1024B \| 3,421,435 \| 3,633,819 \| 6.2% \| Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2017-03-27	net/mlx5e: CQ and RQ don't need priv pointer	Saeed Mahameed	1	-2/+1
	Remove mlx5e_priv pointer from CQ and RQ structs, it was needed only to access mdev pointer from priv pointer. Instead we now pass mdev where needed. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
2017-03-24	net/mlx5e: Different SQ types	Saeed Mahameed	1	-1/+1
	Different SQ types (tx, xdp, ico) are growing apart, we separate them and remove unwanted parts in each one of them, to simplify data path and utilize data cache. Remove DB union from SQ structures since it is not needed anymore as we now have different SQ data type for each SQ. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24	net/mlx5e: Poll XDP TX CQ before RX CQ	Saeed Mahameed	1	-3/+3
	Handle XDP TX completions before handling RX packets, to make sure more free space is available for XDP TX packets a moment before handling RX packets. Performance improvement: System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Before Now improvement --------------------------------------------------------------- XDP Drop (1 core) 16.9Mpps 16.9Mpps No change XDP TX (1 core) 12Mpps 13Mpps 8% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24	net/mlx5e: Move XDP SQ instance into RQ	Saeed Mahameed	1	-1/+1
	To save many rq->channel->sq dereferences in fast-path. And rename it to xdpsq. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24	net/mlx5e: Move XDP completion functions to rx file	Saeed Mahameed	1	-61/+1
	XDP code belongs to RX path, move mlx5e_poll_xdp_tx_cq and mlx5e_free_xdp_tx_descs to en_rx.c. Rename them to mlx5e_poll_xdpsq_cq and mlx5e_free_xdpsq_descs. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24	net/mlx5e: Use dma_rmb rather than rmb in CQE fetch routine	Saeed Mahameed	1	-1/+1
	Use dma_rmb in mlx5e_get_cqe rather than aggressive rmb (at least on some architectures), this should help improve the performance on such CPU archs where dma_rmb is optimized. Performance improvement: System: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz Test case Baseline Now improvement --------------------------------------------------------------- TX packets (24 threads) 45Mpps 50Mpps 11% TC stack Drop (1 core) 3.45Mpps 3.6Mpps 5% XDP Drop (1 core) 14Mpps 16.9Mpps 20% XDP TX (1 core) 10.4Mpps 12Mpps 15% Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-06	net/mlx5e: Change the SQ/RQ operational state to positive logic	Mohamad Haj Yahia	1	-2/+2
	When using the negative logic (i.e. FLUSH state), after the RQ/SQ reopen we will have a time interval that the RQ/SQ is not really ready and the state indicates that its not in FLUSH state because the initial SQ/RQ struct memory starts as zeros. Now we changed the state to indicate if the SQ/RQ is opened and we will set the READY state after finishing preparing all the SQ/RQ resources. Fixes: 6e8dd6d6f4bd ("net/mlx5e: Don't wait for SQ completions on close") Fixes: f2fde18c52a7 ("net/mlx5e: Don't wait for RQ completions on close") Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22	net/mlx5e: XDP TX forwarding support	Saeed Mahameed	1	-1/+64
	Adding support for XDP_TX forwarding from xdp program. Using XDP, now user can loop packets out of the same port. We create a dedicated TX SQ for each channel that will serve XDP programs that return XDP_TX action to loop packets back to the wire directly from the channel RQ RX path. For that RX pages will now need to be mapped bi-directionally, and on XDP_TX action we will sync the page back to device then queue it into SQ for transmission. The XDP xmit frame function will report back to the RX path if the page was consumed (transmitted), if so, RX path will forget about that page as if it were released to the stack. Later on, on XDP TX completion, the page will be released back to the page cache. For simplicity this patch will hit a doorbell on every XDP TX packet. Next patch will introduce a xmit more like mechanism that will queue up more than one packet into SQ w/o notifying the hardware, once RX napi loop is done we will hit doorbell once for all XDP TX packets form the previous loop. This should drastically improve XDP TX performance. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22	net/mlx5e: Have a clear separation between different SQ types	Saeed Mahameed	1	-1/+1
	Make a clear separate between Regular SQ (TXQ) and ICO SQ creation, destruction and union their mutual information structures. Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ. And have a different SQ edge for ICO SQ than TXQ SQ, to be more accurate. In preparation for XDP TX support. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-17	net/mlx5e: Single flow order-0 pages for Striding RQ	Tariq Toukan	1	-1/+1
	To improve the memory consumption scheme, we omit the flow that demands and splits high-order pages in Striding RQ, and stay with a single Striding RQ flow that uses order-0 pages. Moving to fragmented memory allows the use of larger MPWQEs, which reduces the number of UMR posts and filler CQEs. Moving to a single flow allows several optimizations that improve performance, especially in production servers where we would anyway fallback to order-0 allocations: - inline functions that were called via function pointers. - improve the UMR post process. This patch alone is expected to give a slight performance reduction. However, the new memory scheme gives the possibility to use a page-cache of a fair size, that doesn't inflate the memory footprint, which will dramatically fix the reduction and even give a performance gain. Performance tests: The following results were measured on a freshly booted system, giving optimal baseline performance, as high-order pages are yet to be fragmented and depleted. We ran pktgen single-stream benchmarks, with iptables-raw-drop: Single stride, 64 bytes: * 4,739,057 - baseline * 4,749,550 - this patch no reduction Larger packets, no page cross, 1024 bytes: * 3,982,361 - baseline * 3,845,682 - this patch 3.5% reduction Larger packets, every 3rd packet crosses a page, 1500 bytes: * 3,731,189 - baseline * 3,579,414 - this patch 4% reduction Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)") Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-28	net/mlx5e: Don't wait for SQ completions on close	Saeed Mahameed	1	-2/+4
	Instead of asking the firmware to flush the SQ (Send Queue) via asynchronous completions when moved to error, we handle SQ flush manually (mlx5e_free_tx_descs) same as we did when SQ flush got timed out or on tx_timeout. This will reduce SQs flush time and speedup interface down procedure. Moved mlx5e_free_tx_descs to the end of en_tx.c for tx critical code locality. Fixes: 29429f3300a3 ('net/mlx5e: Timeout if SQ doesn't flush during close') Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-27	net/mlx5e: Support adaptive RX coalescing	Gil Rockah	1	-0/+5
	Striving for high message rate and low interrupt rate. Usage: ethtool -C <interface> adaptive-rx on/off Signed-off-by: Gil Rockah <gilr@mellanox.com> Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> CC: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx5e: Remove redundant barrier	Tariq Toukan	1	-1/+0
	The bit-op operation one line before is an explicit barrier by itself. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx5e: Add fragmented memory support for RX multi packet WQE	Tariq Toukan	1	-0/+3
	If the allocation of a linear (physically continuous) MPWQE fails, we allocate a fragmented MPWQE. This is implemented via device's UMR (User Memory Registration) which allows to register multiple memory fragments into ConnectX hardware as a continuous buffer. UMR registration is an asynchronous operation and is done via ICO SQs. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21	net/mlx5e: Added ICO SQs	Tariq Toukan	1	-0/+55
	Added ICO (Internal Control Operations) SQ per channel to be used for driver internal operations such as memory registration for fragmented memory and nop requests upon ifconfig up. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-13	mlx5: use napi_consume_skb API to get bulk free operations	Jesper Dangaard Brouer	1	-1/+1
	Bulk free of SKBs happen transparently by the API call napi_consume_skb(). The napi budget parameter is needed by napi_consume_skb() to detect if called from netpoll. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-02	net/mlx5e: Remove wrong poll CQ optimization	Tariq Toukan	1	-1/+0
	With the MLX5E_CQ_HAS_CQES optimization flag, the following buggy flow might occur: - Suppose RX is always busy, TX has a single packet every second. - We poll a single TX cqe and clear its flag. - We never arm it again as RX is always busy. - TX CQ flag is never changed, and new TX cqes are not polled. We revert this optimization. Fixes: e586b3b0baee ('net/mlx5: Ethernet Datapath files') Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18	mlx5: support napi_complete_done()	Eric Dumazet	1	-5/+6
	A NAPI poll handler should return number of RX packets processed, instead of 0 / budget. This allows proper busy poll accounting through LINUX_MIB_BUSYPOLLRXPACKETS SNMP counter. napi_complete_done() allows /sys/class/net/ethX/gro_flush_timeout to be used for finer GRO aggregation control. Tested: Enabled busy polling, and checked TcpExtBusyPollRxPackets counter is increasing. echo 70 >/proc/sys/net/core/busy_read nstat >/dev/null netperf -H target -t TCP_RR >/dev/null nstat \| grep TcpExtBusyPollRxPackets TcpExtBusyPollRxPackets 490958 0.0 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24	net/mlx5e: Pop cq outside mlx5e_get_cqe	Achiad Shochat	1	-2/+0
	Separate between mlx5e_get_cqe() and mlx5_cqwq_pop(), this helps for better code readability and better CQ buffer management. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-24	net/mlx5e: Remove mlx5e_cq.sqrq back-pointer	Achiad Shochat	1	-1/+1
	Use container_of() instead. Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-30	net/mlx5: Ethernet Datapath files	Amir Vadai	1	-0/+107
	en_[rt]x.c contains the data path related code specific to tx or rx. en_txrx.c contains data path code which is common for both the rx and tx, this is mainly napi related code. Below are the objects that are being used by the hardware and the driver in the data path: Channel - one channel per IRQ. Every channel object contains: RQ - describes the rx queue TIR - One TIR (Transport Interface Receive) object per flow type. TIR contains attributes for a type of rx flow (e.g IPv4, IPv6 etc). A flow is defined in the Flow Table. Currently TIR describes the RSS hash parameters if exists and LRO attributes. SQ - describes the a tx queue. There is one SQ (Send Queue) per TC (traffic class). TIS - There is one TIS (Transport Interface Send) per TC. It describes the TC and may later be extended to describe more transport properties. Both RQ and SQ inherit from the object WQ (work queue). This common code to describe the layout of CQE's WQE's in memory is in the files wq.[cj] For every channel there is one NAPI context that is used for RX and for TX. Driver is using netdev_alloc_skb() to allocate skb's. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>