net/mlx5e: Recover Send Queue (SQ) from error state

An error TX completion (CQE) which arrived on a specific SQ indicates that this SQ got moved by the hardware to error state, which means all pending and incoming TX requests are dropped or will be dropped and no further "Good" CQEs will be generated for that SQ. Before this patch TX completions (CQEs) were not monitored and were handled as a regular CQE. This caused the SQ to stay in an error state, making it useless for xmiting new packets. Mitigation plan: In case of an error completion, schedule a recovery work which would do the following: - Mark the TXQ as DRV_XOFF to disable new packets to arrive from the stack - NAPI to flush all pending SQ WQEs (via flush_in_error_en bit) to release SW and HW resources(SKB, DMA, etc) and have the SQ and CQ consumer/producer indices synced. - Modify the SQ state ERR -> RST -> RDY (restart the SQ). - Reactivate the SQ and reset SQ cc and pc If we identify two consecutive requests for SQ recover in less than 500 msecs, drop the recover request to avoid CPU overload, as this scenario most likely happened due to a severe repeated bug. In addition, add SQ recover SW counter to monitor successful recoveries. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
author: Eran Ben Elisha <eranbe@mellanox.com> 2017-12-26 16:02:24 +0200
committer: Saeed Mahameed <saeedm@mellanox.com> 2018-03-27 17:29:28 -0700
commit: db75373c91b0cfb6a68ad6ae88721e4e21ae6261 (patch)
tree: 5724e4db736f47e6bce86154b3018b6881b30a95 /drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
parent: net/mlx5e: Dump xmit error completions (diff)
download: linux-dev-db75373c91b0cfb6a68ad6ae88721e4e21ae6261.tar.xz
linux-dev-db75373c91b0cfb6a68ad6ae88721e4e21ae6261.zip
1 files changed, 8 insertions, 2 deletions
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 88b5b7bfc9a9..20297108528a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -469,9 +469,13 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 		wqe_counter = be16_to_cpu(cqe->wqe_counter);
 
 		if (unlikely(cqe->op_own >> 4 == MLX5_CQE_REQ_ERR)) {
-			if (!sq->stats.cqe_err)
+			if (!test_and_set_bit(MLX5E_SQ_STATE_RECOVERING,
+					      &sq->state)) {
 				mlx5e_dump_error_cqe(sq,
 						     (struct mlx5_err_cqe *)cqe);
+				queue_work(cq->channel->priv->wq,
+					   &sq->recover.recover_work);
+			}
 			sq->stats.cqe_err++;
 		}
 
@@ -528,7 +532,9 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 	netdev_tx_completed_queue(sq->txq, npkts, nbytes);
 
 	if (netif_tx_queue_stopped(sq->txq) &&
-	    mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc, MLX5E_SQ_STOP_ROOM)) {
+	    mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc,
+				   MLX5E_SQ_STOP_ROOM) &&
+	    !test_bit(MLX5E_SQ_STATE_RECOVERING, &sq->state)) {
 		netif_tx_wake_queue(sq->txq);
 		sq->stats.wake++;
 	}
author	Eran Ben Elisha <eranbe@mellanox.com>	2017-12-26 16:02:24 +0200
committer	Saeed Mahameed <saeedm@mellanox.com>	2018-03-27 17:29:28 -0700
commit	db75373c91b0cfb6a68ad6ae88721e4e21ae6261 (patch)
tree	5724e4db736f47e6bce86154b3018b6881b30a95 /drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
parent	net/mlx5e: Dump xmit error completions (diff)
download	linux-dev-db75373c91b0cfb6a68ad6ae88721e4e21ae6261.tar.xz linux-dev-db75373c91b0cfb6a68ad6ae88721e4e21ae6261.zip