aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/sw/rxe (follow)
AgeCommit message (Collapse)AuthorFilesLines
2018-09-26RDMA: Fully setup the device name in ib_register_deviceJason Gunthorpe1-2/+1
The current code has two copies of the device name, ibdev->dev and dev_name(&ibdev->dev), and they are setup at different times, which is very confusing. Set them both up at the same time and make dev_name() the lead name, which is the proper use of the driver core APIs. To make it very clear that the name is not valid until registration pass it in to the ib_register_device() call rather than messing with ibdev->name directly. Also the reorganization now checks that dev_name is unique even if it does not contain a %. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Adit Ranadive <aditr@vmware.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Acked-by: Devesh Sharma <devesh.sharma@broadcom.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
2018-08-30IB/rxe: fixes for rdma read retryVijay Immanuel2-11/+25
When a read request is retried for the remaining partial data, the response may restart from read response first or read response only. So support those cases. Do not advance the comp psn beyond the current wqe's last_psn as that could skip over an entire read wqe and will cause the req_retry() logic to set an incorrect req psn. An example sequence is as follows: Write PSN 40 -- this is the current WQE. Read request PSN 41 Write PSN 42 Receive ACK PSN 42 -- this will complete the current WQE for PSN 40, and set the comp psn to 42 which is a problem because the read request at PSN 41 has been skipped over. So when req_retry() tries to retransmit the read request, it sets the req psn to 42 which is incorrect. When retrying a read request, calculate the number of psns completed based on the dma resid instead of the wqe first_psn. The wqe first_psn could have moved if the read request was retried multiple times. Set the reth length to the dma resid to handle read retries for the remaining partial data. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: avoid back-to-back retriesVijay Immanuel2-1/+18
Error retries can occur due to timeouts, NAKs or receiving packets beyond the current read request. Avoid back-to-back retries due to packet processing, by only retrying the initial attempt immediately. Subsequent retries must be due to timeouts. Continue to process completion packets after scheduling a retry. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: fix for duplicate request processing and ack psnsVijay Immanuel2-2/+8
Don't reset the resp opcode for a replayed read response. The resp opcode could be in the middle of a write or send sequence, when the duplicate read request was received. An example sequence is as follows: - Receive read request for 12KB PSN 20. Transmit read response first, middle and last with PSNs 20,21,22. - Receive write first PSN 23. At this point the resp psn is 24 and resp opcode is write first. - The sender notices that PSN 20 is dropped and retransmits. Receive read request for 12KB PSN 20. Transmit read response first, middle and last with PSNs 20,21,22. The resp opcode is set to -1, the resp psn remains 24. - Receive write first PSN 23. This is processed by duplicate_request(). The resp opcode remains -1 and resp psn remains 24. - Receive write middle PSN 24. check_op_seq() reports a missing first error since the resp opcode is -1. When sending an ack for a duplicate send or write request, use the psn of the previous ack sent. Do not use the psn of a read response for the ack. An example sequence is as follows: - Receive write PSN 30. Transmit ACK for PSN 30. - Receive read request 4KB PSN 31. Transmit read response with PSN 31. The resp psn is now 32. - The sender notices that PSN 30 is dropped and retransmits. Receive write PSN 30. duplicate_request() sends an ACK with PSN 31. That is incorrect since PSN 31 was a read request. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: Refactor lookup memory functionParav Pandit1-24/+11
Consolidate all error checks under single if() condition and use helper unlikely() macro for them, in addition drop unneeded goto labels. rxe_pool_get_index() already provides RB tree based efficient lookup. Avoid doing extra checks for error cases which are rare and already covered by rxe_pool_get_index(). Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: Avoid NULL check when search is successfulParav Pandit1-4/+3
While performing lookup in a pool, if entry is found, take the reference right there, instead of checking again outside the loop and save one branch. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: Change pool state enums to capital lettersParav Pandit2-8/+8
Normal practice is to have enum defines in capital letters. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: Replace spinlock with rwlockParav Pandit2-19/+19
Concurrent readers which read rb tree are protected using read lock. Concurrent writers which add element to pool are protected using write lock. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: Simplify rxe_find_route() to avoid GID query for netdevParav Pandit4-24/+13
rxe_prepare() is called on an skb which has ndev already initialized by rxe_init_packet(). Therefore avoid querying the GID attribute again and use the available netdevice from the skb->dev. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Tested-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: remove redudant qpn checkZhu Yanjun1-2/+2
In the commit 536ca245c512 ("IB/rxe: Drop QP0 silently"), if qpn is zero, the function directly returns. So in the following function, it is not necessary to check qpn. The qpn check in the function check_keys is removed. Fixes: 536ca245c512 ("IB/rxe: Drop QP0 silently") CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-30IB/rxe: vary the source udp port for receive scalingVijay Immanuel3-4/+15
Select the source udp port number for a QP based on the source QPN. This provides a better spread of traffic across NIC RX queues for RC/UC QPs. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-08-16Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe1-1/+4
Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-08RDMA/rxe: Set wqe->status correctly if an unexpected response is receivedBart Van Assche1-0/+1
Every function that returns COMPST_ERROR must set wqe->status to another value than IB_WC_SUCCESS before returning COMPST_ERROR. Fix the only code path for which this is not yet the case. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: <stable@vger.kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments constBart Van Assche1-9/+9
Since neither ib_post_send() nor ib_post_recv() modify the data structure their second argument points at, declare that argument const. This change makes it necessary to declare the 'bad_wr' argument const too and also to modify all ULPs that call ib_post_send(), ib_post_recv() or ib_post_srq_recv(). This patch does not change any functionality but makes it possible for the compiler to verify whether the ib_post_(send|recv|srq_recv) really do not modify the posted work request. To make this possible, only one cast had to be introduce that casts away constness, namely in rpcrdma_post_recvs(). The only way I can think of to avoid that cast is to introduce an additional loop in that function or to change the data type of bad_wr from struct ib_recv_wr ** into int (an index that refers to an element in the work request list). However, both approaches would require even more extensive changes than this patch. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-30RDMA: Constify the argument of the work request conversion functionsBart Van Assche1-4/+4
When posting a send work request, the work request that is posted is not modified by any of the RDMA drivers. Make this explicit by constifying most ib_send_wr pointers in RDMA transport drivers. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-24IB/rxe: Drop QP0 silentlyZhu Yanjun1-2/+7
According to "Annex A16: RDMA over Converged Ethernet (RoCE)": A16.4.3 MANAGEMENT INTERFACES As defined in the base specification, a special Queue Pair, QP0 is defined solely for communication between subnet manager(s) and subnet management agents. Since such an IB-defined subnet management architecture is outside the scope of this annex, it follows that there is also no requirement that a port which conforms to this annex be associated with a QP0. Thus, for end nodes designed to conform to this annex, the concept of QP0 is undefined and unused for any port connected to an Ethernet network. CA16-8: A packet arriving at a RoCE port containing a BTH with the destination QP field set to QP0 shall be silently dropped. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Moni Shoua <monis@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-09RDMA/rxe: Simplify the error handling code in rxe_create_ah()Bart Van Assche1-10/+3
This patch not only simplifies the error handling code in rxe_create_ah() but also removes the dead code that was left behind by commit 47ec38666210 ("RDMA: Convert drivers to use sgid_attr instead of sgid_index"). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-29IB/rxe: don't clear the tx queue on every transferVijay Immanuel1-10/+7
Do not call sk_dst_set() on every packet transfer because that calls sk_tx_queue_clear(), which clears the tx queue. A QP must stay on the same tx queue to maintain packet order. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Acked-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/core: add max_send_sge and max_recv_sge attributesSteve Wise2-5/+6
This patch replaces the ib_device_attr.max_sge with max_send_sge and max_recv_sge. It allows ulps to take advantage of devices that have very different send and recv sge depths. For example cxgb4 has a max_recv_sge of 4, yet a max_send_sge of 16. Splitting out these attributes allows much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW API. Consider a large RDMA WRITE that has 16 scattergather entries. With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of 16, it can be done with 1 WRITE WR. Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/rxe: avoid unnecessary NULL checkZhu Yanjun1-2/+1
Before goto err2, the variable qp is checked. So it is not necessary to check qp in label err2. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/rxe: support for 802.1q VLAN on the listenerVijay Immanuel1-0/+5
Set the vlan flag and vlan_id field in the wc for rdma_listen() to work over VLAN. This is required by ib_init_ah_attr_from_wc() which is called by the CM REQ handler. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/rxe: increase max MR limitVijay Immanuel1-1/+1
Increase the max MR limit to support more I/O queues for NVMe over Fabrics hosts. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/rxe: Fix missing completion for mem_reg work requestsVijay Immanuel1-0/+3
Run the completer task to post a work completion after processing a memory registration or invalidate work request. This covers the case where the memory registration or invalidate was the last work request posted to the qp. Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18RDMA: Convert drivers to use sgid_attr instead of sgid_indexParav Pandit4-57/+13
The core code now ensures that all driver callbacks that receive an rdma_ah_attrs will have a sgid_attr's pointer if there is a GRH present. Drivers can use this pointer instead of calling a query function with sgid_index. This simplifies the drivers and also avoids races where a gid_index lookup may return different data if it is changed. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2018-06-18IB/rxe: Do not hide uABI stuff in memcpyJason Gunthorpe1-3/+16
struct rxe_global_route and struct ib_global_route are not the same thing and should not be memcpy'd over each other, do a member by member copy instead. This allows the layout of the in-kernel struct ib_global_route to be changed without breaking rxe. Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-18IB/rxe: Use rdma GID APIParav Pandit2-37/+25
rxe_netdev_from_av can now be done by the core code directly from the gid_attrs, no need for a helper in the driver. ib_find_cached_gid_by_port can be switched to use the rdma version here as well. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-06-11IB/rxe: avoid double kfree skbZhu Yanjun1-1/+1
In rxe_send, when network_type is not RDMA_NETWORK_IPV4 or RDMA_NETWORK_IPV6, skb is freed and -EINVAL is returned. Then rxe_xmit_packet will return -EINVAL, too. In rxe_requester, this skb is double freed. In rxe_requester, kfree_skb is needed only after fill_packet fails. So kfree_skb is moved from label err to test fill_packet. Fixes: 5793b4652155 ("IB/rxe: remove unnecessary skb_clone in xmit") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-28Merge branch 'mr_fix' into git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma for-nextJason Gunthorpe1-9/+1
Update mlx4 to support user MR creation against read-only memory, previously it required the memory to be writable. Based on rdma for-rc due to dependencies. * mr_fix: (2 commits) IB/mlx4: Mark user MR as writable if actual virtual memory is writable IB/core: Make testing MR flags for writability a static inline function
2018-05-28IB/rxe: avoid unnecessary exportZhu Yanjun1-1/+0
The function rxe_remove_all is only used in this modules. There is no other modules that call this function. So it is not necessary to export it. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-17IB/rxe: avoid calling WARN_ON_ONCE twiceZhu Yanjun1-4/+0
In the exit branch, WARN_ON_ONCE is called to show stack. So it is not necessary to call WARN_ON_ONCE before going to exit. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-05-15IB: Fix RDMA_RXE and INFINIBAND_RDMAVT dependencies for DMA_VIRT_OPSBen Hutchings1-0/+1
DMA_VIRT_OPS requires that dma_addr_t is at least as wide as a pointer, which is expressed as a dependency on !64BIT || ARCH_DMA_ADDR_T_64BIT. For parisc64 this is not true, and if these IB modules are enabled, kconfig warns: WARNING: unmet direct dependencies detected for DMA_VIRT_OPS Depends on [n]: HAS_DMA [=y] && (!64BIT [=y] || ARCH_DMA_ADDR_T_64BIT) Selected by [m]: - INFINIBAND_RDMAVT [=m] && INFINIBAND [=m] && 64BIT [=y] && PCI [=y] - RDMA_RXE [=m] && INET [=y] && PCI [=y] && INFINIBAND [=m] Add dependencies to fix this. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09Merge branch 'k.o/for-rc' into k.o/wip/dl-for-nextDoug Ledford3-7/+2
Several items of conflict have arisen between the RDMA stack's for-rc branch and upcoming for-next work: 9fd4350ba895 ("IB/rxe: avoid double kfree_skb") directly conflicts with 2e47350789eb ("IB/rxe: optimize the function duplicate_request") Patches already submitted by Intel for the hfi1 driver will fail to apply cleanly without this merge Other people on the mailing list have notified that their upcoming patches also fail to apply cleanly without this merge Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-09nvmet,rxe: defer ip datagram sending to taskletAlexandru Moise1-9/+1
This addresses 3 separate problems: 1. When using NVME over Fabrics we may end up sending IP packets in interrupt context, we should defer this work to a tasklet. [ 50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0 [ 50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G W 4.17.0-rc3-ARCH+ #104 [ 50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 [ 50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0 [ 50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006 [ 50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001 [ 50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec [ 50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614 [ 50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec [ 50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000 [ 50.959402] FS: 0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000 [ 50.961552] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0 [ 50.966121] Call Trace: [ 50.966845] <IRQ> [ 50.967497] __dev_queue_xmit+0x62d/0x690 [ 50.968722] dev_queue_xmit+0x10/0x20 [ 50.969894] neigh_resolve_output+0x173/0x190 [ 50.971244] ip_finish_output2+0x2b8/0x370 [ 50.972527] ip_finish_output+0x1d2/0x220 [ 50.973785] ? ip_finish_output+0x1d2/0x220 [ 50.975010] ip_output+0xd4/0x100 [ 50.975903] ip_local_out+0x3b/0x50 [ 50.976823] rxe_send+0x74/0x120 [ 50.977702] rxe_requester+0xe3b/0x10b0 [ 50.978881] ? ip_local_deliver_finish+0xd1/0xe0 [ 50.980260] rxe_do_task+0x85/0x100 [ 50.981386] rxe_run_task+0x2f/0x40 [ 50.982470] rxe_post_send+0x51a/0x550 [ 50.983591] nvmet_rdma_queue_response+0x10a/0x170 [ 50.985024] __nvmet_req_complete+0x95/0xa0 [ 50.986287] nvmet_req_complete+0x15/0x60 [ 50.987469] nvmet_bio_done+0x2d/0x40 [ 50.988564] bio_endio+0x12c/0x140 [ 50.989654] blk_update_request+0x185/0x2a0 [ 50.990947] blk_mq_end_request+0x1e/0x80 [ 50.991997] nvme_complete_rq+0x1cc/0x1e0 [ 50.993171] nvme_pci_complete_rq+0x117/0x120 [ 50.994355] __blk_mq_complete_request+0x15e/0x180 [ 50.995988] blk_mq_complete_request+0x6f/0xa0 [ 50.997304] nvme_process_cq+0xe0/0x1b0 [ 50.998494] nvme_irq+0x28/0x50 [ 50.999572] __handle_irq_event_percpu+0xa2/0x1c0 [ 51.000986] handle_irq_event_percpu+0x32/0x80 [ 51.002356] handle_irq_event+0x3c/0x60 [ 51.003463] handle_edge_irq+0x1c9/0x200 [ 51.004473] handle_irq+0x23/0x30 [ 51.005363] do_IRQ+0x46/0xd0 [ 51.006182] common_interrupt+0xf/0xf [ 51.007129] </IRQ> 2. Work must always be offloaded to tasklet for rxe_post_send_kernel() when using NVMEoF in order to solve lock ordering between neigh->ha_lock seqlock and the nvme queue lock: [ 77.833783] Possible interrupt unsafe locking scenario: [ 77.833783] [ 77.835831] CPU0 CPU1 [ 77.837129] ---- ---- [ 77.838313] lock(&(&n->ha_lock)->seqcount); [ 77.839550] local_irq_disable(); [ 77.841377] lock(&(&nvmeq->q_lock)->rlock); [ 77.843222] lock(&(&n->ha_lock)->seqcount); [ 77.845178] <Interrupt> [ 77.846298] lock(&(&nvmeq->q_lock)->rlock); [ 77.847986] [ 77.847986] *** DEADLOCK *** 3. Same goes for the lock ordering between sch->q.lock and nvme queue lock: [ 47.634271] Possible interrupt unsafe locking scenario: [ 47.634271] [ 47.636452] CPU0 CPU1 [ 47.637861] ---- ---- [ 47.639285] lock(&(&sch->q.lock)->rlock); [ 47.640654] local_irq_disable(); [ 47.642451] lock(&(&nvmeq->q_lock)->rlock); [ 47.644521] lock(&(&sch->q.lock)->rlock); [ 47.646480] <Interrupt> [ 47.647263] lock(&(&nvmeq->q_lock)->rlock); [ 47.648492] [ 47.648492] *** DEADLOCK *** Using NVMEoF after this patch seems to finally be stable, without it, rxe eventually deadlocks the whole system and causes RCU stalls. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-05-01IB/cxgb4: use skb_put_zero()/__skb_put_zeroYueHaibing1-3/+1
Use the recently introduced helper to replace the pattern of skb_put_zero/__skb_put() && memset(). Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27IB/rxe: avoid double kfree_skbZhu Yanjun2-6/+1
When skb is sent, it will pass the following functions in soft roce. rxe_send [rdma_rxe] ip_local_out __ip_local_out ip_output ip_finish_output ip_finish_output2 dev_queue_xmit __dev_queue_xmit dev_hard_start_xmit In the above functions, if error occurs in the above functions or iptables rules drop skb after ip_local_out, kfree_skb will be called. So it is not necessary to call kfree_skb in soft roce module again. Or else crash will occur. The steps to reproduce: server client --------- --------- |1.1.1.1|<----rxe-channel--->|1.1.1.2| --------- --------- On server: rping -s -a 1.1.1.1 -v -C 10000 -S 512 On client: rping -c -a 1.1.1.1 -v -C 10000 -S 512 The kernel configs CONFIG_DEBUG_KMEMLEAK and CONFIG_DEBUG_OBJECTS are enabled on both server and client. When rping runs, run the following command in server: iptables -I OUTPUT -p udp --dport 4791 -j DROP Without this patch, crash will occur. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INVJianchao Wang1-1/+1
w/o RXE_START_MASK, the last_psn of IB_OPCODE_RC_SEND_ONLY_INV will not be updated in update_wqe_psn, and the corresponding wqe will not be acked in rxe_completer due to its last_psn is zero. Finally, the other wqe will also not be able to be acked, because the wqe of IB_OPCODE_RC_SEND_ONLY_INV with last_psn 0 is still there. This causes large amount of io timeout when nvmeof is over rxe. Add RXE_START_MASK for IB_OPCODE_RC_SEND_ONLY_INV to fix this. Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27IB/rxe: remove unused function variableZhu Yanjun6-21/+17
In the functions rxe_mem_init_dma, rxe_mem_init_user, rxe_mem_init_fast and copy_data, the function variable rxe is not used. So this function variable rxe is removed. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27IB/rxe: change rxe_set_mtu function type to voidZhu Yanjun2-7/+3
The function rxe_set_mtu always returns zero. So this function type is changed to void. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-27IB/rxe: Change rxe_rcv to return voidYuval Shaia4-9/+12
It always returns 0. Change return type to void. Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19IB/rxe: replace refcount_inc with skb_getZhu Yanjun2-2/+2
Follow the advice from Bart, the function refcount_inc is replaced with skb_get in commit 99dae690255e ("IB/rxe: optimize mcast recv process") and commit 86af61764151 ("IB/rxe: remove unnecessary skb_clone"). CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Suggested-by: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19IB/rxe: optimize the function duplicate_requestZhu Yanjun1-14/+3
In the function duplicate_request, the reference of skb can be increased to replace the function skb_clone. This will make rxe performace better and save memory. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-19IB/rxe: make rxe_release_udp_tunnel staticZhu Yanjun2-3/+1
The function rxe_release_udp_tunnel is only used in rxe_net.c. So it is necessary to make this function as static. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2018-04-17IB/rxe: avoid export symbolsZhu Yanjun1-3/+0
The functions rxe_set_mtu, rxe_add and rxe_remove are only used in their own module. So it is not necessary to export them. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-17IB/rxe: make the variable staticZhu Yanjun2-2/+1
The variable rxe_net_notifier is only used in the file rxe_net.c. So remove it from rxe_net.h file and make it static in the file rxe_net.c. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05IB/rxe: Fix for oops in rxe_register_device on ppc64le archMikhail Malygin1-1/+1
On ppc64le arch rxe_add command causes oops in kernel log: [ 92.495140] Oops: Kernel access of bad area, sig: 11 [#1] [ 92.499710] SMP NR_CPUS=2048 NUMA pSeries [ 92.499792] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) iptable _nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E) nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) af_packet(E) rpcrdma(E) ib_isert(E) iscsi_target_mod(E) i b_iser(E) libiscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) bochs_drm(E) tt m(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) drm(E) agpgart(E) virtio_rng(E) virtio_console(E) rtc_ generic(E) dm_ec(OEN) ttln_rdma(OEN) rdma_cm(E) configfs(E) iw_cm(E) ib_cm(E) rdma_rxe(E) ip6_udp_tunnel(E) udp_tunnel(E) ib_core(E) ql a2xxx(E) [ 92.499832] scsi_transport_fc(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E) ipmi_watchdog(E) ipmi_ssif(E) ipmi_poweroff(E) ipmi_powernv(EX) ipmi_devintf(E) ipmi_msghandler(E) dummy(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_service_time(E) scsi_transport_iscsi(E) sd_mod(E) sr_mod(E) cdrom(E) hid_generic(E) usbhid(E) virtio_blk(E) virtio_scsi(E) virtio_net(E) ibmvscsi(EX) scsi_transport_srp(E) xhci_pci(E) xhci_hcd(E) usbcore(E) usb_common(E) virtio_pci(E) virtio_ring(E) virtio(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E) [ 92.499834] Supported: No, Unsupported modules are loaded [ 92.499839] CPU: 3 PID: 5576 Comm: sh Tainted: G OE NX 4.4.120-ttln.17-default #1 [ 92.499841] task: c0000000afe8a490 ti: c0000000beba8000 task.ti: c0000000beba8000 [ 92.499842] NIP: c00000000008ba3c LR: c000000000027644 CTR: c00000000008ba10 [ 92.499844] REGS: c0000000bebab750 TRAP: 0300 Tainted: G OE NX (4.4.120-ttln.17-default) [ 92.499850] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 28424428 XER: 20000000 [ 92.499871] CFAR: 0000000000002424 DAR: 0000000000000208 DSISR: 40000000 SOFTE: 1 GPR00: c000000000027644 c0000000bebab9d0 c000000000f09700 0000000000000000 GPR04: d0000000043d7192 0000000000000002 000000000000001a fffffffffffffffe GPR08: 000000000000009c c00000000008ba10 d0000000043e5848 d0000000043d3828 GPR12: c00000000008ba10 c000000007a02400 0000000010062e38 0000010020388860 GPR16: 0000000000000000 0000000000000000 00000100203885f0 00000000100f6c98 GPR20: c0000000b3f1fcc0 c0000000b3f1fc48 c0000000b3f1fbd0 c0000000b3f1fb58 GPR24: c0000000b3f1fae0 c0000000b3f1fa68 00000000000005dc c0000000b3f1f9f0 GPR28: d0000000043e5848 c0000000b3f1f900 c0000000b3f1f320 c0000000b3f1f000 [ 92.499881] NIP [c00000000008ba3c] dma_get_required_mask_pSeriesLP+0x2c/0x1a0 [ 92.499885] LR [c000000000027644] dma_get_required_mask+0x44/0xac [ 92.499886] Call Trace: [ 92.499891] [c0000000bebab9d0] [c0000000bebaba30] 0xc0000000bebaba30 (unreliable) [ 92.499894] [c0000000bebaba10] [c000000000027644] dma_get_required_mask+0x44/0xac [ 92.499904] [c0000000bebaba30] [d0000000043cb4b4] rxe_register_device+0xc4/0x430 [rdma_rxe] [ 92.499910] [c0000000bebabab0] [d0000000043c06c8] rxe_add+0x448/0x4e0 [rdma_rxe] [ 92.499915] [c0000000bebabb30] [d0000000043d28dc] rxe_net_add+0x4c/0xf0 [rdma_rxe] [ 92.499921] [c0000000bebabb60] [d0000000043d305c] rxe_param_set_add+0x6c/0x1ac [rdma_rxe] [ 92.499924] [c0000000bebabbf0] [c0000000000e78c0] param_attr_store+0xa0/0x180 [ 92.499927] [c0000000bebabc70] [c0000000000e6448] module_attr_store+0x48/0x70 [ 92.499932] [c0000000bebabc90] [c000000000391f60] sysfs_kf_write+0x70/0xb0 [ 92.499935] [c0000000bebabcb0] [c000000000390f1c] kernfs_fop_write+0x18c/0x1e0 [ 92.499939] [c0000000bebabd00] [c0000000002e22ac] __vfs_write+0x4c/0x1d0 [ 92.499942] [c0000000bebabd90] [c0000000002e2f94] vfs_write+0xc4/0x200 [ 92.499945] [c0000000bebabde0] [c0000000002e488c] SyS_write+0x6c/0x110 [ 92.499948] [c0000000bebabe30] [c000000000009384] system_call+0x38/0xe4 [ 92.499949] Instruction dump: [ 92.499954] 4e800020 3c4c00e8 3842dcf0 7c0802a6 f8010010 60000000 7c0802a6 fba1ffe8 [ 92.499958] fbc1fff0 fbe1fff8 f8010010 f821ffc1 <e9230208> 7c7e1b78 2fa90000 419e0078 [ 92.499962] ---[ end trace bed077e15eb420cf ]--- It fails in dma_get_required_mask, that has ppc-specific implementation, and fail if provided device argument is NULL Signed-off-by: Mikhail Malygin <mikhail@malygin.me> Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-05IB/rxe: Removed GID add/del dummy routinesParav Pandit1-17/+0
rxe driver's add_gid() and del_gid() callbacks are doing simple checks which are already done by the ib core before invoking these callback routines. Therefore, code is simplified to skip implementing add_gid() and del_gid() callback functions. They are only invoked by ib_core if they are implemented. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03RDMA: Use ib_gid_attr during GID modificationParav Pandit1-6/+4
Now that ib_gid_attr contains device, port and index, simplify the provider APIs add_gid() and del_gid() to use device, port and index fields from the ib_gid_attr attributes structure. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03IB/providers: Avoid null netdev check for RoCEParav Pandit1-3/+1
Now that IB core GID cache ensures that all RoCE entries have an associated netdev remove null checks from the provider drivers for clarity. Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-04-03RDMA/providers: Simplify query_gid callback of RoCE providersParav Pandit1-18/+0
ib_query_gid() fetches the GID from the software cache maintained in ib_core for RoCE ports. Therefore, simplify the provider drivers for RoCE to treat query_gid() callback as never called for RoCE, and only require non-RoCE devices to implement it. Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-03-29IB/rxe: optimize mcast recv processZhu Yanjun1-10/+4
In mcast recv process, the function skb_clone is used. In fact, the refcount can be increased to replace cloning a new skb since the original skb will not be modified before it is freed. This can make the performance better and save the memory. CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>