aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/infiniband/hw (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-15RDMA/cxgb4: Remove kref accounting for sync operationLeon Romanovsky3-29/+3
Ucontext allocation and release aren't async events and don't need kref accounting. The common layer of RDMA subsystem ensures that dealloc ucontext will be called after all other objects are released. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15RDMA/nes: Remove useless usecnt variable and redundant memsetLeon Romanovsky2-9/+1
The internal design of RDMA/core ensures that there dealloc ucontext will be called only if alloc_ucontext succeeded, hence there is no need to manage internal variable to mark validity of ucontext. As part of this change, remove redundant memeset too. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15IB/hfi1: Fix a build warning for TID RDMA READKaike Wan1-0/+1
The following build warning was produced for the TID RDMA READ patch ("IB/hfi1: Enable TID RDMA READ protocol"): drivers/infiniband/hw/hfi1/qp.c: In function 'hfi1_setup_wqe': drivers/infiniband/hw/hfi1/qp.c:328:3: warning: this statement may fall through [-Wimplicit-fallthrough=] hfi1_setup_tid_rdma_wqe(qp, wqe); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/infiniband/hw/hfi1/qp.c:329:2: note: here case IB_QPT_UC: ^~~~ This patch will fix the issue by adding the "fall through" comment. Fixes: f1ab4efa6d32 ("IB/hfi1: Enable TID RDMA READ protocol") Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-15IB/{hw,sw}: Remove 'uobject->context' dependency in object creation APIsShamir Rabinovitch17-144/+172
Now when we have the udata passed to all the ib_xxx object creation APIs and the additional macro 'rdma_udata_to_drv_context' to get the ib_ucontext from ib_udata stored in uverbs_attr_bundle, we can finally start to remove the dependency of the drivers in the ib_xxx->uobject->context. Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14RDMA/hns: Configure capacity of hns deviceLijun Ou1-0/+5
This patch adds new device capability for IB_DEVICE_MEM_MGT_EXTENSIONS to indicate device support for the following features: 1. Fast register memory region. 2. send with remote invalidate by frmr 3. local invalidate memory regsion As well as adds the max depth of frmr page list len. Signed-off-by: Yangyang Li <liyangyang20@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14RDMA/hns: Delete useful prints for aeq subtype eventYixian Liu1-51/+6
Current all messages printed for aeq subtype event are wrong. Thus, delete them and only the value of subtype event is printed. Signed-off-by: Yixian Liu <liuyixian@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14RDMA/hns: Set allocated memory to zero for wridYixian Liu1-4/+4
The memory allocated for wrid should be initialized to zero. Signed-off-by: Yixian Liu <liuyixian@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14RDMA/hns: Fix the state of rereg mrYixian Liu1-0/+3
The state of mr after reregister operation should be set to valid state. Otherwise, it will keep the same as the state before reregistered. Signed-off-by: Yixian Liu <liuyixian@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-14RDMA/hns: Limit minimum ROCE CQ depth to 64chenglang1-0/+1
This patch modifies the minimum CQ depth specification of hip08 and is consistent with the processing of hip06. Signed-off-by: chenglang <chenglang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-13RDMA/nes: Use for_each_sg_dma_page iterator for umem SGLShiraz, Saleem1-131/+89
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-13RDMA: Fix allocation failure on pointer pdColin Ian King1-1/+1
The null check on an allocation failure on pd is currently checking if pd is non-null rather than null. Fix this by adding the missing ! operator. Fixes: 21a428a019c9 ("RDMA: Handle PD allocations by IB/core") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-13Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma into for-nextDoug Ledford38-612/+588
I had merged the hfi1-tid code into my local copy of for-next, but was waiting on 0day testing before pushing it (I pushed it to my wip branch). Having waited several days for 0day testing to show up, I'm finally just going to push it out. In the meantime, though, Jason pushed other stuff to for-next, so I needed to merge up the branches before pushing. Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-11RDMA/bnxt_re: fix or'ing of data into an uninitialized struct memberColin Ian King1-1/+1
The struct member comp_mask has not been initialized however a bit pattern is being bitwise or'd into the member and hence other bit fields in comp_mask may contain any garbage from the stack. Fix this by making the bitwise or into an assignment. Fixes: 95b86d1c91ad ("RDMA/bnxt_re: Update kernel user abi to pass chip context") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/mlx5: Fix memory leak in case we fail to add an IB deviceMark Bloch1-1/+3
Make sure the IB device is freed on failure. Fixes: b5ca15ad7e61 ("IB/mlx5: Add proper representors support") Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11IB/mlx5: Fix bad flow upon DEVX mkey creationYishai Hadas1-2/+3
Fix bad flow upon DEVX mkey creation to prevent deleting the indirect mkey from the radix tree in case there was a previous failure to insert it. Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/ocrdma: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-32/+23
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/qedr: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-37/+30
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Acked-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/vmw_pvrdma: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-13/+8
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/cxgb3: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-17/+12
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Acked-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/cxgb4: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-19/+13
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Acked-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/hns: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem4-74/+56
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/i40iw: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-19/+16
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/mthca: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem1-20/+16
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-11RDMA/bnxt_re: Use for_each_sg_dma_page iterator on umem SGLShiraz, Saleem2-18/+14
Use the for_each_sg_dma_page iterator variant to walk the umem DMA-mapped SGL and get the page DMA address. This avoids the extra loop to iterate pages in the SGE when for_each_sg iterator is used. Additionally, purge umem->page_shift usage in the driver as its only relevant for ODP MRs. Use system page size and shift instead. Signed-off-by: Shiraz, Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-09Merge branch 'wip/dl-for-next' into for-nextDoug Ledford36-399/+9595
Due to concurrent work by myself and Jason, a normal fast forward merge was not possible. This brings in a number of hfi1 changes, mainly the hfi1 TID RDMA support (roughly 10,000 LOC change), which was reviewed and integrated over a period of days. Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-08iw_cxgb4: fix srqidx leak during connection abortRaju Rangoju1-2/+3
When an application aborts the connection by moving QP from RTS to ERROR, then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the *srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the connection by calling c4iw_ep_disconnect(). c4iw_ep_disconnect() does the following: 1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4. 2. sends abort request CPL to hw. But, since the close_complete_upcall() is sent before sending the ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the connection holds one. Because, the srqidx is passed up to libcxgb4 only after corresponding ABORT_RPL is processed by kernel in abort_rpl(). This patch handle the corner-case by moving the call to close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl(). So that libcxgb4 is notified about the -ECONNRESET only after abort_rpl(), and libcxgb4 can relinquish the srqidx properly. Signed-off-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08iw_cxgb4: complete the cached SRQ buffersRaju Rangoju3-8/+157
If TP fetches an SRQ buffer but ends up not using it before the connection is aborted, then it passes the index of that SRQ buffer to the host in ABORT_REQ_RSS or ABORT_RPL CPL message. But, if the srqidx field is zero in the received ABORT_RPL or ABORT_REQ_RSS CPL, then we need to read the tcb.rq_start field to see if it really did have an RQE cached. This works around a case where HW does not include the srqidx in the ABORT_RPL/ABORT_REQ_RSS CPL. The final value of rq_start is the one present in TCB with the TF_RX_PDU_OUT bit cleared. So, we need to read the TCB, examine the TF_RX_PDU_OUT (bit 49 of t_flags) in order to determine if there's a rx PDU feedback event pending. Signed-off-by: Raju Rangoju <rajur@chelsio.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08RDMA: Handle PD allocations by IB/coreLeon Romanovsky27-342/+218
The PD allocations in IB/core allows us to simplify drivers and their error flows in their .alloc_pd() paths. The changes in .alloc_pd() go hand in had with relevant update in .dealloc_pd(). We will use this opportunity and convert .dealloc_pd() to don't fail, as it was suggested a long time ago, failures are not happening as we have never seen a WARN_ON print. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08IB/usnic: Fix locking when unregisteringParvi Kaustubhi1-2/+4
Move the call to usnic_ib_device_remove after usnic_ib_ibdev_list_lock has been released. Signed-off-by: Parvi Kaustubhi <pkaustub@cisco.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08iw_cxgb4: use tos when finding ipv6 routesSteve Wise1-2/+3
When IPv6 support was added, the correct tos was not passed to cxgb_find_route6(). This potentially results in the wrong route entry. Fixes: 830662f6f032 ("RDMA/cxgb4: Add support for active and passive open connection with IPv6 address") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08iw_cxgb4: use tos when importing the endpointSteve Wise1-1/+1
import_ep() is passed the correct tos, but doesn't use it correctly. Fixes: ac8e4c69a021 ("cxgb4/iw_cxgb4: TOS support") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-08iw_cxgb4: use listening ep tos when accepting new connectionsSteve Wise1-2/+7
If the parent listening endpoint has a service type set, then use that when setting up the connection. This allows server-side applications to mandate the tos for passive side connections via rdma_set_service_type() on the listening endpoints. Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Update kernel user abi to pass chip contextDevesh Sharma1-3/+14
User space verbs provider library would need chip context. Changing the ABI to add chip version details in structure. Furthermore, changing the kernel driver ucontext allocation code to initialize the abi structure with appropriate values. As suggested by community, appended the new fields at the bottom of the ABI structure and retaining to older fields as those were in the older versions. Keeping the ABI version at 1 and adding a new field in the ucontext response structure to hold the component mask. The user space library should check pre-defined flags to figure out if a certain feature is supported on not. Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Add extended psn structure for 57500 adaptersDevesh Sharma5-18/+70
The new 57500 series of adapter has bigger psn search structure. The size of new structure is 16B. Changing the control path memory allocation and fast path code to accommodate the new psn structure while maintaining the backward compatibility. There are few additional changes listed below: - For 57500 chip max-sge are limited to 6 for now. - For 57500 chip max-receive-sge should be set to 6 for now. - Add driver/hardware interface structure for new chip. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Enable GSI QP support for 57500 seriesDevesh Sharma4-55/+99
In the new 57500 series of adapters the GSI qp is a UD type QP unlike the previous generation where it was a Raw Eth QP. Changing the control and data path to support the same. Listing all the significant diffs: - AH creation resolve network type unconditionally - Add check at relevant places to distinguish from Raw Eth processing flow. - bnxt_re_process_res_ud_wc report completion with GRH flag when qp is GSI. - Change length, cfa_meta and smac to match new driver/hardware interface. - Add new driver/hardware interface. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Skip backing store allocation for 57500 seriesDevesh Sharma4-7/+14
The backing store to keep HW context data structures is allocated and initialized by L2 driver. For 57500 chip RoCE driver do not require to allocate and initialize additional memory. Changing to skip duplicate allocation and initialization for 57500 adapters. Driver continues as before for older chips. This patch also takes care of stats context memory alignment to 128 boundary, a requirement for 57500 series of chip. Older chips do not care of alignment, thus the change is unconditional. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Add 64bit doorbells for 57500 seriesDevesh Sharma9-153/+276
The new chip series has 64 bit doorbell for notification queues. Thus, both control and data path event queues need new routines to write 64 bit doorbell. Adding the same. There is new doorbell interface between the chip and driver. Changing the chip specific data structure definitions. Additional significant changes are listed below - bnxt_re_net_ring_free/alloc takes a new argument - bnxt_qplib_enable_nq and enable_rcfw uses new doorbell offset for new chip. - DB mapping for NQ and CREQ now maps 8 bytes. - DBR_DBR_* macros renames to DBC_DBC_* - store nq_db_offset in a 32bit data type. - got rid of __iowrite64_copy, used writeq instead. - changed the DB header initialization to simpler scheme. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07RDMA/bnxt_re: Add chip context to identify 57500 seriesDevesh Sharma5-1/+51
Adding setup and destroy routines for chip-context. The chip context would be used frequently in control and data path to take execution flow depending on the chip type. chip context structure pointer is added to the relevant data structures. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07IB/mlx5: Simplify WQE count power of two checkGal Pressman1-3/+3
Use is_power_of_2() instead of hard coding it in the driver. While at it, fix the meaningless error print. Signed-off-by: Gal Pressman <galpress@amazon.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07drivers/IB,usnic: reduce scope of mmap_semDavidlohr Bueso3-55/+6
usnic_uiom_get_pages() uses gup_longterm() so we cannot really get rid of mmap_sem altogether in the driver, but we can get rid of some complexity that mmap_sem brings with only pinned_vm. We can get rid of the wq altogether as we no longer need to defer work to unpin pages as the counter is now atomic. We also share the lock. Acked-by: Parvi Kaustubhi <pkaustub@cisco.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07drivers/IB,hfi1: do not se mmap_semDavidlohr Bueso1-6/+0
This driver already uses gup_fast() and thus we can just drop the mmap_sem protection around the pinned_vm counter. Note that the window between when hfi1_can_pin_pages() is called and the actual counter is incremented remains the same as mmap_sem was _only_ used for when ->pinned_vm was touched. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.det> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07drivers/IB,qib: optimize mmap_sem usageDavidlohr Bueso1-46/+27
The driver uses mmap_sem for both pinned_vm accounting and get_user_pages(). Because rdma drivers might want to use gup_longterm() in the future we still need some sort of mmap_sem serialization (as opposed to removing it entirely by using gup_fast()). Now that pinned_vm is atomic the writer lock can therefore be converted to reader. This also fixes a bug that __qib_get_user_pages was not taking into account the current value of pinned_vm. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-07mm: make mm->pinned_vm an atomic64 counterDavidlohr Bueso3-9/+9
Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-05IB/hfi1: Prioritize the sending of ACK packetsKaike Wan12-51/+144
ACK packets are generally associated with request completion and resource release and therefore should be sent first. This patch optimizes the send engine by using the following policies: (1) QPs with RVT_S_ACK_PENDING bit set in qp->s_flags or qpriv->s_flags should have their priority incremented; (2) QPs with ACK or TID-ACK packet queued should have their priority incremented; (3) When a QP is queued to the wait list due to resource constraints, it will be queued to the head if it has ACK packet to send; (4) When selecting qps to run from the wait list, the one with the highest priority and starve_cnt will be selected; each priority will be equivalent to a fixed number of starve_cnt (16). Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Add static trace for TID RDMA WRITE protocolKaike Wan6-3/+692
This patch makes the following changes to the static trace: 1. Adds the decoding of TID RDMA WRITE packets in IB header trace; 2. Adds trace events for various stages of the TID RDMA WRITE protocol. These events provide a fine-grained control for monitoring and debugging the hfi1 driver in the filed. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Enable TID RDMA WRITE protocolKaike Wan2-1/+24
This patch enables TID RDMA WRITE protocol by converting a qualified RDMA WRITE request into a TID RDMA WRITE request internally: (1) The TID RDMA cability must be enabled; (2) The request must start on a 4K page boundary; (3) The request length must be a multiple of 4K and must be larger or equal to 256K. Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Add interlock between TID RDMA WRITE and other requestsKaike Wan3-2/+59
This locking mechanism is designed to provent vavious memory corruption scenarios from occurring when requests are pipelined, especially when RDMA WRITE requests are interleaved with TID RDMA READ requests: 1. READ-AFTER-READ; 2. READ-AFTER-WRITE; 3. WRITE-AFTER-READ; 4. WRITE-AFTER-WRITE. When memory corruption is likely, a request will be held back until previous requests have been completed. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Add TID RDMA WRITE functionality into RDMA verbsKaike Wan6-49/+480
This patch integrates TID RDMA WRITE protocol into normal RDMA verbs framework. The TID RDMA WRITE protocol is an end-to-end protocol between the hfi1 drivers on two OPA nodes that converts a qualified RDMA WRITE request into a TID RDMA WRITE request to avoid data copying on the responder side. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Add the dual leg codeKaike Wan7-10/+217
The "Second Leg" of the TID RDMA WRITE protocol deals with the transfer of data and ack packets, which are in the KDETH PSN space, as opposed to the IB PSN space. Therefore, the Second Leg could be considered as a separate state machine. As such, it is handled by a different work queue item which is scheduled along with the normal IB state machine work item. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2019-02-05IB/hfi1: Add the TID second leg ACK packet builderKaike Wan1-0/+141
This patch adds the TID packet builder for the responder side, which contains the state machine to build TID RDMA ACK packet for either TID RDMA WRITE DATA or TID RDMA RESYNC packets. Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>