aboutsummaryrefslogtreecommitdiffstats
path: root/net/sunrpc/xprtrdma/rpc_rdma.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-09-05xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handlerChuck Lever1-45/+17
Adopt the use of xprt_pin_rqst to eliminate contention between Call-side users of rb_lock and the use of rb_lock in rpcrdma_reply_handler. This replaces the mechanism introduced in 431af645cf66 ("xprtrdma: Fix client lock-up after application signal fires"). Use recv_lock to quickly find the completing rqst, pin it, then drop the lock. At that point invalidation and pull-up of the Reply XDR can be done. Both are often expensive operations. Finally, take recv_lock again to signal completion to the RPC layer. It also protects adjustment of "cwnd". This greatly reduces the amount of time a lock is held by the reply handler. Comparing lock_stat results shows a marked decrease in contention on rb_lock and recv_lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from the "out_norqst:" path in rpcrdma_reply_handler.] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-05Merge tag 'nfs-rdma-for-4.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-nextTrond Myklebust1-339/+505
NFS-over-RDMA client updates for Linux 4.14 Bugfixes and cleanups: - Constify rpc_xprt_ops - Harden RPC call encoding and decoding - Clean up rpc call decoding to use xdr_streams - Remove unused variables from various structures - Refactor code to remove imul instructions - Rearrange rx_stats structure for better cacheline sharing
2017-08-18SUNRPC: Add a separate spinlock to protect the RPC request receive listTrond Myklebust1-4/+4
This further reduces contention with the transport_lock, and allows us to convert to using a non-bh-safe spinlock, since the list is now never accessed from a bh context. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15xprtrdma: Remove imul instructions from chunk list encodersChuck Lever1-24/+21
Re-arrange the pointer arithmetic in the chunk list encoders to eliminate several more integer multiplication instructions during Transport Header encoding. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-15xprtrdma: Remove imul instructions from rpcrdma_convert_iovs()Chuck Lever1-57/+48
Re-arrange the pointer arithmetic in rpcrdma_convert_iovs() to eliminate several integer multiplication instructions during Transport Header encoding. Also, array overflow does not occur outside development environments, so replace overflow checking with one spot check at the end. This reduces the number of conditional branches in the common case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11xprtrdma: Harden chunk list encoding against send buffer overflowChuck Lever1-86/+142
While marshaling chunk lists which are variable-length XDR objects, check for XDR buffer overflow at every step. Measurements show no significant changes in CPU utilization. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11xprtrdma: Set up an xdr_stream in rpcrdma_marshal_req()Chuck Lever1-6/+18
Initialize an xdr_stream at the top of rpcrdma_marshal_req(), and use it to encode the fixed transport header fields. This xdr_stream will be used to encode the chunk lists in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11xprtrdma: Remove rpclen from rpcrdma_marshal_reqChuck Lever1-7/+2
Clean up: Remove a variable whose result is no longer used. Commit 655fec6987be ("xprtrdma: Use gathered Send for large inline messages") should have removed it. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-11xprtrdma: Clean up rpcrdma_marshal_req() synopsisChuck Lever1-8/+17
Clean up: The caller already has rpcrdma_xprt, so pass that directly instead. And provide a documenting comment for this critical function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08xprtrdma: Remove rpcrdma_rep::rr_lenChuck Lever1-1/+1
This field is no longer used outside the Receive completion handler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08xprtrdma: Replace rpcrdma_count_chunks()Chuck Lever1-94/+127
Clean up chunk list decoding by using the xdr_stream set up in rpcrdma_reply_handler. This hardens decoding by checking for buffer overflow at every step while unmarshaling variable-length XDR objects. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08xprtrdma: Refactor rpcrdma_reply_handler()Chuck Lever1-82/+130
Refactor the reply handler's transport header decoding logic to make it easier to understand and update. Convert some of the handler to use xdr_streams, which will enable stricter validation of input data and enable the eventual addition of support for new combinations of chunks, such as "Write + Reply" or "PZRC + normal Read". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08xprtrdma: Harden backchannel call decodingChuck Lever1-21/+37
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08xprtrdma: Add xdr_init_decode to rpcrdma_reply_handler()Chuck Lever1-14/+23
Transport header decoding deals with untrusted input data, therefore decoding this header needs to be hardened. Adopt the same infrastructure that is used when XDR decoding NFS replies. This is slightly more CPU-intensive than the replaced code, but we're not adding new atomics, locking, or context switches. The cost is manageable. Start by initializing an xdr_stream in rpcrdma_reply_handler(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Replace PAGE_MASK with offset_in_page()Chuck Lever1-8/+8
Clean up. Reported by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Fix client lock-up after application signal firesChuck Lever1-29/+50
After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Pass only the list of registered MRs to ro_unmap_syncChuck Lever1-9/+7
There are rare cases where an rpcrdma_req can be re-used (via rpcrdma_buffer_put) while the RPC reply handler is still running. This is due to a signal firing at just the wrong instant. Since commit 9d6b04097882 ("xprtrdma: Place registered MWs on a per-req list"), rpcrdma_mws are self-contained; ie., they fully describe an MR and scatterlist, and no part of that information is stored in struct rpcrdma_req. As part of closing the above race window, pass only the req's list of registered MRs to ro_unmap_sync, rather than the rpcrdma_req itself. Some extra transport header sanity checking is removed. Since the client depends on its own recollection of what memory had been registered, there doesn't seem to be a way to abuse this change. And, the check was not terribly effective. If the client had sent Read chunks, the "list_empty" test is negative in both of the removed cases, which are actually looking for Write or Reply chunks. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Pre-mark remotely invalidated MRsChuck Lever1-2/+20
There are rare cases where an rpcrdma_req and its matched rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC reply handler is still using that req. This is typically due to a signal firing at just the wrong instant. As part of closing this race window, avoid using the wrong rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as invalidated while we are sure the rep is still OK to use. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Squelch ENOBUFS warningsChuck Lever1-3/+5
When ro_map is out of buffers, that's not a permanent error, so don't report a problem. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-04-25xprtrdma: Use same device when mapping or syncing DMA buffersChuck Lever1-2/+2
When the underlying device driver is reloaded, ia->ri_device will be replaced. All cached copies of that device pointer have to be updated as well. Commit 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and Receive buffers") added the rg_device field to each regbuf. As part of handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on all regbufs for a transport. Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after the driver has been reloaded should reinitialize rg_device correctly for every case except rpcrdma_wc_receive, which still uses rpcrdma_rep::rr_device. Ensure the same device that was used to map a Receive buffer is also used to sync it in rpcrdma_wc_receive by using rg_device there instead of rr_device. This is the only use of rr_device, so it can be removed. The use of regbufs in the send path is also updated, for completeness. Fixes: 54cbd6b0c6b9 ("xprtrdma: Delay DMA mapping Send and ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Refactor management of mw_list fieldChuck Lever1-3/+3
Clean up some duplicate code. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Properly recover FRWRs with in-flight FASTREG WRsChuck Lever1-6/+8
Sriharsha (sriharsha.basavapatna@broadcom.com) reports an occasional double DMA unmap of an FRWR MR when a connection is lost. I see one way this can happen. When a request requires more than one segment or chunk, rpcrdma_marshal_req loops, invoking ->frwr_op_map for each segment (MR) in each chunk. Each call posts a FASTREG Work Request to register one MR. Now suppose that the transport connection is lost part-way through marshaling this request. As part of recovering and resetting that req, rpcrdma_marshal_req invokes ->frwr_op_unmap_safe, which hands all the req's registered FRWRs to the MR recovery thread. But note: FRWR registration is asynchronous. So it's possible that some of these "already registered" FRWRs are fully registered, and some are still waiting for their FASTREG WR to complete. When the connection is lost, the "already registered" frmrs are marked FRMR_IS_VALID, and the "still waiting" WRs flush. Then frwr_wc_fastreg marks these frmrs FRMR_FLUSHED_FR. But thanks to ->frwr_op_unmap_safe, the MR recovery thread is doing an unreg / alloc_mr, a DMA unmap, and marking each of these frwrs FRMR_IS_INVALID, at the same time frwr_wc_fastreg might be running. - If the recovery thread runs last, then the frmr is marked FRMR_IS_INVALID, and life continues. - If frwr_wc_fastreg runs last, the frmr is marked FRMR_FLUSHED_FR, but the recovery thread has already DMA unmapped that MR. When ->frwr_op_map later re-uses this frmr, it sees it is not marked FRMR_IS_INVALID, and tries to recover it before using it, resulting in a second DMA unmap of the same MR. The fix is to guarantee in-flight FASTREG WRs have flushed before MR recovery runs on those FRWRs. Thus we depend on ro_unmap_safe (called from xprt_rdma_send_request on retransmit, or from xprt_rdma_free) to clean up old registrations as needed. Reported-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Reduce required number of send SGEsChuck Lever1-3/+23
The MAX_SEND_SGES check introduced in commit 655fec6987be ("xprtrdma: Use gathered Send for large inline messages") fails for devices that have a small max_sge. Instead of checking for a large fixed maximum number of SGEs, check for a minimum small number. RPC-over-RDMA will switch to using a Read chunk if an xdr_buf has more pages than can fit in the device's max_sge limit. This is considerably better than failing all together to mount the server. This fix supports devices that have as few as three send SGEs available. Reported-by: Selvin Xavier <selvin.xavier@broadcom.com> Reported-by: Devesh Sharma <devesh.sharma@broadcom.com> Reported-by: Honggang Li <honli@redhat.com> Reported-by: Ram Amrani <Ram.Amrani@cavium.com> Fixes: 655fec6987be ("xprtrdma: Use gathered Send for large ...") Cc: stable@vger.kernel.org # v4.9+ Tested-by: Honggang Li <honli@redhat.com> Tested-by: Ram Amrani <Ram.Amrani@cavium.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Per-connection pad optimizationChuck Lever1-14/+14
Pad optimization is changed by echoing into /proc/sys/sunrpc/rdma_pad_optimize. This is a global setting, affecting all RPC-over-RDMA connections to all servers. The marshaling code picks up that value and uses it for decisions about how to construct each RPC-over-RDMA frame. Having it change suddenly in mid-operation can result in unexpected failures. And some servers a client mounts might need chunk round-up, while others don't. So instead, copy the pad_optimize setting into each connection's rpcrdma_ia when the transport is created, and use the copy, which can't change during the life of the connection, instead. This also removes a hack: rpcrdma_convert_iovs was using the remote-invalidation-expected flag to predict when it could leave out Write chunk padding. This is because the Linux server handles implicit XDR padding on Write chunks correctly, and only Linux servers can set the connection's remote-invalidation-expected flag. It's more sensible to use the pad optimization setting instead. Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-10xprtrdma: Fix Read chunk paddingChuck Lever1-6/+4
When pad optimization is disabled, rpcrdma_convert_iovs still does not add explicit XDR round-up padding to a Read chunk. Commit 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") incorrectly short-circuited the test for whether round-up padding is needed that appears later in rpcrdma_convert_iovs. However, if this is indeed a regular Read chunk (and not a Position-Zero Read chunk), the tail iovec _always_ contains the chunk's padding, and never anything else. So, it's easy to just skip the tail when padding optimization is enabled, and add the tail in a subsequent Read chunk segment, if disabled. Fixes: 677eb17e94ed ("xprtrdma: Fix XDR tail buffer marshalling") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29xprtrdma: Relocate connection helper functionsChuck Lever1-34/+0
Clean up: Disentangle connection helpers from RPC-over-RDMA reply decoding functions. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-11-29xprtrdma: Update dprintk in rpcrdma_count_chunksChuck Lever1-1/+1
Clean up: offset and handle should be zero-filled, just like in the chunk encoders. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Eliminate rpcrdma_receive_worker()Chuck Lever1-1/+3
Clean up: the extra layer of indirection doesn't add value. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Use gathered Send for large inline messagesChuck Lever1-131/+170
An RPC Call message that is sent inline but that has a data payload (ie, one or more items in rq_snd_buf's page list) must be "pulled up:" - call_allocate has to reserve enough RPC Call buffer space to accommodate the data payload - call_transmit has to memcopy the rq_snd_buf's page list and tail into its head iovec before it is sent As the inline threshold is increased beyond its current 1KB default, however, this means data payloads of more than a few KB are copied by the host CPU. For example, if the inline threshold is increased just to 4KB, then NFS WRITE requests up to 4KB would involve a memcpy of the NFS WRITE's payload data into the RPC Call buffer. This is an undesirable amount of participation by the host CPU. The inline threshold may be much larger than 4KB in the future, after negotiation with a peer server. Instead of copying the components of rq_snd_buf into its head iovec, construct a gather list of these components, and send them all in place. The same approach is already used in the Linux server's RPC-over-RDMA reply path. This mechanism also eliminates the need for rpcrdma_tail_pullup, which is used to manage the XDR pad and trailing inline content when a Read list is present. This requires that the pages in rq_snd_buf's page list be DMA-mapped during marshaling, and unmapped when a data-bearing RPC is completed. This is slightly less efficient for very small I/O payloads, but significantly more efficient as data payload size and inline threshold increase past a kilobyte. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Basic support for Remote InvalidationChuck Lever1-4/+14
Have frwr's ro_unmap_sync recognize an invalidated rkey that appears as part of a Receive completion. Local invalidation can be skipped for that rkey. Use an out-of-band signaling mechanism to indicate to the server that the client is prepared to receive RDMA Send With Invalidate. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Client-side support for rpcrdma_connect_privateChuck Lever1-3/+5
Send an RDMA-CM private message on connect, and look for one during a connection-established event. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Once the client knows the server's inline threshold maxima, it can adjust the use of Reply chunks, and eliminate most use of Position Zero Read chunks. Moderately-sized I/O can be done using a pure inline RDMA Send instead of RDMA operations that require memory registration. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Move send_wr to struct rpcrdma_reqChuck Lever1-2/+3
Clean up: Most of the fields in each send_wr do not vary. There is no need to initialize them before each ib_post_send(). This removes a large-ish data structure from the stack. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Simplify rpcrdma_ep_post_recv()Chuck Lever1-1/+1
Clean up. Since commit fc66448549bb ("xprtrdma: Split the completion queue"), rpcrdma_ep_post_recv() no longer uses the "ep" argument. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Delay DMA mapping Send and Receive buffersChuck Lever1-0/+9
Currently, each regbuf is allocated and DMA mapped at the same time. This is done during transport creation. When a device driver is unloaded, every DMA-mapped buffer in use by a transport has to be unmapped, and then remapped to the new device if the driver is loaded again. Remapping will have to be done _after_ the connect worker has set up the new device. But there's an ordering problem: call_allocate, which invokes xprt_rdma_allocate which calls rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_ the connect worker can run to set up the new device. Instead, at transport creation, allocate each buffer, but leave it unmapped. Once the RPC carries these buffers into ->send_request, by which time a transport connection should have been established, check to see that the RPC's buffers have been DMA mapped. If not, map them there. When device driver unplug support is added, it will simply unmap all the transport's regbufs, but it doesn't have to deallocate the underlying memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-09-19xprtrdma: Eliminate INLINE_THRESHOLD macrosChuck Lever1-1/+1
Clean up: r_xprt is already available everywhere these macros are invoked, so just dereference that directly. RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: No direct data placement with krb5i and krb5pChuck Lever1-2/+10
Direct data placement is not allowed when using flavors that guarantee integrity or privacy. When such security flavors are in effect, don't allow the use of Read and Write chunks for moving individual data items. All messages larger than the inline threshold are sent via Long Call or Long Reply. On my systems (CX-3 Pro on FDR), for small I/O operations, the use of Long messages adds only around 5 usecs of latency in each direction. Note that when integrity or encryption is used, the host CPU touches every byte in these messages. Even if it could be used, data movement offload doesn't buy much in this case. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Clean up fixup_copy_count accountingChuck Lever1-13/+13
fixup_copy_count should count only the number of bytes copied to the page list. The head and tail are now always handled without a data copy. And the debugging at the end of rpcrdma_inline_fixup() is also no longer necessary, since copy_len will be non-zero when there is reply data in the tail (a normal and valid case). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Update only specific fields in private receive bufferChuck Lever1-4/+9
Now that rpcrdma_inline_fixup() updates only two fields in rq_rcv_buf, a full memcpy of that structure to rq_private_buf is unwarranted. Updating rq_private_buf fields only where needed also better documents what is going on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Do not update {head, tail}.iov_len in rpcrdma_inline_fixup()Chuck Lever1-28/+33
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: rpcrdma_inline_fixup() overruns the receive page listChuck Lever1-5/+11
When the remaining length of an incoming reply is longer than the XDR buf's page_len, switch over to the tail iovec instead of copying more than page_len bytes into the page list. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Chunk list encoders no longer share one rl_segments arrayChuck Lever1-33/+28
Currently, all three chunk list encoders each use a portion of the one rl_segments array in rpcrdma_req. This is because the MWs for each chunk list were preserved in rl_segments so that ro_unmap could find and invalidate them after the RPC was complete. However, now that MWs are placed on a per-req linked list as they are registered, there is no longer any information in rpcrdma_mr_seg that is shared between ro_map and ro_unmap_{sync,safe}, and thus nothing in rl_segments needs to be preserved after rpcrdma_marshal_req is complete. Thus the rl_segments array can be used now just for the needs of each rpcrdma_convert_iovs call. Once each chunk list is encoded, the next chunk list encoder is free to re-use all of rl_segments. This means all three chunk lists in one RPC request can now each encode a full size data payload with no increase in the size of rl_segments. This is a key requirement for Kerberos support, since both the Call and Reply for a single RPC transaction are conveyed via Long messages (RDMA Read/Write). Both can be large. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Place registered MWs on a per-req listChuck Lever1-43/+38
Instead of placing registered MWs sparsely into the rl_segments array, place these MWs on a per-req list. ro_unmap_{sync,safe} can then simply pull those MWs off the list instead of walking through the array. This change significantly reduces the size of struct rpcrdma_req by removing nsegs and rl_mw from every array element. As an additional clean-up, chunk co-ordinates are returned in the "*mw" output argument so they are no longer needed in every array element. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Chunk list encoders must not return zeroChuck Lever1-3/+3
Clean up, based on code audit: Remove the possibility that the chunk list XDR encoders can return zero, which would be interpreted as a NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-07-11xprtrdma: Honor ->send_request API contractChuck Lever1-1/+1
Commit c93c62231cf5 ("xprtrdma: Disconnect on registration failure") added a disconnect for some RPC marshaling failures. This is needed only in a handful of cases, but it was triggering for simple stuff like temporary resource shortages. Try to straighten this out. Fix up the lower layers so they don't return -ENOMEM or other error codes that the RPC client's FSM doesn't explicitly recognize. Also fix up the places in the send_request path that do want a disconnect. For example, when ib_post_send or ib_post_recv fail, this is a sign that there is a send or receive queue resource miscalculation. That should be rare, and is a sign of a software bug. But xprtrdma can recover: disconnect to reset the transport and start over. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Add ro_unmap_safe memreg methodChuck Lever1-4/+1
There needs to be a safe method of releasing registered memory resources when an RPC terminates. Safe can mean a number of things: + Doesn't have to sleep + Doesn't rely on having a QP in RTS ro_unmap_safe will be that safe method. It can be used in cases where synchronous memory invalidation can deadlock, or needs to have an active QP. The important case is fencing an RPC's memory regions after it is signaled (^C) and before it exits. If this is not done, there is a window where the server can write an RPC reply into memory that the client has released and re-used for some other purpose. Note that this is a full solution for FRWR, but FMR and physical still have some gaps where a particularly bad server can wreak some havoc on the client. These gaps are not made worse by this patch and are expected to be exceptionally rare and timing-based. They are noted in documenting comments. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Remove rpcrdma_create_chunks()Chuck Lever1-151/+0
rpcrdma_create_chunks() has been replaced, and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Allow Read list and Reply chunk simultaneouslyChuck Lever1-59/+268
rpcrdma_marshal_req() makes a simplifying assumption: that NFS operations with large Call messages have small Reply messages, and vice versa. Therefore with RPC-over-RDMA, only one chunk type is ever needed for each Call/Reply pair, because one direction needs chunks, the other direction will always fit inline. In fact, this assumption is asserted in the code: if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) { dprintk("RPC: %s: cannot marshal multiple chunk lists\n", __func__); return -EIO; } But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p perform data transformation on RPC messages before they are transmitted, direct data placement techniques cannot be used, thus RPC messages must be sent via a Long call in both directions. All such calls are sent with a Position Zero Read chunk, and all such replies are handled with a Reply chunk. Thus the client must provide every Call/Reply pair with both a Read list and a Reply chunk. Without any special security in effect, NFSv4 WRITEs may now also use the Read list and provide a Reply chunk. The marshal_req logic was preventing that, meaning an NFSv4 WRITE with a large payload that included a GETATTR result larger than the inline threshold would fail. The code that encodes each chunk list is now completely contained in its own function. There is some code duplication, but the trade-off is that the overall logic should be more clear. Note that all three chunk lists now share the rl_segments array. Some additional per-req accounting is necessary to track this usage. For the same reasons that the above simplifying assumption has held true for so long, I don't expect more array elements are needed at this time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Update comments in rpcrdma_marshal_req()Chuck Lever1-14/+4
Update documenting comments to reflect code changes over the past year. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Avoid using Write list for small NFS READ requestsChuck Lever1-4/+5
Avoid the latency and interrupt overhead of registering a Write chunk when handling NFS READ requests of a few hundred bytes or less. This change does not interoperate with Linux NFS/RDMA servers that do not have commit 9d11b51ce7c1 ('svcrdma: Fix send_reply() scatter/gather set-up'). Commit 9d11b51ce7c1 was introduced in v4.3, and is included in 4.2.y, 4.1.y, and 3.18.y. Oracle bug 22925946 has been filed to request that the above fix be included in the Oracle Linux UEK4 NFS/RDMA server. Red Hat bugzillas 1327280 and 1327554 have been filed to request that RHEL NFS/RDMA server backports include the above fix. Workaround: Replace the "proto=rdma,port=20049" mount options with "proto=tcp" until commit 9d11b51ce7c1 is applied to your NFS server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2016-05-17xprtrdma: Prevent inline overflowChuck Lever1-10/+75
When deciding whether to send a Call inline, rpcrdma_marshal_req doesn't take into account header bytes consumed by chunk lists. This results in Call messages on the wire that are sometimes larger than the inline threshold. Likewise, when a Write list or Reply chunk is in play, the server's reply has to emit an RDMA Send that includes a larger-than-minimal RPC-over-RDMA header. The actual size of a Call message cannot be estimated until after the chunk lists have been registered. Thus the size of each RPC-over-RDMA header can be estimated only after chunks are registered; but the decision to register chunks is based on the size of that header. Chicken, meet egg. The best a client can do is estimate header size based on the largest header that might occur, and then ensure that inline content is always smaller than that. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>