aboutsummaryrefslogtreecommitdiffstats
path: root/tools (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-11-17NFSv4: Don't try to CLOSE if the stateid 'other' field has changedTrond Myklebust3-12/+13
If the stateid is no longer recognised on the server, either due to a restart, or due to a competing CLOSE call, then we do not have to retry. Any open contexts that triggered a reopen of the file, will also act as triggers for any CLOSE for the updated stateids. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.Trond Myklebust5-2/+65
If we're racing with an OPEN, then retry the operation instead of declaring it a success. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> [Andrew W Elble: Fix a typo in nfs4_refresh_open_stateid] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFS: Fix a typo in nfs_rename()Trond Myklebust1-1/+1
On successful rename, the "old_dentry" is retained and is attached to the "new_dir", so we need to call nfs_set_verifier() accordingly. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFSv4: Fix open create exclusive when the server rebootsTrond Myklebust1-15/+26
If the server that does not implement NFSv4.1 persistent session semantics reboots while we are performing an exclusive create, then the return value of NFS4ERR_DELAY when we replay the open during the grace period causes us to lose the verifier. When the grace period expires, and we present a new verifier, the server will then correctly reply NFS4ERR_EXIST. This commit ensures that we always present the same verifier when replaying the OPEN. Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFSv4: Add a tracepoint to document open stateid updatesTrond Myklebust2-0/+5
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFSv4: Fix OPEN / CLOSE raceTrond Myklebust3-35/+123
Ben Coddington has noted the following race between OPEN and CLOSE on a single client. Process 1 Process 2 Server ========= ========= ====== 1) OPEN file 2) OPEN file 3) Process OPEN (1) seqid=1 4) Process OPEN (2) seqid=2 5) Reply OPEN (2) 6) Receive reply (2) 7) new stateid, seqid=2 8) CLOSE file, using stateid w/ seqid=2 9) Reply OPEN (1) 10( Process CLOSE (8) 11) Reply CLOSE (8) 12) Forget stateid file closed 13) Receive reply (7) 14) Forget stateid file closed. 15) Receive reply (1). 16) New stateid seqid=1 is really the same stateid that was closed. IOW: the reply to the first OPEN is delayed. Since "Process 2" does not wait before closing the file, and it does not cache the closed stateid, then when the delayed reply is finally received, it is treated as setting up a new stateid by the client. The fix is to ensure that the client processes the OPEN and CLOSE calls in the same order in which the server processed them. This commit ensures that we examine the seqid of the stateid returned by OPEN. If it is a new stateid, we assume the seqid must be equal to the value 1, and that each state transition increments the seqid value by 1 (See RFC7530, Section 9.1.4.2, and RFC5661, Section 8.2.2). If the tracker sees that an OPEN returns with a seqid that is greater than the cached seqid + 1, then it bumps a flag to ensure that the caller waits for the RPCs carrying the missing seqids to complete. Note that there can still be pathologies where the server crashes before it can even send us the missing seqids. Since the OPEN call is still holding a slot when it waits here, that could cause the recovery to stall forever. To avoid that, we time out after a 5 second wait. Reported-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17sunrpc: Add rpc_request static trace pointChuck Lever2-2/+31
Display information about the RPC procedure being requested in the trace log. This sometimes critical information cannot always be derived from other RPC trace entries. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17sunrpc: Fix rpc_task_begin trace pointChuck Lever1-2/+1
The rpc_task_begin trace point always display a task ID of zero. Move the trace point call site so that it picks up the new task ID. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17SUNRPC: Fix parsing failure in trace points with XIDsChuck Lever1-24/+25
mount.nf-11159 8.... 905.248380: xprt_transmit: [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049 mount.nf-11159 8.... 905.248381: rpc_task_sleep: task:6210@1 flags=0e80 state=0005 status=0 timeout=60000 queue=xprt_pending kworker/-1591 1.... 905.248419: xprt_lookup_rqst: [FAILED TO PARSE] xid=351291440 status=0 addr=192.168.2.5 port=20049 kworker/-1591 1.... 905.248423: xprt_complete_rqst: [FAILED TO PARSE] xid=351291440 status=24 addr=192.168.2.5 port=20049 Byte swapping is not available during trace-cmd report. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17net: sunrpc: mark expected switch fall-throughsGustavo A. R. Silva3-0/+16
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFS: Fix bool initialization/comparisonThomas Meyer4-8/+8
Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFS: Avoid RCU usage in tracepointsAnna Schumaker1-18/+6
There isn't an obvious way to acquire and release the RCU lock during a tracepoint, so we can't use the rpc_peeraddr2str() function here. Instead, rely on the client's cl_hostname, which should have similar enough information without needing an rcu_dereference(). Reported-by: Dave Jones <davej@codemonkey.org.uk> Cc: stable@vger.kernel.org # v3.12 Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Update copyright noticesChuck Lever5-0/+5
Credit work contributed by Oracle engineers since 2014. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Remove include for linux/prefetch.hChuck Lever1-1/+0
Clean up. This include should have been removed by commit 23826c7aeac7 ("xprtrdma: Serialize credit accounting again"). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17rpcrdma: Remove C structure definitions of XDR data itemsChuck Lever3-68/+3
Clean up: C-structure style XDR encoding and decoding logic has been replaced over the past several merge windows on both the client and server. These data structures are no longer used. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE modeChuck Lever1-1/+1
Lift the Send and LocalInv completion handlers out of soft IRQ mode to make room for other work. Also, move the Send CQ to a different CPU than the CPU where the Receive CQ is running, for improved scalability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Devesh Sharma <devesh.sharma@broadcom.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs_client.cl_count from atomic_t to refcount_tElena Reshetova7-32/+33
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_client.cl_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs_lock_context.count from atomic_t to refcount_tElena Reshetova2-7/+8
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_lock_context.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs4_lock_state.ls_count from atomic_t to refcount_tElena Reshetova3-8/+8
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_lock_state.ls_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs_cache_defer_req.count from atomic_t to refcount_tElena Reshetova2-4/+4
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs_cache_defer_req.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs4_ff_layout_mirror.ref from atomic_t to refcount_tElena Reshetova2-5/+6
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_ff_layout_mirror.ref is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert pnfs_layout_hdr.plh_refcount from atomic_t to refcount_tElena Reshetova2-7/+7
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable pnfs_layout_hdr.plh_refcount is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert pnfs_layout_segment.pls_refcount from atomic_t to refcount_tElena Reshetova2-8/+8
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17fs, nfs: convert nfs4_pnfs_ds.ds_count from atomic_t to refcount_tElena Reshetova2-6/+7
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nfs4_pnfs_ds.ds_count is used as pure reference counter. Convert it to refcount_t and fix up the operations. Suggested-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17NFSv4.1: Fix up replays of interrupted requestsTrond Myklebust2-47/+103
If the previous request on a slot was interrupted before it was processed by the server, then our slot sequence number may be out of whack, and so we try the next operation using the old sequence number. The problem with this, is that not all servers check to see that the client is replaying the same operations as previously when they decide to go to the replay cache, and so instead of the expected error of NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could (if the operations match up) be mistaken by the client for a new reply. To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op in order to resync our slot sequence number. Cc: Olga Kornievskaia <olga.kornievskaia@gmail.com> [olga.kornievskaia@gmail.com: fix an Oops] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Remove atomic send completion countingChuck Lever3-33/+0
The sendctx circular queue now guarantees that xprtrdma cannot overflow the Send Queue, so remove the remaining bits of the original Send WQE counting mechanism. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: RPC completion should wait for Send completionChuck Lever4-4/+34
When an RPC Call includes a file data payload, that payload can come from pages in the page cache, or a user buffer (for direct I/O). If the payload can fit inline, xprtrdma includes it in the Send using a scatter-gather technique. xprtrdma mustn't allow the RPC consumer to re-use the memory where that payload resides before the Send completes. Otherwise, the new contents of that memory would be exposed by an HCA retransmit of the Send operation. So, block RPC completion on Send completion, but only in the case where a separate file data payload is part of the Send. This prevents the reuse of that memory while it is still part of a Send operation without an undue cost to other cases. Waiting is avoided in the common case because typically the Send will have completed long before the RPC Reply arrives. These days, an RPC timeout will trigger a disconnect, which tears down the QP. The disconnect flushes all waiting Sends. This bounds the amount of time the reply handler has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Refactor rpcrdma_deferred_completionChuck Lever3-13/+22
Invoke a common routine for releasing hardware resources (for example, invalidating MRs). This needs to be done whether an RPC Reply has arrived or the RPC was terminated early. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Add a field of bit flags to struct rpcrdma_reqChuck Lever4-4/+8
We have one boolean flag in rpcrdma_req today. I'd like to add more flags, so convert that boolean to a bit flag. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Add data structure to manage RDMA Send argumentsChuck Lever4-32/+247
Problem statement: Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA- enabled storage initiators don't handle delayed Send completion correctly. If Send completion is delayed beyond the end of a ULP transaction, the ULP may release resources that are still being used by the HCA to complete a long-running Send operation. This is a common design trait amongst our initiators. Most Send operations are faster than the ULP transaction they are part of. Waiting for a completion for these is typically unnecessary. Infrequently, a network partition or some other problem crops up where an ordering problem can occur. In NFS parlance, the RPC Reply arrives and completes the RPC, but the HCA is still retrying the Send WR that conveyed the RPC Call. In this case, the HCA can try to use memory that has been invalidated or DMA unmapped, and the connection is lost. If that memory has been re-used for something else (possibly not related to NFS), and the Send retransmission exposes that data on the wire. Thus we cannot assume that it is safe to release Send-related resources just because a ULP reply has arrived. After some analysis, we have determined that the completion housekeeping will not be difficult for xprtrdma: - Inline Send buffers are registered via the local DMA key, and are already left DMA mapped for the lifetime of a transport connection, thus no additional handling is necessary for those - Gathered Sends involving page cache pages _will_ need to DMA unmap those pages after the Send completes. But like inline send buffers, they are registered via the local DMA key, and thus will not need to be invalidated In addition, RPC completion will need to wait for Send completion in the latter case. However, nearly always, the Send that conveys the RPC Call will have completed long before the RPC Reply arrives, and thus no additional latency will be accrued. Design notes: In this patch, the rpcrdma_sendctx object is introduced, and a lock-free circular queue is added to manage a set of them per transport. The RPC client's send path already prevents sending more than one RPC Call at the same time. This allows us to treat the consumer side of the queue (rpcrdma_sendctx_get_locked) as if there is a single consumer thread. The producer side of the queue (rpcrdma_sendctx_put_locked) is invoked only from the Send completion handler, which is a single thread of execution (soft IRQ). The only care that needs to be taken is with the tail index, which is shared between the producer and consumer. Only the producer updates the tail index. The consumer compares the head with the tail to ensure that the a sendctx that is in use is never handed out again (or, expressed more conventionally, the queue is empty). When the sendctx queue empties completely, there are enough Sends outstanding that posting more Send operations can result in a Send Queue overflow. In this case, the ULP is told to wait and try again. This introduces strong Send Queue accounting to xprtrdma. As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> suggested a mechanism that does not require signaling every Send. We signal once every N Sends, and perform SGE unmapping of N Send operations during that one completion. Reported-by: Sagi Grimberg <sagi@grimberg.me> Suggested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: "Unoptimize" rpcrdma_prepare_hdr_sge()Chuck Lever1-7/+5
Commit 655fec6987be ("xprtrdma: Use gathered Send for large inline messages") assumed that, since the zeroeth element of the Send SGE array always pointed to req->rl_rdmabuf, it needed to be initialized just once. This was a valid assumption because the Send SGE array and rl_rdmabuf both live in the same rpcrdma_req. In a subsequent patch, the Send SGE array will be separated from the rpcrdma_req, so the zeroeth element of the SGE array needs to be initialized every time. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Change return value of rpcrdma_prepare_send_sges()Chuck Lever3-24/+38
Clean up: Make rpcrdma_prepare_send_sges() return a negative errno instead of a bool. Soon callers will want distinct treatments of different types of failures. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Fix error handling in rpcrdma_prepare_msg_sges()Chuck Lever1-14/+24
When this function fails, it needs to undo the DMA mappings it's done so far. Otherwise these are leaked. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Clean up SGE accounting in rpcrdma_prepare_msg_sges()Chuck Lever1-1/+1
Clean up. rpcrdma_prepare_hdr_sge() sets num_sge to one, then rpcrdma_prepare_msg_sges() sets num_sge again to the count of SGEs it added, plus one for the header SGE just mapped in rpcrdma_prepare_hdr_sge(). This is confusing, and nails in an assumption about when these functions are called. Instead, maintain a running count that both functions can update with just the number of SGEs they have added to the SGE array. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Decode credits field in rpcrdma_reply_handlerChuck Lever3-27/+14
We need to decode and save the incoming rdma_credits field _after_ we know that the direction of the message is "forward direction Reply". Otherwise, the credits value in reverse direction Calls is also used to update the forward direction credits. It is safe to decode the rdma_credits field in rpcrdma_reply_handler now that rpcrdma_reply_handler is single-threaded. Receives complete in the same order as they were sent on the NFS server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completionChuck Lever3-23/+36
I noticed that the soft IRQ thread looked pretty busy under heavy I/O workloads. perf suggested one area that was expensive was the queue_work() call in rpcrdma_wc_receive. That gave me some ideas. Instead of scheduling a separate worker to process RPC Replies, promote the Receive completion handler to IB_POLL_WORKQUEUE, and invoke rpcrdma_reply_handler directly. Note that the poll workqueue is single-threaded. In order to keep memory invalidation from serializing all RPC Replies, handle any necessary invalidation tasks in a separate multi-threaded workqueue. This provides a two-tier scheme, similar to OS I/O interrupt handlers: A fast interrupt handler that schedules the slow handler and re-enables the interrupt, and a slower handler that is invoked for any needed heavy lifting. Benefits include: - One less context switch for RPCs that don't register memory - Receive completion handling is moved out of soft IRQ context to make room for other users of soft IRQ - The same CPU core now DMA syncs and XDR decodes the Receive buffer Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Refactor rpcrdma_reply_handler some moreChuck Lever2-57/+69
Clean up: I'd like to be able to invoke the tail of rpcrdma_reply_handler in two different places. Split the tail out into its own helper function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Move decoded header fields into rpcrdma_repChuck Lever2-19/+20
Clean up: Make it easier to pass the decoded XID, vers, credits, and proc fields around by moving these variables into struct rpcrdma_rep. Note: the credits field will be handled in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17xprtrdma: Throw away reply when version is unrecognizedChuck Lever1-9/+8
A reply with an unrecognized value in the version field means the transport header is potentially garbled and therefore all the fields are untrustworthy. Fixes: 59aa1f9a3cce3 ("xprtrdma: Properly handle RDMA_ERROR ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16xprtrdma: Remove ro_unmap_safeChuck Lever3-40/+0
Clean up: There are no remaining callers of this method. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16xprtrdma: Use ro_unmap_sync in xprt_rdma_send_requestChuck Lever1-1/+2
The "safe" version of ro_unmap is used here to avoid waiting unnecessarily. However: - It is safe to wait. After all, we have to wait anyway when using FMR to register memory. - This case is rare: it occurs only after a reconnect. By switching this call site to ro_unmap_sync, the final use of ro_unmap_safe is removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16xprtrdma: Don't defer fencing an async RPC's chunksChuck Lever1-1/+1
In current kernels, waiting in xprt_release appears to be safe to do. I had erroneously believed that for ASYNC RPCs, waiting of any kind in xprt_release->xprt_rdma_free would result in deadlock. I've done injection testing and consulted with Trond to confirm that waiting in the RPC release path is safe. For the very few times where RPC resources haven't yet been released earlier by the reply handler, it is safe to wait synchronously in xprt_rdma_free for invalidation rather than defering it to MR recovery. Note: When the QP is error state, posting a LocalInvalidate should flush and mark the MR as bad. There is no way the remote HCA can access that MR via a QP in error state, so it is effectively already inaccessible and thus safe for the Upper Layer to access. The next time the MR is used it should be recognized and cleaned up properly by frwr_op_map. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16NFS: remove special-case revalidate in nfs_opendir()NeilBrown1-7/+0
Commit f5a73672d181 ("NFS: allow close-to-open cache semantics to apply to root of NFS filesystem") added a call to __nfs_revalidate_inode() to nfs_opendir to as the lookup process wouldn't reliable do this. Subsequent commit a3fbbde70a0c ("VFS: we need to set LOOKUP_JUMPED on mountpoint crossing") make this unnecessary. So remove the unnecessary code. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16NFS: revalidate "." etc correctly on "open".NeilBrown1-2/+2
For correct close-to-open semantics, NFS must validate the change attribute of a directory (or file) on open. Since commit ecf3d1f1aa74 ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op"), open() of "." or a path ending ".." is not revalidated reliably (except when that direct is a mount point). Prior to that commit, "." was revalidated using nfs_lookup_revalidate() which checks the LOOKUP_OPEN flag and forces revalidation if the flag is set. Since that commit, nfs_weak_revalidate() is used for NFSv3 (which ignores the flags) and nothing is used for NFSv4. This is fixed by using nfs_lookup_verify_inode() in nfs_weak_revalidate(). This does the revalidation exactly when needed. Also, add a definition of .d_weak_revalidate for NFSv4. The incorrect behavior is easily demonstrated by running "echo *" in some non-mountpoint NFS directory while watching network traffic. Without this patch, "echo *" sometimes doesn't produce any traffic. With the patch it always does. Fixes: ecf3d1f1aa74 ("vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op") cc: stable@vger.kernel.org (3.9+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16NFS: Don't compare apples to elephants to determine access bitsAnna Schumaker3-37/+10
The NFS_ACCESS_* flags aren't a 1:1 mapping to the MAY_* flags, so checking for MAY_WHATEVER might have surprising results in nfs*_proc_access(). Let's simplify this check when determining which bits to ask for, and do it in a generic place instead of copying code for each NFS version. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-16NFS: Create NFS_ACCESS_* flagsAnna Schumaker2-8/+18
Passing the NFS v4 flags into the v3 code seems weird to me, even if they are defined to the same values. This patch adds in generic flags to help me feel better Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-10-15Linux 4.14-rc5Linus Torvalds1-1/+1
2017-10-14x86/microcode: Do the family check firstBorislav Petkov1-9/+18
On CPUs like AMD's Geode, for example, we shouldn't even try to load microcode because they do not support the modern microcode loading interface. However, we do the family check *after* the other checks whether the loader has been disabled on the command line or whether we're running in a guest. So move the family checks first in order to exit early if we're being loaded on an unsupported family. Reported-and-tested-by: Sven Glodowski <glodi1@arcor.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.11.. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396 Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14locking/lockdep: Disable cross-release features for nowIngo Molnar1-2/+2
Johan Hovold reported a big lockdep slowdown on his system, caused by lockdep: > I had noticed that the BeagleBone Black boot time appeared to have > increased significantly with 4.14 and yesterday I finally had time to > investigate it. > > Boot time (from "Linux version" to login prompt) had in fact doubled > since 4.13 where it took 17 seconds (with my current config) compared to > the 35 seconds I now see with 4.14-rc4. > > I quick bisect pointed to lockdep and specifically the following commit: > > 28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition of a crosslock") Because the final v4.14 release is close, disable the cross-release lockdep features for now. Bisected-by: Johan Hovold <johan@kernel.org> Debugged-by: Johan Hovold <johan@kernel.org> Reported-by: Johan Hovold <johan@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Lindgren <tony@atomide.com> Cc: kernel-team@lge.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mm@kvack.org Cc: linux-omap@vger.kernel.org Link: http://lkml.kernel.org/r/20171014072659.f2yr6mhm5ha3eou7@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14x86/mm: Flush more aggressively in lazy TLB modeAndy Lutomirski3-49/+136
Since commit: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") x86's lazy TLB mode has been all the way lazy: when running a kernel thread (including the idle thread), the kernel keeps using the last user mm's page tables without attempting to maintain user TLB coherence at all. From a pure semantic perspective, this is fine -- kernel threads won't attempt to access user pages, so having stale TLB entries doesn't matter. Unfortunately, I forgot about a subtlety. By skipping TLB flushes, we also allow any paging-structure caches that may exist on the CPU to become incoherent. This means that we can have a paging-structure cache entry that references a freed page table, and the CPU is within its rights to do a speculative page walk starting at the freed page table. I can imagine this causing two different problems: - A speculative page walk starting from a bogus page table could read IO addresses. I haven't seen any reports of this causing problems. - A speculative page walk that involves a bogus page table can install garbage in the TLB. Such garbage would always be at a user VA, but some AMD CPUs have logic that triggers a machine check when it notices these bogus entries. I've seen a couple reports of this. Boris further explains the failure mode: > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. To fix this, reinstate TLB coherence in lazy mode. With this patch applied, we do it in one of two ways: - If we have PCID, we simply switch back to init_mm's page tables when we enter a kernel thread -- this seems to be quite cheap except for the cost of serializing the CPU. - If we don't have PCID, then we set a flag and switch to init_mm the first time we would otherwise need to flush the TLB. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed to override the default mode for benchmarking. In theory, we could optimize this better by only flushing the TLB in lazy CPUs when a page table is freed. Doing that would require auditing the mm code to make sure that all page table freeing goes through tlb_remove_page() as well as reworking some data structures to implement the improved flush logic. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Adam Borowski <kilobyte@angband.pl> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biggers <ebiggers@google.com> Cc: Johannes Hirte <johannes.hirte@datenkhaos.de> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 94b1b03b519b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>