aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs/write.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-20NFS: Fix up documentation warningsTrond Myklebust1-1/+0
Fix up some compiler warnings about function parameters, etc not being correctly described or formatted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Ensure NFS writeback allocations don't recurse back into NFS.Trond Myklebust1-1/+6
All the allocations that we can hit in the NFS layer and sunrpc layers themselves are already marked as GFP_NOFS, but we need to ensure that any calls to generic kernel functionality do the right thing as well. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-20NFS: Pass error information to the pgio error cleanup routineTrond Myklebust1-2/+9
Allow the caller to pass error information when cleaning up a failed I/O request so that we can conditionally take action to cancel the request altogether if the error turned out to be fatal. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-02-12NFS: Don't use page_file_mapping after removing the pageBenjamin Coddington1-5/+6
If nfs_page_async_flush() removes the page from the mapping, then we can't use page_file_mapping() on it as nfs_updatepate() is wont to do when receiving an error. Instead, push the mapping to the stack before the page is possibly truncated. Fixes: 8fc75bed96bb ("NFS: Fix up return value on fatal errors in nfs_page_async_flush()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-29NFS: Fix up return value on fatal errors in nfs_page_async_flush()Trond Myklebust1-4/+5
Ensure that we return the fatal error value that caused us to exit nfs_page_async_flush(). Fixes: c373fff7bd25 ("NFSv4: Don't special case "launder"") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.12+ Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-02Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-3/+21
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - xprtrdma: Yet another double DMA-unmap # v4.20 Features: - Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG - Per-xprt rdma receive workqueues - Drop support for FMR memory registration - Make port= mount option optional for RDMA mounts Other bugfixes and cleanups: - Remove unused nfs4_xdev_fs_type declaration - Fix comments for behavior that has changed - Remove generic RPC credentials by switching to 'struct cred' - Fix crossing mountpoints with different auth flavors - Various xprtrdma fixes from testing and auditing the close code - Fixes for disconnect issues when using xprtrdma with krb5 - Clean up and improve xprtrdma trace points - Fix NFS v4.2 async copy reboot recovery" * tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits) sunrpc: convert to DEFINE_SHOW_ATTRIBUTE sunrpc: Add xprt after nfs4_test_session_trunk() sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS sunrpc: handle ENOMEM in rpcb_getport_async NFS: remove unnecessary test for IS_ERR(cred) xprtrdma: Prevent leak of rpcrdma_rep objects NFSv4.2 fix async copy reboot recovery xprtrdma: Don't leak freed MRs xprtrdma: Add documenting comment for rpcrdma_buffer_destroy xprtrdma: Replace outdated comment for rpcrdma_ep_post xprtrdma: Update comments in frwr_op_send SUNRPC: Fix some kernel doc complaints SUNRPC: Simplify defining common RPC trace events NFS: Fix NFSv4 symbolic trace point output xprtrdma: Trace mapping, alloc, and dereg failures xprtrdma: Add trace points for calls to transport switch methods xprtrdma: Relocate the xprtrdma_mr_map trace points xprtrdma: Clean up of xprtrdma chunk trace points xprtrdma: Remove unused fields from rpcrdma_ia xprtrdma: Cull dprintk() call sites ...
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS1-1/+1
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-19NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.NeilBrown1-1/+1
SUNRPC has two sorts of credentials, both of which appear as "struct rpc_cred". There are "generic credentials" which are supplied by clients such as NFS and passed in 'struct rpc_message' to indicate which user should be used to authorize the request, and there are low-level credentials such as AUTH_NULL, AUTH_UNIX, AUTH_GSS which describe the credential to be sent over the wires. This patch replaces all the generic credentials by 'struct cred' pointers - the credential structure used throughout Linux. For machine credentials, there is a special 'struct cred *' pointer which is statically allocated and recognized where needed as having a special meaning. A look-up of a low-level cred will map this to a machine credential. Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-12-19NFS: move credential expiry tracking out of SUNRPC into NFS.NeilBrown1-3/+21
NFS needs to know when a credential is about to expire so that it can modify write-back behaviour to finish the write inside the expiry time. It currently uses functions in SUNRPC code which make use of a fairly complex callback scheme and flags in the generic credientials. As I am working to discard the generic credentials, this has to change. This patch moves the logic into NFS, in part by finding and caching the low-level credential in the open_context. We then make direct cred-api calls on that. This makes the code much simpler and removes a dependency on generic rpc credentials. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-07-26NFS: Ensure we immediately start writeback on rescheduled writesTrond Myklebust1-0/+2
If the writes are being rescheduled due to a pNFS error, then we really want to immediately start a new flush. The O_DIRECT code already does this, so we only need to worry about buffered writes. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31NFS: Move call to nfs4_state_protect() to nfs4_commit_setup()Anna Schumaker1-4/+1
Rather than doing this in the generic NFS client code. Let's put this with the other v4 stuff so it's all in one place. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-05-31NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup()Anna Schumaker1-4/+1
This doesn't really need to be in the generic NFS client code, and I think it makes more sense to keep the v4 code in one place. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-04-12Merge tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds1-2/+6
Pull NFS client updates from Anna Schumaker: "Stable bugfixes: - xprtrdma: Fix corner cases when handling device removal # v4.12+ - xprtrdma: Fix latency regression on NUMA NFS/RDMA clients # v4.15+ Features: - New sunrpc tracepoint for RPC pings - Finer grained NFSv4 attribute checking - Don't unnecessarily return NFS v4 delegations Other bugfixes and cleanups: - Several other small NFSoRDMA cleanups - Improvements to the sunrpc RTT measurements - A few sunrpc tracepoint cleanups - Various fixes for NFS v4 lock notifications - Various sunrpc and NFS v4 XDR encoding cleanups - Switch to the ida_simple API - Fix NFSv4.1 exclusive create - Forget acl cache after setattr operation - Don't advance the nfs_entry readdir cookie if xdr decoding fails" * tag 'nfs-for-4.17-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (47 commits) NFS: advance nfs_entry cookie only after decoding completes successfully NFSv3/acl: forget acl cache after setattr NFSv4.1: Fix exclusive create NFSv4: Declare the size up to date after it was set. nfs: Use ida_simple API NFSv4: Fix the nfs_inode_set_delegation() arguments NFSv4: Clean up CB_GETATTR encoding NFSv4: Don't ask for attributes when ACCESS is protected by a delegation NFSv4: Add a helper to encode/decode struct timespec NFSv4: Clean up encode_attrs NFSv4; Clean up XDR encoding of type bitmap4 NFSv4: Allow GFP_NOIO sleeps in decode_attr_owner/decode_attr_group SUNRPC: Add a helper for encoding opaque data inline SUNRPC: Add helpers for decoding opaque and string types NFSv4: Ignore change attribute invalidations if we hold a delegation NFS: More fine grained attribute tracking NFS: Don't force unnecessary cache invalidation in nfs_update_inode() NFS: Don't redirty the attribute cache in nfs_wcc_update_inode() NFS: Don't force a revalidation of all attributes if change is missing NFS: Convert NFS_INO_INVALID flags to unsigned long ...
2018-04-10NFSv4: Declare the size up to date after it was set.Trond Myklebust1-0/+1
When we've changed the file size, then ensure we declare it to be up to date in the inode attributes. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-10NFS: More fine grained attribute trackingTrond Myklebust1-2/+5
Currently, if the NFS_INO_INVALID_ATTR flag is set, for instance by a call to nfs_post_op_update_inode_locked(), then it will not be cleared until all the attributes have been revalidated. This means, for instance, that NFSv4 writes will always force a full attribute revalidation. Track the ctime, mtime, size and change attribute separately from the other attributes so that we can have nfs_post_op_update_inode_locked() set them correctly, and later have the cache consistency bitmask be able to clear them. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-04-02Merge branch 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+3
Pull wait_var_event updates from Ingo Molnar: "This introduces the new wait_var_event() API, which is a more flexible waiting primitive than wait_on_atomic_t(). All wait_on_atomic_t() users are migrated over to the new API and wait_on_atomic_t() is removed. The migration fixes one bug and should result in no functional changes for the other usecases" * 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/wait: Improve __var_waitqueue() code generation sched/wait: Remove the wait_on_atomic_t() API sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API sched/wait: Introduce wait_var_event()
2018-03-20sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() APIPeter Zijlstra1-3/+3
The old wait_on_atomic_t() is going to get removed, use the more flexible wait_var_event() API instead. No change in functionality. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-08NFS: Fix unstable write completionTrond Myklebust1-40/+43
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to ensure that all outstanding COMMIT requests to the inode in question are complete. Currently we may exit early from both nfs_commit_inode() and nfs_write_inode() even if there are COMMIT requests in flight, or unstable writes on the commit list. In order to get the right semantics w.r.t. sync_inode(), we don't need to have nfs_commit_inode() reset the inode dirty flags when called from nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that nfs_write_inode() leaves them in the right state if there are outstanding commits, or stable pages. Reported-by: Scott Mayhew <smayhew@redhat.com> Fixes: dc4fd9ab01ab ("nfs: don't wait on commit in nfs_commit_inode()...") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2018-01-30Merge tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-0/+2
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable bugfixes: - Fix breakages in the nfsstat utility due to the inclusion of the NFSv4 LOOKUPP operation - Fix a NULL pointer dereference in nfs_idmap_prepare_pipe_upcall() due to nfs_idmap_legacy_upcall() being called without an 'aux' parameter - Fix a refcount leak in the standard O_DIRECT error path - Fix a refcount leak in the pNFS O_DIRECT fallback to MDS path - Fix CPU latency issues with nfs_commit_release_pages() - Fix the LAYOUTUNAVAILABLE error case in the file layout type - NFS: Fix a race between mmap() and O_DIRECT Features: - Support the statx() mask and query flags to enable optimisations when the user is requesting only attributes that are already up to date in the inode cache, or is specifying the AT_STATX_DONT_SYNC flag - Add a module alias for the SCSI pNFS layout type Bugfixes: - Automounting when resolving a NFSv4 referral should preserve the RDMA transport protocol settings - Various other RDMA bugfixes from Chuck - pNFS block layout fixes - Always set NFS_LOCK_LOST when a lock is lost" * tag 'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits) NFS: Fix a race between mmap() and O_DIRECT NFS: Remove a redundant call to unmap_mapping_range() pnfs/blocklayout: Ensure disk address in block device map pnfs/blocklayout: pnfs_block_dev_map uses bytes, not sectors lockd: Fix server refcounting SUNRPC: Fix null rpc_clnt dereference in rpc_task_queued tracepoint SUNRPC: Micro-optimize __rpc_execute SUNRPC: task_run_action should display tk_callback sunrpc: Format RPC events consistently for display SUNRPC: Trace xprt_timer events xprtrdma: Correct some documenting comments xprtrdma: Fix "bytes registered" accounting xprtrdma: Instrument allocation/release of rpcrdma_req/rep objects xprtrdma: Add trace points to instrument QP and CQ access upcalls xprtrdma: Add trace points in the client-side backchannel code paths xprtrdma: Add trace points for connect events xprtrdma: Add trace points to instrument MR allocation and recovery xprtrdma: Add trace points to instrument memory invalidation xprtrdma: Add trace points in reply decoder path xprtrdma: Add trace points to instrument memory registration ..
2018-01-29Merge tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linuxLinus Torvalds1-5/+3
Pull inode->i_version rework from Jeff Layton: "This pile of patches is a rework of the inode->i_version field. We have traditionally incremented that field on every inode data or metadata change. Typically this increment needs to be logged on disk even when nothing else has changed, which is rather expensive. It turns out though that none of the consumers of that field actually require this behavior. The only real requirement for all of them is that it be different iff the inode has changed since the last time the field was checked. Given that, we can optimize away most of the i_version increments and avoid dirtying inode metadata when the only change is to the i_version and no one is querying it. Queries of the i_version field are rather rare, so we can help write performance under many common workloads. This patch series converts existing accesses of the i_version field to a new API, and then converts all of the in-kernel filesystems to use it. The last patch in the series then converts the backend implementation to a scheme that optimizes away a large portion of the metadata updates when no one is looking at it. In my own testing this series significantly helps performance with small I/O sizes. I also got this email for Christmas this year from the kernel test robot (a 244% r/w bandwidth improvement with XFS over DAX, with 4k writes): https://lkml.org/lkml/2017/12/25/8 A few of the earlier patches in this pile are also flowing to you via other trees (mm, integrity, and nfsd trees in particular)". * tag 'iversion-v4.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: (22 commits) fs: handle inode->i_version more efficiently btrfs: only dirty the inode in btrfs_update_time if something was changed xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing fs: only set S_VERSION when updating times if necessary IMA: switch IMA over to new i_version API xfs: convert to new i_version API ufs: use new i_version API ocfs2: convert to new i_version API nfsd: convert to new i_version API nfs: convert to new i_version API ext4: convert to new i_version API ext2: convert to new i_version API exofs: switch to new i_version API btrfs: convert to new i_version API afs: convert to new i_version API affs: convert to new i_version API fat: convert to new i_version API fs: don't take the i_lock in inode_inc_iversion fs: new API for handling inode->i_version ntfs: remove i_version handling ...
2018-01-29nfs: convert to new i_version APIJeff Layton1-5/+3
For NFS, we just use the "raw" API since the i_version is mostly managed by the server. The exception there is when the client holds a write delegation, but we only need to bump it once there anyway to handle CB_GETATTR. Tested-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-14NFS: Add a cond_resched() to nfs_commit_release_pages()Trond Myklebust1-0/+2
The commit list can get very large, and so we need a cond_resched() in nfs_commit_release_pages() in order to ensure we don't hog the CPU for excessive periods of time. Reported-by: Mike Galbraith <efault@gmx.de> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-12-15nfs: don't wait on commit in nfs_commit_inode() if there were no commit requestsScott Mayhew1-0/+2
If there were no commit requests, then nfs_commit_inode() should not wait on the commit or mark the inode dirty, otherwise the following BUG_ON can be triggered: [ 1917.130762] kernel BUG at fs/inode.c:578! [ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1] [ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries [ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G ------------ T 3.10.0-768.el7.ppc64 #1 [ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000 [ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80 [ 1917.130816] REGS: c00000004cea3720 TRAP: 0700 Tainted: G ------------ T (3.10.0-768.el7.ppc64) [ 1917.130820] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI> CR: 22002822 XER: 20000000 [ 1917.130828] CFAR: c00000000011f594 SOFTE: 1 GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750 GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03 GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8 GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0 GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8 [ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350 [ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350 [ 1917.130873] Call Trace: [ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable) [ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270 [ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0 [ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs] [ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0 [ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180 [ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150 [ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100 [ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60 [ 1917.130919] Instruction dump: [ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0 [ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 <0b0a0000> 892d02a4 2f890000 40de0134 Signed-off-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@vger.kernel.org # 4.5+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17nfs/write: Use common error handling code in nfs_lock_and_join_requests()Markus Elfring1-8/+9
Add a jump target so that a bit of exception handling can be better reused at the end of this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-09-11NFS: various changes relating to reporting IO errors.NeilBrown1-7/+0
1/ remove 'start' and 'end' args from nfs_file_fsync_commit(). They aren't used. 2/ Make nfs_context_set_write_error() a "static inline" in internal.h so we can... 3/ Use nfs_context_set_write_error() instead of mapping_set_error() if nfs_pageio_add_request() fails before sending any request. NFS generally keeps errors in the open_context, not the mapping, so this is more consistent. 4/ If filemap_write_and_write_range() reports any error, still check ctx->error. The value in ctx->error is likely to be more useful. As part of this, NFS_CONTEXT_ERROR_WRITE is cleared slightly earlier, before nfs_file_fsync_commit() is called, rather than at the start of that function. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11NFS: Add static NFS I/O tracepointsChuck Lever1-0/+7
Tools like tcpdump and rpcdebug can be very useful. But there are plenty of environments where they are difficult or impossible to use. For example, we've had customers report I/O failures during workloads so heavy that collecting network traffic or enabling RPC debugging are themselves onerous. The kernel's static tracepoints are lightweight (less likely to introduce timing changes) and efficient (the trace data is compact). They also work in scenarios where capturing network traffic is not possible due to lack of hardware support (some InfiniBand HCAs) or where data or network privacy is a concern. Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT is initiated, and when it completes. Record the arguments and results of each operation, which are not shown by existing sunrpc module's tracepoints. For instance, the recorded offset and count can be used to match an "initiate" event to a "done" event. If an NFS READ result returns fewer bytes than requested or zero, seeing the EOF flag can be probative. Seeing an NFS4ERR_BAD_STATEID result is also indication of a particular class of problems. The timing information attached to each event record can often be useful as well. Usage example: [root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done /sys/kernel/debug/tracing/events/nfs/*initiate*/filter /sys/kernel/debug/tracing/events/nfs/*done/filter Hit Ctrl^C to stop recording ^CKernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 0 overrun: 0 commit overrun: 0 bytes: 3680 oldest event ts: 78.367422 now ts: 100.124419 dropped events: 0 read events: 74 ... and so on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()Trond Myklebust1-1/+5
If we skip a subrequest due to a zero refcount, we should still count the byte range that it covered so that we accurately reconstruct the original request size. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: Don't hold the group lock when calling nfs_release_request()Trond Myklebust1-1/+1
That can deadlock if this is the last reference since nfs_page_group_destroy() calls nfs_page_group_sync_on_bit(). Note that even if the page was removed from the subpage list, the req->wb_head could still be pointing to the old head. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: Remove pnfs_generic_transfer_commit_list()Trond Myklebust1-0/+2
It's pretty much a duplicate of nfs_scan_commit_list() that also clears the PG_COMMIT_TO_DS flag. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlockTrond Myklebust1-4/+11
Since the commit list is not ordered, it is possible for nfs_scan_commit_list to hold a request that nfs_lock_and_join_requests() is waiting for, while at the same time trying to grab a request that nfs_lock_and_join_requests already holds. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06NFS: don't expect errors from mempool_alloc().NeilBrown1-4/+2
Commit fbe77c30e9ab ("NFS: move rw_mode to nfs_pageio_header") reintroduced some pointless code that commit 518662e0fcb9 ("NFS: fix usage of mempools.") had recently removed. Remove it again. Cc: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20Merge branch 'bugfixes'Trond Myklebust1-1/+1
2017-08-20NFS: Remove unused parameter gfp_flags from nfs_pageio_init()Trond Myklebust1-1/+1
Now that the mirror allocation has been moved, the parameter can go. Also remove the redundant symbol export. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Wait for requests that are locked on the commit listTrond Myklebust1-4/+13
If a request is on the commit list, but is locked, we will currently skip it, which can lead to livelocking when the commit count doesn't reduce to zero. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Switch to using mapping->private_lock for page writeback lookups.Trond Myklebust1-11/+16
Switch from using the inode->i_lock for this to avoid contention with other metadata manipulation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of commitsTrond Myklebust1-5/+7
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of requestsTrond Myklebust1-13/+5
Rather than forcing us to take the inode->i_lock just in order to bump the number. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Use a mutex to protect the per-inode commit listsTrond Myklebust1-12/+12
The commit lists can get very large, so using the inode->i_lock can end up affecting general metadata performance. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Refactor nfs_page_find_head_request()Trond Myklebust1-12/+30
Split out the 2 cases so that we can treat the locking differently. The issue is that the locking in the pageswapcache cache is highly linked to the commit list locking. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()Trond Myklebust1-15/+20
Hide the locking from nfs_lock_and_join_requests() so that we can separate out the requirements for swapcache pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix up nfs_page_group_covers_page()Trond Myklebust1-12/+6
Fix up the test in nfs_page_group_covers_page(). The simplest implementation is to check that we have a set of intersecting or contiguous subrequests that connect page offset 0 to nfs_page_length(req->wb_page). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove unused parameter from nfs_page_group_lock()Trond Myklebust1-3/+3
nfs_page_group_lock() is now always called with the 'nonblock' parameter set to 'false'. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove nfs_page_group_clear_bits()Trond Myklebust1-26/+3
At this point, we only expect ever to potentially see PG_REMOVE and PG_TEARDOWN being set on the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race casesTrond Myklebust1-29/+29
Since nfs_page_group_destroy() does not take any locks on the requests to be freed, we need to ensure that we don't inadvertently free the request in nfs_destroy_unlinked_subrequests() while the last reference is being released elsewhere. Do this by: 1) Taking a reference to the request unless it is already being freed 2) Checking (under the page group lock) if PG_TEARDOWN is already set before freeing an unreferenced request in nfs_destroy_unlinked_subrequests() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Further optimise nfs_lock_and_join_requests()Trond Myklebust1-27/+18
When locking the entire group in order to remove subrequests, the locks are always taken in order, and with the page group lock being taken after the page head is locked. The intention is that: 1) The lock on the group head guarantees that requests may not be removed from the group (although new entries could be appended if we're not holding the group lock). 2) It is safe to drop and retake the page group lock while iterating through the list, in particular when waiting for a subrequest lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()Trond Myklebust1-18/+22
We should no longer need the inode->i_lock, now that we've straightened out the request locking. The locking schema is now: 1) Lock page head request 2) Lock the page group 3) Lock the subrequests one by one Note that there is a subtle race with nfs_inode_remove_request() due to the fact that the latter does not lock the page head, when removing it from the struct page. Only the last subrequest is locked, hence we need to re-check that the PagePrivate(page) is still set after we've locked all the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove page group limit in nfs_flush_incompatible()Trond Myklebust1-2/+0
nfs_try_to_update_request() should be able to cope now. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Teach nfs_try_to_update_request() to deal with request page_groupsTrond Myklebust1-40/+20
Simplify the code, and avoid some flushes to disk. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix the inode request accounting when pages have subrequestsTrond Myklebust1-12/+15
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests() manipulate the inode flags adjusting the NFS_I(inode)->nrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't unlock writebacks before declaring PG_WB_ENDTrond Myklebust1-4/+4
We don't want nfs_lock_and_join_requests() to start fiddling with the request before the call to nfs_page_group_sync_on_bit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>