aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-09-14Merge tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds7-20/+287
Pull more NFS client updates from Trond Myklebust: "Hightlights include: Bugfixes: - Various changes relating to reporting IO errors. - pnfs: Use the standard I/O stateid when calling LAYOUTGET Features: - Add static NFS I/O tracepoints for debugging" * tag 'nfs-for-4.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: various changes relating to reporting IO errors. NFS: Add static NFS I/O tracepoints pNFS: Use the standard I/O stateid when calling LAYOUTGET
2017-09-14Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+1
Pull mount flag updates from Al Viro: "Another chunk of fmount preparations from dhowells; only trivial conflicts for that part. It separates MS_... bits (very grotty mount(2) ABI) from the struct super_block ->s_flags (kernel-internal, only a small subset of MS_... stuff). This does *not* convert the filesystems to new constants; only the infrastructure is done here. The next step in that series is where the conflicts would be; that's the conversion of filesystems. It's purely mechanical and it's better done after the merge, so if you could run something like list=$(for i in MS_RDONLY MS_NOSUID MS_NODEV MS_NOEXEC MS_SYNCHRONOUS MS_MANDLOCK MS_DIRSYNC MS_NOATIME MS_NODIRATIME MS_SILENT MS_POSIXACL MS_KERNMOUNT MS_I_VERSION MS_LAZYTIME; do git grep -l $i fs drivers/staging/lustre drivers/mtd ipc mm include/linux; done|sort|uniq|grep -v '^fs/namespace.c$') sed -i -e 's/\<MS_RDONLY\>/SB_RDONLY/g' \ -e 's/\<MS_NOSUID\>/SB_NOSUID/g' \ -e 's/\<MS_NODEV\>/SB_NODEV/g' \ -e 's/\<MS_NOEXEC\>/SB_NOEXEC/g' \ -e 's/\<MS_SYNCHRONOUS\>/SB_SYNCHRONOUS/g' \ -e 's/\<MS_MANDLOCK\>/SB_MANDLOCK/g' \ -e 's/\<MS_DIRSYNC\>/SB_DIRSYNC/g' \ -e 's/\<MS_NOATIME\>/SB_NOATIME/g' \ -e 's/\<MS_NODIRATIME\>/SB_NODIRATIME/g' \ -e 's/\<MS_SILENT\>/SB_SILENT/g' \ -e 's/\<MS_POSIXACL\>/SB_POSIXACL/g' \ -e 's/\<MS_KERNMOUNT\>/SB_KERNMOUNT/g' \ -e 's/\<MS_I_VERSION\>/SB_I_VERSION/g' \ -e 's/\<MS_LAZYTIME\>/SB_LAZYTIME/g' \ $list and commit it with something along the lines of 'convert filesystems away from use of MS_... constants' as commit message, it would save a quite a bit of headache next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: Differentiate mount flags (MS_*) from internal superblock flags VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags
2017-09-11Merge tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds16-565/+490
Pull NFS client updates from Trond Myklebust: "Hightlights include: Stable bugfixes: - Fix mirror allocation in the writeback code to avoid a use after free - Fix the O_DSYNC writes to use the correct byte range - Fix 2 use after free issues in the I/O code Features: - Writeback fixes to split up the inode->i_lock in order to reduce contention - RPC client receive fixes to reduce the amount of time the xprt->transport_lock is held when receiving data from a socket into am XDR buffer. - Ditto fixes to reduce contention between call side users of the rdma rb_lock, and its use in rpcrdma_reply_handler. - Re-arrange rdma stats to reduce false cacheline sharing. - Various rdma cleanups and optimisations. - Refactor the NFSv4.1 exchange id code and clean up the code. - Const-ify all instances of struct rpc_xprt_ops Bugfixes: - Fix the NFSv2 'sec=' mount option. - NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' - Fix the NFSv3 GRANT callback when the port changes on the server. - Fix livelock issues with COMMIT - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when doing and NFSv4.1 open by filehandle" * tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits) NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests() NFS: Don't hold the group lock when calling nfs_release_request() NFS: Remove pnfs_generic_transfer_commit_list() NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock NFS: Fix 2 use after free issues in the I/O code NFS: Sync the correct byte range during synchronous writes lockd: Delete an error message for a failed memory allocation in reclaimer() NFS: remove jiffies field from access cache NFS: flush data when locking a file to ensure cache coherence for mmap. SUNRPC: remove some dead code. NFS: don't expect errors from mempool_alloc(). xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler xprtrdma: Re-arrange struct rx_stats NFS: Fix NFSv2 security settings NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' SUNRPC: ECONNREFUSED should cause a rebind. NFS: Remove unused parameter gfp_flags from nfs_pageio_init() NFSv4: Fix up mirror allocation SUNRPC: Add a separate spinlock to protect the RPC request receive list SUNRPC: Cleanup xs_tcp_read_common() ...
2017-09-11NFS: various changes relating to reporting IO errors.NeilBrown4-15/+19
1/ remove 'start' and 'end' args from nfs_file_fsync_commit(). They aren't used. 2/ Make nfs_context_set_write_error() a "static inline" in internal.h so we can... 3/ Use nfs_context_set_write_error() instead of mapping_set_error() if nfs_pageio_add_request() fails before sending any request. NFS generally keeps errors in the open_context, not the mapping, so this is more consistent. 4/ If filemap_write_and_write_range() reports any error, still check ctx->error. The value in ctx->error is likely to be more useful. As part of this, NFS_CONTEXT_ERROR_WRITE is cleared slightly earlier, before nfs_file_fsync_commit() is called, rather than at the start of that function. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11NFS: Add static NFS I/O tracepointsChuck Lever3-0/+259
Tools like tcpdump and rpcdebug can be very useful. But there are plenty of environments where they are difficult or impossible to use. For example, we've had customers report I/O failures during workloads so heavy that collecting network traffic or enabling RPC debugging are themselves onerous. The kernel's static tracepoints are lightweight (less likely to introduce timing changes) and efficient (the trace data is compact). They also work in scenarios where capturing network traffic is not possible due to lack of hardware support (some InfiniBand HCAs) or where data or network privacy is a concern. Introduce tracepoints that show when an NFS READ, WRITE, or COMMIT is initiated, and when it completes. Record the arguments and results of each operation, which are not shown by existing sunrpc module's tracepoints. For instance, the recorded offset and count can be used to match an "initiate" event to a "done" event. If an NFS READ result returns fewer bytes than requested or zero, seeing the EOF flag can be probative. Seeing an NFS4ERR_BAD_STATEID result is also indication of a particular class of problems. The timing information attached to each event record can often be useful as well. Usage example: [root@manet tmp]# trace-cmd record -e nfs:*initiate* -e nfs:*done /sys/kernel/debug/tracing/events/nfs/*initiate*/filter /sys/kernel/debug/tracing/events/nfs/*done/filter Hit Ctrl^C to stop recording ^CKernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 0 overrun: 0 commit overrun: 0 bytes: 3680 oldest event ts: 78.367422 now ts: 100.124419 dropped events: 0 read events: 74 ... and so on. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-11pNFS: Use the standard I/O stateid when calling LAYOUTGETTrond Myklebust1-5/+9
Instead of having a private method for copying the open/delegation stateid, use the same call that is used for standard I/O through the MDS. Note that this means we transmit the stateid with a zero seqid, avoiding issues with NFS4ERR_OLD_STATEID. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests()Trond Myklebust1-1/+5
If we skip a subrequest due to a zero refcount, we should still count the byte range that it covered so that we accurately reconstruct the original request size. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09Merge tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-5/+5
Pull nfsd updates from Bruce Fields: "More RDMA work and some op-structure constification from Chuck Lever, and a small cleanup to our xdr encoding" * tag 'nfsd-4.14' of git://linux-nfs.org/~bfields/linux: svcrdma: Estimate Send Queue depth properly rdma core: Add rdma_rw_mr_payload() svcrdma: Limit RQ depth svcrdma: Populate tail iovec when receiving nfsd: Incoming xdr_bufs may have content in tail buffer svcrdma: Clean up svc_rdma_build_read_chunk() sunrpc: Const-ify struct sv_serv_ops nfsd: Const-ify NFSv4 encoding and decoding ops arrays sunrpc: Const-ify instances of struct svc_xprt_ops nfsd4: individual encoders no longer see error cases nfsd4: skip encoder in trivial error cases nfsd4: define ->op_release for compound ops nfsd4: opdesc will be useful outside nfs4proc.c nfsd4: move some nfsd4 op definitions to xdr4.h
2017-09-09NFS: Don't hold the group lock when calling nfs_release_request()Trond Myklebust1-1/+1
That can deadlock if this is the last reference since nfs_page_group_destroy() calls nfs_page_group_sync_on_bit(). Note that even if the page was removed from the subpage list, the req->wb_head could still be pointing to the old head. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: Remove pnfs_generic_transfer_commit_list()Trond Myklebust2-41/+4
It's pretty much a duplicate of nfs_scan_commit_list() that also clears the PG_COMMIT_TO_DS flag. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlockTrond Myklebust2-9/+22
Since the commit list is not ordered, it is possible for nfs_scan_commit_list to hold a request that nfs_lock_and_join_requests() is waiting for, while at the same time trying to grab a request that nfs_lock_and_join_requests already holds. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-08NFS: Fix 2 use after free issues in the I/O codeTrond Myklebust3-17/+12
The writeback code wants to send a commit after processing the pages, which is why we want to delay releasing the struct path until after that's done. Also, the layout code expects that we do not free the inode before we've put the layout segments in pnfs_writehdr_free() and pnfs_readhdr_free() Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete") Fixes: 4714fb51fd03 ("nfs: remove pgio_header refcount, related cleanup") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+1
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-07NFS: Sync the correct byte range during synchronous writestarangg@amazon.com1-3/+3
Since commit 18290650b1c8 ("NFS: Move buffered I/O locking into nfs_file_write()") nfs_file_write() has not flushed the correct byte range during synchronous writes. generic_write_sync() expects that iocb->ki_pos points to the right edge of the range rather than the left edge. To replicate the problem, open a file with O_DSYNC, have the client write at increasing offsets, and then print the successful offsets. Block port 2049 partway through that sequence, and observe that the client application indicates successful writes in advance of what the server received. Fixes: 18290650b1c8 ("NFS: Move buffered I/O locking into nfs_file_write()") Signed-off-by: Jacob Strauss <jsstraus@amazon.com> Signed-off-by: Tarang Gupta <tarangg@amazon.com> Tested-by: Tarang Gupta <tarangg@amazon.com> Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06fscache: remove unused ->now_uncached callbackJan Kara1-40/+0
Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06NFS: remove jiffies field from access cacheNeilBrown2-5/+0
This field hasn't been used since commit 57b691819ee2 ("NFS: Cache access checks more aggressively"). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06NFS: flush data when locking a file to ensure cache coherence for mmap.NeilBrown1-4/+7
When a byte range lock (or flock) is taken out on an NFS file, the validity of the cached data is checked and the inode is marked NFS_INODE_INVALID_DATA. However the cached data isn't flushed from the page cache. This is sufficient for future read() requests or mmap() requests as they call nfs_revalidate_mapping() which performs the flush if necessary. However an existing mapping is not affected. Accessing data through that mapping will continue to return old data even though the inode is marked NFS_INODE_INVALID_DATA. This can easily be confirmed using the 'nfs' tool in git://github.com/okirch/twopence-nfs.git and running nfs coherence FILENAME on one client, and nfs coherence -r FILENAME on another client. It appears that prior to Linux 2.6.0 this worked correctly. However commit: http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0 removed the call to inode_invalidate_pages() from nfs_zap_caches(). I haven't tested this code, but inspection suggests that prior to this commit, file locking would invalidate all inode pages. This patch adds a call to nfs_revalidate_mapping() after a successful SETLK so that invalid data is flushed. With this patch the above test passes. To minimize impact (and possibly avoid a GETATTR call) this only happens if the mapping might be mapped into userspace. Cc: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06NFS: don't expect errors from mempool_alloc().NeilBrown1-4/+2
Commit fbe77c30e9ab ("NFS: move rw_mode to nfs_pageio_header") reintroduced some pointless code that commit 518662e0fcb9 ("NFS: fix usage of mempools.") had recently removed. Remove it again. Cc: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-24sunrpc: Const-ify struct sv_serv_opsChuck Lever1-5/+5
Close an attack vector by moving the arrays of per-server methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig1-1/+1
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-20Merge branch 'bugfixes'Trond Myklebust5-43/+61
2017-08-20NFS: Fix NFSv2 security settingsChuck Lever1-4/+8
For a while now any NFSv2 mount where sec= is specified uses AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX. Commit e68fd7c8071d ("mount: use sec= that was specified on the command line") attempted to address a very similar problem with NFSv3, and should have fixed this too, but it has a bug. The MNTv1 MNT procedure does not return a list of security flavors, so our client makes up a list containing just AUTH_NULL. This should enable nfs_verify_authflavors() to assign the sec= specified flavor, but instead, it incorrectly sets it to AUTH_NULL. I expect this would also be a problem for any NFSv3 server whose MNTv3 MNT procedure returned a security flavor list containing only AUTH_NULL. Fixes: e68fd7c8071d ("mount: use sec= that was specified on ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'NeilBrown1-0/+11
An NFSv4.1 client might close a file after the user who opened it has logged off. In this case the user's credentials may no longer be valid, if they are e.g. kerberos credentials that have expired. NFSv4.1 has a mechanism to allow the client to use machine credentials to close a file. However due to a short-coming in the RFC, a CLOSE with those credentials may not be possible if the file in question isn't exported to the same security flavor - the required PUTFH must be rejected when this is the case. Specifically if a server and client support kerberos in general and have used it to form a machine credential, but the file is only exported to "sec=sys", a PUTFH with the machine credentials will fail, so CLOSE is not possible. As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there is no value in using the machine credential in place of them. So in that case, just use the users credentials for CLOSE etc, as you would in NFSv4.0 Signed-off-by: Neil Brown <neilb@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFS: Remove unused parameter gfp_flags from nfs_pageio_init()Trond Myklebust3-5/+3
Now that the mirror allocation has been moved, the parameter can go. Also remove the redundant symbol export. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFSv4: Fix up mirror allocationTrond Myklebust1-34/+39
There are a number of callers of nfs_pageio_complete() that want to continue using the nfs_pageio_descriptor without needing to call nfs_pageio_init() again. Examples include nfs_pageio_resend() and nfs_pageio_cond_complete(). The problem is that nfs_pageio_complete() also calls nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors. This can lead to writeback errors, in the next call to nfs_pageio_setup_mirroring(). Fix by simply moving the allocation of the mirrors to nfs_pageio_setup_mirroring(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709 Reported-by: JianhongYin <yin-jianhong@163.com> Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-18Merge branch 'writeback'Trond Myklebust9-348/+257
2017-08-15NFS: Wait for requests that are locked on the commit listTrond Myklebust3-8/+29
If a request is on the commit list, but is locked, we will currently skip it, which can lead to livelocking when the commit count doesn't reduce to zero. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg()Trond Myklebust3-45/+2
Now that we no longer hold the inode->i_lock when manipulating the commit lists, it is safe to call pnfs_put_lseg() again. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Switch to using mapping->private_lock for page writeback lookups.Trond Myklebust1-11/+16
Switch from using the inode->i_lock for this to avoid contention with other metadata manipulation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of commitsTrond Myklebust2-6/+8
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of requestsTrond Myklebust5-22/+11
Rather than forcing us to take the inode->i_lock just in order to bump the number. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Use a mutex to protect the per-inode commit listsTrond Myklebust4-22/+22
The commit lists can get very large, so using the inode->i_lock can end up affecting general metadata performance. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Refactor nfs_page_find_head_request()Trond Myklebust1-12/+30
Split out the 2 cases so that we can treat the locking differently. The issue is that the locking in the pageswapcache cache is highly linked to the commit list locking. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()Trond Myklebust1-15/+20
Hide the locking from nfs_lock_and_join_requests() so that we can separate out the requirements for swapcache pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix up nfs_page_group_covers_page()Trond Myklebust1-12/+6
Fix up the test in nfs_page_group_covers_page(). The simplest implementation is to check that we have a set of intersecting or contiguous subrequests that connect page offset 0 to nfs_page_length(req->wb_page). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove unused parameter from nfs_page_group_lock()Trond Myklebust2-23/+14
nfs_page_group_lock() is now always called with the 'nonblock' parameter set to 'false'. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove unuse function nfs_page_group_lock_wait()Trond Myklebust1-21/+0
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove nfs_page_group_clear_bits()Trond Myklebust1-26/+3
At this point, we only expect ever to potentially see PG_REMOVE and PG_TEARDOWN being set on the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race casesTrond Myklebust1-29/+29
Since nfs_page_group_destroy() does not take any locks on the requests to be freed, we need to ensure that we don't inadvertently free the request in nfs_destroy_unlinked_subrequests() while the last reference is being released elsewhere. Do this by: 1) Taking a reference to the request unless it is already being freed 2) Checking (under the page group lock) if PG_TEARDOWN is already set before freeing an unreferenced request in nfs_destroy_unlinked_subrequests() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Further optimise nfs_lock_and_join_requests()Trond Myklebust1-27/+18
When locking the entire group in order to remove subrequests, the locks are always taken in order, and with the page group lock being taken after the page head is locked. The intention is that: 1) The lock on the group head guarantees that requests may not be removed from the group (although new entries could be appended if we're not holding the group lock). 2) It is safe to drop and retake the page group lock while iterating through the list, in particular when waiting for a subrequest lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()Trond Myklebust1-18/+22
We should no longer need the inode->i_lock, now that we've straightened out the request locking. The locking schema is now: 1) Lock page head request 2) Lock the page group 3) Lock the subrequests one by one Note that there is a subtle race with nfs_inode_remove_request() due to the fact that the latter does not lock the page head, when removing it from the struct page. Only the last subrequest is locked, hence we need to re-check that the PagePrivate(page) is still set after we've locked all the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove page group limit in nfs_flush_incompatible()Trond Myklebust1-2/+0
nfs_try_to_update_request() should be able to cope now. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Teach nfs_try_to_update_request() to deal with request page_groupsTrond Myklebust1-40/+20
Simplify the code, and avoid some flushes to disk. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix the inode request accounting when pages have subrequestsTrond Myklebust1-12/+15
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests() manipulate the inode flags adjusting the NFS_I(inode)->nrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't unlock writebacks before declaring PG_WB_ENDTrond Myklebust1-4/+4
We don't want nfs_lock_and_join_requests() to start fiddling with the request before the call to nfs_page_group_sync_on_bit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't check request offset and size without holding a lockTrond Myklebust1-12/+12
Request offsets and sizes are not guaranteed to be stable unless you are holding the request locked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix an ABBA issue in nfs_lock_and_join_requests()Trond Myklebust1-12/+17
All other callers of nfs_page_group_lock() appear to already hold the page lock on the head page, so doing it in the opposite order here is inefficient, although not deadlock prone since we roll back all locks on contention. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()Trond Myklebust1-2/+1
Yes, this is a situation that should never happen (hence the WARN_ON) but we should still ensure that we free up the locks and references to the faulty pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Ensure we always dereference the page head lastTrond Myklebust1-5/+6
This fixes a race with nfs_page_group_sync_on_bit() whereby the call to wake_up_bit() in nfs_page_group_unlock() could occur after the page header had been freed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce lock contention in nfs_try_to_update_request()Trond Myklebust1-5/+3
Micro-optimisation to move the lockless check into the for(;;) loop. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>