aboutsummaryrefslogtreecommitdiffstats
path: root/fs/nfs/write.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2015-10-21NFSv4.1/pnfs: Retry through MDS when getting bad length of dataKinglong Mee1-0/+7
If non rpc-based layout driver return bad length of data, nfs retries by calling rpc_restart_call_prepare() that cause an NULL reference panic. This patch lets nfs retry through MDS for non rpc-based layout driver return bad length of data. [13034.883329] BUG: unable to handle kernel NULL pointer dereference at (null) [13034.884902] IP: [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc] [13034.886558] PGD 0 [13034.888126] Oops: 0000 [#1] KASAN [13034.889710] Modules linked in: blocklayoutdriver(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c coretemp btrfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ppdev vmw_balloon auth_rpcgss shpchp nfs_acl lockd vmw_vmci parport_pc xor raid6_pq grace parport sunrpc i2c_piix4 vmwgfx drm_kms_helper ttm drm mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache] [13034.898260] CPU: 0 PID: 10112 Comm: kworker/0:1 Tainted: G OE 4.3.0-rc5+ #279 [13034.899932] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [13034.903342] Workqueue: events bl_read_cleanup [blocklayoutdriver] [13034.905059] task: ffff88006a9148c0 ti: ffff880035e90000 task.ti: ffff880035e90000 [13034.906827] RIP: 0010:[<ffffffffa00db372>] [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc] [13034.910522] RSP: 0018:ffff880035e97b58 EFLAGS: 00010282 [13034.912378] RAX: fffffbfff04a5a94 RBX: ffff880068fe4858 RCX: 0000000000000003 [13034.914339] RDX: dffffc0000000000 RSI: 0000000000000003 RDI: 0000000000000282 [13034.916236] RBP: ffff880035e97b68 R08: 0000000000000001 R09: 0000000000000001 [13034.918229] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [13034.920007] R13: ffff880068fe4858 R14: ffff880068fe4a60 R15: 0000000000001000 [13034.921845] FS: 0000000000000000(0000) GS:ffffffff82247000(0000) knlGS:0000000000000000 [13034.923645] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13034.925525] CR2: 0000000000000000 CR3: 00000000063dd000 CR4: 00000000001406f0 [13034.932808] Stack: [13034.934813] ffff880068fe4780 0000000000001000 ffff880035e97ba8 ffffffffa08800d2 [13034.936675] ffffffffa088029d ffff880068fe4780 ffff880068fe4858 ffffffffa089c0a0 [13034.938593] ffff880068fe47e0 ffff88005d59faf0 ffff880035e97be0 ffffffffa087e08f [13034.940454] Call Trace: [13034.942388] [<ffffffffa08800d2>] nfs_readpage_result+0x112/0x200 [nfs] [13034.944317] [<ffffffffa088029d>] ? nfs_readpage_done+0xdd/0x160 [nfs] [13034.946267] [<ffffffffa087e08f>] nfs_pgio_result+0x9f/0x120 [nfs] [13034.948166] [<ffffffffa09266cc>] pnfs_ld_read_done+0x7c/0x1e0 [nfsv4] [13034.950247] [<ffffffffa03b07ee>] bl_read_cleanup+0x2e/0x60 [blocklayoutdriver] [13034.952156] [<ffffffff810ebf62>] process_one_work+0x412/0x870 [13034.954102] [<ffffffff810ebe84>] ? process_one_work+0x334/0x870 [13034.955949] [<ffffffff810ebb50>] ? queue_delayed_work_on+0x40/0x40 [13034.957985] [<ffffffff810ec441>] worker_thread+0x81/0x6a0 [13034.959817] [<ffffffff810ec3c0>] ? process_one_work+0x870/0x870 [13034.961785] [<ffffffff810f43bd>] kthread+0x17d/0x1a0 [13034.963544] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330 [13034.965479] [<ffffffff81100428>] ? finish_task_switch+0x88/0x220 [13034.967223] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330 [13034.968929] [<ffffffff81b6ae5f>] ret_from_fork+0x3f/0x70 [13034.970534] [<ffffffff810f4240>] ? kthread_create_on_node+0x330/0x330 [13034.972176] Code: c7 43 50 40 84 0d a0 e8 3d fe 1c e1 48 8d 7b 58 c7 83 e4 00 00 00 00 00 00 00 e8 ca fe 1c e1 4c 8b 63 58 4c 89 e7 e8 be fe 1c e1 <49> 83 3c 24 00 74 12 48 c7 43 50 f0 a2 0e a0 b8 01 00 00 00 5b [13034.977148] RIP [<ffffffffa00db372>] rpc_restart_call_prepare+0x62/0x90 [sunrpc] [13034.978780] RSP <ffff880035e97b58> [13034.980399] CR2: 0000000000000000 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFS: Fix a write performance regressionTrond Myklebust1-1/+1
If all other conditions in nfs_can_extend_write() are met, and there are no locks, then we should be able to assume close-to-open semantics and the ability to extend our write to cover the whole page. With this patch, the xfstests generic/074 test completes in 242s instead of >1400s on my test rig. Fixes: bd61e0a9c852 ("locks: convert posix locks to file_lock_context") Cc: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-02NFS: Fix up page writeback accountingTrond Myklebust1-6/+6
Currently, we are crediting all the calls to nfs_writepages_callback() (i.e. the nfs_writepages() callback) to nfs_writepage(). Aside from being inconsistent with the behaviour of the equivalent readpage/readpages accounting, this also means that we cannot distinguish between bulk writes and single page writebacks (which confuses the 'nfsiostat -p' tool). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-09-20NFS: Do cleanup before resetting pageio read/write to mdsKinglong Mee1-0/+3
There is a reference leak of layout segment after resetting pageio read/write to mds. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-17NFS: Rename nfs_commit_unstable_pages() to nfs_write_inode()Anna Schumaker1-6/+1
All nfs_write_inode() does is pass its arguments to nfs_commit_unstable_pages(). Let's cut out the middle man and have nfs_write_pages() do the work directly. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-08-10NFSv4.1/pnfs: Fix atomicity of commit list updatesTrond Myklebust1-5/+24
pnfs_layout_mark_request_commit() needs to ensure that it adds the request to the commit list atomically with all the other updates in order to prevent corruption to buckets[ds_commit_idx].wlseg due to races with pnfs_generic_clear_request_commit(). Fixes: 338d00cfef07d ("pnfs: Refactor the *_layout_mark_request_commit...") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-28Merge tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-6/+9
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable patches: - Fix a situation where the client uses the wrong (zero) stateid. - Fix a memory leak in nfs_do_recoalesce Bugfixes: - Plug a memory leak when ->prepare_layoutcommit fails - Fix an Oops in the NFSv4 open code - Fix a backchannel deadlock - Fix a livelock in sunrpc when sendmsg fails due to low memory availability - Don't revalidate the mapping if both size and change attr are up to date - Ensure we don't miss a file extension when doing pNFS - Several fixes to handle NFSv4.1 sequence operation status bits correctly - Several pNFS layout return bugfixes" * tag 'nfs-for-4.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (28 commits) nfs: Fix an oops caused by using other thread's stack space in ASYNC mode nfs: plug memory leak when ->prepare_layoutcommit fails SUNRPC: Report TCP errors to the caller sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. NFSv4.2: handle NFS-specific llseek errors NFS: Don't clear desc->pg_moreio in nfs_do_recoalesce() NFS: Fix a memory leak in nfs_do_recoalesce NFS: nfs_mark_for_revalidate should always set NFS_INO_REVAL_PAGECACHE NFS: Remove the "NFS_CAP_CHANGE_ATTR" capability NFS: Set NFS_INO_REVAL_PAGECACHE if the change attribute is uninitialised NFS: Don't revalidate the mapping if both size and change attr are up to date NFSv4/pnfs: Ensure we don't miss a file extension NFSv4: We must set NFS_OPEN_STATE flag in nfs_resync_open_stateid_locked SUNRPC: xprt_complete_bc_request must also decrement the free slot count SUNRPC: Fix a backchannel deadlock pNFS: Don't throw out valid layout segments pNFS: pnfs_roc_drain() fix a race with open pNFS: Fix races between return-on-close and layoutreturn. pNFS: pnfs_roc_drain should return 'true' when sleeping pNFS: Layoutreturn must invalidate all existing layout segments. ...
2015-07-22NFSv4/pnfs: Ensure we don't miss a file extensionTrond Myklebust1-6/+9
pNFS writes don't return attributes, however that doesn't mean that we should ignore the fact that they may be extending the file. This patch ensures that if a write is seen to extend the file, then we always set an attribute barrier, and update the cached file size. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-07-02Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-7/+2
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix a crash in the NFSv4 file locking code. - Fix an fsync() regression, where we were failing to retry I/O in some circumstances. - Fix an infinite loop in NFSv4.0 OPEN stateid recovery - Fix a memory leak when an attempted pnfs fails. - Fix a memory leak in the backchannel code - Large hostnames were not supported correctly in NFSv4.1 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O. - Fix a couple of credential issues in pNFS/flexfiles Bugfixes + cleanups: - Open flag sanity checks in the NFSv4 atomic open codepath - More NFSv4 delegation related bugfixes - Various NFSv4.1 backchannel bugfixes and cleanups - Fix the NFS swap socket code - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code - Fix a UDP transport deadlock issue Features: - More RDMA client transport improvements - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles" * tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits) nfs: Remove invalid tk_pid from debug message nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh nfs: Drop bad comment in nfs41_walk_client_list() nfs: Remove unneeded micro checking of CONFIG_PROC_FS nfs: Don't setting FILE_CREATED flags always nfs: Use remove_proc_subtree() instead remove_proc_entry() nfs: Remove unused argument in nfs_server_set_fsinfo() nfs: Fix a memory leak when meeting an unsupported state protect nfs: take extra reference to fl->fl_file when running a LOCKU operation NFSv4: When returning a delegation, don't reclaim an incompatible open mode. NFSv4.2: LAYOUTSTATS is optional to implement NFSv4.2: Fix up a decoding error in layoutstats pNFS/flexfiles: Fix the reset of struct pgio_header when resending pNFS/flexfiles: Turn off layoutcommit for servers that don't need it pnfs/flexfiles: protect ktime manipulation with mirror lock nfs: provide pnfs_report_layoutstat when NFS42 is disabled nfs: verify open flags before allowing open nfs: always update creds in mirror, even when we have an already connected ds nfs: fix potential credential leak in ff_layout_update_mirror_cred pnfs/flexfiles: report layoutstat regularly ...
2015-07-01nfs: Remove invalid tk_pid from debug messageKinglong Mee1-1/+1
Before rpc_run_task(), tk_pid is uninitiated as 0 always. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds1-1/+2
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-22Merge branch 'bugfixes'Trond Myklebust1-0/+1
* bugfixes: NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writes pNFS: Fix a memory leak when attempted pnfs fails NFS: Ensure that we update the sequence id under the slot table lock nfs: Initialize cb_sequenceres information before validate_seqid() nfs: Only update callback sequnce id when CB_SEQUENCE success NFSv4: nfs4_handle_delegation_recall_error should ignore EAGAIN
2015-06-17NFS: Ensure we set NFS_CONTEXT_RESEND_WRITES when requeuing writesTrond Myklebust1-0/+1
If a write attempt fails, and the write is queued up for resending to the server, as opposed to being dropped, then we need to set the appropriate flag so that nfs_file_fsync() does the right thing. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-10NFS: Remove unused nfs_rw_ops->rw_release() functionAnna Schumaker1-6/+0
This was only ever set to nfs_writeback_release_common(), a function which is completely empty. Let's just drop this function pointer and simplify the code a bit. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-06-02writeback: move backing_dev_info->bdi_stat[] into bdi_writebackTejun Heo1-1/+2
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->bdi_stat[] into wb. * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all enums is changed from BDI_ to WB_. * BDI_STAT_BATCH() -> WB_STAT_BATCH() * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...) * bdi_stat[_error]() -> wb_stat[_error]() * bdi_writeout_inc() -> wb_writeout_inc() * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and frees stat. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[] introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-13nfs: stat(2) fails during cthon04 basic test5 on NFSv4.0Chuck Lever1-5/+8
When running the Connectathon basic tests against a Solaris NFS server over NFSv4.0, test5 reports that stat(2) returns a file size of zero instead of 1MB. On success, nfs_commit_inode() can return a positive result; see other call sites such as nfs_file_fsync_commit() and nfs_commit_unstable_pages(). The call site recently added in nfs_wb_all() does not prevent that positive return value from leaking to its callers. If it leaks through nfs_sync_inode() back to nfs_getattr(), that causes stat(2) to return a positive return value to user space while also not filling in the passed-in struct stat. Additional clean up: the new logic in nfs_wb_all() is rewritten in bfields-normal form. Fixes: 5bb89b4702e2 ("NFSv4.1/pnfs: Separate out metadata . . .") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-26Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-8/+7
Pull NFS client updates from Trond Myklebust: "Another set of mainly bugfixes and a couple of cleanups. No new functionality in this round. Highlights include: Stable patches: - Fix a regression in /proc/self/mountstats - Fix the pNFS flexfiles O_DIRECT support - Fix high load average due to callback thread sleeping Bugfixes: - Various patches to fix the pNFS layoutcommit support - Do not cache pNFS deviceids unless server notifications are enabled - Fix a SUNRPC transport reconnection regression - make debugfs file creation failure non-fatal in SUNRPC - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints - Fix locking around NFSv4.2 fallocate() support - Truncating NFSv4 file opens should also sync O_DIRECT writes - Prevent infinite loop in rpcrdma_ep_create() Features: - Various improvements to the RDMA transport code's handling of memory registration - Various code cleanups" * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits) fs/nfs: fix new compiler warning about boolean in switch nfs: Remove unneeded casts in nfs NFS: Don't attempt to decode missing directory entries Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one" NFS: Rename idmap.c to nfs4idmap.c NFS: Move nfs_idmap.h into fs/nfs/ NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h NFS: Add a stub for GETDEVICELIST nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes nfs: fix DIO good bytes calculation nfs: Fetch MOUNTED_ON_FILEID when updating an inode sunrpc: make debugfs file creation failure non-fatal nfs: fix high load average due to callback thread sleeping NFS: Reduce time spent holding the i_mutex during fallocate() NFS: Don't zap caches on fallocate() xprtrdma: Make rpcrdma_{un}map_one() into inline functions xprtrdma: Handle non-SEND completions via a callout xprtrdma: Add "open" memreg op xprtrdma: Add "destroy MRs" memreg op xprtrdma: Add "reset MRs" memreg op ...
2015-04-26Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-4/+4
Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-23Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"Nicolas Iooss1-1/+1
This reverts commit 5a254d08b086d80cbead2ebcee6d2a4b3a15587a. Since commit 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats when add one"), nfs_readpage and nfs_do_writepage use nfs_inc_stats to increment NFSIOS_READPAGES and NFSIOS_WRITEPAGES instead of nfs_add_stats. However nfs_inc_stats does not do the same thing as nfs_add_stats with value 1 because these functions work on distinct stats: nfs_inc_stats increments stats from "enum nfs_stat_eventcounters" (in server->io_stats->events) and nfs_add_stats those from "enum nfs_stat_bytecounters" (in server->io_stats->bytes). Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Fixes: 5a254d08b086 ("nfs: replace nfs_add_stats with nfs_inc_stats...") Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells1-4/+4
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-14page_writeback: clean up mess around cancel_dirty_page()Konstantin Khlebnikov1-5/+0
This patch replaces cancel_dirty_page() with a helper function account_page_cleaned() which only updates counters. It's called from truncate_complete_page() and from try_to_free_buffers() (hack for ext3). Page is locked in both cases, page-lock protects against concurrent dirtiers: see commit 2d6d7f982846 ("mm: protect set_page_dirty() from ongoing truncation"). Delete_from_page_cache() shouldn't be called for dirty pages, they must be handled by caller (either written or truncated). This patch treats final dirty accounting fixup at the end of __delete_from_page_cache() as a debug check and adds WARN_ON_ONCE() around it. If something removes dirty pages without proper handling that might be a bug and unwritten data might be lost. Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough here. cancel_dirty_page() in nfs_wb_page_cancel() is redundant. This is helper for nfs_invalidate_page() and it's called only in case complete invalidation. The mess was started in v2.6.20 after commits 46d2277c796f ("Clean up and make try_to_free_buffers() not race with dirty pages") and 3e67c0987d75 ("truncate: clear page dirtiness before running try_to_free_buffers()") first was reverted right in v2.6.20 in commit ecdfc9787fe5 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in v2.6.25 commit a2b345642f53 ("Fix dirty page accounting leak with ext3 data=journal"). Custom fixes were introduced between these points. NFS in v2.6.23, commit 1b3b4a1a2deb ("NFS: Fix a write request leak in nfs_invalidate_page()"). Kludge in __delete_from_page_cache() in v2.6.24, commit 3a6927906f1b ("Do dirty page accounting when removing a page from the page cache"). Since v2.6.25 all of them are redundant. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-27NFSv4.1/pnfs: Separate out metadata and data consistency for pNFSTrond Myklebust1-7/+6
The LAYOUTCOMMIT operation means different things to different layout types. For blocks and objects, it is both a data and metadata consistency operation. For files and flexfiles, it is only a metadata consistency operation. This patch separates out the 2 cases, allowing the files/flexfiles layout drivers to optimise away the data consistency calls to layoutcommit. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-03-01NFS: Add attribute update barriers to NFS writebacksTrond Myklebust1-0/+30
Ensure that other operations that race with our write RPC calls cannot revert the file size updates that were made on the server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Chuck Lever <chuck.lever@oracle.com>
2015-02-13nfs: Can call nfs_clear_page_commit() insteadTom Haynes1-5/+2
Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-13nfs: Provide and use helper functions for marking a page as unstableTom Haynes1-7/+2
Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-02-12Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+3
Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
2015-02-11Merge tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-19/+33
Pull NFS client updates from Trond Myklebust: "Highlights incluse: Features: - Removing the forced serialisation of open()/close() calls in NFSv4.x (x>0) makes for a significant performance improvement in metadata intensive workloads. - Full support for the pNFS "flexible files" layout type - Further RPC/RDMA client improvements from Chuck Bugfixes: - Stable fix: NFSv4.1 backchannel calls blocking operations with !TASK_RUNNING - Stable fix: pnfs_generic_pg_init_read/write can be called with lseg == NULL - Stable fix: Fix an Oopsable condition when nsm_mon_unmon is called as part of the namespace cleanup, - Stable fix: Ensure we reference the inode for return-on-close in delegreturn - Use SO_REUSEPORT to ensure that NFSv3 TCP connections can rebind to the same source address/port combination during a disconnect/ reconnect event. This is a requirement imposed by most NFSv3 server duplicate reply cache implementations. Optimisations: - Ask for no NFSv4.1 delegations on OPEN if using O_DIRECT Other: - Add Anna Schumaker as co-maintainer for the NFS client" * tag 'nfs-for-3.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (119 commits) SUNRPC: Cleanup to remove xs_tcp_close() pnfs: delete an unintended goto pnfs/flexfiles: Do not dprintk after the free SUNRPC: Fix stupid typo in xs_sock_set_reuseport SUNRPC: Define xs_tcp_fin_timeout only if CONFIG_SUNRPC_DEBUG SUNRPC: Handle connection reset more efficiently. SUNRPC: Remove the redundant XPRT_CONNECTION_CLOSE flag SUNRPC: Make xs_tcp_close() do a socket shutdown rather than a sock_release SUNRPC: Ensure xs_tcp_shutdown() requests a full close of the connection SUNRPC: Cleanup to remove remaining uses of XPRT_CONNECTION_ABORT SUNRPC: Remove TCP socket linger code SUNRPC: Remove TCP client connection reset hack SUNRPC: TCP/UDP always close the old socket before reconnecting SUNRPC: Add helpers to prevent socket create from racing SUNRPC: Ensure xs_reset_transport() resets the close connection flags SUNRPC: Do not clear the source port in xs_reset_transport SUNRPC: Handle EADDRINUSE on connect SUNRPC: Set SO_REUSEPORT socket option for TCP connections NFSv4.1: Fix pnfs_put_lseg races NFSv4.1: pnfs_send_layoutreturn should use GFP_NOFS ...
2015-02-03nfs: add mirroring support to pgio layerWeston Andros Adamson1-2/+8
This patch adds mirrored write support to the pgio layer. The default is to use one mirror, but pgio callers may define callbacks to change this to any value up to the (arbitrarily selected) limit of 16. The basic idea is to break out members of nfs_pageio_descriptor that cannot be shared between mirrored DSes and put them in a new structure. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03pnfs: pass ds_commit_idx through the commit pathWeston Andros Adamson1-6/+8
Pass ds_commit_idx through the nfs commit path. It's used to select the commit bucket when using pnfs and is ignored when not using pnfs. Several functions had to be changed: nfs_retry_commit, nfs_mark_request_commit, pnfs_mark_request_commit and the pnfs layout driver .mark_request_commit functions. Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-02-03nfs: handle overlapping reqs in lock_and_joinWeston Andros Adamson1-6/+11
This is needed for mirrored DS support, where multuple requests cover the same range. Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
2015-02-03nfs: allow different protocol in nfs_initiate_commitPeng Tao1-3/+4
pnfs flexfile layout client may want to use NFSv3 ops rather than the default MDS v4 ops. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
2015-02-03pnfs: Add nfs_rpc_ops in calls to nfs_initiate_pgioTom Haynes1-3/+3
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
2015-01-20fs: export inode_to_bdi and use it in favor of mapping->backing_dev_infoChristoph Hellwig1-3/+3
Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping->backing_dev_info with it to prepare for the removal of mapping->backing_dev_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-16locks: add a dedicated spinlock to protect i_flctx listsJeff Layton1-2/+2
We can now add a dedicated spinlock without expanding struct inode. Change to using that to protect the various i_flctx lists. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16locks: convert posix locks to file_lock_contextJeff Layton1-17/+13
Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
2015-01-16locks: move flock locks to file_lock_contextJeff Layton1-5/+38
Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
2014-11-24nfs: replace nfs_add_stats with nfs_inc_stats when add oneLi RongQing1-1/+1
Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-24NFS: fix subtle change in COMMIT behaviorWeston Andros Adamson1-5/+12
Recent work in the pgio layer made it possible for there to be more than one request per page. This caused a subtle change in commit behavior, because write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for writeback against the number of requests on a commit list to choose when to send a COMMIT in a non-blocking flush. This is probably hard to hit in normal operation - you have to be using rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page aligned to have a noticeable change in behavior. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-11-12nfs: remove spurious WARN_ON_ONCE in write pathWeston Andros Adamson1-2/+0
This WARN_ON_ONCE was supposed to catch reference counting bugs, but can trigger in inappropriate situations. This was reproducible using NFSv2 on an architecture with 64K pages -- we verified that it was not a reference counting bug and the warning was safe to ignore. Reported-by: Will Deacon <will.deacon@arm.com> Tested-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid waiting at all in nfs_release_page when congested.NeilBrown1-0/+5
If nfs_release_page() is called on a sequence of pages which are all in the same file which is blocked on COMMIT, each page could contribute a 1 second delay which could be come excessive. I have seen delays of as much as 208 seconds. To keep the delay to one second, mark the bdi as write-congested if the commit didn't finished. Once it does finish, the write-congested flag will be cleared by nfs_commit_release_pages(). With this, the longest total delay in try_to_free_pages that I have seen is under 3 seconds. With no waiting in nfs_release_page at all I have seen delays of nearly 1.5 seconds. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid deadlocks with loop-back mounted NFS filesystems.NeilBrown1-0/+2
Support for loop-back mounted NFS filesystems is useful when NFS is used to access shared storage in a high-availability cluster. If the node running the NFS server fails, some other node can mount the filesystem and start providing NFS service. If that node already had the filesystem NFS mounted, it will now have it loop-back mounted. nfsd can suffer a deadlock when allocating memory and entering direct reclaim. While direct reclaim does not write to the NFS filesystem it can send and wait for a COMMIT through nfs_release_page(). This patch modifies nfs_release_page() to wait a limited time for the commit to complete - one second. If the commit doesn't complete in this time, nfs_release_page() will fail. This means it might now fail in some cases where it wouldn't before. These cases are only when 'gfp' includes '__GFP_WAIT'. nfs_release_page() is only called by try_to_release_page(), and that can only be called on an NFS page with required 'gfp' flags from - page_cache_pipe_buf_steal() in splice.c - shrink_page_list() in vmscan.c - invalidate_inode_pages2_range() in truncate.c The first two handle failure quite safely. The last is only called after ->launder_page() has been called, and that will have waited for the commit to finish already. So aborting if the commit takes longer than 1 second is perfectly safe. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24NFS: don't use STABLE writes during writeback.NeilBrown1-2/+5
commit b31268ac793fd300da66b9c28bbf0a200339ab96 FS: Use stable writes when not doing a bulk flush was a bit heavy handed. The particular problem that lead to this patch was that small writes to an O_SYNC file we being written as UNSTABLE writes followed by a commit. This is appropriate for large writes (which require multiple NFS requests) but for small writes (single NFS request), using NFS_FILE_SYNC is more efficient. So that patch causes the code to select between the two methods depending on how many nfs requests get generated. Unfortunately this ends up applying to non O_SYNC writes as well. In particular if you memory-map a file and update random pages, then when they are eventually written out by writeback they will go as NFS_FILE_SYNC. This is inefficient and slows down the application. So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL. With this patch: O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE followed by COMMIT for multiple requests Writing immediately before close of fsync follow the same pattern. Non-O_SYNC writes without an fsync of close eventually get flushed out as UNSTABLE and a commit follows eventually as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12NFS: Unconditionally enable commit codeAnna Schumaker1-61/+0
The goal is to create a generic NFS module with code that does not depend on what versions of NFS are enabled. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10NFS: Fix a compile warning when !(CONFIG_NFS_V3 || CONFIG_NFS_V4)Trond Myklebust1-32/+42
gcc reports: linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’: linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized] list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) { ^ linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here struct nfs_commit_info cinfo; Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com> Cc: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10nfs41: add a helper function to set layoutcommit after commitPeng Tao1-0/+15
Track lwb in nfs_commit_data so that we can use it to setup layoutcommit in commit_done callback. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22nfs: don't sleep with inode lock in lock_and_join_requestsWeston Andros Adamson1-1/+11
This handles the 'nonblock=false' case in nfs_lock_and_join_requests. If the group is already locked and blocking is allowed, drop the inode lock and wait for the group lock to be cleared before trying it all again. This should fix warnings found in peterz's tree (sched/wait branch), where might_sleep() checks are added to wait.[ch]. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22nfs: fix error handling in lock_and_join_requestsWeston Andros Adamson1-1/+4
This fixes handling of errors from nfs_page_group_lock in nfs_lock_and_join_requests. It now releases the inode lock and the reference to the head request. Reported-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22nfs: change nfs_page_group_lock argumentWeston Andros Adamson1-2/+2
Flip the meaning of the second argument from 'wait' to 'nonblock' to match related functions. Update all five calls to reflect this change. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Reviewed-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-13Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-71/+79
Pull NFS client updates from Trond Myklebust: "Highlights include: - stable fix for a bug in nfs3_list_one_acl() - speed up NFS path walks by supporting LOOKUP_RCU - more read/write code cleanups - pNFS fixes for layout return on close - fixes for the RCU handling in the rpcsec_gss code - more NFS/RDMA fixes" * tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) nfs: reject changes to resvport and sharecache during remount NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred NFS: fix two problems in lookup_revalidate in RCU-walk NFS: allow lockless access to access_cache NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU NFS: support RCU_WALK in nfs_permission() sunrpc/auth: allow lockless (rcu) lookup of credential cache. NFS: prepare for RCU-walk support but pushing tests later in code. NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. NFS: add checks for returned value of try_module_get() nfs: clear_request_commit while holding i_lock pnfs: add pnfs_put_lseg_async pnfs: find swapped pages on pnfs commit lists too nfs: fix comment and add warn_on for PG_INODE_REF nfs: check wait_on_bit_lock err in page_group_lock sunrpc: remove "ec" argument from encrypt_v2 operation sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c ...
2014-08-03nfs: clear_request_commit while holding i_lockWeston Andros Adamson1-11/+4
Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>