aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2014-10-07f2fs: support volatile operations for transient dataJaegeuk Kim4-2/+25
This patch adds support for volatile writes which keep data pages in memory until f2fs_evict_inode is called by iput. For instance, we can use this feature for the sqlite database as follows. While supporting atomic writes for main database file, we can keep its journal data temporarily in the page cache by the following sequence. 1. open -> ioctl(F2FS_IOC_START_VOLATILE_WRITE); 2. writes : keep all the data in the page cache. 3. flush to the database file with atomic writes a. ioctl(F2FS_IOC_START_ATOMIC_WRITE); b. writes c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE); 4. close -> drop the cached data Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-06f2fs: support atomic writesJaegeuk Kim8-5/+139
This patch introduces a very limited functionality for atomic write support. In order to support atomic write, this patch adds two ioctls: o F2FS_IOC_START_ATOMIC_WRITE o F2FS_IOC_COMMIT_ATOMIC_WRITE The database engine should be aware of the following sequence. 1. open -> ioctl(F2FS_IOC_START_ATOMIC_WRITE); 2. writes : all the written data will be treated as atomic pages. 3. commit -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE); : this flushes all the data blocks to the disk, which will be shown all or nothing by f2fs recovery procedure. 4. repeat to #2. The IO pattens should be: ,- START_ATOMIC_WRITE ,- COMMIT_ATOMIC_WRITE CP | D D D D D D | FSYNC | D D D D | FSYNC ... `- COMMIT_ATOMIC_WRITE Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-05f2fs: remove unused return valueJaegeuk Kim1-3/+2
Don't return any value without any usage. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-03Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds3-8/+4
Pull cifs/smb3 fixes from Steve French: "Fix for CIFS/SMB3 oops on reconnect during readpages (3.17 regression) and for incorrectly closing file handle in symlink error cases" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Fix readpages retrying on reconnects Fix problem recognizing symlinks
2014-10-02ocfs2/dlm: should put mle when goto kill in dlm_assert_master_handleralex chen1-0/+4
In dlm_assert_master_handler, the mle is get in dlm_find_mle, should be put when goto kill, otherwise, this mle will never be released. Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-02CIFS: Fix readpages retrying on reconnectsPavel Shilovsky1-7/+1
If we got a reconnect error from async readv we re-add pages back to page_list and continue loop. That is wrong because these pages have been already added to the pagecache but page_list has pages that have not been added to the pagecache yet. This ends up with a general protection fault in put_pages after readpages. Fix it by not retrying the read of these pages and falling back to readpage instead. Fixes debian bug 762306 Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
2014-10-02Fix problem recognizing symlinksSteve French2-1/+3
Changeset eb85d94bd introduced a problem where if a cifs open fails during query info of a file we will still try to close the file (happens with certain types of reparse points) even though the file handle is not valid. In addition for SMB2/SMB3 we were not mapping the return code returned by Windows when trying to open a file (like a Windows NFS symlink) which is a reparse point. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> CC: stable <stable@vger.kernel.org> #v3.13+
2014-10-01nfsd: eliminate "to_delegation" defineJeff Layton3-7/+4
We now have cb_to_delegation and to_delegation, which do the same thing and are defined separately in different .c files. Move the cb_to_delegation definition into a header file and eliminate the redundant to_delegation definition. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-30f2fs: clean up f2fs_ioctl functionsJaegeuk Kim1-63/+75
This patch cleans up f2fs_ioctl functions for better readability. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: potential shift wrapping buf in f2fs_trim_fs()Dan Carpenter1-1/+1
My static checker complains that segment is a u64 but only the lower 31 bits can be used before we hit a shift wrapping bug. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: call f2fs_unlock_op after error was handledJaegeuk Kim3-21/+35
This patch relocates f2fs_unlock_op in every directory operations to be called after any error was processed. Otherwise, the checkpoint can be entered with valid node ids without its dentry when -ENOSPC is occurred. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: check the use of macros on block counts and addressesJaegeuk Kim6-103/+81
This patch cleans up the existing and new macros for readability. Rule is like this. ,-----------------------------------------> MAX_BLKADDR -, | ,------------- TOTAL_BLKS ----------------------------, | | | | ,- seg0_blkaddr ,----- sit/nat/ssa/main blkaddress | block | | (SEG0_BLKADDR) | | | | (e.g., MAIN_BLKADDR) | address 0..x................ a b c d ............................. | | global seg# 0...................... m ............................. | | | | `------- MAIN_SEGS -----------' `-------------- TOTAL_SEGS ---------------------------' | | seg# 0..........xx.................. = Note = o GET_SEGNO_FROM_SEG0 : blk address -> global segno o GET_SEGNO : blk address -> segno o START_BLOCK : segno -> starting block address Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: refactor flush_nat_entries to remove costly reorganizing opsJaegeuk Kim3-159/+162
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets according to its nid ranges. This can improve the flushing nat pages, however, if there are a lot of cached nat entries, it becomes a bottleneck. This patch introduces a new set management flow by removing dirty nat list and adding a series of set operations when the nat entry becomes dirty. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: introduce FITRIM in f2fs_ioctlJaegeuk Kim5-13/+134
This patch introduces FITRIM in f2fs_ioctl. In this case, f2fs will issue small discards and prefree discards as many as possible for the given area. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30f2fs: introduce cp_control structureJaegeuk Kim5-15/+37
This patch add a new data structure to control checkpoint parameters. Currently, it presents the reason of checkpoint such as is_umount and normal sync. Reviewed-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30Merge branch 'bugfixes' into linux-nextTrond Myklebust6-53/+61
* bugfixes: NFSv4.1: Fix an NFSv4.1 state renewal regression NFSv4: fix open/lock state recovery error handling NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails NFS: Fabricate fscache server index key correctly SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT nfs: fix duplicate proc entries
2014-09-30NFSv4.1: Fix an NFSv4.1 state renewal regressionAndy Adamson2-3/+11
Commit 2f60ea6b8ced ("NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation") set the NFS4_RENEW_TIMEOUT flag in nfs4_renew_state, and does not put an nfs41_proc_async_sequence call, the NFSv4.1 lease renewal heartbeat call, on the wire to renew the NFSv4.1 state if the flag was not set. The NFS4_RENEW_TIMEOUT flag is set when "now" is after the last renewal (cl_last_renewal) plus the lease time divided by 3. This is arbitrary and sometimes does the following: In normal operation, the only way a future state renewal call is put on the wire is via a call to nfs4_schedule_state_renewal, which schedules a nfs4_renew_state workqueue task. nfs4_renew_state determines if the NFS4_RENEW_TIMEOUT should be set, and the calls nfs41_proc_async_sequence, which only gets sent if the NFS4_RENEW_TIMEOUT flag is set. Then the nfs41_proc_async_sequence rpc_release function schedules another state remewal via nfs4_schedule_state_renewal. Without this change we can get into a state where an application stops accessing the NFSv4.1 share, state renewal calls stop due to the NFS4_RENEW_TIMEOUT flag _not_ being set. The only way to recover from this situation is with a clientid re-establishment, once the application resumes and the server has timed out the lease and so returns NFS4ERR_BAD_SESSION on the subsequent SEQUENCE operation. An example application: open, lock, write a file. sleep for 6 * lease (could be less) ulock, close. In the above example with NFSv4.1 delegations enabled, without this change, there are no OP_SEQUENCE state renewal calls during the sleep, and the clientid is recovered due to lease expiration on the close. This issue does not occur with NFSv4.1 delegations disabled, nor with NFSv4.0, with or without delegations enabled. Signed-off-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1411486536-23401-1-git-send-email-andros@netapp.com Fixes: 2f60ea6b8ced (NFSv4: The NFSv4.0 client must send RENEW calls...) Cc: stable@vger.kernel.org # 3.2.x Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-30nfsd4: fix corruption of NFSv4 read dataJ. Bruce Fields1-1/+2
The calculation of page_ptr here is wrong in the case the read doesn't start at an offset that is a multiple of a page. The result is that nfs4svc_encode_compoundres sets rq_next_page to a value one too small, and then the loop in svc_free_res_pages may incorrectly fail to clear a page pointer in rq_respages[]. Pages left in rq_respages[] are available for the next rpc request to use, so xdr data may be written to that page, which may hold data still waiting to be transmitted to the client or data in the page cache. The observed result was silent data corruption seen on an NFSv4 client. We tag this as "fixing" 05638dc73af2 because that commit exposed this bug, though the incorrect calculation predates it. Particular thanks to Andrea Arcangeli and David Gilbert for analysis and testing. Fixes: 05638dc73af2 "nfsd4: simplify server xdr->next_page use" Cc: stable@vger.kernel.org Reported-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29NFSD: Implement SEEKAnna Schumaker3-2/+97
This patch adds server support for the NFS v4.2 operation SEEK, which returns the position of the next hole or data segment in a file. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29NFSD: Add generic v4.2 infrastructureAnna Schumaker1-0/+28
It's cleaner to introduce everything at once and have the server reply with "not supported" than it would be to introduce extra operations when implementing a specific one in the middle of the list. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-28NFSv4: fix open/lock state recovery error handlingTrond Myklebust1-10/+6
The current open/lock state recovery unfortunately does not handle errors such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping, just proceeds as if the state manager is finished recovering. This patch ensures that we loop back, handle higher priority errors and complete the open/lock state recovery. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM failsTrond Myklebust1-1/+0
If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted a second time, then the client will currently take this to mean that it must declare all locks to be stale, and hence ineligible for reboot recovery. RFC3530 and RFC5661 both suggest that the client should instead rely on the server to respond to inelegible open share, lock and delegation reclaim requests with NFS4ERR_NO_GRACE in this situation. Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-27Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds5-78/+47
Pull vfs fixes from Al Viro: "Assorted fixes + unifying __d_move() and __d_materialise_dentry() + minimal regression fix for d_path() of victims of overwriting rename() ported on top of that" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Don't exchange "short" filenames unconditionally. fold swapping ->d_name.hash into switch_names() fold unlocking the children into dentry_unlock_parents_for_move() kill __d_materialise_dentry() __d_materialise_dentry(): flip the order of arguments __d_move(): fold manipulations with ->d_child/->d_subdirs don't open-code d_rehash() in d_materialise_unique() pull rehashing and unlocking the target dentry into __d_materialise_dentry() ufs: deal with nfsd/iget races fuse: honour max_read and max_write in direct_io mode shmem: fix nlink for rename overwrite directory
2014-09-27vfs: Don't exchange "short" filenames unconditionally.Mikhail Efremov1-9/+18
Only exchange source and destination filenames if flags contain RENAME_EXCHANGE. In case if executable file was running and replaced by other file /proc/PID/exe should still show correct file name, not the old name of the file by which it was replaced. The scenario when this bug manifests itself was like this: * ALT Linux uses rpm and start-stop-daemon; * during a package upgrade rpm creates a temporary file for an executable to rename it upon successful unpacking; * start-stop-daemon is run subsequently and it obtains the (nonexistant) temporary filename via /proc/PID/exe thus failing to identify the running process. Note that "long" filenames (> DNAiME_INLINE_LEN) are still exchanged without RENAME_EXCHANGE and this behaviour exists long enough (should be fixed too apparently). So this patch is just an interim workaround that restores behavior for "short" names as it was before changes introduced by commit da1ce0670c14 ("vfs: add cross-rename"). See https://lkml.org/lkml/2014/9/7/6 for details. AV: the comments about being more careful with ->d_name.hash than with ->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s, when we started to unhash the target instead of swapping hash chain positions followed by d_delete() as we used to do when dcache was first introduced. Acked-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: da1ce0670c14 "vfs: add cross-rename" Signed-off-by: Mikhail Efremov <sem@altlinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27fold swapping ->d_name.hash into switch_names()Linus Torvalds1-2/+1
and do it along with ->d_name.len there Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26fold unlocking the children into dentry_unlock_parents_for_move()Al Viro1-5/+4
... renaming it into dentry_unlock_for_move() and making it more symmetric with dentry_lock_for_move(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26kill __d_materialise_dentry()Al Viro1-44/+10
it folds into __d_move() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26__d_materialise_dentry(): flip the order of argumentsAl Viro1-24/+20
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry) case in __d_move(). A bit more and it'll fold in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26__d_move(): fold manipulations with ->d_child/->d_subdirsAl Viro1-5/+3
list_del() + list_add() is a slightly pessimised list_move() list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init() Interleaving those makes the resulting code even worse. And harder to follow... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26don't open-code d_rehash() in d_materialise_unique()Al Viro1-5/+1
... and get rid of duplicate BUG_ON() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26pull rehashing and unlocking the target dentry into __d_materialise_dentry()Al Viro1-7/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26ufs: deal with nfsd/iget racesAl Viro2-1/+9
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26fuse: honour max_read and max_write in direct_io modeMiklos Szeredi2-1/+2
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of bytes a caller asked to pack into fuse request. This value may be lesser than capacity of fuse request or iov_iter. So fuse_get_user_pages() must ensure that *nbytesp won't grow. Now, when helper iov_iter_get_pages() performs all hard work of extracting pages from iov_iter, it can be done by passing properly calculated "maxsize" to the helper. The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need this capability, so pass LONG_MAX as the maxsize argument here. Fixes: c9c37e2e6378 ("fuse: switch to iov_iter_get_pages()") Reported-by: Werner Baumann <werner.baumann@onlinehome.de> Tested-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26nfsd: introduce nfsd4_callback_opsChristoph Hellwig3-71/+83
Add a higher level abstraction than the rpc_ops for callback operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: split nfsd4_callback initialization and useChristoph Hellwig3-12/+13
Split out initializing the nfs4_callback structure from using it. For the NULL callback this gets rid of tons of pointless re-initializations. Note that I don't quite understand what protects us from running multiple NULL callbacks at the same time, but at least this chance doesn't make it worse.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: introduce a generic nfsd4_cbChristoph Hellwig3-35/+22
Add a helper to queue up a callback. CB_NULL has a bit of special casing because it is special in the specification, but all other new callback operations will be able to share code with this and a few more changes to refactor the callback code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: remove nfsd4_callback.cb_opChristoph Hellwig2-9/+9
We can always get at the private data by using container_of, no need for a void pointer. Also introduce a little to_delegation helper to avoid opencoding the container_of everywhere. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: do not clear rpc_resp in nfsd4_cb_done_sequenceBenny Halevy1-3/+0
This is incorrect when a callback is has to be restarted, in which case the XDR decoding of the second iteration will see a NULL cb argument. [hch: updated description] Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26nfsd: fix nfsd4_cb_recall_done error handlingChristoph Hellwig1-10/+7
For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID, nfsd4_cb_recall_done first marks the connection down, then retries until dl_retries hits zero, then marks the connection down again and sets cb_done. This changes the code to only retry for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting cb_done into a single point in the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26fs/cachefiles: add missing \n to kerror conversionsFabian Frederick6-33/+33
Commit 0227d6abb378 ("fs/cachefiles: replace kerror by pr_err") didn't include newline featuring in original kerror definition Signed-off-by: Fabian Frederick <fabf@skynet.be> Reported-by: David Howells <dhowells@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> [3.16.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26mm: softdirty: addresses before VMAs in PTE holes aren't softdirtyPeter Feiner1-9/+18
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap. This bug was introduced in commit 68b5a6524856 ("mm: softdirty: respect VM_SOFTDIRTY in PTE holes"). That commit made /proc/pid/pagemap look at VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs returned by find_vma. Tested: Wrote a selftest that creates a PMD-sized VMA then unmaps the first page and asserts that the page is not softdirty. I'm going to send the pagemap selftest in a later commit. Signed-off-by: Peter Feiner <pfeiner@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Jamie Liu <jamieliu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26ocfs2/dlm: do not get resource spinlock if lockres is newJoseph Qi1-8/+10
There is a deadlock case which reported by Guozhonghua: https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html This case is caused by &res->spinlock and &dlm->master_lock misordering in different threads. It was introduced by commit 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers"). Since lockres is new, it doesn't not require the &res->spinlock. So remove it. Fixes: 8d400b81cc83 ("ocfs2/dlm: Clean up refmap helpers") Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Reported-by: Guozhonghua <guozhonghua@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26nilfs2: fix data loss with mmap()Andreas Rohner1-1/+6
This bug leads to reproducible silent data loss, despite the use of msync(), sync() and a clean unmount of the file system. It is easily reproducible with the following script: ----------------[BEGIN SCRIPT]-------------------- mkfs.nilfs2 -f /dev/sdb mount /dev/sdb /mnt dd if=/dev/zero bs=1M count=30 of=/mnt/testfile umount /mnt mount /dev/sdb /mnt CHECKSUM_BEFORE="$(md5sum /mnt/testfile)" /root/mmaptest/mmaptest /mnt/testfile 30 10 5 sync CHECKSUM_AFTER="$(md5sum /mnt/testfile)" umount /mnt mount /dev/sdb /mnt CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)" umount /mnt echo "BEFORE MMAP:\t$CHECKSUM_BEFORE" echo "AFTER MMAP:\t$CHECKSUM_AFTER" echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT" ----------------[END SCRIPT]-------------------- The mmaptest tool looks something like this (very simplified, with error checking removed): ----------------[BEGIN mmaptest]-------------------- data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE, MAP_SHARED, fd, file_offset); for (i = 0; i < write_count; ++i) { memcpy(data + i * 4096, buf, sizeof(buf)); msync(data, file_size - file_offset, MS_SYNC)) } ----------------[END mmaptest]-------------------- The output of the script looks something like this: BEFORE MMAP: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile AFTER MMAP: 6604a1c31f10780331a6850371b3a313 /mnt/testfile AFTER REMOUNT: 281ed1d5ae50e8419f9b978aab16de83 /mnt/testfile So it is clear, that the changes done using mmap() do not survive a remount. This can be reproduced a 100% of the time. The problem was introduced in commit 136e8770cd5d ("nilfs2: fix issue of nilfs_set_page_dirty() for page at EOF boundary"). If the page was read with mpage_readpage() or mpage_readpages() for example, then it has no buffers attached to it. In that case page_has_buffers(page) in nilfs_set_page_dirty() will be false. Therefore nilfs_set_file_dirty() is never called and the pages are never collected and never written to disk. This patch fixes the problem by also calling nilfs_set_file_dirty() if the page has no buffers attached to it. [akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/] Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Tested-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26ocfs2: free vol_label in ocfs2_delete_osb()Joseph Qi1-0/+1
osb->vol_label is malloced in ocfs2_initialize_super but not freed if error occurs or during umount, thus causing a memory leak. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-25NFS: Fabricate fscache server index key correctlyDavid Howells1-2/+1
When fabricating a server index key for fscache, we should clear the index key buffer before starting to fill it in, not in the middle. Reported-by: James Pearson <james-p@moving-picture.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFSv3: Fix missing includes of nfs3_fs.hTrond Myklebust2-0/+2
Silence a few warnings about missing symbols that are due to missing includes of nfs3_fs.h. Fixes: 00a36a1090350 (NFS: Move v3 declarations out of internal.h) Fixes: cb8c20fa53ec2 (NFS: Move NFS v3 acl functions to nfs3_fs.h) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()NeilBrown1-8/+6
Now that nfs_release_page() doesn't block indefinitely, other deadlock avoidance mechanisms aren't needed. - it doesn't hurt for kswapd to block occasionally. If it doesn't want to block it would clear __GFP_WAIT. The current_is_kswapd() was only added to avoid deadlocks and we have a new approach for that. - memory allocation in the SUNRPC layer can very rarely try to ->releasepage() a page it is trying to handle. The deadlock is removed as nfs_release_page() doesn't block indefinitely. So we don't need to set PF_FSTRANS for sunrpc network operations any more. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid waiting at all in nfs_release_page when congested.NeilBrown2-2/+12
If nfs_release_page() is called on a sequence of pages which are all in the same file which is blocked on COMMIT, each page could contribute a 1 second delay which could be come excessive. I have seen delays of as much as 208 seconds. To keep the delay to one second, mark the bdi as write-congested if the commit didn't finished. Once it does finish, the write-congested flag will be cleared by nfs_commit_release_pages(). With this, the longest total delay in try_to_free_pages that I have seen is under 3 seconds. With no waiting in nfs_release_page at all I have seen delays of nearly 1.5 seconds. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25NFS: avoid deadlocks with loop-back mounted NFS filesystems.NeilBrown2-10/+18
Support for loop-back mounted NFS filesystems is useful when NFS is used to access shared storage in a high-availability cluster. If the node running the NFS server fails, some other node can mount the filesystem and start providing NFS service. If that node already had the filesystem NFS mounted, it will now have it loop-back mounted. nfsd can suffer a deadlock when allocating memory and entering direct reclaim. While direct reclaim does not write to the NFS filesystem it can send and wait for a COMMIT through nfs_release_page(). This patch modifies nfs_release_page() to wait a limited time for the commit to complete - one second. If the commit doesn't complete in this time, nfs_release_page() will fail. This means it might now fail in some cases where it wouldn't before. These cases are only when 'gfp' includes '__GFP_WAIT'. nfs_release_page() is only called by try_to_release_page(), and that can only be called on an NFS page with required 'gfp' flags from - page_cache_pipe_buf_steal() in splice.c - shrink_page_list() in vmscan.c - invalidate_inode_pages2_range() in truncate.c The first two handle failure quite safely. The last is only called after ->launder_page() has been called, and that will have waited for the commit to finish already. So aborting if the commit takes longer than 1 second is perfectly safe. Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24NFS: don't use STABLE writes during writeback.NeilBrown1-2/+5
commit b31268ac793fd300da66b9c28bbf0a200339ab96 FS: Use stable writes when not doing a bulk flush was a bit heavy handed. The particular problem that lead to this patch was that small writes to an O_SYNC file we being written as UNSTABLE writes followed by a commit. This is appropriate for large writes (which require multiple NFS requests) but for small writes (single NFS request), using NFS_FILE_SYNC is more efficient. So that patch causes the code to select between the two methods depending on how many nfs requests get generated. Unfortunately this ends up applying to non O_SYNC writes as well. In particular if you memory-map a file and update random pages, then when they are eventually written out by writeback they will go as NFS_FILE_SYNC. This is inefficient and slows down the application. So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL. With this patch: O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE followed by COMMIT for multiple requests Writing immediately before close of fsync follow the same pattern. Non-O_SYNC writes without an fsync of close eventually get flushed out as UNSTABLE and a commit follows eventually as appropriate. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>