aboutsummaryrefslogtreecommitdiffstats
path: root/include/linux/fs.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2014-12-16Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+0
Pull vfs pile #2 from Al Viro: "Next pile (and there'll be one or two more). The large piece in this one is getting rid of /proc/*/ns/* weirdness; among other things, it allows to (finally) make nameidata completely opaque outside of fs/namei.c, making for easier further cleanups in there" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coda_venus_readdir(): use file_inode() fs/namei.c: fold link_path_walk() call into path_init() path_init(): don't bother with LOOKUP_PARENT in argument fs/namei.c: new helper (path_cleanup()) path_init(): store the "base" pointer to file in nameidata itself make default ->i_fop have ->open() fail with ENXIO make nameidata completely opaque outside of fs/namei.c kill proc_ns completely take the targets of /proc/*/ns/* symlinks to separate fs bury struct proc_ns in fs/proc copy address of proc_ns_ops into ns_common new helpers: ns_alloc_inum/ns_free_inum make proc_ns_operations work with struct ns_common * instead of void * switch the rest of proc_ns_operations to working with &...->ns netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns common object embedded into various struct ....ns
2014-12-16Merge branch 'for-3.19' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-1/+1
Pull nfsd updates from Bruce Fields: "A comparatively quieter cycle for nfsd this time, but still with two larger changes: - RPC server scalability improvements from Jeff Layton (using RCU instead of a spinlock to find idle threads). - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna Schumaker, enabling fallocate on new clients" * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits) nfsd4: fix xdr4 count of server in fs_location4 nfsd4: fix xdr4 inclusion of escaped char sunrpc/cache: convert to use string_escape_str() sunrpc: only call test_bit once in svc_xprt_received fs: nfsd: Fix signedness bug in compare_blob sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt sunrpc: convert to lockless lookup of queued server threads sunrpc: fix potential races in pool_stats collection sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it sunrpc: require svc_create callers to pass in meaningful shutdown routine sunrpc: have svc_wake_up only deal with pool 0 sunrpc: convert sp_task_pending flag to use atomic bitops sunrpc: move rq_cachetype field to better optimize space sunrpc: move rq_splice_ok flag into rq_flags sunrpc: move rq_dropme flag into rq_flags sunrpc: move rq_usedeferral flag to rq_flags sunrpc: move rq_local field to rq_flags sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it nfsd: minor off by one checks in __write_versions() sunrpc: release svc_pool_map reference when serv allocation fails ...
2014-12-14Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds1-0/+1
Pull security layer updates from James Morris: "In terms of changes, there's general maintenance to the Smack, SELinux, and integrity code. The IMA code adds a new kconfig option, IMA_APPRAISE_SIGNED_INIT, which allows IMA appraisal to require signatures. Support for reading keys from rootfs before init is call is also added" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits) selinux: Remove security_ops extern security: smack: fix out-of-bounds access in smk_parse_smack() VFS: refactor vfs_read() ima: require signature based appraisal integrity: provide a hook to load keys when rootfs is ready ima: load x509 certificate from the kernel integrity: provide a function to load x509 certificate from the kernel integrity: define a new function integrity_read_file() Security: smack: replace kzalloc with kmem_cache for inode_smack Smack: Lock mode for the floor and hat labels ima: added support for new kernel cmdline parameter ima_template_fmt ima: allocate field pointers array on demand in template_desc_init_fields() ima: don't allocate a copy of template_fmt in template_desc_init_fields() ima: display template format in meas. list if template name length is zero ima: added error messages to template-related functions ima: use atomic bit operations to protect policy update interface ima: ignore empty and with whitespaces policy lines ima: no need to allocate entry for comment ima: report policy load status ima: use path names cache ...
2014-12-14Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds1-0/+1
Pull aio updates from Benjamin LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: Skip timer for io_getevents if timeout=0 aio: Make it possible to remap aio ring
2014-12-13aio: Make it possible to remap aio ringPavel Emelyanov1-0/+1
There are actually two issues this patch addresses. Let me start with the one I tried to solve in the beginning. So, in the checkpoint-restore project (criu) we try to dump tasks' state and restore one back exactly as it was. One of the tasks' state bits is rings set up with io_setup() call. There's (almost) no problems in dumping them, there's a problem restoring them -- if I dump a task with aio ring originally mapped at address A, I want to restore one back at exactly the same address A. Unfortunately, the io_setup() does not allow for that -- it mmaps the ring at whatever place mm finds appropriate (it calls do_mmap_pgoff() with zero address and without the MAP_FIXED flag). To make restore possible I'm going to mremap() the freshly created ring into the address A (under which it was seen before dump). The problem is that the ring's virtual address is passed back to the user-space as the context ID and this ID is then used as search key by all the other io_foo() calls. Reworking this ID to be just some integer doesn't seem to work, as this value is already used by libaio as a pointer using which this library accesses memory for aio meta-data. So, to make restore work we need to make sure that a) ring is mapped at desired virtual address b) kioctx->user_id matches this value Having said that, the patch makes mremap() on aio region update the kioctx's user_id and mmap_base values. Here appears the 2nd issue I mentioned in the beginning of this mail. If (regardless of the C/R dances I do) someone creates an io context with io_setup(), then mremap()-s the ring and then destroys the context, the kill_ioctx() routine will call munmap() on wrong (old) address. This will result in a) aio ring remaining in memory and b) some other vma get unexpectedly unmapped. What do you think? Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-12-13syscalls: implement execveat() system callDavid Drysdale1-0/+1
This patchset adds execveat(2) for x86, and is derived from Meredydd Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528). The primary aim of adding an execveat syscall is to allow an implementation of fexecve(3) that does not rely on the /proc filesystem, at least for executables (rather than scripts). The current glibc version of fexecve(3) is implemented via /proc, which causes problems in sandboxed or otherwise restricted environments. Given the desire for a /proc-free fexecve() implementation, HPA suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be an appropriate generalization. Also, having a new syscall means that it can take a flags argument without back-compatibility concerns. The current implementation just defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be added in future -- for example, flags for new namespaces (as suggested at https://lkml.org/lkml/2006/7/11/474). Related history: - https://lkml.org/lkml/2006/12/27/123 is an example of someone realizing that fexecve() is likely to fail in a chroot environment. - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered documenting the /proc requirement of fexecve(3) in its manpage, to "prevent other people from wasting their time". - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a problem where a process that did setuid() could not fexecve() because it no longer had access to /proc/self/fd; this has since been fixed. This patch (of 4): Add a new execveat(2) system call. execveat() is to execve() as openat() is to open(): it takes a file descriptor that refers to a directory, and resolves the filename relative to that. In addition, if the filename is empty and AT_EMPTY_PATH is specified, execveat() executes the file to which the file descriptor refers. This replicates the functionality of fexecve(), which is a system call in other UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and so relies on /proc being mounted). The filename fed to the executed program as argv[0] (or the name of the script fed to a script interpreter) will be of the form "/dev/fd/<fd>" (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively reflecting how the executable was found. This does however mean that execution of a script in a /proc-less environment won't work; also, script execution via an O_CLOEXEC file descriptor fails (as the file will not be accessible after exec). Based on patches by Meredydd Luff. Signed-off-by: David Drysdale <drysdale@google.com> Cc: Meredydd Luff <meredydd@senatehouse.org> Cc: Shuah Khan <shuah.kh@samsung.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rich Felker <dalias@aerifal.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm/rmap: share the i_mmap_rwsemDavidlohr Bueso1-0/+10
Similarly to the anon memory counterpart, we can share the mapping's lock ownership as the interval tree is not modified when doing doing the walk, only the file page. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm: convert i_mmap_mutex to rwsemDavidlohr Bueso1-3/+4
The i_mmap_mutex is a close cousin of the anon vma lock, both protecting similar data, one for file backed pages and the other for anon memory. To this end, this lock can also be a rwsem. In addition, there are some important opportunities to share the lock when there are no tree modifications. This conversion is straightforward. For now, all users take the write lock. [sfr@canb.auug.org.au: update fremap.c] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13mm,fs: introduce helpers around the i_mmap_mutexDavidlohr Bueso1-0/+10
This series is a continuation of the conversion of the i_mmap_mutex to rwsem, following what we have for the anon memory counterpart. With Hugh's feedback from the first iteration. Ultimately, the most obvious paths that require exclusive ownership of the lock is when we modify the VMA interval tree, via vma_interval_tree_insert() and vma_interval_tree_remove() families. Cases such as unmapping, where the ptes content is changed but the tree remains untouched should make it safe to share the i_mmap_rwsem. As such, the code of course is straightforward, however the devil is very much in the details. While its been tested on a number of workloads without anything exploding, I would not be surprised if there are some less documented/known assumptions about the lock that could suffer from these changes. Or maybe I'm just missing something, but either way I believe its at the point where it could use more eyes and hopefully some time in linux-next. Because the lock type conversion is the heart of this patchset, its worth noting a few comparisons between mutex vs rwsem (xadd): (i) Same size, no extra footprint. (ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for exclusive lock ownership. (iii) Both can be slightly unfair wrt exclusive ownership, with writer lock stealing properties, not necessarily respecting FIFO order for granting the lock when contended. (iv) Mutexes can be slightly faster than rwsems when the lock is non-contended. (v) Both suck at performance for debug (slowpaths), which shouldn't matter anyway. Sharing the lock is obviously beneficial, and sem writer ownership is close enough to mutexes. The biggest winner of these changes is migration. As for concrete numbers, the following performance results are for a 4-socket 60-core IvyBridge-EX with 130Gb of RAM. Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite well with this set, with a steady ~60% throughput (jpm) increase for alltests and up to ~30% for disk for high amounts of concurrency. Lower counts of workload users (< 100) does not show much difference at all, so at least no regressions. 3.18-rc1 3.18-rc1-i_mmap_rwsem alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%) alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%) alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%) alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%) alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%) alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%) alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%) alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%) alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%) alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%) alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%) alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%) alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%) alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%) alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%) alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%) alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%) alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%) alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%) alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%) disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%) disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%) disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%) disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%) disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%) disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%) disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%) disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%) disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%) disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%) disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%) disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%) disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%) disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%) disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%) disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%) disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%) disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%) disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%) disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%) In addition, a 67.5% increase in successfully migrated NUMA pages, thus improving node locality. The patch layout is simple but designed for bisection (in case reversion is needed if the changes break upstream) and easier review: o Patches 1-4 convert the i_mmap lock from mutex to rwsem. o Patches 5-10 share the lock in specific paths, each patch details the rationale behind why it should be safe. This patchset has been tested with: postgres 9.4 (with brand new hugetlb support), hugetlbfs test suite (all tests pass, in fact more tests pass with these changes than with an upstream kernel), ltp, aim7 benchmarks, memcached and iozone with the -B option for mmap'ing. *Untested* paths are nommu, memory-failure, uprobes and xip. This patch (of 8): Various parts of the kernel acquire and release this mutex, so add i_mmap_lock_write() and immap_unlock_write() helper functions that will encapsulate this logic. The next patch will make use of these. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name> Acked-by: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10make default ->i_fop have ->open() fail with ENXIOAl Viro1-1/+0
As it is, default ->i_fop has NULL ->open() (along with all other methods). The only case where it matters is reopening (via procfs symlink) a file that didn't get its ->f_op from ->i_fop - anything else will have ->i_fop assigned to something sane (default would fail on read/write/ioctl/etc.). Unfortunately, such case exists - alloc_file() users, especially anon_get_file() ones. There we have tons of opened files of very different kinds sharing the same inode. As the result, attempt to reopen those via procfs succeeds and you get a descriptor you can't do anything with. Moreover, in case of sockets we set ->i_fop that will only be used on such reopen attempts - and put a failing ->open() into it to make sure those do not succeed. It would be simpler to put such ->open() into default ->i_fop and leave it unchanged both for anon inode (as we do anyway) and for socket ones. Result: * everything going through do_dentry_open() works as it used to * sock_no_open() kludge is gone * attempts to reopen anon-inode files fail as they really ought to * ditto for aio_private_file() * ditto for perfmon - this one actually tried to imitate sock_no_open() trick, but failed to set ->i_fop, so in the current tree reopens succeed and yield completely useless descriptor. Intent clearly had been to fail with -ENXIO on such reopens; now it actually does. * everything else that used alloc_file() keeps working - it has ->i_fop set for its inodes anyway Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-10Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-3/+10
Pull VFS changes from Al Viro: "First pile out of several (there _definitely_ will be more). Stuff in this one: - unification of d_splice_alias()/d_materialize_unique() - iov_iter rewrite - killing a bunch of ->f_path.dentry users (and f_dentry macro). Getting that completed will make life much simpler for unionmount/overlayfs, since then we'll be able to limit the places sensitive to file _dentry_ to reasonably few. Which allows to have file_inode(file) pointing to inode in a covered layer, with dentry pointing to (negative) dentry in union one. Still not complete, but much closer now. - crapectomy in lustre (dead code removal, mostly) - "let's make seq_printf return nothing" preparations - assorted cleanups and fixes There _definitely_ will be more piles" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) copy_from_iter_nocache() new helper: iov_iter_kvec() csum_and_copy_..._iter() iov_iter.c: handle ITER_KVEC directly iov_iter.c: convert copy_to_iter() to iterate_and_advance iov_iter.c: convert copy_from_iter() to iterate_and_advance iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() iov_iter.c: convert iov_iter_zero() to iterate_and_advance iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds iov_iter.c: convert iov_iter_npages() to iterate_all_kinds iov_iter.c: iterate_and_advance iov_iter.c: macros for iterating over iov_iter kill f_dentry macro dcache: fix kmemcheck warning in switch_names new helper: audit_file() nfsd_vfs_write(): use file_inode() ncpfs: use file_inode() kill f_dentry uses lockd: get rid of ->f_path.dentry->d_sb ...
2014-12-10Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds1-3/+2
Pull quota updates from Jan Kara: "Quota improvements and some minor cleanups. The main portion in the pull request are changes which move i_dquot array from struct inode into fs-private part of an inode which saves memory for filesystems which don't use VFS quotas" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: One function call less in udf_fill_super() after error detection udf: Deletion of unnecessary checks before the function call "iput" jbd: Deletion of an unnecessary check before the function call "iput" vfs: Remove i_dquot field from inode jfs: Convert to private i_dquot field reiserfs: Convert to private i_dquot field ocfs2: Convert to private i_dquot field ext4: Convert to private i_dquot field ext3: Convert to private i_dquot field ext2: Convert to private i_dquot field quota: Use function to provide i_dquot pointers xfs: Set allowed quota types gfs2: Set allowed quota types quota: Allow each filesystem to specify which quota types it supports quota: Remove const from function declarations quota: Add log level to printk
2014-11-19Merge tag 'trace-seq-file-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into for-nextAl Viro1-1/+1
Pull the beginning of seq_file cleanup from Steven: "I'm looking to clean up the seq_file code and to eventually merge the trace_seq code with seq_file as well, since they basically do the same thing. Part of this process is to remove the return code of seq_printf() and friends as they are rather inconsistent. It is better to use the new function seq_has_overflowed() if you want to stop processing when the buffer is full. Note, if the buffer is full, the seq_file code will throw away the contents, allocate a bigger buffer, and then call your code again to fill in the data. The only thing that breaking out of the function early does is to save a little time which is probably never noticed. I started with patches from Joe Perches and modified them as well. There's many more places that need to be updated before we can convert seq_printf() and friends to return void. But this patch set introduces the seq_has_overflowed() and does some initial updates."
2014-11-19kill f_dentry macroAl Viro1-1/+0
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19merge nfs bugfixes into nfsd for-3.19 branchJ. Bruce Fields1-3/+46
In addition to nfsd bugfixes, there are some fixes in -rc5 for client bugs that can interfere with my testing.
2014-11-19Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity into nextJames Morris1-0/+1
2014-11-17VFS: refactor vfs_read()Dmitry Kasatkin1-0/+1
integrity_kernel_read() duplicates the file read operations code in vfs_read(). This patch refactors vfs_read() code creating a helper function __vfs_read(). It is used by both vfs_read() and integrity_kernel_read(). Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2014-11-17fs: add freeze_super/thaw_super fs hooksBenjamin Marzinski1-0/+2
Currently, freezing a filesystem involves calling freeze_super, which locks sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it hard for gfs2 (and potentially other cluster filesystems) to use the vfs freezing code to do freezes on all the cluster nodes. In order to communicate that a freeze has been requested, and to make sure that only one node is trying to freeze at a time, gfs2 uses a glock (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire this lock before calling freeze_super. This means that two nodes can attempt to freeze the filesystem by both calling freeze_super, acquiring the sb->s_umount lock, and then attempting to grab the cluster glock sd_freeze_gl. Only one will succeed, and the other will be stuck in freeze_super, making it impossible to finish freezing the node. To solve this problem, this patch adds the freeze_super and thaw_super hooks. If a filesystem implements these hooks, they are called instead of the vfs freeze_super and thaw_super functions. This means that every filesystem that implements these hooks must call the vfs freeze_super and thaw_super functions itself within the hook function to make use of the vfs freezing code. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-11-10vfs: Remove i_dquot field from inodeJan Kara1-3/+0
All filesystems using VFS quotas are now converted to use their private i_dquot fields. Remove the i_dquot field from generic inode structure. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10quota: Use function to provide i_dquot pointersJan Kara1-0/+1
i_dquot array is used by relatively few filesystems (ext?, ocfs2, jfs, reiserfs) so it is beneficial to move this array to fs-private part of the inode. We cannot just pass quota pointers from filesystems to quota functions because during quotaon and quotaoff we have to traverse list of all inodes and manipulate i_dquot pointers for each inode. So we provide a function which generic quota code can use to get pointer to the i_dquot array from the filesystem. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-10quota: Allow each filesystem to specify which quota types it supportsJan Kara1-0/+1
Currently all filesystems supporting VFS quota support user and group quotas. With introduction of project quotas this is going to change so make sure filesystem isn't called for quota type it doesn't support by introduction of a bitmask determining which quota types each filesystem supports. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2014-11-07VFS: Rename do_fallocate() to vfs_fallocate()Anna Schumaker1-1/+1
This function needs to be exported so it can be used by the NFSD module when responding to the new ALLOCATE and DEALLOCATE operations in NFS v4.2. Christoph Hellwig suggested renaming the function to stay consistent with how other vfs functions are named. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-11-05fs: Convert show_fdinfo functions to voidJoe Perches1-1/+1
seq_printf functions shouldn't really check the return value. Checking seq_has_overflowed() occasionally is used instead. Update vfs documentation. Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com Cc: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Joe Perches <joe@perches.com> [ did a few clean ups ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-10-31new helper: is_root_inode()Al Viro1-0/+5
replace open-coded instances Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31vfs: make first argument of dir_context.actor typedMiklos Szeredi1-1/+4
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-31Return short read or 0 at end of a raw device, not EIODavid Jeffery1-0/+1
Author: David Jeffery <djeffery@redhat.com> Changes to the basic direct I/O code have broken the raw driver when reading to the end of a raw device. Instead of returning a short read for a read that extends partially beyond the device's end or 0 when at the end of the device, these reads now return EIO. The raw driver needs the same end of device handling as was added for normal block devices. Using blkdev_read_iter, which has the needed size checks, prevents the EIO conditions at the end of the device. Signed-off-by: David Jeffery <djeffery@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-28overlayfs: fix lockdep misannotationMiklos Szeredi1-3/+6
In an overlay directory that shadows an empty lower directory, say /mnt/a/empty102, do: touch /mnt/a/empty102/x unlink /mnt/a/empty102/x rmdir /mnt/a/empty102 It's actually harmless, but needs another level of nesting between I_MUTEX_CHILD and I_MUTEX_NORMAL. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-24fs: limit filesystem stacking depthMiklos Szeredi1-0/+11
Add a simple read-only counter to super_block that indicates how deep this is in the stack of filesystems. Previously ecryptfs was the only stackable filesystem and it explicitly disallowed multiple layers of itself. Overlayfs, however, can be stacked recursively and also may be stacked on top of ecryptfs or vice versa. To limit the kernel stack usage we must limit the depth of the filesystem stack. Initially the limit is set to 2. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: add whiteout supportMiklos Szeredi1-0/+11
Whiteout isn't actually a new file type, but is represented as a char device (Linus's idea) with 0/0 device number. This has several advantages compared to introducing a new whiteout file type: - no userspace API changes (e.g. trivial to make backups of upper layer filesystem, without losing whiteouts) - no fs image format changes (you can boot an old kernel/fsck without whiteout support and things won't break) - implementation is trivial Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export check_sticky()Miklos Szeredi1-0/+9
It's already duplicated in btrfs and about to be used in overlayfs too. Move the sticky bit check to an inline helper and call the out-of-line helper only in the unlikly case of the sticky bit being set. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export __inode_permission() to modulesMiklos Szeredi1-0/+1
We need to be able to check inode permissions (but not filesystem implied permissions) for stackable filesystems. Expose this interface for overlayfs. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: export do_splice_direct() to modulesMiklos Szeredi1-0/+3
Export do_splice_direct() to modules. Needed by overlay filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-24vfs: add i_op->dentry_open()Miklos Szeredi1-0/+4
Add a new inode operation i_op->dentry_open(). This is for stacked filesystems that want to return a struct file from a different filesystem. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-10-18Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+0
Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...
2014-10-14block: Remove REQ_KERNELMartin K. Petersen1-2/+0
REQ_KERNEL is no longer used. Remove it and drop the redundant uio argument to nfs_file_direct_{read,write}. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-13Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+3
Pull vfs updates from Al Viro: "The big thing in this pile is Eric's unmount-on-rmdir series; we finally have everything we need for that. The final piece of prereqs is delayed mntput() - now filesystem shutdown always happens on shallow stack. Other than that, we have several new primitives for iov_iter (Matt Wilcox, culled from his XIP-related series) pushing the conversion to ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c cleanups and fixes (including the external name refcounting, which gives consistent behaviour of d_move() wrt procfs symlinks for long and short names alike) and assorted cleanups and fixes all over the place. This is just the first pile; there's a lot of stuff from various people that ought to go in this window. Starting with unionmount/overlayfs mess... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits) fs/file_table.c: Update alloc_file() comment vfs: Deduplicate code shared by xattr system calls operating on paths reiserfs: remove pointless forward declaration of struct nameidata don't need that forward declaration of struct nameidata in dcache.h anymore take dname_external() into fs/dcache.c let path_init() failures treated the same way as subsequent link_path_walk() fix misuses of f_count() in ppp and netlink ncpfs: use list_for_each_entry() for d_subdirs walk vfs: move getname() from callers to do_mount() gfs2_atomic_open(): skip lookups on hashed dentry [infiniband] remove pointless assignments gadgetfs: saner API for gadgetfs_create_file() f_fs: saner API for ffs_sb_create_file() jfs: don't hash direct inode [s390] remove pointless assignment of ->f_op in vmlogrdr ->open() ecryptfs: ->f_op is never NULL android: ->f_op is never NULL nouveau: __iomem misannotations missing annotation in fs/file.c fs: namespace: suppress 'may be used uninitialized' warnings ...
2014-10-09vfs: move getname() from callers to do_mount()Seunghun Lee1-1/+2
It would make more sense to pass char __user * instead of char * in callers of do_mount() and do getname() inside do_mount(). Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Seunghun Lee <waydi1@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09constify file_inode()Al Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-07locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKINGJeff Layton1-2/+2
Currently they both just return 0. Fix them to return more appropriate values instead. For better or worse, most places in the kernel return -EINVAL when leases aren't available. -ENOLCK would probably have been better, but let's follow suit here in the case of F_SETLEASE. In the F_GETLEASE case, just return F_UNLCK since we know that no lease will have been set. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: set fl_owner for leases to filp instead of current->filesJeff Layton1-7/+1
Like flock locks, leases are owned by the file description. Now that the i_have_this_lease check in __break_lease is gone, we don't actually use the fl_owner for leases for anything. So, it's now safe to set this more appropriately to the same value as the fl_file. While we're at it, fix up the comments over the fl_owner_t definition since they're rather out of date. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-07locks: give lm_break a return valueJeff Layton1-1/+1
Christoph suggests: "Add a return value to lm_break so that the lock manager can tell the core code "you can delete this lease right now". That gets rid of the games with the timeout which require all kinds of race avoidance code in the users." Do that here and have the nfsd lease break routine use it when it detects that there was a race between setting up the lease and it being broken. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: move freeing of leases outside of i_lockJeff Layton1-3/+4
There was only one place where we still could free a file_lock while holding the i_lock -- lease_modify. Add a new list_head argument to the lm_change operation, pass in a private list when calling it, and fix those callers to dispose of the list once the lock has been dropped. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: define a lm_setup handler for leasesJeff Layton1-0/+1
...and move the fasync setup into it for fcntl lease calls. At the same time, change the semantics of how the file_lock double-pointer is handled. Up until now, on a successful lease return you got a pointer to the lock on the list. This is bad, since that pointer can no longer be relied on as valid once the inode->i_lock has been released. Change the code to instead just zero out the pointer if the lease we passed in ended up being used. Then the callers can just check to see if it's NULL after the call and free it if it isn't. The priv argument has the same semantics. The lm_setup function can zero the pointer out to signal to the caller that it should not be freed after the function returns. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07locks: plumb a "priv" pointer into the setlease routinesJeff Layton1-6/+6
In later patches, we're going to add a new lock_manager_operation to finish setting up the lease while still holding the i_lock. To do this, we'll need to pass a little bit of info in the fcntl setlease case (primarily an fasync structure). Plumb the extra pointer into there in advance of that. We declare this pointer as a void ** to make it clear that this is private info, and that the caller isn't required to set this unless the lm_setup specifically requires it. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09security: make security_file_set_fowner, f_setown and __f_setown void returnJeff Layton1-2/+2
security_file_set_fowner always returns 0, so make it f_setown and __f_setown void return functions and fix up the error handling in the callers. Cc: linux-security-module@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09locks: consolidate "nolease" routinesJeff Layton1-0/+1
GFS2 and NFS have setlease routines that always just return -EINVAL. Turn that into a generic routine that can live in fs/libfs.c. Cc: <linux-nfs@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: <cluster-devel@redhat.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Trond Myklebust <trond.myklebust@primarydata.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09locks: remove lock_may_read and lock_may_writeJeff Layton1-14/+0
There are no callers of these functions. Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09locks: New ops in lock_manager_operations for get/put ownerKinglong Mee1-0/+2
NFSD or other lockmanager may increase the owner's reference, so adds two new options for copying and releasing owner. v5: change order from 2/6 to 3/6 v4: rename lm_copy_owner/lm_release_owner to lm_get_owner/lm_put_owner Reviewed-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09locks: Rename __locks_copy_lock() to locks_copy_conflock()Kinglong Mee1-2/+2
Jeff advice, " Right now __locks_copy_lock is only used to copy conflocks. It would be good to rename that to something more distinct (i.e.locks_copy_conflock), to make it clear that we're generating a conflock there." v5: change order from 3/6 to 2/6 v4: new patch only renaming function name Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09locks: Remove unused conf argument from lm_grantJoe Perches1-1/+1
This argument is always NULL so don't pass it around. [jlayton: remove dependencies on previous patches in series] Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jeff Layton <jlayton@primarydata.com>