aboutsummaryrefslogtreecommitdiffstats
path: root/fs/afs/super.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-16afs: Pass pre-fetch server and volume break counts into afs_iget5_set()David Howells1-6/+8
Pass the server and volume break counts from before the status fetch operation that queried the attributes of a file into afs_iget5_set() so that the new vnode's break counters can be initialised appropriately. This allows detection of a volume or server break that happened whilst we were fetching the status or setting up the vnode. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make vnode->cb_interest RCU safeDavid Howells1-2/+2
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_or_lock() instead of read_seqlock_excl(). This also requires the caller of afs_check_validity() to hold the RCU read lock across the call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't save callback version and type fieldsDavid Howells1-1/+0
Don't save callback version and type fields as the version is about the format of the callback information and the type is relative to the particular RPC call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix application of status and callback to be under same lockDavid Howells1-2/+1
When applying the status and callback in the response of an operation, apply them in the same critical section so that there's no race between checking the callback state and checking status-dependent state (such as the data version). Fix this by: (1) Allocating a joint {status,callback} record (afs_status_cb) before calling the RPC function for each vnode for which the RPC reply contains a status or a status plus a callback. A flag is set in the record to indicate if a callback was actually received. (2) These records are passed into the RPC functions to be filled in. The afs_decode_status() and yfs_decode_status() functions are removed and the cb_lock is no longer taken. (3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer update the vnode. (4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update the vnode. (5) vnodes, expected data-version numbers and callback break counters (cb_break) no longer need to be passed to the reply delivery functions. Note that, for the moment, the file locking functions still need access to both the call and the vnode at the same time. (6) afs_vnode_commit_status() is now given the cb_break value and the expected data_version and the task of applying the status and the callback to the vnode are now done here. This is done under a single taking of vnode->cb_lock. (7) afs_pages_written_back() is now called by afs_store_data() rather than by the reply delivery function. afs_pages_written_back() has been moved to before the call point and is now given the first and last page numbers rather than a pointer to the call. (8) The indicator from YFS.RemoveFile2 as to whether the target file actually got removed (status.abort_code == VNOVNODE) rather than merely dropping a link is now checked in afs_unlink rather than in xdr_decode_YFSFetchStatus(). Supplementary fixes: (*) afs_cache_permit() now gets the caller_access mask from the afs_status_cb object rather than picking it out of the vnode's status record. afs_fetch_status() returns caller_access through its argument list for this purpose also. (*) afs_inode_init_from_status() now uses a write lock on cb_lock rather than a read lock and now sets the callback inside the same critical section. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make some RPC operations non-interruptibleDavid Howells1-1/+1
Make certain RPC operations non-interruptible, including: (*) Set attributes (*) Store data We don't want to get interrupted during a flush on close, flush on unlock, writeback or an inode update, leaving us in a state where we still need to do the writeback or update. (*) Extend lock (*) Release lock We don't want to get lock extension interrupted as the file locks on the server are time-limited. Interruption during lock release is less of an issue since the lock is time-limited, but it's better to complete the release to avoid a several-minute wait to recover it. *Setting* the lock isn't a problem if it's interrupted since we can just return to the user and tell them they were interrupted - at which point they can elect to retry. (*) Silly unlink We want to remove silly unlink files if we can, rather than leaving them for the salvager to clear up. Note that whilst these calls are no longer interruptible, they do have timeouts on them, so if the server stops responding the call will fail with something like ETIME or ECONNRESET. Without this, the following: kAFS: Unexpected error from FS.StoreData -512 appears in dmesg when a pending store data gets interrupted and some processes may just hang. Additionally, make the code that checks/updates the server record ignore failure due to interruption if the main call is uninterruptible and if the server has an address list. The next op will check it again since the expiration time on the old list has past. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-07Merge tag 'afs-next-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds1-2/+32
Pull AFS updates from David Howells: "A set of fix and development patches for AFS for 5.2. Summary: - Fix the AFS file locking so that sqlite can run on an AFS mount and also so that firefox and gnome can use a homedir that's mounted through AFS. This required emulation of fine-grained locking when the server will only support whole-file locks and no upgrade/downgrade. Four modes are provided, settable by mount parameter: "flock=local" - No reference to the server "flock=openafs" - Fine-grained locks are local-only, whole-file locks require sufficient server locks "flock=strict" - All locks require sufficient server locks "flock=write" - Always get an exclusive server lock If the volume is a read-only or backup volume, then flock=local for that volume. - Log extra information for a couple of cases where the client mucks up somehow: AFS vnode with undefined type and dir check failure - in both cases we seem to end up with unfilled data, but the issues happen infrequently and are difficult to reproduce at will. - Implement silly rename for unlink() and rename(). - Set i_blocks so that du can get some information about usage. - Fix xattr handlers to return the right amount of data and to not overflow buffers. - Implement getting/setting raw AFS and YFS ACLs as xattrs" * tag 'afs-next-20190507' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Implement YFS ACL setting afs: Get YFS ACLs and information through xattrs afs: implement acl setting afs: Get an AFS3 ACL as an xattr afs: Fix getting the afs.fid xattr afs: Fix the afs.cell and afs.volume xattr handlers afs: Calculate i_blocks based on file size afs: Log more information for "kAFS: AFS vnode with undefined type\n" afs: Provide mount-time configurable byte-range file locking emulation afs: Add more tracepoints afs: Implement sillyrename for unlink and rename afs: Add directory reload tracepoint afs: Handle lock rpc ops failing on a file that got deleted afs: Improve dir check failure reports afs: Add file locking tracepoints afs: Further fix file locking afs: Fix AFS file locking to allow fine grained locks afs: Calculate lock extend timer from set/extend reply reception afs: Split wait from afs_make_call()
2019-05-07afs: Log more information for "kAFS: AFS vnode with undefined type\n"David Howells1-1/+1
Log more information when "kAFS: AFS vnode with undefined type\n" is displayed due to a vnode record being retrieved from the server that appears to have a duff file type (usually 0). This prints more information to try and help pin down the problem. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-01afs: switch to use of ->free_inode()Al Viro1-5/+4
debugging printks left in ->destroy_inode() and so's the update of inode count; we could take the latter to RCU-delayed part (would take only moving the check on module exit past rcu_barrier() there), but debugging output ought to either stay where it is or go into ->evict_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-25afs: Provide mount-time configurable byte-range file locking emulationDavid Howells1-0/+27
Provide byte-range file locking emulation that can be configured at mount time to one of four modes: (1) flock=local. Locking is done locally only and no reference is made to the server. (2) flock=openafs. Byte-range locking is done locally only; whole-file locking is done with reference to the server. Whole-file locks cannot be upgraded unless the client holds an exclusive lock. (3) flock=strict. Byte-range and whole-file locking both require a sufficient whole-file lock on the server. (4) flock=write. As strict, but the client always gets an exclusive whole-file lock on the server. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Add more tracepointsDavid Howells1-0/+1
Add four more tracepoints: (1) afs_make_fs_call1 - Split from afs_make_fs_call but takes a filename to log also. (2) afs_make_fs_call2 - Like the above but takes two filenames to log. (3) afs_lookup - Log the result of doing a successful lookup, including a negative result (fid 0:0). (4) afs_get_tree - Log the set up of a volume for mounting. It also extends the name buffer on the afs_edit_dir tracepoint to 24 chars and puts quotes around the filename in the text representation. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Implement sillyrename for unlink and renameDavid Howells1-1/+3
Implement sillyrename for AFS unlink and rename, using the NFS variant implementation as a basis. Note that the asynchronous file locking extender/releaser has to be notified with a state change to stop it complaining if there's a race between that and the actual file deletion. A tracepoint, afs_silly_rename, is also added to note the silly rename and the cleanup. The afs_edit_dir tracepoint is given some extra reason indicators and the afs_flock_ev tracepoint is given a silly-delete file lock cancellation indicator. Signed-off-by: David Howells <dhowells@redhat.com>
2019-03-12Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-210/+220
Pull vfs mount infrastructure updates from Al Viro: "The rest of core infrastructure; no new syscalls in that pile, but the old parts are switched to new infrastructure. At that point conversions of individual filesystems can happen independently; some are done here (afs, cgroup, procfs, etc.), there's also a large series outside of that pile dealing with NFS (quite a bit of option-parsing stuff is getting used there - it's one of the most convoluted filesystems in terms of mount-related logics), but NFS bits are the next cycle fodder. It got seriously simplified since the last cycle; documentation is probably the weakest bit at the moment - I considered dropping the commit introducing Documentation/filesystems/mount_api.txt (cutting the size increase by quarter ;-), but decided that it would be better to fix it up after -rc1 instead. That pile allows to do followup work in independent branches, which should make life much easier for the next cycle. fs/super.c size increase is unpleasant; there's a followup series that allows to shrink it considerably, but I decided to leave that until the next cycle" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (41 commits) afs: Use fs_context to pass parameters over automount afs: Add fs_context support vfs: Add some logging to the core users of the fs_context log vfs: Implement logging through fs_context vfs: Provide documentation for new mount API vfs: Remove kern_mount_data() hugetlbfs: Convert to fs_context cpuset: Use fs_context kernfs, sysfs, cgroup, intel_rdt: Support fs_context cgroup: store a reference to cgroup_ns into cgroup_fs_context cgroup1_get_tree(): separate "get cgroup_root to use" into a separate helper cgroup_do_mount(): massage calling conventions cgroup: stash cgroup_root reference into cgroup_fs_context cgroup2: switch to option-by-option parsing cgroup1: switch to option-by-option parsing cgroup: take options parsing into ->parse_monolithic() cgroup: fold cgroup1_mount() into cgroup1_get_tree() cgroup: start switching to fs_context ipc: Convert mqueue fs to fs_context proc: Add fs_context support to procfs ...
2019-03-12mm: refactor readahead defines in mm.hNikolay Borisov1-1/+1
All users of VM_MAX_READAHEAD actually convert it to kbytes and then to pages. Define the macro explicitly as (SZ_128K / PAGE_SIZE). This simplifies the expression in every filesystem. Also rename the macro to VM_READAHEAD_PAGES to properly convey its meaning. Finally remove unused VM_MIN_READAHEAD [akpm@linux-foundation.org: fix fs/io_uring.c, per Stephen] Link: http://lkml.kernel.org/r/20181221144053.24318-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: David Howells <dhowells@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: David Sterba <dsterba@suse.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-28afs: Use fs_context to pass parameters over automountDavid Howells1-37/+3
Alter the AFS automounting code to create and modify an fs_context struct when parameterising a new mount triggered by an AFS mountpoint rather than constructing device name and option strings. Also remove the cell=, vol= and rwpath options as they are then redundant. The reason they existed is because the 'device name' may be derived literally from a mountpoint object in the filesystem, so default cell and parent-type information needed to be passed in by some other method from the automount routines. The vol= option didn't end up being used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric W. Biederman <ebiederm@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28afs: Add fs_context supportDavid Howells1-208/+252
Add fs_context support to the AFS filesystem, converting the parameter parsing to store options there. This will form the basis for namespace propagation over mountpoints within the AFS model, thereby allowing AFS to be used in containers more easily. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-24afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFSDavid Howells1-2/+3
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode number equivalent) to 96 bits to allow the support of YFS. This requires the iget comparator to check the vnode->fid rather than i_ino and i_generation as i_ino is not sufficiently capacious. It also requires this data to be placed into the vnode cache key for fscache. For the moment, just discard the top 32 bits of the vnode ID when returning it though stat. Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-15afs: Display manually added cells in dynamic root mountDavid Howells1-20/+32
Alter the dynroot mount so that cells created by manipulation of /proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root cell as a module parameter will cause directories for those cells to be created in the dynamic root superblock for the network namespace[*]. To this end: (1) Only one dynamic root superblock is now created per network namespace and this is shared between all attempts to mount it. This makes it easier to find the superblock to modify. (2) When a dynamic root superblock is created, the list of cells is walked and directories created for each cell already defined. (3) When a new cell is added, if a dynamic root superblock exists, a directory is created for it. (4) When a cell is destroyed, the directory is removed. (5) These directories are created by calling lookup_one_len() on the root dir which automatically creates them if they don't exist. [*] Inasmuch as network namespaces are currently supported here. Signed-off-by: David Howells <dhowells@redhat.com>
2018-06-02Merge tag 'afs-fixes-20180514' into afs-procAl Viro1-2/+2
backmerge AFS fixes that went into mainline and deal with the conflict in fs/afs/fsclient.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-05-23afs: Implement network namespacingDavid Howells1-8/+12
Implement network namespacing within AFS, but don't yet let mounts occur outside the init namespace. An additional patch will be required propagate the network namespace across automounts. Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14afs: Fix whole-volume callback handlingDavid Howells1-1/+1
It's possible for an AFS file server to issue a whole-volume notification that callbacks on all the vnodes in the file have been broken. This is done for R/O and backup volumes (which don't have per-file callbacks) and for things like a volume being taken offline. Fix callback handling to detect whole-volume notifications, to track it across operations and to check it during inode validation. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14afs: Fix directory page lockingDavid Howells1-1/+1
The afs directory loading code (primarily afs_read_dir()) locks all the pages that hold a directory's content blob to defend against getdents/getdents races and getdents/lookup races where the competitors issue conflicting reads on the same data. As the reads will complete consecutively, they may retrieve different versions of the data and one may overwrite the data that the other is busy parsing. Fix this by not locking the pages at all, but rather by turning the validation lock into an rwsem and getting an exclusive lock on it whilst reading the data or validating the attributes and a shared lock whilst parsing the data. Sharing the attribute validation lock should be fine as the data fetch will retrieve the attributes also. The individual page locks aren't needed at all as the only place they're being used is to serialise data loading. Without this patch, the: if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) { ... } part of afs_read_dir() may be skipped, leaving the pages unlocked when we hit the success: clause - in which case we try to unlock the not-locked pages, leading to the following oops: page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0 flags: 0xfffe000001004(referenced|private) raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000 page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff98156b27c000 ------------[ cut here ]------------ kernel BUG at mm/filemap.c:1205! ... RIP: 0010:unlock_page+0x43/0x50 ... Call Trace: afs_dir_iterate+0x789/0x8f0 [kafs] ? _cond_resched+0x15/0x30 ? kmem_cache_alloc_trace+0x166/0x1d0 ? afs_do_lookup+0x69/0x490 [kafs] ? afs_do_lookup+0x101/0x490 [kafs] ? key_default_cmp+0x20/0x20 ? request_key+0x3c/0x80 ? afs_lookup+0xf1/0x340 [kafs] ? __lookup_slow+0x97/0x150 ? lookup_slow+0x35/0x50 ? walk_component+0x1bf/0x490 ? path_lookupat.isra.52+0x75/0x200 ? filename_lookup.part.66+0xa0/0x170 ? afs_end_vnode_operation+0x41/0x60 [kafs] ? __check_object_size+0x9c/0x171 ? strncpy_from_user+0x4a/0x170 ? vfs_statx+0x73/0xe0 ? __do_sys_newlstat+0x39/0x70 ? __x64_sys_getdents+0xc9/0x140 ? __x64_sys_getdents+0x140/0x140 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Split the dynroot stuff out and give it its own ops tablesDavid Howells1-4/+7
Split the AFS dynamic root stuff out of the main directory handling file and into its own file as they share little in common. The dynamic root code also gets its own dentry and inode ops tables. Signed-off-by: David Howells <dhowells@redhat.com>
2018-02-06afs: Support the AFS dynamic rootDavid Howells1-42/+90
Support the AFS dynamic root which is a pseudo-volume that doesn't connect to any server resource, but rather is just a root directory that dynamically creates mountpoint directories where the name of such a directory is the name of the cell. Such a mount can be created thus: mount -t afs none /afs -o dyn Dynamic root superblocks aren't shared except by bind mounts and propagation. Cell root volumes can then be mounted by referring to them by name, e.g.: ls /afs/grand.central.org/ ls /afs/.grand.central.org/ The kernel will upcall to consult the DNS if the address wasn't supplied directly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-12-01afs: Properly reset afs_vnode (inode) fieldsDavid Howells1-1/+13
When an AFS inode is allocated by afs_alloc_inode(), the allocated afs_vnode struct isn't necessarily reset from the last time it was used as an inode because the slab constructor is only invoked once when the memory is obtained from the page allocator. This means that information can leak from one inode to the next because we're not calling kmem_cache_zalloc(). Some of the information isn't reset, in particular the permit cache pointer. Bring the clearances up to date. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com>
2017-11-27Rename superblock flags (MS_xyz -> SB_xyz)Linus Torvalds1-2/+2
This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-13afs: Get rid of the afs_writeback recordDavid Howells1-2/+2
Get rid of the afs_writeback record that kAFS is using to match keys with writes made by that key. Instead, keep a list of keys that have a file open for writing and/or sync'ing and iterate through those. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Overhaul volume and server record caching and fileserver rotationDavid Howells1-35/+57
The current code assumes that volumes and servers are per-cell and are never shared, but this is not enforced, and, indeed, public cells do exist that are aliases of each other. Further, an organisation can, say, set up a public cell and a private cell with overlapping, but not identical, sets of servers. The difference is purely in the database attached to the VL servers. The current code will malfunction if it sees a server in two cells as it assumes global address -> server record mappings and that each server is in just one cell. Further, each server may have multiple addresses - and may have addresses of different families (IPv4 and IPv6, say). To this end, the following structural changes are made: (1) Server record management is overhauled: (a) Server records are made independent of cell. The namespace keeps track of them, volume records have lists of them and each vnode has a server on which its callback interest currently resides. (b) The cell record no longer keeps a list of servers known to be in that cell. (c) The server records are now kept in a flat list because there's no single address to sort on. (d) Server records are now keyed by their UUID within the namespace. (e) The addresses for a server are obtained with the VL.GetAddrsU rather than with VL.GetEntryByName, using the server's UUID as a parameter. (f) Cached server records are garbage collected after a period of non-use and are counted out of existence before purging is allowed to complete. This protects the work functions against rmmod. (g) The servers list is now in /proc/fs/afs/servers. (2) Volume record management is overhauled: (a) An RCU-replaceable server list is introduced. This tracks both servers and their coresponding callback interests. (b) The superblock is now keyed on cell record and numeric volume ID. (c) The volume record is now tied to the superblock which mounts it, and is activated when mounted and deactivated when unmounted. This makes it easier to handle the cache cookie without causing a double-use in fscache. (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU to get the server UUID list. (e) The volume name is updated if it is seen to have changed when the volume is updated (the update is keyed on the volume ID). (3) The vlocation record is got rid of and VLDB records are no longer cached. Sufficient information is stored in the volume record, though an update to a volume record is now no longer shared between related volumes (volumes come in bundles of three: R/W, R/O and backup). and the following procedural changes are made: (1) The fileserver cursor introduced previously is now fleshed out and used to iterate over fileservers and their addresses. (2) Volume status is checked during iteration, and the server list is replaced if a change is detected. (3) Server status is checked during iteration, and the address list is replaced if a change is detected. (4) The abort code is saved into the address list cursor and -ECONNABORTED returned in afs_make_call() if a remote abort happened rather than translating the abort into an error message. This allows actions to be taken depending on the abort code more easily. (a) If a VMOVED abort is seen then this is handled by rechecking the volume and restarting the iteration. (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is handled by sleeping for a short period and retrying and/or trying other servers that might serve that volume. A message is also displayed once until the condition has cleared. (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the moment. (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to see if it has been deleted; if not, the fileserver is probably indicating that the volume couldn't be attached and needs salvaging. (e) If statfs() sees one of these aborts, it does not sleep, but rather returns an error, so as not to block the umount program. (5) The fileserver iteration functions in vnode.c are now merged into their callers and more heavily macroised around the cursor. vnode.c is removed. (6) Operations on a particular vnode are serialised on that vnode because the server will lock that vnode whilst it operates on it, so a second op sent will just have to wait. (7) Fileservers are probed with FS.GetCapabilities before being used. This is where service upgrade will be done. (8) A callback interest on a fileserver is set up before an FS operation is performed and passed through to afs_make_call() so that it can be set on the vnode if the operation returns a callback. The callback interest is passed through to afs_iget() also so that it can be set there too. In general, record updating is done on an as-needed basis when we try to access servers, volumes or vnodes rather than offloading it to work items and special threads. Notes: (1) Pre AFS-3.4 servers are no longer supported, though this can be added back if necessary (AFS-3.4 was released in 1998). (2) VBUSY is retried forever for the moment at intervals of 1s. (3) /proc/fs/afs/<cell>/servers no longer exists. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Overhaul cell database managementDavid Howells1-5/+7
Overhaul the way that the in-kernel AFS client keeps track of cells in the following manner: (1) Cells are now held in an rbtree to make walking them quicker and RCU managed (though this is probably overkill). (2) Cells now have a manager work item that: (A) Looks after fetching and refreshing the VL server list. (B) Manages cell record lifetime, including initialising and destruction. (B) Manages cell record caching whereby threads are kept around for a certain time after last use and then destroyed. (C) Manages the FS-Cache index cookie for a cell. It is not permitted for a cookie to be in use twice, so we have to be careful to not allow a new cell record to exist at the same time as an old record of the same name. (3) Each AFS network namespace is given a manager work item that manages the cells within it, maintaining a single timer to prod cells into updating their DNS records. This uses the reduce_timer() facility to make the timer expire at the soonest timed event that needs happening. (4) When a module is being unloaded, cells and cell managers are now counted out using dec_after_work() to make sure the module text is pinned until after the data structures have been cleaned up. (5) Each cell's VL server list is now protected by a seqlock rather than a semaphore. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Overhaul permit cachingDavid Howells1-1/+0
Overhaul permit caching in AFS by making it per-vnode and sharing permit lists where possible. When most of the fileserver operations are called, they return a status structure indicating the (revised) details of the vnode or vnodes involved in the operation. This includes the access mark derived from the ACL (named CallerAccess in the protocol definition file). This is cacheable and if the ACL changes, the server will tell us that it is breaking the callback promise, at which point we can discard the currently cached permits. With this patch, the afs_permits structure has, at the end, an array of { key, CallerAccess } elements, sorted by key pointer. This is then cached in a hash table so that it can be shared between vnodes with the same access permits. Permit lists can only be shared if they contain the exact same set of key->CallerAccess mappings. Note that that table is global rather than being per-net_ns. If the keys in a permit list cross net_ns boundaries, there is no problem sharing the cached permits, since the permits are just integer masks. Since permit lists pin keys, the permit cache also makes it easier for a future patch to find all occurrences of a key and remove them by means of setting the afs_permits::invalidated flag and then clearing the appropriate key pointer. In such an event, memory barriers will need adding. Lastly, the permit caching is skipped if the server has sent either a vnode-specific or an entire-server callback since the start of the operation. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Overhaul the callback handlingDavid Howells1-4/+7
Overhaul the AFS callback handling by the following means: (1) Don't give up callback promises on vnodes that we are no longer using, rather let them just expire on the server or let the server break them. This is actually more efficient for the server as the callback lookup is expensive if there are lots of extant callbacks. (2) Only give up the callback promises we have from a server when the server record is destroyed. Then we can just give up *all* the callback promises on it in one go. (3) Servers can end up being shared between cells if cells are aliased, so don't add all the vnodes being backed by a particular server into a big FID-indexed tree on that server as there may be duplicates. Instead have each volume instance (~= superblock) register an interest in a server as it starts to make use of it and use this to allow the processor for callbacks from the server to find the superblock and thence the inode corresponding to the FID being broken by means of ilookup_nowait(). (4) Rather than iterating over the entire callback list when a mass-break comes in from the server, maintain a counter of mass-breaks in afs_server (cb_seq) and make afs_validate() check it against the copy in afs_vnode. It would be nice not to have to take a read_lock whilst doing this, but that's tricky without using RCU. (5) Save a ref on the fileserver we're using for a call in the afs_call struct so that we can access its cb_s_break during call decoding. (6) Write-lock around callback and status storage in a vnode and read-lock around getattr so that we don't see the status mid-update. This has the following consequences: (1) Data invalidation isn't seen until someone calls afs_validate() on a vnode. Unfortunately, we need to use a key to query the server, but getting one from a background thread is tricky without caching loads of keys all over the place. (2) Mass invalidation isn't seen until someone calls afs_validate(). (3) Callback breaking is going to hit the inode_hash_lock quite a bit. Could this be replaced with rcu_read_lock() since inodes are destroyed under RCU conditions. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Push the net ns pointer to more placesDavid Howells1-6/+6
Push the network namespace pointer to more places in AFS, including the afs_server structure (which doesn't hold a ref on the netns). In particular, afs_put_cell() now takes requires a net ns parameter so that it can safely alter the netns after decrementing the cell usage count - the cell will be deallocated by a background thread after being cached for a period, which means that it's not safe to access it after reducing its usage count. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Note the cell in the superblock info alsoDavid Howells1-22/+35
Keep a reference to the cell in the superblock info structure in addition to the volume and net pointers. This will make it easier to clean up in a future patch in which afs_put_volume() will need the cell pointer. Whilst we're at it, make the cell and volume getting functions return a pointer to the object got to make the call sites look neater. Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-13afs: Lay the groundwork for supporting network namespacesDavid Howells1-19/+26
Lay the groundwork for supporting network namespaces (netns) to the AFS filesystem by moving various global features to a network-namespace struct (afs_net) and providing an instance of this as a temporary global variable that everything uses via accessor functions for the moment. The following changes have been made: (1) Store the netns in the superblock info. This will be obtained from the mounter's nsproxy on a manual mount and inherited from the parent superblock on an automount. (2) The cell list is made per-netns. It can be viewed through /proc/net/afs/cells and also be modified by writing commands to that file. (3) The local workstation cell is set per-ns in /proc/net/afs/rootcell. This is unset by default. (4) The 'rootcell' module parameter, which sets a cell and VL server list modifies the init net namespace, thereby allowing an AFS root fs to be theoretically used. (5) The volume location lists and the file lock manager are made per-netns. (6) The AF_RXRPC socket and associated I/O bits are made per-ns. The various workqueues remain global for the moment. Changes still to be made: (1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced from the old name. (2) A per-netns subsys needs to be registered for AFS into which it can store its per-netns data. (3) Rather than the AF_RXRPC socket being opened on module init, it needs to be opened on the creation of a superblock in that netns. (4) The socket needs to be closed when the last superblock using it is destroyed and all outstanding client calls on it have been completed. This prevents a reference loop on the namespace. (5) It is possible that several namespaces will want to use AFS, in which case each one will need its own UDP port. These can either be set through /proc/net/afs/cm_port or the kernel can pick one at random. The init_ns gets 7001 by default. Other issues that need resolving: (1) The DNS keyring needs net-namespacing. (2) Where do upcalls go (eg. DNS request-key upcall)? (3) Need something like open_socket_in_file_ns() syscall so that AFS command line tools attempting to operate on an AFS file/volume have their RPC calls go to the right place. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-15Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+43
Pull ->s_options removal from Al Viro: "Preparations for fsmount/fsopen stuff (coming next cycle). Everything gets moved to explicit ->show_options(), killing ->s_options off + some cosmetic bits around fs/namespace.c and friends. Basically, the stuff needed to work with fsmount series with minimum of conflicts with other work. It's not strictly required for this merge window, but it would reduce the PITA during the coming cycle, so it would be nice to have those bits and pieces out of the way" * 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: isofs: Fix isofs_show_options() VFS: Kill off s_options and helpers orangefs: Implement show_options 9p: Implement show_options isofs: Implement show_options afs: Implement show_options affs: Implement show_options befs: Implement show_options spufs: Implement show_options bpf: Implement show_options ramfs: Implement show_options pstore: Implement show_options omfs: Implement show_options hugetlbfs: Implement show_options VFS: Don't use save/replace_mount_options if not using generic_show_options VFS: Provide empty name qstr VFS: Make get_filesystem() return the affected filesystem VFS: Clean up whitespace in fs/namespace.c and fs/super.c Provide a function to create a NUL-terminated string from unterminated data
2017-07-11afs: Implement show_optionsDavid Howells1-2/+43
Implement the show_options superblock op for afs as part of a bid to get rid of s_options and generic_show_options() to make it easier to implement a context-based mount where the mount options can be passed individually over a file descriptor. Also implement the show_devname op to display the correct device name and thus avoid the need to display the cell= and volume= options. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-afs@lists.infradead.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-09afs: Add metadata xattrsDavid Howells1-0/+1
Add xattrs to allow the user to get/set metadata in lieu of having pioctl() available. The following xattrs are now available: - "afs.cell" The name of the cell in which the vnode's volume resides. - "afs.fid" The volume ID, vnode ID and vnode uniquifier of the file as three hex numbers separated by colons. - "afs.volume" The name of the volume in which the vnode resides. For example: # getfattr -d -m ".*" /mnt/scratch getfattr: Removing leading '/' from absolute path names # file: mnt/scratch afs.cell="mycell.myorg.org" afs.fid="10000b:1:1" afs.volume="scratch" Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-20afs: Convert to separately allocated bdiJan Kara1-1/+4
Allocate struct backing_dev_info separately instead of embedding it inside the superblock. This unifies handling of bdi among users. CC: David Howells <dhowells@redhat.com> CC: linux-afs@lists.infradead.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-2/+2
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14kmemcg: account certain kmem allocations to memcgVladimir Davydov1-1/+1
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells1-1/+1
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-03fs: Limit sys_mount to only request filesystem modules.Eric W. Biederman1-0/+1
Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13afs: Only allow mounting afs in the intial network namespaceEric W. Biederman1-0/+6
rxrpc sockets only work in the initial network namespace so it isn't possible to support afs in any other network namespace. Cc: David Howells <dhowells@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-10-02fs: push rcu_barrier() from deactivate_locked_super() to filesystemsKirill A. Shutemov1-0/+5
There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14VFS: Pass mount flags to sget()David Howells1-2/+1
Pass mount flags to sget() so that it can use them in initialising a new superblock before the set function is called. They could also be passed to the compare function. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20switch open-coded instances of d_make_root() to new helperAl Viro1-5/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03vfs: fix the stupidity with i_dentry in inode destructorsAl Viro1-1/+0
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16AFS: Set s_id in the superblock to the volume nameDavid Howells1-0/+1
Set s_id in the superblock to the name of the AFS volume that this superblock corresponds to. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-12afs: fix sget() races, close leak on umountAl Viro1-41/+32
* set ->s_fs_info in set() callback passed to sget() * allocate the thing and set it up enough for afs_test_super() before making it visible * have it freed in ->kill_sb() (current tree simply leaks it) * have ->put_super() leave ->s_fs_info->volume alone; it's too early for dropping it; do that from ->kill_sb() after having called kill_anon_super(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-12switch afsAl Viro1-0/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-07fs: icache RCU free inodesNick Piggin1-1/+9
RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <npiggin@kernel.dk>