aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/fs/ceph/caps.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2022-10-04ceph: Use kcalloc for allocating multiple elementsKenneth Lee1-1/+1
Prefer using kcalloc(a, b) over kzalloc(a * b) as this improves semantics since kcalloc is intended for allocating an array of memory. Signed-off-by: Kenneth Lee <klee33@uw.edu> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04ceph: no need to wait for transition RDCACHE|RD -> RDXiubo Li1-2/+6
For write when trying to get the Fwb caps we need to keep waiting on transition from WRBUFFER|WR -> WR to avoid a new WR sync write from going before a prior buffered writeback happens. While for read there is no need to wait on transition from RDCACHE|RD -> RD, and we can just exclude the revoking caps and force to sync read. Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-10-04ceph: wake up the waiters if any new caps comesXiubo Li1-0/+4
When new caps comes we need to wake up the waiters and also when revoking the caps, there also could be new caps comes. Link: https://tracker.ceph.com/issues/54044 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: flush the dirty caps immediatelly when quota is approachingXiubo Li1-2/+3
When the quota is approaching we need to notify it to the MDS as soon as possible, or the client could write to the directory more than expected. This will flush the dirty caps without delaying after each write, though this couldn't prevent the real size of a directory exceed the quota but could prevent it as soon as possible. Link: https://tracker.ceph.com/issues/56180 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: don't get the inline data for new creating filesXiubo Li1-1/+1
If the 'i_inline_version' is 1, that means the file is just new created and there shouldn't have any inline data in it, we should skip retrieving the inline data from MDS. This also could help reduce possiblity of dead lock issue introduce by the inline data and Fcr caps. Gradually we will remove the inline feature from kclient after ceph's scrub too have support to unline the inline data, currently this could help reduce the teuthology test failures. This is possiblly could also fix a bug that for some old clients if they couldn't explictly uninline the inline data when writing, the inline version will keep as 1 always. We may always reading non-exist data from inline data. Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: make change_auth_cap_ses a global symbolXiubo Li1-2/+2
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-08-03ceph: don't leak snap_rwsem in handle_cap_grantJeff Layton1-14/+13
When handle_cap_grant is called on an IMPORT op, then the snap_rwsem is held and the function is expected to release it before returning. It currently fails to do that in all cases which could lead to a deadlock. Fixes: 6f05b30ea063 ("ceph: reset i_requested_max_size if file write is not wanted") Link: https://tracker.ceph.com/issues/55857 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-06-29ceph: wait on async create before checking caps for syncfsJeff Layton1-0/+1
Currently, we'll call ceph_check_caps, but if we're still waiting on the reply, we'll end up spinning around on the same inode in flush_dirty_session_caps. Wait for the async create reply before flushing caps. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/55823 Fixes: fbed7045f552 ("ceph: wait for async create reply before sending any cap messages") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-06-09netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_contextDavid Howells1-52/+52
While randstruct was satisfied with using an open-coded "void *" offset cast for the netfs_i_context <-> inode casting, __builtin_object_size() as used by FORTIFY_SOURCE was not as easily fooled. This was causing the following complaint[1] from gcc v12: In file included from include/linux/string.h:253, from include/linux/ceph/ceph_debug.h:7, from fs/ceph/inode.c:2: In function 'fortify_memset_chk', inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2, inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2: include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning] 242 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix this by embedding a struct inode into struct netfs_i_context (which should perhaps be renamed to struct netfs_inode). The struct inode vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode structs and vfs_inode is then simply changed to "netfs.inode" in those filesystems. Further, rename netfs_i_context to netfs_inode, get rid of the netfs_inode() function that converted a netfs_i_context pointer to an inode pointer (that can now be done with &ctx->inode) and rename the netfs_i_context() function to netfs_inode() (which is now a wrapper around container_of()). Most of the changes were done with: perl -p -i -e 's/vfs_inode/netfs.inode/'g \ `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]` Kees suggested doing it with a pair structure[2] and a special declarator to insert that into the network filesystem's inode wrapper[3], but I think it's cleaner to embed it - and then it doesn't matter if struct randomisation reorders things. Dave Chinner suggested using a filesystem-specific VFS_I() function in each filesystem to convert that filesystem's own inode wrapper struct into the VFS inode struct[4]. Version #2: - Fix a couple of missed name changes due to a disabled cifs option. - Rename nfs_i_context to nfs_inode - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper structs. [ This also undoes commit 507160f46c55 ("netfs: gcc-12: temporarily disable '-Wattribute-warning' for now") that is no longer needed ] Fixes: bc899ee1c898 ("netfs: Add a netfs inode context") Reported-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> cc: Jonathan Corbet <corbet@lwn.net> cc: Eric Van Hensbergen <ericvh@gmail.com> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <smfrench@gmail.com> cc: William Kucharski <william.kucharski@oracle.com> cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> cc: Dave Chinner <david@fromorbit.com> cc: linux-doc@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: samba-technical@lists.samba.org cc: linux-fsdevel@vger.kernel.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1] Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2] Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3] Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4] Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1 Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-05-25ceph: try to queue a writeback if revoking failsXiubo Li1-4/+24
If the pagecaches writeback just finished and the i_wrbuffer_ref reaches zero it will try to trigger ceph_check_caps(). But if just before ceph_check_caps() the i_wrbuffer_ref could be increased again by mmap/cache write, then the Fwb revoke will fail. We need to try to queue a writeback in this case instead of triggering the writeback by BDI's delayed work per 5 seconds. URL: https://tracker.ceph.com/issues/46904 URL: https://tracker.ceph.com/issues/55377 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25ceph: rename unsafe_request_wait()Xiubo Li1-4/+4
Rename it to flush_mdlog_and_wait_inode_unsafe_requests() to make it more descriptive. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25ceph: replace usage of found with dedicated list iterator variableJakob Koschel1-17/+15
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-05-25ceph: use dedicated list iterator variableJakob Koschel1-3/+4
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: fix possible NULL pointer dereference for req->r_sessionXiubo Li1-0/+4
The request will be inserted into the ci->i_unsafe_dirops before assigning the req->r_session, so it's possible that we will hit NULL pointer dereference bug here. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/55327 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Aaron Tomlin <atomlin@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-04-25ceph: get snap_rwsem read lock in handle_cap_export for ceph_add_capNiels Dossche1-0/+3
ceph_add_cap says in its function documentation that the caller should hold the read lock on the session snap_rwsem. Furthermore, not only ceph_add_cap needs that lock, when it calls to ceph_lookup_snap_realm it eventually calls ceph_get_snap_realm which states via lockdep that snap_rwsem needs to be held. handle_cap_export calls ceph_add_cap without that mdsc->snap_rwsem held. Thus, since ceph_get_snap_realm and ceph_add_cap both need the lock, the common place to acquire that lock is inside handle_cap_export. Signed-off-by: Niels Dossche <dossche.niels@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-21ceph: assign the ci only when the inode isn't NULLXiubo Li1-1/+1
The ceph_find_inode() may will fail and return NULL. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-03-01ceph: wait for async create reply before sending any cap messagesJeff Layton1-0/+14
If we haven't received a reply to an async create request, then we don't want to send any cap messages to the MDS for that inode yet. Just have ceph_check_caps and __kick_flushing_caps return without doing anything, and have ceph_write_inode wait for the reply if we were asked to wait on the inode writeback. URL: https://tracker.ceph.com/issues/54107 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-26ceph: put the requests/sessions when it fails to alloc memoryXiubo Li1-18/+37
When failing to allocate the sessions memory we should make sure the req1 and req2 and the sessions get put. And also in case the max_sessions decreased so when kreallocate the new memory some sessions maybe missed being put. And if the max_sessions is 0 krealloc will return ZERO_SIZE_PTR, which will lead to a distinct access fault. URL: https://tracker.ceph.com/issues/53819 Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-20Merge tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-2/+1
Pull ceph updates from Ilya Dryomov: "The highlight is the new mount "device" string syntax implemented by Venky Shankar. It solves some long-standing issues with using different auth entities and/or mounting different CephFS filesystems from the same cluster, remounting and also misleading /proc/mounts contents. The existing syntax of course remains to be maintained. On top of that, there is a couple of fixes for edge cases in quota and a new mount option for turning on unbuffered I/O mode globally instead of on a per-file basis with ioctl(CEPH_IOC_SYNCIO)" * tag 'ceph-for-5.17-rc1' of git://github.com/ceph/ceph-client: ceph: move CEPH_SUPER_MAGIC definition to magic.h ceph: remove redundant Lsx caps check ceph: add new "nopagecache" option ceph: don't check for quotas on MDS stray dirs ceph: drop send metrics debug message rbd: make const pointer spaces a static const array ceph: Fix incorrect statfs report for small quota ceph: mount syntax module parameter doc: document new CephFS mount device syntax ceph: record updated mon_addr on remount ceph: new device mount syntax libceph: rename parse_fsid() to ceph_parse_fsid() and export libceph: generalize addr/ip parsing based on delimiter
2022-01-13ceph: remove redundant Lsx caps checkXiubo Li1-2/+1
The newcaps has already included the Ls, no need to check it again. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-01-12Merge tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds1-1/+2
Pull fscache rewrite from David Howells: "This is a set of patches that rewrites the fscache driver and the cachefiles driver, significantly simplifying the code compared to what's upstream, removing the complex operation scheduling and object state machine in favour of something much smaller and simpler. The series is structured such that the first few patches disable fscache use by the network filesystems using it, remove the cachefiles driver entirely and as much of the fscache driver as can be got away with without causing build failures in the network filesystems. The patches after that recreate fscache and then cachefiles, attempting to add the pieces in a logical order. Finally, the filesystems are reenabled and then the very last patch changes the documentation. [!] Note: I have dropped the cifs patch for the moment, leaving local caching in cifs disabled. I've been having trouble getting that working. I think I have it done, but it needs more testing (there seem to be some test failures occurring with v5.16 also from xfstests), so I propose deferring that patch to the end of the merge window. WHY REWRITE? ============ Fscache's operation scheduling API was intended to handle sequencing of cache operations, which were all required (where possible) to run asynchronously in parallel with the operations being done by the network filesystem, whilst allowing the cache to be brought online and offline and to interrupt service for invalidation. With the advent of the tmpfile capacity in the VFS, however, an opportunity arises to do invalidation much more simply, without having to wait for I/O that's actually in progress: Cachefiles can simply create a tmpfile, cut over the file pointer for the backing object attached to a cookie and abandon the in-progress I/O, dismissing it upon completion. Future work here would involve using Omar Sandoval's vfs_link() with AT_LINK_REPLACE[1] to allow an extant file to be displaced by a new hard link from a tmpfile as currently I have to unlink the old file first. These patches can also simplify the object state handling as I/O operations to the cache don't all have to be brought to a stop in order to invalidate a file. To that end, and with an eye on to writing a new backing cache model in the future, I've taken the opportunity to simplify the indexing structure. I've separated the index cookie concept from the file cookie concept by C type now. The former is now called a "volume cookie" (struct fscache_volume) and there is a container of file cookies. There are then just the two levels. All the index cookie levels are collapsed into a single volume cookie, and this has a single printable string as a key. For instance, an AFS volume would have a key of something like "afs,example.com,1000555", combining the filesystem name, cell name and volume ID. This is freeform, but must not have '/' chars in it. I've also eliminated all pointers back from fscache into the network filesystem. This required the duplication of a little bit of data in the cookie (cookie key, coherency data and file size), but it's not actually that much. This gets rid of problems with making sure we keep netfs data structures around so that the cache can access them. These patches mean that most of the code that was in the drivers before is simply gone and those drivers are now almost entirely new code. That being the case, there doesn't seem any particular reason to try and maintain bisectability across it. Further, there has to be a point in the middle where things are cut over as there's a single point everything has to go through (ie. /dev/cachefiles) and it can't be in use by two drivers at once. ISSUES YET OUTSTANDING ====================== There are some issues still outstanding, unaddressed by this patchset, that will need fixing in future patchsets, but that don't stop this series from being usable: (1) The cachefiles driver needs to stop using the backing filesystem's metadata to store information about what parts of the cache are populated. This is not reliable with modern extent-based filesystems. Fixing this is deferred to a separate patchset as it involves negotiation with the network filesystem and the VM as to how much data to download to fulfil a read - which brings me on to (2)... (2) NFS (and CIFS with the dropped patch) do not take account of how the cache would like I/O to be structured to meet its granularity requirements. Previously, the cache used page granularity, which was fine as the network filesystems also dealt in page granularity, and the backing filesystem (ext4, xfs or whatever) did whatever it did out of sight. However, we now have folios to deal with and the cache will now have to store its own metadata to track its contents. The change I'm looking at making for cachefiles is to store content bitmaps in one or more xattrs and making a bit in the map correspond to something like a 256KiB block. However, the size of an xattr and the fact that they have to be read/updated in one go means that I'm looking at covering 1GiB of data per 512-byte map and storing each map in an xattr. Cachefiles has the potential to grow into a fully fledged filesystem of its very own if I'm not careful. However, I'm also looking at changing things even more radically and going to a different model of how the cache is arranged and managed - one that's more akin to the way, say, openafs does things - which brings me on to (3)... (3) The way cachefilesd does culling is very inefficient for large caches and it would be better to move it into the kernel if I can as cachefilesd has to keep asking the kernel if it can cull a file. Changing the way the backend works would allow this to be addressed. BITS THAT MAY BE CONTROVERSIAL ============================== There are some bits I've added that may be controversial: (1) I've provided a flag, S_KERNEL_FILE, that cachefiles uses to check if a files is already being used by some other kernel service (e.g. a duplicate cachefiles cache in the same directory) and reject it if it is. This isn't entirely necessary, but it helps prevent accidental data corruption. I don't want to use S_SWAPFILE as that has other effects, but quite possibly swapon() should set S_KERNEL_FILE too. Note that it doesn't prevent userspace from interfering, though perhaps it should. (I have made it prevent a marked directory from being rmdir-able). (2) Cachefiles wants to keep the backing file for a cookie open whilst we might need to write to it from network filesystem writeback. The problem is that the network filesystem unuses its cookie when its file is closed, and so we have nothing pinning the cachefiles file open and it will get closed automatically after a short time to avoid EMFILE/ENFILE problems. Reopening the cache file, however, is a problem if this is being done due to writeback triggered by exit(). Some filesystems will oops if we try to open a file in that context because they want to access current->fs or suchlike. To get around this, I added the following: (A) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem inode to indicate that we have a usage count on the cookie caching that inode. (B) A flag in struct writeback_control, unpinned_fscache_wb, that is set when __writeback_single_inode() clears the last dirty page from i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this flag. This has to be done here so that clearing I_PINNING_FSCACHE_WB can be done atomically with the check of PAGECACHE_TAG_DIRTY that clears I_DIRTY_PAGES. (C) A function, fscache_set_page_dirty(), which if it is not set, sets I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache resources. (D) A function, fscache_unpin_writeback(), to be called by ->write_inode() to unuse the cookie. (E) A function, fscache_clear_inode_writeback(), to be called when the inode is evicted, before clear_inode() is called. This cleans up any lingering I_PINNING_FSCACHE_WB. The network filesystem can then use these tools to make sure that fscache_write_to_cache() can write locally modified data to the cache as well as to the server. For the future, I'm working on write helpers for netfs lib that should allow this facility to be removed by keeping track of the dirty regions separately - but that's incomplete at the moment and is also going to be affected by folios, one way or another, since it deals with pages" Link: https://lore.kernel.org/all/510611.1641942444@warthog.procyon.org.uk/ Tested-by: Dominique Martinet <asmadeus@codewreck.org> # 9p Tested-by: kafs-testing@auristor.com # afs Tested-by: Jeff Layton <jlayton@kernel.org> # ceph Tested-by: Dave Wysochanski <dwysocha@redhat.com> # nfs Tested-by: Daire Byrne <daire@dneg.com> # nfs * tag 'fscache-rewrite-20220111' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (67 commits) 9p, afs, ceph, nfs: Use current_is_kswapd() rather than gfpflags_allow_blocking() fscache: Add a tracepoint for cookie use/unuse fscache: Rewrite documentation ceph: add fscache writeback support ceph: conversion to new fscache API nfs: Implement cache I/O by accessing the cache directly nfs: Convert to new fscache volume/cookie API 9p: Copy local writes to the cache when writing to the server 9p: Use fscache indexing rewrite and reenable caching afs: Skip truncation on the server of data we haven't written yet afs: Copy local writes to the cache when writing to the server afs: Convert afs to use the new fscache API fscache, cachefiles: Display stat of culling events fscache, cachefiles: Display stats of no-space events cachefiles: Allow cachefiles to actually function fscache, cachefiles: Store the volume coherency data cachefiles: Implement the I/O routines cachefiles: Implement cookie resize for truncate cachefiles: Implement begin and end I/O operation cachefiles: Implement backing file wrangling ...
2022-01-11ceph: conversion to new fscache APIJeff Layton1-1/+2
Now that the fscache API has been reworked and simplified, change ceph over to use it. With the old API, we would only instantiate a cookie when the file was open for reads. Change it to instantiate the cookie when the inode is instantiated and call use/unuse when the file is opened/closed. Also, ensure we resize the cached data on truncates, and invalidate the cache in response to the appropriate events. This will allow us to plumb in write support later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20211129162907.149445-2-jlayton@kernel.org/ # v1 Link: https://lore.kernel.org/r/20211207134451.66296-2-jlayton@kernel.org/ # v2 Link: https://lore.kernel.org/r/163906984277.143852.14697110691303589000.stgit@warthog.procyon.org.uk/ # v2 Link: https://lore.kernel.org/r/163967188351.1823006.5065634844099079351.stgit@warthog.procyon.org.uk/ # v3 Link: https://lore.kernel.org/r/164021581427.640689.14128682147127509264.stgit@warthog.procyon.org.uk/ # v4
2021-12-01ceph: fix duplicate increment of opened_inodes metricHu Weiwen1-8/+8
opened_inodes is incremented twice when the same inode is opened twice with O_RDONLY and O_WRONLY respectively. To reproduce, run this python script, then check the metrics: import os for _ in range(10000): fd_r = os.open('a', os.O_RDONLY) fd_w = os.open('a', os.O_WRONLY) os.close(fd_r) os.close(fd_w) Fixes: 1dd8d4708136 ("ceph: metrics for opened files, pinned caps and opened inodes") Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08ceph: shut down access to inode when async create failsJeff Layton1-6/+6
Add proper error handling for when an async create fails. The inode never existed, so any dirty caps or data are now toast. We already d_drop the dentry in that case, but the now-stale inode may still be around. We want to shut down access to these inodes, and ensure that they can't harbor any more dirty data, which can cause problems at umount time. When this occurs, flag such inodes as being SHUTDOWN, and trash any caps and cap flushes that may be in flight for them, and invalidate the pagecache for the inode. Add a new helper that can check whether an inode or an entire mount is now shut down, and call it instead of accessing the mount_state directly in places where we test that now. URL: https://tracker.ceph.com/issues/51279 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08ceph: refactor remove_session_caps_cbJeff Layton1-0/+116
Move remove_capsnaps to caps.c. Move the part of remove_session_caps_cb under i_ceph_lock into a separate function that lives in caps.c. Have remove_session_caps_cb call the new helper after taking the lock. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08ceph: don't use -ESTALE as special return code in try_get_cap_refsJeff Layton1-8/+8
In some cases, we may want to return -ESTALE if it ends up that we're dealing with an inode that no longer exists. Switch to using -EUCLEAN as the "special" error return. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-11-08ceph: print inode numbers instead of pointer valuesJeff Layton1-4/+5
We have a lot of log messages that print inode pointer values. This is of dubious utility. Switch a random assortment of the ones I've found most useful to use ceph_vinop to print the snap:inum tuple instead. [ idryomov: use . as a separator, break unnecessarily long lines ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-10-19ceph: fix handling of "meta" errorsJeff Layton1-9/+3
Currently, we check the wb_err too early for directories, before all of the unsafe child requests have been waited on. In order to fix that we need to check the mapping->wb_err later nearer to the end of ceph_fsync. We also have an overly-complex method for tracking errors after blocklisting. The errors recorded in cleanup_session_requests go to a completely separate field in the inode, but we end up reporting them the same way we would for any other error (in fsync). There's no real benefit to tracking these errors in two different places, since the only reporting mechanism for them is in fsync, and we'd need to advance them both every time. Given that, we can just remove i_meta_err, and convert the places that used it to instead just use mapping->wb_err instead. That also fixes the original problem by ensuring that we do a check_and_advance of the wb_err at the end of the fsync op. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52864 Reported-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-21ceph: fix off by one bugs in unsafe_request_wait()Dan Carpenter1-2/+2
The "> max" tests should be ">= max" to prevent an out of bounds access on the next lines. Fixes: e1a4541ec0b9 ("ceph: flush the mdlog before waiting on unsafe reqs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-03ceph: fix dereference of null pointer cfColin Ian King1-0/+3
Currently in the case where kmem_cache_alloc fails the null pointer cf is dereferenced when assigning cf->is_capsnap = false. Fix this by adding a null pointer check and return path. Cc: stable@vger.kernel.org Addresses-Coverity: ("Dereference null return") Fixes: b2f9fa1f3bd8 ("ceph: correctly handle releasing an embedded cap flush") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: lockdep annotations for try_nonblocking_invalidateJeff Layton1-0/+2
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: don't WARN if we're forcibly removing the session capsXiubo Li1-9/+30
For example in the case of a forced umount, we'll remove all the session caps even if they are dirty. Move the warning to a wrapper function and make most of the callers use it. Call the core function when removing caps due to a forced umount. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: remove the capsnaps when removing capsXiubo Li1-17/+51
capsnaps will take inode references via ihold when queueing to flush. When force unmounting, the client will just close the sessions and may never get a flush reply, causing a leak and inode ref leak. Fix this by removing the capsnaps for an inode when removing the caps. URL: https://tracker.ceph.com/issues/52295 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: print more information when we can't find snaprealmJeff Layton1-6/+5
Print a bit more information when we can't find the realm during ceph_add_cap. Show both the inode number and the old realm inode number. Suggested-by: Sage Weil <sage@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: add ceph_change_snap_realm() helperJeff Layton1-33/+3
Consolidate some fiddly code for changing an inode's snap_realm into a new helper function, and change the callers to use it. While we're in here, nothing uses the i_snap_realm_counter field, so remove that from the inode. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: flush the mdlog before waiting on unsafe reqsXiubo Li1-0/+76
For the client requests who will have unsafe and safe replies from MDS daemons, in the MDS side the MDS daemons won't flush the mdlog (journal log) immediatelly, because they think it's unnecessary. That's true for most cases but not all, likes the fsync request. The fsync will wait until all the unsafe replied requests to be safely replied. Normally if there have multiple threads or clients are running, the whole mdlog in MDS daemons could be flushed in time if any request will trigger the mdlog submit thread. So usually we won't experience the normal operations will stuck for a long time. But in case there has only one client with only thread is running, the stuck phenomenon maybe obvious and the worst case it must wait at most 5 seconds to wait the mdlog to be flushed by the MDS's tick thread periodically. This patch will trigger to flush the mdlog in the relevant and auth MDSes to which the in-flight requests are sent just before waiting the unsafe requests to finish. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: make iterate_sessions a global symbolXiubo Li1-25/+1
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-09-02ceph: fix memory leak on decode error in ceph_handle_capsJeff Layton1-2/+3
If we hit a decoding error late in the frame, then we might exit the function without putting the pool_ns string. Ensure that we always put that reference on the way out of the function. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-25ceph: correctly handle releasing an embedded cap flushXiubo Li1-8/+13
The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-08-04ceph: reduce contention in ceph_check_delayed_caps()Luis Henriques1-1/+16
Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag. This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run. Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29ceph: eliminate ceph_async_iput()Jeff Layton1-6/+3
Now that we don't need to hold session->s_mutex or the snap_rwsem when calling ceph_check_caps, we can eliminate ceph_async_iput and just use normal iput calls. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29ceph: don't take s_mutex in ceph_flush_snapsJeff Layton1-10/+3
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29ceph: don't take s_mutex in try_flush_capsJeff Layton1-14/+2
The s_mutex doesn't protect anything in this codepath. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29ceph: don't take s_mutex or snap_rwsem in ceph_check_capsJeff Layton1-61/+11
These locks appear to be completely unnecessary. Almost all of this function is done under the inode->i_ceph_lock, aside from the actual sending of the message. Don't take either lock in this function. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-06-29ceph: eliminate session->s_gen_ttl_lockJeff Layton1-9/+6
Turn s_cap_gen field into an atomic_t, and just rely on the fact that we hold the s_mutex when changing the s_cap_ttl field. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-05-06Merge tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds1-18/+9
Pull ceph updates from Ilya Dryomov: "Notable items here are - a series to take advantage of David Howells' netfs helper library from Jeff - three new filesystem client metrics from Xiubo - ceph.dir.rsnaps vxattr from Yanhu - two auth-related fixes from myself, marked for stable. Interspersed is a smattering of assorted fixes and cleanups across the filesystem" * tag 'ceph-for-5.13-rc1' of git://github.com/ceph/ceph-client: (24 commits) libceph: allow addrvecs with a single NONE/blank address libceph: don't set global_id until we get an auth ticket libceph: bump CephXAuthenticate encoding version ceph: don't allow access to MDS-private inodes ceph: fix up some bare fetches of i_size ceph: convert some PAGE_SIZE invocations to thp_size() ceph: support getting ceph.dir.rsnaps vxattr ceph: drop pinned_page parameter from ceph_get_caps ceph: fix inode leak on getattr error in __fh_to_dentry ceph: only check pool permissions for regular files ceph: send opened files/pinned caps/opened inodes metrics to MDS daemon ceph: avoid counting the same request twice or more ceph: rename the metric helpers ceph: fix kerneldoc copypasta over ceph_start_io_direct ceph: use attach/detach_page_private for tracking snap context ceph: don't use d_add in ceph_handle_snapdir ceph: don't clobber i_snap_caps on non-I_NEW inode ceph: fix fall-through warnings for Clang ceph: convert ceph_readpages to ceph_readahead ceph: convert ceph_write_begin to netfs_write_begin ...
2021-04-27ceph: fix up some bare fetches of i_sizeJeff Layton1-3/+3
We need to use i_size_read(), which properly handles the torn read case on 32-bit arches. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27ceph: drop pinned_page parameter from ceph_get_capsJeff Layton1-6/+5
All of the existing callers that don't set this to NULL just drop the page reference at some arbitrary point later in processing. There's no point in keeping a page reference that we don't use, so just drop the reference immediately after checking the Uptodate flag. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27ceph: fix fscache invalidationJeff Layton1-0/+1
Ensure that we invalidate the fscache whenever we invalidate the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-04-27ceph: rip out old fscache readpage handlingJeff Layton1-9/+0
With the new netfs read helper functions, we won't need a lot of this infrastructure as it handles the pagecache pages itself. Rip out the read handling for now, and much of the old infrastructure that deals in individual pages. The cookie handling is mostly unchanged, however. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>