aboutsummaryrefslogtreecommitdiffstats
path: root/fs/fuse
AgeCommit message (Collapse)AuthorFilesLines
2025-12-01Merge tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-11/+8
Pull directory locking updates from Christian Brauner: "This contains the work to add centralized APIs for directory locking operations. This series is part of a larger effort to change directory operation locking to allow multiple concurrent operations in a directory. The ultimate goal is to lock the target dentry(s) rather than the whole parent directory. To help with changing the locking protocol, this series centralizes locking and lookup in new helper functions. The helpers establish a pattern where it is the dentry that is being locked and unlocked (currently the lock is held on dentry->d_parent->d_inode, but that can change in the future). This also changes vfs_mkdir() to unlock the parent on failure, as well as dput()ing the dentry. This allows end_creating() to only require the target dentry (which may be IS_ERR() after vfs_mkdir()), not the parent" * tag 'vfs-6.19-rc1.directory.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: nfsd: fix end_creating() conversion VFS: introduce end_creating_keep() VFS: change vfs_mkdir() to unlock on failure. ecryptfs: use new start_creating/start_removing APIs Add start_renaming_two_dentries() VFS/ovl/smb: introduce start_renaming_dentry() VFS/nfsd/ovl: introduce start_renaming() and end_renaming() VFS: add start_creating_killable() and start_removing_killable() VFS: introduce start_removing_dentry() smb/server: use end_removing_noperm for for target of smb2_create_link() VFS: introduce start_creating_noperm() and start_removing_noperm() VFS/nfsd/cachefiles/ovl: introduce start_removing() and end_removing() VFS/nfsd/cachefiles/ovl: add start_creating() and end_creating() VFS: tidy up do_unlinkat() VFS: introduce start_dirop() and end_dirop() debugfs: rename end_creating() to debugfs_end_creating()
2025-12-01Merge tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-0/+1
Pull directory delegations update from Christian Brauner: "This contains the work for recall-only directory delegations for knfsd. Add support for simple, recallable-only directory delegations. This was decided at the fall NFS Bakeathon where the NFS client and server maintainers discussed how to merge directory delegation support. The approach starts with recallable-only delegations for several reasons: 1. RFC8881 has gaps that are being addressed in RFC8881bis. In particular, it requires directory position information for CB_NOTIFY callbacks, which is difficult to implement properly under Linux. The spec is being extended to allow that information to be omitted. 2. Client-side support for CB_NOTIFY still lags. The client side involves heuristics about when to request a delegation. 3. Early indication shows simple, recallable-only delegations can help performance. Anna Schumaker mentioned seeing a multi-minute speedup in xfstests runs with them enabled. With these changes, userspace can also request a read lease on a directory that will be recalled on conflicting accesses. This may be useful for applications like Samba. Users can disable leases altogether via the fs.leases-enable sysctl if needed. VFS changes: - Dedicated Type for Delegations Introduce struct delegated_inode to track inodes that may have delegations that need to be broken. This replaces the previous approach of passing raw inode pointers through the delegation breaking code paths, providing better type safety and clearer semantics for the delegation machinery. - Break parent directory delegations in open(..., O_CREAT) codepath - Allow mkdir to wait for delegation break on parent - Allow rmdir to wait for delegation break on parent - Add try_break_deleg calls for parents to vfs_link(), vfs_rename(), and vfs_unlink() - Make vfs_create(), vfs_mknod(), and vfs_symlink() break delegations on parent directory - Clean up argument list for vfs_create() - Expose delegation support to userland Filelock changes: - Make lease_alloc() take a flags argument - Rework the __break_lease API to use flags - Add struct delegated_inode - Push the S_ISREG check down to ->setlease handlers - Lift the ban on directory leases in generic_setlease NFSD changes: - Allow filecache to hold S_IFDIR files - Allow DELEGRETURN on directories - Wire up GET_DIR_DELEGATION handling Fixes: - Fix kernel-doc warnings in __fcntl_getlease - Add needed headers for new struct delegation definition" * tag 'vfs-6.19-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: add needed headers for new struct delegation definition filelock: __fcntl_getlease: fix kernel-doc warnings vfs: expose delegation support to userland nfsd: wire up GET_DIR_DELEGATION handling nfsd: allow DELEGRETURN on directories nfsd: allow filecache to hold S_IFDIR files filelock: lift the ban on directory leases in generic_setlease vfs: make vfs_symlink break delegations on parent dir vfs: make vfs_mknod break delegations on parent directory vfs: make vfs_create break delegations on parent directory vfs: clean up argument list for vfs_create() vfs: break parent dir delegations in open(..., O_CREAT) codepath vfs: allow rmdir to wait for delegation break on parent vfs: allow mkdir to wait for delegation break on parent vfs: add try_break_deleg calls for parents to vfs_{link,rename,unlink} filelock: push the S_ISREG check down to ->setlease handlers filelock: add struct delegated_inode filelock: rework the __break_lease API to use flags filelock: make lease_alloc() take a flags argument
2025-12-01Merge tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-2/+2
Pull vfs inode updates from Christian Brauner: "Features: - Hide inode->i_state behind accessors. Open-coded accesses prevent asserting they are done correctly. One obvious aspect is locking, but significantly more can be checked. For example it can be detected when the code is clearing flags which are already missing, or is setting flags when it is illegal (e.g., I_FREEING when ->i_count > 0) - Provide accessors for ->i_state, converts all filesystems using coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2, overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to compile - Rework I_NEW handling to operate without fences, simplifying the code after the accessor infrastructure is in place Cleanups: - Move wait_on_inode() from writeback.h to fs.h - Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb for clarity - Cosmetic fixes to LRU handling - Push list presence check into inode_io_list_del() - Touch up predicts in __d_lookup_rcu() - ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage - Assert on ->i_count in iput_final() - Assert ->i_lock held in __iget() Fixes: - Add missing fences to I_NEW handling" * tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) dcache: touch up predicts in __d_lookup_rcu() fs: push list presence check into inode_io_list_del() fs: cosmetic fixes to lru handling fs: rework I_NEW handling to operate without fences fs: make plain ->i_state access fail to compile xfs: use the new ->i_state accessors nilfs2: use the new ->i_state accessors overlayfs: use the new ->i_state accessors gfs2: use the new ->i_state accessors f2fs: use the new ->i_state accessors smb: use the new ->i_state accessors ceph: use the new ->i_state accessors btrfs: use the new ->i_state accessors Manual conversion to use ->i_state accessors of all places not covered by coccinelle Coccinelle-based conversion to use ->i_state accessors fs: provide accessors for ->i_state fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb fs: move wait_on_inode() from writeback.h to fs.h fs: add missing fences to I_NEW handling ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage ...
2025-12-01Merge tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds4-137/+172
Pull iomap updates from Christian Brauner: "FUSE iomap Support for Buffered Reads: This adds iomap support for FUSE buffered reads and readahead. This enables granular uptodate tracking with large folios so only non-uptodate portions need to be read. Also fixes a race condition with large folios + writeback cache that could cause data corruption on partial writes followed by reads. - Refactored iomap read/readahead bio logic into helpers - Added caller-provided callbacks for read operations - Moved buffered IO bio logic into new file - FUSE now uses iomap for read_folio and readahead Zero Range Folio Batch Support: Add folio batch support for iomap_zero_range() to handle dirty folios over unwritten mappings. Fix raciness issues where dirty data could be lost during zero range operations. - filemap_get_folios_tag_range() helper for dirty folio lookup - Optional zero range dirty folio processing - XFS fills dirty folios on zero range of unwritten mappings - Removed old partial EOF zeroing optimization DIO Write Completions from Interrupt Context: Restore pre-iomap behavior where pure overwrite completions run inline rather than being deferred to workqueue. Reduces context switches for high-performance workloads like ScyllaDB. - Removed unused IOCB_DIO_CALLER_COMP code - Error completions always run in user context (fixes zonefs) - Reworked REQ_FUA selection logic - Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP Buffered IO Cleanups: Some performance and code clarity improvements: - Replace manual bitmap scanning with find_next_bit() - Simplify read skip logic for writes - Optimize pending async writeback accounting - Better variable naming - Documentation for iomap_finish_folio_write() requirements Misaligned Vectors for Zoned XFS: Enables sub-block aligned vectors in XFS always-COW mode for zoned devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag. Bug Fixes: - Allocate s_dio_done_wq for async reads (fixes syzbot report after error completion changes) - Fix iomap_read_end() for already uptodate folios (regression fix)" * tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits) iomap: allocate s_dio_done_wq for async reads as well iomap: fix iomap_read_end() for already uptodate folios iomap: invert the polarity of IOMAP_DIO_INLINE_COMP iomap: support write completions from interrupt context iomap: rework REQ_FUA selection iomap: always run error completions in user context fs, iomap: remove IOCB_DIO_CALLER_COMP iomap: use find_next_bit() for uptodate bitmap scanning iomap: use find_next_bit() for dirty bitmap scanning iomap: simplify when reads can be skipped for writes iomap: simplify ->read_folio_range() error handling for reads iomap: optimize pending async writeback accounting docs: document iomap writeback's iomap_finish_folio_write() requirement iomap: account for unaligned end offsets when truncating read range iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted xfs: support sub-block aligned vectors in always COW mode iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag xfs: error tag to force zeroing on debug kernels iomap: remove old partial eof zeroing optimization xfs: fill dirty folios on zero range of unwritten mappings ...
2025-11-14VFS: introduce start_creating_noperm() and start_removing_noperm()NeilBrown1-11/+8
xfs, fuse, ipc/mqueue need variants of start_creating or start_removing which do not check permissions. This patch adds _noperm versions of these functions. Note that do_mq_open() was only calling mntget() so it could call path_put() - it didn't really need an extra reference on the mnt. Now it doesn't call mntget() and uses end_creating() which does the dput() half of path_put(). Also mq_unlink() previously passed d_inode(dentry->d_parent) as the dir inode to vfs_unlink(). This is after locking d_inode(mnt->mnt_root) These two inodes are the same, but normally calls use the textual parent. So I've changes the vfs_unlink() call to be given d_inode(mnt->mnt_root). Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neil@brown.name> -- changes since v2: - dir arg passed to vfs_unlink() in mq_unlink() changed to match the dir passed to lookup_noperm() - restore assignment to path->mnt even though the mntget() is removed. Link: https://patch.msgid.link/20251113002050.676694-7-neilb@ownmail.net Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12iomap: simplify ->read_folio_range() error handling for readsJoanne Koong1-8/+2
Instead of requiring that the caller calls iomap_finish_folio_read() even if the ->read_folio_range() callback returns an error, account for this internally in iomap instead, which makes the interface simpler and makes it match writeback's ->read_folio_range() error handling expectations. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-6-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12iomap: optimize pending async writeback accountingJoanne Koong1-2/+2
Pending writebacks must be accounted for to determine when all requests have completed and writeback on the folio should be ended. Currently this is done by atomically incrementing ifs->write_bytes_pending for every range to be written back. Instead, the number of atomic operations can be minimized by setting ifs->write_bytes_pending to the folio size, internally tracking how many bytes are written back asynchronously, and then after sending off all the requests, decrementing ifs->write_bytes_pending by the number of bytes not written back asynchronously. Now, for N ranges written back, only N + 2 atomic operations are required instead of 2N + 2. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20251111193658.3495942-5-joannelkoong@gmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-12filelock: push the S_ISREG check down to ->setlease handlersJeff Layton1-0/+1
When nfsd starts requesting directory delegations, setlease handlers may see requests for leases on directories. Push the !S_ISREG check down into the non-trivial setlease handlers, so we can selectively enable them where they're supported. FUSE is special: It's the only filesystem that supports atomic_open and allows kernel-internal leases. atomic_open is issued when the VFS doesn't know the state of the dentry being opened. If the file doesn't exist, it may be created, in which case the dir lease should be broken. The existing kernel-internal lease implementation has no provision for this. Ensure that we don't allow directory leases by default going forward by explicitly disabling them there. Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20251111-dir-deleg-ro-v6-4-52f3feebb2f2@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05virtio-fs: fix incorrect check for fsvq->kobjAlok Tiwari1-1/+1
In virtio_fs_add_queues_sysfs(), the code incorrectly checks fs->mqs_kobj after calling kobject_create_and_add(). Change the check to fsvq->kobj (fs->mqs_kobj -> fsvq->kobj) to ensure the per-queue kobject is successfully created. Fixes: 87cbdc396a31 ("virtio_fs: add sysfs entries for queue information") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Link: https://patch.msgid.link/20251027104658.1668537-1-alok.a.tiwari@oracle.com Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: remove fc->blkbits workaround for partial writesJoanne Koong3-21/+2
Now that fuse is integrated with iomap for read/readahead, we can remove the workaround that was added in commit bd24d2108e9c ("fuse: fix fuseblk i_blkbits for iomap partial writes"), which was previously needed to avoid a race condition where an iomap partial write may be overwritten by a read if blocksize < PAGE_SIZE. Now that fuse does iomap read/readahead, this is protected against since there is granular uptodate tracking of blocks, which means this workaround can be removed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: use iomap for readaheadJoanne Koong1-96/+124
Do readahead in fuse using iomap. This gives us granular uptodate tracking for large folios, which optimizes how much data needs to be read in. If some portions of the folio are already uptodate (eg through a prior write), we only need to read in the non-uptodate portions. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-05fuse: use iomap for read_folioJoanne Koong1-24/+56
Read folio data into the page cache using iomap. This gives us granular uptodate tracking for large folios, which optimizes how much data needs to be read in. If some portions of the folio are already uptodate (eg through a prior write), we only need to read in the non-uptodate portions. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-20Coccinelle-based conversion to use ->i_state accessorsMateusz Guzik1-2/+2
All places were patched by coccinelle with the default expecting that ->i_lock is held, afterwards entries got fixed up by hand to use unlocked variants as needed. The script: @@ expression inode, flags; @@ - inode->i_state & flags + inode_state_read(inode) & flags @@ expression inode, flags; @@ - inode->i_state &= ~flags + inode_state_clear(inode, flags) @@ expression inode, flag1, flag2; @@ - inode->i_state &= ~flag1 & ~flag2 + inode_state_clear(inode, flag1 | flag2) @@ expression inode, flags; @@ - inode->i_state |= flags + inode_state_set(inode, flags) @@ expression inode, flags; @@ - inode->i_state = flags + inode_state_assign(inode, flags) @@ expression inode, flags; @@ - flags = inode->i_state + flags = inode_state_read(inode) @@ expression inode, flags; @@ - READ_ONCE(inode->i_state) & flags + inode_state_read(inode) & flags Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-10Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"Andrey Albershteyn1-4/+0
This reverts commit 474b155adf3927d2c944423045757b54aa1ca4de. This patch caused regression in ioctl_setflags(). Underlying filesystems use EOPNOTSUPP to indicate that flag is not supported. This error is also gets converted in ioctl_setflags(). Therefore, for unsupported flags error changed from EOPNOSUPP to ENOIOCTLCMD. Link: https://lore.kernel.org/linux-xfs/a622643f-1585-40b0-9441-cf7ece176e83@kernel.org/ Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-03Merge tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuseLinus Torvalds14-345/+519
Pull fuse updates from Miklos Szeredi: - Extend copy_file_range interface to be fully 64bit capable (Miklos) - Add selftest for fusectl (Chen Linxuan) - Move fuse docs into a separate directory (Bagas Sanjaya) - Allow fuse to enter freezable state in some cases (Sergey Senozhatsky) - Clean up writeback accounting after removing tmp page copies (Joanne) - Optimize virtiofs request handling (Li RongQing) - Add synchronous FUSE_INIT support (Miklos) - Allow server to request prune of unused inodes (Miklos) - Fix deadlock with AIO/sync release (Darrick) - Add some prep patches for block/iomap support (Darrick) - Misc fixes and cleanups * tag 'fuse-update-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (26 commits) fuse: move CREATE_TRACE_POINTS to a separate file fuse: move the backing file idr and code into a new source file fuse: enable FUSE_SYNCFS for all fuseblk servers fuse: capture the unique id of fuse commands being sent fuse: fix livelock in synchronous file put from fuseblk workers mm: fix lockdep issues in writeback handling fuse: add prune notification fuse: remove redundant calls to fuse_copy_finish() in fuse_notify() fuse: fix possibly missing fuse_copy_finish() call in fuse_notify() fuse: remove FUSE_NOTIFY_CODE_MAX from <uapi/linux/fuse.h> fuse: remove fuse_readpages_end() null mapping check fuse: fix references to fuse.rst -> fuse/fuse.rst fuse: allow synchronous FUSE_INIT fuse: zero initialize inode private data fuse: remove unused 'inode' parameter in fuse_passthrough_open virtio_fs: fix the hash table using in virtio_fs_enqueue_req() mm: remove BDI_CAP_WRITEBACK_ACCT fuse: use default writeback accounting virtio_fs: Remove redundant spinlock in virtio_fs_request_complete() fuse: remove unneeded offset assignment when filling write pages ...
2025-10-03Merge tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-14/+7
Pull finish_no_open updates from Al Viro: "finish_no_open calling conventions change to simplify callers" * tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: slightly simplify nfs_atomic_open() simplify gfs2_atomic_open() simplify fuse_atomic_open() simplify nfs_atomic_open_v23() simplify vboxsf_dir_atomic_open() simplify cifs_atomic_open() 9p: simplify v9fs_vfs_atomic_open_dotl() 9p: simplify v9fs_vfs_atomic_open() allow finish_no_open(file, ERR_PTR(-E...))
2025-10-02Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-1/+1
Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page*() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling * tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
2025-10-02Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linuxLinus Torvalds1-4/+4
Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds2-2/+2
Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-1/+1
Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-25fuse: move CREATE_TRACE_POINTS to a separate fileDarrick J. Wong3-2/+15
Before we start adding new tracepoints for fuse+iomap, move the tracepoint creation itself to a separate source file so that we don't have to start pulling iomap dependencies into dev.c just for the iomap structures. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-25fuse: move the backing file idr and code into a new source fileDarrick J. Wong4-183/+208
iomap support for fuse is also going to want the ability to attach backing files to a fuse filesystem. Move the fuse_backing code into a separate file so that both can use it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-23fuse: enable FUSE_SYNCFS for all fuseblk serversDarrick J. Wong1-0/+1
Turn on syncfs for all fuseblk servers so that the ones in the know can flush cached intermediate data and logs to disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-23fuse: capture the unique id of fuse commands being sentDarrick J. Wong4-8/+31
The fuse_request_{send,end} tracepoints capture the value of req->in.h.unique in the trace output. It would be really nice if we could use this to match a request to its response for debugging and latency analysis, but the call to trace_fuse_request_send occurs before the unique id has been set: fuse_request_send: connection 8388608 req 0 opcode 1 (FUSE_LOOKUP) len 107 fuse_request_end: connection 8388608 req 6 len 16 error -2 (Notice that req moves from 0 to 6) Move the callsites to trace_fuse_request_send to after the unique id has been set by introducing a helper to do that for standard fuse_req requests. FUSE_FORGET requests are not covered by this because they appear to be synthesized into the event stream without a fuse_req object and are never replied to. Requests that are aborted without ever having been submitted to the fuse server retain the behavior that only the fuse_request_end tracepoint shows up in the trace record, and with req==0. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-23fuse: fix livelock in synchronous file put from fuseblk workersDarrick J. Wong1-1/+7
I observed a hang when running generic/323 against a fuseblk server. This test opens a file, initiates a lot of AIO writes to that file descriptor, and closes the file descriptor before the writes complete. Unsurprisingly, the AIO exerciser threads are mostly stuck waiting for responses from the fuseblk server: # cat /proc/372265/task/372313/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_do_getattr+0xfc/0x1f0 [fuse] [<0>] fuse_file_read_iter+0xbe/0x1c0 [fuse] [<0>] aio_read+0x130/0x1e0 [<0>] io_submit_one+0x542/0x860 [<0>] __x64_sys_io_submit+0x98/0x1a0 [<0>] do_syscall_64+0x37/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 But the /weird/ part is that the fuseblk server threads are waiting for responses from itself: # cat /proc/372210/task/372232/stack [<0>] request_wait_answer+0x1fe/0x2a0 [fuse] [<0>] __fuse_simple_request+0xd3/0x2b0 [fuse] [<0>] fuse_file_put+0x9a/0xd0 [fuse] [<0>] fuse_release+0x36/0x50 [fuse] [<0>] __fput+0xec/0x2b0 [<0>] task_work_run+0x55/0x90 [<0>] syscall_exit_to_user_mode+0xe9/0x100 [<0>] do_syscall_64+0x43/0xf0 [<0>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 The fuseblk server is fuse2fs so there's nothing all that exciting in the server itself. So why is the fuse server calling fuse_file_put? The commit message for the fstest sheds some light on that: "By closing the file descriptor before calling io_destroy, you pretty much guarantee that the last put on the ioctx will be done in interrupt context (during I/O completion). Aha. AIO fgets a new struct file from the fd when it queues the ioctx. The completion of the FUSE_WRITE command from userspace causes the fuse server to call the AIO completion function. The completion puts the struct file, queuing a delayed fput to the fuse server task. When the fuse server task returns to userspace, it has to run the delayed fput, which in the case of a fuseblk server, it does synchronously. Sending the FUSE_RELEASE command sychronously from fuse server threads is a bad idea because a client program can initiate enough simultaneous AIOs such that all the fuse server threads end up in delayed_fput, and now there aren't any threads left to handle the queued fuse commands. Fix this by only using asynchronous fputs when closing files, and leave a comment explaining why. Cc: stable@vger.kernel.org # v2.6.38 Fixes: 5a18ec176c934c ("fuse: fix hang of single threaded fuseblk filesystem") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-23io_uring/cmd: drop unused res2 param from io_uring_cmd_done()Caleb Sander Mateos1-4/+4
Commit 79525b51acc1 ("io_uring: fix nvme's 32b cqes on mixed cq") split out a separate io_uring_cmd_done32() helper for ->uring_cmd() implementations that return 32-byte CQEs. The res2 value passed to io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores it when is_cqe32 is passed as false. So drop the parameter from io_uring_cmd_done() to simplify the callers and clarify that it's not possible to return an extra value beyond the 32-bit CQE result. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19fs: replace use of system_wq with system_percpu_wqMarco Crivellari2-2/+2
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq to all the fs subsystem. The old wq will be kept for a few release cylces. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-3-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-16simplify fuse_atomic_open()Al Viro1-14/+7
Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-09-15fs: rename generic_delete_inode() and generic_drop_inode()Mateusz Guzik1-1/+1
generic_delete_inode() is rather misleading for what the routine is doing. inode_just_drop() should be much clearer. The new naming is inconsistent with generic_drop_inode(), so rename that one as well with inode_ as the suffix. No functional changes. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-13mm: introduce memdesc_flags_tMatthew Wilcox (Oracle)1-1/+1
Patch series "Add and use memdesc_flags_t". At some point struct page will be separated from struct slab and struct folio. This is a step towards that by introducing a type for the 'flags' word of all three structures. This gives us a certain amount of type safety by establishing that some of these unsigned longs are different from other unsigned longs in that they contain things like node ID, section number and zone number in the upper bits. That lets us have functions that can be easily called by anyone who has a slab, folio or page (but not easily by anyone else) to get the node or zone. There's going to be some unusual merge problems with this as some odd bits of the kernel decide they want to print out the flags value or something similar by writing page->flags and now they'll need to write page->flags.f instead. That's most of the churn here. Maybe we should be removing these things from the debug output? This patch (of 11): Wrap the unsigned long flags in a typedef. In upcoming patches, this will provide a strong hint that you can't just pass a random unsigned long to functions which take this as an argument. [willy@infradead.org: s/flags/flags.f/ in several architectures] Link: https://lkml.kernel.org/r/aKMgPRLD-WnkPxYm@casper.infradead.org [nicola.vetrini@gmail.com: mips: fix compilation error] Link: https://lore.kernel.org/lkml/CA+G9fYvkpmqGr6wjBNHY=dRp71PLCoi2341JxOudi60yqaeUdg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250825214245.1838158-1-nicola.vetrini@gmail.com Link: https://lkml.kernel.org/r/20250805172307.1302730-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250805172307.1302730-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-09-05fuse: virtio_fs: fix page fault for DAX page addressHaiyue Wang1-1/+1
The commit ced17ee32a99 ("Revert "virtio: reject shm region if length is zero"") exposes the following DAX page fault bug (this fix the failure that getting shm region alway returns false because of zero length): The commit 21aa65bf82a7 ("mm: remove callers of pfn_t functionality") handles the DAX physical page address incorrectly: the removed macro 'phys_to_pfn_t()' should be replaced with 'PHYS_PFN()'. [ 1.390321] BUG: unable to handle page fault for address: ffffd3fb40000008 [ 1.390875] #PF: supervisor read access in kernel mode [ 1.391257] #PF: error_code(0x0000) - not-present page [ 1.391509] PGD 0 P4D 0 [ 1.391626] Oops: Oops: 0000 [#1] SMP NOPTI [ 1.391806] CPU: 6 UID: 1000 PID: 162 Comm: weston Not tainted 6.17.0-rc3-WSL2-STABLE #2 PREEMPT(none) [ 1.392361] RIP: 0010:dax_to_folio+0x14/0x60 [ 1.392653] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff [ 1.393727] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086 [ 1.394003] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000 [ 1.394524] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000 [ 1.394967] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000 [ 1.395400] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000 [ 1.395806] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000 [ 1.396268] FS: 000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000 [ 1.396715] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.397100] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0 [ 1.397518] Call Trace: [ 1.397663] <TASK> [ 1.397900] dax_insert_entry+0x13b/0x390 [ 1.398179] dax_fault_iter+0x2a5/0x6c0 [ 1.398443] dax_iomap_pte_fault+0x193/0x3c0 [ 1.398750] __fuse_dax_fault+0x8b/0x270 [ 1.398997] ? vm_mmap_pgoff+0x161/0x210 [ 1.399175] __do_fault+0x30/0x180 [ 1.399360] do_fault+0xc4/0x550 [ 1.399547] __handle_mm_fault+0x8e3/0xf50 [ 1.399731] ? do_syscall_64+0x72/0x1e0 [ 1.399958] handle_mm_fault+0x192/0x2f0 [ 1.400204] do_user_addr_fault+0x20e/0x700 [ 1.400418] exc_page_fault+0x66/0x150 [ 1.400602] asm_exc_page_fault+0x26/0x30 [ 1.400831] RIP: 0033:0x72596d1bf703 [ 1.401076] Code: 31 f6 45 31 e4 48 8d 15 b3 73 00 00 e8 06 03 00 00 8b 83 68 01 00 00 e9 8e fa ff ff 0f 1f 00 48 8b 44 24 08 4c 89 ee 48 89 df <c7> 00 21 43 34 12 e8 72 09 00 00 e9 6a fa ff ff 0f 1f 44 00 00 e8 [ 1.402172] RSP: 002b:00007ffc350f6dc0 EFLAGS: 00010202 [ 1.402488] RAX: 0000725970e94000 RBX: 00005b7c642c2560 RCX: 0000725970d359a7 [ 1.402898] RDX: 0000000000000003 RSI: 00007ffc350f6dc0 RDI: 00005b7c642c2560 [ 1.403284] RBP: 00007ffc350f6e90 R08: 000000000000000d R09: 0000000000000000 [ 1.403634] R10: 00007ffc350f6dd8 R11: 0000000000000246 R12: 0000000000000001 [ 1.404078] R13: 00007ffc350f6dc0 R14: 0000725970e29ce0 R15: 0000000000000003 [ 1.404450] </TASK> [ 1.404570] Modules linked in: [ 1.404821] CR2: ffffd3fb40000008 [ 1.405029] ---[ end trace 0000000000000000 ]--- [ 1.405323] RIP: 0010:dax_to_folio+0x14/0x60 [ 1.405556] Code: 52 c9 c3 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 c1 ef 05 48 c1 e7 06 48 03 3d 34 b5 31 01 <48> 8b 57 08 48 89 f8 f6 c2 01 75 2b 66 90 c3 cc cc cc cc f7 c7 ff [ 1.406639] RSP: 0000:ffffaf7d04407aa8 EFLAGS: 00010086 [ 1.406910] RAX: 000000a000000000 RBX: ffffaf7d04407bb0 RCX: 0000000000000000 [ 1.407379] RDX: ffffd17b40000008 RSI: 0000000000000083 RDI: ffffd3fb40000000 [ 1.407800] RBP: 0000000000000011 R08: 000000a000000000 R09: 0000000000000000 [ 1.408246] R10: 0000000000001000 R11: ffffaf7d04407c10 R12: 0000000000000000 [ 1.408666] R13: ffffa020557be9c0 R14: 0000014000000001 R15: 0000725970e94000 [ 1.409170] FS: 000072596d6d2ec0(0000) GS:ffffa0222dc59000(0000) knlGS:0000000000000000 [ 1.409608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.409977] CR2: ffffd3fb40000008 CR3: 000000011579c005 CR4: 0000000000372ef0 [ 1.410437] Kernel panic - not syncing: Fatal exception [ 1.410857] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 21aa65bf82a7 ("mm: remove callers of pfn_t functionality") Signed-off-by: Haiyue Wang <haiyuewa@163.com> Link: https://lore.kernel.org/20250904120339.972-1-haiyuewa@163.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-05fuse: add prune notificationMiklos Szeredi3-0/+56
Some fuse servers need to prune their caches, which can only be done if the kernel's own dentry/inode caches are pruned first to avoid dangling references. Add FUSE_NOTIFY_PRUNE, which takes an array of node ID's to try and get rid of. Inodes with active references are skipped. A similar functionality is already provided by FUSE_NOTIFY_INVAL_ENTRY with the FUSE_EXPIRE_ONLY flag. Differences in the interface are FUSE_NOTIFY_INVAL_ENTRY: - can only prune one dentry - dentry is determined by parent ID and name - if inode has multiple aliases (cached hard links), then they would have to be invalidated individually to be able to get rid of the inode FUSE_NOTIFY_PRUNE: - can prune multiple inodes - inodes determined by their node ID - aliases are taken care of automatically Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-05fuse: remove redundant calls to fuse_copy_finish() in fuse_notify()Miklos Szeredi1-56/+23
Remove tail calls of fuse_copy_finish(), since it's now done from fuse_dev_do_write(). No functional change. Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-05fuse: fix possibly missing fuse_copy_finish() call in fuse_notify()Miklos Szeredi1-1/+1
In case of FUSE_NOTIFY_RESEND and FUSE_NOTIFY_INC_EPOCH fuse_copy_finish() isn't called. Fix by always calling fuse_copy_finish() after fuse_notify(). It's a no-op if called a second time. Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Fixes: 2396356a945b ("fuse: add more control over cache invalidation behaviour") Cc: <stable@vger.kernel.org> # v6.9 Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-02fuse: remove fuse_readpages_end() null mapping checkJoanne Koong1-13/+11
Remove extra logic in fuse_readpages_end() that checks against null folio mappings. This was added in commit ce534fb05292 ("fuse: allow splice to move pages"): "Since the remove_from_page_cache() + add_to_page_cache_locked() are non-atomic it is possible that the page cache is repopulated in between the two and add_to_page_cache_locked() will fail. This could be fixed by creating a new atomic replace_page_cache_page() function. fuse_readpages_end() needed to be reworked so it works even if page->mapping is NULL for some or all pages which can happen if the add_to_page_cache_locked() failed." Commit ef6a3c63112e ("mm: add replace_page_cache_page() function") added atomic page cache replacement, which means the check against null mappings can be removed. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-02fuse: fix references to fuse.rst -> fuse/fuse.rstMiklos Szeredi2-2/+2
Commit 6be0ddb20200 ("Documentation: fuse: Consolidate FUSE docs into its own subdirectory") moved fuse docs to a subdirectory but didn't update references inside the kernel tree. Fixes: 6be0ddb20200 ("Documentation: fuse: Consolidate FUSE docs into its own subdirectory") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202508261621.EaNMWVjm-lkp@intel.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-09-02fuse: allow synchronous FUSE_INITMiklos Szeredi6-34/+113
FUSE_INIT has always been asynchronous with mount. That means that the server processed this request after the mount syscall returned. This means that FUSE_INIT can't supply the root inode's ID, hence it currently has a hardcoded value. There are other limitations such as not being able to perform getxattr during mount, which is needed by selinux. To remove these limitations allow server to process FUSE_INIT while initializing the in-core super block for the fuse filesystem. This can only be done if the server is prepared to handle this, so add FUSE_DEV_IOC_SYNC_INIT ioctl, which a) lets the server know whether this feature is supported, returning ENOTTY othewrwise. b) lets the kernel know to perform a synchronous initialization The implementation is slightly tricky, since fuse_dev/fuse_conn are set up only during super block creation. This is solved by setting the private data of the fuse device file to a special value ((struct fuse_dev *) 1) and waiting for this to be turned into a proper fuse_dev before commecing with operations on the device file. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: zero initialize inode private dataMiklos Szeredi1-7/+4
This is slightly tricky, since the VFS uses non-zeroing allocation to preserve some fields that are left in a consistent state. Reported-by: Chunsheng Luo <luochunsheng@ustc.edu> Closes: https://lore.kernel.org/all/20250818083224.229-1-luochunsheng@ustc.edu/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: remove unused 'inode' parameter in fuse_passthrough_openChunsheng Luo3-8/+3
The 'inode' parameter in fuse_passthrough_open() is never referenced in the function implementation. Signed-off-by: Chunsheng Luo <luochunsheng@ustc.edu> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: Block access to folio overlimitEdward Adam Davis1-1/+1
syz reported a slab-out-of-bounds Write in fuse_dev_do_write. When the number of bytes to be retrieved is truncated to the upper limit by fc->max_pages and there is an offset, the oob is triggered. Add a loop termination condition to prevent overruns. Fixes: 3568a9569326 ("fuse: support large folios for retrieves") Reported-by: syzbot+2d215d165f9354b9c4ea@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2d215d165f9354b9c4ea Tested-by: syzbot+2d215d165f9354b9c4ea@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27virtio_fs: fix the hash table using in virtio_fs_enqueue_req()Li RongQing2-2/+5
The original commit be2ff42c5d6e ("fuse: Use hash table to link processing request") converted fuse_pqueue->processing to a hash table, but virtio_fs_enqueue_req() was not updated to use it correctly. So use fuse_pqueue->processing as a hash table, this make the code more coherent Co-developed-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: use default writeback accountingJoanne Koong2-10/+1
commit 0c58a97f919c ("fuse: remove tmp folio for writebacks and internal rb tree") removed temp folios for dirty page writeback. Consequently, fuse can now use the default writeback accounting. With switching fuse to use default writeback accounting, there are some added benefits. This updates wb->writeback_inodes tracking as well now and updates writeback throughput estimates after writeback completion. This commit also removes inc_wb_stat() and dec_wb_stat(). These have no callers anymore now that fuse does not call them. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27virtio_fs: Remove redundant spinlock in virtio_fs_request_complete()Li RongQing1-3/+0
Since clear_bit is an atomic operation, the spinlock is redundant and can be removed, reducing lock contention is good for performance. Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: remove unneeded offset assignment when filling write pagesJoanne Koong1-1/+0
With the change in aee03ea7ff98 ("fuse: support large folios for writethrough writes"), this old line for setting ap->descs[0].offset is now obsolete and unneeded. This should have been removed as part of aee03ea7ff98. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Fixes: aee03ea7ff98 ("fuse: support large folios for writethrough writes") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: use freezable wait in fuse_get_req()Sergey Senozhatsky1-2/+3
Use freezable wait in fuse_get_req() so that it won't block the system from entering suspend: Freezing user space processes failed after 20.009 seconds Call trace: __switch_to+0xcc/0x168 schedule+0x57c/0x1138 fuse_get_req+0xd0/0x2b0 fuse_simple_request+0x120/0x620 fuse_getxattr+0xe4/0x158 fuse_xattr_get+0x2c/0x48 __vfs_getxattr+0x160/0x1d8 get_vfs_caps_from_disk+0x74/0x1a8 __audit_inode+0x244/0x4d8 user_path_at_empty+0x2e0/0x390 __arm64_sys_faccessat+0xdc/0x260 Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-27fuse: add COPY_FILE_RANGE_64 that allows large copiesMiklos Szeredi2-12/+35
The FUSE protocol uses struct fuse_write_out to convey the return value of copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE interface supports a 64-bit size copies and there's no reason why copies should be limited to 32-bit. Introduce a new op COPY_FILE_RANGE_64, which is identical, except the number of bytes copied is returned in a 64-bit value. If the fuse server does not support COPY_FILE_RANGE_64, fall back to COPY_FILE_RANGE. Reported-by: Florian Weimer <fweimer@redhat.com> Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: fix fuseblk i_blkbits for iomap partial writesJoanne Koong3-2/+21
On regular fuse filesystems, i_blkbits is set to PAGE_SHIFT which means any iomap partial writes will mark the entire folio as uptodate. However fuseblk filesystems work differently and allow the blocksize to be less than the page size. As such, this may lead to data corruption if fuseblk sets its blocksize to less than the page size, uses the writeback cache, and does a partial write, then a read and the read happens before the write has undergone writeback, since the folio will not be marked uptodate from the partial write so the read will read in the entire folio from disk, which will overwrite the partial write. The long-term solution for this, which will also be needed for fuse to enable large folios with the writeback cache on, is to have fuse also use iomap for folio reads, but until that is done, the cleanest workaround is to use the page size for fuseblk's internal kernel inode blksize/blkbits values while maintaining current behavior for stat(). This was verified using ntfs-3g: $ sudo mkfs.ntfs -f -c 512 /dev/vdd1 $ sudo ntfs-3g /dev/vdd1 ~/fuseblk $ stat ~/fuseblk/hi.txt IO Block: 512 Fixes: a4c9ab1d4975 ("fuse: use iomap for buffered writes") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: reflect cached blocksize if blocksize was changedJoanne Koong3-0/+12
As pointed out by Miklos[1], in the fuse_update_get_attr() path, the attributes returned to stat may be cached values instead of fresh ones fetched from the server. In the case where the server returned a modified blocksize value, we need to cache it and reflect it back to stat if values are not re-fetched since we now no longer directly change inode->i_blkbits. Link: https://lore.kernel.org/linux-fsdevel/CAJfpeguCOxeVX88_zPd1hqziB_C+tmfuDhZP5qO2nKmnb-dTUA@mail.gmail.com/ [1] Fixes: 542ede096e48 ("fuse: keep inode->i_blkbits constant") Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: prevent overflow in copy_file_range return valueMiklos Szeredi1-1/+1
The FUSE protocol uses struct fuse_write_out to convey the return value of copy_file_range, which is restricted to uint32_t. But the COPY_FILE_RANGE interface supports a 64-bit size copies. Currently the number of bytes copied is silently truncated to 32-bit, which may result in poor performance or even failure to copy in case of truncation to zero. Reported-by: Florian Weimer <fweimer@redhat.com> Closes: https://lore.kernel.org/all/lhuh5ynl8z5.fsf@oldenburg.str.redhat.com/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-08-26fuse: check if copy_file_range() returns larger than requested sizeMiklos Szeredi1-0/+3
Just like write(), copy_file_range() should check if the return value is less or equal to the requested number of bytes. Reported-by: Chunsheng Luo <luochunsheng@ustc.edu> Closes: https://lore.kernel.org/all/20250807062425.694-1-luochunsheng@ustc.edu/ Fixes: 88bc7d5097a1 ("fuse: add support for copy_file_range()") Cc: <stable@vger.kernel.org> # v4.20 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>