diff options
author | 2025-06-02 15:31:05 -0700 | |
---|---|---|
committer | 2025-06-02 15:31:05 -0700 | |
commit | 2619a6d413f4c3c4c1eddf63e83ecc345f250d07 (patch) | |
tree | 841c37f113d1a2b0f2a2421e27fc0dd2b023cda7 /Documentation | |
parent | Merge tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs (diff) | |
parent | fuse: increase readdir buffer size (diff) | |
download | linux-rng-2619a6d413f4c3c4c1eddf63e83ecc345f250d07.tar.xz linux-rng-2619a6d413f4c3c4c1eddf63e83ecc345f250d07.zip |
Merge tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Remove tmp page copying in writeback path (Joanne).
This removes ~300 lines and with that a lot of complexity related to
avoiding reclaim related deadlock. The old mechanism is replaced with
a mapping flag that tells the MM not to block reclaim waiting for
writeback to complete. The MM parts have been reviewed/acked by
respective maintainers.
- Convert more code to handle large folios (Joanne). This still just
adds the code to deal with large folios and does not enable them yet.
- Allow invalidating all cached lookups atomically (Luis Henriques).
This feature is useful for CernVMFS, which currently does this
iteratively.
- Align write prefaulting in fuse with generic one (Dave Hansen)
- Fix race causing invalid data to be cached when setting attributes on
different nodes of a distributed fs (Guang Yuan Wu)
- Update documentation for passthrough (Chen Linxuan)
- Add fdinfo about the device number associated with an opened
/dev/fuse instance (Chen Linxuan)
- Increase readdir buffer size (Miklos). This depends on a patch to VFS
readdir code that was already merged through Christians tree.
- Optimize io-uring request expiration (Joanne)
- Misc cleanups
* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: increase readdir buffer size
readdir: supply dir_context.count as readdir buffer size hint
fuse: don't allow signals to interrupt getdents copying
fuse: support large folios for writeback
fuse: support large folios for readahead
fuse: support large folios for queued writes
fuse: support large folios for stores
fuse: support large folios for symlinks
fuse: support large folios for folio reads
fuse: support large folios for writethrough writes
fuse: refactor fuse_fill_write_pages()
fuse: support large folios for retrieves
fuse: support copying large folios
fs: fuse: add dev id to /dev/fuse fdinfo
docs: filesystems: add fuse-passthrough.rst
MAINTAINERS: update filter of FUSE documentation
fuse: fix race between concurrent setattrs from multiple nodes
fuse: remove tmp folio for writebacks and internal rb tree
mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
fuse: optimize over-io-uring request expiration check
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/filesystems/fuse-passthrough.rst | 133 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 |
2 files changed, 134 insertions, 0 deletions
diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse-passthrough.rst new file mode 100644 index 000000000000..2b0e7c2da54a --- /dev/null +++ b/Documentation/filesystems/fuse-passthrough.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +FUSE Passthrough +================ + +Introduction +============ + +FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the +performance of FUSE filesystems for I/O operations. Typically, FUSE operations +involve communication between the kernel and a userspace FUSE daemon, which can +incur overhead. Passthrough allows certain operations on a FUSE file to bypass +the userspace daemon and be executed directly by the kernel on an underlying +"backing file". + +This is achieved by the FUSE daemon registering a file descriptor (pointing to +the backing file on a lower filesystem) with the FUSE kernel module. The kernel +then receives an identifier (``backing_id``) for this registered backing file. +When a FUSE file is subsequently opened, the FUSE daemon can, in its response to +the ``OPEN`` request, include this ``backing_id`` and set the +``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific +operations. + +Currently, passthrough is supported for operations like ``read(2)``/``write(2)`` +(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``. + +Enabling Passthrough +==================== + +To use FUSE passthrough: + + 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH`` + enabled. + 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the + ``FUSE_PASSTHROUGH`` capability and specify its desired + ``max_stack_depth``. + 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl + on its connection file descriptor (e.g., ``/dev/fuse``) to register a + backing file descriptor and obtain a ``backing_id``. + 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon + replies with the ``FOPEN_PASSTHROUGH`` flag set in + ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id`` + in ``fuse_open_out::backing_id``. + 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with + the ``backing_id`` to release the kernel's reference to the backing file + when it's no longer needed for passthrough setups. + +Privilege Requirements +====================== + +Setting up passthrough functionality currently requires the FUSE daemon to +possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several +security and resource management considerations that are actively being +discussed and worked on. The primary reasons for this restriction are detailed +below. + +Resource Accounting and Visibility +---------------------------------- + +The core mechanism for passthrough involves the FUSE daemon opening a file +descriptor to a backing file and registering it with the FUSE kernel module via +the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id`` +associated with a kernel-internal ``struct fuse_backing`` object, which holds a +reference to the backing ``struct file``. + +A significant concern arises because the FUSE daemon can close its own file +descriptor to the backing file after registration. The kernel, however, will +still hold a reference to the ``struct file`` via the ``struct fuse_backing`` +object as long as it's associated with a ``backing_id`` (or subsequently, with +an open FUSE file in passthrough mode). + +This behavior leads to two main issues for unprivileged FUSE daemons: + + 1. **Invisibility to lsof and other inspection tools**: Once the FUSE + daemon closes its file descriptor, the open backing file held by the kernel + becomes "hidden." Standard tools like ``lsof``, which typically inspect + process file descriptor tables, would not be able to identify that this + file is still open by the system on behalf of the FUSE filesystem. This + makes it difficult for system administrators to track resource usage or + debug issues related to open files (e.g., preventing unmounts). + + 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to + resource limits, including the maximum number of open file descriptors + (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files + and then close its own FDs, it could potentially cause the kernel to hold + an unlimited number of open ``struct file`` references without these being + accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a + denial-of-service (DoS) by exhausting system-wide file resources. + +The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues, +restricting this powerful capability to trusted processes. + +**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files", +which are visible via ``fdinfo`` and accounted under the registering user's +``RLIMIT_NOFILE``. + +Filesystem Stacking and Shutdown Loops +-------------------------------------- + +Another concern relates to the potential for creating complex and problematic +filesystem stacking scenarios if unprivileged users could set up passthrough. +A FUSE passthrough filesystem might use a backing file that resides: + + * On the *same* FUSE filesystem. + * On another filesystem (like OverlayFS) which itself might have an upper or + lower layer that is a FUSE filesystem. + +These configurations could create dependency loops, particularly during +filesystem shutdown or unmount sequences, leading to deadlocks or system +instability. This is conceptually similar to the risks associated with the +``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``. + +To mitigate this, FUSE passthrough already incorporates checks based on +filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``). +For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate +the ``max_stack_depth`` it supports. When a backing file is registered via +``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's +filesystem stack depth is within the allowed limit. + +The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security, +ensuring that only privileged users can create these potentially complex +stacking arrangements. + +General Security Posture +------------------------ + +As a general principle for new kernel features that allow userspace to instruct +the kernel to perform direct operations on its behalf based on user-provided +file descriptors, starting with a higher privilege requirement (like +``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows +the feature to be used and tested while further security implications are +evaluated and addressed. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 32618512a965..11a599387266 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -99,6 +99,7 @@ Documentation for filesystem implementations. fuse fuse-io fuse-io-uring + fuse-passthrough inotify isofs nilfs2 |