aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst116
-rw-r--r--Documentation/filesystems/affs.rst16
-rw-r--r--Documentation/filesystems/afs.rst12
-rw-r--r--Documentation/filesystems/api-summary.rst18
-rw-r--r--Documentation/filesystems/autofs-mount-control.rst8
-rw-r--r--Documentation/filesystems/autofs.rst10
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst186
-rw-r--r--Documentation/filesystems/bcachefs/SubmittingPatches.rst105
-rw-r--r--Documentation/filesystems/bcachefs/casefolding.rst108
-rw-r--r--Documentation/filesystems/bcachefs/errorcodes.rst30
-rw-r--r--Documentation/filesystems/bcachefs/future/idle_work.rst78
-rw-r--r--Documentation/filesystems/bcachefs/index.rst38
-rw-r--r--Documentation/filesystems/befs.rst4
-rw-r--r--Documentation/filesystems/btrfs.rst17
-rw-r--r--Documentation/filesystems/buffer.rst12
-rw-r--r--Documentation/filesystems/caching/backend-api.rst850
-rw-r--r--Documentation/filesystems/caching/cachefiles.rst190
-rw-r--r--Documentation/filesystems/caching/fscache.rst531
-rw-r--r--Documentation/filesystems/caching/index.rst4
-rw-r--r--Documentation/filesystems/caching/netfs-api.rst1136
-rw-r--r--Documentation/filesystems/caching/object.rst313
-rw-r--r--Documentation/filesystems/caching/operations.rst210
-rw-r--r--Documentation/filesystems/ceph.rst57
-rw-r--r--Documentation/filesystems/coda.rst6
-rw-r--r--Documentation/filesystems/configfs.rst52
-rw-r--r--Documentation/filesystems/dax.rst306
-rw-r--r--Documentation/filesystems/dax.txt268
-rw-r--r--Documentation/filesystems/debugfs.rst57
-rw-r--r--Documentation/filesystems/devpts.rst4
-rw-r--r--Documentation/filesystems/directory-locking.rst343
-rw-r--r--Documentation/filesystems/dlmfs.rst4
-rw-r--r--Documentation/filesystems/efivarfs.rst2
-rw-r--r--Documentation/filesystems/erofs.rst327
-rw-r--r--Documentation/filesystems/ext2.rst5
-rw-r--r--Documentation/filesystems/ext4/about.rst2
-rw-r--r--Documentation/filesystems/ext4/atomic_writes.rst225
-rw-r--r--Documentation/filesystems/ext4/attributes.rst68
-rw-r--r--Documentation/filesystems/ext4/bigalloc.rst2
-rw-r--r--Documentation/filesystems/ext4/bitmaps.rst6
-rw-r--r--Documentation/filesystems/ext4/blockgroup.rst38
-rw-r--r--Documentation/filesystems/ext4/blockmap.rst2
-rw-r--r--Documentation/filesystems/ext4/blocks.rst2
-rw-r--r--Documentation/filesystems/ext4/checksums.rst26
-rw-r--r--Documentation/filesystems/ext4/directory.rst187
-rw-r--r--Documentation/filesystems/ext4/eainode.rst10
-rw-r--r--Documentation/filesystems/ext4/globals.rst1
-rw-r--r--Documentation/filesystems/ext4/group_descr.rst126
-rw-r--r--Documentation/filesystems/ext4/ifork.rst60
-rw-r--r--Documentation/filesystems/ext4/inlinedata.rst8
-rw-r--r--Documentation/filesystems/ext4/inodes.rst316
-rw-r--r--Documentation/filesystems/ext4/journal.rst374
-rw-r--r--Documentation/filesystems/ext4/mmp.rst36
-rw-r--r--Documentation/filesystems/ext4/orphan.rst42
-rw-r--r--Documentation/filesystems/ext4/overview.rst3
-rw-r--r--Documentation/filesystems/ext4/special_inodes.rst19
-rw-r--r--Documentation/filesystems/ext4/super.rst574
-rw-r--r--Documentation/filesystems/f2fs.rst656
-rw-r--r--Documentation/filesystems/fiemap.rst49
-rw-r--r--Documentation/filesystems/files.rst53
-rw-r--r--Documentation/filesystems/fscrypt.rst679
-rw-r--r--Documentation/filesystems/fsverity.rst501
-rw-r--r--Documentation/filesystems/fuse-io-uring.rst99
-rw-r--r--Documentation/filesystems/fuse-io.rst3
-rw-r--r--Documentation/filesystems/fuse-passthrough.rst133
-rw-r--r--Documentation/filesystems/fuse.rst31
-rw-r--r--Documentation/filesystems/gfs2-glocks.rst60
-rw-r--r--Documentation/filesystems/gfs2.rst37
-rw-r--r--Documentation/filesystems/hfs.rst2
-rw-r--r--Documentation/filesystems/hpfs.rst2
-rw-r--r--Documentation/filesystems/idmappings.rst1039
-rw-r--r--Documentation/filesystems/index.rst24
-rw-r--r--Documentation/filesystems/iomap/design.rst462
-rw-r--r--Documentation/filesystems/iomap/index.rst13
-rw-r--r--Documentation/filesystems/iomap/operations.rst744
-rw-r--r--Documentation/filesystems/iomap/porting.rst120
-rw-r--r--Documentation/filesystems/journalling.rst103
-rw-r--r--Documentation/filesystems/locking.rst382
-rw-r--r--Documentation/filesystems/locks.rst17
-rw-r--r--Documentation/filesystems/mandatory-locking.rst188
-rw-r--r--Documentation/filesystems/mount_api.rst54
-rw-r--r--Documentation/filesystems/multigrain-ts.rst125
-rw-r--r--Documentation/filesystems/netfs_library.rst1051
-rw-r--r--Documentation/filesystems/nfs/client-identifier.rst216
-rw-r--r--Documentation/filesystems/nfs/exporting.rst87
-rw-r--r--Documentation/filesystems/nfs/index.rst4
-rw-r--r--Documentation/filesystems/nfs/localio.rst357
-rw-r--r--Documentation/filesystems/nfs/reexport.rst117
-rw-r--r--Documentation/filesystems/nfs/rpc-cache.rst2
-rw-r--r--Documentation/filesystems/nfs/rpc-server-gss.rst11
-rw-r--r--Documentation/filesystems/nilfs2.rst2
-rw-r--r--Documentation/filesystems/ntfs.rst466
-rw-r--r--Documentation/filesystems/ntfs3.rst123
-rw-r--r--Documentation/filesystems/ocfs2.rst2
-rw-r--r--Documentation/filesystems/omfs.rst2
-rw-r--r--Documentation/filesystems/orangefs.rst2
-rw-r--r--Documentation/filesystems/overlayfs.rst417
-rw-r--r--Documentation/filesystems/path-lookup.rst232
-rw-r--r--Documentation/filesystems/path-lookup.txt2
-rw-r--r--Documentation/filesystems/porting.rst433
-rw-r--r--Documentation/filesystems/proc.rst808
-rw-r--r--Documentation/filesystems/qnx6.rst4
-rw-r--r--Documentation/filesystems/quota.rst12
-rw-r--r--Documentation/filesystems/ramfs-rootfs-initramfs.rst15
-rw-r--r--Documentation/filesystems/relay.rst36
-rw-r--r--Documentation/filesystems/resctrl.rst1523
-rw-r--r--Documentation/filesystems/seq_file.rst26
-rw-r--r--Documentation/filesystems/sharedsubtree.rst4
-rw-r--r--Documentation/filesystems/smb/cifsroot.rst (renamed from Documentation/filesystems/cifs/cifsroot.rst)2
-rw-r--r--Documentation/filesystems/smb/index.rst11
-rw-r--r--Documentation/filesystems/smb/ksmbd.rst186
-rw-r--r--Documentation/filesystems/smb/smbdirect.rst103
-rw-r--r--Documentation/filesystems/spufs/spufs.rst2
-rw-r--r--Documentation/filesystems/squashfs.rst72
-rw-r--r--Documentation/filesystems/sysfs-pci.rst138
-rw-r--r--Documentation/filesystems/sysfs-tagging.rst48
-rw-r--r--Documentation/filesystems/sysfs.rst58
-rw-r--r--Documentation/filesystems/sysv-fs.rst264
-rw-r--r--Documentation/filesystems/tmpfs.rst145
-rw-r--r--Documentation/filesystems/ubifs-authentication.rst12
-rw-r--r--Documentation/filesystems/ubifs.rst2
-rw-r--r--Documentation/filesystems/udf.rst2
-rw-r--r--Documentation/filesystems/vfat.rst4
-rw-r--r--Documentation/filesystems/vfs.rst538
-rw-r--r--Documentation/filesystems/xfs/index.rst14
-rw-r--r--Documentation/filesystems/xfs/xfs-delayed-logging-design.rst (renamed from Documentation/filesystems/xfs-delayed-logging-design.rst)371
-rw-r--r--Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst194
-rw-r--r--Documentation/filesystems/xfs/xfs-online-fsck-design.rst5503
-rw-r--r--Documentation/filesystems/xfs/xfs-self-describing-metadata.rst (renamed from Documentation/filesystems/xfs-self-describing-metadata.rst)1
-rw-r--r--Documentation/filesystems/zonefs.rst89
129 files changed, 20172 insertions, 6667 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 2995279ddc24..be3504ca034a 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -18,7 +18,7 @@ and Maya Gokhale. Additional development by Greg Watson
The best detailed explanation of the Linux implementation and applications of
the 9p client is available in the form of a USENIX paper:
- http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
+ https://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html
Other applications are described in the following papers:
@@ -31,7 +31,7 @@ Other applications are described in the following papers:
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
+ https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf
Usage
=====
@@ -40,7 +40,7 @@ For remote file server::
mount -t 9p 10.10.1.2 /mnt/9
-For Plan 9 From User Space applications (http://swtch.com/plan9)::
+For Plan 9 From User Space applications (https://9fans.github.io/plan9port/)::
mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER
@@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-where mount_tag is the tag associated by the server to each of the exported
+where mount_tag is the tag generated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+USBG Usage
+==========
+
+To mount a 9p FS on a USB Host accessible via the gadget at runtime::
+
+ mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9
+
+To mount a 9p FS on a USB Host accessible via the gadget as root filesystem::
+
+ root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs
+
+where <device> is the tag associated by the usb gadget transport.
+It is defined by the configfs instance name.
+
+USBG Example
+============
+
+The USB host exports a filesystem, while the gadget on the USB device
+side makes it mountable.
+
+Diod (9pfs server) and the forwarder are on the development host, where
+the root filesystem is actually stored. The gadget is initialized during
+boot (or later) on the embedded board. Then the forwarder will find it
+on the USB bus and start forwarding requests.
+
+In this case the 9p requests come from the device and are handled by the
+host. The reason is that USB device ports are normally not available on
+PCs, so a connection in the other direction would not work.
+
+When using the usbg transport, for now there is no native usb host
+service capable to handle the requests from the gadget driver. For
+this we have to use the extra python tool p9_fwd.py from tools/usb.
+
+Just start the 9pfs capable network server like diod/nfs-ganesha e.g.::
+
+ $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD
+
+Optionally scan your bus if there are more then one usbg gadgets to find their path::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py list
+
+ Bus | Addr | Manufacturer | Product | ID | Path
+ --- | ---- | ---------------- | ---------------- | --------- | ----
+ 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2
+ 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3
+
+Then start the python transport::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999
+
+After that the gadget driver can be used as described above.
+
+One use-case is to use it as an alternative to NFS root booting during
+the development of embedded Linux devices.
+
Options
=======
@@ -68,6 +123,7 @@ Options
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
+ usbg connect to a specified usb gadget channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
@@ -78,19 +134,39 @@ Options
offering several exported file systems.
cache=mode specifies a caching policy. By default, no caches are used.
-
- none
- default no cache policy, metadata and data
- alike are synchronous.
- loose
- no attempts are made at consistency,
- intended for exclusive, read-only mounts
- fscache
- use FS-Cache for a persistent, read-only
- cache backend.
- mmap
- minimal cache that is only used for read-write
- mmap. Northing else is cached, like cache=none
+ The mode can be specified as a bitmask or by using one of the
+ preexisting common 'shortcuts'.
+ The bitmask is described below: (unspecified bits are reserved)
+
+ ========== ====================================================
+ 0b00000000 all caches disabled, mmap disabled
+ 0b00000001 file caches enabled
+ 0b00000010 meta-data caches enabled
+ 0b00000100 writeback behavior (as opposed to writethrough)
+ 0b00001000 loose caches (no explicit consistency with server)
+ 0b10000000 fscache enabled for persistent caching
+ ========== ====================================================
+
+ The current shortcuts and their associated bitmask are:
+
+ ========= ====================================================
+ none 0b00000000 (no caching)
+ readahead 0b00000001 (only read-ahead file caching)
+ mmap 0b00000101 (read-ahead + writeback file cache)
+ loose 0b00001111 (non-coherent file and meta-data caches)
+ fscache 0b10001111 (persistent loose cache)
+ ========= ====================================================
+
+ NOTE: only these shortcuts are tested modes of operation at the
+ moment, so using other combinations of bit-patterns is not
+ known to work. Work on better cache support is in progress.
+
+ IMPORTANT: loose caches (and by extension at the moment fscache)
+ do not necessarily validate cached values on the server. In other
+ words changes on the server are not guaranteed to be reflected
+ on the client system. Only use this mode of operation if you
+ have an exclusive mount and the server will not modify the
+ filesystem underneath you.
debug=n specifies debug level. The debug level is a bitmask.
@@ -137,6 +213,12 @@ Options
This can be used to share devices/named pipes/sockets between
hosts. This functionality will be expanded in later versions.
+ directio bypass page cache on all read/write operations
+
+ ignoreqv ignore qid.version==0 as a marker to ignore cache
+
+ noxattr do not offer xattr functions on this mount.
+
access there are four access modes.
user
if a user tries to access a file on v9fs
diff --git a/Documentation/filesystems/affs.rst b/Documentation/filesystems/affs.rst
index 7f1a40dce6d3..5776cbd5fa53 100644
--- a/Documentation/filesystems/affs.rst
+++ b/Documentation/filesystems/affs.rst
@@ -110,13 +110,15 @@ The Amiga protection flags RWEDRWEDHSPARWED are handled as follows:
- R maps to r for user, group and others. On directories, R implies x.
- - If both W and D are allowed, w will be set.
+ - W maps to w.
- E maps to x.
- - H and P are always retained and ignored under Linux.
+ - D is ignored.
- - A is always reset when a file is written to.
+ - H, S and P are always retained and ignored under Linux.
+
+ - A is cleared when a file is written to.
User id and group id will be used unless set[gu]id are given as mount
options. Since most of the Amiga file systems are single user systems
@@ -128,11 +130,13 @@ Linux -> Amiga:
The Linux rwxrwxrwx file mode is handled as follows:
- - r permission will set R for user, group and others.
+ - r permission will allow R for user, group and others.
+
+ - w permission will allow W for user, group and others.
- - w permission will set W and D for user, group and others.
+ - x permission of the user will allow E for plain files.
- - x permission of the user will set E for plain files.
+ - D will be allowed for user, group and others.
- All other flags (suid, sgid, ...) are ignored and will
not be retained.
diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst
index cada9464d6bd..f15ba388bbde 100644
--- a/Documentation/filesystems/afs.rst
+++ b/Documentation/filesystems/afs.rst
@@ -44,7 +44,7 @@ options::
CONFIG_AF_RXRPC - The RxRPC protocol transport
CONFIG_RXKAD - The RxRPC Kerberos security handler
- CONFIG_AFS - The AFS filesystem
+ CONFIG_AFS_FS - The AFS filesystem
Additionally, the following can be turned on to aid debugging::
@@ -109,7 +109,7 @@ Mountpoints
AFS has a concept of mountpoints. In AFS terms, these are specially formatted
symbolic links (of the same form as the "device name" passed to mount). kAFS
presents these to the user as directories that have a follow-link capability
-(ie: symbolic link semantics). If anyone attempts to access them, they will
+(i.e.: symbolic link semantics). If anyone attempts to access them, they will
automatically cause the target volume to be mounted (if possible) on that site.
Automatically mounted filesystems will be automatically unmounted approximately
@@ -144,7 +144,7 @@ looks up a cell of the same name, for example::
Proc Filesystem
===============
-The AFS modules creates a "/proc/fs/afs/" directory and populates it:
+The AFS module creates a "/proc/fs/afs/" directory and populates it:
(*) A "cells" file that lists cells currently known to the afs module and
their usage counts::
@@ -190,7 +190,7 @@ Security
Secure operations are initiated by acquiring a key using the klog program. A
very primitive klog program is available at:
- http://people.redhat.com/~dhowells/rxrpc/klog.c
+ https://people.redhat.com/~dhowells/rxrpc/klog.c
This should be compiled by::
@@ -201,7 +201,7 @@ And then run as::
./klog
Assuming it's successful, this adds a key of type RxRPC, named for the service
-and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or
+and cell, e.g.: "afs@<cellname>". This can be viewed with the keyctl program or
by cat'ing /proc/keys::
[root@andromeda ~]# keyctl show
@@ -211,7 +211,7 @@ by cat'ing /proc/keys::
111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM
Currently the username, realm, password and proposed ticket lifetime are
-compiled in to the program.
+compiled into the program.
It is not required to acquire a key before using AFS facilities, but if one is
not acquired then all operations will be governed by the anonymous user parts
diff --git a/Documentation/filesystems/api-summary.rst b/Documentation/filesystems/api-summary.rst
index bbb0c1c0e5cf..cc5cc7f3fbd8 100644
--- a/Documentation/filesystems/api-summary.rst
+++ b/Documentation/filesystems/api-summary.rst
@@ -56,9 +56,6 @@ Other Functions
.. kernel-doc:: fs/namei.c
:export:
-.. kernel-doc:: fs/buffer.c
- :export:
-
.. kernel-doc:: block/bio.c
:export:
@@ -71,9 +68,6 @@ Other Functions
.. kernel-doc:: fs/fs-writeback.c
:export:
-.. kernel-doc:: fs/block_dev.c
- :export:
-
.. kernel-doc:: fs/anon_inodes.c
:export:
@@ -86,9 +80,6 @@ Other Functions
.. kernel-doc:: fs/dax.c
:export:
-.. kernel-doc:: fs/direct-io.c
- :export:
-
.. kernel-doc:: fs/libfs.c
:export:
@@ -104,6 +95,9 @@ Other Functions
.. kernel-doc:: fs/xattr.c
:export:
+.. kernel-doc:: fs/namespace.c
+ :export:
+
The proc filesystem
===================
@@ -125,6 +119,12 @@ Events based on file descriptors
.. kernel-doc:: fs/eventfd.c
:export:
+eventpoll (epoll) interfaces
+============================
+
+.. kernel-doc:: fs/eventpoll.c
+ :internal:
+
The Filesystem for Exporting Kernel Objects
===========================================
diff --git a/Documentation/filesystems/autofs-mount-control.rst b/Documentation/filesystems/autofs-mount-control.rst
index 2903aed92316..b5a379d25c40 100644
--- a/Documentation/filesystems/autofs-mount-control.rst
+++ b/Documentation/filesystems/autofs-mount-control.rst
@@ -196,7 +196,7 @@ information and return operation results::
struct args_ismountpoint ismountpoint;
};
- char path[0];
+ char path[];
};
The ioctlfd field is a mount point file descriptor of an autofs mount
@@ -391,7 +391,7 @@ variation uses the path and optionally in.type field of struct args_ismountpoint
set to an autofs mount type. The call returns 1 if this is a mount point
and sets out.devid field to the device number of the mount and out.magic
field to the relevant super block magic number (described below) or 0 if
-it isn't a mountpoint. In both cases the the device number (as returned
+it isn't a mountpoint. In both cases the device number (as returned
by new_encode_dev()) is returned in out.devid field.
If supplied with a file descriptor we're looking for a specific mount,
@@ -399,12 +399,12 @@ not necessarily at the top of the mounted stack. In this case the path
the descriptor corresponds to is considered a mountpoint if it is itself
a mountpoint or contains a mount, such as a multi-mount without a root
mount. In this case we return 1 if the descriptor corresponds to a mount
-point and and also returns the super magic of the covering mount if there
+point and also returns the super magic of the covering mount if there
is one or 0 if it isn't a mountpoint.
If a path is supplied (and the ioctlfd field is set to -1) then the path
is looked up and is checked to see if it is the root of a mount. If a
type is also given we are looking for a particular autofs mount and if
-a match isn't found a fail is returned. If the the located path is the
+a match isn't found a fail is returned. If the located path is the
root of a mount 1 is returned along with the super magic of the mount
or 0 otherwise.
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 681c6a492bc0..5eb02394fcc3 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -18,7 +18,7 @@ key advantages:
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
- in that data base at the time of access will be used to provide
+ in that database at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
@@ -35,7 +35,7 @@ This document describes only the kernel module and the interactions
required with any user-space program. Subsequent text refers to this
as the "automount daemon" or simply "the daemon".
-"autofs" is a Linux kernel module with provides the "autofs"
+"autofs" is a Linux kernel module which provides the "autofs"
filesystem type. Several "autofs" filesystems can be mounted and they
can each be managed separately, or all managed by the same daemon.
@@ -423,7 +423,7 @@ The available ioctl commands are:
and objects are expired if the are not in use.
**AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
+ and objects are expired even if they are in use. This assumes
that the daemon has requested this because it is capable of
performing the umount.
@@ -442,7 +442,7 @@ which can be used to communicate directly with the autofs filesystem.
It requires CAP_SYS_ADMIN for access.
The 'ioctl's that can be used on this device are described in a separate
-document `autofs-mount-control.txt`, and are summarised briefly here.
+document `autofs-mount-control.rst`, and are summarised briefly here.
Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct autofs_dev_ioctl {
@@ -467,7 +467,7 @@ Each ioctl is passed a pointer to an `autofs_dev_ioctl` structure::
struct args_ismountpoint ismountpoint;
};
- char path[0];
+ char path[];
};
For the **OPEN_MOUNT** and **IS_MOUNTPOINT** commands, the target
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
new file mode 100644
index 000000000000..b29562a6bf55
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/CodingStyle.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs coding style
+=====================
+
+Good development is like gardening, and codebases are our gardens. Tend to them
+every day; look for little things that are out of place or in need of tidying.
+A little weeding here and there goes a long way; don't wait until things have
+spiraled out of control.
+
+Things don't always have to be perfect - nitpicking often does more harm than
+good. But appreciate beauty when you see it - and let people know.
+
+The code that you are afraid to touch is the code most in need of refactoring.
+
+A little organizing here and there goes a long way.
+
+Put real thought into how you organize things.
+
+Good code is readable code, where the structure is simple and leaves nowhere
+for bugs to hide.
+
+Assertions are one of our most important tools for writing reliable code. If in
+the course of writing a patchset you encounter a condition that shouldn't
+happen (and will have unpredictable or undefined behaviour if it does), or
+you're not sure if it can happen and not sure how to handle it yet - make it a
+BUG_ON(). Don't leave undefined or unspecified behavior lurking in the codebase.
+
+By the time you finish the patchset, you should understand better which
+assertions need to be handled and turned into checks with error paths, and
+which should be logically impossible. Leave the BUG_ON()s in for the ones which
+are logically impossible. (Or, make them debug mode assertions if they're
+expensive - but don't turn everything into a debug mode assertion, so that
+we're not stuck debugging undefined behaviour should it turn out that you were
+wrong).
+
+Assertions are documentation that can't go out of date. Good assertions are
+wonderful.
+
+Good assertions drastically and dramatically reduce the amount of testing
+required to shake out bugs.
+
+Good assertions are based on state, not logic. To write good assertions, you
+have to think about what the invariants on your state are.
+
+Good invariants and assertions will hold everywhere in your codebase. This
+means that you can run them in only a few places in the checked in version, but
+should you need to debug something that caused the assertion to fail, you can
+quickly shotgun them everywhere to find the codepath that broke the invariant.
+
+A good assertion checks something that the compiler could check for us, and
+elide - if we were working in a language with embedded correctness proofs that
+the compiler could check. This is something that exists today, but it'll likely
+still be a few decades before it comes to systems programming languages. But we
+can still incorporate that kind of thinking into our code and document the
+invariants with runtime checks - much like the way people working in
+dynamically typed languages may add type annotations, gradually making their
+code statically typed.
+
+Looking for ways to make your assertions simpler - and higher level - will
+often nudge you towards making the entire system simpler and more robust.
+
+Good code is code where you can poke around and see what it's doing -
+introspection. We can't debug anything if we can't see what's going on.
+
+Whenever we're debugging, and the solution isn't immediately obvious, if the
+issue is that we don't know where the issue is because we can't see what's
+going on - fix that first.
+
+We have the tools to make anything visible at runtime, efficiently - RCU and
+percpu data structures among them. Don't let things stay hidden.
+
+The most important tool for introspection is the humble pretty printer - in
+bcachefs, this means `*_to_text()` functions, which output to printbufs.
+
+Pretty printers are wonderful, because they compose and you can use them
+everywhere. Having functions to print whatever object you're working with will
+make your error messages much easier to write (therefore they will actually
+exist) and much more informative. And they can be used from sysfs/debugfs, as
+well as tracepoints.
+
+Runtime info and debugging tools should come with clear descriptions and
+labels, and good structure - we don't want files with a list of bare integers,
+like in procfs. Part of the job of the debugging tools is to educate users and
+new developers as to how the system works.
+
+Error messages should, whenever possible, tell you everything you need to debug
+the issue. It's worth putting effort into them.
+
+Tracepoints shouldn't be the first thing you reach for. They're an important
+tool, but always look for more immediate ways to make things visible. When we
+have to rely on tracing, we have to know which tracepoints we're looking for,
+and then we have to run the troublesome workload, and then we have to sift
+through logs. This is a lot of steps to go through when a user is hitting
+something, and if it's intermittent it may not even be possible.
+
+The humble counter is an incredibly useful tool. They're cheap and simple to
+use, and many complicated internal operations with lots of things that can
+behave weirdly (anything involving memory reclaim, for example) become
+shockingly easy to debug once you have counters on every distinct codepath.
+
+Persistent counters are even better.
+
+When debugging, try to get the most out of every bug you come across; don't
+rush to fix the initial issue. Look for things that will make related bugs
+easier the next time around - introspection, new assertions, better error
+messages, new debug tools, and do those first. Look for ways to make the system
+better behaved; often one bug will uncover several other bugs through
+downstream effects.
+
+Fix all that first, and then the original bug last - even if that means keeping
+a user waiting. They'll thank you in the long run, and when they understand
+what you're doing you'll be amazed at how patient they're happy to be. Users
+like to help - otherwise they wouldn't be reporting the bug in the first place.
+
+Talk to your users. Don't isolate yourself.
+
+Users notice all sorts of interesting things, and by just talking to them and
+interacting with them you can benefit from their experience.
+
+Spend time doing support and helpdesk stuff. Don't just write code - code isn't
+finished until it's being used trouble free.
+
+This will also motivate you to make your debugging tools as good as possible,
+and perhaps even your documentation, too. Like anything else in life, the more
+time you spend at it the better you'll get, and you the developer are the
+person most able to improve the tools to make debugging quick and easy.
+
+Be wary of how you take on and commit to big projects. Don't let development
+become product-manager focused. Often time an idea is a good one but needs to
+wait for its proper time - but you won't know if it's the proper time for an
+idea until you start writing code.
+
+Expect to throw a lot of things away, or leave them half finished for later.
+Nobody writes all perfect code that all gets shipped, and you'll be much more
+productive in the long run if you notice this early and shift to something
+else. The experience gained and lessons learned will be valuable for all the
+other work you do.
+
+But don't be afraid to tackle projects that require significant rework of
+existing code. Sometimes these can be the best projects, because they can lead
+us to make existing code more general, more flexible, more multipurpose and
+perhaps more robust. Just don't hesitate to abandon the idea if it looks like
+it's going to make a mess of things.
+
+Complicated features can often be done as a series of refactorings, with the
+final change that actually implements the feature as a quite small patch at the
+end. It's wonderful when this happens, especially when those refactorings are
+things that improve the codebase in their own right. When that happens there's
+much less risk of wasted effort if the feature you were going for doesn't work
+out.
+
+Always strive to work incrementally. Always strive to turn the big projects
+into little bite sized projects that can prove their own merits.
+
+Instead of always tackling those big projects, look for little things that
+will be useful, and make the big projects easier.
+
+The question of what's likely to be useful is where junior developers most
+often go astray - doing something because it seems like it'll be useful often
+leads to overengineering. Knowing what's useful comes from many years of
+experience, or talking with people who have that experience - or from simply
+reading lots of code and looking for common patterns and issues. Don't be
+afraid to throw things away and do something simpler.
+
+Talk about your ideas with your fellow developers; often times the best things
+come from relaxed conversations where people aren't afraid to say "what if?".
+
+Don't neglect your tools.
+
+The most important tools (besides the compiler and our text editor) are the
+tools we use for testing. The shortest possible edit/test/debug cycle is
+essential for working productively. We learn, gain experience, and discover the
+errors in our thinking by running our code and seeing what happens. If your
+time is being wasted because your tools are bad or too slow - don't accept it,
+fix it.
+
+Put effort into your documentation, commit messages, and code comments - but
+don't go overboard. A good commit message is wonderful - but if the information
+was important enough to go in a commit message, ask yourself if it would be
+even better as a code comment.
+
+A good code comment is wonderful, but even better is the comment that didn't
+need to exist because the code was so straightforward as to be obvious;
+organized into small clean and tidy modules, with clear and descriptive names
+for functions and variables, where every line of code has a clear purpose.
diff --git a/Documentation/filesystems/bcachefs/SubmittingPatches.rst b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
new file mode 100644
index 000000000000..18c79d548391
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/SubmittingPatches.rst
@@ -0,0 +1,105 @@
+Submitting patches to bcachefs
+==============================
+
+Here are suggestions for submitting patches to bcachefs subsystem.
+
+Submission checklist
+--------------------
+
+Patches must be tested before being submitted, either with the xfstests suite
+[0]_, or the full bcachefs test suite in ktest [1]_, depending on what's being
+touched. Note that ktest wraps xfstests and will be an easier method to running
+it for most users; it includes single-command wrappers for all the mainstream
+in-kernel local filesystems.
+
+Patches will undergo more testing after being merged (including
+lockdep/kasan/preempt/etc. variants), these are not generally required to be
+run by the submitter - but do put some thought into what you're changing and
+which tests might be relevant, e.g. are you dealing with tricky memory layout
+work? kasan, are you doing locking work? then lockdep; and ktest includes
+single-command variants for the debug build types you'll most likely need.
+
+The exception to this rule is incomplete WIP/RFC patches: if you're working on
+something nontrivial, it's encouraged to send out a WIP patch to let people
+know what you're doing and make sure you're on the right track. Just make sure
+it includes a brief note as to what's done and what's incomplete, to avoid
+confusion.
+
+Rigorous checkpatch.pl adherence is not required (many of its warnings are
+considered out of date), but try not to deviate too much without reason.
+
+Focus on writing code that reads well and is organized well; code should be
+aesthetically pleasing.
+
+CI
+--
+
+Instead of running your tests locally, when running the full test suite it's
+preferable to let a server farm do it in parallel, and then have the results
+in a nice test dashboard (which can tell you which failures are new, and
+presents results in a git log view, avoiding the need for most bisecting).
+
+That exists [2]_, and community members may request an account. If you work for
+a big tech company, you'll need to help out with server costs to get access -
+but the CI is not restricted to running bcachefs tests: it runs any ktest test
+(which generally makes it easy to wrap other tests that can run in qemu).
+
+Other things to think about
+---------------------------
+
+- How will we debug this code? Is there sufficient introspection to diagnose
+ when something starts acting wonky on a user machine?
+
+ We don't necessarily need every single field of every data structure visible
+ with introspection, but having the important fields of all the core data
+ types wired up makes debugging drastically easier - a bit of thoughtful
+ foresight greatly reduces the need to have people build custom kernels with
+ debug patches.
+
+ More broadly, think about all the debug tooling that might be needed.
+
+- Does it make the codebase more or less of a mess? Can we also try to do some
+ organizing, too?
+
+- Do new tests need to be written? New assertions? How do we know and verify
+ that the code is correct, and what happens if something goes wrong?
+
+ We don't yet have automated code coverage analysis or easy fault injection -
+ but for now, pretend we did and ask what they might tell us.
+
+ Assertions are hugely important, given that we don't yet have a systems
+ language that can do ergonomic embedded correctness proofs. Hitting an assert
+ in testing is much better than wandering off into undefined behaviour la-la
+ land - use them. Use them judiciously, and not as a replacement for proper
+ error handling, but use them.
+
+- Does it need to be performance tested? Should we add new performance counters?
+
+ bcachefs has a set of persistent runtime counters which can be viewed with
+ the 'bcachefs fs top' command; this should give users a basic idea of what
+ their filesystem is currently doing. If you're doing a new feature or looking
+ at old code, think if anything should be added.
+
+- If it's a new on disk format feature - have upgrades and downgrades been
+ tested? (Automated tests exists but aren't in the CI, due to the hassle of
+ disk image management; coordinate to have them run.)
+
+Mailing list, IRC
+-----------------
+
+Patches should hit the list [3]_, but much discussion and code review happens
+on IRC as well [4]_; many people appreciate the more conversational approach
+and quicker feedback.
+
+Additionally, we have a lively user community doing excellent QA work, which
+exists primarily on IRC. Please make use of that resource; user feedback is
+important for any nontrivial feature, and documenting it in commit messages
+would be a good idea.
+
+.. rubric:: References
+
+.. [0] git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+.. [1] https://evilpiepirate.org/git/ktest.git/
+.. [2] https://evilpiepirate.org/~testdashboard/ci/
+.. [3] linux-bcachefs@vger.kernel.org
+.. [4] irc.oftc.net#bcache, #bcachefs-dev
diff --git a/Documentation/filesystems/bcachefs/casefolding.rst b/Documentation/filesystems/bcachefs/casefolding.rst
new file mode 100644
index 000000000000..871a38f557e8
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/casefolding.rst
@@ -0,0 +1,108 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Casefolding
+===========
+
+bcachefs has support for case-insensitive file and directory
+lookups using the regular `chattr +F` (`S_CASEFOLD`, `FS_CASEFOLD_FL`)
+casefolding attributes.
+
+The main usecase for casefolding is compatibility with software written
+against other filesystems that rely on casefolded lookups
+(eg. NTFS and Wine/Proton).
+Taking advantage of file-system level casefolding can lead to great
+loading time gains in many applications and games.
+
+Casefolding support requires a kernel with the `CONFIG_UNICODE` enabled.
+Once a directory has been flagged for casefolding, a feature bit
+is enabled on the superblock which marks the filesystem as using
+casefolding.
+When the feature bit for casefolding is enabled, it is no longer possible
+to mount that filesystem on kernels without `CONFIG_UNICODE` enabled.
+
+On the lookup/query side: casefolding is implemented by allocating a new
+string of `BCH_NAME_MAX` length using the `utf8_casefold` function to
+casefold the query string.
+
+On the dirent side: casefolding is implemented by ensuring the `bkey`'s
+hash is made from the casefolded string and storing the cached casefolded
+name with the regular name in the dirent.
+
+The structure looks like this:
+
+* Regular: [dirent data][regular name][nul][nul]...
+* Casefolded: [dirent data][reg len][cf len][regular name][casefolded name][nul][nul]...
+
+(Do note, the number of NULs here is merely for illustration; their count can
+vary per-key, and they may not even be present if the key is aligned to
+`sizeof(u64)`.)
+
+This is efficient as it means that for all file lookups that require casefolding,
+it has identical performance to a regular lookup:
+a hash comparison and a `memcmp` of the name.
+
+Rationale
+---------
+
+Several designs were considered for this system:
+One was to introduce a dirent_v2, however that would be painful especially as
+the hash system only has support for a single key type. This would also need
+`BCH_NAME_MAX` to change between versions, and a new feature bit.
+
+Another option was to store without the two lengths, and just take the length of
+the regular name and casefolded name contiguously / 2 as the length. This would
+assume that the regular length == casefolded length, but that could potentially
+not be true, if the uppercase unicode glyph had a different UTF-8 encoding than
+the lowercase unicode glyph.
+It would be possible to disregard the casefold cache for those cases, but it was
+decided to simply encode the two string lengths in the key to avoid random
+performance issues if this edgecase was ever hit.
+
+The option settled on was to use a free-bit in d_type to mark a dirent as having
+a casefold cache, and then treat the first 4 bytes the name block as lengths.
+You can see this in the `d_cf_name_block` member of union in `bch_dirent`.
+
+The feature bit was used to allow casefolding support to be enabled for the majority
+of users, but some allow users who have no need for the feature to still use bcachefs as
+`CONFIG_UNICODE` can increase the kernel side a significant amount due to the tables used,
+which may be decider between using bcachefs for eg. embedded platforms.
+
+Other filesystems like ext4 and f2fs have a super-block level option for casefolding
+encoding, but bcachefs currently does not provide this. ext4 and f2fs do not expose
+any encodings than a single UTF-8 version. When future encodings are desirable,
+they will be added trivially using the opts mechanism.
+
+dentry/dcache considerations
+----------------------------
+
+Currently, in casefolded directories, bcachefs (like other filesystems) will not cache
+negative dentry's.
+
+This is because currently doing so presents a problem in the following scenario:
+
+ - Lookup file "blAH" in a casefolded directory
+ - Creation of file "BLAH" in a casefolded directory
+ - Lookup file "blAH" in a casefolded directory
+
+This would fail if negative dentry's were cached.
+
+This is slightly suboptimal, but could be fixed in future with some vfs work.
+
+
+References
+----------
+
+(from Peter Anvin, on the list)
+
+It is worth noting that Microsoft has basically declared their
+"recommended" case folding (upcase) table to be permanently frozen (for
+new filesystem instances in the case where they use an on-disk
+translation table created at format time.) As far as I know they have
+never supported anything other than 1:1 conversion of BMP code points,
+nor normalization.
+
+The exFAT specification enumerates the full recommended upcase table,
+although in a somewhat annoying format (basically a hex dump of
+compressed data):
+
+https://learn.microsoft.com/en-us/windows/win32/fileio/exfat-specification
diff --git a/Documentation/filesystems/bcachefs/errorcodes.rst b/Documentation/filesystems/bcachefs/errorcodes.rst
new file mode 100644
index 000000000000..2cccaa0ba7cd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/errorcodes.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+bcachefs private error codes
+----------------------------
+
+In bcachefs, as a hard rule we do not throw or directly use standard error
+codes (-EINVAL, -EBUSY, etc.). Instead, we define private error codes as needed
+in fs/bcachefs/errcode.h.
+
+This gives us much better error messages and makes debugging much easier. Any
+direct uses of standard error codes you see in the source code are simply old
+code that has yet to be converted - feel free to clean it up!
+
+Private error codes may subtype another error code, this allows for grouping of
+related errors that should be handled similarly (e.g. transaction restart
+errors), as well as specifying which standard error code should be returned at
+the bcachefs module boundary.
+
+At the module boundary, we use bch2_err_class() to convert to a standard error
+code; this also emits a trace event so that the original error code be
+recovered even if it wasn't logged.
+
+Do not reuse error codes! Generally speaking, a private error code should only
+be thrown in one place. That means that when we see it in a log message we can
+see, unambiguously, exactly which file and line number it was returned from.
+
+Try to give error codes names that are as reasonably descriptive of the error
+as possible. Frequently, the error will be logged at a place far removed from
+where the error was generated; good names for error codes mean much more
+descriptive and useful error messages.
diff --git a/Documentation/filesystems/bcachefs/future/idle_work.rst b/Documentation/filesystems/bcachefs/future/idle_work.rst
new file mode 100644
index 000000000000..59a332509dcd
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/future/idle_work.rst
@@ -0,0 +1,78 @@
+Idle/background work classes design doc:
+
+Right now, our behaviour at idle isn't ideal, it was designed for servers that
+would be under sustained load, to keep pending work at a "medium" level, to
+let work build up so we can process it in more efficient batches, while also
+giving headroom for bursts in load.
+
+But for desktops or mobile - scenarios where work is less sustained and power
+usage is more important - we want to operate differently, with a "rush to
+idle" so the system can go to sleep. We don't want to be dribbling out
+background work while the system should be idle.
+
+The complicating factor is that there are a number of background tasks, which
+form a heirarchy (or a digraph, depending on how you divide it up) - one
+background task may generate work for another.
+
+Thus proper idle detection needs to model this heirarchy.
+
+- Foreground writes
+- Page cache writeback
+- Copygc, rebalance
+- Journal reclaim
+
+When we implement idle detection and rush to idle, we need to be careful not
+to disturb too much the existing behaviour that works reasonably well when the
+system is under sustained load (or perhaps improve it in the case of
+rebalance, which currently does not actively attempt to let work batch up).
+
+SUSTAINED LOAD REGIME
+---------------------
+
+When the system is under continuous load, we want these jobs to run
+continuously - this is perhaps best modelled with a P/D controller, where
+they'll be trying to keep a target value (i.e. fragmented disk space,
+available journal space) roughly in the middle of some range.
+
+The goal under sustained load is to balance our ability to handle load spikes
+without running out of x resource (free disk space, free space in the
+journal), while also letting some work accumululate to be batched (or become
+unnecessary).
+
+For example, we don't want to run copygc too aggressively, because then it
+will be evacuating buckets that would have become empty (been overwritten or
+deleted) anyways, and we don't want to wait until we're almost out of free
+space because then the system will behave unpredicably - suddenly we're doing
+a lot more work to service each write and the system becomes much slower.
+
+IDLE REGIME
+-----------
+
+When the system becomes idle, we should start flushing our pending work
+quicker so the system can go to sleep.
+
+Note that the definition of "idle" depends on where in the heirarchy a task
+is - a task should start flushing work more quickly when the task above it has
+stopped generating new work.
+
+e.g. rebalance should start flushing more quickly when page cache writeback is
+idle, and journal reclaim should only start flushing more quickly when both
+copygc and rebalance are idle.
+
+It's important to let work accumulate when more work is still incoming and we
+still have room, because flushing is always more efficient if we let it batch
+up. New writes may overwrite data before rebalance moves it, and tasks may be
+generating more updates for the btree nodes that journal reclaim needs to flush.
+
+On idle, how much work we do at each interval should be proportional to the
+length of time we have been idle for. If we're idle only for a short duration,
+we shouldn't flush everything right away; the system might wake up and start
+generating new work soon, and flushing immediately might end up doing a lot of
+work that would have been unnecessary if we'd allowed things to batch more.
+
+To summarize, we will need:
+
+ - A list of classes for background tasks that generate work, which will
+ include one "foreground" class.
+ - Tracking for each class - "Am I doing work, or have I gone to sleep?"
+ - And each class should check the class above it when deciding how much work to issue.
diff --git a/Documentation/filesystems/bcachefs/index.rst b/Documentation/filesystems/bcachefs/index.rst
new file mode 100644
index 000000000000..e5c4c2120b93
--- /dev/null
+++ b/Documentation/filesystems/bcachefs/index.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+bcachefs Documentation
+======================
+
+Subsystem-specific development process notes
+--------------------------------------------
+
+Development notes specific to bcachefs. These are intended to supplement
+:doc:`general kernel development handbook </process/index>`.
+
+.. toctree::
+ :maxdepth: 1
+ :numbered:
+
+ CodingStyle
+ SubmittingPatches
+
+Filesystem implementation
+-------------------------
+
+Documentation for filesystem features and their implementation details.
+At this moment, only a few of these are described here.
+
+.. toctree::
+ :maxdepth: 1
+ :numbered:
+
+ casefolding
+ errorcodes
+
+Future design
+-------------
+.. toctree::
+ :maxdepth: 1
+
+ future/idle_work
diff --git a/Documentation/filesystems/befs.rst b/Documentation/filesystems/befs.rst
index 79f9740d76ff..a22f603b2938 100644
--- a/Documentation/filesystems/befs.rst
+++ b/Documentation/filesystems/befs.rst
@@ -106,8 +106,8 @@ iocharset=xxx Use xxx as the name of the NLS translation table.
debug The driver will output debugging information to the syslog.
============= ===========================================================
-How to Get Lastest Version
-==========================
+How to Get Latest Version
+=========================
The latest version is currently available at:
<http://befs-driver.sourceforge.net/>
diff --git a/Documentation/filesystems/btrfs.rst b/Documentation/filesystems/btrfs.rst
index d0904f602819..a81db8f54d68 100644
--- a/Documentation/filesystems/btrfs.rst
+++ b/Documentation/filesystems/btrfs.rst
@@ -19,15 +19,24 @@ The main Btrfs features include:
* Subvolumes (separate internal filesystem roots)
* Object level mirroring and striping
* Checksums on data and metadata (multiple algorithms available)
- * Compression
+ * Compression (multiple algorithms available)
+ * Reflink, deduplication
+ * Scrub (on-line checksum verification)
+ * Hierarchical quota groups (subvolume and snapshot support)
* Integrated multiple device support, with several raid algorithms
* Offline filesystem check
- * Efficient incremental backup and FS mirroring
+ * Efficient incremental backup and FS mirroring (send/receive)
+ * Trim/discard
* Online filesystem defragmentation
+ * Swapfile support
+ * Zoned mode
+ * Read/write metadata verification
+ * Online resize (shrink, grow)
-For more information please refer to the wiki
+For more information please refer to the documentation site or wiki
+
+ https://btrfs.readthedocs.io
- https://btrfs.wiki.kernel.org
that maintains information about administration tasks, frequently asked
questions, use cases, mount options, comprehensible changelogs, features,
diff --git a/Documentation/filesystems/buffer.rst b/Documentation/filesystems/buffer.rst
new file mode 100644
index 000000000000..ae24faf68efa
--- /dev/null
+++ b/Documentation/filesystems/buffer.rst
@@ -0,0 +1,12 @@
+Buffer Heads
+============
+
+Linux uses buffer heads to maintain state about individual filesystem blocks.
+Buffer heads are deprecated and new filesystems should use iomap instead.
+
+Functions
+---------
+
+.. kernel-doc:: include/linux/buffer_head.h
+.. kernel-doc:: fs/buffer.c
+ :export:
diff --git a/Documentation/filesystems/caching/backend-api.rst b/Documentation/filesystems/caching/backend-api.rst
index 19fbf6b9aa36..3a199fc50828 100644
--- a/Documentation/filesystems/caching/backend-api.rst
+++ b/Documentation/filesystems/caching/backend-api.rst
@@ -1,727 +1,479 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================
-FS-Cache Cache backend API
-==========================
+=================
+Cache Backend API
+=================
The FS-Cache system provides an API by which actual caches can be supplied to
FS-Cache for it to then serve out to network filesystems and other interested
-parties.
+parties. This API is used by::
-This API is declared in <linux/fscache-cache.h>.
+ #include <linux/fscache-cache.h>.
-Initialising and Registering a Cache
-====================================
-
-To start off, a cache definition must be initialised and registered for each
-cache the backend wants to make available. For instance, CacheFS does this in
-the fill_super() operation on mounting.
-
-The cache definition (struct fscache_cache) should be initialised by calling::
-
- void fscache_init_cache(struct fscache_cache *cache,
- struct fscache_cache_ops *ops,
- const char *idfmt,
- ...);
-
-Where:
-
- * "cache" is a pointer to the cache definition;
-
- * "ops" is a pointer to the table of operations that the backend supports on
- this cache; and
-
- * "idfmt" is a format and printf-style arguments for constructing a label
- for the cache.
-
-
-The cache should then be registered with FS-Cache by passing a pointer to the
-previously initialised cache definition to::
-
- int fscache_add_cache(struct fscache_cache *cache,
- struct fscache_object *fsdef,
- const char *tagname);
-
-Two extra arguments should also be supplied:
-
- * "fsdef" which should point to the object representation for the FS-Cache
- master index in this cache. Netfs primary index entries will be created
- here. FS-Cache keeps the caller's reference to the index object if
- successful and will release it upon withdrawal of the cache.
-
- * "tagname" which, if given, should be a text string naming this cache. If
- this is NULL, the identifier will be used instead. For CacheFS, the
- identifier is set to name the underlying block device and the tag can be
- supplied by mount.
-
-This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag
-is already in use. 0 will be returned on success.
-
-
-Unregistering a Cache
-=====================
-
-A cache can be withdrawn from the system by calling this function with a
-pointer to the cache definition::
-
- void fscache_withdraw_cache(struct fscache_cache *cache);
-
-In CacheFS's case, this is called by put_super().
-
-
-Security
+Overview
========
-The cache methods are executed one of two contexts:
-
- (1) that of the userspace process that issued the netfs operation that caused
- the cache method to be invoked, or
-
- (2) that of one of the processes in the FS-Cache thread pool.
-
-In either case, this may not be an appropriate context in which to access the
-cache.
-
-The calling process's fsuid, fsgid and SELinux security identities may need to
-be masqueraded for the duration of the cache driver's access to the cache.
-This is left to the cache to handle; FS-Cache makes no effort in this regard.
-
+Interaction with the API is handled on three levels: cache, volume and data
+storage, and each level has its own type of cookie object:
-Control and Statistics Presentation
-===================================
+ ======================= =======================
+ COOKIE C TYPE
+ ======================= =======================
+ Cache cookie struct fscache_cache
+ Volume cookie struct fscache_volume
+ Data storage cookie struct fscache_cookie
+ ======================= =======================
-The cache may present data to the outside world through FS-Cache's interfaces
-in sysfs and procfs - the former for control and the latter for statistics.
+Cookies are used to provide some filesystem data to the cache, manage state and
+pin the cache during access in addition to acting as reference points for the
+API functions. Each cookie has a debugging ID that is included in trace points
+to make it easier to correlate traces. Note, though, that debugging IDs are
+simply allocated from incrementing counters and will eventually wrap.
-A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS
-is enabled. This is accessible through the kobject struct fscache_cache::kobj
-and is for use by the cache as it sees fit.
+The cache backend and the network filesystem can both ask for cache cookies -
+and if they ask for one of the same name, they'll get the same cookie. Volume
+and data cookies, however, are created at the behest of the filesystem only.
-Relevant Data Structures
-========================
+Cache Cookies
+=============
- * Index/Data file FS-Cache representation cookie::
+Caches are represented in the API by cache cookies. These are objects of
+type::
- struct fscache_cookie {
- struct fscache_object_def *def;
- struct fscache_netfs *netfs;
- void *netfs_data;
- ...
- };
-
- The fields that might be of use to the backend describe the object
- definition, the netfs definition and the netfs's data for this cookie.
- The object definition contain functions supplied by the netfs for loading
- and matching index entries; these are required to provide some of the
- cache operations.
-
-
- * In-cache object representation::
-
- struct fscache_object {
- int debug_id;
- enum {
- FSCACHE_OBJECT_RECYCLING,
- ...
- } state;
- spinlock_t lock
- struct fscache_cache *cache;
- struct fscache_cookie *cookie;
+ struct fscache_cache {
+ void *cache_priv;
+ unsigned int debug_id;
+ char *name;
...
};
- Structures of this type should be allocated by the cache backend and
- passed to FS-Cache when requested by the appropriate cache operation. In
- the case of CacheFS, they're embedded in CacheFS's internal object
- structures.
+There are a few fields that the cache backend might be interested in. The
+``debug_id`` can be used in tracing to match lines referring to the same cache
+and ``name`` is the name the cache was registered with. The ``cache_priv``
+member is private data provided by the cache when it is brought online. The
+other fields are for internal use.
- The debug_id is a simple integer that can be used in debugging messages
- that refer to a particular object. In such a case it should be printed
- using "OBJ%x" to be consistent with FS-Cache.
- Each object contains a pointer to the cookie that represents the object it
- is backing. An object should retired when put_object() is called if it is
- in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be
- initialised by calling fscache_object_init(object).
+Registering a Cache
+===================
+When a cache backend wants to bring a cache online, it should first register
+the cache name and that will get it a cache cookie. This is done with::
- * FS-Cache operation record::
+ struct fscache_cache *fscache_acquire_cache(const char *name);
- struct fscache_operation {
- atomic_t usage;
- struct fscache_object *object;
- unsigned long flags;
- #define FSCACHE_OP_EXCLUSIVE
- void (*processor)(struct fscache_operation *op);
- void (*release)(struct fscache_operation *op);
- ...
- };
+This will look up and potentially create a cache cookie. The cache cookie may
+have already been created by a network filesystem looking for it, in which case
+that cache cookie will be used. If the cache cookie is not in use by another
+cache, it will be moved into the preparing state, otherwise it will return
+busy.
- FS-Cache has a pool of threads that it uses to give CPU time to the
- various asynchronous operations that need to be done as part of driving
- the cache. These are represented by the above structure. The processor
- method is called to give the op CPU time, and the release method to get
- rid of it when its usage count reaches 0.
+If successful, the cache backend can then start setting up the cache. In the
+event that the initialisation fails, the cache backend should call::
- An operation can be made exclusive upon an object by setting the
- appropriate flag before enqueuing it with fscache_enqueue_operation(). If
- an operation needs more processing time, it should be enqueued again.
+ void fscache_relinquish_cache(struct fscache_cache *cache);
+to reset and discard the cookie.
- * FS-Cache retrieval operation record::
- struct fscache_retrieval {
- struct fscache_operation op;
- struct address_space *mapping;
- struct list_head *to_do;
- ...
- };
+Bringing a Cache Online
+=======================
- A structure of this type is allocated by FS-Cache to record retrieval and
- allocation requests made by the netfs. This struct is then passed to the
- backend to do the operation. The backend may get extra refs to it by
- calling fscache_get_retrieval() and refs may be discarded by calling
- fscache_put_retrieval().
+Once the cache is set up, it can be brought online by calling::
- A retrieval operation can be used by the backend to do retrieval work. To
- do this, the retrieval->op.processor method pointer should be set
- appropriately by the backend and fscache_enqueue_retrieval() called to
- submit it to the thread pool. CacheFiles, for example, uses this to queue
- page examination when it detects PG_lock being cleared.
+ int fscache_add_cache(struct fscache_cache *cache,
+ const struct fscache_cache_ops *ops,
+ void *cache_priv);
- The to_do field is an empty list available for the cache backend to use as
- it sees fit.
+This stores the cache operations table pointer and cache private data into the
+cache cookie and moves the cache to the active state, thereby allowing accesses
+to take place.
- * FS-Cache storage operation record::
+Withdrawing a Cache From Service
+================================
- struct fscache_storage {
- struct fscache_operation op;
- pgoff_t store_limit;
- ...
- };
+The cache backend can withdraw a cache from service by calling this function::
- A structure of this type is allocated by FS-Cache to record outstanding
- writes to be made. FS-Cache itself enqueues this operation and invokes
- the write_page() method on the object at appropriate times to effect
- storage.
+ void fscache_withdraw_cache(struct fscache_cache *cache);
+This moves the cache to the withdrawn state to prevent new cache- and
+volume-level accesses from starting and then waits for outstanding cache-level
+accesses to complete.
-Cache Operations
-================
+The cache must then go through the data storage objects it has and tell fscache
+to withdraw them, calling::
-The cache backend provides FS-Cache with a table of operations that can be
-performed on the denizens of the cache. These are held in a structure of type:
+ void fscache_withdraw_cookie(struct fscache_cookie *cookie);
- ::
+on the cookie that each object belongs to. This schedules the specified cookie
+for withdrawal. This gets offloaded to a workqueue. The cache backend can
+wait for completion by calling::
- struct fscache_cache_ops
+ void fscache_wait_for_objects(struct fscache_cache *cache);
- * Name of cache provider [mandatory]::
+Once all the cookies are withdrawn, a cache backend can withdraw all the
+volumes, calling::
- const char *name
+ void fscache_withdraw_volume(struct fscache_volume *volume);
- This isn't strictly an operation, but should be pointed at a string naming
- the backend.
+to tell fscache that a volume has been withdrawn. This waits for all
+outstanding accesses on the volume to complete before returning.
+When the cache is completely withdrawn, fscache should be notified by
+calling::
- * Allocate a new object [mandatory]::
+ void fscache_relinquish_cache(struct fscache_cache *cache);
- struct fscache_object *(*alloc_object)(struct fscache_cache *cache,
- struct fscache_cookie *cookie)
+to clear fields in the cookie and discard the caller's ref on it.
- This method is used to allocate a cache object representation to back a
- cookie in a particular cache. fscache_object_init() should be called on
- the object to initialise it prior to returning.
- This function may also be used to parse the index key to be used for
- multiple lookup calls to turn it into a more convenient form. FS-Cache
- will call the lookup_complete() method to allow the cache to release the
- form once lookup is complete or aborted.
+Volume Cookies
+==============
+Within a cache, the data storage objects are organised into logical volumes.
+These are represented in the API as objects of type::
- * Look up and create object [mandatory]::
+ struct fscache_volume {
+ struct fscache_cache *cache;
+ void *cache_priv;
+ unsigned int debug_id;
+ char *key;
+ unsigned int key_hash;
+ ...
+ u8 coherency_len;
+ u8 coherency[];
+ };
- void (*lookup_object)(struct fscache_object *object)
+There are a number of fields here that are of interest to the caching backend:
- This method is used to look up an object, given that the object is already
- allocated and attached to the cookie. This should instantiate that object
- in the cache if it can.
+ * ``cache`` - The parent cache cookie.
- The method should call fscache_object_lookup_negative() as soon as
- possible if it determines the object doesn't exist in the cache. If the
- object is found to exist and the netfs indicates that it is valid then
- fscache_obtained_object() should be called once the object is in a
- position to have data stored in it. Similarly, fscache_obtained_object()
- should also be called once a non-present object has been created.
+ * ``cache_priv`` - A place for the cache to stash private data.
- If a lookup error occurs, fscache_object_lookup_error() should be called
- to abort the lookup of that object.
+ * ``debug_id`` - A debugging ID for logging in tracepoints.
+ * ``key`` - A printable string with no '/' characters in it that represents
+ the index key for the volume. The key is NUL-terminated and padded out to
+ a multiple of 4 bytes.
- * Release lookup data [mandatory]::
+ * ``key_hash`` - A hash of the index key. This should work out the same, no
+ matter the cpu arch and endianness.
- void (*lookup_complete)(struct fscache_object *object)
+ * ``coherency`` - A piece of coherency data that should be checked when the
+ volume is bound to in the cache.
- This method is called to ask the cache to release any resources it was
- using to perform a lookup.
+ * ``coherency_len`` - The amount of data in the coherency buffer.
- * Increment object refcount [mandatory]::
+Data Storage Cookies
+====================
- struct fscache_object *(*grab_object)(struct fscache_object *object)
+A volume is a logical group of data storage objects, each of which is
+represented to the network filesystem by a cookie. Cookies are represented in
+the API as objects of type::
- This method is called to increment the reference count on an object. It
- may fail (for instance if the cache is being withdrawn) by returning NULL.
- It should return the object pointer if successful.
+ struct fscache_cookie {
+ struct fscache_volume *volume;
+ void *cache_priv;
+ unsigned long flags;
+ unsigned int debug_id;
+ unsigned int inval_counter;
+ loff_t object_size;
+ u8 advice;
+ u32 key_hash;
+ u8 key_len;
+ u8 aux_len;
+ ...
+ };
+The fields in the cookie that are of interest to the cache backend are:
- * Lock/Unlock object [mandatory]::
+ * ``volume`` - The parent volume cookie.
- void (*lock_object)(struct fscache_object *object)
- void (*unlock_object)(struct fscache_object *object)
+ * ``cache_priv`` - A place for the cache to stash private data.
- These methods are used to exclusively lock an object. It must be possible
- to schedule with the lock held, so a spinlock isn't sufficient.
+ * ``flags`` - A collection of bit flags, including:
+ * FSCACHE_COOKIE_NO_DATA_TO_READ - There is no data available in the
+ cache to be read as the cookie has been created or invalidated.
- * Pin/Unpin object [optional]::
+ * FSCACHE_COOKIE_NEEDS_UPDATE - The coherency data and/or object size has
+ been changed and needs committing.
- int (*pin_object)(struct fscache_object *object)
- void (*unpin_object)(struct fscache_object *object)
+ * FSCACHE_COOKIE_LOCAL_WRITE - The netfs's data has been modified
+ locally, so the cache object may be in an incoherent state with respect
+ to the server.
- These methods are used to pin an object into the cache. Once pinned an
- object cannot be reclaimed to make space. Return -ENOSPC if there's not
- enough space in the cache to permit this.
+ * FSCACHE_COOKIE_HAVE_DATA - The backend should set this if it
+ successfully stores data into the cache.
+ * FSCACHE_COOKIE_RETIRED - The cookie was invalidated when it was
+ relinquished and the cached data should be discarded.
- * Check coherency state of an object [mandatory]::
+ * ``debug_id`` - A debugging ID for logging in tracepoints.
- int (*check_consistency)(struct fscache_object *object)
+ * ``inval_counter`` - The number of invalidations done on the cookie.
- This method is called to have the cache check the saved auxiliary data of
- the object against the netfs's idea of the state. 0 should be returned
- if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS
- may also be returned.
+ * ``advice`` - Information about how the cookie is to be used.
- * Update object [mandatory]::
+ * ``key_hash`` - A hash of the index key. This should work out the same, no
+ matter the cpu arch and endianness.
- int (*update_object)(struct fscache_object *object)
+ * ``key_len`` - The length of the index key.
- This is called to update the index entry for the specified object. The
- new information should be in object->cookie->netfs_data. This can be
- obtained by calling object->cookie->def->get_aux()/get_attr().
+ * ``aux_len`` - The length of the coherency data buffer.
+Each cookie has an index key, which may be stored inline to the cookie or
+elsewhere. A pointer to this can be obtained by calling::
- * Invalidate data object [mandatory]::
+ void *fscache_get_key(struct fscache_cookie *cookie);
- int (*invalidate_object)(struct fscache_operation *op)
+The index key is a binary blob, the storage for which is padded out to a
+multiple of 4 bytes.
- This is called to invalidate a data object (as pointed to by op->object).
- All the data stored for this object should be discarded and an
- attr_changed operation should be performed. The caller will follow up
- with an object update operation.
+Each cookie also has a buffer for coherency data. This may also be inline or
+detached from the cookie and a pointer is obtained by calling::
- fscache_op_complete() must be called on op before returning.
+ void *fscache_get_aux(struct fscache_cookie *cookie);
- * Discard object [mandatory]::
- void (*drop_object)(struct fscache_object *object)
+Cookie Accounting
+=================
- This method is called to indicate that an object has been unbound from its
- cookie, and that the cache should release the object's resources and
- retire it if it's in state FSCACHE_OBJECT_RECYCLING.
+Data storage cookies are counted and this is used to block cache withdrawal
+completion until all objects have been destroyed. The following functions are
+provided to the cache to deal with that::
- This method should not attempt to release any references held by the
- caller. The caller will invoke the put_object() method as appropriate.
+ void fscache_count_object(struct fscache_cache *cache);
+ void fscache_uncount_object(struct fscache_cache *cache);
+ void fscache_wait_for_objects(struct fscache_cache *cache);
+The count function records the allocation of an object in a cache and the
+uncount function records its destruction. Warning: by the time the uncount
+function returns, the cache may have been destroyed.
- * Release object reference [mandatory]::
+The wait function can be used during the withdrawal procedure to wait for
+fscache to finish withdrawing all the objects in the cache. When it completes,
+there will be no remaining objects referring to the cache object or any volume
+objects.
- void (*put_object)(struct fscache_object *object)
- This method is used to discard a reference to an object. The object may
- be freed when all the references to it are released.
+Cache Management API
+====================
+The cache backend implements the cache management API by providing a table of
+operations that fscache can use to manage various aspects of the cache. These
+are held in a structure of type::
- * Synchronise a cache [mandatory]::
+ struct fscache_cache_ops {
+ const char *name;
+ ...
+ };
- void (*sync)(struct fscache_cache *cache)
+This contains a printable name for the cache backend driver plus a number of
+pointers to methods to allow fscache to request management of the cache:
- This is called to ask the backend to synchronise a cache with its backing
- device.
+ * Set up a volume cookie [optional]::
+ void (*acquire_volume)(struct fscache_volume *volume);
- * Dissociate a cache [mandatory]::
+ This method is called when a volume cookie is being created. The caller
+ holds a cache-level access pin to prevent the cache from going away for
+ the duration. This method should set up the resources to access a volume
+ in the cache and should not return until it has done so.
- void (*dissociate_pages)(struct fscache_cache *cache)
+ If successful, it can set ``cache_priv`` to its own data.
- This is called to ask a cache to perform any page dissociations as part of
- cache withdrawal.
+ * Clean up volume cookie [optional]::
- * Notification that the attributes on a netfs file changed [mandatory]::
+ void (*free_volume)(struct fscache_volume *volume);
- int (*attr_changed)(struct fscache_object *object);
+ This method is called when a volume cookie is being released if
+ ``cache_priv`` is set.
- This is called to indicate to the cache that certain attributes on a netfs
- file have changed (for example the maximum size a file may reach). The
- cache can read these from the netfs by calling the cookie's get_attr()
- method.
- The cache may use the file size information to reserve space on the cache.
- It should also call fscache_set_store_limit() to indicate to FS-Cache the
- highest byte it's willing to store for an object.
+ * Look up a cookie in the cache [mandatory]::
- This method may return -ve if an error occurred or the cache object cannot
- be expanded. In such a case, the object will be withdrawn from service.
+ bool (*lookup_cookie)(struct fscache_cookie *cookie);
- This operation is run asynchronously from FS-Cache's thread pool, and
- storage and retrieval operations from the netfs are excluded during the
- execution of this operation.
+ This method is called to look up/create the resources needed to access the
+ data storage for a cookie. It is called from a worker thread with a
+ volume-level access pin in the cache to prevent it from being withdrawn.
+ True should be returned if successful and false otherwise. If false is
+ returned, the withdraw_cookie op (see below) will be called.
- * Reserve cache space for an object's data [optional]::
+ If lookup fails, but the object could still be created (e.g. it hasn't
+ been cached before), then::
- int (*reserve_space)(struct fscache_object *object, loff_t size);
+ void fscache_cookie_lookup_negative(
+ struct fscache_cookie *cookie);
- This is called to request that cache space be reserved to hold the data
- for an object and the metadata used to track it. Zero size should be
- taken as request to cancel a reservation.
+ can be called to let the network filesystem proceed and start downloading
+ stuff whilst the cache backend gets on with the job of creating things.
- This should return 0 if successful, -ENOSPC if there isn't enough space
- available, or -ENOMEM or -EIO on other errors.
+ If successful, ``cookie->cache_priv`` can be set.
- The reservation may exceed the current size of the object, thus permitting
- future expansion. If the amount of space consumed by an object would
- exceed the reservation, it's permitted to refuse requests to allocate
- pages, but not required. An object may be pruned down to its reservation
- size if larger than that already.
+ * Withdraw an object without any cookie access counts held [mandatory]::
- * Request page be read from cache [mandatory]::
+ void (*withdraw_cookie)(struct fscache_cookie *cookie);
- int (*read_or_alloc_page)(struct fscache_retrieval *op,
- struct page *page,
- gfp_t gfp)
+ This method is called to withdraw a cookie from service. It will be
+ called when the cookie is relinquished by the netfs, withdrawn or culled
+ by the cache backend or closed after a period of non-use by fscache.
- This is called to attempt to read a netfs page from the cache, or to
- reserve a backing block if not. FS-Cache will have done as much checking
- as it can before calling, but most of the work belongs to the backend.
+ The caller doesn't hold any access pins, but it is called from a
+ non-reentrant work item to manage races between the various ways
+ withdrawal can occur.
- If there's no page in the cache, then -ENODATA should be returned if the
- backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it
- didn't.
+ The cookie will have the ``FSCACHE_COOKIE_RETIRED`` flag set on it if the
+ associated data is to be removed from the cache.
- If there is suitable data in the cache, then a read operation should be
- queued and 0 returned. When the read finishes, fscache_end_io() should be
- called.
- The fscache_mark_pages_cached() should be called for the page if any cache
- metadata is retained. This will indicate to the netfs that the page needs
- explicit uncaching. This operation takes a pagevec, thus allowing several
- pages to be marked at once.
+ * Change the size of a data storage object [mandatory]::
- The retrieval record pointed to by op should be retained for each page
- queued and released when I/O on the page has been formally ended.
- fscache_get/put_retrieval() are available for this purpose.
+ void (*resize_cookie)(struct netfs_cache_resources *cres,
+ loff_t new_size);
- The retrieval record may be used to get CPU time via the FS-Cache thread
- pool. If this is desired, the op->op.processor should be set to point to
- the appropriate processing routine, and fscache_enqueue_retrieval() should
- be called at an appropriate point to request CPU time. For instance, the
- retrieval routine could be enqueued upon the completion of a disk read.
- The to_do field in the retrieval record is provided to aid in this.
+ This method is called to inform the cache backend of a change in size of
+ the netfs file due to local truncation. The cache backend should make all
+ of the changes it needs to make before returning as this is done under the
+ netfs inode mutex.
- If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS
- returned if possible or fscache_end_io() called with a suitable error
- code.
+ The caller holds a cookie-level access pin to prevent a race with
+ withdrawal and the netfs must have the cookie marked in-use to prevent
+ garbage collection or culling from removing any resources.
- fscache_put_retrieval() should be called after a page or pages are dealt
- with. This will complete the operation when all pages are dealt with.
+ * Invalidate a data storage object [mandatory]::
- * Request pages be read from cache [mandatory]::
+ bool (*invalidate_cookie)(struct fscache_cookie *cookie);
- int (*read_or_alloc_pages)(struct fscache_retrieval *op,
- struct list_head *pages,
- unsigned *nr_pages,
- gfp_t gfp)
+ This is called when the network filesystem detects a third-party
+ modification or when an O_DIRECT write is made locally. This requests
+ that the cache backend should throw away all the data in the cache for
+ this object and start afresh. It should return true if successful and
+ false otherwise.
- This is like the read_or_alloc_page() method, except it is handed a list
- of pages instead of one page. Any pages on which a read operation is
- started must be added to the page cache for the specified mapping and also
- to the LRU. Such pages must also be removed from the pages list and
- ``*nr_pages`` decremented per page.
+ On entry, new I O/operations are blocked. Once the cache is in a position
+ to accept I/O again, the backend should release the block by calling::
- If there was an error such as -ENOMEM, then that should be returned; else
- if one or more pages couldn't be read or allocated, then -ENOBUFS should
- be returned; else if one or more pages couldn't be read, then -ENODATA
- should be returned. If all the pages are dispatched then 0 should be
- returned.
+ void fscache_resume_after_invalidation(struct fscache_cookie *cookie);
+ If the method returns false, caching will be withdrawn for this cookie.
- * Request page be allocated in the cache [mandatory]::
- int (*allocate_page)(struct fscache_retrieval *op,
- struct page *page,
- gfp_t gfp)
+ * Prepare to make local modifications to the cache [mandatory]::
- This is like the read_or_alloc_page() method, except that it shouldn't
- read from the cache, even if there's data there that could be retrieved.
- It should, however, set up any internal metadata required such that
- the write_page() method can write to the cache.
+ void (*prepare_to_write)(struct fscache_cookie *cookie);
- If there's no backing block available, then -ENOBUFS should be returned
- (or -ENOMEM if there were other problems). If a block is successfully
- allocated, then the netfs page should be marked and 0 returned.
+ This method is called when the network filesystem finds that it is going
+ to need to modify the contents of the cache due to local writes or
+ truncations. This gives the cache a chance to note that a cache object
+ may be incoherent with respect to the server and may need writing back
+ later. This may also cause the cached data to be scrapped on later
+ rebinding if not properly committed.
- * Request pages be allocated in the cache [mandatory]::
+ * Begin an operation for the netfs lib [mandatory]::
- int (*allocate_pages)(struct fscache_retrieval *op,
- struct list_head *pages,
- unsigned *nr_pages,
- gfp_t gfp)
+ bool (*begin_operation)(struct netfs_cache_resources *cres,
+ enum fscache_want_state want_state);
- This is an multiple page version of the allocate_page() method. pages and
- nr_pages should be treated as for the read_or_alloc_pages() method.
+ This method is called when an I/O operation is being set up (read, write
+ or resize). The caller holds an access pin on the cookie and must have
+ marked the cookie as in-use.
+ If it can, the backend should attach any resources it needs to keep around
+ to the netfs_cache_resources object and return true.
- * Request page be written to cache [mandatory]::
+ If it can't complete the setup, it should return false.
- int (*write_page)(struct fscache_storage *op,
- struct page *page);
+ The want_state parameter indicates the state the caller needs the cache
+ object to be in and what it wants to do during the operation:
- This is called to write from a page on which there was a previously
- successful read_or_alloc_page() call or similar. FS-Cache filters out
- pages that don't have mappings.
+ * ``FSCACHE_WANT_PARAMS`` - The caller just wants to access cache
+ object parameters; it doesn't need to do data I/O yet.
- This method is called asynchronously from the FS-Cache thread pool. It is
- not required to actually store anything, provided -ENODATA is then
- returned to the next read of this page.
+ * ``FSCACHE_WANT_READ`` - The caller wants to read data.
- If an error occurred, then a negative error code should be returned,
- otherwise zero should be returned. FS-Cache will take appropriate action
- in response to an error, such as withdrawing this object.
+ * ``FSCACHE_WANT_WRITE`` - The caller wants to write to or resize the
+ cache object.
- If this method returns success then FS-Cache will inform the netfs
- appropriately.
+ Note that there won't necessarily be anything attached to the cookie's
+ cache_priv yet if the cookie is still being created.
- * Discard retained per-page metadata [mandatory]::
+Data I/O API
+============
- void (*uncache_page)(struct fscache_object *object, struct page *page)
+A cache backend provides a data I/O API by through the netfs library's ``struct
+netfs_cache_ops`` attached to a ``struct netfs_cache_resources`` by the
+``begin_operation`` method described above.
- This is called when a netfs page is being evicted from the pagecache. The
- cache backend should tear down any internal representation or tracking it
- maintains for this page.
+See the Documentation/filesystems/netfs_library.rst for a description.
-FS-Cache Utilities
-==================
+Miscellaneous Functions
+=======================
FS-Cache provides some utilities that a cache backend may make use of:
* Note occurrence of an I/O error in a cache::
- void fscache_io_error(struct fscache_cache *cache)
+ void fscache_io_error(struct fscache_cache *cache);
- This tells FS-Cache that an I/O error occurred in the cache. After this
- has been called, only resource dissociation operations (object and page
- release) will be passed from the netfs to the cache backend for the
- specified cache.
+ This tells FS-Cache that an I/O error occurred in the cache. This
+ prevents any new I/O from being started on the cache.
This does not actually withdraw the cache. That must be done separately.
+ * Note cessation of caching on a cookie due to failure::
- * Invoke the retrieval I/O completion function::
-
- void fscache_end_io(struct fscache_retrieval *op, struct page *page,
- int error);
-
- This is called to note the end of an attempt to retrieve a page. The
- error value should be 0 if successful and an error otherwise.
-
-
- * Record that one or more pages being retrieved or allocated have been dealt
- with::
-
- void fscache_retrieval_complete(struct fscache_retrieval *op,
- int n_pages);
-
- This is called to record the fact that one or more pages have been dealt
- with and are no longer the concern of this operation. When the number of
- pages remaining in the operation reaches 0, the operation will be
- completed.
-
-
- * Record operation completion::
-
- void fscache_op_complete(struct fscache_operation *op);
-
- This is called to record the completion of an operation. This deducts
- this operation from the parent object's run state, potentially permitting
- one or more pending operations to start running.
-
-
- * Set highest store limit::
-
- void fscache_set_store_limit(struct fscache_object *object,
- loff_t i_size);
-
- This sets the limit FS-Cache imposes on the highest byte it's willing to
- try and store for a netfs. Any page over this limit is automatically
- rejected by fscache_read_alloc_page() and co with -ENOBUFS.
-
-
- * Mark pages as being cached::
-
- void fscache_mark_pages_cached(struct fscache_retrieval *op,
- struct pagevec *pagevec);
-
- This marks a set of pages as being cached. After this has been called,
- the netfs must call fscache_uncache_page() to unmark the pages.
-
-
- * Perform coherency check on an object::
-
- enum fscache_checkaux fscache_check_aux(struct fscache_object *object,
- const void *data,
- uint16_t datalen);
-
- This asks the netfs to perform a coherency check on an object that has
- just been looked up. The cookie attached to the object will determine the
- netfs to use. data and datalen should specify where the auxiliary data
- retrieved from the cache can be found.
-
- One of three values will be returned:
-
- FSCACHE_CHECKAUX_OKAY
- The coherency data indicates the object is valid as is.
-
- FSCACHE_CHECKAUX_NEEDS_UPDATE
- The coherency data needs updating, but otherwise the object is
- valid.
-
- FSCACHE_CHECKAUX_OBSOLETE
- The coherency data indicates that the object is obsolete and should
- be discarded.
-
-
- * Initialise a freshly allocated object::
-
- void fscache_object_init(struct fscache_object *object);
-
- This initialises all the fields in an object representation.
-
-
- * Indicate the destruction of an object::
-
- void fscache_object_destroyed(struct fscache_cache *cache);
-
- This must be called to inform FS-Cache that an object that belonged to a
- cache has been destroyed and deallocated. This will allow continuation
- of the cache withdrawal process when it is stopped pending destruction of
- all the objects.
-
-
- * Indicate negative lookup on an object::
-
- void fscache_object_lookup_negative(struct fscache_object *object);
-
- This is called to indicate to FS-Cache that a lookup process for an object
- found a negative result.
-
- This changes the state of an object to permit reads pending on lookup
- completion to go off and start fetching data from the netfs server as it's
- known at this point that there can't be any data in the cache.
-
- This may be called multiple times on an object. Only the first call is
- significant - all subsequent calls are ignored.
-
-
- * Indicate an object has been obtained::
-
- void fscache_obtained_object(struct fscache_object *object);
-
- This is called to indicate to FS-Cache that a lookup process for an object
- produced a positive result, or that an object was created. This should
- only be called once for any particular object.
-
- This changes the state of an object to indicate:
-
- (1) if no call to fscache_object_lookup_negative() has been made on
- this object, that there may be data available, and that reads can
- now go and look for it; and
-
- (2) that writes may now proceed against this object.
-
-
- * Indicate that object lookup failed::
-
- void fscache_object_lookup_error(struct fscache_object *object);
-
- This marks an object as having encountered a fatal error (usually EIO)
- and causes it to move into a state whereby it will be withdrawn as soon
- as possible.
-
-
- * Indicate that a stale object was found and discarded::
-
- void fscache_object_retrying_stale(struct fscache_object *object);
-
- This is called to indicate that the lookup procedure found an object in
- the cache that the netfs decided was stale. The object has been
- discarded from the cache and the lookup will be performed again.
-
-
- * Indicate that the caching backend killed an object::
-
- void fscache_object_mark_killed(struct fscache_object *object,
- enum fscache_why_object_killed why);
-
- This is called to indicate that the cache backend preemptively killed an
- object. The why parameter should be set to indicate the reason:
+ void fscache_caching_failed(struct fscache_cookie *cookie);
- FSCACHE_OBJECT_IS_STALE
- - the object was stale and needs discarding.
+ This notes that a the caching that was being done on a cookie failed in
+ some way, for instance the backing storage failed to be created or
+ invalidation failed and that no further I/O operations should take place
+ on it until the cache is reset.
- FSCACHE_OBJECT_NO_SPACE
- - there was insufficient cache space
+ * Count I/O requests::
- FSCACHE_OBJECT_WAS_RETIRED
- - the object was retired when relinquished.
+ void fscache_count_read(void);
+ void fscache_count_write(void);
- FSCACHE_OBJECT_WAS_CULLED
- - the object was culled to make space.
+ These record reads and writes from/to the cache. The numbers are
+ displayed in /proc/fs/fscache/stats.
+ * Count out-of-space errors::
- * Get and release references on a retrieval record::
+ void fscache_count_no_write_space(void);
+ void fscache_count_no_create_space(void);
- void fscache_get_retrieval(struct fscache_retrieval *op);
- void fscache_put_retrieval(struct fscache_retrieval *op);
+ These record ENOSPC errors in the cache, divided into failures of data
+ writes and failures of filesystem object creations (e.g. mkdir).
- These two functions are used to retain a retrieval record while doing
- asynchronous data retrieval and block allocation.
+ * Count objects culled::
+ void fscache_count_culled(void);
- * Enqueue a retrieval record for processing::
+ This records the culling of an object.
- void fscache_enqueue_retrieval(struct fscache_retrieval *op);
+ * Get the cookie from a set of cache resources::
- This enqueues a retrieval record for processing by the FS-Cache thread
- pool. One of the threads in the pool will invoke the retrieval record's
- op->op.processor callback function. This function may be called from
- within the callback function.
+ struct fscache_cookie *fscache_cres_cookie(struct netfs_cache_resources *cres)
+ Pull a pointer to the cookie from the cache resources. This may return a
+ NULL cookie if no cookie was set.
- * List of object state names::
- const char *fscache_object_states[];
+API Function Reference
+======================
- For debugging purposes, this may be used to turn the state that an object
- is in into a text string for display purposes.
+.. kernel-doc:: include/linux/fscache-cache.h
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst
index 65d3db476765..b3ccc782cb3b 100644
--- a/Documentation/filesystems/caching/cachefiles.rst
+++ b/Documentation/filesystems/caching/cachefiles.rst
@@ -1,8 +1,8 @@
.. SPDX-License-Identifier: GPL-2.0
-===============================================
-CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
-===============================================
+===================================
+Cache on Already Mounted Filesystem
+===================================
.. Contents:
@@ -28,6 +28,7 @@ CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM
(*) Debugging.
+ (*) On-demand Read.
Overview
@@ -114,7 +115,7 @@ set up cache ready for use. The following script commands are available:
This mask can also be set through sysfs, eg::
- echo 5 >/sys/modules/cachefiles/parameters/debug
+ echo 5 > /sys/module/cachefiles/parameters/debug
Starting the Cache
@@ -348,7 +349,7 @@ data cached therein; nor is it permitted to create new files in the cache.
There are policy source files available in:
- http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
+ https://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
and later versions. In that tarball, see the files::
@@ -415,7 +416,7 @@ process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
-may be overridden. This is not seen externally, and is used whan a process
+may be overridden. This is not seen externally, and is used when a process
acts upon another object, for example SIGKILLing another process or opening a
file.
@@ -482,3 +483,180 @@ the control file. For example::
echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
will turn on all function entry debugging.
+
+
+On-demand Read
+==============
+
+When working in its original mode, CacheFiles serves as a local cache for a
+remote networking fs - while in on-demand read mode, CacheFiles can boost the
+scenario where on-demand read semantics are needed, e.g. container image
+distribution.
+
+The essential difference between these two modes is seen when a cache miss
+occurs: In the original mode, the netfs will fetch the data from the remote
+server and then write it to the cache file; in on-demand read mode, fetching
+the data and writing it into the cache is delegated to a user daemon.
+
+``CONFIG_CACHEFILES_ONDEMAND`` should be enabled to support on-demand read mode.
+
+
+Protocol Communication
+----------------------
+
+The on-demand read mode uses a simple protocol for communication between kernel
+and user daemon. The protocol can be modeled as::
+
+ kernel --[request]--> user daemon --[reply]--> kernel
+
+CacheFiles will send requests to the user daemon when needed. The user daemon
+should poll the devnode ('/dev/cachefiles') to check if there's a pending
+request to be processed. A POLLIN event will be returned when there's a pending
+request.
+
+The user daemon then reads the devnode to fetch a request to process. It should
+be noted that each read only gets one request. When it has finished processing
+the request, the user daemon should write the reply to the devnode.
+
+Each request starts with a message header of the form::
+
+ struct cachefiles_msg {
+ __u32 msg_id;
+ __u32 opcode;
+ __u32 len;
+ __u32 object_id;
+ __u8 data[];
+ };
+
+where:
+
+ * ``msg_id`` is a unique ID identifying this request among all pending
+ requests.
+
+ * ``opcode`` indicates the type of this request.
+
+ * ``object_id`` is a unique ID identifying the cache file operated on.
+
+ * ``data`` indicates the payload of this request.
+
+ * ``len`` indicates the whole length of this request, including the
+ header and following type-specific payload.
+
+
+Turning on On-demand Mode
+-------------------------
+
+An optional parameter becomes available to the "bind" command::
+
+ bind [ondemand]
+
+When the "bind" command is given no argument, it defaults to the original mode.
+When it is given the "ondemand" argument, i.e. "bind ondemand", on-demand read
+mode will be enabled.
+
+
+The OPEN Request
+----------------
+
+When the netfs opens a cache file for the first time, a request with the
+CACHEFILES_OP_OPEN opcode, a.k.a an OPEN request will be sent to the user
+daemon. The payload format is of the form::
+
+ struct cachefiles_open {
+ __u32 volume_key_size;
+ __u32 cookie_key_size;
+ __u32 fd;
+ __u32 flags;
+ __u8 data[];
+ };
+
+where:
+
+ * ``data`` contains the volume_key followed directly by the cookie_key.
+ The volume key is a NUL-terminated string; the cookie key is binary
+ data.
+
+ * ``volume_key_size`` indicates the size of the volume key in bytes.
+
+ * ``cookie_key_size`` indicates the size of the cookie key in bytes.
+
+ * ``fd`` indicates an anonymous fd referring to the cache file, through
+ which the user daemon can perform write/llseek file operations on the
+ cache file.
+
+
+The user daemon can use the given (volume_key, cookie_key) pair to distinguish
+the requested cache file. With the given anonymous fd, the user daemon can
+fetch the data and write it to the cache file in the background, even when
+kernel has not triggered a cache miss yet.
+
+Be noted that each cache file has a unique object_id, while it may have multiple
+anonymous fds. The user daemon may duplicate anonymous fds from the initial
+anonymous fd indicated by the @fd field through dup(). Thus each object_id can
+be mapped to multiple anonymous fds, while the usr daemon itself needs to
+maintain the mapping.
+
+When implementing a user daemon, please be careful of RLIMIT_NOFILE,
+``/proc/sys/fs/nr_open`` and ``/proc/sys/fs/file-max``. Typically these needn't
+be huge since they're related to the number of open device blobs rather than
+open files of each individual filesystem.
+
+The user daemon should reply the OPEN request by issuing a "copen" (complete
+open) command on the devnode::
+
+ copen <msg_id>,<cache_size>
+
+where:
+
+ * ``msg_id`` must match the msg_id field of the OPEN request.
+
+ * When >= 0, ``cache_size`` indicates the size of the cache file;
+ when < 0, ``cache_size`` indicates any error code encountered by the
+ user daemon.
+
+
+The CLOSE Request
+-----------------
+
+When a cookie withdrawn, a CLOSE request (opcode CACHEFILES_OP_CLOSE) will be
+sent to the user daemon. This tells the user daemon to close all anonymous fds
+associated with the given object_id. The CLOSE request has no extra payload,
+and shouldn't be replied.
+
+
+The READ Request
+----------------
+
+When a cache miss is encountered in on-demand read mode, CacheFiles will send a
+READ request (opcode CACHEFILES_OP_READ) to the user daemon. This tells the user
+daemon to fetch the contents of the requested file range. The payload is of the
+form::
+
+ struct cachefiles_read {
+ __u64 off;
+ __u64 len;
+ };
+
+where:
+
+ * ``off`` indicates the starting offset of the requested file range.
+
+ * ``len`` indicates the length of the requested file range.
+
+
+When it receives a READ request, the user daemon should fetch the requested data
+and write it to the cache file identified by object_id.
+
+When it has finished processing the READ request, the user daemon should reply
+by using the CACHEFILES_IOC_READ_COMPLETE ioctl on one of the anonymous fds
+associated with the object_id given in the READ request. The ioctl is of the
+form::
+
+ ioctl(fd, CACHEFILES_IOC_READ_COMPLETE, msg_id);
+
+where:
+
+ * ``fd`` is one of the anonymous fds associated with the object_id
+ given.
+
+ * ``msg_id`` must match the msg_id field of the READ request.
diff --git a/Documentation/filesystems/caching/fscache.rst b/Documentation/filesystems/caching/fscache.rst
index 70de86922b6a..de1f32526cc1 100644
--- a/Documentation/filesystems/caching/fscache.rst
+++ b/Documentation/filesystems/caching/fscache.rst
@@ -10,25 +10,25 @@ Overview
This facility is a general purpose cache for network filesystems, though it
could be used for caching other things such as ISO9660 filesystems too.
-FS-Cache mediates between cache backends (such as CacheFS) and network
+FS-Cache mediates between cache backends (such as CacheFiles) and network
filesystems::
+---------+
- | | +--------------+
- | NFS |--+ | |
- | | | +-->| CacheFS |
- +---------+ | +----------+ | | /dev/hda5 |
- | | | | +--------------+
- +---------+ +-->| | |
- | | | |--+
- | AFS |----->| FS-Cache |
- | | | |--+
- +---------+ +-->| | |
- | | | | +--------------+
- +---------+ | +----------+ | | |
- | | | +-->| CacheFiles |
- | ISOFS |--+ | /var/cache |
- | | +--------------+
+ | | +--------------+
+ | NFS |--+ | |
+ | | | +-->| CacheFS |
+ +---------+ | +----------+ | | /dev/hda5 |
+ | | | | +--------------+
+ +---------+ +-------------->| | |
+ | | +-------+ | |--+
+ | AFS |----->| | | FS-Cache |
+ | | | netfs |-->| |--+
+ +---------+ +-->| lib | | | |
+ | | | | | | +--------------+
+ +---------+ | +-------+ +----------+ | | |
+ | | | +-->| CacheFiles |
+ | 9P |--+ | /var/cache |
+ | | +--------------+
+---------+
Or to look at it another way, FS-Cache is a module that provides a caching
@@ -84,101 +84,62 @@ then serving the pages out of that cache rather than the netfs inode because:
one-off access of a small portion of it (such as might be done with the
"file" program).
-It instead serves the cache out in PAGE_SIZE chunks as and when requested by
-the netfs('s) using it.
+It instead serves the cache out in chunks as and when requested by the netfs
+using it.
FS-Cache provides the following facilities:
- (1) More than one cache can be used at once. Caches can be selected
+ * More than one cache can be used at once. Caches can be selected
explicitly by use of tags.
- (2) Caches can be added / removed at any time.
+ * Caches can be added / removed at any time, even whilst being accessed.
- (3) The netfs is provided with an interface that allows either party to
+ * The netfs is provided with an interface that allows either party to
withdraw caching facilities from a file (required for (2)).
- (4) The interface to the netfs returns as few errors as possible, preferring
+ * The interface to the netfs returns as few errors as possible, preferring
rather to let the netfs remain oblivious.
- (5) Cookies are used to represent indices, files and other objects to the
- netfs. The simplest cookie is just a NULL pointer - indicating nothing
- cached there.
-
- (6) The netfs is allowed to propose - dynamically - any index hierarchy it
- desires, though it must be aware that the index search function is
- recursive, stack space is limited, and indices can only be children of
- indices.
-
- (7) Data I/O is done direct to and from the netfs's pages. The netfs
- indicates that page A is at index B of the data-file represented by cookie
- C, and that it should be read or written. The cache backend may or may
- not start I/O on that page, but if it does, a netfs callback will be
- invoked to indicate completion. The I/O may be either synchronous or
- asynchronous.
-
- (8) Cookies can be "retired" upon release. At this point FS-Cache will mark
- them as obsolete and the index hierarchy rooted at that point will get
- recycled.
-
- (9) The netfs provides a "match" function for index searches. In addition to
- saying whether a match was made or not, this can also specify that an
- entry should be updated or deleted.
-
-(10) As much as possible is done asynchronously.
-
-
-FS-Cache maintains a virtual indexing tree in which all indices, files, objects
-and pages are kept. Bits of this tree may actually reside in one or more
-caches::
-
- FSDEF
- |
- +------------------------------------+
- | |
- NFS AFS
- | |
- +--------------------------+ +-----------+
- | | | |
- homedir mirror afs.org redhat.com
- | | |
- +------------+ +---------------+ +----------+
- | | | | | |
- 00001 00002 00007 00125 vol00001 vol00002
- | | | | |
- +---+---+ +-----+ +---+ +------+------+ +-----+----+
- | | | | | | | | | | | | |
- PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak
- | |
- PG0 +-------+
- | |
- 00001 00003
- |
- +---+---+
- | | |
- PG0 PG1 PG2
-
-In the example above, you can see two netfs's being backed: NFS and AFS. These
-have different index hierarchies:
-
- * The NFS primary index contains per-server indices. Each server index is
- indexed by NFS file handles to get data file objects. Each data file
- objects can have an array of pages, but may also have further child
- objects, such as extended attributes and directory entries. Extended
- attribute objects themselves have page-array contents.
-
- * The AFS primary index contains per-cell indices. Each cell index contains
- per-logical-volume indices. Each of volume index contains up to three
- indices for the read-write, read-only and backup mirrors of those volumes.
- Each of these contains vnode data file objects, each of which contains an
- array of pages.
-
-The very top index is the FS-Cache master index in which individual netfs's
-have entries.
-
-Any index object may reside in more than one cache, provided it only has index
-children. Any index with non-index object children will be assumed to only
-reside in one cache.
+ * There are three types of cookie: cache, volume and data file cookies.
+ Cache cookies represent the cache as a whole and are not normally visible
+ to the netfs; the netfs gets a volume cookie to represent a collection of
+ files (typically something that a netfs would get for a superblock); and
+ data file cookies are used to cache data (something that would be got for
+ an inode).
+
+ * Volumes are matched using a key. This is a printable string that is used
+ to encode all the information that might be needed to distinguish one
+ superblock, say, from another. This would be a compound of things like
+ cell name or server address, volume name or share path. It must be a
+ valid pathname.
+
+ * Cookies are matched using a key. This is a binary blob and is used to
+ represent the object within a volume (so the volume key need not form
+ part of the blob). This might include things like an inode number and
+ uniquifier or a file handle.
+
+ * Cookie resources are set up and pinned by marking the cookie in-use.
+ This prevents the backing resources from being culled. Timed garbage
+ collection is employed to eliminate cookies that haven't been used for a
+ short while, thereby reducing resource overload. This is intended to be
+ used when a file is opened or closed.
+
+ A cookie can be marked in-use multiple times simultaneously; each mark
+ must be unused.
+
+ * Begin/end access functions are provided to delay cache withdrawal for the
+ duration of an operation and prevent structs from being freed whilst
+ we're looking at them.
+
+ * Data I/O is done by asynchronous DIO to/from a buffer described by the
+ netfs using an iov_iter.
+
+ * An invalidation facility is available to discard data from the cache and
+ to deal with I/O that's in progress that is accessing old data.
+
+ * Cookies can be "retired" upon release, thereby causing the object to be
+ removed from the cache.
The netfs API to FS-Cache can be found in:
@@ -189,11 +150,6 @@ The cache backend API to FS-Cache can be found in:
Documentation/filesystems/caching/backend-api.rst
-A description of the internal representations and object state machine can be
-found in:
-
- Documentation/filesystems/caching/object.rst
-
Statistical Information
=======================
@@ -201,342 +157,171 @@ Statistical Information
If FS-Cache is compiled with the following options enabled::
CONFIG_FSCACHE_STATS=y
- CONFIG_FSCACHE_HISTOGRAM=y
-then it will gather certain statistics and display them through a number of
-proc files.
+then it will gather certain statistics and display them through:
-/proc/fs/fscache/stats
-----------------------
+ /proc/fs/fscache/stats
- This shows counts of a number of events that can happen in FS-Cache:
+This shows counts of a number of events that can happen in FS-Cache:
+--------------+-------+-------------------------------------------------------+
|CLASS |EVENT |MEANING |
+==============+=======+=======================================================+
-|Cookies |idx=N |Number of index cookies allocated |
-+ +-------+-------------------------------------------------------+
-| |dat=N |Number of data storage cookies allocated |
+|Cookies |n=N |Number of data storage cookies allocated |
+ +-------+-------------------------------------------------------+
-| |spc=N |Number of special cookies allocated |
-+--------------+-------+-------------------------------------------------------+
-|Objects |alc=N |Number of objects allocated |
-+ +-------+-------------------------------------------------------+
-| |nal=N |Number of object allocation failures |
+| |v=N |Number of volume index cookies allocated |
+ +-------+-------------------------------------------------------+
-| |avl=N |Number of objects that reached the available state |
-+ +-------+-------------------------------------------------------+
-| |ded=N |Number of objects that reached the dead state |
-+--------------+-------+-------------------------------------------------------+
-|ChkAux |non=N |Number of objects that didn't have a coherency check |
+| |vcol=N |Number of volume index key collisions |
+ +-------+-------------------------------------------------------+
-| |ok=N |Number of objects that passed a coherency check |
-+ +-------+-------------------------------------------------------+
-| |upd=N |Number of objects that needed a coherency data update |
-+ +-------+-------------------------------------------------------+
-| |obs=N |Number of objects that were declared obsolete |
-+--------------+-------+-------------------------------------------------------+
-|Pages |mrk=N |Number of pages marked as being cached |
-| |unc=N |Number of uncache page requests seen |
+| |voom=N |Number of OOM events when allocating volume cookies |
+--------------+-------+-------------------------------------------------------+
|Acquire |n=N |Number of acquire cookie requests seen |
+ +-------+-------------------------------------------------------+
-| |nul=N |Number of acq reqs given a NULL parent |
-+ +-------+-------------------------------------------------------+
-| |noc=N |Number of acq reqs rejected due to no cache available |
-+ +-------+-------------------------------------------------------+
| |ok=N |Number of acq reqs succeeded |
+ +-------+-------------------------------------------------------+
-| |nbf=N |Number of acq reqs rejected due to error |
-+ +-------+-------------------------------------------------------+
| |oom=N |Number of acq reqs failed on ENOMEM |
+--------------+-------+-------------------------------------------------------+
-|Lookups |n=N |Number of lookup calls made on cache backends |
+|LRU |n=N |Number of cookies currently on the LRU |
+ +-------+-------------------------------------------------------+
-| |neg=N |Number of negative lookups made |
+| |exp=N |Number of cookies expired off of the LRU |
+ +-------+-------------------------------------------------------+
-| |pos=N |Number of positive lookups made |
+| |rmv=N |Number of cookies removed from the LRU |
+ +-------+-------------------------------------------------------+
-| |crt=N |Number of objects created by lookup |
+| |drp=N |Number of LRU'd cookies relinquished/withdrawn |
+ +-------+-------------------------------------------------------+
-| |tmo=N |Number of lookups timed out and requeued |
+| |at=N |Time till next LRU cull (jiffies) |
++--------------+-------+-------------------------------------------------------+
+|Invals |n=N |Number of invalidations |
+--------------+-------+-------------------------------------------------------+
|Updates |n=N |Number of update cookie requests seen |
+ +-------+-------------------------------------------------------+
-| |nul=N |Number of upd reqs given a NULL parent |
+| |rsz=N |Number of resize requests |
+ +-------+-------------------------------------------------------+
-| |run=N |Number of upd reqs granted CPU time |
+| |rsn=N |Number of skipped resize requests |
+--------------+-------+-------------------------------------------------------+
|Relinqs |n=N |Number of relinquish cookie requests seen |
+ +-------+-------------------------------------------------------+
-| |nul=N |Number of rlq reqs given a NULL parent |
+| |rtr=N |Number of rlq reqs with retire=true |
+ +-------+-------------------------------------------------------+
-| |wcr=N |Number of rlq reqs waited on completion of creation |
+| |drop=N |Number of cookies no longer blocking re-acquisition |
+--------------+-------+-------------------------------------------------------+
-|AttrChg |n=N |Number of attribute changed requests seen |
-+ +-------+-------------------------------------------------------+
-| |ok=N |Number of attr changed requests queued |
-+ +-------+-------------------------------------------------------+
-| |nbf=N |Number of attr changed rejected -ENOBUFS |
+|NoSpace |nwr=N |Number of write requests refused due to lack of space |
+ +-------+-------------------------------------------------------+
-| |oom=N |Number of attr changed failed -ENOMEM |
+| |ncr=N |Number of create requests refused due to lack of space |
+ +-------+-------------------------------------------------------+
-| |run=N |Number of attr changed ops given CPU time |
+| |cull=N |Number of objects culled to make space |
+--------------+-------+-------------------------------------------------------+
-|Allocs |n=N |Number of allocation requests seen |
+|IO |rd=N |Number of read operations in the cache |
+ +-------+-------------------------------------------------------+
-| |ok=N |Number of successful alloc reqs |
-+ +-------+-------------------------------------------------------+
-| |wt=N |Number of alloc reqs that waited on lookup completion |
-+ +-------+-------------------------------------------------------+
-| |nbf=N |Number of alloc reqs rejected -ENOBUFS |
-+ +-------+-------------------------------------------------------+
-| |int=N |Number of alloc reqs aborted -ERESTARTSYS |
-+ +-------+-------------------------------------------------------+
-| |ops=N |Number of alloc reqs submitted |
-+ +-------+-------------------------------------------------------+
-| |owt=N |Number of alloc reqs waited for CPU time |
-+ +-------+-------------------------------------------------------+
-| |abt=N |Number of alloc reqs aborted due to object death |
-+--------------+-------+-------------------------------------------------------+
-|Retrvls |n=N |Number of retrieval (read) requests seen |
-+ +-------+-------------------------------------------------------+
-| |ok=N |Number of successful retr reqs |
-+ +-------+-------------------------------------------------------+
-| |wt=N |Number of retr reqs that waited on lookup completion |
-+ +-------+-------------------------------------------------------+
-| |nod=N |Number of retr reqs returned -ENODATA |
-+ +-------+-------------------------------------------------------+
-| |nbf=N |Number of retr reqs rejected -ENOBUFS |
-+ +-------+-------------------------------------------------------+
-| |int=N |Number of retr reqs aborted -ERESTARTSYS |
-+ +-------+-------------------------------------------------------+
-| |oom=N |Number of retr reqs failed -ENOMEM |
-+ +-------+-------------------------------------------------------+
-| |ops=N |Number of retr reqs submitted |
-+ +-------+-------------------------------------------------------+
-| |owt=N |Number of retr reqs waited for CPU time |
-+ +-------+-------------------------------------------------------+
-| |abt=N |Number of retr reqs aborted due to object death |
-+--------------+-------+-------------------------------------------------------+
-|Stores |n=N |Number of storage (write) requests seen |
-+ +-------+-------------------------------------------------------+
-| |ok=N |Number of successful store reqs |
-+ +-------+-------------------------------------------------------+
-| |agn=N |Number of store reqs on a page already pending storage |
-+ +-------+-------------------------------------------------------+
-| |nbf=N |Number of store reqs rejected -ENOBUFS |
-+ +-------+-------------------------------------------------------+
-| |oom=N |Number of store reqs failed -ENOMEM |
-+ +-------+-------------------------------------------------------+
-| |ops=N |Number of store reqs submitted |
-+ +-------+-------------------------------------------------------+
-| |run=N |Number of store reqs granted CPU time |
-+ +-------+-------------------------------------------------------+
-| |pgs=N |Number of pages given store req processing time |
-+ +-------+-------------------------------------------------------+
-| |rxd=N |Number of store reqs deleted from tracking tree |
-+ +-------+-------------------------------------------------------+
-| |olm=N |Number of store reqs over store limit |
-+--------------+-------+-------------------------------------------------------+
-|VmScan |nos=N |Number of release reqs against pages with no |
-| | |pending store |
-+ +-------+-------------------------------------------------------+
-| |gon=N |Number of release reqs against pages stored by |
-| | |time lock granted |
-+ +-------+-------------------------------------------------------+
-| |bsy=N |Number of release reqs ignored due to in-progress store|
-+ +-------+-------------------------------------------------------+
-| |can=N |Number of page stores cancelled due to release req |
-+--------------+-------+-------------------------------------------------------+
-|Ops |pend=N |Number of times async ops added to pending queues |
-+ +-------+-------------------------------------------------------+
-| |run=N |Number of times async ops given CPU time |
-+ +-------+-------------------------------------------------------+
-| |enq=N |Number of times async ops queued for processing |
-+ +-------+-------------------------------------------------------+
-| |can=N |Number of async ops cancelled |
-+ +-------+-------------------------------------------------------+
-| |rej=N |Number of async ops rejected due to object |
-| | |lookup/create failure |
-+ +-------+-------------------------------------------------------+
-| |ini=N |Number of async ops initialised |
-+ +-------+-------------------------------------------------------+
-| |dfr=N |Number of async ops queued for deferred release |
-+ +-------+-------------------------------------------------------+
-| |rel=N |Number of async ops released |
-| | |(should equal ini=N when idle) |
-+ +-------+-------------------------------------------------------+
-| |gc=N |Number of deferred-release async ops garbage collected |
-+--------------+-------+-------------------------------------------------------+
-|CacheOp |alo=N |Number of in-progress alloc_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |luo=N |Number of in-progress lookup_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |luc=N |Number of in-progress lookup_complete() cache ops |
-+ +-------+-------------------------------------------------------+
-| |gro=N |Number of in-progress grab_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |upo=N |Number of in-progress update_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |dro=N |Number of in-progress drop_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |pto=N |Number of in-progress put_object() cache ops |
-+ +-------+-------------------------------------------------------+
-| |syn=N |Number of in-progress sync_cache() cache ops |
-+ +-------+-------------------------------------------------------+
-| |atc=N |Number of in-progress attr_changed() cache ops |
-+ +-------+-------------------------------------------------------+
-| |rap=N |Number of in-progress read_or_alloc_page() cache ops |
-+ +-------+-------------------------------------------------------+
-| |ras=N |Number of in-progress read_or_alloc_pages() cache ops |
-+ +-------+-------------------------------------------------------+
-| |alp=N |Number of in-progress allocate_page() cache ops |
-+ +-------+-------------------------------------------------------+
-| |als=N |Number of in-progress allocate_pages() cache ops |
-+ +-------+-------------------------------------------------------+
-| |wrp=N |Number of in-progress write_page() cache ops |
-+ +-------+-------------------------------------------------------+
-| |ucp=N |Number of in-progress uncache_page() cache ops |
-+ +-------+-------------------------------------------------------+
-| |dsp=N |Number of in-progress dissociate_pages() cache ops |
-+--------------+-------+-------------------------------------------------------+
-|CacheEv |nsp=N |Number of object lookups/creations rejected due to |
-| | |lack of space |
-+ +-------+-------------------------------------------------------+
-| |stl=N |Number of stale objects deleted |
-+ +-------+-------------------------------------------------------+
-| |rtr=N |Number of objects retired when relinquished |
-+ +-------+-------------------------------------------------------+
-| |cul=N |Number of objects culled |
+| |wr=N |Number of write operations in the cache |
+--------------+-------+-------------------------------------------------------+
+Netfslib will also add some stats counters of its own.
-/proc/fs/fscache/histogram
---------------------------
+Cache List
+==========
- ::
+FS-Cache provides a list of cache cookies:
- cat /proc/fs/fscache/histogram
- JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS
- ===== ===== ========= ========= ========= ========= =========
+ /proc/fs/fscache/cookies
- This shows the breakdown of the number of times each amount of time
- between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
- columns are as follows:
+This will look something like::
- ========= =======================================================
- COLUMN TIME MEASUREMENT
- ========= =======================================================
- OBJ INST Length of time to instantiate an object
- OP RUNS Length of time a call to process an operation took
- OBJ RUNS Length of time a call to process an object event took
- RETRV DLY Time between an requesting a read and lookup completing
- RETRIEVLS Time between beginning and end of a retrieval
- ========= =======================================================
+ # cat /proc/fs/fscache/caches
+ CACHE REF VOLS OBJS ACCES S NAME
+ ======== ===== ===== ===== ===== = ===============
+ 00000001 2 1 2123 1 A default
- Each row shows the number of events that took a particular range of times.
- Each step is 1 jiffy in size. The JIFS column indicates the particular
- jiffy range covered, and the SECS field the equivalent number of seconds.
+where the columns are:
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ CACHE Cache cookie debug ID (also appears in traces)
+ REF Number of references on the cache cookie
+ VOLS Number of volumes cookies in this cache
+ OBJS Number of cache objects in use
+ ACCES Number of accesses pinning the cache
+ S State
+ NAME Name of the cache.
+ ======= ===============================================================
+
+The state can be (-) Inactive, (P)reparing, (A)ctive, (E)rror or (W)ithdrawing.
-Object List
+Volume List
===========
-If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a
-list of all the objects currently allocated and allow them to be viewed
-through::
+FS-Cache provides a list of volume cookies:
- /proc/fs/fscache/objects
+ /proc/fs/fscache/volumes
This will look something like::
- [root@andromeda ~]# head /proc/fs/fscache/objects
- OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA
- ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================
- 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a
- 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a
+ VOLUME REF nCOOK ACC FL CACHE KEY
+ ======== ===== ===== === == =============== ================
+ 00000001 55 54 1 00 default afs,example.com,100058
-where the first set of columns before the '|' describe the object:
+where the columns are:
======= ===============================================================
COLUMN DESCRIPTION
======= ===============================================================
- OBJECT Object debugging ID (appears as OBJ%x in some debug messages)
- PARENT Debugging ID of parent object
- STAT Object state
- CHLDN Number of child objects of this object
- OPS Number of outstanding operations on this object
- OOP Number of outstanding child object management operations
- IPR
- EX Number of outstanding exclusive operations
- READS Number of outstanding read operations
- EM Object's event mask
- EV Events raised on this object
- F Object flags
- S Object work item busy state mask (1:pending 2:running)
+ VOLUME The volume cookie debug ID (also appears in traces)
+ REF Number of references on the volume cookie
+ nCOOK Number of cookies in the volume
+ ACC Number of accesses pinning the cache
+ FL Flags on the volume cookie
+ CACHE Name of the cache or "-"
+ KEY The indexing key for the volume
======= ===============================================================
-and the second set of columns describe the object's cookie, if present:
-
- ================ ======================================================
- COLUMN DESCRIPTION
- ================ ======================================================
- NETFS_COOKIE_DEF Name of netfs cookie definition
- TY Cookie type (IX - index, DT - data, hex - special)
- FL Cookie flags
- NETFS_DATA Netfs private data stored in the cookie
- OBJECT_KEY Object key } 1 column, with separating comma
- AUX_DATA Object aux data } presence may be configured
- ================ ======================================================
-
-The data shown may be filtered by attaching the a key to an appropriate keyring
-before viewing the file. Something like::
-
- keyctl add user fscache:objlist <restrictions> @s
-
-where <restrictions> are a selection of the following letters:
- == =========================================================
- K Show hexdump of object key (don't show if not given)
- A Show hexdump of object aux data (don't show if not given)
- == =========================================================
+Cookie List
+===========
-and the following paired letters:
+FS-Cache provides a list of cookies:
- == =========================================================
- C Show objects that have a cookie
- c Show objects that don't have a cookie
- B Show objects that are busy
- b Show objects that aren't busy
- W Show objects that have pending writes
- w Show objects that don't have pending writes
- R Show objects that have outstanding reads
- r Show objects that don't have outstanding reads
- S Show objects that have work queued
- s Show objects that don't have work queued
- == =========================================================
+ /proc/fs/fscache/cookies
-If neither side of a letter pair is given, then both are implied. For example:
+This will look something like::
- keyctl add user fscache:objlist KB @s
+ # head /proc/fs/fscache/cookies
+ COOKIE VOLUME REF ACT ACC S FL DEF
+ ======== ======== === === === = == ================
+ 00000435 00000001 1 0 -1 - 08 0000000201d080070000000000000000, 0000000000000000
+ 00000436 00000001 1 0 -1 - 00 0000005601d080080000000000000000, 0000000000000051
+ 00000437 00000001 1 0 -1 - 08 00023b3001d0823f0000000000000000, 0000000000000000
+ 00000438 00000001 1 0 -1 - 08 0000005801d0807b0000000000000000, 0000000000000000
+ 00000439 00000001 1 0 -1 - 08 00023b3201d080a10000000000000000, 0000000000000000
+ 0000043a 00000001 1 0 -1 - 08 00023b3401d080a30000000000000000, 0000000000000000
+ 0000043b 00000001 1 0 -1 - 08 00023b3601d080b30000000000000000, 0000000000000000
+ 0000043c 00000001 1 0 -1 - 08 00023b3801d080b40000000000000000, 0000000000000000
-shows objects that are busy, and lists their object keys, but does not dump
-their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is
-not implied.
+where the columns are:
-By default all objects and all fields will be shown.
+ ======= ===============================================================
+ COLUMN DESCRIPTION
+ ======= ===============================================================
+ COOKIE The cookie debug ID (also appears in traces)
+ VOLUME The parent volume cookie debug ID
+ REF Number of references on the volume cookie
+ ACT Number of times the cookie is marked for in use
+ ACC Number of access pins in the cookie
+ S State of the cookie
+ FL Flags on the cookie
+ DEF Key, auxiliary data
+ ======= ===============================================================
Debugging
=========
-If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime
-debugging enabled by adjusting the value in::
+If CONFIG_NETFS_DEBUG is enabled, the FS-Cache facility and NETFS support can
+have runtime debugging enabled by adjusting the value in::
- /sys/module/fscache/parameters/debug
+ /sys/module/netfs/parameters/debug
This is a bitmask of debugging streams to enable:
@@ -549,10 +334,8 @@ This is a bitmask of debugging streams to enable:
3 8 Cookie management Function entry trace
4 16 Function exit trace
5 32 General
- 6 64 Page handling Function entry trace
- 7 128 Function exit trace
- 8 256 General
- 9 512 Operation management Function entry trace
+ 6-8 (Not used)
+ 9 512 I/O operation management Function entry trace
10 1024 Function exit trace
11 2048 General
======= ======= =============================== =======================
@@ -560,6 +343,6 @@ This is a bitmask of debugging streams to enable:
The appropriate set of values should be OR'd together and the result written to
the control file. For example::
- echo $((1|8|64)) >/sys/module/fscache/parameters/debug
+ echo $((1|8|512)) >/sys/module/netfs/parameters/debug
will turn on all function entry debugging.
diff --git a/Documentation/filesystems/caching/index.rst b/Documentation/filesystems/caching/index.rst
index 033da7ac7c6e..df4307124b00 100644
--- a/Documentation/filesystems/caching/index.rst
+++ b/Documentation/filesystems/caching/index.rst
@@ -7,8 +7,6 @@ Filesystem Caching
:maxdepth: 2
fscache
- object
+ netfs-api
backend-api
cachefiles
- netfs-api
- operations
diff --git a/Documentation/filesystems/caching/netfs-api.rst b/Documentation/filesystems/caching/netfs-api.rst
index d9f14b8610ba..665b27f1556e 100644
--- a/Documentation/filesystems/caching/netfs-api.rst
+++ b/Documentation/filesystems/caching/netfs-api.rst
@@ -1,896 +1,452 @@
.. SPDX-License-Identifier: GPL-2.0
-===============================
-FS-Cache Network Filesystem API
-===============================
+==============================
+Network Filesystem Caching API
+==============================
-There's an API by which a network filesystem can make use of the FS-Cache
-facilities. This is based around a number of principles:
+Fscache provides an API by which a network filesystem can make use of local
+caching facilities. The API is arranged around a number of principles:
- (1) Caches can store a number of different object types. There are two main
- object types: indices and files. The first is a special type used by
- FS-Cache to make finding objects faster and to make retiring of groups of
- objects easier.
+ (1) A cache is logically organised into volumes and data storage objects
+ within those volumes.
- (2) Every index, file or other object is represented by a cookie. This cookie
- may or may not have anything associated with it, but the netfs doesn't
- need to care.
+ (2) Volumes and data storage objects are represented by various types of
+ cookie.
- (3) Barring the top-level index (one entry per cached netfs), the index
- hierarchy for each netfs is structured according the whim of the netfs.
+ (3) Cookies have keys that distinguish them from their peers.
-This API is declared in <linux/fscache.h>.
+ (4) Cookies have coherency data that allows a cache to determine if the
+ cached data is still valid.
-.. This document contains the following sections:
-
- (1) Network filesystem definition
- (2) Index definition
- (3) Object definition
- (4) Network filesystem (un)registration
- (5) Cache tag lookup
- (6) Index registration
- (7) Data file registration
- (8) Miscellaneous object registration
- (9) Setting the data file size
- (10) Page alloc/read/write
- (11) Page uncaching
- (12) Index and data file consistency
- (13) Cookie enablement
- (14) Miscellaneous cookie operations
- (15) Cookie unregistration
- (16) Index invalidation
- (17) Data file invalidation
- (18) FS-Cache specific page flags.
-
-
-Network Filesystem Definition
-=============================
-
-FS-Cache needs a description of the network filesystem. This is specified
-using a record of the following structure::
-
- struct fscache_netfs {
- uint32_t version;
- const char *name;
- struct fscache_cookie *primary_index;
- ...
- };
-
-This first two fields should be filled in before registration, and the third
-will be filled in by the registration function; any other fields should just be
-ignored and are for internal use only.
-
-The fields are:
-
- (1) The name of the netfs (used as the key in the toplevel index).
-
- (2) The version of the netfs (if the name matches but the version doesn't, the
- entire in-cache hierarchy for this netfs will be scrapped and begun
- afresh).
-
- (3) The cookie representing the primary index will be allocated according to
- another parameter passed into the registration function.
-
-For example, kAFS (linux/fs/afs/) uses the following definitions to describe
-itself::
-
- struct fscache_netfs afs_cache_netfs = {
- .version = 0,
- .name = "afs",
- };
-
-
-Index Definition
-================
-
-Indices are used for two purposes:
-
- (1) To aid the finding of a file based on a series of keys (such as AFS's
- "cell", "volume ID", "vnode ID").
-
- (2) To make it easier to discard a subset of all the files cached based around
- a particular key - for instance to mirror the removal of an AFS volume.
-
-However, since it's unlikely that any two netfs's are going to want to define
-their index hierarchies in quite the same way, FS-Cache tries to impose as few
-restraints as possible on how an index is structured and where it is placed in
-the tree. The netfs can even mix indices and data files at the same level, but
-it's not recommended.
-
-Each index entry consists of a key of indeterminate length plus some auxiliary
-data, also of indeterminate length.
-
-There are some limits on indices:
-
- (1) Any index containing non-index objects should be restricted to a single
- cache. Any such objects created within an index will be created in the
- first cache only. The cache in which an index is created can be
- controlled by cache tags (see below).
-
- (2) The entry data must be atomically journallable, so it is limited to about
- 400 bytes at present. At least 400 bytes will be available.
-
- (3) The depth of the index tree should be judged with care as the search
- function is recursive. Too many layers will run the kernel out of stack.
-
-
-Object Definition
-=================
-
-To define an object, a structure of the following type should be filled out::
-
- struct fscache_cookie_def
- {
- uint8_t name[16];
- uint8_t type;
-
- struct fscache_cache_tag *(*select_cache)(
- const void *parent_netfs_data,
- const void *cookie_netfs_data);
-
- enum fscache_checkaux (*check_aux)(void *cookie_netfs_data,
- const void *data,
- uint16_t datalen,
- loff_t object_size);
-
- void (*get_context)(void *cookie_netfs_data, void *context);
-
- void (*put_context)(void *cookie_netfs_data, void *context);
-
- void (*mark_pages_cached)(void *cookie_netfs_data,
- struct address_space *mapping,
- struct pagevec *cached_pvec);
- };
-
-This has the following fields:
-
- (1) The type of the object [mandatory].
-
- This is one of the following values:
-
- FSCACHE_COOKIE_TYPE_INDEX
- This defines an index, which is a special FS-Cache type.
-
- FSCACHE_COOKIE_TYPE_DATAFILE
- This defines an ordinary data file.
-
- Any other value between 2 and 255
- This defines an extraordinary object such as an XATTR.
-
- (2) The name of the object type (NUL terminated unless all 16 chars are used)
- [optional].
-
- (3) A function to select the cache in which to store an index [optional].
-
- This function is invoked when an index needs to be instantiated in a cache
- during the instantiation of a non-index object. Only the immediate index
- parent for the non-index object will be queried. Any indices above that
- in the hierarchy may be stored in multiple caches. This function does not
- need to be supplied for any non-index object or any index that will only
- have index children.
-
- If this function is not supplied or if it returns NULL then the first
- cache in the parent's list will be chosen, or failing that, the first
- cache in the master list.
-
- (4) A function to check the auxiliary data [optional].
-
- This function will be called to check that a match found in the cache for
- this object is valid. For instance with AFS it could check the auxiliary
- data against the data version number returned by the server to determine
- whether the index entry in a cache is still valid.
-
- If this function is absent, it will be assumed that matching objects in a
- cache are always valid.
-
- The function is also passed the cache's idea of the object size and may
- use this to manage coherency also.
-
- If present, the function should return one of the following values:
-
- FSCACHE_CHECKAUX_OKAY
- - the entry is okay as is
-
- FSCACHE_CHECKAUX_NEEDS_UPDATE
- - the entry requires update
-
- FSCACHE_CHECKAUX_OBSOLETE
- - the entry should be deleted
+ (5) I/O is done asynchronously where possible.
- This function can also be used to extract data from the auxiliary data in
- the cache and copy it into the netfs's structures.
+This API is used by::
- (5) A pair of functions to manage contexts for the completion callback
- [optional].
+ #include <linux/fscache.h>.
- The cache read/write functions are passed a context which is then passed
- to the I/O completion callback function. To ensure this context remains
- valid until after the I/O completion is called, two functions may be
- provided: one to get an extra reference on the context, and one to drop a
- reference to it.
-
- If the context is not used or is a type of object that won't go out of
- scope, then these functions are not required. These functions are not
- required for indices as indices may not contain data. These functions may
- be called in interrupt context and so may not sleep.
-
- (6) A function to mark a page as retaining cache metadata [optional].
-
- This is called by the cache to indicate that it is retaining in-memory
- information for this page and that the netfs should uncache the page when
- it has finished. This does not indicate whether there's data on the disk
- or not. Note that several pages at once may be presented for marking.
-
- The PG_fscache bit is set on the pages before this function would be
- called, so the function need not be provided if this is sufficient.
-
- This function is not required for indices as they're not permitted data.
-
- (7) A function to unmark all the pages retaining cache metadata [mandatory].
-
- This is called by FS-Cache to indicate that a backing store is being
- unbound from a cookie and that all the marks on the pages should be
- cleared to prevent confusion. Note that the cache will have torn down all
- its tracking information so that the pages don't need to be explicitly
- uncached.
-
- This function is not required for indices as they're not permitted data.
-
-
-Network Filesystem (Un)registration
-===================================
-
-The first step is to declare the network filesystem to the cache. This also
-involves specifying the layout of the primary index (for AFS, this would be the
-"cell" level).
-
-The registration function is::
-
- int fscache_register_netfs(struct fscache_netfs *netfs);
-
-It just takes a pointer to the netfs definition. It returns 0 or an error as
-appropriate.
-
-For kAFS, registration is done as follows::
-
- ret = fscache_register_netfs(&afs_cache_netfs);
-
-The last step is, of course, unregistration::
-
- void fscache_unregister_netfs(struct fscache_netfs *netfs);
-
-
-Cache Tag Lookup
-================
-
-FS-Cache permits the use of more than one cache. To permit particular index
-subtrees to be bound to particular caches, the second step is to look up cache
-representation tags. This step is optional; it can be left entirely up to
-FS-Cache as to which cache should be used. The problem with doing that is that
-FS-Cache will always pick the first cache that was registered.
-
-To get the representation for a named tag::
-
- struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name);
-
-This takes a text string as the name and returns a representation of a tag. It
-will never return an error. It may return a dummy tag, however, if it runs out
-of memory; this will inhibit caching with this tag.
-
-Any representation so obtained must be released by passing it to this function::
-
- void fscache_release_cache_tag(struct fscache_cache_tag *tag);
+.. This document contains the following sections:
-The tag will be retrieved by FS-Cache when it calls the object definition
-operation select_cache().
+ (1) Overview
+ (2) Volume registration
+ (3) Data file registration
+ (4) Declaring a cookie to be in use
+ (5) Resizing a data file (truncation)
+ (6) Data I/O API
+ (7) Data file coherency
+ (8) Data file invalidation
+ (9) Write back resource management
+ (10) Caching of local modifications
+ (11) Page release and invalidation
+
+
+Overview
+========
+
+The fscache hierarchy is organised on two levels from a network filesystem's
+point of view. The upper level represents "volumes" and the lower level
+represents "data storage objects". These are represented by two types of
+cookie, hereafter referred to as "volume cookies" and "cookies".
+
+A network filesystem acquires a volume cookie for a volume using a volume key,
+which represents all the information that defines that volume (e.g. cell name
+or server address, volume ID or share name). This must be rendered as a
+printable string that can be used as a directory name (ie. no '/' characters
+and shouldn't begin with a '.'). The maximum name length is one less than the
+maximum size of a filename component (allowing the cache backend one char for
+its own purposes).
+
+A filesystem would typically have a volume cookie for each superblock.
+
+The filesystem then acquires a cookie for each file within that volume using an
+object key. Object keys are binary blobs and only need to be unique within
+their parent volume. The cache backend is responsible for rendering the binary
+blob into something it can use and may employ hash tables, trees or whatever to
+improve its ability to find an object. This is transparent to the network
+filesystem.
+
+A filesystem would typically have a cookie for each inode, and would acquire it
+in iget and relinquish it when evicting the cookie.
+
+Once it has a cookie, the filesystem needs to mark the cookie as being in use.
+This causes fscache to send the cache backend off to look up/create resources
+for the cookie in the background, to check its coherency and, if necessary, to
+mark the object as being under modification.
+
+A filesystem would typically "use" the cookie in its file open routine and
+unuse it in file release and it needs to use the cookie around calls to
+truncate the cookie locally. It *also* needs to use the cookie when the
+pagecache becomes dirty and unuse it when writeback is complete. This is
+slightly tricky, and provision is made for it.
+
+When performing a read, write or resize on a cookie, the filesystem must first
+begin an operation. This copies the resources into a holding struct and puts
+extra pins into the cache to stop cache withdrawal from tearing down the
+structures being used. The actual operation can then be issued and conflicting
+invalidations can be detected upon completion.
+
+The filesystem is expected to use netfslib to access the cache, but that's not
+actually required and it can use the fscache I/O API directly.
+
+
+Volume Registration
+===================
+
+The first step for a network filesystem is to acquire a volume cookie for the
+volume it wants to access::
+
+ struct fscache_volume *
+ fscache_acquire_volume(const char *volume_key,
+ const char *cache_name,
+ const void *coherency_data,
+ size_t coherency_len);
+
+This function creates a volume cookie with the specified volume key as its name
+and notes the coherency data.
+
+The volume key must be a printable string with no '/' characters in it. It
+should begin with the name of the filesystem and should be no longer than 254
+characters. It should uniquely represent the volume and will be matched with
+what's stored in the cache.
+
+The caller may also specify the name of the cache to use. If specified,
+fscache will look up or create a cache cookie of that name and will use a cache
+of that name if it is online or comes online. If no cache name is specified,
+it will use the first cache that comes to hand and set the name to that.
+
+The specified coherency data is stored in the cookie and will be matched
+against coherency data stored on disk. The data pointer may be NULL if no data
+is provided. If the coherency data doesn't match, the entire cache volume will
+be invalidated.
+
+This function can return errors such as EBUSY if the volume key is already in
+use by an acquired volume or ENOMEM if an allocation failure occurred. It may
+also return a NULL volume cookie if fscache is not enabled. It is safe to
+pass a NULL cookie to any function that takes a volume cookie. This will
+cause that function to do nothing.
+
+
+When the network filesystem has finished with a volume, it should relinquish it
+by calling::
+
+ void fscache_relinquish_volume(struct fscache_volume *volume,
+ const void *coherency_data,
+ bool invalidate);
+
+This will cause the volume to be committed or removed, and if sealed the
+coherency data will be set to the value supplied. The amount of coherency data
+must match the length specified when the volume was acquired. Note that all
+data cookies obtained in this volume must be relinquished before the volume is
+relinquished.
-Index Registration
-==================
+Data File Registration
+======================
-The third step is to inform FS-Cache about part of an index hierarchy that can
-be used to locate files. This is done by requesting a cookie for each index in
-the path to the file::
+Once it has a volume cookie, a network filesystem can use it to acquire a
+cookie for data storage::
struct fscache_cookie *
- fscache_acquire_cookie(struct fscache_cookie *parent,
- const struct fscache_object_def *def,
+ fscache_acquire_cookie(struct fscache_volume *volume,
+ u8 advice,
const void *index_key,
size_t index_key_len,
const void *aux_data,
size_t aux_data_len,
- void *netfs_data,
- loff_t object_size,
- bool enable);
+ loff_t object_size)
-This function creates an index entry in the index represented by parent,
-filling in the index entry by calling the operations pointed to by def.
+This creates the cookie in the volume using the specified index key. The index
+key is a binary blob of the given length and must be unique for the volume.
+This is saved into the cookie. There are no restrictions on the content, but
+its length shouldn't exceed about three quarters of the maximum filename length
+to allow for encoding.
-A unique key that represents the object within the parent must be pointed to by
-index_key and is of length index_key_len.
+The caller should also pass in a piece of coherency data in aux_data. A buffer
+of size aux_data_len will be allocated and the coherency data copied in. It is
+assumed that the size is invariant over time. The coherency data is used to
+check the validity of data in the cache. Functions are provided by which the
+coherency data can be updated.
-An optional blob of auxiliary data that is to be stored within the cache can be
-pointed to with aux_data and should be of length aux_data_len. This would
-typically be used for storing coherency data.
+The file size of the object being cached should also be provided. This may be
+used to trim the data and will be stored with the coherency data.
-The netfs may pass an arbitrary value in netfs_data and this will be presented
-to it in the event of any calling back. This may also be used in tracing or
-logging of messages.
+This function never returns an error, though it may return a NULL cookie on
+allocation failure or if fscache is not enabled. It is safe to pass in a NULL
+volume cookie and pass the NULL cookie returned to any function that takes it.
+This will cause that function to do nothing.
-The cache tracks the size of the data attached to an object and this set to be
-object_size. For indices, this should be 0. This value will be passed to the
-->check_aux() callback.
-Note that this function never returns an error - all errors are handled
-internally. It may, however, return NULL to indicate no cookie. It is quite
-acceptable to pass this token back to this function as the parent to another
-acquisition (or even to the relinquish cookie, read page and write page
-functions - see below).
+When the network filesystem has finished with a cookie, it should relinquish it
+by calling::
-Note also that no indices are actually created in a cache until a non-index
-object needs to be created somewhere down the hierarchy. Furthermore, an index
-may be created in several different caches independently at different times.
-This is all handled transparently, and the netfs doesn't see any of it.
+ void fscache_relinquish_cookie(struct fscache_cookie *cookie,
+ bool retire);
-A cookie will be created in the disabled state if enabled is false. A cookie
-must be enabled to do anything with it. A disabled cookie can be enabled by
-calling fscache_enable_cookie() (see below).
+This will cause fscache to either commit the storage backing the cookie or
+delete it.
-For example, with AFS, a cell would be added to the primary index. This index
-entry would have a dependent inode containing volume mappings within this cell::
- cell->cache =
- fscache_acquire_cookie(afs_cache_netfs.primary_index,
- &afs_cell_cache_index_def,
- cell->name, strlen(cell->name),
- NULL, 0,
- cell, 0, true);
+Marking A Cookie In-Use
+=======================
-And then a particular volume could be added to that index by ID, creating
-another index for vnodes (AFS inode equivalents)::
+Once a cookie has been acquired by a network filesystem, the filesystem should
+tell fscache when it intends to use the cookie (typically done on file open)
+and should say when it has finished with it (typically on file close)::
- volume->cache =
- fscache_acquire_cookie(volume->cell->cache,
- &afs_volume_cache_index_def,
- &volume->vid, sizeof(volume->vid),
- NULL, 0,
- volume, 0, true);
+ void fscache_use_cookie(struct fscache_cookie *cookie,
+ bool will_modify);
+ void fscache_unuse_cookie(struct fscache_cookie *cookie,
+ const void *aux_data,
+ const loff_t *object_size);
+The *use* function tells fscache that it will use the cookie and, additionally,
+indicate if the user is intending to modify the contents locally. If not yet
+done, this will trigger the cache backend to go and gather the resources it
+needs to access/store data in the cache. This is done in the background, and
+so may not be complete by the time the function returns.
-Data File Registration
-======================
+The *unuse* function indicates that a filesystem has finished using a cookie.
+It optionally updates the stored coherency data and object size and then
+decreases the in-use counter. When the last user unuses the cookie, it is
+scheduled for garbage collection. If not reused within a short time, the
+resources will be released to reduce system resource consumption.
-The fourth step is to request a data file be created in the cache. This is
-identical to index cookie acquisition. The only difference is that the type in
-the object definition should be something other than index type::
+A cookie must be marked in-use before it can be accessed for read, write or
+resize - and an in-use mark must be kept whilst there is dirty data in the
+pagecache in order to avoid an oops due to trying to open a file during process
+exit.
- vnode->cache =
- fscache_acquire_cookie(volume->cache,
- &afs_vnode_cache_object_def,
- &key, sizeof(key),
- &aux, sizeof(aux),
- vnode, vnode->status.size, true);
+Note that in-use marks are cumulative. For each time a cookie is marked
+in-use, it must be unused.
-Miscellaneous Object Registration
+Resizing A Data File (Truncation)
=================================
-An optional step is to request an object of miscellaneous type be created in
-the cache. This is almost identical to index cookie acquisition. The only
-difference is that the type in the object definition should be something other
-than index type. While the parent object could be an index, it's more likely
-it would be some other type of object such as a data file::
-
- xattr->cache =
- fscache_acquire_cookie(vnode->cache,
- &afs_xattr_cache_object_def,
- &xattr->name, strlen(xattr->name),
- NULL, 0,
- xattr, strlen(xattr->val), true);
-
-Miscellaneous objects might be used to store extended attributes or directory
-entries for example.
-
-
-Setting the Data File Size
-==========================
+If a network filesystem file is resized locally by truncation, the following
+should be called to notify the cache::
-The fifth step is to set the physical attributes of the file, such as its size.
-This doesn't automatically reserve any space in the cache, but permits the
-cache to adjust its metadata for data tracking appropriately::
+ void fscache_resize_cookie(struct fscache_cookie *cookie,
+ loff_t new_size);
- int fscache_attr_changed(struct fscache_cookie *cookie);
+The caller must have first marked the cookie in-use. The cookie and the new
+size are passed in and the cache is synchronously resized. This is expected to
+be called from ``->setattr()`` inode operation under the inode lock.
-The cache will return -ENOBUFS if there is no backing cache or if there is no
-space to allocate any extra metadata required in the cache.
-Note that attempts to read or write data pages in the cache over this size may
-be rebuffed with -ENOBUFS.
+Data I/O API
+============
-This operation schedules an attribute adjustment to happen asynchronously at
-some point in the future, and as such, it may happen after the function returns
-to the caller. The attribute adjustment excludes read and write operations.
+To do data I/O operations directly through a cookie, the following functions
+are available::
+ int fscache_begin_read_operation(struct netfs_cache_resources *cres,
+ struct fscache_cookie *cookie);
+ int fscache_read(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ enum netfs_read_from_hole read_hole,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+ int fscache_write(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
-Page alloc/read/write
-=====================
+The *begin* function sets up an operation, attaching the resources required to
+the cache resources block from the cookie. Assuming it doesn't return an error
+(for instance, it will return -ENOBUFS if given a NULL cookie, but otherwise do
+nothing), then one of the other two functions can be issued.
-And the sixth step is to store and retrieve pages in the cache. There are
-three functions that are used to do this.
+The *read* and *write* functions initiate a direct-IO operation. Both take the
+previously set up cache resources block, an indication of the start file
+position, and an I/O iterator that describes buffer and indicates the amount of
+data.
-Note:
+The read function also takes a parameter to indicate how it should handle a
+partially populated region (a hole) in the disk content. This may be to ignore
+it, skip over an initial hole and place zeros in the buffer or give an error.
- (1) A page should not be re-read or re-allocated without uncaching it first.
-
- (2) A read or allocated page must be uncached when the netfs page is released
- from the pagecache.
-
- (3) A page should only be written to the cache if previous read or allocated.
-
-This permits the cache to maintain its page tracking in proper order.
-
-
-PAGE READ
----------
-
-Firstly, the netfs should ask FS-Cache to examine the caches and read the
-contents cached for a particular page of a particular file if present, or else
-allocate space to store the contents if not::
+The read and write functions can be given an optional termination function that
+will be run on completion::
typedef
- void (*fscache_rw_complete_t)(struct page *page,
- void *context,
- int error);
-
- int fscache_read_or_alloc_page(struct fscache_cookie *cookie,
- struct page *page,
- fscache_rw_complete_t end_io_func,
- void *context,
- gfp_t gfp);
-
-The cookie argument must specify a cookie for an object that isn't an index,
-the page specified will have the data loaded into it (and is also used to
-specify the page number), and the gfp argument is used to control how any
-memory allocations made are satisfied.
-
-If the cookie indicates the inode is not cached:
-
- (1) The function will return -ENOBUFS.
-
-Else if there's a copy of the page resident in the cache:
-
- (1) The mark_pages_cached() cookie operation will be called on that page.
+ void (*netfs_io_terminated_t)(void *priv, ssize_t transferred_or_error,
+ bool was_async);
- (2) The function will submit a request to read the data from the cache's
- backing device directly into the page specified.
+If a termination function is given, the operation will be run asynchronously
+and the termination function will be called upon completion. If not given, the
+operation will be run synchronously. Note that in the asynchronous case, it is
+possible for the operation to complete before the function returns.
- (3) The function will return 0.
+Both the read and write functions end the operation when they complete,
+detaching any pinned resources.
- (4) When the read is complete, end_io_func() will be invoked with:
+The read operation will fail with ESTALE if invalidation occurred whilst the
+operation was ongoing.
- * The netfs data supplied when the cookie was created.
- * The page descriptor.
+Data File Coherency
+===================
- * The context argument passed to the above function. This will be
- maintained with the get_context/put_context functions mentioned above.
-
- * An argument that's 0 on success or negative for an error code.
-
- If an error occurs, it should be assumed that the page contains no usable
- data. fscache_readpages_cancel() may need to be called.
-
- end_io_func() will be called in process context if the read is results in
- an error, but it might be called in interrupt context if the read is
- successful.
-
-Otherwise, if there's not a copy available in cache, but the cache may be able
-to store the page:
-
- (1) The mark_pages_cached() cookie operation will be called on that page.
-
- (2) A block may be reserved in the cache and attached to the object at the
- appropriate place.
-
- (3) The function will return -ENODATA.
-
-This function may also return -ENOMEM or -EINTR, in which case it won't have
-read any data from the cache.
-
-
-Page Allocate
--------------
-
-Alternatively, if there's not expected to be any data in the cache for a page
-because the file has been extended, a block can simply be allocated instead::
-
- int fscache_alloc_page(struct fscache_cookie *cookie,
- struct page *page,
- gfp_t gfp);
-
-This is similar to the fscache_read_or_alloc_page() function, except that it
-never reads from the cache. It will return 0 if a block has been allocated,
-rather than -ENODATA as the other would. One or the other must be performed
-before writing to the cache.
-
-The mark_pages_cached() cookie operation will be called on the page if
-successful.
-
-
-Page Write
-----------
-
-Secondly, if the netfs changes the contents of the page (either due to an
-initial download or if a user performs a write), then the page should be
-written back to the cache::
-
- int fscache_write_page(struct fscache_cookie *cookie,
- struct page *page,
- loff_t object_size,
- gfp_t gfp);
-
-The cookie argument must specify a data file cookie, the page specified should
-contain the data to be written (and is also used to specify the page number),
-object_size is the revised size of the object and the gfp argument is used to
-control how any memory allocations made are satisfied.
-
-The page must have first been read or allocated successfully and must not have
-been uncached before writing is performed.
-
-If the cookie indicates the inode is not cached then:
-
- (1) The function will return -ENOBUFS.
-
-Else if space can be allocated in the cache to hold this page:
-
- (1) PG_fscache_write will be set on the page.
-
- (2) The function will submit a request to write the data to cache's backing
- device directly from the page specified.
-
- (3) The function will return 0.
-
- (4) When the write is complete PG_fscache_write is cleared on the page and
- anyone waiting for that bit will be woken up.
-
-Else if there's no space available in the cache, -ENOBUFS will be returned. It
-is also possible for the PG_fscache_write bit to be cleared when no write took
-place if unforeseen circumstances arose (such as a disk error).
-
-Writing takes place asynchronously.
-
-
-Multiple Page Read
-------------------
-
-A facility is provided to read several pages at once, as requested by the
-readpages() address space operation::
-
- int fscache_read_or_alloc_pages(struct fscache_cookie *cookie,
- struct address_space *mapping,
- struct list_head *pages,
- int *nr_pages,
- fscache_rw_complete_t end_io_func,
- void *context,
- gfp_t gfp);
-
-This works in a similar way to fscache_read_or_alloc_page(), except:
-
- (1) Any page it can retrieve data for is removed from pages and nr_pages and
- dispatched for reading to the disk. Reads of adjacent pages on disk may
- be merged for greater efficiency.
-
- (2) The mark_pages_cached() cookie operation will be called on several pages
- at once if they're being read or allocated.
-
- (3) If there was an general error, then that error will be returned.
-
- Else if some pages couldn't be allocated or read, then -ENOBUFS will be
- returned.
-
- Else if some pages couldn't be read but were allocated, then -ENODATA will
- be returned.
-
- Otherwise, if all pages had reads dispatched, then 0 will be returned, the
- list will be empty and ``*nr_pages`` will be 0.
-
- (4) end_io_func will be called once for each page being read as the reads
- complete. It will be called in process context if error != 0, but it may
- be called in interrupt context if there is no error.
-
-Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude
-some of the pages being read and some being allocated. Those pages will have
-been marked appropriately and will need uncaching.
-
-
-Cancellation of Unread Pages
-----------------------------
-
-If one or more pages are passed to fscache_read_or_alloc_pages() but not then
-read from the cache and also not read from the underlying filesystem then
-those pages will need to have any marks and reservations removed. This can be
-done by calling::
-
- void fscache_readpages_cancel(struct fscache_cookie *cookie,
- struct list_head *pages);
-
-prior to returning to the caller. The cookie argument should be as passed to
-fscache_read_or_alloc_pages(). Every page in the pages list will be examined
-and any that have PG_fscache set will be uncached.
-
-
-Page Uncaching
-==============
-
-To uncache a page, this function should be called::
-
- void fscache_uncache_page(struct fscache_cookie *cookie,
- struct page *page);
-
-This function permits the cache to release any in-memory representation it
-might be holding for this netfs page. This function must be called once for
-each page on which the read or write page functions above have been called to
-make sure the cache's in-memory tracking information gets torn down.
-
-Note that pages can't be explicitly deleted from the a data file. The whole
-data file must be retired (see the relinquish cookie function below).
-
-Furthermore, note that this does not cancel the asynchronous read or write
-operation started by the read/alloc and write functions, so the page
-invalidation functions must use::
-
- bool fscache_check_page_write(struct fscache_cookie *cookie,
- struct page *page);
-
-to see if a page is being written to the cache, and::
-
- void fscache_wait_on_page_write(struct fscache_cookie *cookie,
- struct page *page);
-
-to wait for it to finish if it is.
-
-
-When releasepage() is being implemented, a special FS-Cache function exists to
-manage the heuristics of coping with vmscan trying to eject pages, which may
-conflict with the cache trying to write pages to the cache (which may itself
-need to allocate memory)::
-
- bool fscache_maybe_release_page(struct fscache_cookie *cookie,
- struct page *page,
- gfp_t gfp);
-
-This takes the netfs cookie, and the page and gfp arguments as supplied to
-releasepage(). It will return false if the page cannot be released yet for
-some reason and if it returns true, the page has been uncached and can now be
-released.
-
-To make a page available for release, this function may wait for an outstanding
-storage request to complete, or it may attempt to cancel the storage request -
-in which case the page will not be stored in the cache this time.
-
-
-Bulk Image Page Uncache
------------------------
-
-A convenience routine is provided to perform an uncache on all the pages
-attached to an inode. This assumes that the pages on the inode correspond on a
-1:1 basis with the pages in the cache::
-
- void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie,
- struct inode *inode);
-
-This takes the netfs cookie that the pages were cached with and the inode that
-the pages are attached to. This function will wait for pages to finish being
-written to the cache and for the cache to finish with the page generally. No
-error is returned.
-
-
-Index and Data File consistency
-===============================
-
-To find out whether auxiliary data for an object is up to data within the
-cache, the following function can be called::
-
- int fscache_check_consistency(struct fscache_cookie *cookie,
- const void *aux_data);
-
-This will call back to the netfs to check whether the auxiliary data associated
-with a cookie is correct; if aux_data is non-NULL, it will update the auxiliary
-data buffer first. It returns 0 if it is and -ESTALE if it isn't; it may also
-return -ENOMEM and -ERESTARTSYS.
-
-To request an update of the index data for an index or other object, the
-following function should be called::
+To request an update of the coherency data and file size on a cookie, the
+following should be called::
void fscache_update_cookie(struct fscache_cookie *cookie,
- const void *aux_data);
-
-This function will update the cookie's auxiliary data buffer from aux_data if
-that is non-NULL and then schedule this to be stored on disk. The update
-method in the parent index definition will be called to transfer the data.
-
-Note that partial updates may happen automatically at other times, such as when
-data blocks are added to a data file object.
-
-
-Cookie Enablement
-=================
-
-Cookies exist in one of two states: enabled and disabled. If a cookie is
-disabled, it ignores all attempts to acquire child cookies; check, update or
-invalidate its state; allocate, read or write backing pages - though it is
-still possible to uncache pages and relinquish the cookie.
-
-The initial enablement state is set by fscache_acquire_cookie(), but the cookie
-can be enabled or disabled later. To disable a cookie, call::
-
- void fscache_disable_cookie(struct fscache_cookie *cookie,
- const void *aux_data,
- bool invalidate);
-
-If the cookie is not already disabled, this locks the cookie against other
-enable and disable ops, marks the cookie as being disabled, discards or
-invalidates any backing objects and waits for cessation of activity on any
-associated object before unlocking the cookie.
-
-All possible failures are handled internally. The caller should consider
-calling fscache_uncache_all_inode_pages() afterwards to make sure all page
-markings are cleared up.
-
-Cookies can be enabled or reenabled with::
-
- void fscache_enable_cookie(struct fscache_cookie *cookie,
const void *aux_data,
- loff_t object_size,
- bool (*can_enable)(void *data),
- void *data)
-
-If the cookie is not already enabled, this locks the cookie against other
-enable and disable ops, invokes can_enable() and, if the cookie is not an index
-cookie, will begin the procedure of acquiring backing objects.
-
-The optional can_enable() function is passed the data argument and returns a
-ruling as to whether or not enablement should actually be permitted to begin.
+ const loff_t *object_size);
-All possible failures are handled internally. The cookie will only be marked
-as enabled if provisional backing objects are allocated.
+This will update the cookie's coherency data and/or file size.
-The object's data size is updated from object_size and is passed to the
-->check_aux() function.
-In both cases, the cookie's auxiliary data buffer is updated from aux_data if
-that is non-NULL inside the enablement lock before proceeding.
-
-
-Miscellaneous Cookie operations
-===============================
+Data File Invalidation
+======================
-There are a number of operations that can be used to control cookies:
+Sometimes it will be necessary to invalidate an object that contains data.
+Typically this will be necessary when the server informs the network filesystem
+of a remote third-party change - at which point the filesystem has to throw
+away the state and cached data that it had for an file and reload from the
+server.
- * Cookie pinning::
+To indicate that a cache object should be invalidated, the following should be
+called::
- int fscache_pin_cookie(struct fscache_cookie *cookie);
- void fscache_unpin_cookie(struct fscache_cookie *cookie);
+ void fscache_invalidate(struct fscache_cookie *cookie,
+ const void *aux_data,
+ loff_t size,
+ unsigned int flags);
- These operations permit data cookies to be pinned into the cache and to
- have the pinning removed. They are not permitted on index cookies.
+This increases the invalidation counter in the cookie to cause outstanding
+reads to fail with -ESTALE, sets the coherency data and file size from the
+information supplied, blocks new I/O on the cookie and dispatches the cache to
+go and get rid of the old data.
- The pinning function will return 0 if successful, -ENOBUFS in the cookie
- isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning,
- -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
- -EIO if there's any other problem.
+Invalidation runs asynchronously in a worker thread so that it doesn't block
+too much.
- * Data space reservation::
- int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size);
+Write-Back Resource Management
+==============================
- This permits a netfs to request cache space be reserved to store up to the
- given amount of a file. It is permitted to ask for more than the current
- size of the file to allow for future file expansion.
+To write data to the cache from network filesystem writeback, the cache
+resources required need to be pinned at the point the modification is made (for
+instance when the page is marked dirty) as it's not possible to open a file in
+a thread that's exiting.
- If size is given as zero then the reservation will be cancelled.
+The following facilities are provided to manage this:
- The function will return 0 if successful, -ENOBUFS in the cookie isn't
- backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations,
- -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or
- -EIO if there's any other problem.
+ * An inode flag, ``I_PINNING_FSCACHE_WB``, is provided to indicate that an
+ in-use is held on the cookie for this inode. It can only be changed if the
+ the inode lock is held.
- Note that this doesn't pin an object in a cache; it can still be culled to
- make space if it's not in use.
+ * A flag, ``unpinned_fscache_wb`` is placed in the ``writeback_control``
+ struct that gets set if ``__writeback_single_inode()`` clears
+ ``I_PINNING_FSCACHE_WB`` because all the dirty pages were cleared.
+To support this, the following functions are provided::
-Cookie Unregistration
-=====================
+ bool fscache_dirty_folio(struct address_space *mapping,
+ struct folio *folio,
+ struct fscache_cookie *cookie);
+ void fscache_unpin_writeback(struct writeback_control *wbc,
+ struct fscache_cookie *cookie);
+ void fscache_clear_inode_writeback(struct fscache_cookie *cookie,
+ struct inode *inode,
+ const void *aux);
-To get rid of a cookie, this function should be called::
+The *set* function is intended to be called from the filesystem's
+``dirty_folio`` address space operation. If ``I_PINNING_FSCACHE_WB`` is not
+set, it sets that flag and increments the use count on the cookie (the caller
+must already have called ``fscache_use_cookie()``).
- void fscache_relinquish_cookie(struct fscache_cookie *cookie,
- const void *aux_data,
- bool retire);
+The *unpin* function is intended to be called from the filesystem's
+``write_inode`` superblock operation. It cleans up after writing by unusing
+the cookie if unpinned_fscache_wb is set in the writeback_control struct.
-If retire is non-zero, then the object will be marked for recycling, and all
-copies of it will be removed from all active caches in which it is present.
-Not only that but all child objects will also be retired.
+The *clear* function is intended to be called from the netfs's ``evict_inode``
+superblock operation. It must be called *after*
+``truncate_inode_pages_final()``, but *before* ``clear_inode()``. This cleans
+up any hanging ``I_PINNING_FSCACHE_WB``. It also allows the coherency data to
+be updated.
-If retire is zero, then the object may be available again when next the
-acquisition function is called. Retirement here will overrule the pinning on a
-cookie.
-The cookie's auxiliary data will be updated from aux_data if that is non-NULL
-so that the cache can lazily update it on disk.
+Caching of Local Modifications
+==============================
-One very important note - relinquish must NOT be called for a cookie unless all
-the cookies for "child" indices, objects and pages have been relinquished
-first.
+If a network filesystem has locally modified data that it wants to write to the
+cache, it needs to mark the pages to indicate that a write is in progress, and
+if the mark is already present, it needs to wait for it to be removed first
+(presumably due to an already in-progress operation). This prevents multiple
+competing DIO writes to the same storage in the cache.
+Firstly, the netfs should determine if caching is available by doing something
+like::
-Index Invalidation
-==================
+ bool caching = fscache_cookie_enabled(cookie);
-There is no direct way to invalidate an index subtree. To do this, the caller
-should relinquish and retire the cookie they have, and then acquire a new one.
+If caching is to be attempted, pages should be waited for and then marked using
+the following functions provided by the netfs helper library::
+ void set_page_fscache(struct page *page);
+ void wait_on_page_fscache(struct page *page);
+ int wait_on_page_fscache_killable(struct page *page);
-Data File Invalidation
-======================
+Once all the pages in the span are marked, the netfs can ask fscache to
+schedule a write of that region::
-Sometimes it will be necessary to invalidate an object that contains data.
-Typically this will be necessary when the server tells the netfs of a foreign
-change - at which point the netfs has to throw away all the state it had for an
-inode and reload from the server.
+ void fscache_write_to_cache(struct fscache_cookie *cookie,
+ struct address_space *mapping,
+ loff_t start, size_t len, loff_t i_size,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv,
+ bool caching)
-To indicate that a cache object should be invalidated, the following function
-can be called::
+And if an error occurs before that point is reached, the marks can be removed
+by calling::
- void fscache_invalidate(struct fscache_cookie *cookie);
+ void fscache_clear_page_bits(struct address_space *mapping,
+ loff_t start, size_t len,
+ bool caching)
-This can be called with spinlocks held as it defers the work to a thread pool.
-All extant storage, retrieval and attribute change ops at this point are
-cancelled and discarded. Some future operations will be rejected until the
-cache has had a chance to insert a barrier in the operations queue. After
-that, operations will be queued again behind the invalidation operation.
+In these functions, a pointer to the mapping to which the source pages are
+attached is passed in and start and len indicate the size of the region that's
+going to be written (it doesn't have to align to page boundaries necessarily,
+but it does have to align to DIO boundaries on the backing filesystem). The
+caching parameter indicates if caching should be skipped, and if false, the
+functions do nothing.
-The invalidation operation will perform an attribute change operation and an
-auxiliary data update operation as it is very likely these will have changed.
+The write function takes some additional parameters: the cookie representing
+the cache object to be written to, i_size indicates the size of the netfs file
+and term_func indicates an optional completion function, to which
+term_func_priv will be passed, along with the error or amount written.
-Using the following function, the netfs can wait for the invalidation operation
-to have reached a point at which it can start submitting ordinary operations
-once again::
+Note that the write function will always run asynchronously and will unmark all
+the pages upon completion before calling term_func.
- void fscache_wait_on_invalidate(struct fscache_cookie *cookie);
+Page Release and Invalidation
+=============================
-FS-cache Specific Page Flag
-===========================
+Fscache keeps track of whether we have any data in the cache yet for a cache
+object we've just created. It knows it doesn't have to do any reading until it
+has done a write and then the page it wrote from has been released by the VM,
+after which it *has* to look in the cache.
-FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is
-given the alternative name PG_fscache.
+To inform fscache that a page might now be in the cache, the following function
+should be called from the ``release_folio`` address space op::
-PG_fscache is used to indicate that the page is known by the cache, and that
-the cache must be informed if the page is going to go away. It's an indication
-to the netfs that the cache has an interest in this page, where an interest may
-be a pointer to it, resources allocated or reserved for it, or I/O in progress
-upon it.
+ void fscache_note_page_release(struct fscache_cookie *cookie);
-The netfs can use this information in methods such as releasepage() to
-determine whether it needs to uncache a page or update it.
+if the page has been released (ie. release_folio returned true).
-Furthermore, if this bit is set, releasepage() and invalidatepage() operations
-will be called on a page to get rid of it, even if PG_private is not set. This
-allows caching to attempted on a page before read_cache_pages() to be called
-after fscache_read_or_alloc_pages() as the former will try and release pages it
-was given under certain circumstances.
+Page release and page invalidation should also wait for any mark left on the
+page to say that a DIO write is underway from that page::
-This bit does not overlap with such as PG_private. This means that FS-Cache
-can be used with a filesystem that uses the block buffering code.
+ void wait_on_page_fscache(struct page *page);
+ int wait_on_page_fscache_killable(struct page *page);
-There are a number of operations defined on this flag::
- int PageFsCache(struct page *page);
- void SetPageFsCache(struct page *page)
- void ClearPageFsCache(struct page *page)
- int TestSetPageFsCache(struct page *page)
- int TestClearPageFsCache(struct page *page)
+API Function Reference
+======================
-These functions are bit test, bit set, bit clear, bit test and set and bit
-test and clear operations on PG_fscache.
+.. kernel-doc:: include/linux/fscache.h
diff --git a/Documentation/filesystems/caching/object.rst b/Documentation/filesystems/caching/object.rst
deleted file mode 100644
index ce0e043ccd33..000000000000
--- a/Documentation/filesystems/caching/object.rst
+++ /dev/null
@@ -1,313 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-====================================================
-In-Kernel Cache Object Representation and Management
-====================================================
-
-By: David Howells <dhowells@redhat.com>
-
-.. Contents:
-
- (*) Representation
-
- (*) Object management state machine.
-
- - Provision of cpu time.
- - Locking simplification.
-
- (*) The set of states.
-
- (*) The set of events.
-
-
-Representation
-==============
-
-FS-Cache maintains an in-kernel representation of each object that a netfs is
-currently interested in. Such objects are represented by the fscache_cookie
-struct and are referred to as cookies.
-
-FS-Cache also maintains a separate in-kernel representation of the objects that
-a cache backend is currently actively caching. Such objects are represented by
-the fscache_object struct. The cache backends allocate these upon request, and
-are expected to embed them in their own representations. These are referred to
-as objects.
-
-There is a 1:N relationship between cookies and objects. A cookie may be
-represented by multiple objects - an index may exist in more than one cache -
-or even by no objects (it may not be cached).
-
-Furthermore, both cookies and objects are hierarchical. The two hierarchies
-correspond, but the cookies tree is a superset of the union of the object trees
-of multiple caches::
-
- NETFS INDEX TREE : CACHE 1 : CACHE 2
- : :
- : +-----------+ :
- +----------->| IObject | :
- +-----------+ | : +-----------+ :
- | ICookie |-------+ : | :
- +-----------+ | : | : +-----------+
- | +------------------------------>| IObject |
- | : | : +-----------+
- | : V : |
- | : +-----------+ : |
- V +----------->| IObject | : |
- +-----------+ | : +-----------+ : |
- | ICookie |-------+ : | : V
- +-----------+ | : | : +-----------+
- | +------------------------------>| IObject |
- +-----+-----+ : | : +-----------+
- | | : | : |
- V | : V : |
- +-----------+ | : +-----------+ : |
- | ICookie |------------------------->| IObject | : |
- +-----------+ | : +-----------+ : |
- | V : | : V
- | +-----------+ : | : +-----------+
- | | ICookie |-------------------------------->| IObject |
- | +-----------+ : | : +-----------+
- V | : V : |
- +-----------+ | : +-----------+ : |
- | DCookie |------------------------->| DObject | : |
- +-----------+ | : +-----------+ : |
- | : : |
- +-------+-------+ : : |
- | | : : |
- V V : : V
- +-----------+ +-----------+ : : +-----------+
- | DCookie | | DCookie |------------------------>| DObject |
- +-----------+ +-----------+ : : +-----------+
- : :
-
-In the above illustration, ICookie and IObject represent indices and DCookie
-and DObject represent data storage objects. Indices may have representation in
-multiple caches, but currently, non-index objects may not. Objects of any type
-may also be entirely unrepresented.
-
-As far as the netfs API goes, the netfs is only actually permitted to see
-pointers to the cookies. The cookies themselves and any objects attached to
-those cookies are hidden from it.
-
-
-Object Management State Machine
-===============================
-
-Within FS-Cache, each active object is managed by its own individual state
-machine. The state for an object is kept in the fscache_object struct, in
-object->state. A cookie may point to a set of objects that are in different
-states.
-
-Each state has an action associated with it that is invoked when the machine
-wakes up in that state. There are four logical sets of states:
-
- (1) Preparation: states that wait for the parent objects to become ready. The
- representations are hierarchical, and it is expected that an object must
- be created or accessed with respect to its parent object.
-
- (2) Initialisation: states that perform lookups in the cache and validate
- what's found and that create on disk any missing metadata.
-
- (3) Normal running: states that allow netfs operations on objects to proceed
- and that update the state of objects.
-
- (4) Termination: states that detach objects from their netfs cookies, that
- delete objects from disk, that handle disk and system errors and that free
- up in-memory resources.
-
-
-In most cases, transitioning between states is in response to signalled events.
-When a state has finished processing, it will usually set the mask of events in
-which it is interested (object->event_mask) and relinquish the worker thread.
-Then when an event is raised (by calling fscache_raise_event()), if the event
-is not masked, the object will be queued for processing (by calling
-fscache_enqueue_object()).
-
-
-Provision of CPU Time
----------------------
-
-The work to be done by the various states was given CPU time by the threads of
-the slow work facility. This was used in preference to the workqueue facility
-because:
-
- (1) Threads may be completely occupied for very long periods of time by a
- particular work item. These state actions may be doing sequences of
- synchronous, journalled disk accesses (lookup, mkdir, create, setxattr,
- getxattr, truncate, unlink, rmdir, rename).
-
- (2) Threads may do little actual work, but may rather spend a lot of time
- sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded
- workqueues don't necessarily have the right numbers of threads.
-
-
-Locking Simplification
-----------------------
-
-Because only one worker thread may be operating on any particular object's
-state machine at once, this simplifies the locking, particularly with respect
-to disconnecting the netfs's representation of a cache object (fscache_cookie)
-from the cache backend's representation (fscache_object) - which may be
-requested from either end.
-
-
-The Set of States
-=================
-
-The object state machine has a set of states that it can be in. There are
-preparation states in which the object sets itself up and waits for its parent
-object to transit to a state that allows access to its children:
-
- (1) State FSCACHE_OBJECT_INIT.
-
- Initialise the object and wait for the parent object to become active. In
- the cache, it is expected that it will not be possible to look an object
- up from the parent object, until that parent object itself has been looked
- up.
-
-There are initialisation states in which the object sets itself up and accesses
-disk for the object metadata:
-
- (2) State FSCACHE_OBJECT_LOOKING_UP.
-
- Look up the object on disk, using the parent as a starting point.
- FS-Cache expects the cache backend to probe the cache to see whether this
- object is represented there, and if it is, to see if it's valid (coherency
- management).
-
- The cache should call fscache_object_lookup_negative() to indicate lookup
- failure for whatever reason, and should call fscache_obtained_object() to
- indicate success.
-
- At the completion of lookup, FS-Cache will let the netfs go ahead with
- read operations, no matter whether the file is yet cached. If not yet
- cached, read operations will be immediately rejected with ENODATA until
- the first known page is uncached - as to that point there can be no data
- to be read out of the cache for that file that isn't currently also held
- in the pagecache.
-
- (3) State FSCACHE_OBJECT_CREATING.
-
- Create an object on disk, using the parent as a starting point. This
- happens if the lookup failed to find the object, or if the object's
- coherency data indicated what's on disk is out of date. In this state,
- FS-Cache expects the cache to create
-
- The cache should call fscache_obtained_object() if creation completes
- successfully, fscache_object_lookup_negative() otherwise.
-
- At the completion of creation, FS-Cache will start processing write
- operations the netfs has queued for an object. If creation failed, the
- write ops will be transparently discarded, and nothing recorded in the
- cache.
-
-There are some normal running states in which the object spends its time
-servicing netfs requests:
-
- (4) State FSCACHE_OBJECT_AVAILABLE.
-
- A transient state in which pending operations are started, child objects
- are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary
- lookup data is freed.
-
- (5) State FSCACHE_OBJECT_ACTIVE.
-
- The normal running state. In this state, requests the netfs makes will be
- passed on to the cache.
-
- (6) State FSCACHE_OBJECT_INVALIDATING.
-
- The object is undergoing invalidation. When the state comes here, it
- discards all pending read, write and attribute change operations as it is
- going to clear out the cache entirely and reinitialise it. It will then
- continue to the FSCACHE_OBJECT_UPDATING state.
-
- (7) State FSCACHE_OBJECT_UPDATING.
-
- The state machine comes here to update the object in the cache from the
- netfs's records. This involves updating the auxiliary data that is used
- to maintain coherency.
-
-And there are terminal states in which an object cleans itself up, deallocates
-memory and potentially deletes stuff from disk:
-
- (8) State FSCACHE_OBJECT_LC_DYING.
-
- The object comes here if it is dying because of a lookup or creation
- error. This would be due to a disk error or system error of some sort.
- Temporary data is cleaned up, and the parent is released.
-
- (9) State FSCACHE_OBJECT_DYING.
-
- The object comes here if it is dying due to an error, because its parent
- cookie has been relinquished by the netfs or because the cache is being
- withdrawn.
-
- Any child objects waiting on this one are given CPU time so that they too
- can destroy themselves. This object waits for all its children to go away
- before advancing to the next state.
-
-(10) State FSCACHE_OBJECT_ABORT_INIT.
-
- The object comes to this state if it was waiting on its parent in
- FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself
- so that the parent may proceed from the FSCACHE_OBJECT_DYING state.
-
-(11) State FSCACHE_OBJECT_RELEASING.
-(12) State FSCACHE_OBJECT_RECYCLING.
-
- The object comes to one of these two states when dying once it is rid of
- all its children, if it is dying because the netfs relinquished its
- cookie. In the first state, the cached data is expected to persist, and
- in the second it will be deleted.
-
-(13) State FSCACHE_OBJECT_WITHDRAWING.
-
- The object transits to this state if the cache decides it wants to
- withdraw the object from service, perhaps to make space, but also due to
- error or just because the whole cache is being withdrawn.
-
-(14) State FSCACHE_OBJECT_DEAD.
-
- The object transits to this state when the in-memory object record is
- ready to be deleted. The object processor shouldn't ever see an object in
- this state.
-
-
-The Set of Events
------------------
-
-There are a number of events that can be raised to an object state machine:
-
- FSCACHE_OBJECT_EV_UPDATE
- The netfs requested that an object be updated. The state machine will ask
- the cache backend to update the object, and the cache backend will ask the
- netfs for details of the change through its cookie definition ops.
-
- FSCACHE_OBJECT_EV_CLEARED
- This is signalled in two circumstances:
-
- (a) when an object's last child object is dropped and
-
- (b) when the last operation outstanding on an object is completed.
-
- This is used to proceed from the dying state.
-
- FSCACHE_OBJECT_EV_ERROR
- This is signalled when an I/O error occurs during the processing of some
- object.
-
- FSCACHE_OBJECT_EV_RELEASE, FSCACHE_OBJECT_EV_RETIRE
- These are signalled when the netfs relinquishes a cookie it was using.
- The event selected depends on whether the netfs asks for the backing
- object to be retired (deleted) or retained.
-
- FSCACHE_OBJECT_EV_WITHDRAW
- This is signalled when the cache backend wants to withdraw an object.
- This means that the object will have to be detached from the netfs's
- cookie.
-
-Because the withdrawing releasing/retiring events are all handled by the object
-state machine, it doesn't matter if there's a collision with both ends trying
-to sever the connection at the same time. The state machine can just pick
-which one it wants to honour, and that effects the other.
diff --git a/Documentation/filesystems/caching/operations.rst b/Documentation/filesystems/caching/operations.rst
deleted file mode 100644
index f7ddcc028939..000000000000
--- a/Documentation/filesystems/caching/operations.rst
+++ /dev/null
@@ -1,210 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================================
-Asynchronous Operations Handling
-================================
-
-By: David Howells <dhowells@redhat.com>
-
-.. Contents:
-
- (*) Overview.
-
- (*) Operation record initialisation.
-
- (*) Parameters.
-
- (*) Procedure.
-
- (*) Asynchronous callback.
-
-
-Overview
-========
-
-FS-Cache has an asynchronous operations handling facility that it uses for its
-data storage and retrieval routines. Its operations are represented by
-fscache_operation structs, though these are usually embedded into some other
-structure.
-
-This facility is available to and expected to be be used by the cache backends,
-and FS-Cache will create operations and pass them off to the appropriate cache
-backend for completion.
-
-To make use of this facility, <linux/fscache-cache.h> should be #included.
-
-
-Operation Record Initialisation
-===============================
-
-An operation is recorded in an fscache_operation struct::
-
- struct fscache_operation {
- union {
- struct work_struct fast_work;
- struct slow_work slow_work;
- };
- unsigned long flags;
- fscache_operation_processor_t processor;
- ...
- };
-
-Someone wanting to issue an operation should allocate something with this
-struct embedded in it. They should initialise it by calling::
-
- void fscache_operation_init(struct fscache_operation *op,
- fscache_operation_release_t release);
-
-with the operation to be initialised and the release function to use.
-
-The op->flags parameter should be set to indicate the CPU time provision and
-the exclusivity (see the Parameters section).
-
-The op->fast_work, op->slow_work and op->processor flags should be set as
-appropriate for the CPU time provision (see the Parameters section).
-
-FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the
-operation and waited for afterwards.
-
-
-Parameters
-==========
-
-There are a number of parameters that can be set in the operation record's flag
-parameter. There are three options for the provision of CPU time in these
-operations:
-
- (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread
- may decide it wants to handle an operation itself without deferring it to
- another thread.
-
- This is, for example, used in read operations for calling readpages() on
- the backing filesystem in CacheFiles. Although readpages() does an
- asynchronous data fetch, the determination of whether pages exist is done
- synchronously - and the netfs does not proceed until this has been
- determined.
-
- If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags
- before submitting the operation, and the operating thread must wait for it
- to be cleared before proceeding::
-
- wait_on_bit(&op->flags, FSCACHE_OP_WAITING,
- TASK_UNINTERRUPTIBLE);
-
-
- (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it
- will be given to keventd to process. Such an operation is not permitted
- to sleep on I/O.
-
- This is, for example, used by CacheFiles to copy data from a backing fs
- page to a netfs page after the backing fs has read the page in.
-
- If this option is used, op->fast_work and op->processor must be
- initialised before submitting the operation::
-
- INIT_WORK(&op->fast_work, do_some_work);
-
-
- (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it
- will be given to the slow work facility to process. Such an operation is
- permitted to sleep on I/O.
-
- This is, for example, used by FS-Cache to handle background writes of
- pages that have just been fetched from a remote server.
-
- If this option is used, op->slow_work and op->processor must be
- initialised before submitting the operation::
-
- fscache_operation_init_slow(op, processor)
-
-
-Furthermore, operations may be one of two types:
-
- (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in
- conjunction with any other operation on the object being operated upon.
-
- An example of this is the attribute change operation, in which the file
- being written to may need truncation.
-
- (2) Shareable. Operations of this type may be running simultaneously. It's
- up to the operation implementation to prevent interference between other
- operations running at the same time.
-
-
-Procedure
-=========
-
-Operations are used through the following procedure:
-
- (1) The submitting thread must allocate the operation and initialise it
- itself. Normally this would be part of a more specific structure with the
- generic op embedded within.
-
- (2) The submitting thread must then submit the operation for processing using
- one of the following two functions::
-
- int fscache_submit_op(struct fscache_object *object,
- struct fscache_operation *op);
-
- int fscache_submit_exclusive_op(struct fscache_object *object,
- struct fscache_operation *op);
-
- The first function should be used to submit non-exclusive ops and the
- second to submit exclusive ones. The caller must still set the
- FSCACHE_OP_EXCLUSIVE flag.
-
- If successful, both functions will assign the operation to the specified
- object and return 0. -ENOBUFS will be returned if the object specified is
- permanently unavailable.
-
- The operation manager will defer operations on an object that is still
- undergoing lookup or creation. The operation will also be deferred if an
- operation of conflicting exclusivity is in progress on the object.
-
- If the operation is asynchronous, the manager will retain a reference to
- it, so the caller should put their reference to it by passing it to::
-
- void fscache_put_operation(struct fscache_operation *op);
-
- (3) If the submitting thread wants to do the work itself, and has marked the
- operation with FSCACHE_OP_MYTHREAD, then it should monitor
- FSCACHE_OP_WAITING as described above and check the state of the object if
- necessary (the object might have died while the thread was waiting).
-
- When it has finished doing its processing, it should call
- fscache_op_complete() and fscache_put_operation() on it.
-
- (4) The operation holds an effective lock upon the object, preventing other
- exclusive ops conflicting until it is released. The operation can be
- enqueued for further immediate asynchronous processing by adjusting the
- CPU time provisioning option if necessary, eg::
-
- op->flags &= ~FSCACHE_OP_TYPE;
- op->flags |= ~FSCACHE_OP_FAST;
-
- and calling::
-
- void fscache_enqueue_operation(struct fscache_operation *op)
-
- This can be used to allow other things to have use of the worker thread
- pools.
-
-
-Asynchronous Callback
-=====================
-
-When used in asynchronous mode, the worker thread pool will invoke the
-processor method with a pointer to the operation. This should then get at the
-container struct by using container_of()::
-
- static void fscache_write_op(struct fscache_operation *_op)
- {
- struct fscache_storage *op =
- container_of(_op, struct fscache_storage, op);
- ...
- }
-
-The caller holds a reference on the operation, and will invoke
-fscache_put_operation() when the processor function returns. The processor
-function is at liberty to call fscache_enqueue_operation() or to take extra
-references.
diff --git a/Documentation/filesystems/ceph.rst b/Documentation/filesystems/ceph.rst
index 0aa70750df0f..6d2276a87a5a 100644
--- a/Documentation/filesystems/ceph.rst
+++ b/Documentation/filesystems/ceph.rst
@@ -57,12 +57,25 @@ a snapshot on any subdirectory (and its nested contents) in the
system. Snapshot creation and deletion are as simple as 'mkdir
.snap/foo' and 'rmdir .snap/foo'.
-Ceph also provides some recursive accounting on directories for nested
-files and bytes. That is, a 'getfattr -d foo' on any directory in the
-system will reveal the total number of nested regular files and
-subdirectories, and a summation of all nested file sizes. This makes
-the identification of large disk space consumers relatively quick, as
-no 'du' or similar recursive scan of the file system is required.
+Snapshot names have two limitations:
+
+* They can not start with an underscore ('_'), as these names are reserved
+ for internal usage by the MDS.
+* They can not exceed 240 characters in size. This is because the MDS makes
+ use of long snapshot names internally, which follow the format:
+ `_<SNAPSHOT-NAME>_<INODE-NUMBER>`. Since filenames in general can't have
+ more than 255 characters, and `<node-id>` takes 13 characters, the long
+ snapshot names can take as much as 255 - 1 - 1 - 13 = 240.
+
+Ceph also provides some recursive accounting on directories for nested files
+and bytes. You can run the commands::
+
+ getfattr -n ceph.dir.rfiles /some/dir
+ getfattr -n ceph.dir.rbytes /some/dir
+
+to get the total number of nested files and their combined size, respectively.
+This makes the identification of large disk space consumers relatively quick,
+as no 'du' or similar recursive scan of the file system is required.
Finally, Ceph also allows quotas to be set on any directory in the system.
The quota can restrict the number of bytes or the number of files stored
@@ -82,7 +95,7 @@ Mount Syntax
The basic mount syntax is::
- # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt
+ # mount -t ceph user@fsid.fs_name=/[subdir] mnt -o mon_addr=monip1[:port][/monip2[:port]]
You only need to specify a single monitor, as the client will get the
full list when it connects. (However, if the monitor you specify
@@ -90,16 +103,35 @@ happens to be down, the mount won't succeed.) The port can be left
off if the monitor is using the default. So if the monitor is at
1.2.3.4::
- # mount -t ceph 1.2.3.4:/ /mnt/ceph
+ # mount -t ceph cephuser@07fe3187-00d9-42a3-814b-72a4d5e7d5be.cephfs=/ /mnt/ceph -o mon_addr=1.2.3.4
is sufficient. If /sbin/mount.ceph is installed, a hostname can be
-used instead of an IP address.
+used instead of an IP address and the cluster FSID can be left out
+(as the mount helper will fill it in by reading the ceph configuration
+file)::
+ # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=mon-addr
+Multiple monitor addresses can be passed by separating each address with a slash (`/`)::
+
+ # mount -t ceph cephuser@cephfs=/ /mnt/ceph -o mon_addr=192.168.1.100/192.168.1.101
+
+When using the mount helper, monitor address can be read from ceph
+configuration file if available. Note that, the cluster FSID (passed as part
+of the device string) is validated by checking it with the FSID reported by
+the monitor.
Mount Options
=============
+ mon_addr=ip_address[:port][/ip_address[:port]]
+ Monitor address to the cluster. This is used to bootstrap the
+ connection to the cluster. Once connection is established, the
+ monitor addresses in the monitor map are followed.
+
+ fsid=cluster-id
+ FSID of the cluster (from `ceph fsid` command).
+
ip=A.B.C.D[:N]
Specify the IP and/or port the client should bind to locally.
There is normally not much reason to do this. If the IP is not
@@ -163,14 +195,14 @@ Mount Options
to the default VFS implementation if this option is used.
recover_session=<no|clean>
- Set auto reconnect mode in the case where the client is blacklisted. The
+ Set auto reconnect mode in the case where the client is blocklisted. The
available modes are "no" and "clean". The default is "no".
* no: never attempt to reconnect when client detects that it has been
- blacklisted. Operations will generally fail after being blacklisted.
+ blocklisted. Operations will generally fail after being blocklisted.
* clean: client reconnects to the ceph cluster automatically when it
- detects that it has been blacklisted. During reconnect, client drops
+ detects that it has been blocklisted. During reconnect, client drops
dirty data/metadata, invalidates page caches and writable file handles.
After reconnect, file locks become stale because the MDS loses track
of them. If an inode contains any stale file locks, read/write on the
@@ -184,7 +216,6 @@ For more information on Ceph, see the home page at
The Linux kernel client source tree is available at
- https://github.com/ceph/ceph-client.git
- - git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
and the source for the full system is at
https://github.com/ceph/ceph.git
diff --git a/Documentation/filesystems/coda.rst b/Documentation/filesystems/coda.rst
index 84c860c89887..0db3c83a50e5 100644
--- a/Documentation/filesystems/coda.rst
+++ b/Documentation/filesystems/coda.rst
@@ -141,7 +141,7 @@ kernel support.
a process P which accessing a Coda file. It makes a system call which
traps to the OS kernel. Examples of such calls trapping to the kernel
are ``read``, ``write``, ``open``, ``close``, ``create``, ``mkdir``,
- ``rmdir``, ``chmod`` in a Unix ontext. Similar calls exist in the Win32
+ ``rmdir``, ``chmod`` in a Unix context. Similar calls exist in the Win32
environment, and are named ``CreateFile``.
Generally the operating system handles the request in a virtual
@@ -524,7 +524,7 @@ kernel support.
Description
This call is made to determine the ViceFid and filetype of
- a directory entry. The directory entry requested carries name name
+ a directory entry. The directory entry requested carries name 'name'
and Venus will search the directory identified by cfs_lookup_in.VFid.
The result may indicate that the name does not exist, or that
difficulty was encountered in finding it (e.g. due to disconnection).
@@ -886,7 +886,7 @@ kernel support.
none
Description
- Remove the directory with name name from the directory
+ Remove the directory with name 'name' from the directory
identified by VFid.
.. Note:: The attributes of the parent directory should be returned since
diff --git a/Documentation/filesystems/configfs.rst b/Documentation/filesystems/configfs.rst
index f8941954c667..ac22138de6a4 100644
--- a/Documentation/filesystems/configfs.rst
+++ b/Documentation/filesystems/configfs.rst
@@ -226,7 +226,7 @@ filename. configfs_attribute->ca_mode specifies the file permissions.
If an attribute is readable and provides a ->show method, that method will
be called whenever userspace asks for a read(2) on the attribute. If an
attribute is writable and provides a ->store method, that method will be
-be called whenever userspace asks for a write(2) on the attribute.
+called whenever userspace asks for a write(2) on the attribute.
struct configfs_bin_attribute
=============================
@@ -253,7 +253,7 @@ to be used.
If binary attribute is readable and the config_item provides a
ct_item_ops->read_bin_attribute() method, that method will be called
whenever userspace asks for a read(2) on the attribute. The converse
-will happen for write(2). The reads/writes are bufferred so only a
+will happen for write(2). The reads/writes are buffered so only a
single read/write will occur; the attributes' need not concern itself
with it.
@@ -289,7 +289,6 @@ config_item_type::
const char *name);
struct config_group *(*make_group)(struct config_group *group,
const char *name);
- int (*commit_item)(struct config_item *item);
void (*disconnect_notify)(struct config_group *group,
struct config_item *item);
void (*drop_item)(struct config_group *group,
@@ -486,50 +485,3 @@ up. Here, the heartbeat code calls configfs_depend_item(). If it
succeeds, then heartbeat knows the region is safe to give to ocfs2.
If it fails, it was being torn down anyway, and heartbeat can gracefully
pass up an error.
-
-Committable Items
-=================
-
-Note:
- Committable items are currently unimplemented.
-
-Some config_items cannot have a valid initial state. That is, no
-default values can be specified for the item's attributes such that the
-item can do its work. Userspace must configure one or more attributes,
-after which the subsystem can start whatever entity this item
-represents.
-
-Consider the FakeNBD device from above. Without a target address *and*
-a target device, the subsystem has no idea what block device to import.
-The simple example assumes that the subsystem merely waits until all the
-appropriate attributes are configured, and then connects. This will,
-indeed, work, but now every attribute store must check if the attributes
-are initialized. Every attribute store must fire off the connection if
-that condition is met.
-
-Far better would be an explicit action notifying the subsystem that the
-config_item is ready to go. More importantly, an explicit action allows
-the subsystem to provide feedback as to whether the attributes are
-initialized in a way that makes sense. configfs provides this as
-committable items.
-
-configfs still uses only normal filesystem operations. An item is
-committed via rename(2). The item is moved from a directory where it
-can be modified to a directory where it cannot.
-
-Any group that provides the ct_group_ops->commit_item() method has
-committable items. When this group appears in configfs, mkdir(2) will
-not work directly in the group. Instead, the group will have two
-subdirectories: "live" and "pending". The "live" directory does not
-support mkdir(2) or rmdir(2) either. It only allows rename(2). The
-"pending" directory does allow mkdir(2) and rmdir(2). An item is
-created in the "pending" directory. Its attributes can be modified at
-will. Userspace commits the item by renaming it into the "live"
-directory. At this point, the subsystem receives the ->commit_item()
-callback. If all required attributes are filled to satisfaction, the
-method returns zero and the item is moved to the "live" directory.
-
-As rmdir(2) does not work in the "live" directory, an item must be
-shutdown, or "uncommitted". Again, this is done via rename(2), this
-time from the "live" directory back to the "pending" one. The subsystem
-is notified by the ct_group_ops->uncommit_object() method.
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
new file mode 100644
index 000000000000..08dd5e254cc5
--- /dev/null
+++ b/Documentation/filesystems/dax.rst
@@ -0,0 +1,306 @@
+=======================
+Direct Access for files
+=======================
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The `DAX` code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports `DAX`, you can make a filesystem
+on it as usual. The `DAX` code currently only supports files with a block
+size equal to your kernel's `PAGE_SIZE`, so you may need to specify a block
+size when creating the filesystem.
+
+Currently 5 filesystems support `DAX`: ext2, ext4, xfs, virtiofs and erofs.
+Enabling `DAX` on them is different.
+
+Enabling DAX on ext2 and erofs
+------------------------------
+
+When mounting the filesystem, use the ``-o dax`` option on the command line or
+add 'dax' to the options in ``/etc/fstab``. This works to enable `DAX` on all files
+within the filesystem. It is equivalent to the ``-o dax=always`` behavior below.
+
+
+Enabling DAX on xfs and ext4
+----------------------------
+
+Summary
+-------
+
+ 1. There exists an in-kernel file access mode flag `S_DAX` that corresponds to
+ the statx flag `STATX_ATTR_DAX`. See the manpage for statx(2) for details
+ about this access mode.
+
+ 2. There exists a persistent flag `FS_XFLAG_DAX` that can be applied to regular
+ files and directories. This advisory flag can be set or cleared at any
+ time, but doing so does not immediately affect the `S_DAX` state.
+
+ 3. If the persistent `FS_XFLAG_DAX` flag is set on a directory, this flag will
+ be inherited by all regular files and subdirectories that are subsequently
+ created in this directory. Files and subdirectories that exist at the time
+ this flag is set or cleared on the parent directory are not modified by
+ this modification of the parent directory.
+
+ 4. There exist dax mount options which can override `FS_XFLAG_DAX` in the
+ setting of the `S_DAX` flag. Given underlying storage which supports `DAX` the
+ following hold:
+
+ ``-o dax=inode`` means "follow `FS_XFLAG_DAX`" and is the default.
+
+ ``-o dax=never`` means "never set `S_DAX`, ignore `FS_XFLAG_DAX`."
+
+ ``-o dax=always`` means "always set `S_DAX` ignore `FS_XFLAG_DAX`."
+
+ ``-o dax`` is a legacy option which is an alias for ``dax=always``.
+
+ .. warning::
+
+ The option ``-o dax`` may be removed in the future so ``-o dax=always`` is
+ the preferred method for specifying this behavior.
+
+ .. note::
+
+ Modifications to and the inheritance behavior of `FS_XFLAG_DAX` remain
+ the same even when the filesystem is mounted with a dax option. However,
+ in-core inode state (`S_DAX`) will be overridden until the filesystem is
+ remounted with dax=inode and the inode is evicted from kernel memory.
+
+ 5. The `S_DAX` policy can be changed via:
+
+ a) Setting the parent directory `FS_XFLAG_DAX` as needed before files are
+ created
+
+ b) Setting the appropriate dax="foo" mount option
+
+ c) Changing the `FS_XFLAG_DAX` flag on existing regular files and
+ directories. This has runtime constraints and limitations that are
+ described in 6) below.
+
+ 6. When changing the `S_DAX` policy via toggling the persistent `FS_XFLAG_DAX`
+ flag, the change to existing regular files won't take effect until the
+ files are closed by all processes.
+
+
+Details
+-------
+
+There are 2 per-file dax flags. One is a persistent inode setting (`FS_XFLAG_DAX`)
+and the other is a volatile flag indicating the active state of the feature
+(`S_DAX`).
+
+`FS_XFLAG_DAX` is preserved within the filesystem. This persistent config
+setting can be set, cleared and/or queried using the `FS_IOC_FS`[`GS`]`ETXATTR` ioctl
+(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
+
+New files and directories automatically inherit `FS_XFLAG_DAX` from
+their parent directory **when created**. Therefore, setting `FS_XFLAG_DAX` at
+directory creation time can be used to set a default behavior for an entire
+sub-tree.
+
+To clarify inheritance, here are 3 examples:
+
+Example A:
+
+.. code-block:: shell
+
+ mkdir -p a/b/c
+ xfs_io -c 'chattr +x' a
+ mkdir a/b/c/d
+ mkdir a/e
+
+ ------[outcome]------
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+.. code-block:: shell
+
+ mkdir a
+ xfs_io -c 'chattr +x' a
+ mkdir -p a/b/c/d
+
+ ------[outcome]------
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+.. code-block:: shell
+
+ mkdir -p a/b/c
+ xfs_io -c 'chattr +x' c
+ mkdir a/b/c/d
+
+ ------[outcome]------
+
+ dax: c,d
+ no dax: a,b
+
+The current enabled state (`S_DAX`) is set when a file inode is instantiated in
+memory by the kernel. It is set based on the underlying media support, the
+value of `FS_XFLAG_DAX` and the filesystem's dax mount option.
+
+statx can be used to query `S_DAX`.
+
+.. note::
+
+ That only regular files will ever have `S_DAX` set and therefore statx
+ will never indicate that `S_DAX` is set on directories.
+
+Setting the `FS_XFLAG_DAX` flag (specifically or through inheritance) occurs even
+if the underlying media does not support dax and/or the filesystem is
+overridden with a mount option.
+
+
+Enabling DAX on virtiofs
+----------------------------
+The semantic of DAX on virtiofs is basically equal to that on ext4 and xfs,
+except that when '-o dax=inode' is specified, virtiofs client derives the hint
+whether DAX shall be enabled or not from virtiofs server through FUSE protocol,
+rather than the persistent `FS_XFLAG_DAX` flag. That is, whether DAX shall be
+enabled or not is completely determined by virtiofs server, while virtiofs
+server itself may deploy various algorithm making this decision, e.g. depending
+on the persistent `FS_XFLAG_DAX` flag on the host.
+
+It is still supported to set or clear persistent `FS_XFLAG_DAX` flag inside
+guest, but it is not guaranteed that DAX will be enabled or disabled for
+corresponding file then. Users inside guest still need to call statx(2) and
+check the statx flag `STATX_ATTR_DAX` to see if DAX is enabled for this file.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support `DAX` in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that can be contiguously accessed at that offset. It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- brd: RAM backed block device driver
+- pmem: NVDIMM persistent memory driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of:
+
+* Adding support to mark inodes as being `DAX` by setting the `S_DAX` flag in
+ i_flags
+* Implementing ->read_iter and ->write_iter operations which use
+ :c:func:`dax_iomap_rw()` when inode has `S_DAX` flag set
+* Implementing an mmap file operation for `DAX` files which sets the
+ `VM_MIXEDMAP` and `VM_HUGEPAGE` flags on the `VMA`, and setting the vm_ops to
+ include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
+ handlers should probably call :c:func:`dax_iomap_fault()` passing the
+ appropriate fault size and iomap operations.
+* Calling :c:func:`iomap_zero_range()` passing appropriate iomap operations
+ instead of :c:func:`block_truncate_page()` for `DAX` files
+* Ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The iomap handlers for allocating blocks must make sure that allocated blocks
+are zeroed out and converted to written extents before being returned to avoid
+exposure of uninitialized data through mmap.
+
+These filesystems may be used for inspiration:
+
+.. seealso::
+
+ ext2: see Documentation/filesystems/ext2.rst
+
+.. seealso::
+
+ xfs: see Documentation/admin-guide/xfs.rst
+
+.. seealso::
+
+ ext4: see Documentation/filesystems/ext4/
+
+
+Handling Media Errors
+---------------------
+
+The libnvdimm subsystem stores a record of known media error locations for
+each pmem block device (in gendisk->badblocks). If we fault at such location,
+or one with a latent error not yet discovered, the application can expect
+to receive a `SIGBUS`. Libnvdimm also allows clearing of these errors by simply
+writing the affected sectors (through the pmem driver, and if the underlying
+NVDIMM supports the clear_poison DSM defined by ACPI).
+
+Since `DAX` IO normally doesn't go through the ``driver/bio`` path, applications or
+sysadmins have an option to restore the lost data from a prior ``backup/inbuilt``
+redundancy in the following ways:
+
+1. Delete the affected file, and restore from a backup (sysadmin route):
+ This will free the filesystem blocks that were being used by the file,
+ and the next time they're allocated, they will be zeroed first, which
+ happens through the driver, and will clear bad sectors.
+
+2. Truncate or hole-punch the part of the file that has a bad-block (at least
+ an entire aligned sector has to be hole-punched, but not necessarily an
+ entire filesystem block).
+
+These are the two basic paths that allow `DAX` filesystems to continue operating
+in the presence of media errors. More robust error recovery mechanisms can be
+built on top of this in the future, for example, involving redundancy/mirroring
+provided at the block layer through DM, or additionally, at the filesystem
+level. These would have to rely on the above two tenets, that error clearing
+can happen either by sending an IO through the driver, or zeroing (also through
+the driver).
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+`DAX` on a block device that supports `DAX`, they will still be copied into RAM.
+
+The DAX code does not work correctly on architectures which have virtually
+mapped caches such as ARM, MIPS and SPARC.
+
+Calling :c:func:`get_user_pages()` on a range of user memory that has been
+mmapped from a `DAX` file will fail when there are no 'struct page' to describe
+those pages. This problem has been addressed in some device drivers
+by adding optional struct page support for pages under the control of
+the driver (see `CONFIG_NVDIMM_PFN` in ``drivers/nvdimm`` for an example of
+how to do this). In the non struct page cases `O_DIRECT` reads/writes to
+those memory ranges from a non-`DAX` file will fail
+
+
+.. note::
+
+ `O_DIRECT` reads/writes _of a `DAX` file do work, it is the memory that
+ is being accessed that is key here). Other things that will not work in
+ the non struct page case include RDMA, :c:func:`sendfile()` and
+ :c:func:`splice()`.
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
deleted file mode 100644
index 8fdb78f3c6c9..000000000000
--- a/Documentation/filesystems/dax.txt
+++ /dev/null
@@ -1,268 +0,0 @@
-Direct Access for files
------------------------
-
-Motivation
-----------
-
-The page cache is usually used to buffer reads and writes to files.
-It is also used to provide the pages which are mapped into userspace
-by a call to mmap.
-
-For block devices that are memory-like, the page cache pages would be
-unnecessary copies of the original storage. The DAX code removes the
-extra copy by performing reads and writes directly to the storage device.
-For file mappings, the storage device is mapped directly into userspace.
-
-
-Usage
------
-
-If you have a block device which supports DAX, you can make a filesystem
-on it as usual. The DAX code currently only supports files with a block
-size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem.
-
-Currently 3 filesystems support DAX: ext2, ext4 and xfs. Enabling DAX on them
-is different.
-
-Enabling DAX on ext2
------------------------------
-
-When mounting the filesystem, use the "-o dax" option on the command line or
-add 'dax' to the options in /etc/fstab. This works to enable DAX on all files
-within the filesystem. It is equivalent to the '-o dax=always' behavior below.
-
-
-Enabling DAX on xfs and ext4
-----------------------------
-
-Summary
--------
-
- 1. There exists an in-kernel file access mode flag S_DAX that corresponds to
- the statx flag STATX_ATTR_DAX. See the manpage for statx(2) for details
- about this access mode.
-
- 2. There exists a persistent flag FS_XFLAG_DAX that can be applied to regular
- files and directories. This advisory flag can be set or cleared at any
- time, but doing so does not immediately affect the S_DAX state.
-
- 3. If the persistent FS_XFLAG_DAX flag is set on a directory, this flag will
- be inherited by all regular files and subdirectories that are subsequently
- created in this directory. Files and subdirectories that exist at the time
- this flag is set or cleared on the parent directory are not modified by
- this modification of the parent directory.
-
- 4. There exist dax mount options which can override FS_XFLAG_DAX in the
- setting of the S_DAX flag. Given underlying storage which supports DAX the
- following hold:
-
- "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
-
- "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
-
- "-o dax=always" means "always set S_DAX ignore FS_XFLAG_DAX."
-
- "-o dax" is a legacy option which is an alias for "dax=always".
- This may be removed in the future so "-o dax=always" is
- the preferred method for specifying this behavior.
-
- NOTE: Modifications to and the inheritance behavior of FS_XFLAG_DAX remain
- the same even when the filesystem is mounted with a dax option. However,
- in-core inode state (S_DAX) will be overridden until the filesystem is
- remounted with dax=inode and the inode is evicted from kernel memory.
-
- 5. The S_DAX policy can be changed via:
-
- a) Setting the parent directory FS_XFLAG_DAX as needed before files are
- created
-
- b) Setting the appropriate dax="foo" mount option
-
- c) Changing the FS_XFLAG_DAX flag on existing regular files and
- directories. This has runtime constraints and limitations that are
- described in 6) below.
-
- 6. When changing the S_DAX policy via toggling the persistent FS_XFLAG_DAX flag,
- the change in behaviour for existing regular files may not occur
- immediately. If the change must take effect immediately, the administrator
- needs to:
-
- a) stop the application so there are no active references to the data set
- the policy change will affect
-
- b) evict the data set from kernel caches so it will be re-instantiated when
- the application is restarted. This can be achieved by:
-
- i. drop-caches
- ii. a filesystem unmount and mount cycle
- iii. a system reboot
-
-
-Details
--------
-
-There are 2 per-file dax flags. One is a persistent inode setting (FS_XFLAG_DAX)
-and the other is a volatile flag indicating the active state of the feature
-(S_DAX).
-
-FS_XFLAG_DAX is preserved within the filesystem. This persistent config
-setting can be set, cleared and/or queried using the FS_IOC_FS[GS]ETXATTR ioctl
-(see ioctl_xfs_fsgetxattr(2)) or an utility such as 'xfs_io'.
-
-New files and directories automatically inherit FS_XFLAG_DAX from
-their parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
-directory creation time can be used to set a default behavior for an entire
-sub-tree.
-
-To clarify inheritance, here are 3 examples:
-
-Example A:
-
-mkdir -p a/b/c
-xfs_io -c 'chattr +x' a
-mkdir a/b/c/d
-mkdir a/e
-
- dax: a,e
- no dax: b,c,d
-
-Example B:
-
-mkdir a
-xfs_io -c 'chattr +x' a
-mkdir -p a/b/c/d
-
- dax: a,b,c,d
- no dax:
-
-Example C:
-
-mkdir -p a/b/c
-xfs_io -c 'chattr +x' c
-mkdir a/b/c/d
-
- dax: c,d
- no dax: a,b
-
-
-The current enabled state (S_DAX) is set when a file inode is instantiated in
-memory by the kernel. It is set based on the underlying media support, the
-value of FS_XFLAG_DAX and the filesystem's dax mount option.
-
-statx can be used to query S_DAX. NOTE that only regular files will ever have
-S_DAX set and therefore statx will never indicate that S_DAX is set on
-directories.
-
-Setting the FS_XFLAG_DAX flag (specifically or through inheritance) occurs even
-if the underlying media does not support dax and/or the filesystem is
-overridden with a mount option.
-
-
-
-Implementation Tips for Block Driver Writers
---------------------------------------------
-
-To support DAX in your block driver, implement the 'direct_access'
-block device operation. It is used to translate the sector number
-(expressed in units of 512-byte sectors) to a page frame number (pfn)
-that identifies the physical page for the memory. It also returns a
-kernel virtual address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that can be contiguously accessed at that offset. It may also
-return a negative errno if an error occurs.
-
-In order to support this method, the storage must be byte-accessible by
-the CPU at all times. If your device uses paging techniques to expose
-a large amount of memory through a smaller window, then you cannot
-implement direct_access. Equally, if your device can occasionally
-stall the CPU for an extended period, you should also not attempt to
-implement direct_access.
-
-These block devices may be used for inspiration:
-- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
-- pmem: NVDIMM persistent memory driver
-
-
-Implementation Tips for Filesystem Writers
-------------------------------------------
-
-Filesystem support consists of
-- adding support to mark inodes as being DAX by setting the S_DAX flag in
- i_flags
-- implementing ->read_iter and ->write_iter operations which use dax_iomap_rw()
- when inode has S_DAX flag set
-- implementing an mmap file operation for DAX files which sets the
- VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
- include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
- handlers should probably call dax_iomap_fault() passing the appropriate
- fault size and iomap operations.
-- calling iomap_zero_range() passing appropriate iomap operations instead of
- block_truncate_page() for DAX files
-- ensuring that there is sufficient locking between reads, writes,
- truncates and page faults
-
-The iomap handlers for allocating blocks must make sure that allocated blocks
-are zeroed out and converted to written extents before being returned to avoid
-exposure of uninitialized data through mmap.
-
-These filesystems may be used for inspiration:
-- ext2: see Documentation/filesystems/ext2.rst
-- ext4: see Documentation/filesystems/ext4/
-- xfs: see Documentation/admin-guide/xfs.rst
-
-
-Handling Media Errors
----------------------
-
-The libnvdimm subsystem stores a record of known media error locations for
-each pmem block device (in gendisk->badblocks). If we fault at such location,
-or one with a latent error not yet discovered, the application can expect
-to receive a SIGBUS. Libnvdimm also allows clearing of these errors by simply
-writing the affected sectors (through the pmem driver, and if the underlying
-NVDIMM supports the clear_poison DSM defined by ACPI).
-
-Since DAX IO normally doesn't go through the driver/bio path, applications or
-sysadmins have an option to restore the lost data from a prior backup/inbuilt
-redundancy in the following ways:
-
-1. Delete the affected file, and restore from a backup (sysadmin route):
- This will free the filesystem blocks that were being used by the file,
- and the next time they're allocated, they will be zeroed first, which
- happens through the driver, and will clear bad sectors.
-
-2. Truncate or hole-punch the part of the file that has a bad-block (at least
- an entire aligned sector has to be hole-punched, but not necessarily an
- entire filesystem block).
-
-These are the two basic paths that allow DAX filesystems to continue operating
-in the presence of media errors. More robust error recovery mechanisms can be
-built on top of this in the future, for example, involving redundancy/mirroring
-provided at the block layer through DM, or additionally, at the filesystem
-level. These would have to rely on the above two tenets, that error clearing
-can happen either by sending an IO through the driver, or zeroing (also through
-the driver).
-
-
-Shortcomings
-------------
-
-Even if the kernel or its modules are stored on a filesystem that supports
-DAX on a block device that supports DAX, they will still be copied into RAM.
-
-The DAX code does not work correctly on architectures which have virtually
-mapped caches such as ARM, MIPS and SPARC.
-
-Calling get_user_pages() on a range of user memory that has been mmaped
-from a DAX file will fail when there are no 'struct page' to describe
-those pages. This problem has been addressed in some device drivers
-by adding optional struct page support for pages under the control of
-the driver (see CONFIG_NVDIMM_PFN in drivers/nvdimm for an example of
-how to do this). In the non struct page cases O_DIRECT reads/writes to
-those memory ranges from a non-DAX file will fail (note that O_DIRECT
-reads/writes _of a DAX file_ do work, it is the memory that is being
-accessed that is key here). Other things that will not work in the
-non struct page case include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst
index 1da7a4b7383d..55f807293924 100644
--- a/Documentation/filesystems/debugfs.rst
+++ b/Documentation/filesystems/debugfs.rst
@@ -120,8 +120,8 @@ and hexadecimal::
Boolean values can be placed in debugfs with::
- struct dentry *debugfs_create_bool(const char *name, umode_t mode,
- struct dentry *parent, bool *value);
+ void debugfs_create_bool(const char *name, umode_t mode,
+ struct dentry *parent, bool *value);
A read on the resulting file will yield either Y (for non-zero values) or
N, followed by a newline. If written to, it will accept either upper- or
@@ -155,8 +155,8 @@ any code which does so in the mainline. Note that all files created with
debugfs_create_blob() are read-only.
If you want to dump a block of registers (something that happens quite
-often during development, even if little such code reaches mainline.
-Debugfs offers two functions: one to make a registers-only file, and
+often during development, even if little such code reaches mainline),
+debugfs offers two functions: one to make a registers-only file, and
another to insert a register block in the middle of another sequential
file::
@@ -183,19 +183,23 @@ The "base" argument may be 0, but you may want to build the reg32 array
using __stringify, and a number of register names (macros) are actually
byte offsets over a base for the register block.
-If you want to dump an u32 array in debugfs, you can create file with::
+If you want to dump a u32 array in debugfs, you can create a file with::
+
+ struct debugfs_u32_array {
+ u32 *array;
+ u32 n_elements;
+ };
void debugfs_create_u32_array(const char *name, umode_t mode,
struct dentry *parent,
- u32 *array, u32 elements);
+ struct debugfs_u32_array *array);
-The "array" argument provides data, and the "elements" argument is
-the number of elements in the array. Note: Once array is created its
-size can not be changed.
+The "array" argument wraps a pointer to the array's data and the number
+of its elements. Note: Once array is created its size can not be changed.
-There is a helper function to create device related seq_file::
+There is a helper function to create a device-related seq_file::
- struct dentry *debugfs_create_devm_seqfile(struct device *dev,
+ void debugfs_create_devm_seqfile(struct device *dev,
const char *name,
struct dentry *parent,
int (*read_fn)(struct seq_file *s,
@@ -207,18 +211,16 @@ seq_file content.
There are a couple of other directory-oriented helper functions::
- struct dentry *debugfs_rename(struct dentry *old_dir,
- struct dentry *old_dentry,
- struct dentry *new_dir,
- const char *new_name);
+ struct dentry *debugfs_change_name(struct dentry *dentry,
+ const char *fmt, ...);
struct dentry *debugfs_create_symlink(const char *name,
struct dentry *parent,
const char *target);
-A call to debugfs_rename() will give a new name to an existing debugfs
-file, possibly in a different directory. The new_name must not exist prior
-to the call; the return value is old_dentry with updated information.
+A call to debugfs_change_name() will give a new name to an existing debugfs
+file, always in the same directory. The new_name must not exist prior
+to the call; the return value is 0 on success and -E... on failure.
Symbolic links can be created with debugfs_create_symlink().
There is one important thing that all debugfs users must take into account:
@@ -227,22 +229,15 @@ module is unloaded without explicitly removing debugfs entries, the result
will be a lot of stale pointers and no end of highly antisocial behavior.
So all debugfs users - at least those which can be built as modules - must
be prepared to remove all files and directories they create there. A file
-can be removed with::
+or directory can be removed with::
void debugfs_remove(struct dentry *dentry);
The dentry value can be NULL or an error value, in which case nothing will
-be removed.
-
-Once upon a time, debugfs users were required to remember the dentry
-pointer for every debugfs file they created so that all files could be
-cleaned up. We live in more civilized times now, though, and debugfs users
-can call::
-
- void debugfs_remove_recursive(struct dentry *dentry);
-
-If this function is passed a pointer for the dentry corresponding to the
-top-level directory, the entire hierarchy below that directory will be
-removed.
+be removed. Note that this function will recursively remove all files and
+directories underneath it. Previously, debugfs_remove_recursive() was used
+to perform that task, but this function is now just an alias to
+debugfs_remove(). debugfs_remove_recursive() should be considered
+deprecated.
.. [1] http://lwn.net/Articles/309298/
diff --git a/Documentation/filesystems/devpts.rst b/Documentation/filesystems/devpts.rst
index a03248ddfb4c..b6324ab1960d 100644
--- a/Documentation/filesystems/devpts.rst
+++ b/Documentation/filesystems/devpts.rst
@@ -5,8 +5,8 @@ The Devpts Filesystem
=====================
Each mount of the devpts filesystem is now distinct such that ptys
-and their indicies allocated in one mount are independent from ptys
-and their indicies in all other mounts.
+and their indices allocated in one mount are independent from ptys
+and their indices in all other mounts.
All mounts of the devpts filesystem now create a ``/dev/pts/ptmx`` node
with permissions ``0000``.
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst
index de12016ee419..6fdf0b02df43 100644
--- a/Documentation/filesystems/directory-locking.rst
+++ b/Documentation/filesystems/directory-locking.rst
@@ -11,127 +11,268 @@ When taking the i_rwsem on multiple non-directory objects, we
always acquire the locks in order by increasing address. We'll call
that "inode pointer" order in the following.
-For our purposes all operations fall in 5 classes:
-1) read access. Locking rules: caller locks directory we are accessing.
-The lock is taken shared.
+Primitives
+==========
-2) object creation. Locking rules: same as above, but the lock is taken
-exclusive.
+For our purposes all operations fall in 6 classes:
-3) object removal. Locking rules: caller locks parent, finds victim,
-locks victim and calls the method. Locks are exclusive.
+1. read access. Locking rules:
-4) rename() that is _not_ cross-directory. Locking rules: caller locks
-the parent and finds source and target. In case of exchange (with
-RENAME_EXCHANGE in flags argument) lock both. In any case,
-if the target already exists, lock it. If the source is a non-directory,
-lock it. If we need to lock both, lock them in inode pointer order.
-Then call the method. All locks are exclusive.
-NB: we might get away with locking the the source (and target in exchange
-case) shared.
+ * lock the directory we are accessing (shared)
-5) link creation. Locking rules:
+2. object creation. Locking rules:
- * lock parent
- * check that source is not a directory
- * lock source
- * call the method.
+ * lock the directory we are accessing (exclusive)
-All locks are exclusive.
+3. object removal. Locking rules:
-6) cross-directory rename. The trickiest in the whole bunch. Locking
-rules:
+ * lock the parent (exclusive)
+ * find the victim
+ * lock the victim (exclusive)
- * lock the filesystem
- * lock parents in "ancestors first" order.
- * find source and target.
- * if old parent is equal to or is a descendent of target
- fail with -ENOTEMPTY
- * if new parent is equal to or is a descendent of source
- fail with -ELOOP
- * If it's an exchange, lock both the source and the target.
- * If the target exists, lock it. If the source is a non-directory,
- lock it. If we need to lock both, do so in inode pointer order.
- * call the method.
+4. link creation. Locking rules:
+
+ * lock the parent (exclusive)
+ * check that the source is not a directory
+ * lock the source (exclusive; probably could be weakened to shared)
-All ->i_rwsem are taken exclusive. Again, we might get away with locking
-the the source (and target in exchange case) shared.
+5. rename that is _not_ cross-directory. Locking rules:
-The rules above obviously guarantee that all directories that are going to be
-read, modified or removed by method will be locked by caller.
+ * lock the parent (exclusive)
+ * find the source and target
+ * decide which of the source and target need to be locked.
+ The source needs to be locked if it's a non-directory, target - if it's
+ a non-directory or about to be removed.
+ * take the locks that need to be taken (exclusive), in inode pointer order
+ if need to take both (that can happen only when both source and target
+ are non-directories - the source because it wouldn't need to be locked
+ otherwise and the target because mixing directory and non-directory is
+ allowed only with RENAME_EXCHANGE, and that won't be removing the target).
+6. cross-directory rename. The trickiest in the whole bunch. Locking rules:
+
+ * lock the filesystem
+ * if the parents don't have a common ancestor, fail the operation.
+ * lock the parents in "ancestors first" order (exclusive). If neither is an
+ ancestor of the other, lock the parent of source first.
+ * find the source and target.
+ * verify that the source is not a descendent of the target and
+ target is not a descendent of source; fail the operation otherwise.
+ * lock the subdirectories involved (exclusive), source before target.
+ * lock the non-directories involved (exclusive), in inode pointer order.
+
+The rules above obviously guarantee that all directories that are going
+to be read, modified or removed by method will be locked by the caller.
+
+
+Splicing
+========
+
+There is one more thing to consider - splicing. It's not an operation
+in its own right; it may happen as part of lookup. We speak of the
+operations on directory trees, but we obviously do not have the full
+picture of those - especially for network filesystems. What we have
+is a bunch of subtrees visible in dcache and locking happens on those.
+Trees grow as we do operations; memory pressure prunes them. Normally
+that's not a problem, but there is a nasty twist - what should we do
+when one growing tree reaches the root of another? That can happen in
+several scenarios, starting from "somebody mounted two nested subtrees
+from the same NFS4 server and doing lookups in one of them has reached
+the root of another"; there's also open-by-fhandle stuff, and there's a
+possibility that directory we see in one place gets moved by the server
+to another and we run into it when we do a lookup.
+
+For a lot of reasons we want to have the same directory present in dcache
+only once. Multiple aliases are not allowed. So when lookup runs into
+a subdirectory that already has an alias, something needs to be done with
+dcache trees. Lookup is already holding the parent locked. If alias is
+a root of separate tree, it gets attached to the directory we are doing a
+lookup in, under the name we'd been looking for. If the alias is already
+a child of the directory we are looking in, it changes name to the one
+we'd been looking for. No extra locking is involved in these two cases.
+However, if it's a child of some other directory, the things get trickier.
+First of all, we verify that it is *not* an ancestor of our directory
+and fail the lookup if it is. Then we try to lock the filesystem and the
+current parent of the alias. If either trylock fails, we fail the lookup.
+If trylocks succeed, we detach the alias from its current parent and
+attach to our directory, under the name we are looking for.
+
+Note that splicing does *not* involve any modification of the filesystem;
+all we change is the view in dcache. Moreover, holding a directory locked
+exclusive prevents such changes involving its children and holding the
+filesystem lock prevents any changes of tree topology, other than having a
+root of one tree becoming a child of directory in another. In particular,
+if two dentries have been found to have a common ancestor after taking
+the filesystem lock, their relationship will remain unchanged until
+the lock is dropped. So from the directory operations' point of view
+splicing is almost irrelevant - the only place where it matters is one
+step in cross-directory renames; we need to be careful when checking if
+parents have a common ancestor.
+
+
+Multiple-filesystem stuff
+=========================
+
+For some filesystems a method can involve a directory operation on
+another filesystem; it may be ecryptfs doing operation in the underlying
+filesystem, overlayfs doing something to the layers, network filesystem
+using a local one as a cache, etc. In all such cases the operations
+on other filesystems must follow the same locking rules. Moreover, "a
+directory operation on this filesystem might involve directory operations
+on that filesystem" should be an asymmetric relation (or, if you will,
+it should be possible to rank the filesystems so that directory operation
+on a filesystem could trigger directory operations only on higher-ranked
+ones - in these terms overlayfs ranks lower than its layers, network
+filesystem ranks lower than whatever it caches on, etc.)
+
+
+Deadlock avoidance
+==================
If no directory is its own ancestor, the scheme above is deadlock-free.
Proof:
- First of all, at any moment we have a partial ordering of the
- objects - A < B iff A is an ancestor of B.
-
- That ordering can change. However, the following is true:
-
-(1) if object removal or non-cross-directory rename holds lock on A and
- attempts to acquire lock on B, A will remain the parent of B until we
- acquire the lock on B. (Proof: only cross-directory rename can change
- the parent of object and it would have to lock the parent).
-
-(2) if cross-directory rename holds the lock on filesystem, order will not
- change until rename acquires all locks. (Proof: other cross-directory
- renames will be blocked on filesystem lock and we don't start changing
- the order until we had acquired all locks).
-
-(3) locks on non-directory objects are acquired only after locks on
- directory objects, and are acquired in inode pointer order.
- (Proof: all operations but renames take lock on at most one
- non-directory object, except renames, which take locks on source and
- target in inode pointer order in the case they are not directories.)
-
-Now consider the minimal deadlock. Each process is blocked on
-attempt to acquire some lock and already holds at least one lock. Let's
-consider the set of contended locks. First of all, filesystem lock is
-not contended, since any process blocked on it is not holding any locks.
-Thus all processes are blocked on ->i_rwsem.
-
-By (3), any process holding a non-directory lock can only be
-waiting on another non-directory lock with a larger address. Therefore
-the process holding the "largest" such lock can always make progress, and
-non-directory objects are not included in the set of contended locks.
-
-Thus link creation can't be a part of deadlock - it can't be
-blocked on source and it means that it doesn't hold any locks.
-
-Any contended object is either held by cross-directory rename or
-has a child that is also contended. Indeed, suppose that it is held by
-operation other than cross-directory rename. Then the lock this operation
-is blocked on belongs to child of that object due to (1).
-
-It means that one of the operations is cross-directory rename.
-Otherwise the set of contended objects would be infinite - each of them
-would have a contended child and we had assumed that no object is its
-own descendent. Moreover, there is exactly one cross-directory rename
-(see above).
-
-Consider the object blocking the cross-directory rename. One
-of its descendents is locked by cross-directory rename (otherwise we
-would again have an infinite set of contended objects). But that
-means that cross-directory rename is taking locks out of order. Due
-to (2) the order hadn't changed since we had acquired filesystem lock.
-But locking rules for cross-directory rename guarantee that we do not
-try to acquire lock on descendent before the lock on ancestor.
-Contradiction. I.e. deadlock is impossible. Q.E.D.
-
+There is a ranking on the locks, such that all primitives take
+them in order of non-decreasing rank. Namely,
+
+ * rank ->i_rwsem of non-directories on given filesystem in inode pointer
+ order.
+ * put ->i_rwsem of all directories on a filesystem at the same rank,
+ lower than ->i_rwsem of any non-directory on the same filesystem.
+ * put ->s_vfs_rename_mutex at rank lower than that of any ->i_rwsem
+ on the same filesystem.
+ * among the locks on different filesystems use the relative
+ rank of those filesystems.
+
+For example, if we have NFS filesystem caching on a local one, we have
+
+ 1. ->s_vfs_rename_mutex of NFS filesystem
+ 2. ->i_rwsem of directories on that NFS filesystem, same rank for all
+ 3. ->i_rwsem of non-directories on that filesystem, in order of
+ increasing address of inode
+ 4. ->s_vfs_rename_mutex of local filesystem
+ 5. ->i_rwsem of directories on the local filesystem, same rank for all
+ 6. ->i_rwsem of non-directories on local filesystem, in order of
+ increasing address of inode.
+
+It's easy to verify that operations never take a lock with rank
+lower than that of an already held lock.
+
+Suppose deadlocks are possible. Consider the minimal deadlocked
+set of threads. It is a cycle of several threads, each blocked on a lock
+held by the next thread in the cycle.
+
+Since the locking order is consistent with the ranking, all
+contended locks in the minimal deadlock will be of the same rank,
+i.e. they all will be ->i_rwsem of directories on the same filesystem.
+Moreover, without loss of generality we can assume that all operations
+are done directly to that filesystem and none of them has actually
+reached the method call.
+
+In other words, we have a cycle of threads, T1,..., Tn,
+and the same number of directories (D1,...,Dn) such that
+
+ T1 is blocked on D1 which is held by T2
+
+ T2 is blocked on D2 which is held by T3
+
+ ...
+
+ Tn is blocked on Dn which is held by T1.
+
+Each operation in the minimal cycle must have locked at least
+one directory and blocked on attempt to lock another. That leaves
+only 3 possible operations: directory removal (locks parent, then
+child), same-directory rename killing a subdirectory (ditto) and
+cross-directory rename of some sort.
+
+There must be a cross-directory rename in the set; indeed,
+if all operations had been of the "lock parent, then child" sort
+we would have Dn a parent of D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships couldn't
+have changed since the moment directory locks had been acquired,
+so they would all hold simultaneously at the deadlock time and
+we would have a loop.
+
+Since all operations are on the same filesystem, there can't be
+more than one cross-directory rename among them. Without loss of
+generality we can assume that T1 is the one doing a cross-directory
+rename and everything else is of the "lock parent, then child" sort.
+
+In other words, we have a cross-directory rename that locked
+Dn and blocked on attempt to lock D1, which is a parent of D2, which is
+a parent of D3, ..., which is a parent of Dn. Relationships between
+D1,...,Dn all hold simultaneously at the deadlock time. Moreover,
+cross-directory rename does not get to locking any directories until it
+has acquired filesystem lock and verified that directories involved have
+a common ancestor, which guarantees that ancestry relationships between
+all of them had been stable.
+
+Consider the order in which directories are locked by the
+cross-directory rename; parents first, then possibly their children.
+Dn and D1 would have to be among those, with Dn locked before D1.
+Which pair could it be?
+
+It can't be the parents - indeed, since D1 is an ancestor of Dn,
+it would be the first parent to be locked. Therefore at least one of the
+children must be involved and thus neither of them could be a descendent
+of another - otherwise the operation would not have progressed past
+locking the parents.
+
+It can't be a parent and its child; otherwise we would've had
+a loop, since the parents are locked before the children, so the parent
+would have to be a descendent of its child.
+
+It can't be a parent and a child of another parent either.
+Otherwise the child of the parent in question would've been a descendent
+of another child.
+
+That leaves only one possibility - namely, both Dn and D1 are
+among the children, in some order. But that is also impossible, since
+neither of the children is a descendent of another.
+
+That concludes the proof, since the set of operations with the
+properties required for a minimal deadlock can not exist.
+
+Note that the check for having a common ancestor in cross-directory
+rename is crucial - without it a deadlock would be possible. Indeed,
+suppose the parents are initially in different trees; we would lock the
+parent of source, then try to lock the parent of target, only to have
+an unrelated lookup splice a distant ancestor of source to some distant
+descendent of the parent of target. At that point we have cross-directory
+rename holding the lock on parent of source and trying to lock its
+distant ancestor. Add a bunch of rmdir() attempts on all directories
+in between (all of those would fail with -ENOTEMPTY, had they ever gotten
+the locks) and voila - we have a deadlock.
+
+Loop avoidance
+==============
These operations are guaranteed to avoid loop creation. Indeed,
the only operation that could introduce loops is cross-directory rename.
-Since the only new (parent, child) pair added by rename() is (new parent,
-source), such loop would have to contain these objects and the rest of it
-would have to exist before rename(). I.e. at the moment of loop creation
-rename() responsible for that would be holding filesystem lock and new parent
-would have to be equal to or a descendent of source. But that means that
-new parent had been equal to or a descendent of source since the moment when
-we had acquired filesystem lock and rename() would fail with -ELOOP in that
-case.
+Suppose after the operation there is a loop; since there hadn't been such
+loops before the operation, at least on of the nodes in that loop must've
+had its parent changed. In other words, the loop must be passing through
+the source or, in case of exchange, possibly the target.
+
+Since the operation has succeeded, neither source nor target could have
+been ancestors of each other. Therefore the chain of ancestors starting
+in the parent of source could not have passed through the target and
+vice versa. On the other hand, the chain of ancestors of any node could
+not have passed through the node itself, or we would've had a loop before
+the operation. But everything other than source and target has kept
+the parent after the operation, so the operation does not change the
+chains of ancestors of (ex-)parents of source and target. In particular,
+those chains must end after a finite number of steps.
+
+Now consider the loop created by the operation. It passes through either
+source or target; the next node in the loop would be the ex-parent of
+target or source resp. After that the loop would follow the chain of
+ancestors of that parent. But as we have just shown, that chain must
+end after a finite number of steps, which means that it can't be a part
+of any loop. Q.E.D.
While this locking scheme works for arbitrary DAGs, it relies on
ability to check that directory is a descendent of another object. Current
diff --git a/Documentation/filesystems/dlmfs.rst b/Documentation/filesystems/dlmfs.rst
index 68daaa7facf9..70d4e48242c3 100644
--- a/Documentation/filesystems/dlmfs.rst
+++ b/Documentation/filesystems/dlmfs.rst
@@ -12,7 +12,7 @@ dlmfs is built with OCFS2 as it requires most of its infrastructure.
:Project web page: http://ocfs2.wiki.kernel.org
:Tools web page: https://github.com/markfasheh/ocfs2-tools
-:OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+:OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html
All code copyright 2005 Oracle except when otherwise noted.
@@ -36,7 +36,7 @@ None
Usage
=====
-If you're just interested in OCFS2, then please see ocfs2.txt. The
+If you're just interested in OCFS2, then please see ocfs2.rst. The
rest of this document will be geared towards those who want to use
dlmfs for easy to setup and easy to use clustered locking in
userspace.
diff --git a/Documentation/filesystems/efivarfs.rst b/Documentation/filesystems/efivarfs.rst
index 0551985821b8..f646c3f0980f 100644
--- a/Documentation/filesystems/efivarfs.rst
+++ b/Documentation/filesystems/efivarfs.rst
@@ -40,4 +40,4 @@ accidentally.
*See also:*
- Documentation/admin-guide/acpi/ssdt-overlays.rst
-- Documentation/ABI/stable/sysfs-firmware-efi-vars
+- Documentation/ABI/removed/sysfs-firmware-efi-vars
diff --git a/Documentation/filesystems/erofs.rst b/Documentation/filesystems/erofs.rst
index bf145171c2bf..7ddb235aee9d 100644
--- a/Documentation/filesystems/erofs.rst
+++ b/Documentation/filesystems/erofs.rst
@@ -1,63 +1,100 @@
.. SPDX-License-Identifier: GPL-2.0
======================================
-Enhanced Read-Only File System - EROFS
+EROFS - Enhanced Read-Only File System
======================================
Overview
========
-EROFS file-system stands for Enhanced Read-Only File System. Different
-from other read-only file systems, it aims to be designed for flexibility,
-scalability, but be kept simple and high performance.
+EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a
+generic read-only filesystem solution for various read-only use cases instead
+of just focusing on storage space saving without considering any side effects
+of runtime performance.
-It is designed as a better filesystem solution for the following scenarios:
+It is designed to meet the needs of flexibility, feature extendability and user
+payload friendly, etc. Apart from those, it is still kept as a simple
+random-access friendly high-performance filesystem to get rid of unneeded I/O
+amplification and memory-resident overhead compared to similar approaches.
+
+It is implemented to be a better choice for the following scenarios:
- read-only storage media or
- part of a fully trusted read-only solution, which means it needs to be
immutable and bit-for-bit identical to the official golden image for
- their releases due to security and other considerations and
+ their releases due to security or other considerations and
- - hope to save some extra storage space with guaranteed end-to-end performance
- by using reduced metadata and transparent file compression, especially
- for those embedded devices with limited memory (ex, smartphone);
+ - hope to minimize extra storage space with guaranteed end-to-end performance
+ by using compact layout, transparent file compression and direct access,
+ especially for those embedded devices with limited memory and high-density
+ hosts with numerous containers.
-Here is the main features of EROFS:
+Here are the main features of EROFS:
- Little endian on-disk design;
- - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
+ - Block-based distribution and file-based distribution over fscache are
+ supported;
+
+ - Support multiple devices to refer to external blobs, which can be used
+ for container images;
- - Metadata & data could be mixed by design;
+ - 32-bit block addresses for each device, therefore 16TiB address space at
+ most with 4KiB block size for now;
- - 2 inode versions for different requirements:
+ - Two inode layouts for different requirements:
- ===================== ============ =====================================
+ ===================== ============ ======================================
compact (v1) extended (v2)
- ===================== ============ =====================================
+ ===================== ============ ======================================
Inode metadata size 32 bytes 64 bytes
- Max file size 4 GB 16 EB (also limited by max. vol size)
+ Max file size 4 GiB 16 EiB (also limited by max. vol size)
Max uids/gids 65536 4294967296
- File change time no yes (64 + 32-bit timestamp)
+ Per-inode timestamp no yes (64 + 32-bit timestamp)
Max hardlinks 65536 4294967296
- Metadata reserved 4 bytes 14 bytes
- ===================== ============ =====================================
+ Metadata reserved 8 bytes 18 bytes
+ ===================== ============ ======================================
+
+ - Support extended attributes as an option;
+
+ - Support a bloom filter that speeds up negative extended attribute lookups;
+
+ - Support POSIX.1e ACLs by using extended attributes;
+
+ - Support transparent data compression as an option:
+ LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
+ addition, inplace decompression is also supported to avoid bounce compressed
+ buffers and unnecessary page cache thrashing.
+
+ - Support chunk-based data deduplication and rolling-hash compressed data
+ deduplication;
+
+ - Support tailpacking inline compared to byte-addressed unaligned metadata
+ or smaller block size alternatives;
+
+ - Support merging tail-end data into a special inode as fragments.
- - Support extended attributes (xattrs) as an option;
+ - Support large folios to make use of THPs (Transparent Hugepages);
- - Support xattr inline and tail-end data inline for all files;
+ - Support direct I/O on uncompressed files to avoid double caching for loop
+ devices;
- - Support POSIX.1e ACLs by using xattrs;
+ - Support FSDAX on uncompressed images for secure containers and ramdisks in
+ order to get rid of unnecessary page cache.
- - Support transparent file compression as an option:
- LZ4 algorithm with 4 KB fixed-sized output compression for high performance.
+ - Support file-based on-demand loading with the Fscache infrastructure.
The following git tree provides the file system user-space tools under
-development (ex, formatting tool mkfs.erofs):
+development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
+compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
+For more information, please also refer to the documentation site:
+
+- https://erofs.docs.kernel.org
+
Bugs and patches are welcome, please kindly help us and send to the following
linux-erofs mailing list:
@@ -84,8 +121,24 @@ cache_strategy=%s Select a strategy for cached decompression from now on:
It still does in-place I/O decompression
for the rest compressed physical clusters.
========== =============================================
+dax={always,never} Use direct access (no page cache). See
+ Documentation/filesystems/dax.rst.
+dax A legacy option which is an alias for ``dax=always``.
+device=%s Specify a path to an extra device to be used together.
+fsid=%s Specify a filesystem image ID for Fscache back-end.
+domain_id=%s Specify a domain ID in fscache mode so that different images
+ with the same blobs under a given domain ID can share storage.
+fsoffset=%llu Specify block-aligned filesystem offset for the primary device.
=================== =========================================================
+Sysfs Entries
+=============
+
+Information about mounted erofs file systems can be found in /sys/fs/erofs.
+Each mounted filesystem will have a directory in /sys/fs/erofs based on its
+device name (i.e., /sys/fs/erofs/sda).
+(see also Documentation/ABI/testing/sysfs-fs-erofs)
+
On-disk details
===============
@@ -113,31 +166,31 @@ may not. All metadatas can be now observed in two different spaces (views):
::
- |-> aligned with 8B
- |-> followed closely
- + meta_blkaddr blocks |-> another slot
- _____________________________________________________________________
- | ... | inode | xattrs | extents | data inline | ... | inode ...
- |________|_______|(optional)|(optional)|__(optional)_|_____|__________
- |-> aligned with the inode slot size
- . .
- . .
- . .
- . .
- . .
- . .
- .____________________________________________________|-> aligned with 4B
- | xattr_ibody_header | shared xattrs | inline xattrs |
- |____________________|_______________|_______________|
- |-> 12 bytes <-|->x * 4 bytes<-| .
- . . .
- . . .
- . . .
- ._______________________________.______________________.
- | id | id | id | id | ... | id | ent | ... | ent| ... |
- |____|____|____|____|______|____|_____|_____|____|_____|
- |-> aligned with 4B
- |-> aligned with 4B
+ |-> aligned with 8B
+ |-> followed closely
+ + meta_blkaddr blocks |-> another slot
+ _____________________________________________________________________
+ | ... | inode | xattrs | extents | data inline | ... | inode ...
+ |________|_______|(optional)|(optional)|__(optional)_|_____|__________
+ |-> aligned with the inode slot size
+ . .
+ . .
+ . .
+ . .
+ . .
+ . .
+ .____________________________________________________|-> aligned with 4B
+ | xattr_ibody_header | shared xattrs | inline xattrs |
+ |____________________|_______________|_______________|
+ |-> 12 bytes <-|->x * 4 bytes<-| .
+ . . .
+ . . .
+ . . .
+ ._______________________________.______________________.
+ | id | id | id | id | ... | id | ent | ... | ent| ... |
+ |____|____|____|____|______|____|_____|_____|____|_____|
+ |-> aligned with 4B
+ |-> aligned with 4B
Inode could be 32 or 64 bytes, which can be distinguished from a common
field which all inode versions have -- i_format::
@@ -151,15 +204,16 @@ may not. All metadatas can be now observed in two different spaces (views):
| |
|__________________| 64 bytes
- Xattrs, extents, data inline are followed by the corresponding inode with
+ Xattrs, extents, data inline are placed after the corresponding inode with
proper alignment, and they could be optional for different data mappings.
- _currently_ total 4 valid data mappings are supported:
+ _currently_ total 5 data layouts are supported:
== ====================================================================
0 flat file data without data inline (no extent);
1 fixed-sized output data compression (with non-compacted indexes);
2 flat file data with tail packing data inline (no extent);
- 3 fixed-sized output data compression (with compacted indexes, v5.3+).
+ 3 fixed-sized output data compression (with compacted indexes, v5.3+);
+ 4 chunk-based file (v5.15+).
== ====================================================================
The size of the optional xattrs is indicated by i_xattr_count in inode
@@ -175,13 +229,13 @@ may not. All metadatas can be now observed in two different spaces (views):
Each share xattr can also be directly found by the following formula:
xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
- ::
+::
- |-> aligned by 4 bytes
- + xattr_blkaddr blocks |-> aligned with 4 bytes
- _________________________________________________________________________
- | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
- |________|_____________|_____________|_____|______________|_______________
+ |-> aligned by 4 bytes
+ + xattr_blkaddr blocks |-> aligned with 4 bytes
+ _________________________________________________________________________
+ | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
+ |________|_____________|_____________|_____|______________|_______________
Directories
-----------
@@ -193,48 +247,123 @@ algorithm (could refer to the related source code).
::
- ___________________________
- / |
- / ______________|________________
- / / | nameoff1 | nameoffN-1
- ____________.______________._______________v________________v__________
- | dirent | dirent | ... | dirent | filename | filename | ... | filename |
- |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
- \ ^
- \ | * could have
- \ | trailing '\0'
- \________________________| nameoff0
-
- Directory block
+ ___________________________
+ / |
+ / ______________|________________
+ / / | nameoff1 | nameoffN-1
+ ____________.______________._______________v________________v__________
+ | dirent | dirent | ... | dirent | filename | filename | ... | filename |
+ |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
+ \ ^
+ \ | * could have
+ \ | trailing '\0'
+ \________________________| nameoff0
+ Directory block
Note that apart from the offset of the first filename, nameoff0 also indicates
the total number of directory entries in this block since it is no need to
introduce another on-disk field at all.
-Compression
------------
-Currently, EROFS supports 4KB fixed-sized output transparent file compression,
-as illustrated below::
-
- |---- Variant-Length Extent ----|-------- VLE --------|----- VLE -----
- clusterofs clusterofs clusterofs
- | | | logical data
- _________v_______________________________v_____________________v_______________
- ... | . | | . | | . | ...
- ____|____.________|_____________|________.____|_____________|__.__________|____
- |-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|-> cluster <-|
- size size size size size
- . . . .
- . . . .
- . . . .
- _______._____________._____________._____________._____________________
- ... | | | | ... physical data
- _______|_____________|_____________|_____________|_____________________
- |-> cluster <-|-> cluster <-|-> cluster <-|
- size size size
-
-Currently each on-disk physical cluster can contain 4KB (un)compressed data
-at most. For each logical cluster, there is a corresponding on-disk index to
-describe its cluster type, physical cluster address, etc.
-
-See "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
+Chunk-based files
+-----------------
+In order to support chunk-based data deduplication, a new inode data layout has
+been supported since Linux v5.15: Files are split in equal-sized data chunks
+with ``extents`` area of the inode metadata indicating how to get the chunk
+data: these can be simply as a 4-byte block address array or in the 8-byte
+chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
+details.)
+
+By the way, chunk-based files are all uncompressed for now.
+
+Long extended attribute name prefixes
+-------------------------------------
+There are use cases where extended attributes with different values can have
+only a few common prefixes (such as overlayfs xattrs). The predefined prefixes
+work inefficiently in both image size and runtime performance in such cases.
+
+The long xattr name prefixes feature is introduced to address this issue. The
+overall idea is that, apart from the existing predefined prefixes, the xattr
+entry could also refer to user-specified long xattr name prefixes, e.g.
+"trusted.overlay.".
+
+When referring to a long xattr name prefix, the highest bit (bit 7) of
+erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole
+represent the index of the referred long name prefix among all long name
+prefixes. Therefore, only the trailing part of the name apart from the long
+xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if
+the full xattr name matches exactly as its long xattr name prefix.
+
+All long xattr prefixes are stored one by one in the packed inode as long as
+the packed inode is valid, or in the meta inode otherwise. The
+xattr_prefix_count (of the on-disk superblock) indicates the total number of
+long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start
+offset of long name prefixes in the packed/meta inode. Note that, long extended
+attribute name prefixes are disabled if xattr_prefix_count is 0.
+
+Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4),
+where len represents the total size of the data part. The data part is actually
+represented by 'struct erofs_xattr_long_prefix', where base_index represents the
+index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for
+"trusted.overlay." long name prefix, while the infix string keeps the string
+after stripping the short prefix, e.g. "overlay." for the example above.
+
+Data compression
+----------------
+EROFS implements fixed-sized output compression which generates fixed-sized
+compressed data blocks from variable-sized input in contrast to other existing
+fixed-sized input solutions. Relatively higher compression ratios can be gotten
+by using fixed-sized output compression since nowadays popular data compression
+algorithms are mostly LZ77-based and such fixed-sized output approach can be
+benefited from the historical dictionary (aka. sliding window).
+
+In details, original (uncompressed) data is turned into several variable-sized
+extents and in the meanwhile, compressed into physical clusters (pclusters).
+In order to record each variable-sized extent, logical clusters (lclusters) are
+introduced as the basic unit of compress indexes to indicate whether a new
+extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
+fixed in block size, as illustrated below::
+
+ |<- variable-sized extent ->|<- VLE ->|
+ clusterofs clusterofs clusterofs
+ | | |
+ _________v_________________________________v_______________________v________
+ ... | . | | . | | . ...
+ ____|____._________|______________|________.___ _|______________|__.________
+ |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
+ (HEAD) (NONHEAD) (HEAD) (NONHEAD) .
+ . CBLKCNT . .
+ . . .
+ . . .
+ _______._____________________________.______________._________________
+ ... | | | | ...
+ _______|______________|______________|______________|_________________
+ |-> big pcluster <-|-> pcluster <-|
+
+A physical cluster can be seen as a container of physical compressed blocks
+which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
+were supported. After big pcluster feature is introduced (available since
+Linux v5.13), pcluster can be a multiple of lcluster size.
+
+For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
+starts and blkaddr is used to seek the compressed data. For each NONHEAD
+lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
+distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
+also a HEAD lcluster except that its data is uncompressed. See the comments
+around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
+
+If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
+well. Let the delta0 of the first NONHEAD lcluster store the compressed block
+count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
+to understand its delta0 is constantly 1, as illustrated below::
+
+ __________________________________________________________
+ | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD |
+ |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
+ |<----- a big pcluster (with CBLKCNT) ------>|<-- -->|
+ a lcluster-sized pcluster (without CBLKCNT) ^
+
+If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
+but it's easy to know the size of such pcluster is 1 lcluster as well.
+
+Since Linux v6.1, each pcluster can be used for multiple variable-sized extents,
+therefore it can be used for compressed data deduplication.
diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst
index d83dbbb162e2..92aae683e16a 100644
--- a/Documentation/filesystems/ext2.rst
+++ b/Documentation/filesystems/ext2.rst
@@ -1,6 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+==============================
The Second Extended Filesystem
==============================
@@ -24,7 +25,7 @@ check=none, nocheck (*) Don't do extra checking of bitmaps on mount
(check=normal and check=strict options removed)
dax Use direct access (no page cache). See
- Documentation/filesystems/dax.txt.
+ Documentation/filesystems/dax.rst.
debug Extra debugging information is sent to the
kernel syslog. Useful for developers.
@@ -58,8 +59,6 @@ acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
noacl Don't support POSIX ACLs.
-nobh Do not attach buffer_heads to file pagecache.
-
quota, usrquota Enable user disk quota support
(requires CONFIG_QUOTA).
diff --git a/Documentation/filesystems/ext4/about.rst b/Documentation/filesystems/ext4/about.rst
index 0aadba052264..cc76b577d2f4 100644
--- a/Documentation/filesystems/ext4/about.rst
+++ b/Documentation/filesystems/ext4/about.rst
@@ -39,6 +39,6 @@ entry.
Other References
----------------
-Also see http://www.nongnu.org/ext2-doc/ for quite a collection of
+Also see https://www.nongnu.org/ext2-doc/ for quite a collection of
information about ext2/3. Here's another old reference:
http://wiki.osdev.org/Ext2
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
new file mode 100644
index 000000000000..f65767df3620
--- /dev/null
+++ b/Documentation/filesystems/ext4/atomic_writes.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _atomic_writes:
+
+Atomic Block Writes
+-------------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ EXT4's supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ EXT4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+NOTE: EXT4 does not support software or COW based atomic write, which means
+atomic writes on ext4 are only supported if underlying storage device supports
+it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit. Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent by writing zeroes to it and converting any unwritten extents to
+ written, if found within the range.
+
+Note: Writing on a single contiguous underlying extent, whether mapped or
+unwritten, is not inherently problematic. However, writing to a mixed mapping
+region (i.e. one containing a combination of mapped and unwritten extents)
+must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+------
+
+Creating Filesystems with Atomic Write Support
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+~~~~~~~~~~~~~~~~~~~~~
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+----------------
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+--------
+
+* :doc:`bigalloc` - Documentation on the bigalloc feature
+* :doc:`allocators` - Documentation on block allocation in ext4
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/attributes.rst b/Documentation/filesystems/ext4/attributes.rst
index 54386a010a8d..87814696a65b 100644
--- a/Documentation/filesystems/ext4/attributes.rst
+++ b/Documentation/filesystems/ext4/attributes.rst
@@ -13,8 +13,8 @@ disappeared as of Linux 3.0.
There are two places where extended attributes can be found. The first
place is between the end of each inode entry and the beginning of the
-next inode entry. For example, if inode.i\_extra\_isize = 28 and
-sb.inode\_size = 256, then there are 256 - (128 + 28) = 100 bytes
+next inode entry. For example, if inode.i_extra_isize = 28 and
+sb.inode_size = 256, then there are 256 - (128 + 28) = 100 bytes
available for in-inode extended attribute storage. The second place
where extended attributes can be found is in the block pointed to by
``inode.i_file_acl``. As of Linux 3.11, it is not possible for this
@@ -38,8 +38,8 @@ Extended attributes, when stored after the inode, have a header
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_magic
+ - __le32
+ - h_magic
- Magic number for identification, 0xEA020000. This value is set by the
Linux driver, though e2fsprogs doesn't seem to check it(?)
@@ -55,28 +55,28 @@ The beginning of an extended attribute block is in
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_magic
+ - __le32
+ - h_magic
- Magic number for identification, 0xEA020000.
* - 0x4
- - \_\_le32
- - h\_refcount
+ - __le32
+ - h_refcount
- Reference count.
* - 0x8
- - \_\_le32
- - h\_blocks
+ - __le32
+ - h_blocks
- Number of disk blocks used.
* - 0xC
- - \_\_le32
- - h\_hash
+ - __le32
+ - h_hash
- Hash value of all attributes.
* - 0x10
- - \_\_le32
- - h\_checksum
+ - __le32
+ - h_checksum
- Checksum of the extended attribute block.
* - 0x14
- - \_\_u32
- - h\_reserved[2]
+ - __u32
+ - h_reserved[3]
- Zero.
The checksum is calculated against the FS UUID, the 64-bit block number
@@ -100,46 +100,46 @@ Attributes stored inside an inode do not need be stored in sorted order.
- Name
- Description
* - 0x0
- - \_\_u8
- - e\_name\_len
+ - __u8
+ - e_name_len
- Length of name.
* - 0x1
- - \_\_u8
- - e\_name\_index
+ - __u8
+ - e_name_index
- Attribute name index. There is a discussion of this below.
* - 0x2
- - \_\_le16
- - e\_value\_offs
+ - __le16
+ - e_value_offs
- Location of this attribute's value on the disk block where it is stored.
Multiple attributes can share the same value. For an inode attribute
this value is relative to the start of the first entry; for a block this
value is relative to the start of the block (i.e. the header).
* - 0x4
- - \_\_le32
- - e\_value\_inum
+ - __le32
+ - e_value_inum
- The inode where the value is stored. Zero indicates the value is in the
same block as this entry. This field is only used if the
- INCOMPAT\_EA\_INODE feature is enabled.
+ INCOMPAT_EA_INODE feature is enabled.
* - 0x8
- - \_\_le32
- - e\_value\_size
+ - __le32
+ - e_value_size
- Length of attribute value.
* - 0xC
- - \_\_le32
- - e\_hash
+ - __le32
+ - e_hash
- Hash value of attribute name and attribute value. The kernel doesn't
update the hash for in-inode attributes, so for that case this value
must be zero, because e2fsck validates any non-zero hash regardless of
where the xattr lives.
* - 0x10
- char
- - e\_name[e\_name\_len]
+ - e_name[e_name_len]
- Attribute name. Does not include trailing NULL.
Attribute values can follow the end of the entry table. There appears to
be a requirement that they be aligned to 4-byte boundaries. The values
are stored starting at the end of the block and grow towards the
-xattr\_header/xattr\_entry table. When the two collide, the overflow is
+xattr_header/xattr_entry table. When the two collide, the overflow is
put into a separate disk block. If the disk block fills up, the
filesystem returns -ENOSPC.
@@ -167,15 +167,15 @@ the key name. Here is a map of name index values to key prefixes:
* - 1
- “user.”
* - 2
- - “system.posix\_acl\_access”
+ - “system.posix_acl_access”
* - 3
- - “system.posix\_acl\_default”
+ - “system.posix_acl_default”
* - 4
- “trusted.”
* - 6
- “security.”
* - 7
- - “system.” (inline\_data only?)
+ - “system.” (inline_data only?)
* - 8
- “system.richacl” (SuSE kernels only?)
diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst
index 72075aa608e4..976a180b209c 100644
--- a/Documentation/filesystems/ext4/bigalloc.rst
+++ b/Documentation/filesystems/ext4/bigalloc.rst
@@ -23,7 +23,7 @@ means that a block group addresses 32 gigabytes instead of 128 megabytes,
also shrinking the amount of file system overhead for metadata.
The administrator can set a block cluster size at mkfs time (which is
-stored in the s\_log\_cluster\_size field in the superblock); from then
+stored in the s_log_cluster_size field in the superblock); from then
on, the block bitmaps track clusters, not individual blocks. This means
that block groups can be several gigabytes in size (instead of just
128MiB); however, the minimum allocation unit becomes a cluster, not a
diff --git a/Documentation/filesystems/ext4/bitmaps.rst b/Documentation/filesystems/ext4/bitmaps.rst
index c7546dbc197a..91c45d86e9bb 100644
--- a/Documentation/filesystems/ext4/bitmaps.rst
+++ b/Documentation/filesystems/ext4/bitmaps.rst
@@ -9,15 +9,15 @@ group.
The inode bitmap records which entries in the inode table are in use.
As with most bitmaps, one bit represents the usage status of one data
-block or inode table entry. This implies a block group size of 8 \*
-number\_of\_bytes\_in\_a\_logical\_block.
+block or inode table entry. This implies a block group size of 8 *
+number_of_bytes_in_a_logical_block.
NOTE: If ``BLOCK_UNINIT`` is set for a given block group, various parts
of the kernel and e2fsprogs code pretends that the block bitmap contains
zeros (i.e. all blocks in the group are free). However, it is not
necessarily the case that no blocks are in use -- if ``meta_bg`` is set,
the bitmaps and group descriptor live inside the group. Unfortunately,
-ext2fs\_test\_block\_bitmap2() will return '0' for those locations,
+ext2fs_test_block_bitmap2() will return '0' for those locations,
which produces confusing debugfs output.
Inode Table
diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
index 3da156633339..ed5a5cac6d40 100644
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ b/Documentation/filesystems/ext4/blockgroup.rst
@@ -56,39 +56,39 @@ established that the super block and the group descriptor table, if
present, will be at the beginning of the block group. The bitmaps and
the inode table can be anywhere, and it is quite possible for the
bitmaps to come after the inode table, or for both to be in different
-groups (flex\_bg). Leftover space is used for file data blocks, indirect
+groups (flex_bg). Leftover space is used for file data blocks, indirect
block maps, extent tree blocks, and extended attributes.
Flexible Block Groups
---------------------
Starting in ext4, there is a new feature called flexible block groups
-(flex\_bg). In a flex\_bg, several block groups are tied together as one
+(flex_bg). In a flex_bg, several block groups are tied together as one
logical block group; the bitmap spaces and the inode table space in the
-first block group of the flex\_bg are expanded to include the bitmaps
-and inode tables of all other block groups in the flex\_bg. For example,
-if the flex\_bg size is 4, then group 0 will contain (in order) the
+first block group of the flex_bg are expanded to include the bitmaps
+and inode tables of all other block groups in the flex_bg. For example,
+if the flex_bg size is 4, then group 0 will contain (in order) the
superblock, group descriptors, data block bitmaps for groups 0-3, inode
bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
space in group 0 is for file data. The effect of this is to group the
block group metadata close together for faster loading, and to enable
large files to be continuous on disk. Backup copies of the superblock
and group descriptors are always at the beginning of block groups, even
-if flex\_bg is enabled. The number of block groups that make up a
-flex\_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
+if flex_bg is enabled. The number of block groups that make up a
+flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
Meta Block Groups
-----------------
-Without the option META\_BG, for safety concerns, all block group
+Without the option META_BG, for safety concerns, all block group
descriptors copies are kept in the first block group. Given the default
128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
can have at most 2^27/64 = 2^21 block groups. This limits the entire
-filesystem size to 2^21 ∗ 2^27 = 2^48bytes or 256TiB.
+filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
The solution to this problem is to use the metablock group feature
-(META\_BG), which is already in ext3 for all 2.6 releases. With the
-META\_BG feature, ext4 filesystems are partitioned into many metablock
+(META_BG), which is already in ext3 for all 2.6 releases. With the
+META_BG feature, ext4 filesystems are partitioned into many metablock
groups. Each metablock group is a cluster of block groups whose group
descriptor structures can be stored in a single disk block. For ext4
filesystems with 4 KB block size, a single metablock group partition
@@ -105,12 +105,12 @@ descriptors. Instead, the superblock and a single block group descriptor
block is placed at the beginning of the first, second, and last block
groups in a meta-block group. A meta-block group is a collection of
block groups which can be described by a single block group descriptor
-block. Since the size of the block group descriptor structure is 32
-bytes, a meta-block group contains 32 block groups for filesystems with
-a 1KB block size, and 128 block groups for filesystems with a 4KB
+block. Since the size of the block group descriptor structure is 64
+bytes, a meta-block group contains 16 block groups for filesystems with
+a 1KB block size, and 64 block groups for filesystems with a 4KB
blocksize. Filesystems can either be created using this new block group
descriptor layout, or existing filesystems can be resized on-line, and
-the field s\_first\_meta\_bg in the superblock will indicate the first
+the field s_first_meta_bg in the superblock will indicate the first
block group using this new layout.
Please see an important note about ``BLOCK_UNINIT`` in the section about
@@ -121,15 +121,15 @@ Lazy Block Group Initialization
A new feature for ext4 are three block group descriptor flags that
enable mkfs to skip initializing other parts of the block group
-metadata. Specifically, the INODE\_UNINIT and BLOCK\_UNINIT flags mean
+metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
that the inode and block bitmaps for that group can be calculated and
therefore the on-disk bitmap blocks are not initialized. This is
generally the case for an empty block group or a block group containing
-only fixed-location block group metadata. The INODE\_ZEROED flag means
+only fixed-location block group metadata. The INODE_ZEROED flag means
that the inode table has been initialized; mkfs will unset this flag and
rely on the kernel to initialize the inode tables in the background.
By not writing zeroes to the bitmaps and inode table, mkfs time is
-reduced considerably. Note the feature flag is RO\_COMPAT\_GDT\_CSUM,
-but the dumpe2fs output prints this as “uninit\_bg”. They are the same
+reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
+but the dumpe2fs output prints this as “uninit_bg”. They are the same
thing.
diff --git a/Documentation/filesystems/ext4/blockmap.rst b/Documentation/filesystems/ext4/blockmap.rst
index 30e25750d88a..cc596541ce79 100644
--- a/Documentation/filesystems/ext4/blockmap.rst
+++ b/Documentation/filesystems/ext4/blockmap.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| i.i\_block Offset | Where It Points |
+| i.i_block Offset | Where It Points |
+=====================+==============================================================================================================================================================================================================================+
| 0 to 11 | Direct map to file blocks 0 to 11. |
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
index bd722ecd92d6..b0f80ea87c90 100644
--- a/Documentation/filesystems/ext4/blocks.rst
+++ b/Documentation/filesystems/ext4/blocks.rst
@@ -39,7 +39,7 @@ For 32-bit filesystems, limits are as follows:
- 4TiB
- 8TiB
- 16TiB
- - 256PiB
+ - 256TiB
* - Blocks Per Block Group
- 8,192
- 16,384
diff --git a/Documentation/filesystems/ext4/checksums.rst b/Documentation/filesystems/ext4/checksums.rst
index 5519e253810d..e232749daf5f 100644
--- a/Documentation/filesystems/ext4/checksums.rst
+++ b/Documentation/filesystems/ext4/checksums.rst
@@ -4,7 +4,7 @@ Checksums
---------
Starting in early 2012, metadata checksums were added to all major ext4
-and jbd2 data structures. The associated feature flag is metadata\_csum.
+and jbd2 data structures. The associated feature flag is metadata_csum.
The desired checksum algorithm is indicated in the superblock, though as
of October 2012 the only supported algorithm is crc32c. Some data
structures did not have space to fit a full 32-bit checksum, so only the
@@ -20,7 +20,7 @@ encounters directory blocks that lack sufficient empty space to add a
checksum, it will request that you run ``e2fsck -D`` to have the
directories rebuilt with checksums. This has the added benefit of
removing slack space from the directory files and rebalancing the htree
-indexes. If you \_ignore\_ this step, your directories will not be
+indexes. If you _ignore_ this step, your directories will not be
protected by a checksum!
The following table describes the data elements that go into each type
@@ -35,39 +35,39 @@ of checksum. The checksum function is whatever the superblock describes
- Length
- Ingredients
* - Superblock
- - \_\_le32
+ - __le32
- The entire superblock up to the checksum field. The UUID lives inside
the superblock.
* - MMP
- - \_\_le32
+ - __le32
- UUID + the entire MMP block up to the checksum field.
* - Extended Attributes
- - \_\_le32
+ - __le32
- UUID + the entire extended attribute block. The checksum field is set to
zero.
* - Directory Entries
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the directory block up to the
fake entry enclosing the checksum field.
* - HTREE Nodes
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + all valid extents + HTREE tail.
The checksum field is set to zero.
* - Extents
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the entire extent block up to
the checksum field.
* - Bitmaps
- - \_\_le32 or \_\_le16
+ - __le32 or __le16
- UUID + the entire bitmap. Checksums are stored in the group descriptor,
and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
* - Inodes
- - \_\_le32
+ - __le32
- UUID + inode number + inode generation + the entire inode. The checksum
field is set to zero. Each inode has its own checksum.
* - Group Descriptors
- - \_\_le16
- - If metadata\_csum, then UUID + group number + the entire descriptor;
- else if gdt\_csum, then crc16(UUID + group number + the entire
+ - __le16
+ - If metadata_csum, then UUID + group number + the entire descriptor;
+ else if gdt_csum, then crc16(UUID + group number + the entire
descriptor). In all cases, only the lower 16 bits are stored.
diff --git a/Documentation/filesystems/ext4/directory.rst b/Documentation/filesystems/ext4/directory.rst
index 073940cc64ed..6eece8e31df8 100644
--- a/Documentation/filesystems/ext4/directory.rst
+++ b/Documentation/filesystems/ext4/directory.rst
@@ -42,24 +42,24 @@ is at most 263 bytes long, though on disk you'll need to reference
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- inode
- Number of the inode that this directory entry points to.
* - 0x4
- - \_\_le16
- - rec\_len
+ - __le16
+ - rec_len
- Length of this directory entry. Must be a multiple of 4.
* - 0x6
- - \_\_le16
- - name\_len
+ - __le16
+ - name_len
- Length of the file name.
* - 0x8
- char
- - name[EXT4\_NAME\_LEN]
+ - name[EXT4_NAME_LEN]
- File name.
Since file names cannot be longer than 255 bytes, the new directory
-entry format shortens the name\_len field and uses the space for a file
+entry format shortens the name_len field and uses the space for a file
type flag, probably to avoid having to load every inode during directory
tree traversal. This format is ``ext4_dir_entry_2``, which is at most
263 bytes long, though on disk you'll need to reference
@@ -74,24 +74,24 @@ tree traversal. This format is ``ext4_dir_entry_2``, which is at most
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- inode
- Number of the inode that this directory entry points to.
* - 0x4
- - \_\_le16
- - rec\_len
+ - __le16
+ - rec_len
- Length of this directory entry.
* - 0x6
- - \_\_u8
- - name\_len
+ - __u8
+ - name_len
- Length of the file name.
* - 0x7
- - \_\_u8
- - file\_type
+ - __u8
+ - file_type
- File type code, see ftype_ table below.
* - 0x8
- char
- - name[EXT4\_NAME\_LEN]
+ - name[EXT4_NAME_LEN]
- File name.
.. _ftype:
@@ -121,10 +121,35 @@ The directory file type is one of the following values:
* - 0x7
- Symbolic link.
+To support directories that are both encrypted and casefolded directories, we
+must also include hash information in the directory entry. We append
+``ext4_extended_dir_entry_2`` to ``ext4_dir_entry_2`` except for the entries
+for dot and dotdot, which are kept the same. The structure follows immediately
+after ``name`` and is included in the size listed by ``rec_len`` If a directory
+entry uses this extension, it may be up to 271 bytes.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Size
+ - Name
+ - Description
+ * - 0x0
+ - __le32
+ - hash
+ - The hash of the directory name
+ * - 0x4
+ - __le32
+ - minor_hash
+ - The minor hash of the directory name
+
+
In order to add checksums to these classic directory blocks, a phony
``struct ext4_dir_entry`` is placed at the end of each leaf block to
hold the checksum. The directory entry is 12 bytes long. The inode
-number and name\_len fields are set to zero to fool old software into
+number and name_len fields are set to zero to fool old software into
ignoring an apparently empty directory entry, and the checksum is stored
in the place where the name normally goes. The structure is
``struct ext4_dir_entry_tail``:
@@ -138,24 +163,24 @@ in the place where the name normally goes. The structure is
- Name
- Description
* - 0x0
- - \_\_le32
- - det\_reserved\_zero1
+ - __le32
+ - det_reserved_zero1
- Inode number, which must be zero.
* - 0x4
- - \_\_le16
- - det\_rec\_len
+ - __le16
+ - det_rec_len
- Length of this directory entry, which must be 12.
* - 0x6
- - \_\_u8
- - det\_reserved\_zero2
+ - __u8
+ - det_reserved_zero2
- Length of the file name, which must be zero.
* - 0x7
- - \_\_u8
- - det\_reserved\_ft
+ - __u8
+ - det_reserved_ft
- File type, which must be 0xDE.
* - 0x8
- - \_\_le32
- - det\_checksum
+ - __le32
+ - det_checksum
- Directory leaf block checksum.
The leaf directory block checksum is calculated against the FS UUID, the
@@ -169,7 +194,7 @@ Hash Tree Directories
A linear array of directory entries isn't great for performance, so a
new feature was added to ext3 to provide a faster (but peculiar)
balanced tree keyed off a hash of the directory entry name. If the
-EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
+EXT4_INDEX_FL (0x1000) flag is set in the inode, this directory uses a
hashed btree (htree) to organize and find directory entries. For
backwards read-only compatibility with ext2, this tree is actually
hidden inside the directory file, masquerading as “empty” directory data
@@ -181,14 +206,14 @@ rest of the directory block is empty so that it moves on.
The root of the tree always lives in the first data block of the
directory. By ext2 custom, the '.' and '..' entries must appear at the
beginning of this first block, so they are put here as two
-``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
+``struct ext4_dir_entry_2`` s and not stored in the tree. The rest of
the root node contains metadata about the tree and finally a hash->block
map to find nodes that are lower in the htree. If
``dx_root.info.indirect_levels`` is non-zero then the htree has two
levels; the data block pointed to by the root node's map is an interior
node, which is indexed by a minor hash. Interior nodes in this tree
contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
-minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
+minor_hash->block map to find leafe nodes. Leaf nodes contain a linear
array of all ``struct ext4_dir_entry_2``; all of these entries
(presumably) hash to the same value. If there is an overflow, the
entries simply overflow into the next leaf node, and the
@@ -220,83 +245,83 @@ of a data block:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- dot.inode
- inode number of this directory.
* - 0x4
- - \_\_le16
- - dot.rec\_len
+ - __le16
+ - dot.rec_len
- Length of this record, 12.
* - 0x6
- u8
- - dot.name\_len
+ - dot.name_len
- Length of the name, 1.
* - 0x7
- u8
- - dot.file\_type
+ - dot.file_type
- File type of this entry, 0x2 (directory) (if the feature flag is set).
* - 0x8
- char
- dot.name[4]
- - “.\\0\\0\\0”
+ - “.\0\0\0”
* - 0xC
- - \_\_le32
+ - __le32
- dotdot.inode
- inode number of parent directory.
* - 0x10
- - \_\_le16
- - dotdot.rec\_len
- - block\_size - 12. The record length is long enough to cover all htree
+ - __le16
+ - dotdot.rec_len
+ - block_size - 12. The record length is long enough to cover all htree
data.
* - 0x12
- u8
- - dotdot.name\_len
+ - dotdot.name_len
- Length of the name, 2.
* - 0x13
- u8
- - dotdot.file\_type
+ - dotdot.file_type
- File type of this entry, 0x2 (directory) (if the feature flag is set).
* - 0x14
- char
- - dotdot\_name[4]
- - “..\\0\\0”
+ - dotdot_name[4]
+ - “..\0\0”
* - 0x18
- - \_\_le32
- - struct dx\_root\_info.reserved\_zero
+ - __le32
+ - struct dx_root_info.reserved_zero
- Zero.
* - 0x1C
- u8
- - struct dx\_root\_info.hash\_version
+ - struct dx_root_info.hash_version
- Hash type, see dirhash_ table below.
* - 0x1D
- u8
- - struct dx\_root\_info.info\_length
+ - struct dx_root_info.info_length
- Length of the tree information, 0x8.
* - 0x1E
- u8
- - struct dx\_root\_info.indirect\_levels
- - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
+ - struct dx_root_info.indirect_levels
+ - Depth of the htree. Cannot be larger than 3 if the INCOMPAT_LARGEDIR
feature is set; cannot be larger than 2 otherwise.
* - 0x1F
- u8
- - struct dx\_root\_info.unused\_flags
+ - struct dx_root_info.unused_flags
-
* - 0x20
- - \_\_le16
+ - __le16
- limit
- - Maximum number of dx\_entries that can follow this header, plus 1 for
+ - Maximum number of dx_entries that can follow this header, plus 1 for
the header itself.
* - 0x22
- - \_\_le16
+ - __le16
- count
- - Actual number of dx\_entries that follow this header, plus 1 for the
+ - Actual number of dx_entries that follow this header, plus 1 for the
header itself.
* - 0x24
- - \_\_le32
+ - __le32
- block
- The block number (within the directory file) that goes with hash=0.
* - 0x28
- - struct dx\_entry
+ - struct dx_entry
- entries[0]
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
@@ -322,6 +347,8 @@ The directory hash is one of the following values:
- Half MD4, unsigned.
* - 0x5
- Tea, unsigned.
+ * - 0x6
+ - Siphash.
Interior nodes of an htree are recorded as ``struct dx_node``, which is
also the full length of a data block:
@@ -335,38 +362,38 @@ also the full length of a data block:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- fake.inode
- Zero, to make it look like this entry is not in use.
* - 0x4
- - \_\_le16
- - fake.rec\_len
- - The size of the block, in order to hide all of the dx\_node data.
+ - __le16
+ - fake.rec_len
+ - The size of the block, in order to hide all of the dx_node data.
* - 0x6
- u8
- - name\_len
+ - name_len
- Zero. There is no name for this “unused” directory entry.
* - 0x7
- u8
- - file\_type
+ - file_type
- Zero. There is no file type for this “unused” directory entry.
* - 0x8
- - \_\_le16
+ - __le16
- limit
- - Maximum number of dx\_entries that can follow this header, plus 1 for
+ - Maximum number of dx_entries that can follow this header, plus 1 for
the header itself.
* - 0xA
- - \_\_le16
+ - __le16
- count
- - Actual number of dx\_entries that follow this header, plus 1 for the
+ - Actual number of dx_entries that follow this header, plus 1 for the
header itself.
* - 0xE
- - \_\_le32
+ - __le32
- block
- The block number (within the directory file) that goes with the lowest
hash value of this block. This value is stored in the parent block.
* - 0x12
- - struct dx\_entry
+ - struct dx_entry
- entries[0]
- As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
@@ -383,11 +410,11 @@ long:
- Name
- Description
* - 0x0
- - \_\_le32
+ - __le32
- hash
- Hash code.
* - 0x4
- - \_\_le32
+ - __le32
- block
- Block number (within the directory file, not filesystem blocks) of the
next node in the htree.
@@ -396,13 +423,13 @@ long:
author.)
If metadata checksums are enabled, the last 8 bytes of the directory
-block (precisely the length of one dx\_entry) are used to store a
+block (precisely the length of one dx_entry) are used to store a
``struct dx_tail``, which contains the checksum. The ``limit`` and
-``count`` entries in the dx\_root/dx\_node structures are adjusted as
-necessary to fit the dx\_tail into the block. If there is no space for
-the dx\_tail, the user is notified to run e2fsck -D to rebuild the
+``count`` entries in the dx_root/dx_node structures are adjusted as
+necessary to fit the dx_tail into the block. If there is no space for
+the dx_tail, the user is notified to run e2fsck -D to rebuild the
directory index (which will ensure that there's space for the checksum.
-The dx\_tail structure is 8 bytes long and looks like this:
+The dx_tail structure is 8 bytes long and looks like this:
.. list-table::
:widths: 8 8 24 40
@@ -414,13 +441,13 @@ The dx\_tail structure is 8 bytes long and looks like this:
- Description
* - 0x0
- u32
- - dt\_reserved
+ - dt_reserved
- Zero.
* - 0x4
- - \_\_le32
- - dt\_checksum
+ - __le32
+ - dt_checksum
- Checksum of the htree directory block.
The checksum is calculated against the FS UUID, the htree index header
-(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
-use, and the tail block (dx\_tail).
+(dx_root or dx_node), all of the htree indices (dx_entry) that are in
+use, and the tail block (dx_tail).
diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/filesystems/ext4/eainode.rst
index ecc0d01a0a72..7a2ef26b064a 100644
--- a/Documentation/filesystems/ext4/eainode.rst
+++ b/Documentation/filesystems/ext4/eainode.rst
@@ -5,14 +5,14 @@ Large Extended Attribute Values
To enable ext4 to store extended attribute values that do not fit in the
inode or in the single extended attribute block attached to an inode,
-the EA\_INODE feature allows us to store the value in the data blocks of
+the EA_INODE feature allows us to store the value in the data blocks of
a regular file inode. This “EA inode” is linked only from the extended
attribute name index and must not appear in a directory entry. The
-inode's i\_atime field is used to store a checksum of the xattr value;
-and i\_ctime/i\_version store a 64-bit reference count, which enables
+inode's i_atime field is used to store a checksum of the xattr value;
+and i_ctime/i_version store a 64-bit reference count, which enables
sharing of large xattr values between multiple owning inodes. For
backward compatibility with older versions of this feature, the
-i\_mtime/i\_generation *may* store a back-reference to the inode number
-and i\_generation of the **one** owning inode (in cases where the EA
+i_mtime/i_generation *may* store a back-reference to the inode number
+and i_generation of the **one** owning inode (in cases where the EA
inode is not referenced by multiple inodes) to verify that the EA inode
is the correct one being accessed.
diff --git a/Documentation/filesystems/ext4/globals.rst b/Documentation/filesystems/ext4/globals.rst
index 368bf7662b96..b17418974fd3 100644
--- a/Documentation/filesystems/ext4/globals.rst
+++ b/Documentation/filesystems/ext4/globals.rst
@@ -11,3 +11,4 @@ have static metadata at fixed locations.
.. include:: bitmaps.rst
.. include:: mmp.rst
.. include:: journal.rst
+.. include:: orphan.rst
diff --git a/Documentation/filesystems/ext4/group_descr.rst b/Documentation/filesystems/ext4/group_descr.rst
index 7ba6114e7f5c..392ec44f8fb0 100644
--- a/Documentation/filesystems/ext4/group_descr.rst
+++ b/Documentation/filesystems/ext4/group_descr.rst
@@ -7,34 +7,34 @@ Each block group on the filesystem has one of these descriptors
associated with it. As noted in the Layout section above, the group
descriptors (if present) are the second item in the block group. The
standard configuration is for each block group to contain a full copy of
-the block group descriptor table unless the sparse\_super feature flag
+the block group descriptor table unless the sparse_super feature flag
is set.
Notice how the group descriptor records the location of both bitmaps and
the inode table (i.e. they can float). This means that within a block
group, the only data structures with fixed locations are the superblock
-and the group descriptor table. The flex\_bg mechanism uses this
+and the group descriptor table. The flex_bg mechanism uses this
property to group several block groups into a flex group and lay out all
of the groups' bitmaps and inode tables into one long run in the first
group of the flex group.
-If the meta\_bg feature flag is set, then several block groups are
-grouped together into a meta group. Note that in the meta\_bg case,
+If the meta_bg feature flag is set, then several block groups are
+grouped together into a meta group. Note that in the meta_bg case,
however, the first and last two block groups within the larger meta
group contain only group descriptors for the groups inside the meta
group.
-flex\_bg and meta\_bg do not appear to be mutually exclusive features.
+flex_bg and meta_bg do not appear to be mutually exclusive features.
In ext2, ext3, and ext4 (when the 64bit feature is not enabled), the
block group descriptor was only 32 bytes long and therefore ends at
-bg\_checksum. On an ext4 filesystem with the 64bit feature enabled, the
+bg_checksum. On an ext4 filesystem with the 64bit feature enabled, the
block group descriptor expands to at least the 64 bytes described below;
the size is stored in the superblock.
-If gdt\_csum is set and metadata\_csum is not set, the block group
+If gdt_csum is set and metadata_csum is not set, the block group
checksum is the crc16 of the FS UUID, the group number, and the group
-descriptor structure. If metadata\_csum is set, then the block group
+descriptor structure. If metadata_csum is set, then the block group
checksum is the lower 16 bits of the checksum of the FS UUID, the group
number, and the group descriptor structure. Both block and inode bitmap
checksums are calculated against the FS UUID, the group number, and the
@@ -51,59 +51,59 @@ The block group descriptor is laid out in ``struct ext4_group_desc``.
- Name
- Description
* - 0x0
- - \_\_le32
- - bg\_block\_bitmap\_lo
+ - __le32
+ - bg_block_bitmap_lo
- Lower 32-bits of location of block bitmap.
* - 0x4
- - \_\_le32
- - bg\_inode\_bitmap\_lo
+ - __le32
+ - bg_inode_bitmap_lo
- Lower 32-bits of location of inode bitmap.
* - 0x8
- - \_\_le32
- - bg\_inode\_table\_lo
+ - __le32
+ - bg_inode_table_lo
- Lower 32-bits of location of inode table.
* - 0xC
- - \_\_le16
- - bg\_free\_blocks\_count\_lo
+ - __le16
+ - bg_free_blocks_count_lo
- Lower 16-bits of free block count.
* - 0xE
- - \_\_le16
- - bg\_free\_inodes\_count\_lo
+ - __le16
+ - bg_free_inodes_count_lo
- Lower 16-bits of free inode count.
* - 0x10
- - \_\_le16
- - bg\_used\_dirs\_count\_lo
+ - __le16
+ - bg_used_dirs_count_lo
- Lower 16-bits of directory count.
* - 0x12
- - \_\_le16
- - bg\_flags
+ - __le16
+ - bg_flags
- Block group flags. See the bgflags_ table below.
* - 0x14
- - \_\_le32
- - bg\_exclude\_bitmap\_lo
+ - __le32
+ - bg_exclude_bitmap_lo
- Lower 32-bits of location of snapshot exclusion bitmap.
* - 0x18
- - \_\_le16
- - bg\_block\_bitmap\_csum\_lo
+ - __le16
+ - bg_block_bitmap_csum_lo
- Lower 16-bits of the block bitmap checksum.
* - 0x1A
- - \_\_le16
- - bg\_inode\_bitmap\_csum\_lo
+ - __le16
+ - bg_inode_bitmap_csum_lo
- Lower 16-bits of the inode bitmap checksum.
* - 0x1C
- - \_\_le16
- - bg\_itable\_unused\_lo
+ - __le16
+ - bg_itable_unused_lo
- Lower 16-bits of unused inode count. If set, we needn't scan past the
- ``(sb.s_inodes_per_group - gdt.bg_itable_unused)``\ th entry in the
+ ``(sb.s_inodes_per_group - gdt.bg_itable_unused)`` th entry in the
inode table for this group.
* - 0x1E
- - \_\_le16
- - bg\_checksum
- - Group descriptor checksum; crc16(sb\_uuid+group\_num+bg\_desc) if the
- RO\_COMPAT\_GDT\_CSUM feature is set, or
- crc32c(sb\_uuid+group\_num+bg\_desc) & 0xFFFF if the
- RO\_COMPAT\_METADATA\_CSUM feature is set. The bg\_checksum
- field in bg\_desc is skipped when calculating crc16 checksum,
+ - __le16
+ - bg_checksum
+ - Group descriptor checksum; crc16(sb_uuid+group_num+bg_desc) if the
+ RO_COMPAT_GDT_CSUM feature is set, or
+ crc32c(sb_uuid+group_num+bg_desc) & 0xFFFF if the
+ RO_COMPAT_METADATA_CSUM feature is set. The bg_checksum
+ field in bg_desc is skipped when calculating crc16 checksum,
and set to zero if crc32c checksum is used.
* -
-
@@ -111,48 +111,48 @@ The block group descriptor is laid out in ``struct ext4_group_desc``.
- These fields only exist if the 64bit feature is enabled and s_desc_size
> 32.
* - 0x20
- - \_\_le32
- - bg\_block\_bitmap\_hi
+ - __le32
+ - bg_block_bitmap_hi
- Upper 32-bits of location of block bitmap.
* - 0x24
- - \_\_le32
- - bg\_inode\_bitmap\_hi
+ - __le32
+ - bg_inode_bitmap_hi
- Upper 32-bits of location of inodes bitmap.
* - 0x28
- - \_\_le32
- - bg\_inode\_table\_hi
+ - __le32
+ - bg_inode_table_hi
- Upper 32-bits of location of inodes table.
* - 0x2C
- - \_\_le16
- - bg\_free\_blocks\_count\_hi
+ - __le16
+ - bg_free_blocks_count_hi
- Upper 16-bits of free block count.
* - 0x2E
- - \_\_le16
- - bg\_free\_inodes\_count\_hi
+ - __le16
+ - bg_free_inodes_count_hi
- Upper 16-bits of free inode count.
* - 0x30
- - \_\_le16
- - bg\_used\_dirs\_count\_hi
+ - __le16
+ - bg_used_dirs_count_hi
- Upper 16-bits of directory count.
* - 0x32
- - \_\_le16
- - bg\_itable\_unused\_hi
+ - __le16
+ - bg_itable_unused_hi
- Upper 16-bits of unused inode count.
* - 0x34
- - \_\_le32
- - bg\_exclude\_bitmap\_hi
+ - __le32
+ - bg_exclude_bitmap_hi
- Upper 32-bits of location of snapshot exclusion bitmap.
* - 0x38
- - \_\_le16
- - bg\_block\_bitmap\_csum\_hi
+ - __le16
+ - bg_block_bitmap_csum_hi
- Upper 16-bits of the block bitmap checksum.
* - 0x3A
- - \_\_le16
- - bg\_inode\_bitmap\_csum\_hi
+ - __le16
+ - bg_inode_bitmap_csum_hi
- Upper 16-bits of the inode bitmap checksum.
* - 0x3C
- - \_\_u32
- - bg\_reserved
+ - __u32
+ - bg_reserved
- Padding to 64 bytes.
.. _bgflags:
@@ -166,8 +166,8 @@ Block group flags can be any combination of the following:
* - Value
- Description
* - 0x1
- - inode table and bitmap are not initialized (EXT4\_BG\_INODE\_UNINIT).
+ - inode table and bitmap are not initialized (EXT4_BG_INODE_UNINIT).
* - 0x2
- - block bitmap is not initialized (EXT4\_BG\_BLOCK\_UNINIT).
+ - block bitmap is not initialized (EXT4_BG_BLOCK_UNINIT).
* - 0x4
- - inode table is zeroed (EXT4\_BG\_INODE\_ZEROED).
+ - inode table is zeroed (EXT4_BG_INODE_ZEROED).
diff --git a/Documentation/filesystems/ext4/ifork.rst b/Documentation/filesystems/ext4/ifork.rst
index b9816d5a896b..dc31f505e6c8 100644
--- a/Documentation/filesystems/ext4/ifork.rst
+++ b/Documentation/filesystems/ext4/ifork.rst
@@ -1,6 +1,6 @@
.. SPDX-License-Identifier: GPL-2.0
-The Contents of inode.i\_block
+The Contents of inode.i_block
------------------------------
Depending on the type of file an inode describes, the 60 bytes of
@@ -47,7 +47,7 @@ In ext4, the file to logical block map has been replaced with an extent
tree. Under the old scheme, allocating a contiguous run of 1,000 blocks
requires an indirect block to map all 1,000 entries; with extents, the
mapping is reduced to a single ``struct ext4_extent`` with
-``ee_len = 1000``. If flex\_bg is enabled, it is possible to allocate
+``ee_len = 1000``. If flex_bg is enabled, it is possible to allocate
very large files with a single extent, at a considerable reduction in
metadata block use, and some improvement in disk efficiency. The inode
must have the extents flag (0x80000) flag set for this feature to be in
@@ -76,28 +76,28 @@ which is 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le16
- - eh\_magic
+ - __le16
+ - eh_magic
- Magic number, 0xF30A.
* - 0x2
- - \_\_le16
- - eh\_entries
+ - __le16
+ - eh_entries
- Number of valid entries following the header.
* - 0x4
- - \_\_le16
- - eh\_max
+ - __le16
+ - eh_max
- Maximum number of entries that could follow the header.
* - 0x6
- - \_\_le16
- - eh\_depth
+ - __le16
+ - eh_depth
- Depth of this extent node in the extent tree. 0 = this extent node
points to data blocks; otherwise, this extent node points to other
extent nodes. The extent tree can be at most 5 levels deep: a logical
block number can be at most ``2^32``, and the smallest ``n`` that
satisfies ``4*(((blocksize - 12)/12)^n) >= 2^32`` is 5.
* - 0x8
- - \_\_le32
- - eh\_generation
+ - __le32
+ - eh_generation
- Generation of the tree. (Used by Lustre, but not standard ext4).
Internal nodes of the extent tree, also known as index nodes, are
@@ -112,22 +112,22 @@ recorded as ``struct ext4_extent_idx``, and are 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le32
- - ei\_block
+ - __le32
+ - ei_block
- This index node covers file blocks from 'block' onward.
* - 0x4
- - \_\_le32
- - ei\_leaf\_lo
+ - __le32
+ - ei_leaf_lo
- Lower 32-bits of the block number of the extent node that is the next
level lower in the tree. The tree node pointed to can be either another
internal node or a leaf node, described below.
* - 0x8
- - \_\_le16
- - ei\_leaf\_hi
+ - __le16
+ - ei_leaf_hi
- Upper 16-bits of the previous field.
* - 0xA
- - \_\_u16
- - ei\_unused
+ - __u16
+ - ei_unused
-
Leaf nodes of the extent tree are recorded as ``struct ext4_extent``,
@@ -142,24 +142,24 @@ and are also 12 bytes long:
- Name
- Description
* - 0x0
- - \_\_le32
- - ee\_block
+ - __le32
+ - ee_block
- First file block number that this extent covers.
* - 0x4
- - \_\_le16
- - ee\_len
+ - __le16
+ - ee_len
- Number of blocks covered by extent. If the value of this field is <=
32768, the extent is initialized. If the value of the field is > 32768,
the extent is uninitialized and the actual extent length is ``ee_len`` -
32768. Therefore, the maximum length of a initialized extent is 32768
blocks, and the maximum length of an uninitialized extent is 32767.
* - 0x6
- - \_\_le16
- - ee\_start\_hi
+ - __le16
+ - ee_start_hi
- Upper 16-bits of the block number to which this extent points.
* - 0x8
- - \_\_le32
- - ee\_start\_lo
+ - __le32
+ - ee_start_lo
- Lower 32-bits of the block number to which this extent points.
Prior to the introduction of metadata checksums, the extent header +
@@ -182,8 +182,8 @@ including) the checksum itself.
- Name
- Description
* - 0x0
- - \_\_le32
- - eb\_checksum
+ - __le32
+ - eb_checksum
- Checksum of the extent block, crc32c(uuid+inum+igeneration+extentblock)
Inline Data
diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst
index d1075178ce0b..a728af0d2fd0 100644
--- a/Documentation/filesystems/ext4/inlinedata.rst
+++ b/Documentation/filesystems/ext4/inlinedata.rst
@@ -11,12 +11,12 @@ file is smaller than 60 bytes, then the data are stored inline in
attribute space, then it might be found as an extended attribute
“system.data” within the inode body (“ibody EA”). This of course
constrains the amount of extended attributes one can attach to an inode.
-If the data size increases beyond i\_block + ibody EA, a regular block
+If the data size increases beyond i_block + ibody EA, a regular block
is allocated and the contents moved to that block.
Pending a change to compact the extended attribute key used to store
inline data, one ought to be able to store 160 bytes of data in a
-256-byte inode (as of June 2015, when i\_extra\_isize is 28). Prior to
+256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to
that, the limit was 156 bytes due to inefficient use of inode space.
The inline data feature requires the presence of an extended attribute
@@ -25,12 +25,12 @@ for “system.data”, even if the attribute value is zero length.
Inline Directories
~~~~~~~~~~~~~~~~~~
-The first four bytes of i\_block are the inode number of the parent
+The first four bytes of i_block are the inode number of the parent
directory. Following that is a 56-byte space for an array of directory
entries; see ``struct ext4_dir_entry``. If there is a “system.data”
attribute in the inode body, the EA value is an array of
``struct ext4_dir_entry`` as well. Note that for inline directories, the
-i\_block and EA space are treated as separate dirent blocks; directory
+i_block and EA space are treated as separate dirent blocks; directory
entries cannot span the two.
Inline directory entries are not checksummed, as the inode checksum
diff --git a/Documentation/filesystems/ext4/inodes.rst b/Documentation/filesystems/ext4/inodes.rst
index a65baffb4ebf..cfc6c1659931 100644
--- a/Documentation/filesystems/ext4/inodes.rst
+++ b/Documentation/filesystems/ext4/inodes.rst
@@ -38,138 +38,138 @@ The inode table entry is laid out in ``struct ext4_inode``.
- Name
- Description
* - 0x0
- - \_\_le16
- - i\_mode
+ - __le16
+ - i_mode
- File mode. See the table i_mode_ below.
* - 0x2
- - \_\_le16
- - i\_uid
+ - __le16
+ - i_uid
- Lower 16-bits of Owner UID.
* - 0x4
- - \_\_le32
- - i\_size\_lo
+ - __le32
+ - i_size_lo
- Lower 32-bits of size in bytes.
* - 0x8
- - \_\_le32
- - i\_atime
- - Last access time, in seconds since the epoch. However, if the EA\_INODE
+ - __le32
+ - i_atime
+ - Last access time, in seconds since the epoch. However, if the EA_INODE
inode flag is set, this inode stores an extended attribute value and
this field contains the checksum of the value.
* - 0xC
- - \_\_le32
- - i\_ctime
+ - __le32
+ - i_ctime
- Last inode change time, in seconds since the epoch. However, if the
- EA\_INODE inode flag is set, this inode stores an extended attribute
+ EA_INODE inode flag is set, this inode stores an extended attribute
value and this field contains the lower 32 bits of the attribute value's
reference count.
* - 0x10
- - \_\_le32
- - i\_mtime
+ - __le32
+ - i_mtime
- Last data modification time, in seconds since the epoch. However, if the
- EA\_INODE inode flag is set, this inode stores an extended attribute
+ EA_INODE inode flag is set, this inode stores an extended attribute
value and this field contains the number of the inode that owns the
extended attribute.
* - 0x14
- - \_\_le32
- - i\_dtime
+ - __le32
+ - i_dtime
- Deletion Time, in seconds since the epoch.
* - 0x18
- - \_\_le16
- - i\_gid
+ - __le16
+ - i_gid
- Lower 16-bits of GID.
* - 0x1A
- - \_\_le16
- - i\_links\_count
+ - __le16
+ - i_links_count
- Hard link count. Normally, ext4 does not permit an inode to have more
than 65,000 hard links. This applies to files as well as directories,
which means that there cannot be more than 64,998 subdirectories in a
directory (each subdirectory's '..' entry counts as a hard link, as does
- the '.' entry in the directory itself). With the DIR\_NLINK feature
+ the '.' entry in the directory itself). With the DIR_NLINK feature
enabled, ext4 supports more than 64,998 subdirectories by setting this
field to 1 to indicate that the number of hard links is not known.
* - 0x1C
- - \_\_le32
- - i\_blocks\_lo
- - Lower 32-bits of “block” count. If the huge\_file feature flag is not
+ - __le32
+ - i_blocks_lo
+ - Lower 32-bits of “block” count. If the huge_file feature flag is not
set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks
- on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in
+ on disk. If huge_file is set and EXT4_HUGE_FILE_FL is NOT set in
``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi
- << 32)`` 512-byte blocks on disk. If huge\_file is set and
- EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file
+ << 32)`` 512-byte blocks on disk. If huge_file is set and
+ EXT4_HUGE_FILE_FL IS set in ``inode.i_flags``, then this file
consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on
disk.
* - 0x20
- - \_\_le32
- - i\_flags
+ - __le32
+ - i_flags
- Inode flags. See the table i_flags_ below.
* - 0x24
- 4 bytes
- - i\_osd1
+ - i_osd1
- See the table i_osd1_ for more details.
* - 0x28
- 60 bytes
- - i\_block[EXT4\_N\_BLOCKS=15]
- - Block map or extent tree. See the section “The Contents of inode.i\_block”.
+ - i_block[EXT4_N_BLOCKS=15]
+ - Block map or extent tree. See the section “The Contents of inode.i_block”.
* - 0x64
- - \_\_le32
- - i\_generation
+ - __le32
+ - i_generation
- File version (for NFS).
* - 0x68
- - \_\_le32
- - i\_file\_acl\_lo
+ - __le32
+ - i_file_acl_lo
- Lower 32-bits of extended attribute block. ACLs are of course one of
many possible extended attributes; I think the name of this field is a
result of the first use of extended attributes being for ACLs.
* - 0x6C
- - \_\_le32
- - i\_size\_high / i\_dir\_acl
+ - __le32
+ - i_size_high / i_dir_acl
- Upper 32-bits of file/directory size. In ext2/3 this field was named
- i\_dir\_acl, though it was usually set to zero and never used.
+ i_dir_acl, though it was usually set to zero and never used.
* - 0x70
- - \_\_le32
- - i\_obso\_faddr
+ - __le32
+ - i_obso_faddr
- (Obsolete) fragment address.
* - 0x74
- 12 bytes
- - i\_osd2
+ - i_osd2
- See the table i_osd2_ for more details.
* - 0x80
- - \_\_le16
- - i\_extra\_isize
+ - __le16
+ - i_extra_isize
- Size of this inode - 128. Alternately, the size of the extended inode
fields beyond the original ext2 inode, including this field.
* - 0x82
- - \_\_le16
- - i\_checksum\_hi
+ - __le16
+ - i_checksum_hi
- Upper 16-bits of the inode checksum.
* - 0x84
- - \_\_le32
- - i\_ctime\_extra
+ - __le32
+ - i_ctime_extra
- Extra change time bits. This provides sub-second precision. See Inode
Timestamps section.
* - 0x88
- - \_\_le32
- - i\_mtime\_extra
+ - __le32
+ - i_mtime_extra
- Extra modification time bits. This provides sub-second precision.
* - 0x8C
- - \_\_le32
- - i\_atime\_extra
+ - __le32
+ - i_atime_extra
- Extra access time bits. This provides sub-second precision.
* - 0x90
- - \_\_le32
- - i\_crtime
+ - __le32
+ - i_crtime
- File creation time, in seconds since the epoch.
* - 0x94
- - \_\_le32
- - i\_crtime\_extra
+ - __le32
+ - i_crtime_extra
- Extra file creation time bits. This provides sub-second precision.
* - 0x98
- - \_\_le32
- - i\_version\_hi
+ - __le32
+ - i_version_hi
- Upper 32-bits for version number.
* - 0x9C
- - \_\_le32
- - i\_projid
+ - __le32
+ - i_projid
- Project ID.
.. _i_mode:
@@ -183,45 +183,45 @@ The ``i_mode`` value is a combination of the following flags:
* - Value
- Description
* - 0x1
- - S\_IXOTH (Others may execute)
+ - S_IXOTH (Others may execute)
* - 0x2
- - S\_IWOTH (Others may write)
+ - S_IWOTH (Others may write)
* - 0x4
- - S\_IROTH (Others may read)
+ - S_IROTH (Others may read)
* - 0x8
- - S\_IXGRP (Group members may execute)
+ - S_IXGRP (Group members may execute)
* - 0x10
- - S\_IWGRP (Group members may write)
+ - S_IWGRP (Group members may write)
* - 0x20
- - S\_IRGRP (Group members may read)
+ - S_IRGRP (Group members may read)
* - 0x40
- - S\_IXUSR (Owner may execute)
+ - S_IXUSR (Owner may execute)
* - 0x80
- - S\_IWUSR (Owner may write)
+ - S_IWUSR (Owner may write)
* - 0x100
- - S\_IRUSR (Owner may read)
+ - S_IRUSR (Owner may read)
* - 0x200
- - S\_ISVTX (Sticky bit)
+ - S_ISVTX (Sticky bit)
* - 0x400
- - S\_ISGID (Set GID)
+ - S_ISGID (Set GID)
* - 0x800
- - S\_ISUID (Set UID)
+ - S_ISUID (Set UID)
* -
- These are mutually-exclusive file types:
* - 0x1000
- - S\_IFIFO (FIFO)
+ - S_IFIFO (FIFO)
* - 0x2000
- - S\_IFCHR (Character device)
+ - S_IFCHR (Character device)
* - 0x4000
- - S\_IFDIR (Directory)
+ - S_IFDIR (Directory)
* - 0x6000
- - S\_IFBLK (Block device)
+ - S_IFBLK (Block device)
* - 0x8000
- - S\_IFREG (Regular file)
+ - S_IFREG (Regular file)
* - 0xA000
- - S\_IFLNK (Symbolic link)
+ - S_IFLNK (Symbolic link)
* - 0xC000
- - S\_IFSOCK (Socket)
+ - S_IFSOCK (Socket)
.. _i_flags:
@@ -234,56 +234,56 @@ The ``i_flags`` field is a combination of these values:
* - Value
- Description
* - 0x1
- - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented)
+ - This file requires secure deletion (EXT4_SECRM_FL). (not implemented)
* - 0x2
- This file should be preserved, should undeletion be desired
- (EXT4\_UNRM\_FL). (not implemented)
+ (EXT4_UNRM_FL). (not implemented)
* - 0x4
- - File is compressed (EXT4\_COMPR\_FL). (not really implemented)
+ - File is compressed (EXT4_COMPR_FL). (not really implemented)
* - 0x8
- - All writes to the file must be synchronous (EXT4\_SYNC\_FL).
+ - All writes to the file must be synchronous (EXT4_SYNC_FL).
* - 0x10
- - File is immutable (EXT4\_IMMUTABLE\_FL).
+ - File is immutable (EXT4_IMMUTABLE_FL).
* - 0x20
- - File can only be appended (EXT4\_APPEND\_FL).
+ - File can only be appended (EXT4_APPEND_FL).
* - 0x40
- - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL).
+ - The dump(1) utility should not dump this file (EXT4_NODUMP_FL).
* - 0x80
- - Do not update access time (EXT4\_NOATIME\_FL).
+ - Do not update access time (EXT4_NOATIME_FL).
* - 0x100
- - Dirty compressed file (EXT4\_DIRTY\_FL). (not used)
+ - Dirty compressed file (EXT4_DIRTY_FL). (not used)
* - 0x200
- - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used)
+ - File has one or more compressed clusters (EXT4_COMPRBLK_FL). (not used)
* - 0x400
- - Do not compress file (EXT4\_NOCOMPR\_FL). (not used)
+ - Do not compress file (EXT4_NOCOMPR_FL). (not used)
* - 0x800
- - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was
- EXT4\_ECOMPR\_FL (compression error), which was never used.
+ - Encrypted inode (EXT4_ENCRYPT_FL). This bit value previously was
+ EXT4_ECOMPR_FL (compression error), which was never used.
* - 0x1000
- - Directory has hashed indexes (EXT4\_INDEX\_FL).
+ - Directory has hashed indexes (EXT4_INDEX_FL).
* - 0x2000
- - AFS magic directory (EXT4\_IMAGIC\_FL).
+ - AFS magic directory (EXT4_IMAGIC_FL).
* - 0x4000
- File data must always be written through the journal
- (EXT4\_JOURNAL\_DATA\_FL).
+ (EXT4_JOURNAL_DATA_FL).
* - 0x8000
- - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4)
+ - File tail should not be merged (EXT4_NOTAIL_FL). (not used by ext4)
* - 0x10000
- All directory entry data should be written synchronously (see
- ``dirsync``) (EXT4\_DIRSYNC\_FL).
+ ``dirsync``) (EXT4_DIRSYNC_FL).
* - 0x20000
- - Top of directory hierarchy (EXT4\_TOPDIR\_FL).
+ - Top of directory hierarchy (EXT4_TOPDIR_FL).
* - 0x40000
- - This is a huge file (EXT4\_HUGE\_FILE\_FL).
+ - This is a huge file (EXT4_HUGE_FILE_FL).
* - 0x80000
- - Inode uses extents (EXT4\_EXTENTS\_FL).
+ - Inode uses extents (EXT4_EXTENTS_FL).
* - 0x100000
- - Verity protected file (EXT4\_VERITY\_FL).
+ - Verity protected file (EXT4_VERITY_FL).
* - 0x200000
- Inode stores a large extended attribute value in its data blocks
- (EXT4\_EA\_INODE\_FL).
+ (EXT4_EA_INODE_FL).
* - 0x400000
- - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL).
+ - This file has blocks allocated past EOF (EXT4_EOFBLOCKS_FL).
(deprecated)
* - 0x01000000
- Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline)
@@ -294,21 +294,21 @@ The ``i_flags`` field is a combination of these values:
- Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in
mainline)
* - 0x10000000
- - Inode has inline data (EXT4\_INLINE\_DATA\_FL).
+ - Inode has inline data (EXT4_INLINE_DATA_FL).
* - 0x20000000
- - Create children with the same project ID (EXT4\_PROJINHERIT\_FL).
+ - Create children with the same project ID (EXT4_PROJINHERIT_FL).
* - 0x80000000
- - Reserved for ext4 library (EXT4\_RESERVED\_FL).
+ - Reserved for ext4 library (EXT4_RESERVED_FL).
* -
- Aggregate flags:
* - 0x705BDFFF
- User-visible flags.
* - 0x604BC0FF
- - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and
- EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's
- EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of
+ - User-modifiable flags. Note that while EXT4_JOURNAL_DATA_FL and
+ EXT4_EXTENTS_FL can be set with setattr, they are not in the kernel's
+ EXT4_FL_USER_MODIFIABLE mask, since it needs to handle the setting of
these flags in a special manner and they are masked out of the set of
- flags that are saved directly to i\_flags.
+ flags that are saved directly to i_flags.
.. _i_osd1:
@@ -325,9 +325,9 @@ Linux:
- Name
- Description
* - 0x0
- - \_\_le32
- - l\_i\_version
- - Inode version. However, if the EA\_INODE inode flag is set, this inode
+ - __le32
+ - l_i_version
+ - Inode version. However, if the EA_INODE inode flag is set, this inode
stores an extended attribute value and this field contains the upper 32
bits of the attribute value's reference count.
@@ -342,8 +342,8 @@ Hurd:
- Name
- Description
* - 0x0
- - \_\_le32
- - h\_i\_translator
+ - __le32
+ - h_i_translator
- ??
Masix:
@@ -357,8 +357,8 @@ Masix:
- Name
- Description
* - 0x0
- - \_\_le32
- - m\_i\_reserved
+ - __le32
+ - m_i_reserved
- ??
.. _i_osd2:
@@ -376,30 +376,30 @@ Linux:
- Name
- Description
* - 0x0
- - \_\_le16
- - l\_i\_blocks\_high
+ - __le16
+ - l_i_blocks_high
- Upper 16-bits of the block count. Please see the note attached to
- i\_blocks\_lo.
+ i_blocks_lo.
* - 0x2
- - \_\_le16
- - l\_i\_file\_acl\_high
+ - __le16
+ - l_i_file_acl_high
- Upper 16-bits of the extended attribute block (historically, the file
ACL location). See the Extended Attributes section below.
* - 0x4
- - \_\_le16
- - l\_i\_uid\_high
+ - __le16
+ - l_i_uid_high
- Upper 16-bits of the Owner UID.
* - 0x6
- - \_\_le16
- - l\_i\_gid\_high
+ - __le16
+ - l_i_gid_high
- Upper 16-bits of the GID.
* - 0x8
- - \_\_le16
- - l\_i\_checksum\_lo
+ - __le16
+ - l_i_checksum_lo
- Lower 16-bits of the inode checksum.
* - 0xA
- - \_\_le16
- - l\_i\_reserved
+ - __le16
+ - l_i_reserved
- Unused.
Hurd:
@@ -413,24 +413,24 @@ Hurd:
- Name
- Description
* - 0x0
- - \_\_le16
- - h\_i\_reserved1
+ - __le16
+ - h_i_reserved1
- ??
* - 0x2
- - \_\_u16
- - h\_i\_mode\_high
+ - __u16
+ - h_i_mode_high
- Upper 16-bits of the file mode.
* - 0x4
- - \_\_le16
- - h\_i\_uid\_high
+ - __le16
+ - h_i_uid_high
- Upper 16-bits of the Owner UID.
* - 0x6
- - \_\_le16
- - h\_i\_gid\_high
+ - __le16
+ - h_i_gid_high
- Upper 16-bits of the GID.
* - 0x8
- - \_\_u32
- - h\_i\_author
+ - __u32
+ - h_i_author
- Author code?
Masix:
@@ -444,17 +444,17 @@ Masix:
- Name
- Description
* - 0x0
- - \_\_le16
- - h\_i\_reserved1
+ - __le16
+ - h_i_reserved1
- ??
* - 0x2
- - \_\_u16
- - m\_i\_file\_acl\_high
+ - __u16
+ - m_i_file_acl_high
- Upper 16-bits of the extended attribute block (historically, the file
ACL location).
* - 0x4
- - \_\_u32
- - m\_i\_reserved2[2]
+ - __u32
+ - m_i_reserved2[2]
- ??
Inode Size
@@ -466,11 +466,11 @@ In ext2 and ext3, the inode structure size was fixed at 128 bytes
on-disk inode at format time for all inodes in the filesystem to provide
space beyond the end of the original ext2 inode. The on-disk inode
record size is recorded in the superblock as ``s_inode_size``. The
-number of bytes actually used by struct ext4\_inode beyond the original
+number of bytes actually used by struct ext4_inode beyond the original
128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each
-inode, which allows struct ext4\_inode to grow for a new kernel without
+inode, which allows struct ext4_inode to grow for a new kernel without
having to upgrade all of the on-disk inodes. Access to fields beyond
-EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within
+EXT2_GOOD_OLD_INODE_SIZE should be verified to be within
``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as
of August 2019) the inode structure is 160 bytes
(``i_extra_isize = 32``). The extra space between the end of the inode
@@ -498,11 +498,11 @@ structure -- inode change time (ctime), access time (atime), data
modification time (mtime), and deletion time (dtime). The four fields
are 32-bit signed integers that represent seconds since the Unix epoch
(1970-01-01 00:00:00 GMT), which means that the fields will overflow in
-January 2038. For inodes that are not linked from any directory but are
-still open (orphan inodes), the dtime field is overloaded for use with
-the orphan list. The superblock field ``s_last_orphan`` points to the
-first inode in the orphan list; dtime is then the number of the next
-orphaned inode, or zero if there are no more orphans.
+January 2038. If the filesystem does not have orphan_file feature, inodes
+that are not linked from any directory but are still open (orphan inodes) have
+the dtime field overloaded for use with the orphan list. The superblock field
+``s_last_orphan`` points to the first inode in the orphan list; dtime is then
+the number of the next orphaned inode, or zero if there are no more orphans.
If the inode structure size ``sb->s_inode_size`` is larger than 128
bytes and the ``i_inode_extra`` field is large enough to encompass the
@@ -516,7 +516,7 @@ creation time (crtime); this field is 64-bits wide and decoded in the
same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible
through the regular stat() interface, though debugfs will report them.
-We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)).
+We use the 32-bit signed time value plus (2^32 * (extra epoch bits)).
In other words:
.. list-table::
@@ -525,8 +525,8 @@ In other words:
* - Extra epoch bits
- MSB of 32-bit time
- - Adjustment for signed 32-bit to 64-bit tv\_sec
- - Decoded 64-bit tv\_sec
+ - Adjustment for signed 32-bit to 64-bit tv_sec
+ - Decoded 64-bit tv_sec
- valid time range
* - 0 0
- 1
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..6e8fb2d4b46f 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -4,14 +4,14 @@ Journal (jbd2)
--------------
Introduced in ext3, the ext4 filesystem employs a journal to protect the
-filesystem against corruption in the case of a system crash. A small
-continuous region of disk (default 128MiB) is reserved inside the
-filesystem as a place to land “important” data writes on-disk as quickly
-as possible. Once the important data transaction is fully written to the
-disk and flushed from the disk write cache, a record of the data being
-committed is also written to the journal. At some later point in time,
-the journal code writes the transactions to their final locations on
-disk (this could involve a lot of seeking or a lot of small
+filesystem against metadata inconsistencies in the case of a system crash. Up
+to 10,240,000 file system blocks (see man mke2fs(8) for more details on journal
+size limits) can be reserved inside the filesystem as a place to land
+“important” data writes on-disk as quickly as possible. Once the important
+data transaction is fully written to the disk and flushed from the disk write
+cache, a record of the data being committed is also written to the journal. At
+some later point in time, the journal code writes the transactions to their
+final locations on disk (this could involve a lot of seeking or a lot of small
read-write-erases) before erasing the commit record. Should the system
crash during the second slow write, the journal can be replayed all the
way to the latest commit record, guaranteeing the atomicity of whatever
@@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks to the journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at mkfs time.
+
The journal inode is typically inode 8. The first 68 bytes of the
journal inode are replicated in the ext4 superblock. The journal itself
is normal (but hidden) file within the filesystem. The file usually
@@ -52,8 +63,8 @@ Generally speaking, the journal has this format:
:header-rows: 1
* - Superblock
- - descriptor\_block (data\_blocks or revocation\_block) [more data or
- revocations] commmit\_block
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
- [more transactions...]
* -
- One transaction
@@ -82,8 +93,8 @@ superblock.
* - 1024 bytes of padding
- ext4 Superblock
- Journal Superblock
- - descriptor\_block (data\_blocks or revocation\_block) [more data or
- revocations] commmit\_block
+ - descriptor_block (data_blocks or revocation_block) [more data or
+ revocations] commmit_block
- [more transactions...]
* -
-
@@ -106,17 +117,17 @@ Every block in the journal starts with a common 12-byte header
- Name
- Description
* - 0x0
- - \_\_be32
- - h\_magic
+ - __be32
+ - h_magic
- jbd2 magic number, 0xC03B3998.
* - 0x4
- - \_\_be32
- - h\_blocktype
+ - __be32
+ - h_blocktype
- Description of what this block contains. See the jbd2_blocktype_ table
below.
* - 0x8
- - \_\_be32
- - h\_sequence
+ - __be32
+ - h_sequence
- The transaction ID that goes with this block.
.. _jbd2_blocktype:
@@ -166,95 +177,104 @@ which is 1024 bytes long:
-
- Static information describing the journal.
* - 0x0
- - journal\_header\_t (12 bytes)
- - s\_header
+ - journal_header_t (12 bytes)
+ - s_header
- Common header identifying this as a superblock.
* - 0xC
- - \_\_be32
- - s\_blocksize
+ - __be32
+ - s_blocksize
- Journal device block size.
* - 0x10
- - \_\_be32
- - s\_maxlen
+ - __be32
+ - s_maxlen
- Total number of blocks in this journal.
* - 0x14
- - \_\_be32
- - s\_first
+ - __be32
+ - s_first
- First block of log information.
* -
-
-
- Dynamic information describing the current state of the log.
* - 0x18
- - \_\_be32
- - s\_sequence
+ - __be32
+ - s_sequence
- First commit ID expected in log.
* - 0x1C
- - \_\_be32
- - s\_start
+ - __be32
+ - s_start
- Block number of the start of log. Contrary to the comments, this field
being zero does not imply that the journal is clean!
* - 0x20
- - \_\_be32
- - s\_errno
- - Error value, as set by jbd2\_journal\_abort().
+ - __be32
+ - s_errno
+ - Error value, as set by jbd2_journal_abort().
* -
-
-
- The remaining fields are only valid in a v2 superblock.
* - 0x24
- - \_\_be32
- - s\_feature\_compat;
+ - __be32
+ - s_feature_compat;
- Compatible feature set. See the table jbd2_compat_ below.
* - 0x28
- - \_\_be32
- - s\_feature\_incompat
+ - __be32
+ - s_feature_incompat
- Incompatible feature set. See the table jbd2_incompat_ below.
* - 0x2C
- - \_\_be32
- - s\_feature\_ro\_compat
+ - __be32
+ - s_feature_ro_compat
- Read-only compatible feature set. There aren't any of these currently.
* - 0x30
- - \_\_u8
- - s\_uuid[16]
+ - __u8
+ - s_uuid[16]
- 128-bit uuid for journal. This is compared against the copy in the ext4
super block at mount time.
* - 0x40
- - \_\_be32
- - s\_nr\_users
+ - __be32
+ - s_nr_users
- Number of file systems sharing this journal.
* - 0x44
- - \_\_be32
- - s\_dynsuper
+ - __be32
+ - s_dynsuper
- Location of dynamic super block copy. (Not used?)
* - 0x48
- - \_\_be32
- - s\_max\_transaction
+ - __be32
+ - s_max_transaction
- Limit of journal blocks per transaction. (Not used?)
* - 0x4C
- - \_\_be32
- - s\_max\_trans\_data
+ - __be32
+ - s_max_trans_data
- Limit of data blocks per transaction. (Not used?)
* - 0x50
- - \_\_u8
- - s\_checksum\_type
+ - __u8
+ - s_checksum_type
- Checksum algorithm used for the journal. See jbd2_checksum_type_ for
more info.
* - 0x51
- - \_\_u8[3]
- - s\_padding2
+ - __u8[3]
+ - s_padding2
-
* - 0x54
- - \_\_u32
- - s\_padding[42]
+ - __be32
+ - s_num_fc_blocks
+ - Number of fast commit blocks in the journal.
+ * - 0x58
+ - __be32
+ - s_head
+ - Block number of the head (first unused block) of the journal, only
+ up-to-date when the journal is empty.
+ * - 0x5C
+ - __u32
+ - s_padding[40]
-
* - 0xFC
- - \_\_be32
- - s\_checksum
+ - __be32
+ - s_checksum
- Checksum of the entire superblock, with this field set to zero.
* - 0x100
- - \_\_u8
- - s\_users[16\*48]
+ - __u8
+ - s_users[16*48]
- ids of all file systems sharing the log. e2fsprogs/Linux don't allow
shared external journals, but I imagine Lustre (or ocfs2?), which use
the jbd2 code, might.
@@ -271,7 +291,7 @@ The journal compat features are any combination of the following:
- Description
* - 0x1
- Journal maintains checksums on the data blocks.
- (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
+ (JBD2_FEATURE_COMPAT_CHECKSUM)
.. _jbd2_incompat:
@@ -284,21 +304,23 @@ The journal incompat features are any combination of the following:
* - Value
- Description
* - 0x1
- - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
+ - Journal has block revocation records. (JBD2_FEATURE_INCOMPAT_REVOKE)
* - 0x2
- Journal can deal with 64-bit block numbers.
- (JBD2\_FEATURE\_INCOMPAT\_64BIT)
+ (JBD2_FEATURE_INCOMPAT_64BIT)
* - 0x4
- - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
+ - Journal commits asynchronously. (JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
* - 0x8
- This journal uses v2 of the checksum on-disk format. Each journal
metadata block gets its own checksum, and the block tags in the
descriptor table contain checksums for each of the data blocks in the
- journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
+ journal. (JBD2_FEATURE_INCOMPAT_CSUM_V2)
* - 0x10
- This journal uses v3 of the checksum on-disk format. This is the same as
v2, but the journal block tag size is fixed regardless of the size of
- block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
+ block numbers. (JBD2_FEATURE_INCOMPAT_CSUM_V3)
+ * - 0x20
+ - Journal has fast commit blocks. (JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
.. _jbd2_checksum_type:
@@ -338,11 +360,11 @@ Descriptor blocks consume at least 36 bytes, but use a full block:
- Name
- Descriptor
* - 0x0
- - journal\_header\_t
+ - journal_header_t
- (open coded)
- Common block header.
* - 0xC
- - struct journal\_block\_tag\_s
+ - struct journal_block_tag_s
- open coded array[]
- Enough tags either to fill up the block or to describe all the data
blocks that follow this descriptor block.
@@ -350,7 +372,7 @@ Descriptor blocks consume at least 36 bytes, but use a full block:
Journal block tags have any of the following formats, depending on which
journal feature and block tag flags are set.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is set, the journal block tag is
defined as ``struct journal_block_tag3_s``, which looks like the
following. The size is 16 or 32 bytes.
@@ -363,24 +385,24 @@ following. The size is 16 or 32 bytes.
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_blocknr
+ - __be32
+ - t_blocknr
- Lower 32-bits of the location of where the corresponding data block
should end up on disk.
* - 0x4
- - \_\_be32
- - t\_flags
+ - __be32
+ - t_flags
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
more info.
* - 0x8
- - \_\_be32
- - t\_blocknr\_high
+ - __be32
+ - t_blocknr_high
- Upper 32-bits of the location of where the corresponding data block
- should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
+ should end up on disk. This is zero if JBD2_FEATURE_INCOMPAT_64BIT is
not enabled.
* - 0xC
- - \_\_be32
- - t\_checksum
+ - __be32
+ - t_checksum
- Checksum of the journal UUID, the sequence number, and the data block.
* -
-
@@ -416,7 +438,7 @@ The journal tag flags are any combination of the following:
* - 0x8
- This is the last tag in this descriptor block.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
+If JBD2_FEATURE_INCOMPAT_CSUM_V3 is NOT set, the journal block tag
is defined as ``struct journal_block_tag_s``, which looks like the
following. The size is 8, 12, 24, or 28 bytes:
@@ -429,18 +451,18 @@ following. The size is 8, 12, 24, or 28 bytes:
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_blocknr
+ - __be32
+ - t_blocknr
- Lower 32-bits of the location of where the corresponding data block
should end up on disk.
* - 0x4
- - \_\_be16
- - t\_checksum
+ - __be16
+ - t_checksum
- Checksum of the journal UUID, the sequence number, and the data block.
Note that only the lower 16 bits are stored.
* - 0x6
- - \_\_be16
- - t\_flags
+ - __be16
+ - t_flags
- Flags that go with the descriptor. See the table jbd2_tag_flags_ for
more info.
* -
@@ -449,8 +471,8 @@ following. The size is 8, 12, 24, or 28 bytes:
- This next field is only present if the super block indicates support for
64-bit block numbers.
* - 0x8
- - \_\_be32
- - t\_blocknr\_high
+ - __be32
+ - t_blocknr_high
- Upper 32-bits of the location of where the corresponding data block
should end up on disk.
* -
@@ -466,8 +488,8 @@ following. The size is 8, 12, 24, or 28 bytes:
``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
field.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
-JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the block is a
``struct jbd2_journal_block_tail``, which looks like this:
.. list-table::
@@ -479,8 +501,8 @@ JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
- Name
- Descriptor
* - 0x0
- - \_\_be32
- - t\_checksum
+ - __be32
+ - t_checksum
- Checksum of the journal UUID + the descriptor block, with this field set
to zero.
@@ -521,25 +543,25 @@ length, but use a full block:
- Name
- Description
* - 0x0
- - journal\_header\_t
- - r\_header
+ - journal_header_t
+ - r_header
- Common block header.
* - 0xC
- - \_\_be32
- - r\_count
+ - __be32
+ - r_count
- Number of bytes used in this block.
* - 0x10
- - \_\_be32 or \_\_be64
+ - __be32 or __be64
- blocks[0]
- Blocks to revoke.
-After r\_count is a linear array of block numbers that are effectively
+After r_count is a linear array of block numbers that are effectively
revoked by this transaction. The size of each block number is 8 bytes if
the superblock advertises 64-bit block number support, or 4 bytes
otherwise.
-If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
-JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
+If JBD2_FEATURE_INCOMPAT_CSUM_V2 or
+JBD2_FEATURE_INCOMPAT_CSUM_V3 are set, the end of the revocation
block is a ``struct jbd2_journal_revoke_tail``, which has this format:
.. list-table::
@@ -551,8 +573,8 @@ block is a ``struct jbd2_journal_revoke_tail``, which has this format:
- Name
- Description
* - 0x0
- - \_\_be32
- - r\_checksum
+ - __be32
+ - r_checksum
- Checksum of the journal UUID + revocation block
Commit Block
@@ -575,37 +597,165 @@ bytes long (but uses a full block):
- Name
- Descriptor
* - 0x0
- - journal\_header\_s
+ - journal_header_s
- (open coded)
- Common block header.
* - 0xC
- unsigned char
- - h\_chksum\_type
+ - h_chksum_type
- The type of checksum to use to verify the integrity of the data blocks
in the transaction. See jbd2_checksum_type_ for more info.
* - 0xD
- unsigned char
- - h\_chksum\_size
+ - h_chksum_size
- The number of bytes used by the checksum. Most likely 4.
* - 0xE
- unsigned char
- - h\_padding[2]
+ - h_padding[2]
-
* - 0x10
- - \_\_be32
- - h\_chksum[JBD2\_CHECKSUM\_BYTES]
+ - __be32
+ - h_chksum[JBD2_CHECKSUM_BYTES]
- 32 bytes of space to store checksums. If
- JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
+ JBD2_FEATURE_INCOMPAT_CSUM_V2 or JBD2_FEATURE_INCOMPAT_CSUM_V3
are set, the first ``__be32`` is the checksum of the journal UUID and
the entire commit block, with this field zeroed. If
- JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
+ JBD2_FEATURE_COMPAT_CHECKSUM is set, the first ``__be32`` is the
crc32 of all the blocks already written to the transaction.
* - 0x30
- - \_\_be64
- - h\_commit\_sec
+ - __be64
+ - h_commit_sec
- The time that the transaction was committed, in seconds since the epoch.
* - 0x38
- - \_\_be32
- - h\_commit\_nsec
+ - __be32
+ - h_commit_nsec
- Nanoseconds component of the above timestamp.
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+ :widths: 8 20 20 32
+ :header-rows: 1
+
+ * - Tag
+ - Meaning
+ - Value struct
+ - Description
+ * - EXT4_FC_TAG_HEAD
+ - Fast commit area header
+ - ``struct ext4_fc_head``
+ - Stores the TID of the transaction after which these fast commits should
+ be applied.
+ * - EXT4_FC_TAG_ADD_RANGE
+ - Add extent to inode
+ - ``struct ext4_fc_add_range``
+ - Stores the inode number and extent to be added in this inode
+ * - EXT4_FC_TAG_DEL_RANGE
+ - Remove logical offsets to inode
+ - ``struct ext4_fc_del_range``
+ - Stores the inode number and the logical offset range that needs to be
+ removed
+ * - EXT4_FC_TAG_CREAT
+ - Create directory entry for a newly created file
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry of the
+ newly created file
+ * - EXT4_FC_TAG_LINK
+ - Link a directory entry to an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+ * - EXT4_FC_TAG_UNLINK
+ - Unlink a directory entry of an inode
+ - ``struct ext4_fc_dentry_info``
+ - Stores the parent inode number, inode number and directory entry
+
+ * - EXT4_FC_TAG_PAD
+ - Padding (unused area)
+ - None
+ - Unused bytes in the fast commit area.
+
+ * - EXT4_FC_TAG_TAIL
+ - Mark the end of a fast commit
+ - ``struct ext4_fc_tail``
+ - Stores the TID of the commit, CRC of the fast commit of which this tag
+ represents the end of
+
+Fast Commit Replay Idempotence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Fast commits tags are idempotent in nature provided the recovery code follows
+certain rules. The guiding principle that the commit path follows while
+committing is that it stores the result of a particular operation instead of
+storing the procedure.
+
+Let's consider this rename operation: 'mv /a /b'. Let's assume dirent '/a'
+was associated with inode 10. During fast commit, instead of storing this
+operation as a procedure "rename a to b", we store the resulting file system
+state as a "series" of outcomes:
+
+- Link dirent b to inode 10
+- Unlink dirent a
+- Inode 10 with valid refcount
+
+Now when recovery code runs, it needs "enforce" this state on the file
+system. This is what guarantees idempotence of fast commit replay.
+
+Let's take an example of a procedure that is not idempotent and see how fast
+commits make it idempotent. Consider following sequence of operations:
+
+1) rm A
+2) mv B A
+3) read A
+
+If we store this sequence of operations as is then the replay is not idempotent.
+Let's say while in replay, we crash after (2). During the second replay,
+file A (which was actually created as a result of "mv B A" operation) would get
+deleted. Thus, file named A would be absent when we try to read A. So, this
+sequence of operations is not idempotent. However, as mentioned above, instead
+of storing the procedure fast commits store the outcome of each procedure. Thus
+the fast commit log for above procedure would be as follows:
+
+(Let's assume dirent A was linked to inode 10 and dirent B was linked to
+inode 11 before the replay)
+
+1) Unlink A
+2) Link A to inode 11
+3) Unlink B
+4) Inode 11
+
+If we crash after (3) we will have file A linked to inode 11. During the second
+replay, we will remove file A (inode 11). But we will create it back and make
+it point to inode 11. We won't find B, so we'll just skip that step. At this
+point, the refcount for inode 11 is not reliable, but that gets fixed by the
+replay of last inode 11 tag. Thus, by converting a non-idempotent procedure
+into a series of idempotent outcomes, fast commits ensured idempotence during
+the replay.
+
+Journal Checkpoint
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Checkpointing the journal ensures all transactions and their associated buffers
+are submitted to the disk. In-progress transactions are waited upon and included
+in the checkpoint. Checkpointing is used internally during critical updates to
+the filesystem including journal recovery, filesystem resizing, and freeing of
+the journal_t structure.
+
+A journal checkpoint can be triggered from userspace via the ioctl
+EXT4_IOC_CHECKPOINT. This ioctl takes a single, u64 argument for flags.
+Currently, three flags are supported. First, EXT4_IOC_CHECKPOINT_FLAG_DRY_RUN
+can be used to verify input to the ioctl. It returns error if there is any
+invalid input, otherwise it returns success without performing
+any checkpointing. This can be used to check whether the ioctl exists on a
+system and to verify there are no issues with arguments or flags. The
+other two flags are EXT4_IOC_CHECKPOINT_FLAG_DISCARD and
+EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT. These flags cause the journal blocks to be
+discarded or zero-filled, respectively, after the journal checkpoint is
+complete. EXT4_IOC_CHECKPOINT_FLAG_DISCARD and EXT4_IOC_CHECKPOINT_FLAG_ZEROOUT
+cannot both be set. The ioctl may be useful when snapshotting a system or for
+complying with content deletion SLOs.
diff --git a/Documentation/filesystems/ext4/mmp.rst b/Documentation/filesystems/ext4/mmp.rst
index 25660981d93c..174dd6538737 100644
--- a/Documentation/filesystems/ext4/mmp.rst
+++ b/Documentation/filesystems/ext4/mmp.rst
@@ -7,8 +7,8 @@ Multiple mount protection (MMP) is a feature that protects the
filesystem against multiple hosts trying to use the filesystem
simultaneously. When a filesystem is opened (for mounting, or fsck,
etc.), the MMP code running on the node (call it node A) checks a
-sequence number. If the sequence number is EXT4\_MMP\_SEQ\_CLEAN, the
-open continues. If the sequence number is EXT4\_MMP\_SEQ\_FSCK, then
+sequence number. If the sequence number is EXT4_MMP_SEQ_CLEAN, the
+open continues. If the sequence number is EXT4_MMP_SEQ_FSCK, then
fsck is (hopefully) running, and open fails immediately. Otherwise, the
open code will wait for twice the specified MMP check interval and check
the sequence number again. If the sequence number has changed, then the
@@ -40,38 +40,38 @@ The MMP structure (``struct mmp_struct``) is as follows:
- Name
- Description
* - 0x0
- - \_\_le32
- - mmp\_magic
+ - __le32
+ - mmp_magic
- Magic number for MMP, 0x004D4D50 (“MMP”).
* - 0x4
- - \_\_le32
- - mmp\_seq
+ - __le32
+ - mmp_seq
- Sequence number, updated periodically.
* - 0x8
- - \_\_le64
- - mmp\_time
+ - __le64
+ - mmp_time
- Time that the MMP block was last updated.
* - 0x10
- char[64]
- - mmp\_nodename
+ - mmp_nodename
- Hostname of the node that opened the filesystem.
* - 0x50
- char[32]
- - mmp\_bdevname
+ - mmp_bdevname
- Block device name of the filesystem.
* - 0x70
- - \_\_le16
- - mmp\_check\_interval
+ - __le16
+ - mmp_check_interval
- The MMP re-check interval, in seconds.
* - 0x72
- - \_\_le16
- - mmp\_pad1
+ - __le16
+ - mmp_pad1
- Zero.
* - 0x74
- - \_\_le32[226]
- - mmp\_pad2
+ - __le32[226]
+ - mmp_pad2
- Zero.
* - 0x3FC
- - \_\_le32
- - mmp\_checksum
+ - __le32
+ - mmp_checksum
- Checksum of the MMP block.
diff --git a/Documentation/filesystems/ext4/orphan.rst b/Documentation/filesystems/ext4/orphan.rst
new file mode 100644
index 000000000000..03cca178864b
--- /dev/null
+++ b/Documentation/filesystems/ext4/orphan.rst
@@ -0,0 +1,42 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Orphan file
+-----------
+
+In unix there can inodes that are unlinked from directory hierarchy but that
+are still alive because they are open. In case of crash the filesystem has to
+clean up these inodes as otherwise they (and the blocks referenced from them)
+would leak. Similarly if we truncate or extend the file, we need not be able
+to perform the operation in a single journalling transaction. In such case we
+track the inode as orphan so that in case of crash extra blocks allocated to
+the file get truncated.
+
+Traditionally ext4 tracks orphan inodes in a form of single linked list where
+superblock contains the inode number of the last orphan inode (s_last_orphan
+field) and then each inode contains inode number of the previously orphaned
+inode (we overload i_dtime inode field for this). However this filesystem
+global single linked list is a scalability bottleneck for workloads that result
+in heavy creation of orphan inodes. When orphan file feature
+(COMPAT_ORPHAN_FILE) is enabled, the filesystem has a special inode
+(referenced from the superblock through s_orphan_file_inum) with several
+blocks. Each of these blocks has a structure:
+
+============= ================ =============== ===============================
+Offset Type Name Description
+============= ================ =============== ===============================
+0x0 Array of Orphan inode Each __le32 entry is either
+ __le32 entries entries empty (0) or it contains
+ inode number of an orphan
+ inode.
+blocksize-8 __le32 ob_magic Magic value stored in orphan
+ block tail (0x0b10ca04)
+blocksize-4 __le32 ob_checksum Checksum of the orphan block.
+============= ================ =============== ===============================
+
+When a filesystem with orphan file feature is writeably mounted, we set
+RO_COMPAT_ORPHAN_PRESENT feature in the superblock to indicate there may
+be valid orphan entries. In case we see this feature when mounting the
+filesystem, we read the whole orphan file and process all orphan inodes found
+there as usual. When cleanly unmounting the filesystem we remove the
+RO_COMPAT_ORPHAN_PRESENT feature to avoid unnecessary scanning of the orphan
+file and also make the filesystem fully compatible with older kernels.
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 123ebfde47ee..9d4054c17ecb 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -7,7 +7,7 @@ An ext4 file system is split into a series of block groups. To reduce
performance difficulties due to fragmentation, the block allocator tries
very hard to keep each file's blocks within the same group, thereby
reducing seek times. The size of a block group is specified in
-``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 \*
+``sb.s_blocks_per_group`` blocks, though it can also calculated as 8 *
``block_size_in_bytes``. With the default block size of 4KiB, each group
will contain 32,768 blocks, for a length of 128MiB. The number of block
groups is the size of the device divided by the size of a block group.
@@ -25,3 +25,4 @@ order.
.. include:: inlinedata.rst
.. include:: eainode.rst
.. include:: verity.rst
+.. include:: atomic_writes.rst
diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst
index 9061aabba827..fc0636901fa0 100644
--- a/Documentation/filesystems/ext4/special_inodes.rst
+++ b/Documentation/filesystems/ext4/special_inodes.rst
@@ -34,5 +34,22 @@ ext4 reserves some inode for special features, as follows:
* - 10
- Replica inode, used for some non-upstream feature?
* - 11
- - Traditional first non-reserved inode. Usually this is the lost+found directory. See s\_first\_ino in the superblock.
+ - Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
+Note that there are also some inodes allocated from non-reserved inode numbers
+for other filesystem features which are not referenced from standard directory
+hierarchy. These are generally reference from the superblock. They are:
+
+.. list-table::
+ :widths: 20 50
+ :header-rows: 1
+
+ * - Superblock field
+ - Description
+
+ * - s_lpf_ino
+ - Inode number of lost+found directory.
+ * - s_prj_quota_inum
+ - Inode number of quota file tracking project quotas
+ * - s_orphan_file_inum
+ - Inode number of file tracking orphan inodes.
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 93e55d7c1d40..1b240661bfa3 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -7,7 +7,7 @@ The superblock records various information about the enclosing
filesystem, such as block counts, inode counts, supported features,
maintenance information, and more.
-If the sparse\_super feature flag is set, redundant copies of the
+If the sparse_super feature flag is set, redundant copies of the
superblock and group descriptors are kept only in the groups whose group
number is either 0 or a power of 3, 5, or 7. If the flag is not set,
redundant copies are kept in all groups.
@@ -27,107 +27,107 @@ The ext4 superblock is laid out as follows in
- Name
- Description
* - 0x0
- - \_\_le32
- - s\_inodes\_count
+ - __le32
+ - s_inodes_count
- Total inode count.
* - 0x4
- - \_\_le32
- - s\_blocks\_count\_lo
+ - __le32
+ - s_blocks_count_lo
- Total block count.
* - 0x8
- - \_\_le32
- - s\_r\_blocks\_count\_lo
+ - __le32
+ - s_r_blocks_count_lo
- This number of blocks can only be allocated by the super-user.
* - 0xC
- - \_\_le32
- - s\_free\_blocks\_count\_lo
+ - __le32
+ - s_free_blocks_count_lo
- Free block count.
* - 0x10
- - \_\_le32
- - s\_free\_inodes\_count
+ - __le32
+ - s_free_inodes_count
- Free inode count.
* - 0x14
- - \_\_le32
- - s\_first\_data\_block
+ - __le32
+ - s_first_data_block
- First data block. This must be at least 1 for 1k-block filesystems and
is typically 0 for all other block sizes.
* - 0x18
- - \_\_le32
- - s\_log\_block\_size
- - Block size is 2 ^ (10 + s\_log\_block\_size).
+ - __le32
+ - s_log_block_size
+ - Block size is 2 ^ (10 + s_log_block_size).
* - 0x1C
- - \_\_le32
- - s\_log\_cluster\_size
- - Cluster size is 2 ^ (10 + s\_log\_cluster\_size) blocks if bigalloc is
- enabled. Otherwise s\_log\_cluster\_size must equal s\_log\_block\_size.
+ - __le32
+ - s_log_cluster_size
+ - Cluster size is 2 ^ (10 + s_log_cluster_size) blocks if bigalloc is
+ enabled. Otherwise s_log_cluster_size must equal s_log_block_size.
* - 0x20
- - \_\_le32
- - s\_blocks\_per\_group
+ - __le32
+ - s_blocks_per_group
- Blocks per group.
* - 0x24
- - \_\_le32
- - s\_clusters\_per\_group
+ - __le32
+ - s_clusters_per_group
- Clusters per group, if bigalloc is enabled. Otherwise
- s\_clusters\_per\_group must equal s\_blocks\_per\_group.
+ s_clusters_per_group must equal s_blocks_per_group.
* - 0x28
- - \_\_le32
- - s\_inodes\_per\_group
+ - __le32
+ - s_inodes_per_group
- Inodes per group.
* - 0x2C
- - \_\_le32
- - s\_mtime
+ - __le32
+ - s_mtime
- Mount time, in seconds since the epoch.
* - 0x30
- - \_\_le32
- - s\_wtime
+ - __le32
+ - s_wtime
- Write time, in seconds since the epoch.
* - 0x34
- - \_\_le16
- - s\_mnt\_count
+ - __le16
+ - s_mnt_count
- Number of mounts since the last fsck.
* - 0x36
- - \_\_le16
- - s\_max\_mnt\_count
+ - __le16
+ - s_max_mnt_count
- Number of mounts beyond which a fsck is needed.
* - 0x38
- - \_\_le16
- - s\_magic
+ - __le16
+ - s_magic
- Magic signature, 0xEF53
* - 0x3A
- - \_\_le16
- - s\_state
+ - __le16
+ - s_state
- File system state. See super_state_ for more info.
* - 0x3C
- - \_\_le16
- - s\_errors
+ - __le16
+ - s_errors
- Behaviour when detecting errors. See super_errors_ for more info.
* - 0x3E
- - \_\_le16
- - s\_minor\_rev\_level
+ - __le16
+ - s_minor_rev_level
- Minor revision level.
* - 0x40
- - \_\_le32
- - s\_lastcheck
+ - __le32
+ - s_lastcheck
- Time of last check, in seconds since the epoch.
* - 0x44
- - \_\_le32
- - s\_checkinterval
+ - __le32
+ - s_checkinterval
- Maximum time between checks, in seconds.
* - 0x48
- - \_\_le32
- - s\_creator\_os
+ - __le32
+ - s_creator_os
- Creator OS. See the table super_creator_ for more info.
* - 0x4C
- - \_\_le32
- - s\_rev\_level
+ - __le32
+ - s_rev_level
- Revision level. See the table super_revision_ for more info.
* - 0x50
- - \_\_le16
- - s\_def\_resuid
+ - __le16
+ - s_def_resuid
- Default uid for reserved blocks.
* - 0x52
- - \_\_le16
- - s\_def\_resgid
+ - __le16
+ - s_def_resgid
- Default gid for reserved blocks.
* -
-
@@ -143,50 +143,50 @@ The ext4 superblock is laid out as follows in
about a feature in either the compatible or incompatible feature set, it
must abort and not try to meddle with things it doesn't understand...
* - 0x54
- - \_\_le32
- - s\_first\_ino
+ - __le32
+ - s_first_ino
- First non-reserved inode.
* - 0x58
- - \_\_le16
- - s\_inode\_size
+ - __le16
+ - s_inode_size
- Size of inode structure, in bytes.
* - 0x5A
- - \_\_le16
- - s\_block\_group\_nr
+ - __le16
+ - s_block_group_nr
- Block group # of this superblock.
* - 0x5C
- - \_\_le32
- - s\_feature\_compat
+ - __le32
+ - s_feature_compat
- Compatible feature set flags. Kernel can still read/write this fs even
if it doesn't understand a flag; fsck should not do that. See the
super_compat_ table for more info.
* - 0x60
- - \_\_le32
- - s\_feature\_incompat
+ - __le32
+ - s_feature_incompat
- Incompatible feature set. If the kernel or fsck doesn't understand one
of these bits, it should stop. See the super_incompat_ table for more
info.
* - 0x64
- - \_\_le32
- - s\_feature\_ro\_compat
+ - __le32
+ - s_feature_ro_compat
- Readonly-compatible feature set. If the kernel doesn't understand one of
these bits, it can still mount read-only. See the super_rocompat_ table
for more info.
* - 0x68
- - \_\_u8
- - s\_uuid[16]
+ - __u8
+ - s_uuid[16]
- 128-bit UUID for volume.
* - 0x78
- char
- - s\_volume\_name[16]
+ - s_volume_name[16]
- Volume label.
* - 0x88
- char
- - s\_last\_mounted[64]
+ - s_last_mounted[64]
- Directory where filesystem was last mounted.
* - 0xC8
- - \_\_le32
- - s\_algorithm\_usage\_bitmap
+ - __le32
+ - s_algorithm_usage_bitmap
- For compression (Not used in e2fsprogs/Linux)
* -
-
@@ -194,18 +194,18 @@ The ext4 superblock is laid out as follows in
- Performance hints. Directory preallocation should only happen if the
EXT4_FEATURE_COMPAT_DIR_PREALLOC flag is on.
* - 0xCC
- - \_\_u8
- - s\_prealloc\_blocks
+ - __u8
+ - s_prealloc_blocks
- #. of blocks to try to preallocate for ... files? (Not used in
e2fsprogs/Linux)
* - 0xCD
- - \_\_u8
- - s\_prealloc\_dir\_blocks
+ - __u8
+ - s_prealloc_dir_blocks
- #. of blocks to preallocate for directories. (Not used in
e2fsprogs/Linux)
* - 0xCE
- - \_\_le16
- - s\_reserved\_gdt\_blocks
+ - __le16
+ - s_reserved_gdt_blocks
- Number of reserved GDT entries for future filesystem expansion.
* -
-
@@ -213,277 +213,289 @@ The ext4 superblock is laid out as follows in
- Journalling support is valid only if EXT4_FEATURE_COMPAT_HAS_JOURNAL is
set.
* - 0xD0
- - \_\_u8
- - s\_journal\_uuid[16]
+ - __u8
+ - s_journal_uuid[16]
- UUID of journal superblock
* - 0xE0
- - \_\_le32
- - s\_journal\_inum
+ - __le32
+ - s_journal_inum
- inode number of journal file.
* - 0xE4
- - \_\_le32
- - s\_journal\_dev
+ - __le32
+ - s_journal_dev
- Device number of journal file, if the external journal feature flag is
set.
* - 0xE8
- - \_\_le32
- - s\_last\_orphan
+ - __le32
+ - s_last_orphan
- Start of list of orphaned inodes to delete.
* - 0xEC
- - \_\_le32
- - s\_hash\_seed[4]
+ - __le32
+ - s_hash_seed[4]
- HTREE hash seed.
* - 0xFC
- - \_\_u8
- - s\_def\_hash\_version
+ - __u8
+ - s_def_hash_version
- Default hash algorithm to use for directory hashes. See super_def_hash_
for more info.
* - 0xFD
- - \_\_u8
- - s\_jnl\_backup\_type
- - If this value is 0 or EXT3\_JNL\_BACKUP\_BLOCKS (1), then the
+ - __u8
+ - s_jnl_backup_type
+ - If this value is 0 or EXT3_JNL_BACKUP_BLOCKS (1), then the
``s_jnl_blocks`` field contains a duplicate copy of the inode's
``i_block[]`` array and ``i_size``.
* - 0xFE
- - \_\_le16
- - s\_desc\_size
+ - __le16
+ - s_desc_size
- Size of group descriptors, in bytes, if the 64bit incompat feature flag
is set.
* - 0x100
- - \_\_le32
- - s\_default\_mount\_opts
+ - __le32
+ - s_default_mount_opts
- Default mount options. See the super_mountopts_ table for more info.
* - 0x104
- - \_\_le32
- - s\_first\_meta\_bg
- - First metablock block group, if the meta\_bg feature is enabled.
+ - __le32
+ - s_first_meta_bg
+ - First metablock block group, if the meta_bg feature is enabled.
* - 0x108
- - \_\_le32
- - s\_mkfs\_time
+ - __le32
+ - s_mkfs_time
- When the filesystem was created, in seconds since the epoch.
* - 0x10C
- - \_\_le32
- - s\_jnl\_blocks[17]
+ - __le32
+ - s_jnl_blocks[17]
- Backup copy of the journal inode's ``i_block[]`` array in the first 15
- elements and i\_size\_high and i\_size in the 16th and 17th elements,
+ elements and i_size_high and i_size in the 16th and 17th elements,
respectively.
* -
-
-
- 64bit support is valid only if EXT4_FEATURE_COMPAT_64BIT is set.
* - 0x150
- - \_\_le32
- - s\_blocks\_count\_hi
+ - __le32
+ - s_blocks_count_hi
- High 32-bits of the block count.
* - 0x154
- - \_\_le32
- - s\_r\_blocks\_count\_hi
+ - __le32
+ - s_r_blocks_count_hi
- High 32-bits of the reserved block count.
* - 0x158
- - \_\_le32
- - s\_free\_blocks\_count\_hi
+ - __le32
+ - s_free_blocks_count_hi
- High 32-bits of the free block count.
* - 0x15C
- - \_\_le16
- - s\_min\_extra\_isize
+ - __le16
+ - s_min_extra_isize
- All inodes have at least # bytes.
* - 0x15E
- - \_\_le16
- - s\_want\_extra\_isize
+ - __le16
+ - s_want_extra_isize
- New inodes should reserve # bytes.
* - 0x160
- - \_\_le32
- - s\_flags
+ - __le32
+ - s_flags
- Miscellaneous flags. See the super_flags_ table for more info.
* - 0x164
- - \_\_le16
- - s\_raid\_stride
+ - __le16
+ - s_raid_stride
- RAID stride. This is the number of logical blocks read from or written
to the disk before moving to the next disk. This affects the placement
of filesystem metadata, which will hopefully make RAID storage faster.
* - 0x166
- - \_\_le16
- - s\_mmp\_interval
+ - __le16
+ - s_mmp_interval
- #. seconds to wait in multi-mount prevention (MMP) checking. In theory,
MMP is a mechanism to record in the superblock which host and device
have mounted the filesystem, in order to prevent multiple mounts. This
feature does not seem to be implemented...
* - 0x168
- - \_\_le64
- - s\_mmp\_block
+ - __le64
+ - s_mmp_block
- Block # for multi-mount protection data.
* - 0x170
- - \_\_le32
- - s\_raid\_stripe\_width
+ - __le32
+ - s_raid_stripe_width
- RAID stripe width. This is the number of logical blocks read from or
written to the disk before coming back to the current disk. This is used
by the block allocator to try to reduce the number of read-modify-write
operations in a RAID5/6.
* - 0x174
- - \_\_u8
- - s\_log\_groups\_per\_flex
+ - __u8
+ - s_log_groups_per_flex
- Size of a flexible block group is 2 ^ ``s_log_groups_per_flex``.
* - 0x175
- - \_\_u8
- - s\_checksum\_type
+ - __u8
+ - s_checksum_type
- Metadata checksum algorithm type. The only valid value is 1 (crc32c).
* - 0x176
- - \_\_le16
+ - \_\_u8
+ - s\_encryption\_level
+ - Versioning level for encryption.
+ * - 0x177
+ - \_\_u8
- s\_reserved\_pad
- -
+ - Padding to next 32bits.
* - 0x178
- - \_\_le64
- - s\_kbytes\_written
+ - __le64
+ - s_kbytes_written
- Number of KiB written to this filesystem over its lifetime.
* - 0x180
- - \_\_le32
- - s\_snapshot\_inum
+ - __le32
+ - s_snapshot_inum
- inode number of active snapshot. (Not used in e2fsprogs/Linux.)
* - 0x184
- - \_\_le32
- - s\_snapshot\_id
+ - __le32
+ - s_snapshot_id
- Sequential ID of active snapshot. (Not used in e2fsprogs/Linux.)
* - 0x188
- - \_\_le64
- - s\_snapshot\_r\_blocks\_count
+ - __le64
+ - s_snapshot_r_blocks_count
- Number of blocks reserved for active snapshot's future use. (Not used in
e2fsprogs/Linux.)
* - 0x190
- - \_\_le32
- - s\_snapshot\_list
+ - __le32
+ - s_snapshot_list
- inode number of the head of the on-disk snapshot list. (Not used in
e2fsprogs/Linux.)
* - 0x194
- - \_\_le32
- - s\_error\_count
+ - __le32
+ - s_error_count
- Number of errors seen.
* - 0x198
- - \_\_le32
- - s\_first\_error\_time
+ - __le32
+ - s_first_error_time
- First time an error happened, in seconds since the epoch.
* - 0x19C
- - \_\_le32
- - s\_first\_error\_ino
+ - __le32
+ - s_first_error_ino
- inode involved in first error.
* - 0x1A0
- - \_\_le64
- - s\_first\_error\_block
+ - __le64
+ - s_first_error_block
- Number of block involved of first error.
* - 0x1A8
- - \_\_u8
- - s\_first\_error\_func[32]
+ - __u8
+ - s_first_error_func[32]
- Name of function where the error happened.
* - 0x1C8
- - \_\_le32
- - s\_first\_error\_line
+ - __le32
+ - s_first_error_line
- Line number where error happened.
* - 0x1CC
- - \_\_le32
- - s\_last\_error\_time
+ - __le32
+ - s_last_error_time
- Time of most recent error, in seconds since the epoch.
* - 0x1D0
- - \_\_le32
- - s\_last\_error\_ino
+ - __le32
+ - s_last_error_ino
- inode involved in most recent error.
* - 0x1D4
- - \_\_le32
- - s\_last\_error\_line
+ - __le32
+ - s_last_error_line
- Line number where most recent error happened.
* - 0x1D8
- - \_\_le64
- - s\_last\_error\_block
+ - __le64
+ - s_last_error_block
- Number of block involved in most recent error.
* - 0x1E0
- - \_\_u8
- - s\_last\_error\_func[32]
+ - __u8
+ - s_last_error_func[32]
- Name of function where the most recent error happened.
* - 0x200
- - \_\_u8
- - s\_mount\_opts[64]
+ - __u8
+ - s_mount_opts[64]
- ASCIIZ string of mount options.
* - 0x240
- - \_\_le32
- - s\_usr\_quota\_inum
+ - __le32
+ - s_usr_quota_inum
- Inode number of user `quota <quota>`__ file.
* - 0x244
- - \_\_le32
- - s\_grp\_quota\_inum
+ - __le32
+ - s_grp_quota_inum
- Inode number of group `quota <quota>`__ file.
* - 0x248
- - \_\_le32
- - s\_overhead\_blocks
+ - __le32
+ - s_overhead_blocks
- Overhead blocks/clusters in fs. (Huh? This field is always zero, which
means that the kernel calculates it dynamically.)
* - 0x24C
- - \_\_le32
- - s\_backup\_bgs[2]
- - Block groups containing superblock backups (if sparse\_super2)
+ - __le32
+ - s_backup_bgs[2]
+ - Block groups containing superblock backups (if sparse_super2)
* - 0x254
- - \_\_u8
- - s\_encrypt\_algos[4]
+ - __u8
+ - s_encrypt_algos[4]
- Encryption algorithms in use. There can be up to four algorithms in use
at any time; valid algorithm codes are given in the super_encrypt_ table
below.
* - 0x258
- - \_\_u8
- - s\_encrypt\_pw\_salt[16]
+ - __u8
+ - s_encrypt_pw_salt[16]
- Salt for the string2key algorithm for encryption.
* - 0x268
- - \_\_le32
- - s\_lpf\_ino
+ - __le32
+ - s_lpf_ino
- Inode number of lost+found
* - 0x26C
- - \_\_le32
- - s\_prj\_quota\_inum
+ - __le32
+ - s_prj_quota_inum
- Inode that tracks project quotas.
* - 0x270
- - \_\_le32
- - s\_checksum\_seed
- - Checksum seed used for metadata\_csum calculations. This value is
- crc32c(~0, $orig\_fs\_uuid).
+ - __le32
+ - s_checksum_seed
+ - Checksum seed used for metadata_csum calculations. This value is
+ crc32c(~0, $orig_fs_uuid).
* - 0x274
- - \_\_u8
- - s\_wtime_hi
+ - __u8
+ - s_wtime_hi
- Upper 8 bits of the s_wtime field.
* - 0x275
- - \_\_u8
- - s\_mtime_hi
+ - __u8
+ - s_mtime_hi
- Upper 8 bits of the s_mtime field.
* - 0x276
- - \_\_u8
- - s\_mkfs_time_hi
+ - __u8
+ - s_mkfs_time_hi
- Upper 8 bits of the s_mkfs_time field.
* - 0x277
- - \_\_u8
- - s\_lastcheck_hi
- - Upper 8 bits of the s_lastcheck_hi field.
+ - __u8
+ - s_lastcheck_hi
+ - Upper 8 bits of the s_lastcheck field.
* - 0x278
- - \_\_u8
- - s\_first_error_time_hi
- - Upper 8 bits of the s_first_error_time_hi field.
+ - __u8
+ - s_first_error_time_hi
+ - Upper 8 bits of the s_first_error_time field.
* - 0x279
- - \_\_u8
- - s\_last_error_time_hi
- - Upper 8 bits of the s_last_error_time_hi field.
+ - __u8
+ - s_last_error_time_hi
+ - Upper 8 bits of the s_last_error_time field.
* - 0x27A
- \_\_u8
- - s\_pad[2]
- - Zero padding.
+ - s\_first\_error\_errcode
+ -
+ * - 0x27B
+ - \_\_u8
+ - s\_last\_error\_errcode
+ -
* - 0x27C
- - \_\_le16
- - s\_encoding
+ - __le16
+ - s_encoding
- Filename charset encoding.
* - 0x27E
- - \_\_le16
- - s\_encoding_flags
+ - __le16
+ - s_encoding_flags
- Filename charset encoding flags.
* - 0x280
- - \_\_le32
- - s\_reserved[95]
+ - __le32
+ - s_orphan_file_inum
+ - Orphan file inode number.
+ * - 0x284
+ - __le32
+ - s_reserved[94]
- Padding to the end of the block.
* - 0x3FC
- - \_\_le32
- - s\_checksum
+ - __le32
+ - s_checksum
- Superblock checksum.
.. _super_state:
@@ -570,32 +582,44 @@ following:
* - Value
- Description
* - 0x1
- - Directory preallocation (COMPAT\_DIR\_PREALLOC).
+ - Directory preallocation (COMPAT_DIR_PREALLOC).
* - 0x2
- “imagic inodes”. Not clear from the code what this does
- (COMPAT\_IMAGIC\_INODES).
+ (COMPAT_IMAGIC_INODES).
* - 0x4
- - Has a journal (COMPAT\_HAS\_JOURNAL).
+ - Has a journal (COMPAT_HAS_JOURNAL).
* - 0x8
- - Supports extended attributes (COMPAT\_EXT\_ATTR).
+ - Supports extended attributes (COMPAT_EXT_ATTR).
* - 0x10
- Has reserved GDT blocks for filesystem expansion
- (COMPAT\_RESIZE\_INODE). Requires RO\_COMPAT\_SPARSE\_SUPER.
+ (COMPAT_RESIZE_INODE). Requires RO_COMPAT_SPARSE_SUPER.
* - 0x20
- - Has directory indices (COMPAT\_DIR\_INDEX).
+ - Has directory indices (COMPAT_DIR_INDEX).
* - 0x40
- “Lazy BG”. Not in Linux kernel, seems to have been for uninitialized
- block groups? (COMPAT\_LAZY\_BG)
+ block groups? (COMPAT_LAZY_BG)
* - 0x80
- - “Exclude inode”. Not used. (COMPAT\_EXCLUDE\_INODE).
+ - “Exclude inode”. Not used. (COMPAT_EXCLUDE_INODE).
* - 0x100
- “Exclude bitmap”. Seems to be used to indicate the presence of
snapshot-related exclude bitmaps? Not defined in kernel or used in
- e2fsprogs (COMPAT\_EXCLUDE\_BITMAP).
+ e2fsprogs (COMPAT_EXCLUDE_BITMAP).
* - 0x200
- - Sparse Super Block, v2. If this flag is set, the SB field s\_backup\_bgs
+ - Sparse Super Block, v2. If this flag is set, the SB field s_backup_bgs
points to the two block groups that contain backup superblocks
- (COMPAT\_SPARSE\_SUPER2).
+ (COMPAT_SPARSE_SUPER2).
+ * - 0x400
+ - Fast commits supported. Although fast commits blocks are
+ backward incompatible, fast commit blocks are not always
+ present in the journal. If fast commit blocks are present in
+ the journal, JBD2 incompat feature
+ (JBD2_FEATURE_INCOMPAT_FAST_COMMIT) gets
+ set (COMPAT_FAST_COMMIT).
+ * - 0x1000
+ - Orphan file allocated. This is the special file for more efficient
+ tracking of unlinked but still open inodes. When there may be any
+ entries in the file, we additionally set proper rocompat feature
+ (RO_COMPAT_ORPHAN_PRESENT).
.. _super_incompat:
@@ -609,45 +633,45 @@ following:
* - Value
- Description
* - 0x1
- - Compression (INCOMPAT\_COMPRESSION).
+ - Compression (INCOMPAT_COMPRESSION).
* - 0x2
- - Directory entries record the file type. See ext4\_dir\_entry\_2 below
- (INCOMPAT\_FILETYPE).
+ - Directory entries record the file type. See ext4_dir_entry_2 below
+ (INCOMPAT_FILETYPE).
* - 0x4
- - Filesystem needs recovery (INCOMPAT\_RECOVER).
+ - Filesystem needs recovery (INCOMPAT_RECOVER).
* - 0x8
- - Filesystem has a separate journal device (INCOMPAT\_JOURNAL\_DEV).
+ - Filesystem has a separate journal device (INCOMPAT_JOURNAL_DEV).
* - 0x10
- Meta block groups. See the earlier discussion of this feature
- (INCOMPAT\_META\_BG).
+ (INCOMPAT_META_BG).
* - 0x40
- - Files in this filesystem use extents (INCOMPAT\_EXTENTS).
+ - Files in this filesystem use extents (INCOMPAT_EXTENTS).
* - 0x80
- - Enable a filesystem size of 2^64 blocks (INCOMPAT\_64BIT).
+ - Enable a filesystem size of 2^64 blocks (INCOMPAT_64BIT).
* - 0x100
- - Multiple mount protection (INCOMPAT\_MMP).
+ - Multiple mount protection (INCOMPAT_MMP).
* - 0x200
- Flexible block groups. See the earlier discussion of this feature
- (INCOMPAT\_FLEX\_BG).
+ (INCOMPAT_FLEX_BG).
* - 0x400
- Inodes can be used to store large extended attribute values
- (INCOMPAT\_EA\_INODE).
+ (INCOMPAT_EA_INODE).
* - 0x1000
- - Data in directory entry (INCOMPAT\_DIRDATA). (Not implemented?)
+ - Data in directory entry (INCOMPAT_DIRDATA). (Not implemented?)
* - 0x2000
- Metadata checksum seed is stored in the superblock. This feature enables
- the administrator to change the UUID of a metadata\_csum filesystem
+ the administrator to change the UUID of a metadata_csum filesystem
while the filesystem is mounted; without it, the checksum definition
- requires all metadata blocks to be rewritten (INCOMPAT\_CSUM\_SEED).
+ requires all metadata blocks to be rewritten (INCOMPAT_CSUM_SEED).
* - 0x4000
- - Large directory >2GB or 3-level htree (INCOMPAT\_LARGEDIR). Prior to
+ - Large directory >2GB or 3-level htree (INCOMPAT_LARGEDIR). Prior to
this feature, directories could not be larger than 4GiB and could not
have an htree more than 2 levels deep. If this feature is enabled,
directories can be larger than 4GiB and have a maximum htree depth of 3.
* - 0x8000
- - Data in inode (INCOMPAT\_INLINE\_DATA).
+ - Data in inode (INCOMPAT_INLINE_DATA).
* - 0x10000
- - Encrypted inodes are present on the filesystem. (INCOMPAT\_ENCRYPT).
+ - Encrypted inodes are present on the filesystem. (INCOMPAT_ENCRYPT).
.. _super_rocompat:
@@ -662,50 +686,54 @@ the following:
- Description
* - 0x1
- Sparse superblocks. See the earlier discussion of this feature
- (RO\_COMPAT\_SPARSE\_SUPER).
+ (RO_COMPAT_SPARSE_SUPER).
* - 0x2
- This filesystem has been used to store a file greater than 2GiB
- (RO\_COMPAT\_LARGE\_FILE).
+ (RO_COMPAT_LARGE_FILE).
* - 0x4
- - Not used in kernel or e2fsprogs (RO\_COMPAT\_BTREE\_DIR).
+ - Not used in kernel or e2fsprogs (RO_COMPAT_BTREE_DIR).
* - 0x8
- This filesystem has files whose sizes are represented in units of
logical blocks, not 512-byte sectors. This implies a very large file
- indeed! (RO\_COMPAT\_HUGE\_FILE)
+ indeed! (RO_COMPAT_HUGE_FILE)
* - 0x10
- Group descriptors have checksums. In addition to detecting corruption,
this is useful for lazy formatting with uninitialized groups
- (RO\_COMPAT\_GDT\_CSUM).
+ (RO_COMPAT_GDT_CSUM).
* - 0x20
- Indicates that the old ext3 32,000 subdirectory limit no longer applies
- (RO\_COMPAT\_DIR\_NLINK). A directory's i\_links\_count will be set to 1
+ (RO_COMPAT_DIR_NLINK). A directory's i_links_count will be set to 1
if it is incremented past 64,999.
* - 0x40
- Indicates that large inodes exist on this filesystem
- (RO\_COMPAT\_EXTRA\_ISIZE).
+ (RO_COMPAT_EXTRA_ISIZE).
* - 0x80
- - This filesystem has a snapshot (RO\_COMPAT\_HAS\_SNAPSHOT).
+ - This filesystem has a snapshot (RO_COMPAT_HAS_SNAPSHOT).
* - 0x100
- - `Quota <Quota>`__ (RO\_COMPAT\_QUOTA).
+ - `Quota <Quota>`__ (RO_COMPAT_QUOTA).
* - 0x200
- This filesystem supports “bigalloc”, which means that file extents are
tracked in units of clusters (of blocks) instead of blocks
- (RO\_COMPAT\_BIGALLOC).
+ (RO_COMPAT_BIGALLOC).
* - 0x400
- This filesystem supports metadata checksumming.
- (RO\_COMPAT\_METADATA\_CSUM; implies RO\_COMPAT\_GDT\_CSUM, though
- GDT\_CSUM must not be set)
+ (RO_COMPAT_METADATA_CSUM; implies RO_COMPAT_GDT_CSUM, though
+ GDT_CSUM must not be set)
* - 0x800
- Filesystem supports replicas. This feature is neither in the kernel nor
- e2fsprogs. (RO\_COMPAT\_REPLICA)
+ e2fsprogs. (RO_COMPAT_REPLICA)
* - 0x1000
- Read-only filesystem image; the kernel will not mount this image
read-write and most tools will refuse to write to the image.
- (RO\_COMPAT\_READONLY)
+ (RO_COMPAT_READONLY)
* - 0x2000
- - Filesystem tracks project quotas. (RO\_COMPAT\_PROJECT)
+ - Filesystem tracks project quotas. (RO_COMPAT_PROJECT)
* - 0x8000
- - Verity inodes may be present on the filesystem. (RO\_COMPAT\_VERITY)
+ - Verity inodes may be present on the filesystem. (RO_COMPAT_VERITY)
+ * - 0x10000
+ - Indicates orphan file may have valid orphan entries and thus we need
+ to clean them up when mounting the filesystem
+ (RO_COMPAT_ORPHAN_PRESENT).
.. _super_def_hash:
@@ -741,36 +769,36 @@ The ``s_default_mount_opts`` field is any combination of the following:
* - Value
- Description
* - 0x0001
- - Print debugging info upon (re)mount. (EXT4\_DEFM\_DEBUG)
+ - Print debugging info upon (re)mount. (EXT4_DEFM_DEBUG)
* - 0x0002
- New files take the gid of the containing directory (instead of the fsgid
- of the current process). (EXT4\_DEFM\_BSDGROUPS)
+ of the current process). (EXT4_DEFM_BSDGROUPS)
* - 0x0004
- - Support userspace-provided extended attributes. (EXT4\_DEFM\_XATTR\_USER)
+ - Support userspace-provided extended attributes. (EXT4_DEFM_XATTR_USER)
* - 0x0008
- - Support POSIX access control lists (ACLs). (EXT4\_DEFM\_ACL)
+ - Support POSIX access control lists (ACLs). (EXT4_DEFM_ACL)
* - 0x0010
- - Do not support 32-bit UIDs. (EXT4\_DEFM\_UID16)
+ - Do not support 32-bit UIDs. (EXT4_DEFM_UID16)
* - 0x0020
- - All data and metadata are commited to the journal.
- (EXT4\_DEFM\_JMODE\_DATA)
+ - All data and metadata are committed to the journal.
+ (EXT4_DEFM_JMODE_DATA)
* - 0x0040
- All data are flushed to the disk before metadata are committed to the
- journal. (EXT4\_DEFM\_JMODE\_ORDERED)
+ journal. (EXT4_DEFM_JMODE_ORDERED)
* - 0x0060
- Data ordering is not preserved; data may be written after the metadata
- has been written. (EXT4\_DEFM\_JMODE\_WBACK)
+ has been written. (EXT4_DEFM_JMODE_WBACK)
* - 0x0100
- - Disable write flushes. (EXT4\_DEFM\_NOBARRIER)
+ - Disable write flushes. (EXT4_DEFM_NOBARRIER)
* - 0x0200
- Track which blocks in a filesystem are metadata and therefore should not
be used as data blocks. This option will be enabled by default on 3.18,
- hopefully. (EXT4\_DEFM\_BLOCK\_VALIDITY)
+ hopefully. (EXT4_DEFM_BLOCK_VALIDITY)
* - 0x0400
- Enable DISCARD support, where the storage device is told about blocks
- becoming unused. (EXT4\_DEFM\_DISCARD)
+ becoming unused. (EXT4_DEFM_DISCARD)
* - 0x0800
- - Disable delayed allocation. (EXT4\_DEFM\_NODELALLOC)
+ - Disable delayed allocation. (EXT4_DEFM_NODELALLOC)
.. _super_flags:
@@ -800,12 +828,12 @@ The ``s_encrypt_algos`` list can contain any of the following:
* - Value
- Description
* - 0
- - Invalid algorithm (ENCRYPTION\_MODE\_INVALID).
+ - Invalid algorithm (ENCRYPTION_MODE_INVALID).
* - 1
- - 256-bit AES in XTS mode (ENCRYPTION\_MODE\_AES\_256\_XTS).
+ - 256-bit AES in XTS mode (ENCRYPTION_MODE_AES_256_XTS).
* - 2
- - 256-bit AES in GCM mode (ENCRYPTION\_MODE\_AES\_256\_GCM).
+ - 256-bit AES in GCM mode (ENCRYPTION_MODE_AES_256_GCM).
* - 3
- - 256-bit AES in CBC mode (ENCRYPTION\_MODE\_AES\_256\_CBC).
+ - 256-bit AES in CBC mode (ENCRYPTION_MODE_AES_256_CBC).
Total size of the superblock is 1024 bytes.
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index 099d45ac8d8f..440e4ae74e44 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -25,10 +25,14 @@ a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs).
- git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git
-For reporting bugs and sending patches, please use the following mailing list:
+For sending patches, please use the following mailing list:
- linux-f2fs-devel@lists.sourceforge.net
+For reporting bugs, please use the following f2fs bug tracker link:
+
+- https://bugzilla.kernel.org/enter_bug.cgi?product=File%20System&component=f2fs
+
Background and Design issues
============================
@@ -101,164 +105,272 @@ Mount Options
=============
-====================== ============================================================
-background_gc=%s Turn on/off cleaning operations, namely garbage
- collection, triggered in background when I/O subsystem is
- idle. If background_gc=on, it will turn on the garbage
- collection and if background_gc=off, garbage collection
- will be turned off. If background_gc=sync, it will turn
- on synchronous garbage collection running in background.
- Default value for this option is on. So garbage
- collection is on by default.
-disable_roll_forward Disable the roll-forward recovery routine
-norecovery Disable the roll-forward recovery routine, mounted read-
- only (i.e., -o ro,disable_roll_forward)
-discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
- enabled, f2fs will issue discard/TRIM commands when a
- segment is cleaned.
-no_heap Disable heap-style segment allocation which finds free
- segments for data from the beginning of main area, while
- for node from the end of main area.
-nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
- by default if CONFIG_F2FS_FS_XATTR is selected.
-noacl Disable POSIX Access Control List. Note: acl is enabled
- by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
-active_logs=%u Support configuring the number of active logs. In the
- current design, f2fs supports only 2, 4, and 6 logs.
- Default number is 6.
-disable_ext_identify Disable the extension list configured by mkfs, so f2fs
- does not aware of cold files such as media files.
-inline_xattr Enable the inline xattrs feature.
-noinline_xattr Disable the inline xattrs feature.
-inline_xattr_size=%u Support configuring inline xattr size, it depends on
- flexible inline xattr feature.
-inline_data Enable the inline data feature: New created small(<~3.4k)
- files can be written into inode block.
-inline_dentry Enable the inline dir feature: data in new created
- directory entries can be written into inode block. The
- space of inode block which is used to store inline
- dentries is limited to ~3.4k.
-noinline_dentry Disable the inline dentry feature.
-flush_merge Merge concurrent cache_flush commands as much as possible
- to eliminate redundant command issues. If the underlying
- device handles the cache_flush command relatively slowly,
- recommend to enable this option.
-nobarrier This option can be used if underlying storage guarantees
- its cached data should be written to the novolatile area.
- If this option is set, no cache_flush commands are issued
- but f2fs still guarantees the write ordering of all the
- data writes.
-fastboot This option is used when a system wants to reduce mount
- time as much as possible, even though normal performance
- can be sacrificed.
-extent_cache Enable an extent cache based on rb-tree, it can cache
- as many as extent which map between contiguous logical
- address and physical address per inode, resulting in
- increasing the cache hit ratio. Set by default.
-noextent_cache Disable an extent cache based on rb-tree explicitly, see
- the above extent_cache mount option.
-noinline_data Disable the inline data feature, inline data feature is
- enabled by default.
-data_flush Enable data flushing before checkpoint in order to
- persist data of regular and symlink.
-reserve_root=%d Support configuring reserved space which is used for
- allocation from a privileged user with specified uid or
- gid, unit: 4KB, the default limit is 0.2% of user blocks.
-resuid=%d The user ID which may use the reserved blocks.
-resgid=%d The group ID which may use the reserved blocks.
-fault_injection=%d Enable fault injection in all supported types with
- specified injection rate.
-fault_type=%d Support configuring fault injection type, should be
- enabled with fault_injection option, fault type value
- is shown below, it supports single or combined type.
-
- =================== ===========
- Type_Name Type_Value
- =================== ===========
- FAULT_KMALLOC 0x000000001
- FAULT_KVMALLOC 0x000000002
- FAULT_PAGE_ALLOC 0x000000004
- FAULT_PAGE_GET 0x000000008
- FAULT_ALLOC_BIO 0x000000010
- FAULT_ALLOC_NID 0x000000020
- FAULT_ORPHAN 0x000000040
- FAULT_BLOCK 0x000000080
- FAULT_DIR_DEPTH 0x000000100
- FAULT_EVICT_INODE 0x000000200
- FAULT_TRUNCATE 0x000000400
- FAULT_READ_IO 0x000000800
- FAULT_CHECKPOINT 0x000001000
- FAULT_DISCARD 0x000002000
- FAULT_WRITE_IO 0x000004000
- =================== ===========
-mode=%s Control block allocation mode which supports "adaptive"
- and "lfs". In "lfs" mode, there should be no random
- writes towards main area.
-io_bits=%u Set the bit size of write IO requests. It should be set
- with "mode=lfs".
-usrquota Enable plain user disk quota accounting.
-grpquota Enable plain group disk quota accounting.
-prjquota Enable plain project quota accounting.
-usrjquota=<file> Appoint specified file and type during mount, so that quota
-grpjquota=<file> information can be properly updated during recovery flow,
-prjjquota=<file> <quota file>: must be in root directory;
-jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
-offusrjquota Turn off user journelled quota.
-offgrpjquota Turn off group journelled quota.
-offprjjquota Turn off project journelled quota.
-quota Enable plain user disk quota accounting.
-noquota Disable all plain disk quota option.
-whint_mode=%s Control which write hints are passed down to block
- layer. This supports "off", "user-based", and
- "fs-based". In "off" mode (default), f2fs does not pass
- down hints. In "user-based" mode, f2fs tries to pass
- down hints given by users. And in "fs-based" mode, f2fs
- passes down hints with its policy.
-alloc_mode=%s Adjust block allocation policy, which supports "reuse"
- and "default".
-fsync_mode=%s Control the policy of fsync. Currently supports "posix",
- "strict", and "nobarrier". In "posix" mode, which is
- default, fsync will follow POSIX semantics and does a
- light operation to improve the filesystem performance.
- In "strict" mode, fsync will be heavy and behaves in line
- with xfs, ext4 and btrfs, where xfstest generic/342 will
- pass, but the performance will regress. "nobarrier" is
- based on "posix", but doesn't issue flush command for
- non-atomic files likewise "nobarrier" mount option.
+======================== ============================================================
+background_gc=%s Turn on/off cleaning operations, namely garbage
+ collection, triggered in background when I/O subsystem is
+ idle. If background_gc=on, it will turn on the garbage
+ collection and if background_gc=off, garbage collection
+ will be turned off. If background_gc=sync, it will turn
+ on synchronous garbage collection running in background.
+ Default value for this option is on. So garbage
+ collection is on by default.
+gc_merge When background_gc is on, this option can be enabled to
+ let background GC thread to handle foreground GC requests,
+ it can eliminate the sluggish issue caused by slow foreground
+ GC operation when GC is triggered from a process with limited
+ I/O and CPU resources.
+nogc_merge Disable GC merge feature.
+disable_roll_forward Disable the roll-forward recovery routine
+norecovery Disable the roll-forward recovery routine, mounted read-
+ only (i.e., -o ro,disable_roll_forward)
+discard/nodiscard Enable/disable real-time discard in f2fs, if discard is
+ enabled, f2fs will issue discard/TRIM commands when a
+ segment is cleaned.
+heap/no_heap Deprecated.
+nouser_xattr Disable Extended User Attributes. Note: xattr is enabled
+ by default if CONFIG_F2FS_FS_XATTR is selected.
+noacl Disable POSIX Access Control List. Note: acl is enabled
+ by default if CONFIG_F2FS_FS_POSIX_ACL is selected.
+active_logs=%u Support configuring the number of active logs. In the
+ current design, f2fs supports only 2, 4, and 6 logs.
+ Default number is 6.
+disable_ext_identify Disable the extension list configured by mkfs, so f2fs
+ is not aware of cold files such as media files.
+inline_xattr Enable the inline xattrs feature.
+noinline_xattr Disable the inline xattrs feature.
+inline_xattr_size=%u Support configuring inline xattr size, it depends on
+ flexible inline xattr feature.
+inline_data Enable the inline data feature: Newly created small (<~3.4k)
+ files can be written into inode block.
+inline_dentry Enable the inline dir feature: data in newly created
+ directory entries can be written into inode block. The
+ space of inode block which is used to store inline
+ dentries is limited to ~3.4k.
+noinline_dentry Disable the inline dentry feature.
+flush_merge Merge concurrent cache_flush commands as much as possible
+ to eliminate redundant command issues. If the underlying
+ device handles the cache_flush command relatively slowly,
+ recommend to enable this option.
+nobarrier This option can be used if underlying storage guarantees
+ its cached data should be written to the novolatile area.
+ If this option is set, no cache_flush commands are issued
+ but f2fs still guarantees the write ordering of all the
+ data writes.
+barrier If this option is set, cache_flush commands are allowed to be
+ issued.
+fastboot This option is used when a system wants to reduce mount
+ time as much as possible, even though normal performance
+ can be sacrificed.
+extent_cache Enable an extent cache based on rb-tree, it can cache
+ as many as extent which map between contiguous logical
+ address and physical address per inode, resulting in
+ increasing the cache hit ratio. Set by default.
+noextent_cache Disable an extent cache based on rb-tree explicitly, see
+ the above extent_cache mount option.
+noinline_data Disable the inline data feature, inline data feature is
+ enabled by default.
+data_flush Enable data flushing before checkpoint in order to
+ persist data of regular and symlink.
+reserve_root=%d Support configuring reserved space which is used for
+ allocation from a privileged user with specified uid or
+ gid, unit: 4KB, the default limit is 0.2% of user blocks.
+resuid=%d The user ID which may use the reserved blocks.
+resgid=%d The group ID which may use the reserved blocks.
+fault_injection=%d Enable fault injection in all supported types with
+ specified injection rate.
+fault_type=%d Support configuring fault injection type, should be
+ enabled with fault_injection option, fault type value
+ is shown below, it supports single or combined type.
+
+ =========================== ==========
+ Type_Name Type_Value
+ =========================== ==========
+ FAULT_KMALLOC 0x00000001
+ FAULT_KVMALLOC 0x00000002
+ FAULT_PAGE_ALLOC 0x00000004
+ FAULT_PAGE_GET 0x00000008
+ FAULT_ALLOC_BIO 0x00000010 (obsolete)
+ FAULT_ALLOC_NID 0x00000020
+ FAULT_ORPHAN 0x00000040
+ FAULT_BLOCK 0x00000080
+ FAULT_DIR_DEPTH 0x00000100
+ FAULT_EVICT_INODE 0x00000200
+ FAULT_TRUNCATE 0x00000400
+ FAULT_READ_IO 0x00000800
+ FAULT_CHECKPOINT 0x00001000
+ FAULT_DISCARD 0x00002000
+ FAULT_WRITE_IO 0x00004000
+ FAULT_SLAB_ALLOC 0x00008000
+ FAULT_DQUOT_INIT 0x00010000
+ FAULT_LOCK_OP 0x00020000
+ FAULT_BLKADDR_VALIDITY 0x00040000
+ FAULT_BLKADDR_CONSISTENCE 0x00080000
+ FAULT_NO_SEGMENT 0x00100000
+ FAULT_INCONSISTENT_FOOTER 0x00200000
+ FAULT_TIMEOUT 0x00400000 (1000ms)
+ FAULT_VMALLOC 0x00800000
+ =========================== ==========
+mode=%s Control block allocation mode which supports "adaptive"
+ and "lfs". In "lfs" mode, there should be no random
+ writes towards main area.
+ "fragment:segment" and "fragment:block" are newly added here.
+ These are developer options for experiments to simulate filesystem
+ fragmentation/after-GC situation itself. The developers use these
+ modes to understand filesystem fragmentation/after-GC condition well,
+ and eventually get some insights to handle them better.
+ In "fragment:segment", f2fs allocates a new segment in ramdom
+ position. With this, we can simulate the after-GC condition.
+ In "fragment:block", we can scatter block allocation with
+ "max_fragment_chunk" and "max_fragment_hole" sysfs nodes.
+ We added some randomness to both chunk and hole size to make
+ it close to realistic IO pattern. So, in this mode, f2fs will allocate
+ 1..<max_fragment_chunk> blocks in a chunk and make a hole in the
+ length of 1..<max_fragment_hole> by turns. With this, the newly
+ allocated blocks will be scattered throughout the whole partition.
+ Note that "fragment:block" implicitly enables "fragment:segment"
+ option for more randomness.
+ Please, use these options for your experiments and we strongly
+ recommend to re-format the filesystem after using these options.
+usrquota Enable plain user disk quota accounting.
+grpquota Enable plain group disk quota accounting.
+prjquota Enable plain project quota accounting.
+usrjquota=<file> Appoint specified file and type during mount, so that quota
+grpjquota=<file> information can be properly updated during recovery flow,
+prjjquota=<file> <quota file>: must be in root directory;
+jqfmt=<quota type> <quota type>: [vfsold,vfsv0,vfsv1].
+offusrjquota Turn off user journalled quota.
+offgrpjquota Turn off group journalled quota.
+offprjjquota Turn off project journalled quota.
+quota Enable plain user disk quota accounting.
+noquota Disable all plain disk quota option.
+alloc_mode=%s Adjust block allocation policy, which supports "reuse"
+ and "default".
+fsync_mode=%s Control the policy of fsync. Currently supports "posix",
+ "strict", and "nobarrier". In "posix" mode, which is
+ default, fsync will follow POSIX semantics and does a
+ light operation to improve the filesystem performance.
+ In "strict" mode, fsync will be heavy and behaves in line
+ with xfs, ext4 and btrfs, where xfstest generic/342 will
+ pass, but the performance will regress. "nobarrier" is
+ based on "posix", but doesn't issue flush command for
+ non-atomic files likewise "nobarrier" mount option.
test_dummy_encryption
test_dummy_encryption=%s
- Enable dummy encryption, which provides a fake fscrypt
- context. The fake fscrypt context is used by xfstests.
- The argument may be either "v1" or "v2", in order to
- select the corresponding fscrypt policy version.
-checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
- to reenable checkpointing. Is enabled by default. While
- disabled, any unmounting or unexpected shutdowns will cause
- the filesystem contents to appear as they did when the
- filesystem was mounted with that option.
- While mounting with checkpoint=disabled, the filesystem must
- run garbage collection to ensure that all available space can
- be used. If this takes too much time, the mount may return
- EAGAIN. You may optionally add a value to indicate how much
- of the disk you would be willing to temporarily give up to
- avoid additional garbage collection. This can be given as a
- number of blocks, or as a percent. For instance, mounting
- with checkpoint=disable:100% would always succeed, but it may
- hide up to all remaining free space. The actual space that
- would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
- This space is reclaimed once checkpoint=enable.
-compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
- "lz4", "zstd" and "lzo-rle" algorithm.
-compress_log_size=%u Support configuring compress cluster size, the size will
- be 4KB * (1 << %u), 16KB is minimum size, also it's
- default size.
-compress_extension=%s Support adding specified extension, so that f2fs can enable
- compression on those corresponding files, e.g. if all files
- with '.ext' has high compression rate, we can set the '.ext'
- on compression extension list and enable compression on
- these file by default rather than to enable it via ioctl.
- For other files, we can still enable compression via ioctl.
-====================== ============================================================
+ Enable dummy encryption, which provides a fake fscrypt
+ context. The fake fscrypt context is used by xfstests.
+ The argument may be either "v1" or "v2", in order to
+ select the corresponding fscrypt policy version.
+checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "enable"
+ to reenable checkpointing. Is enabled by default. While
+ disabled, any unmounting or unexpected shutdowns will cause
+ the filesystem contents to appear as they did when the
+ filesystem was mounted with that option.
+ While mounting with checkpoint=disable, the filesystem must
+ run garbage collection to ensure that all available space can
+ be used. If this takes too much time, the mount may return
+ EAGAIN. You may optionally add a value to indicate how much
+ of the disk you would be willing to temporarily give up to
+ avoid additional garbage collection. This can be given as a
+ number of blocks, or as a percent. For instance, mounting
+ with checkpoint=disable:100% would always succeed, but it may
+ hide up to all remaining free space. The actual space that
+ would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
+ This space is reclaimed once checkpoint=enable.
+checkpoint_merge When checkpoint is enabled, this can be used to create a kernel
+ daemon and make it to merge concurrent checkpoint requests as
+ much as possible to eliminate redundant checkpoint issues. Plus,
+ we can eliminate the sluggish issue caused by slow checkpoint
+ operation when the checkpoint is done in a process context in
+ a cgroup having low i/o budget and cpu shares. To make this
+ do better, we set the default i/o priority of the kernel daemon
+ to "3", to give one higher priority than other kernel threads.
+ This is the same way to give a I/O priority to the jbd2
+ journaling thread of ext4 filesystem.
+nocheckpoint_merge Disable checkpoint merge feature.
+compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo",
+ "lz4", "zstd" and "lzo-rle" algorithm.
+compress_algorithm=%s:%d Control compress algorithm and its compress level, now, only
+ "lz4" and "zstd" support compress level config.
+ algorithm level range
+ lz4 3 - 16
+ zstd 1 - 22
+compress_log_size=%u Support configuring compress cluster size. The size will
+ be 4KB * (1 << %u). The default and minimum sizes are 16KB.
+compress_extension=%s Support adding specified extension, so that f2fs can enable
+ compression on those corresponding files, e.g. if all files
+ with '.ext' has high compression rate, we can set the '.ext'
+ on compression extension list and enable compression on
+ these file by default rather than to enable it via ioctl.
+ For other files, we can still enable compression via ioctl.
+ Note that, there is one reserved special extension '*', it
+ can be set to enable compression for all files.
+nocompress_extension=%s Support adding specified extension, so that f2fs can disable
+ compression on those corresponding files, just contrary to compression extension.
+ If you know exactly which files cannot be compressed, you can use this.
+ The same extension name can't appear in both compress and nocompress
+ extension at the same time.
+ If the compress extension specifies all files, the types specified by the
+ nocompress extension will be treated as special cases and will not be compressed.
+ Don't allow use '*' to specifie all file in nocompress extension.
+ After add nocompress_extension, the priority should be:
+ dir_flag < comp_extention,nocompress_extension < comp_file_flag,no_comp_file_flag.
+ See more in compression sections.
+
+compress_chksum Support verifying chksum of raw data in compressed cluster.
+compress_mode=%s Control file compression mode. This supports "fs" and "user"
+ modes. In "fs" mode (default), f2fs does automatic compression
+ on the compression enabled files. In "user" mode, f2fs disables
+ the automaic compression and gives the user discretion of
+ choosing the target file and the timing. The user can do manual
+ compression/decompression on the compression enabled files using
+ ioctls.
+compress_cache Support to use address space of a filesystem managed inode to
+ cache compressed block, in order to improve cache hit ratio of
+ random read.
+inlinecrypt When possible, encrypt/decrypt the contents of encrypted
+ files using the blk-crypto framework rather than
+ filesystem-layer encryption. This allows the use of
+ inline encryption hardware. The on-disk format is
+ unaffected. For more details, see
+ Documentation/block/inline-encryption.rst.
+atgc Enable age-threshold garbage collection, it provides high
+ effectiveness and efficiency on background GC.
+discard_unit=%s Control discard unit, the argument can be "block", "segment"
+ and "section", issued discard command's offset/size will be
+ aligned to the unit, by default, "discard_unit=block" is set,
+ so that small discard functionality is enabled.
+ For blkzoned device, "discard_unit=section" will be set by
+ default, it is helpful for large sized SMR or ZNS devices to
+ reduce memory cost by getting rid of fs metadata supports small
+ discard.
+memory=%s Control memory mode. This supports "normal" and "low" modes.
+ "low" mode is introduced to support low memory devices.
+ Because of the nature of low memory devices, in this mode, f2fs
+ will try to save memory sometimes by sacrificing performance.
+ "normal" mode is the default mode and same as before.
+age_extent_cache Enable an age extent cache based on rb-tree. It records
+ data block update frequency of the extent per inode, in
+ order to provide better temperature hints for data block
+ allocation.
+errors=%s Specify f2fs behavior on critical errors. This supports modes:
+ "panic", "continue" and "remount-ro", respectively, trigger
+ panic immediately, continue without doing anything, and remount
+ the partition in read-only mode. By default it uses "continue"
+ mode.
+ ====================== =============== =============== ========
+ mode continue remount-ro panic
+ ====================== =============== =============== ========
+ access ops normal normal N/A
+ syscall errors -EIO -EROFS N/A
+ mount option rw ro N/A
+ pending dir write keep keep N/A
+ pending non-dir write drop keep N/A
+ pending node write drop keep N/A
+ pending meta write keep keep N/A
+ ====================== =============== =============== ========
+nat_bits Enable nat_bits feature to enhance full/empty nat blocks access,
+ by default it's disabled.
+======================== ============================================================
Debugfs Entries
===============
@@ -293,7 +405,7 @@ Usage
# insmod f2fs.ko
-3. Create a directory trying to mount::
+3. Create a directory to use when mounting::
# mkdir /mnt/f2fs
@@ -307,7 +419,7 @@ mkfs.f2fs
The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem,
which builds a basic on-disk layout.
-The options consist of:
+The quick options consist of:
=============== ===========================================================
``-l [label]`` Give a volume label, up to 512 unicode name.
@@ -329,6 +441,8 @@ The options consist of:
1 is set by default, which conducts discard.
=============== ===========================================================
+Note: please refer to the manpage of mkfs.f2fs(8) to get full option list.
+
fsck.f2fs
---------
The fsck.f2fs is a tool to check the consistency of an f2fs-formatted
@@ -336,10 +450,12 @@ partition, which examines whether the filesystem metadata and user-made data
are cross-referenced correctly or not.
Note that, initial version of the tool does not fix any inconsistency.
-The options consist of::
+The quick options consist of::
-d debug level [default:0]
+Note: please refer to the manpage of fsck.f2fs(8) to get full option list.
+
dump.f2fs
---------
The dump.f2fs shows the information of specific inode and dumps SSA and SIT to
@@ -363,6 +479,37 @@ Examples::
# dump.f2fs -s 0~-1 /dev/sdx (SIT dump)
# dump.f2fs -a 0~-1 /dev/sdx (SSA dump)
+Note: please refer to the manpage of dump.f2fs(8) to get full option list.
+
+sload.f2fs
+----------
+The sload.f2fs gives a way to insert files and directories in the existing disk
+image. This tool is useful when building f2fs images given compiled files.
+
+Note: please refer to the manpage of sload.f2fs(8) to get full option list.
+
+resize.f2fs
+-----------
+The resize.f2fs lets a user resize the f2fs-formatted disk image, while preserving
+all the files and directories stored in the image.
+
+Note: please refer to the manpage of resize.f2fs(8) to get full option list.
+
+defrag.f2fs
+-----------
+The defrag.f2fs can be used to defragment scattered written data as well as
+filesystem metadata across the disk. This can improve the write speed by giving
+more free consecutive space.
+
+Note: please refer to the manpage of defrag.f2fs(8) to get full option list.
+
+f2fs_io
+-------
+The f2fs_io is a simple tool to issue various filesystem APIs as well as
+f2fs-specific ones, which is very useful for QA tests.
+
+Note: please refer to the manpage of f2fs_io(8) to get full option list.
+
Design
======
@@ -375,7 +522,7 @@ consists of a set of sections. By default, section and zone sizes are set to one
segment size identically, but users can easily modify the sizes by mkfs.
F2FS splits the entire volume into six areas, and all the areas except superblock
-consists of multiple segments as described below::
+consist of multiple segments as described below::
align with the zone size <-|
|-> align with the segment size
@@ -478,7 +625,7 @@ one inode block (i.e., a file) covers::
`- direct node (1018)
`- data (1018)
-Note that, all the node blocks are mapped by NAT which means the location of
+Note that all the node blocks are mapped by NAT which means the location of
each node is translated by the NAT table. In the consideration of the wandering
tree problem, F2FS is able to cut off the propagation of node updates caused by
leaf data writes.
@@ -558,7 +705,7 @@ When F2FS finds a file name in a directory, at first a hash value of the file
name is calculated. Then, F2FS scans the hash table in level #0 to find the
dentry consisting of the file name and its inode number. If not found, F2FS
scans the next hash table in level #1. In this way, F2FS scans hash tables in
-each levels incrementally from 1 to N. In each levels F2FS needs to scan only
+each levels incrementally from 1 to N. In each level F2FS needs to scan only
one bucket determined by the following equation, which shows O(log(# of files))
complexity::
@@ -635,57 +782,22 @@ bitmap is composed of a bit stream covering whole blocks in main area.
Write-hint Policy
-----------------
-1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
-
-2) whint_mode=user-based. F2FS tries to pass down hints given by
-users.
+F2FS sets the whint all the time with the below policy.
===================== ======================== ===================
User F2FS Block
===================== ======================== ===================
- META WRITE_LIFE_NOT_SET
- HOT_NODE "
- WARM_NODE "
- COLD_NODE "
+N/A META WRITE_LIFE_NONE|REQ_META
+N/A HOT_NODE WRITE_LIFE_NONE
+N/A WARM_NODE WRITE_LIFE_MEDIUM
+N/A COLD_NODE WRITE_LIFE_LONG
ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
extension list " "
-- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
-
--- direct io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_NOT_SET
-WRITE_LIFE_NONE " WRITE_LIFE_NONE
-WRITE_LIFE_MEDIUM " WRITE_LIFE_MEDIUM
-WRITE_LIFE_LONG " WRITE_LIFE_LONG
-===================== ======================== ===================
-
-3) whint_mode=fs-based. F2FS passes down hints with its policy.
-
-===================== ======================== ===================
-User F2FS Block
-===================== ======================== ===================
- META WRITE_LIFE_MEDIUM;
- HOT_NODE WRITE_LIFE_NOT_SET
- WARM_NODE "
- COLD_NODE WRITE_LIFE_NONE
-ioctl(COLD) COLD_DATA WRITE_LIFE_EXTREME
-extension list " "
-
--- buffered io
-WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
-WRITE_LIFE_SHORT HOT_DATA WRITE_LIFE_SHORT
-WRITE_LIFE_NOT_SET WARM_DATA WRITE_LIFE_LONG
-WRITE_LIFE_NONE " "
-WRITE_LIFE_MEDIUM " "
-WRITE_LIFE_LONG " "
+N/A COLD_DATA WRITE_LIFE_EXTREME
+N/A HOT_DATA WRITE_LIFE_SHORT
+N/A WARM_DATA WRITE_LIFE_NOT_SET
-- direct io
WRITE_LIFE_EXTREME COLD_DATA WRITE_LIFE_EXTREME
@@ -699,7 +811,7 @@ WRITE_LIFE_LONG " WRITE_LIFE_LONG
Fallocate(2) Policy
-------------------
-The default policy follows the below posix rule.
+The default policy follows the below POSIX rule.
Allocating disk space
The default operation (i.e., mode is zero) of fallocate() allocates
@@ -712,7 +824,7 @@ Allocating disk space
as a method of optimally implementing that function.
However, once F2FS receives ioctl(fd, F2FS_IOC_SET_PIN_FILE) in prior to
-fallocate(fd, DEFAULT_MODE), it allocates on-disk blocks addressess having
+fallocate(fd, DEFAULT_MODE), it allocates on-disk block addresses having
zero or random data, which is useful to the below scenario where:
1. create(fd)
@@ -731,20 +843,49 @@ Compression implementation
cluster can be compressed or not.
- In cluster metadata layout, one special block address is used to indicate
- cluster is compressed one or normal one, for compressed cluster, following
+ a cluster is a compressed one or normal one; for compressed cluster, following
metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
stores data including compress header and compressed data.
- In order to eliminate write amplification during overwrite, F2FS only
support compression on write-once file, data can be compressed only when
- all logical blocks in file are valid and cluster compress ratio is lower
- than specified threshold.
+ all logical blocks in cluster contain valid data and compress ratio of
+ cluster data is lower than specified threshold.
-- To enable compression on regular inode, there are three ways:
+- To enable compression on regular inode, there are four ways:
* chattr +c file
* chattr +c dir; touch dir/file
* mount w/ -o compress_extension=ext; touch file.ext
+ * mount w/ -o compress_extension=*; touch any_file
+
+- To disable compression on regular inode, there are two ways:
+
+ * chattr -c file
+ * mount w/ -o nocompress_extension=ext; touch file.ext
+
+- Priority in between FS_COMPR_FL, FS_NOCOMP_FS, extensions:
+
+ * compress_extension=so; nocompress_extension=zip; chattr +c dir; touch
+ dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so and baz.txt
+ should be compresse, bar.zip should be non-compressed. chattr +c dir/bar.zip
+ can enable compress on bar.zip.
+ * compress_extension=so; nocompress_extension=zip; chattr -c dir; touch
+ dir/foo.so; touch dir/bar.zip; touch dir/baz.txt; then foo.so should be
+ compresse, bar.zip and baz.txt should be non-compressed.
+ chattr+c dir/bar.zip; chattr+c dir/baz.txt; can enable compress on bar.zip
+ and baz.txt.
+
+- At this point, compression feature doesn't expose compressed space to user
+ directly in order to guarantee potential data updates later to the space.
+ Instead, the main goal is to reduce data writes to flash disk as much as
+ possible, resulting in extending disk life time as well as relaxing IO
+ congestion. Alternatively, we've added ioctl(F2FS_IOC_RELEASE_COMPRESS_BLOCKS)
+ interface to reclaim compressed space and show it to user after setting a
+ special flag to the inode. Once the compressed space is released, the flag
+ will block writing data to the file until either the compressed space is
+ reserved via ioctl(F2FS_IOC_RESERVE_COMPRESS_BLOCKS) or the file size is
+ truncated to zero.
Compress metadata layout::
@@ -753,14 +894,101 @@ Compress metadata layout::
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
- . . . .
+ . . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
- . .
+ . .
. .
. .
+-------------+-------------+----------+----------------------------+
| data length | data chksum | reserved | compressed data |
+-------------+-------------+----------+----------------------------+
+
+Compression mode
+--------------------------
+
+f2fs supports "fs" and "user" compression modes with "compression_mode" mount option.
+With this option, f2fs provides a choice to select the way how to compress the
+compression enabled files (refer to "Compression implementation" section for how to
+enable compression on a regular inode).
+
+1) compress_mode=fs
+This is the default option. f2fs does automatic compression in the writeback of the
+compression enabled files.
+
+2) compress_mode=user
+This disables the automatic compression and gives the user discretion of choosing the
+target file and the timing. The user can do manual compression/decompression on the
+compression enabled files using F2FS_IOC_DECOMPRESS_FILE and F2FS_IOC_COMPRESS_FILE
+ioctls like the below.
+
+To decompress a file,
+
+fd = open(filename, O_WRONLY, 0);
+ret = ioctl(fd, F2FS_IOC_DECOMPRESS_FILE);
+
+To compress a file,
+
+fd = open(filename, O_WRONLY, 0);
+ret = ioctl(fd, F2FS_IOC_COMPRESS_FILE);
+
+NVMe Zoned Namespace devices
+----------------------------
+
+- ZNS defines a per-zone capacity which can be equal or less than the
+ zone-size. Zone-capacity is the number of usable blocks in the zone.
+ F2FS checks if zone-capacity is less than zone-size, if it is, then any
+ segment which starts after the zone-capacity is marked as not-free in
+ the free segment bitmap at initial mount time. These segments are marked
+ as permanently used so they are not allocated for writes and
+ consequently are not needed to be garbage collected. In case the
+ zone-capacity is not aligned to default segment size(2MB), then a segment
+ can start before the zone-capacity and span across zone-capacity boundary.
+ Such spanning segments are also considered as usable segments. All blocks
+ past the zone-capacity are considered unusable in these segments.
+
+Device aliasing feature
+-----------------------
+
+f2fs can utilize a special file called a "device aliasing file." This file allows
+the entire storage device to be mapped with a single, large extent, not using
+the usual f2fs node structures. This mapped area is pinned and primarily intended
+for holding the space.
+
+Essentially, this mechanism allows a portion of the f2fs area to be temporarily
+reserved and used by another filesystem or for different purposes. Once that
+external usage is complete, the device aliasing file can be deleted, releasing
+the reserved space back to F2FS for its own use.
+
+<use-case>
+
+# ls /dev/vd*
+/dev/vdb (32GB) /dev/vdc (32GB)
+# mkfs.ext4 /dev/vdc
+# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb
+# mount /dev/vdb /mnt/f2fs
+# ls -l /mnt/f2fs
+vdc.file
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+
+# mount -o loop /dev/vdc /mnt/ext4
+# df -h
+/dev/vdb 64G 33G 32G 52% /mnt/f2fs
+/dev/loop7 32G 24K 30G 1% /mnt/ext4
+# umount /mnt/ext4
+
+# f2fs_io getflags /mnt/f2fs/vdc.file
+get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable
+# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file
+get a flag on noimmutable ret=0, flags=800010
+set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable
+# rm /mnt/f2fs/vdc.file
+# df -h
+/dev/vdb 64G 753M 64G 2% /mnt/f2fs
+
+So, the key idea is, user can do any file operations on /dev/vdc, and
+reclaim the space after the use, while the space is counted as /data.
+That doesn't require modifying partition size and filesystem format.
diff --git a/Documentation/filesystems/fiemap.rst b/Documentation/filesystems/fiemap.rst
index 93fc96f760aa..23b3ed229e49 100644
--- a/Documentation/filesystems/fiemap.rst
+++ b/Documentation/filesystems/fiemap.rst
@@ -12,21 +12,10 @@ returns a list of extents.
Request Basics
--------------
-A fiemap request is encoded within struct fiemap::
-
- struct fiemap {
- __u64 fm_start; /* logical offset (inclusive) at
- * which to start mapping (in) */
- __u64 fm_length; /* logical length of mapping which
- * userspace cares about (in) */
- __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
- __u32 fm_mapped_extents; /* number of extents that were
- * mapped (out) */
- __u32 fm_extent_count; /* size of fm_extents array (in) */
- __u32 fm_reserved;
- struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */
- };
+A fiemap request is encoded within struct fiemap:
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap
fm_start, and fm_length specify the logical range within the file
which the process would like mappings for. Extents returned mirror
@@ -60,6 +49,8 @@ FIEMAP_FLAG_XATTR
If this flag is set, the extents returned will describe the inodes
extended attribute lookup tree, instead of its data tree.
+FIEMAP_FLAG_CACHE
+ This flag requests caching of the extents.
Extent Mapping
--------------
@@ -77,18 +68,10 @@ complete the requested range and will not have the FIEMAP_EXTENT_LAST
flag set (see the next section on extent flags).
Each extent is described by a single fiemap_extent structure as
-returned in fm_extents::
-
- struct fiemap_extent {
- __u64 fe_logical; /* logical offset in bytes for the start of
- * the extent */
- __u64 fe_physical; /* physical offset in bytes for the start
- * of the extent */
- __u64 fe_length; /* length in bytes for the extent */
- __u64 fe_reserved64[2];
- __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
- __u32 fe_reserved[3];
- };
+returned in fm_extents:
+
+.. kernel-doc:: include/uapi/linux/fiemap.h
+ :identifiers: fiemap_extent
All offsets and lengths are in bytes and mirror those on disk. It is valid
for an extents logical offset to start before the request or its logical
@@ -175,6 +158,8 @@ FIEMAP_EXTENT_MERGED
userspace would be highly inefficient, the kernel will try to merge most
adjacent blocks into 'extents'.
+FIEMAP_EXTENT_SHARED
+ This flag is set to request that space be shared with other files.
VFS -> File System Implementation
---------------------------------
@@ -191,14 +176,10 @@ each discovered extent::
u64 len);
->fiemap is passed struct fiemap_extent_info which describes the
-fiemap request::
-
- struct fiemap_extent_info {
- unsigned int fi_flags; /* Flags as passed from user */
- unsigned int fi_extents_mapped; /* Number of mapped extents */
- unsigned int fi_extents_max; /* Size of fiemap_extent array */
- struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */
- };
+fiemap request:
+
+.. kernel-doc:: include/linux/fiemap.h
+ :identifiers: fiemap_extent_info
It is intended that the file system should not need to access any of this
structure directly. Filesystem handlers should be tolerant to signals and return
diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index cbf8e57376bf..eb770f891b27 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
be held.
4. To look up the file structure given an fd, a reader
- must use either fcheck() or fcheck_files() APIs. These
+ must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
take care of barrier requirements due to lock-free lookup.
An example::
@@ -70,43 +70,22 @@ the fdtable structure -
struct file *file;
rcu_read_lock();
- file = fcheck(fd);
- if (file) {
- ...
- }
- ....
+ file = lookup_fdget_rcu(fd);
rcu_read_unlock();
-
-5. Handling of the file structures is special. Since the look-up
- of the fd (fget()/fget_light()) are lock-free, it is possible
- that look-up may race with the last put() operation on the
- file structure. This is avoided using atomic_long_inc_not_zero()
- on ->f_count::
-
- rcu_read_lock();
- file = fcheck_files(files, fd);
if (file) {
- if (atomic_long_inc_not_zero(&file->f_count))
- *fput_needed = 1;
- else
- /* Didn't get the reference, someone's freed */
- file = NULL;
+ ...
+ fput(file);
}
- rcu_read_unlock();
....
- return file;
-
- atomic_long_inc_not_zero() detects if refcounts is already zero or
- goes to zero during increment. If it does, we fail
- fget()/fget_light().
-6. Since both fdtable and file structures can be looked up
+5. Since both fdtable and file structures can be looked up
lock-free, they must be installed using rcu_assign_pointer()
API. If they are looked up lock-free, rcu_dereference()
must be used. However it is advisable to use files_fdtable()
- and fcheck()/fcheck_files() which take care of these issues.
+ and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
+ issues.
-7. While updating, the fdtable pointer must be looked up while
+6. While updating, the fdtable pointer must be looked up while
holding files->file_lock. If ->file_lock is dropped, then
another thread expand the files thereby creating a new
fdtable and making the earlier fdtable pointer stale.
@@ -126,3 +105,19 @@ the fdtable structure -
Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
the fdtable pointer (fdt) must be loaded after locate_fd().
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore
+to just acquire a reference to the file in question under rcu using
+atomic_long_inc_not_zero() since the file might have already been
+recycled and someone else might have bumped the reference. In other
+words, callers might see reference count bumps from newer users. For
+this is reason it is necessary to verify that the pointer is the same
+before and after the reference count increment. This pattern can be seen
+in get_file_rcu() and __files_get_rcu().
+
+In addition, it isn't possible to access or check fields in struct file
+without first acquiring a reference on it under rcu lookup. Not doing
+that was always very dodgy and it was only usable for non-pointer data
+in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers
+either first acquire a reference or they must hold the files_lock of the
+fdtable.
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index f517af8ec11c..29e84d125e02 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -31,15 +31,15 @@ However, except for filenames, fscrypt does not encrypt filesystem
metadata.
Unlike eCryptfs, which is a stacked filesystem, fscrypt is integrated
-directly into supported filesystems --- currently ext4, F2FS, and
-UBIFS. This allows encrypted files to be read and written without
-caching both the decrypted and encrypted pages in the pagecache,
-thereby nearly halving the memory used and bringing it in line with
-unencrypted files. Similarly, half as many dentries and inodes are
-needed. eCryptfs also limits encrypted filenames to 143 bytes,
-causing application compatibility issues; fscrypt allows the full 255
-bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API can be
-used by unprivileged users, with no need to mount anything.
+directly into supported filesystems --- currently ext4, F2FS, UBIFS,
+and CephFS. This allows encrypted files to be read and written
+without caching both the decrypted and encrypted pages in the
+pagecache, thereby nearly halving the memory used and bringing it in
+line with unencrypted files. Similarly, half as many dentries and
+inodes are needed. eCryptfs also limits encrypted filenames to 143
+bytes, causing application compatibility issues; fscrypt allows the
+full 255 bytes (NAME_MAX). Finally, unlike eCryptfs, the fscrypt API
+can be used by unprivileged users, with no need to mount anything.
fscrypt does not support encrypting files in-place. Instead, it
supports marking an empty directory as encrypted. Then, after
@@ -70,18 +70,18 @@ Online attacks
--------------
fscrypt (and storage encryption in general) can only provide limited
-protection, if any at all, against online attacks. In detail:
+protection against online attacks. In detail:
Side-channel attacks
~~~~~~~~~~~~~~~~~~~~
fscrypt is only resistant to side-channel attacks, such as timing or
electromagnetic attacks, to the extent that the underlying Linux
-Cryptographic API algorithms are. If a vulnerable algorithm is used,
-such as a table-based implementation of AES, it may be possible for an
-attacker to mount a side channel attack against the online system.
-Side channel attacks may also be mounted against applications
-consuming decrypted data.
+Cryptographic API algorithms or inline encryption hardware are. If a
+vulnerable algorithm is used, such as a table-based implementation of
+AES, it may be possible for an attacker to mount a side channel attack
+against the online system. Side channel attacks may also be mounted
+against applications consuming decrypted data.
Unauthorized file access
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -99,16 +99,23 @@ Therefore, any encryption-specific access control checks would merely
be enforced by kernel *code* and therefore would be largely redundant
with the wide variety of access control mechanisms already available.)
-Kernel memory compromise
-~~~~~~~~~~~~~~~~~~~~~~~~
+Read-only kernel memory compromise
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unless `hardware-wrapped keys`_ are used, an attacker who gains the
+ability to read from arbitrary kernel memory, e.g. by mounting a
+physical attack or by exploiting a kernel security vulnerability, can
+compromise all fscrypt keys that are currently in-use. This also
+extends to cold boot attacks; if the system is suddenly powered off,
+keys the system was using may remain in memory for a short time.
-An attacker who compromises the system enough to read from arbitrary
-memory, e.g. by mounting a physical attack or by exploiting a kernel
-security vulnerability, can compromise all encryption keys that are
-currently in use.
+However, if hardware-wrapped keys are used, then the fscrypt master
+keys and file contents encryption keys (but not other types of fscrypt
+subkeys such as filenames encryption keys) are protected from
+compromises of arbitrary kernel memory.
-However, fscrypt allows encryption keys to be removed from the kernel,
-which may protect them from later compromise.
+In addition, fscrypt allows encryption keys to be removed from the
+kernel, which may protect them from later compromise.
In more detail, the FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (or the
FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl) can wipe a master
@@ -137,14 +144,31 @@ However, these ioctls have some limitations:
- In general, decrypted contents and filenames in the kernel VFS
caches are freed but not wiped. Therefore, portions thereof may be
recoverable from freed memory, even after the corresponding key(s)
- were wiped. To partially solve this, you can set
- CONFIG_PAGE_POISONING=y in your kernel config and add page_poison=1
- to your kernel command line. However, this has a performance cost.
+ were wiped. To partially solve this, you can add init_on_free=1 to
+ your kernel command line. However, this has a performance cost.
- Secret keys might still exist in CPU registers, in crypto
accelerator hardware (if used by the crypto API to implement any of
the algorithms), or in other places not explicitly considered here.
+Full system compromise
+~~~~~~~~~~~~~~~~~~~~~~
+
+An attacker who gains "root" access and/or the ability to execute
+arbitrary kernel code can freely exfiltrate data that is protected by
+any in-use fscrypt keys. Thus, usually fscrypt provides no meaningful
+protection in this scenario. (Data that is protected by a key that is
+absent throughout the entire attack remains protected, modulo the
+limitations of key removal mentioned above in the case where the key
+was removed prior to the attack.)
+
+However, if `hardware-wrapped keys`_ are used, such attackers will be
+unable to exfiltrate the master keys or file contents keys in a form
+that will be usable after the system is powered off. This may be
+useful if the attacker is significantly time-limited and/or
+bandwidth-limited, so they can only exfiltrate some data and need to
+rely on a later offline attack to exfiltrate the rest of it.
+
Limitations of v1 policies
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -171,16 +195,20 @@ policies on all new encrypted directories.
Key hierarchy
=============
+Note: this section assumes the use of raw keys rather than
+hardware-wrapped keys. The use of hardware-wrapped keys modifies the
+key hierarchy slightly. For details, see `Hardware-wrapped keys`_.
+
Master Keys
-----------
Each encrypted directory tree is protected by a *master key*. Master
keys can be up to 64 bytes long, and must be at least as long as the
-greater of the key length needed by the contents and filenames
-encryption modes being used. For example, if AES-256-XTS is used for
-contents encryption, the master key must be 64 bytes (512 bits). Note
-that the XTS mode is defined to require a key twice as long as that
-required by the underlying block cipher.
+greater of the security strength of the contents and filenames
+encryption modes being used. For example, if any AES-256 mode is
+used, the master key must be at least 256 bits, i.e. 32 bytes. A
+stricter requirement applies if the key is used by a v1 encryption
+policy and AES-256-XTS is used; such keys must be 64 bytes.
To "unlock" an encrypted directory tree, userspace must provide the
appropriate master key. There can be any number of master keys, each
@@ -261,9 +289,9 @@ DIRECT_KEY policies
The Adiantum encryption mode (see `Encryption modes and usage`_) is
suitable for both contents and filenames encryption, and it accepts
-long IVs --- long enough to hold both an 8-byte logical block number
-and a 16-byte per-file nonce. Also, the overhead of each Adiantum key
-is greater than that of an AES-256-XTS key.
+long IVs --- long enough to hold both an 8-byte data unit index and a
+16-byte per-file nonce. Also, the overhead of each Adiantum key is
+greater than that of an AES-256-XTS key.
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
@@ -300,8 +328,8 @@ IV_INO_LBLK_32 policies
IV_INO_LBLK_32 policies work like IV_INO_LBLK_64, except that for
IV_INO_LBLK_32, the inode number is hashed with SipHash-2-4 (where the
-SipHash key is derived from the master key) and added to the file
-logical block number mod 2^32 to produce a 32-bit IV.
+SipHash key is derived from the master key) and added to the file data
+unit index mod 2^32 to produce a 32-bit IV.
This format is optimized for use with inline encryption hardware
compliant with the eMMC v5.2 standard, which supports only 32 IV bits
@@ -332,64 +360,181 @@ Encryption modes and usage
fscrypt allows one encryption mode to be specified for file contents
and one encryption mode to be specified for filenames. Different
directory trees are permitted to use different encryption modes.
+
+Supported modes
+---------------
+
Currently, the following pairs of encryption modes are supported:
-- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
-- AES-128-CBC for contents and AES-128-CTS-CBC for filenames
+- AES-256-XTS for contents and AES-256-CBC-CTS for filenames
+- AES-256-XTS for contents and AES-256-HCTR2 for filenames
- Adiantum for both contents and filenames
-
-If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair.
-
-AES-128-CBC was added only for low-powered embedded devices with
-crypto accelerators such as CAAM or CESA that do not support XTS. To
-use AES-128-CBC, CONFIG_CRYPTO_ESSIV and CONFIG_CRYPTO_SHA256 (or
-another SHA-256 implementation) must be enabled so that ESSIV can be
-used.
-
-Adiantum is a (primarily) stream cipher-based mode that is fast even
-on CPUs without dedicated crypto instructions. It's also a true
-wide-block mode, unlike XTS. It can also eliminate the need to derive
-per-file encryption keys. However, it depends on the security of two
-primitives, XChaCha12 and AES-256, rather than just one. See the
-paper "Adiantum: length-preserving encryption for entry-level
-processors" (https://eprint.iacr.org/2018/720.pdf) for more details.
-To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
-implementations of ChaCha and NHPoly1305 should be enabled, e.g.
-CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
-
-New encryption modes can be added relatively easily, without changes
-to individual filesystems. However, authenticated encryption (AE)
-modes are not currently supported because of the difficulty of dealing
-with ciphertext expansion.
+- AES-128-CBC-ESSIV for contents and AES-128-CBC-CTS for filenames
+- SM4-XTS for contents and SM4-CBC-CTS for filenames
+
+Note: in the API, "CBC" means CBC-ESSIV, and "CTS" means CBC-CTS.
+So, for example, FSCRYPT_MODE_AES_256_CTS means AES-256-CBC-CTS.
+
+Authenticated encryption modes are not currently supported because of
+the difficulty of dealing with ciphertext expansion. Therefore,
+contents encryption uses a block cipher in `XTS mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS>`_ or
+`CBC-ESSIV mode
+<https://en.wikipedia.org/wiki/Disk_encryption_theory#Encrypted_salt-sector_initialization_vector_(ESSIV)>`_,
+or a wide-block cipher. Filenames encryption uses a
+block cipher in `CBC-CTS mode
+<https://en.wikipedia.org/wiki/Ciphertext_stealing>`_ or a wide-block
+cipher.
+
+The (AES-256-XTS, AES-256-CBC-CTS) pair is the recommended default.
+It is also the only option that is *guaranteed* to always be supported
+if the kernel supports fscrypt at all; see `Kernel config options`_.
+
+The (AES-256-XTS, AES-256-HCTR2) pair is also a good choice that
+upgrades the filenames encryption to use a wide-block cipher. (A
+*wide-block cipher*, also called a tweakable super-pseudorandom
+permutation, has the property that changing one bit scrambles the
+entire result.) As described in `Filenames encryption`_, a wide-block
+cipher is the ideal mode for the problem domain, though CBC-CTS is the
+"least bad" choice among the alternatives. For more information about
+HCTR2, see `the HCTR2 paper <https://eprint.iacr.org/2021/1441.pdf>`_.
+
+Adiantum is recommended on systems where AES is too slow due to lack
+of hardware acceleration for AES. Adiantum is a wide-block cipher
+that uses XChaCha12 and AES-256 as its underlying components. Most of
+the work is done by XChaCha12, which is much faster than AES when AES
+acceleration is unavailable. For more information about Adiantum, see
+`the Adiantum paper <https://eprint.iacr.org/2018/720.pdf>`_.
+
+The (AES-128-CBC-ESSIV, AES-128-CBC-CTS) pair exists only to support
+systems whose only form of AES acceleration is an off-CPU crypto
+accelerator such as CAAM or CESA that does not support XTS.
+
+The remaining mode pairs are the "national pride ciphers":
+
+- (SM4-XTS, SM4-CBC-CTS)
+
+Generally speaking, these ciphers aren't "bad" per se, but they
+receive limited security review compared to the usual choices such as
+AES and ChaCha. They also don't bring much new to the table. It is
+suggested to only use these ciphers where their use is mandated.
+
+Kernel config options
+---------------------
+
+Enabling fscrypt support (CONFIG_FS_ENCRYPTION) automatically pulls in
+only the basic support from the crypto API needed to use AES-256-XTS
+and AES-256-CBC-CTS encryption. For optimal performance, it is
+strongly recommended to also enable any available platform-specific
+kconfig options that provide acceleration for the algorithm(s) you
+wish to use. Support for any "non-default" encryption modes typically
+requires extra kconfig options as well.
+
+Below, some relevant options are listed by encryption mode. Note,
+acceleration options not listed below may be available for your
+platform; refer to the kconfig menus. File contents encryption can
+also be configured to use inline encryption hardware instead of the
+kernel crypto API (see `Inline encryption support`_); in that case,
+the file contents mode doesn't need to supported in the kernel crypto
+API, but the filenames mode still does.
+
+- AES-256-XTS and AES-256-CBC-CTS
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+
+- AES-256-HCTR2
+ - Mandatory:
+ - CONFIG_CRYPTO_HCTR2
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_AES_ARM64_CE_BLK
+ - arm64: CONFIG_CRYPTO_POLYVAL_ARM64_CE
+ - x86: CONFIG_CRYPTO_AES_NI_INTEL
+ - x86: CONFIG_CRYPTO_POLYVAL_CLMUL_NI
+
+- Adiantum
+ - Mandatory:
+ - CONFIG_CRYPTO_ADIANTUM
+ - Recommended:
+ - arm32: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - arm64: CONFIG_CRYPTO_NHPOLY1305_NEON
+ - x86: CONFIG_CRYPTO_NHPOLY1305_SSE2
+ - x86: CONFIG_CRYPTO_NHPOLY1305_AVX2
+
+- AES-128-CBC-ESSIV and AES-128-CBC-CTS:
+ - Mandatory:
+ - CONFIG_CRYPTO_ESSIV
+ - CONFIG_CRYPTO_SHA256 or another SHA-256 implementation
+ - Recommended:
+ - AES-CBC acceleration
+
+fscrypt also uses HMAC-SHA512 for key derivation, so enabling SHA-512
+acceleration is recommended:
+
+- SHA-512
+ - Recommended:
+ - arm64: CONFIG_CRYPTO_SHA512_ARM64_CE
+ - x86: CONFIG_CRYPTO_SHA512_SSSE3
Contents encryption
-------------------
-For file contents, each filesystem block is encrypted independently.
-Starting from Linux kernel 5.5, encryption of filesystems with block
-size less than system's page size is supported.
-
-Each block's IV is set to the logical block number within the file as
-a little endian number, except that:
-
-- With CBC mode encryption, ESSIV is also used. Specifically, each IV
- is encrypted with AES-256 where the AES-256 key is the SHA-256 hash
- of the file's data encryption key.
-
-- With `DIRECT_KEY policies`_, the file's nonce is appended to the IV.
- Currently this is only allowed with the Adiantum encryption mode.
-
-- With `IV_INO_LBLK_64 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- (which is also limited to 32 bits) is placed in bits 32-63.
-
-- With `IV_INO_LBLK_32 policies`_, the logical block number is limited
- to 32 bits and is placed in bits 0-31 of the IV. The inode number
- is then hashed and added mod 2^32.
-
-Note that because file logical block numbers are included in the IVs,
-filesystems must enforce that blocks are never shifted around within
-encrypted files, e.g. via "collapse range" or "insert range".
+For contents encryption, each file's contents is divided into "data
+units". Each data unit is encrypted independently. The IV for each
+data unit incorporates the zero-based index of the data unit within
+the file. This ensures that each data unit within a file is encrypted
+differently, which is essential to prevent leaking information.
+
+Note: the encryption depending on the offset into the file means that
+operations like "collapse range" and "insert range" that rearrange the
+extent mapping of files are not supported on encrypted files.
+
+There are two cases for the sizes of the data units:
+
+* Fixed-size data units. This is how all filesystems other than UBIFS
+ work. A file's data units are all the same size; the last data unit
+ is zero-padded if needed. By default, the data unit size is equal
+ to the filesystem block size. On some filesystems, users can select
+ a sub-block data unit size via the ``log2_data_unit_size`` field of
+ the encryption policy; see `FS_IOC_SET_ENCRYPTION_POLICY`_.
+
+* Variable-size data units. This is what UBIFS does. Each "UBIFS
+ data node" is treated as a crypto data unit. Each contains variable
+ length, possibly compressed data, zero-padded to the next 16-byte
+ boundary. Users cannot select a sub-block data unit size on UBIFS.
+
+In the case of compression + encryption, the compressed data is
+encrypted. UBIFS compression works as described above. f2fs
+compression works a bit differently; it compresses a number of
+filesystem blocks into a smaller number of filesystem blocks.
+Therefore a f2fs-compressed file still uses fixed-size data units, and
+it is encrypted in a similar way to a file containing holes.
+
+As mentioned in `Key hierarchy`_, the default encryption setting uses
+per-file keys. In this case, the IV for each data unit is simply the
+index of the data unit in the file. However, users can select an
+encryption setting that does not use per-file keys. For these, some
+kind of file identifier is incorporated into the IVs as follows:
+
+- With `DIRECT_KEY policies`_, the data unit index is placed in bits
+ 0-63 of the IV, and the file's nonce is placed in bits 64-191.
+
+- With `IV_INO_LBLK_64 policies`_, the data unit index is placed in
+ bits 0-31 of the IV, and the file's inode number is placed in bits
+ 32-63. This setting is only allowed when data unit indices and
+ inode numbers fit in 32 bits.
+
+- With `IV_INO_LBLK_32 policies`_, the file's inode number is hashed
+ and added to the data unit index. The resulting value is truncated
+ to 32 bits and placed in bits 0-31 of the IV. This setting is only
+ allowed when data unit indices and inode numbers fit in 32 bits.
+
+The byte order of the IV is always little endian.
+
+If the user selects FSCRYPT_MODE_AES_128_CBC for the contents mode, an
+ESSIV layer is automatically included. In this case, before the IV is
+passed to AES-128-CBC, it is encrypted with AES-256 where the AES-256
+key is the SHA-256 hash of the file's contents encryption key.
Filenames encryption
--------------------
@@ -404,11 +549,11 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames
-share a common prefix at least as long as the cipher block size (16
-bytes for AES), the corresponding encrypted filenames will also share
-a common prefix. This is undesirable. Adiantum does not have this
-weakness, as it is a wide-block encryption mode.
+With CBC-CTS, the IV reuse means that when the plaintext filenames share a
+common prefix at least as long as the cipher block size (16 bytes for AES), the
+corresponding encrypted filenames will also share a common prefix. This is
+undesirable. Adiantum and HCTR2 do not have this weakness, as they are
+wide-block encryption modes.
All supported filenames encryption modes accept any plaintext length
>= 16 bytes; cipher block alignment is not required. However,
@@ -436,9 +581,9 @@ FS_IOC_SET_ENCRYPTION_POLICY
The FS_IOC_SET_ENCRYPTION_POLICY ioctl sets an encryption policy on an
empty directory or verifies that a directory or regular file already
-has the specified encryption policy. It takes in a pointer to a
-:c:type:`struct fscrypt_policy_v1` or a :c:type:`struct
-fscrypt_policy_v2`, defined as follows::
+has the specified encryption policy. It takes in a pointer to
+struct fscrypt_policy_v1 or struct fscrypt_policy_v2, defined as
+follows::
#define FSCRYPT_POLICY_V1 0
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
@@ -458,23 +603,31 @@ fscrypt_policy_v2`, defined as follows::
__u8 contents_encryption_mode;
__u8 filenames_encryption_mode;
__u8 flags;
- __u8 __reserved[4];
+ __u8 log2_data_unit_size;
+ __u8 __reserved[3];
__u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
};
This structure must be initialized as follows:
-- ``version`` must be FSCRYPT_POLICY_V1 (0) if the struct is
- :c:type:`fscrypt_policy_v1` or FSCRYPT_POLICY_V2 (2) if the struct
- is :c:type:`fscrypt_policy_v2`. (Note: we refer to the original
- policy version as "v1", though its version code is really 0.) For
- new encrypted directories, use v2 policies.
+- ``version`` must be FSCRYPT_POLICY_V1 (0) if
+ struct fscrypt_policy_v1 is used or FSCRYPT_POLICY_V2 (2) if
+ struct fscrypt_policy_v2 is used. (Note: we refer to the original
+ policy version as "v1", though its version code is really 0.)
+ For new encrypted directories, use v2 policies.
- ``contents_encryption_mode`` and ``filenames_encryption_mode`` must
be set to constants from ``<linux/fscrypt.h>`` which identify the
encryption modes to use. If unsure, use FSCRYPT_MODE_AES_256_XTS
(1) for ``contents_encryption_mode`` and FSCRYPT_MODE_AES_256_CTS
- (4) for ``filenames_encryption_mode``.
+ (4) for ``filenames_encryption_mode``. For details, see `Encryption
+ modes and usage`_.
+
+ v1 encryption policies only support three combinations of modes:
+ (FSCRYPT_MODE_AES_256_XTS, FSCRYPT_MODE_AES_256_CTS),
+ (FSCRYPT_MODE_AES_128_CBC, FSCRYPT_MODE_AES_128_CTS), and
+ (FSCRYPT_MODE_ADIANTUM, FSCRYPT_MODE_ADIANTUM). v2 policies support
+ all combinations documented in `Supported modes`_.
- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
@@ -493,6 +646,29 @@ This structure must be initialized as follows:
The DIRECT_KEY, IV_INO_LBLK_64, and IV_INO_LBLK_32 flags are
mutually exclusive.
+- ``log2_data_unit_size`` is the log2 of the data unit size in bytes,
+ or 0 to select the default data unit size. The data unit size is
+ the granularity of file contents encryption. For example, setting
+ ``log2_data_unit_size`` to 12 causes file contents be passed to the
+ underlying encryption algorithm (such as AES-256-XTS) in 4096-byte
+ data units, each with its own IV.
+
+ Not all filesystems support setting ``log2_data_unit_size``. ext4
+ and f2fs support it since Linux v6.7. On filesystems that support
+ it, the supported nonzero values are 9 through the log2 of the
+ filesystem block size, inclusively. The default value of 0 selects
+ the filesystem block size.
+
+ The main use case for ``log2_data_unit_size`` is for selecting a
+ data unit size smaller than the filesystem block size for
+ compatibility with inline encryption hardware that only supports
+ smaller data unit sizes. ``/sys/block/$disk/queue/crypto/`` may be
+ useful for checking which data unit sizes are supported by a
+ particular system's inline encryption hardware.
+
+ Leave this field zeroed unless you are certain you need it. Using
+ an unnecessarily small data unit size reduces performance.
+
- For v2 encryption policies, ``__reserved`` must be zeroed.
- For v1 encryption policies, ``master_key_descriptor`` specifies how
@@ -508,9 +684,9 @@ This structure must be initialized as follows:
replaced with ``master_key_identifier``, which is longer and cannot
be arbitrarily chosen. Instead, the key must first be added using
`FS_IOC_ADD_ENCRYPTION_KEY`_. Then, the ``key_spec.u.identifier``
- the kernel returned in the :c:type:`struct fscrypt_add_key_arg` must
- be used as the ``master_key_identifier`` in the :c:type:`struct
- fscrypt_policy_v2`.
+ the kernel returned in the struct fscrypt_add_key_arg must
+ be used as the ``master_key_identifier`` in
+ struct fscrypt_policy_v2.
If the file is not yet encrypted, then FS_IOC_SET_ENCRYPTION_POLICY
verifies that the file is an empty directory. If so, the specified
@@ -590,7 +766,7 @@ FS_IOC_GET_ENCRYPTION_POLICY_EX
The FS_IOC_GET_ENCRYPTION_POLICY_EX ioctl retrieves the encryption
policy, if any, for a directory or regular file. No additional
permissions are required beyond the ability to open the file. It
-takes in a pointer to a :c:type:`struct fscrypt_get_policy_ex_arg`,
+takes in a pointer to struct fscrypt_get_policy_ex_arg,
defined as follows::
struct fscrypt_get_policy_ex_arg {
@@ -637,9 +813,8 @@ The FS_IOC_GET_ENCRYPTION_POLICY ioctl can also retrieve the
encryption policy, if any, for a directory or regular file. However,
unlike `FS_IOC_GET_ENCRYPTION_POLICY_EX`_,
FS_IOC_GET_ENCRYPTION_POLICY only supports the original policy
-version. It takes in a pointer directly to a :c:type:`struct
-fscrypt_policy_v1` rather than a :c:type:`struct
-fscrypt_get_policy_ex_arg`.
+version. It takes in a pointer directly to struct fscrypt_policy_v1
+rather than struct fscrypt_get_policy_ex_arg.
The error codes for FS_IOC_GET_ENCRYPTION_POLICY are the same as those
for FS_IOC_GET_ENCRYPTION_POLICY_EX, except that
@@ -680,14 +855,15 @@ the filesystem, making all files on the filesystem which were
encrypted using that key appear "unlocked", i.e. in plaintext form.
It can be executed on any file or directory on the target filesystem,
but using the filesystem's root directory is recommended. It takes in
-a pointer to a :c:type:`struct fscrypt_add_key_arg`, defined as
-follows::
+a pointer to struct fscrypt_add_key_arg, defined as follows::
struct fscrypt_add_key_arg {
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
__u32 key_id;
- __u32 __reserved[8];
+ #define FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED 0x00000001
+ __u32 flags;
+ __u32 __reserved[7];
__u8 raw[];
};
@@ -706,21 +882,20 @@ follows::
struct fscrypt_provisioning_key_payload {
__u32 type;
- __u32 __reserved;
+ __u32 flags;
__u8 raw[];
};
-:c:type:`struct fscrypt_add_key_arg` must be zeroed, then initialized
+struct fscrypt_add_key_arg must be zeroed, then initialized
as follows:
- If the key is being added for use by v1 encryption policies, then
``key_spec.type`` must contain FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR, and
``key_spec.u.descriptor`` must contain the descriptor of the key
being added, corresponding to the value in the
- ``master_key_descriptor`` field of :c:type:`struct
- fscrypt_policy_v1`. To add this type of key, the calling process
- must have the CAP_SYS_ADMIN capability in the initial user
- namespace.
+ ``master_key_descriptor`` field of struct fscrypt_policy_v1.
+ To add this type of key, the calling process must have the
+ CAP_SYS_ADMIN capability in the initial user namespace.
Alternatively, if the key is being added for use by v2 encryption
policies, then ``key_spec.type`` must contain
@@ -735,23 +910,32 @@ as follows:
Alternatively, if ``key_id`` is nonzero, this field must be 0, since
in that case the size is implied by the specified Linux keyring key.
-- ``key_id`` is 0 if the raw key is given directly in the ``raw``
- field. Otherwise ``key_id`` is the ID of a Linux keyring key of
- type "fscrypt-provisioning" whose payload is a :c:type:`struct
- fscrypt_provisioning_key_payload` whose ``raw`` field contains the
- raw key and whose ``type`` field matches ``key_spec.type``. Since
- ``raw`` is variable-length, the total size of this key's payload
- must be ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the
- raw key size. The process must have Search permission on this key.
-
- Most users should leave this 0 and specify the raw key directly.
- The support for specifying a Linux keyring key is intended mainly to
+- ``key_id`` is 0 if the key is given directly in the ``raw`` field.
+ Otherwise ``key_id`` is the ID of a Linux keyring key of type
+ "fscrypt-provisioning" whose payload is struct
+ fscrypt_provisioning_key_payload whose ``raw`` field contains the
+ key, whose ``type`` field matches ``key_spec.type``, and whose
+ ``flags`` field matches ``flags``. Since ``raw`` is
+ variable-length, the total size of this key's payload must be
+ ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the number
+ of key bytes. The process must have Search permission on this key.
+
+ Most users should leave this 0 and specify the key directly. The
+ support for specifying a Linux keyring key is intended mainly to
allow re-adding keys after a filesystem is unmounted and re-mounted,
- without having to store the raw keys in userspace memory.
+ without having to store the keys in userspace memory.
+
+- ``flags`` contains optional flags from ``<linux/fscrypt.h>``:
+
+ - FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED: This denotes that the key is a
+ hardware-wrapped key. See `Hardware-wrapped keys`_. This flag
+ can't be used if FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR is used.
- ``raw`` is a variable-length field which must contain the actual
key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
- nonzero, then this field is unused.
+ nonzero, then this field is unused. Note that despite being named
+ ``raw``, if FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED is specified then it
+ will contain a wrapped key, not a raw key.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -763,8 +947,8 @@ prevent that other user from unexpectedly removing it. Therefore,
FS_IOC_ADD_ENCRYPTION_KEY may also be used to add a v2 policy key
*again*, even if it's already added by other user(s). In this case,
FS_IOC_ADD_ENCRYPTION_KEY will just install a claim to the key for the
-current user, rather than actually add the key again (but the raw key
-must still be provided, as a proof of knowledge).
+current user, rather than actually add the key again (but the key must
+still be provided, as a proof of knowledge).
FS_IOC_ADD_ENCRYPTION_KEY returns 0 if either the key or a claim to
the key was either added or already exists.
@@ -773,20 +957,23 @@ FS_IOC_ADD_ENCRYPTION_KEY can fail with the following errors:
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace; or the raw key was specified by Linux key ID but the
+ user namespace; or the key was specified by Linux key ID but the
process lacks Search permission on the key.
+- ``EBADMSG``: invalid hardware-wrapped key
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
-- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
- key has the wrong type
-- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
- exists with that ID
+- ``EKEYREJECTED``: the key was specified by Linux key ID, but the key
+ has the wrong type
+- ``ENOKEY``: the key was specified by Linux key ID, but no key exists
+ with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
- had encryption enabled on it
+ had encryption enabled on it; or a hardware wrapped key was specified
+ but the filesystem does not support inline encryption or the hardware
+ does not support hardware-wrapped keys
Legacy method
~~~~~~~~~~~~~
@@ -849,9 +1036,8 @@ or removed by non-root users.
These ioctls don't work on keys that were added via the legacy
process-subscribed keyrings mechanism.
-Before using these ioctls, read the `Kernel memory compromise`_
-section for a discussion of the security goals and limitations of
-these ioctls.
+Before using these ioctls, read the `Online attacks`_ section for a
+discussion of the security goals and limitations of these ioctls.
FS_IOC_REMOVE_ENCRYPTION_KEY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -860,8 +1046,8 @@ The FS_IOC_REMOVE_ENCRYPTION_KEY ioctl removes a claim to a master
encryption key from the filesystem, and possibly removes the key
itself. It can be executed on any file or directory on the target
filesystem, but using the filesystem's root directory is recommended.
-It takes in a pointer to a :c:type:`struct fscrypt_remove_key_arg`,
-defined as follows::
+It takes in a pointer to struct fscrypt_remove_key_arg, defined
+as follows::
struct fscrypt_remove_key_arg {
struct fscrypt_key_specifier key_spec;
@@ -956,8 +1142,8 @@ FS_IOC_GET_ENCRYPTION_KEY_STATUS
The FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl retrieves the status of a
master encryption key. It can be executed on any file or directory on
the target filesystem, but using the filesystem's root directory is
-recommended. It takes in a pointer to a :c:type:`struct
-fscrypt_get_key_status_arg`, defined as follows::
+recommended. It takes in a pointer to
+struct fscrypt_get_key_status_arg, defined as follows::
struct fscrypt_get_key_status_arg {
/* input */
@@ -988,8 +1174,8 @@ The caller must zero all input fields, then fill in ``key_spec``:
On success, 0 is returned and the kernel fills in the output fields:
- ``status`` indicates whether the key is absent, present, or
- incompletely removed. Incompletely removed means that the master
- secret has been removed, but some files are still in use; i.e.,
+ incompletely removed. Incompletely removed means that removal has
+ been initiated, but some files are still in use; i.e.,
`FS_IOC_REMOVE_ENCRYPTION_KEY`_ returned 0 but set the informational
status flag FSCRYPT_KEY_REMOVAL_STATUS_FLAG_FILES_BUSY.
@@ -1049,8 +1235,8 @@ astute users may notice some differences in behavior:
may be used to overwrite the source files but isn't guaranteed to be
effective on all filesystems and storage devices.
-- Direct I/O is not supported on encrypted files. Attempts to use
- direct I/O on such files will fall back to buffered I/O.
+- Direct I/O is supported on encrypted files only under some
+ circumstances. For details, see `Direct I/O support`_.
- The fallocate operations FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE are not supported on encrypted files and will
@@ -1065,11 +1251,6 @@ astute users may notice some differences in behavior:
- DAX (Direct Access) is not supported on encrypted files.
-- The st_size of an encrypted symlink will not necessarily give the
- length of the symlink target as required by POSIX. It will actually
- give the length of the ciphertext, which will be slightly longer
- than the plaintext due to NUL-padding and an extra 2-byte overhead.
-
- The maximum length of an encrypted symlink is 2 bytes shorter than
the maximum length of an unencrypted symlink. For example, on an
EXT4 filesystem with a 4K block size, unencrypted symlinks can be up
@@ -1142,23 +1323,158 @@ where applications may later write sensitive data. It is recommended
that systems implementing a form of "verified boot" take advantage of
this by validating all top-level encryption policies prior to access.
+Inline encryption support
+=========================
+
+By default, fscrypt uses the kernel crypto API for all cryptographic
+operations (other than HKDF, which fscrypt partially implements
+itself). The kernel crypto API supports hardware crypto accelerators,
+but only ones that work in the traditional way where all inputs and
+outputs (e.g. plaintexts and ciphertexts) are in memory. fscrypt can
+take advantage of such hardware, but the traditional acceleration
+model isn't particularly efficient and fscrypt hasn't been optimized
+for it.
+
+Instead, many newer systems (especially mobile SoCs) have *inline
+encryption hardware* that can encrypt/decrypt data while it is on its
+way to/from the storage device. Linux supports inline encryption
+through a set of extensions to the block layer called *blk-crypto*.
+blk-crypto allows filesystems to attach encryption contexts to bios
+(I/O requests) to specify how the data will be encrypted or decrypted
+in-line. For more information about blk-crypto, see
+:ref:`Documentation/block/inline-encryption.rst <inline_encryption>`.
+
+On supported filesystems (currently ext4 and f2fs), fscrypt can use
+blk-crypto instead of the kernel crypto API to encrypt/decrypt file
+contents. To enable this, set CONFIG_FS_ENCRYPTION_INLINE_CRYPT=y in
+the kernel configuration, and specify the "inlinecrypt" mount option
+when mounting the filesystem.
+
+Note that the "inlinecrypt" mount option just specifies to use inline
+encryption when possible; it doesn't force its use. fscrypt will
+still fall back to using the kernel crypto API on files where the
+inline encryption hardware doesn't have the needed crypto capabilities
+(e.g. support for the needed encryption algorithm and data unit size)
+and where blk-crypto-fallback is unusable. (For blk-crypto-fallback
+to be usable, it must be enabled in the kernel configuration with
+CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK=y, and the file must be
+protected by a raw key rather than a hardware-wrapped key.)
+
+Currently fscrypt always uses the filesystem block size (which is
+usually 4096 bytes) as the data unit size. Therefore, it can only use
+inline encryption hardware that supports that data unit size.
+
+Inline encryption doesn't affect the ciphertext or other aspects of
+the on-disk format, so users may freely switch back and forth between
+using "inlinecrypt" and not using "inlinecrypt". An exception is that
+files that are protected by a hardware-wrapped key can only be
+encrypted/decrypted by the inline encryption hardware and therefore
+can only be accessed when the "inlinecrypt" mount option is used. For
+more information about hardware-wrapped keys, see below.
+
+Hardware-wrapped keys
+---------------------
+
+fscrypt supports using *hardware-wrapped keys* when the inline
+encryption hardware supports it. Such keys are only present in kernel
+memory in wrapped (encrypted) form; they can only be unwrapped
+(decrypted) by the inline encryption hardware and are temporally bound
+to the current boot. This prevents the keys from being compromised if
+kernel memory is leaked. This is done without limiting the number of
+keys that can be used and while still allowing the execution of
+cryptographic tasks that are tied to the same key but can't use inline
+encryption hardware, e.g. filenames encryption.
+
+Note that hardware-wrapped keys aren't specific to fscrypt; they are a
+block layer feature (part of *blk-crypto*). For more details about
+hardware-wrapped keys, see the block layer documentation at
+:ref:`Documentation/block/inline-encryption.rst
+<hardware_wrapped_keys>`. The rest of this section just focuses on
+the details of how fscrypt can use hardware-wrapped keys.
+
+fscrypt supports hardware-wrapped keys by allowing the fscrypt master
+keys to be hardware-wrapped keys as an alternative to raw keys. To
+add a hardware-wrapped key with `FS_IOC_ADD_ENCRYPTION_KEY`_,
+userspace must specify FSCRYPT_ADD_KEY_FLAG_HW_WRAPPED in the
+``flags`` field of struct fscrypt_add_key_arg and also in the
+``flags`` field of struct fscrypt_provisioning_key_payload when
+applicable. The key must be in ephemerally-wrapped form, not
+long-term wrapped form.
+
+Some limitations apply. First, files protected by a hardware-wrapped
+key are tied to the system's inline encryption hardware. Therefore
+they can only be accessed when the "inlinecrypt" mount option is used,
+and they can't be included in portable filesystem images. Second,
+currently the hardware-wrapped key support is only compatible with
+`IV_INO_LBLK_64 policies`_ and `IV_INO_LBLK_32 policies`_, as it
+assumes that there is just one file contents encryption key per
+fscrypt master key rather than one per file. Future work may address
+this limitation by passing per-file nonces down the storage stack to
+allow the hardware to derive per-file keys.
+
+Implementation-wise, to encrypt/decrypt the contents of files that are
+protected by a hardware-wrapped key, fscrypt uses blk-crypto,
+attaching the hardware-wrapped key to the bio crypt contexts. As is
+the case with raw keys, the block layer will program the key into a
+keyslot when it isn't already in one. However, when programming a
+hardware-wrapped key, the hardware doesn't program the given key
+directly into a keyslot but rather unwraps it (using the hardware's
+ephemeral wrapping key) and derives the inline encryption key from it.
+The inline encryption key is the key that actually gets programmed
+into a keyslot, and it is never exposed to software.
+
+However, fscrypt doesn't just do file contents encryption; it also
+uses its master keys to derive filenames encryption keys, key
+identifiers, and sometimes some more obscure types of subkeys such as
+dirhash keys. So even with file contents encryption out of the
+picture, fscrypt still needs a raw key to work with. To get such a
+key from a hardware-wrapped key, fscrypt asks the inline encryption
+hardware to derive a cryptographically isolated "software secret" from
+the hardware-wrapped key. fscrypt uses this "software secret" to key
+its KDF to derive all subkeys other than file contents keys.
+
+Note that this implies that the hardware-wrapped key feature only
+protects the file contents encryption keys. It doesn't protect other
+fscrypt subkeys such as filenames encryption keys.
+
+Direct I/O support
+==================
+
+For direct I/O on an encrypted file to work, the following conditions
+must be met (in addition to the conditions for direct I/O on an
+unencrypted file):
+
+* The file must be using inline encryption. Usually this means that
+ the filesystem must be mounted with ``-o inlinecrypt`` and inline
+ encryption hardware must be present. However, a software fallback
+ is also available. For details, see `Inline encryption support`_.
+
+* The I/O request must be fully aligned to the filesystem block size.
+ This means that the file position the I/O is targeting, the lengths
+ of all I/O segments, and the memory addresses of all I/O buffers
+ must be multiples of this value. Note that the filesystem block
+ size may be greater than the logical block size of the block device.
+
+If either of the above conditions is not met, then direct I/O on the
+encrypted file will fall back to buffered I/O.
+
Implementation details
======================
Encryption context
------------------
-An encryption policy is represented on-disk by a :c:type:`struct
-fscrypt_context_v1` or a :c:type:`struct fscrypt_context_v2`. It is
-up to individual filesystems to decide where to store it, but normally
-it would be stored in a hidden extended attribute. It should *not* be
+An encryption policy is represented on-disk by
+struct fscrypt_context_v1 or struct fscrypt_context_v2. It is up to
+individual filesystems to decide where to store it, but normally it
+would be stored in a hidden extended attribute. It should *not* be
exposed by the xattr-related system calls such as getxattr() and
setxattr() because of the special semantics of the encryption xattr.
(In particular, there would be much confusion if an encryption policy
were to be added to or removed from anything other than an empty
directory.) These structs are defined as follows::
- #define FS_KEY_DERIVATION_NONCE_SIZE 16
+ #define FSCRYPT_FILE_NONCE_SIZE 16
#define FSCRYPT_KEY_DESCRIPTOR_SIZE 8
struct fscrypt_context_v1 {
@@ -1167,7 +1483,7 @@ directory.) These structs are defined as follows::
u8 filenames_encryption_mode;
u8 flags;
u8 master_key_descriptor[FSCRYPT_KEY_DESCRIPTOR_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
#define FSCRYPT_KEY_IDENTIFIER_SIZE 16
@@ -1176,9 +1492,10 @@ directory.) These structs are defined as follows::
u8 contents_encryption_mode;
u8 filenames_encryption_mode;
u8 flags;
- u8 __reserved[4];
+ u8 log2_data_unit_size;
+ u8 __reserved[3];
u8 master_key_identifier[FSCRYPT_KEY_IDENTIFIER_SIZE];
- u8 nonce[FS_KEY_DERIVATION_NONCE_SIZE];
+ u8 nonce[FSCRYPT_FILE_NONCE_SIZE];
};
The context structs contain the same information as the corresponding
@@ -1191,12 +1508,19 @@ keys`_ and `DIRECT_KEY policies`_.
Data path changes
-----------------
-For the read path (->readpage()) of regular files, filesystems can
+When inline encryption is used, filesystems just need to associate
+encryption contexts with bios to specify how the block layer or the
+inline encryption hardware will encrypt/decrypt the file contents.
+
+When inline encryption isn't used, filesystems must encrypt/decrypt
+the file contents themselves, as described below:
+
+For the read path (->read_folio()) of regular files, filesystems can
read the ciphertext into the page cache and decrypt it in-place. The
-page lock must be held until decryption has finished, to prevent the
-page from becoming visible to userspace prematurely.
+folio lock must be held until decryption has finished, to prevent the
+folio from becoming visible to userspace prematurely.
-For the write path (->writepage()) of regular files, filesystems
+For the write path (->writepages()) of regular files, filesystems
cannot encrypt data in-place in the page cache, since the cached
plaintext must be preserved. Instead, filesystems must encrypt into a
temporary buffer or "bounce page", then write out the temporary
@@ -1225,20 +1549,20 @@ the user-supplied name to get the ciphertext.
Lookups without the key are more complicated. The raw ciphertext may
contain the ``\0`` and ``/`` characters, which are illegal in
-filenames. Therefore, readdir() must base64-encode the ciphertext for
-presentation. For most filenames, this works fine; on ->lookup(), the
-filesystem just base64-decodes the user-supplied name to get back to
-the raw ciphertext.
+filenames. Therefore, readdir() must base64url-encode the ciphertext
+for presentation. For most filenames, this works fine; on ->lookup(),
+the filesystem just base64url-decodes the user-supplied name to get
+back to the raw ciphertext.
-However, for very long filenames, base64 encoding would cause the
+However, for very long filenames, base64url encoding would cause the
filename length to exceed NAME_MAX. To prevent this, readdir()
actually presents long filenames in an abbreviated form which encodes
a strong "hash" of the ciphertext filename, along with the optional
filesystem-specific hash(es) needed for directory lookups. This
allows the filesystem to still, with a high degree of confidence, map
the filename given in ->lookup() back to a particular directory entry
-that was previously listed by readdir(). See :c:type:`struct
-fscrypt_nokey_name` in the source for more details.
+that was previously listed by readdir(). See
+struct fscrypt_nokey_name in the source for more details.
Note that the precise way that filenames are presented to userspace
without the key is subject to change in the future. It is only meant
@@ -1250,11 +1574,14 @@ Tests
To test fscrypt, use xfstests, which is Linux's de facto standard
filesystem test suite. First, run all the tests in the "encrypt"
-group on the relevant filesystem(s). For example, to test ext4 and
+group on the relevant filesystem(s). One can also run the tests
+with the 'inlinecrypt' mount option to test the implementation for
+inline encryption support. For example, to test ext4 and
f2fs encryption using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
kvm-xfstests -c ext4,f2fs -g encrypt
+ kvm-xfstests -c ext4,f2fs -g encrypt -m inlinecrypt
UBIFS encryption can also be tested this way, but it should be done in
a separate command, and it takes some time for kvm-xfstests to set up
@@ -1276,6 +1603,7 @@ This tests the encrypted I/O paths more thoroughly. To do this with
kvm-xfstests, use the "encrypt" filesystem configuration::
kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ kvm-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
Because this runs many more tests than "-g encrypt" does, it takes
much longer to run; so also consider using `gce-xfstests
@@ -1283,3 +1611,4 @@ much longer to run; so also consider using `gce-xfstests
instead of kvm-xfstests::
gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto
+ gce-xfstests -c ext4/encrypt,f2fs/encrypt -g auto -m inlinecrypt
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index a95536b6443c..dacdbc1149e6 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -11,12 +11,12 @@ Introduction
fs-verity (``fs/verity/``) is a support layer that filesystems can
hook into to support transparent integrity and authenticity protection
-of read-only files. Currently, it is supported by the ext4 and f2fs
-filesystems. Like fscrypt, not too much filesystem-specific code is
-needed to support fs-verity.
+of read-only files. Currently, it is supported by the ext4, f2fs, and
+btrfs filesystems. Like fscrypt, not too much filesystem-specific
+code is needed to support fs-verity.
fs-verity is similar to `dm-verity
-<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
+<https://www.kernel.org/doc/Documentation/admin-guide/device-mapper/verity.rst>`_
but works on files rather than block devices. On regular files on
filesystems supporting fs-verity, userspace can execute an ioctl that
causes the filesystem to build a Merkle tree for the file and persist
@@ -27,9 +27,9 @@ automatically verified against the file's Merkle tree. Reads of any
corrupted data, including mmap reads, will fail.
Userspace can use another ioctl to retrieve the root hash (actually
-the "file measurement", which is a hash that includes the root hash)
-that fs-verity is enforcing for the file. This ioctl executes in
-constant time, regardless of the file size.
+the "fs-verity file digest", which is a hash that includes the Merkle
+tree root hash) that fs-verity is enforcing for the file. This ioctl
+executes in constant time, regardless of the file size.
fs-verity is essentially a way to hash a file in constant time,
subject to the caveat that reads which would violate the hash will
@@ -38,20 +38,14 @@ fail at runtime.
Use cases
=========
-By itself, the base fs-verity feature only provides integrity
-protection, i.e. detection of accidental (non-malicious) corruption.
+By itself, fs-verity only provides integrity protection, i.e.
+detection of accidental (non-malicious) corruption.
However, because fs-verity makes retrieving the file hash extremely
efficient, it's primarily meant to be used as a tool to support
authentication (detection of malicious modifications) or auditing
(logging file hashes before use).
-Trusted userspace code (e.g. operating system code running on a
-read-only partition that is itself authenticated by dm-verity) can
-authenticate the contents of an fs-verity file by using the
-`FS_IOC_MEASURE_VERITY`_ ioctl to retrieve its hash, then verifying a
-digital signature of it.
-
A standard file hash could be used instead of fs-verity. However,
this is inefficient if the file is large and only a small portion may
be accessed. This is often the case for Android application package
@@ -69,13 +63,41 @@ still be used on read-only filesystems. fs-verity is for files that
must live on a read-write filesystem because they are independently
updated and potentially user-installed, so dm-verity cannot be used.
-The base fs-verity feature is a hashing mechanism only; actually
-authenticating the files is up to userspace. However, to meet some
-users' needs, fs-verity optionally supports a simple signature
-verification mechanism where users can configure the kernel to require
-that all fs-verity files be signed by a key loaded into a keyring; see
-`Built-in signature verification`_. Support for fs-verity file hashes
-in IMA (Integrity Measurement Architecture) policies is also planned.
+fs-verity does not mandate a particular scheme for authenticating its
+file hashes. (Similarly, dm-verity does not mandate a particular
+scheme for authenticating its block device root hashes.) Options for
+authenticating fs-verity file hashes include:
+
+- Trusted userspace code. Often, the userspace code that accesses
+ files can be trusted to authenticate them. Consider e.g. an
+ application that wants to authenticate data files before using them,
+ or an application loader that is part of the operating system (which
+ is already authenticated in a different way, such as by being loaded
+ from a read-only partition that uses dm-verity) and that wants to
+ authenticate applications before loading them. In these cases, this
+ trusted userspace code can authenticate a file's contents by
+ retrieving its fs-verity digest using `FS_IOC_MEASURE_VERITY`_, then
+ verifying a signature of it using any userspace cryptographic
+ library that supports digital signatures.
+
+- Integrity Measurement Architecture (IMA). IMA supports fs-verity
+ file digests as an alternative to its traditional full file digests.
+ "IMA appraisal" enforces that files contain a valid, matching
+ signature in their "security.ima" extended attribute, as controlled
+ by the IMA policy. For more information, see the IMA documentation.
+
+- Integrity Policy Enforcement (IPE). IPE supports enforcing access
+ control decisions based on immutable security properties of files,
+ including those protected by fs-verity's built-in signatures.
+ "IPE policy" specifically allows for the authorization of fs-verity
+ files using properties ``fsverity_digest`` for identifying
+ files by their verity digest, and ``fsverity_signature`` to authorize
+ files with a verified fs-verity's built-in signature. For
+ details on configuring IPE policies and understanding its operational
+ modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`.
+
+- Trusted userspace code in combination with `Built-in signature
+ verification`_. This approach should be used only with great care.
User API
========
@@ -84,7 +106,7 @@ FS_IOC_ENABLE_VERITY
--------------------
The FS_IOC_ENABLE_VERITY ioctl enables fs-verity on a file. It takes
-in a pointer to a :c:type:`struct fsverity_enable_arg`, defined as
+in a pointer to a struct fsverity_enable_arg, defined as
follows::
struct fsverity_enable_arg {
@@ -100,29 +122,31 @@ follows::
};
This structure contains the parameters of the Merkle tree to build for
-the file, and optionally contains a signature. It must be initialized
-as follows:
+the file. It must be initialized as follows:
- ``version`` must be 1.
- ``hash_algorithm`` must be the identifier for the hash algorithm to
use for the Merkle tree, such as FS_VERITY_HASH_ALG_SHA256. See
``include/uapi/linux/fsverity.h`` for the list of possible values.
-- ``block_size`` must be the Merkle tree block size. Currently, this
- must be equal to the system page size, which is usually 4096 bytes.
- Other sizes may be supported in the future. This value is not
- necessarily the same as the filesystem block size.
+- ``block_size`` is the Merkle tree block size, in bytes. In Linux
+ v6.3 and later, this can be any power of 2 between (inclusively)
+ 1024 and the minimum of the system page size and the filesystem
+ block size. In earlier versions, the page size was the only allowed
+ value.
- ``salt_size`` is the size of the salt in bytes, or 0 if no salt is
provided. The salt is a value that is prepended to every hashed
block; it can be used to personalize the hashing for a particular
file or device. Currently the maximum salt size is 32 bytes.
- ``salt_ptr`` is the pointer to the salt, or NULL if no salt is
provided.
-- ``sig_size`` is the size of the signature in bytes, or 0 if no
- signature is provided. Currently the signature is (somewhat
- arbitrarily) limited to 16128 bytes. See `Built-in signature
- verification`_ for more information.
-- ``sig_ptr`` is the pointer to the signature, or NULL if no
- signature is provided.
+- ``sig_size`` is the size of the builtin signature in bytes, or 0 if no
+ builtin signature is provided. Currently the builtin signature is
+ (somewhat arbitrarily) limited to 16128 bytes.
+- ``sig_ptr`` is the pointer to the builtin signature, or NULL if no
+ builtin signature is provided. A builtin signature is only needed
+ if the `Built-in signature verification`_ feature is being used. It
+ is not needed for IMA appraisal, and it is not needed if the file
+ signature is being handled entirely in userspace.
- All reserved fields must be zeroed.
FS_IOC_ENABLE_VERITY causes the filesystem to build a Merkle tree for
@@ -146,19 +170,20 @@ fatal signal), no changes are made to the file.
FS_IOC_ENABLE_VERITY can fail with the following errors:
- ``EACCES``: the process does not have write access to the file
-- ``EBADMSG``: the signature is malformed
+- ``EBADMSG``: the builtin signature is malformed
- ``EBUSY``: this ioctl is already running on the file
- ``EEXIST``: the file already has verity enabled
- ``EFAULT``: the caller provided inaccessible memory
+- ``EFBIG``: the file is too large to enable verity on
- ``EINTR``: the operation was interrupted by a fatal signal
- ``EINVAL``: unsupported version, hash algorithm, or block size; or
reserved bits are set; or the file descriptor refers to neither a
regular file nor a directory.
- ``EISDIR``: the file descriptor refers to a directory
-- ``EKEYREJECTED``: the signature doesn't match the file
-- ``EMSGSIZE``: the salt or signature is too long
-- ``ENOKEY``: the fs-verity keyring doesn't contain the certificate
- needed to verify the signature
+- ``EKEYREJECTED``: the builtin signature doesn't match the file
+- ``EMSGSIZE``: the salt or builtin signature is too long
+- ``ENOKEY``: the ".fs-verity" keyring doesn't contain the certificate
+ needed to verify the builtin signature
- ``ENOPKG``: fs-verity recognizes the hash algorithm, but it's not
available in the kernel's crypto API as currently configured (e.g.
for SHA-512, missing CONFIG_CRYPTO_SHA512).
@@ -167,8 +192,8 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
support; or the filesystem superblock has not had the 'verity'
feature enabled on it; or the filesystem does not support fs-verity
on this file. (See `Filesystem support`_.)
-- ``EPERM``: the file is append-only; or, a signature is required and
- one was not provided.
+- ``EPERM``: the file is append-only; or, a builtin signature is
+ required and one was not provided.
- ``EROFS``: the filesystem is read-only
- ``ETXTBSY``: someone has the file open for writing. This can be the
caller's file descriptor, another open file descriptor, or the file
@@ -177,9 +202,10 @@ FS_IOC_ENABLE_VERITY can fail with the following errors:
FS_IOC_MEASURE_VERITY
---------------------
-The FS_IOC_MEASURE_VERITY ioctl retrieves the measurement of a verity
-file. The file measurement is a digest that cryptographically
-identifies the file contents that are being enforced on reads.
+The FS_IOC_MEASURE_VERITY ioctl retrieves the digest of a verity file.
+The fs-verity file digest is a cryptographic digest that identifies
+the file contents that are being enforced on reads; it is computed via
+a Merkle tree and is different from a traditional full-file digest.
This ioctl takes in a pointer to a variable-length structure::
@@ -197,7 +223,7 @@ On success, 0 is returned and the kernel fills in the structure as
follows:
- ``digest_algorithm`` will be the hash algorithm used for the file
- measurement. It will match ``fsverity_enable_arg::hash_algorithm``.
+ digest. It will match ``fsverity_enable_arg::hash_algorithm``.
- ``digest_size`` will be the size of the digest in bytes, e.g. 32
for SHA-256. (This can be redundant with ``digest_algorithm``.)
- ``digest`` will be the actual bytes of the digest.
@@ -216,6 +242,88 @@ FS_IOC_MEASURE_VERITY can fail with the following errors:
- ``EOVERFLOW``: the digest is longer than the specified
``digest_size`` bytes. Try providing a larger buffer.
+FS_IOC_READ_VERITY_METADATA
+---------------------------
+
+The FS_IOC_READ_VERITY_METADATA ioctl reads verity metadata from a
+verity file. This ioctl is available since Linux v5.12.
+
+This ioctl is useful for cases where the verity verification should be
+performed somewhere other than the currently running kernel.
+
+One example is a server program that takes a verity file and serves it
+to a client program, such that the client can do its own fs-verity
+compatible verification of the file. This only makes sense if the
+client doesn't trust the server and if the server needs to provide the
+storage for the client.
+
+Another example is copying verity metadata when creating filesystem
+images in userspace (such as with ``mkfs.ext4 -d``).
+
+This is a fairly specialized use case, and most fs-verity users won't
+need this ioctl.
+
+This ioctl takes in a pointer to the following structure::
+
+ #define FS_VERITY_METADATA_TYPE_MERKLE_TREE 1
+ #define FS_VERITY_METADATA_TYPE_DESCRIPTOR 2
+ #define FS_VERITY_METADATA_TYPE_SIGNATURE 3
+
+ struct fsverity_read_metadata_arg {
+ __u64 metadata_type;
+ __u64 offset;
+ __u64 length;
+ __u64 buf_ptr;
+ __u64 __reserved;
+ };
+
+``metadata_type`` specifies the type of metadata to read:
+
+- ``FS_VERITY_METADATA_TYPE_MERKLE_TREE`` reads the blocks of the
+ Merkle tree. The blocks are returned in order from the root level
+ to the leaf level. Within each level, the blocks are returned in
+ the same order that their hashes are themselves hashed.
+ See `Merkle tree`_ for more information.
+
+- ``FS_VERITY_METADATA_TYPE_DESCRIPTOR`` reads the fs-verity
+ descriptor. See `fs-verity descriptor`_.
+
+- ``FS_VERITY_METADATA_TYPE_SIGNATURE`` reads the builtin signature
+ which was passed to FS_IOC_ENABLE_VERITY, if any. See `Built-in
+ signature verification`_.
+
+The semantics are similar to those of ``pread()``. ``offset``
+specifies the offset in bytes into the metadata item to read from, and
+``length`` specifies the maximum number of bytes to read from the
+metadata item. ``buf_ptr`` is the pointer to the buffer to read into,
+cast to a 64-bit integer. ``__reserved`` must be 0. On success, the
+number of bytes read is returned. 0 is returned at the end of the
+metadata item. The returned length may be less than ``length``, for
+example if the ioctl is interrupted.
+
+The metadata returned by FS_IOC_READ_VERITY_METADATA isn't guaranteed
+to be authenticated against the file digest that would be returned by
+`FS_IOC_MEASURE_VERITY`_, as the metadata is expected to be used to
+implement fs-verity compatible verification anyway (though absent a
+malicious disk, the metadata will indeed match). E.g. to implement
+this ioctl, the filesystem is allowed to just read the Merkle tree
+blocks from disk without actually verifying the path to the root node.
+
+FS_IOC_READ_VERITY_METADATA can fail with the following errors:
+
+- ``EFAULT``: the caller provided inaccessible memory
+- ``EINTR``: the ioctl was interrupted before any data was read
+- ``EINVAL``: reserved fields were set, or ``offset + length``
+ overflowed
+- ``ENODATA``: the file is not a verity file, or
+ FS_VERITY_METADATA_TYPE_SIGNATURE was requested but the file doesn't
+ have a builtin signature
+- ``ENOTTY``: this type of filesystem does not implement fs-verity, or
+ this ioctl is not yet implemented on it
+- ``EOPNOTSUPP``: the kernel was not configured with fs-verity
+ support, or the filesystem superblock has not had the 'verity'
+ feature enabled on it. (See `Filesystem support`_.)
+
FS_IOC_GETFLAGS
---------------
@@ -234,6 +342,8 @@ the file has fs-verity enabled. This can perform better than
FS_IOC_GETFLAGS and FS_IOC_MEASURE_VERITY because it doesn't require
opening the file, and opening verity files can be expensive.
+.. _accessing_verity_files:
+
Accessing verity files
======================
@@ -257,25 +367,24 @@ non-verity one, with the following exceptions:
with EIO (for read()) or SIGBUS (for mmap() reads).
- If the sysctl "fs.verity.require_signatures" is set to 1 and the
- file's verity measurement is not signed by a key in the fs-verity
- keyring, then opening the file will fail. See `Built-in signature
- verification`_.
+ file is not signed by a key in the ".fs-verity" keyring, then
+ opening the file will fail. See `Built-in signature verification`_.
Direct access to the Merkle tree is not supported. Therefore, if a
verity file is copied, or is backed up and restored, then it will lose
its "verity"-ness. fs-verity is primarily meant for files like
executables that are managed by a package manager.
-File measurement computation
-============================
+File digest computation
+=======================
This section describes how fs-verity hashes the file contents using a
-Merkle tree to produce the "file measurement" which cryptographically
-identifies the file contents. This algorithm is the same for all
-filesystems that support fs-verity.
+Merkle tree to produce the digest which cryptographically identifies
+the file contents. This algorithm is the same for all filesystems
+that support fs-verity.
Userspace only needs to be aware of this algorithm if it needs to
-compute the file measurement itself, e.g. in order to sign the file.
+compute fs-verity file digests itself, e.g. in order to sign files.
.. _fsverity_merkle_tree:
@@ -325,74 +434,141 @@ can't a distinguish a large file from a small second file whose data
is exactly the top-level hash block of the first file. Ambiguities
also arise from the convention of padding to the next block boundary.
-To solve this problem, the verity file measurement is actually
-computed as a hash of the following structure, which contains the
-Merkle tree root hash as well as other fields such as the file size::
+To solve this problem, the fs-verity file digest is actually computed
+as a hash of the following structure, which contains the Merkle tree
+root hash as well as other fields such as the file size::
struct fsverity_descriptor {
__u8 version; /* must be 1 */
__u8 hash_algorithm; /* Merkle tree hash algorithm */
__u8 log_blocksize; /* log2 of size of data and tree blocks */
__u8 salt_size; /* size of salt in bytes; 0 if none */
- __le32 sig_size; /* must be 0 */
+ __le32 __reserved_0x04; /* must be 0 */
__le64 data_size; /* size of file the Merkle tree is built over */
__u8 root_hash[64]; /* Merkle tree root hash */
__u8 salt[32]; /* salt prepended to each hashed block */
__u8 __reserved[144]; /* must be 0's */
};
-Note that the ``sig_size`` field must be set to 0 for the purpose of
-computing the file measurement, even if a signature was provided (or
-will be provided) to `FS_IOC_ENABLE_VERITY`_.
-
Built-in signature verification
===============================
-With CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y, fs-verity supports putting
-a portion of an authentication policy (see `Use cases`_) in the
-kernel. Specifically, it adds support for:
+CONFIG_FS_VERITY_BUILTIN_SIGNATURES=y adds supports for in-kernel
+verification of fs-verity builtin signatures.
-1. At fs-verity module initialization time, a keyring ".fs-verity" is
- created. The root user can add trusted X.509 certificates to this
- keyring using the add_key() system call, then (when done)
- optionally use keyctl_restrict_keyring() to prevent additional
- certificates from being added.
+**IMPORTANT**! Please take great care before using this feature.
+It is not the only way to do signatures with fs-verity, and the
+alternatives (such as userspace signature verification, and IMA
+appraisal) can be much better. It's also easy to fall into a trap
+of thinking this feature solves more problems than it actually does.
+
+Enabling this option adds the following:
+
+1. At boot time, the kernel creates a keyring named ".fs-verity". The
+ root user can add trusted X.509 certificates to this keyring using
+ the add_key() system call.
2. `FS_IOC_ENABLE_VERITY`_ accepts a pointer to a PKCS#7 formatted
- detached signature in DER format of the file measurement. On
- success, this signature is persisted alongside the Merkle tree.
- Then, any time the file is opened, the kernel will verify the
- file's actual measurement against this signature, using the
- certificates in the ".fs-verity" keyring.
+ detached signature in DER format of the file's fs-verity digest.
+ On success, the ioctl persists the signature alongside the Merkle
+ tree. Then, any time the file is opened, the kernel verifies the
+ file's actual digest against this signature, using the certificates
+ in the ".fs-verity" keyring. This verification happens as long as the
+ file's signature exists, regardless of the state of the sysctl variable
+ "fs.verity.require_signatures" described in the next item. The IPE LSM
+ relies on this behavior to recognize and label fsverity files
+ that contain a verified built-in fsverity signature.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
- correctly signed file measurement as described in (2).
+ correctly signed digest as described in (2).
-File measurements must be signed in the following format, which is
-similar to the structure used by `FS_IOC_MEASURE_VERITY`_::
+The data that the signature as described in (2) must be a signature of
+is the fs-verity file digest in the following format::
- struct fsverity_signed_digest {
+ struct fsverity_formatted_digest {
char magic[8]; /* must be "FSVerity" */
__le16 digest_algorithm;
__le16 digest_size;
__u8 digest[];
};
-fs-verity's built-in signature verification support is meant as a
-relatively simple mechanism that can be used to provide some level of
-authenticity protection for verity files, as an alternative to doing
-the signature verification in userspace or using IMA-appraisal.
-However, with this mechanism, userspace programs still need to check
-that the verity bit is set, and there is no protection against verity
-files being swapped around.
+That's it. It should be emphasized again that fs-verity builtin
+signatures are not the only way to do signatures with fs-verity. See
+`Use cases`_ for an overview of ways in which fs-verity can be used.
+fs-verity builtin signatures have some major limitations that should
+be carefully considered before using them:
+
+- Builtin signature verification does *not* make the kernel enforce
+ that any files actually have fs-verity enabled. Thus, it is not a
+ complete authentication policy. Currently, if it is used, one
+ way to complete the authentication policy is for trusted userspace
+ code to explicitly check whether files have fs-verity enabled with a
+ signature before they are accessed. (With
+ fs.verity.require_signatures=1, just checking whether fs-verity is
+ enabled suffices.) But, in this case the trusted userspace code
+ could just store the signature alongside the file and verify it
+ itself using a cryptographic library, instead of using this feature.
+
+- Another approach is to utilize fs-verity builtin signature
+ verification in conjunction with the IPE LSM, which supports defining
+ a kernel-enforced, system-wide authentication policy that allows only
+ files with a verified fs-verity builtin signature to perform certain
+ operations, such as execution. Note that IPE doesn't require
+ fs.verity.require_signatures=1.
+ Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for
+ more details.
+
+- A file's builtin signature can only be set at the same time that
+ fs-verity is being enabled on the file. Changing or deleting the
+ builtin signature later requires re-creating the file.
+
+- Builtin signature verification uses the same set of public keys for
+ all fs-verity enabled files on the system. Different keys cannot be
+ trusted for different files; each key is all or nothing.
+
+- The sysctl fs.verity.require_signatures applies system-wide.
+ Setting it to 1 only works when all users of fs-verity on the system
+ agree that it should be set to 1. This limitation can prevent
+ fs-verity from being used in cases where it would be helpful.
+
+- Builtin signature verification can only use signature algorithms
+ that are supported by the kernel. For example, the kernel does not
+ yet support Ed25519, even though this is often the signature
+ algorithm that is recommended for new cryptographic designs.
+
+- fs-verity builtin signatures are in PKCS#7 format, and the public
+ keys are in X.509 format. These formats are commonly used,
+ including by some other kernel features (which is why the fs-verity
+ builtin signatures use them), and are very feature rich.
+ Unfortunately, history has shown that code that parses and handles
+ these formats (which are from the 1990s and are based on ASN.1)
+ often has vulnerabilities as a result of their complexity. This
+ complexity is not inherent to the cryptography itself.
+
+ fs-verity users who do not need advanced features of X.509 and
+ PKCS#7 should strongly consider using simpler formats, such as plain
+ Ed25519 keys and signatures, and verifying signatures in userspace.
+
+ fs-verity users who choose to use X.509 and PKCS#7 anyway should
+ still consider that verifying those signatures in userspace is more
+ flexible (for other reasons mentioned earlier in this document) and
+ eliminates the need to enable CONFIG_FS_VERITY_BUILTIN_SIGNATURES
+ and its associated increase in kernel attack surface. In some cases
+ it can even be necessary, since advanced X.509 and PKCS#7 features
+ do not always work as intended with the kernel. For example, the
+ kernel does not check X.509 certificate validity times.
+
+ Note: IMA appraisal, which supports fs-verity, does not use PKCS#7
+ for its signatures, so it partially avoids the issues discussed
+ here. IMA appraisal does use X.509.
Filesystem support
==================
-fs-verity is currently supported by the ext4 and f2fs filesystems.
-The CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
-on either filesystem.
+fs-verity is supported by several filesystems, described below. The
+CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity on
+any of these filesystems.
``include/linux/fsverity.h`` declares the interface between the
``fs/verity/`` support layer and filesystems. Briefly, filesystems
@@ -412,17 +588,19 @@ To create verity files on an ext4 filesystem, the filesystem must have
been formatted with ``-O verity`` or had ``tune2fs -O verity`` run on
it. "verity" is an RO_COMPAT filesystem feature, so once set, old
kernels will only be able to mount the filesystem readonly, and old
-versions of e2fsck will be unable to check the filesystem. Moreover,
-currently ext4 only supports mounting a filesystem with the "verity"
-feature when its block size is equal to PAGE_SIZE (often 4096 bytes).
+versions of e2fsck will be unable to check the filesystem.
+
+Originally, an ext4 filesystem with the "verity" feature could only be
+mounted when its block size was equal to the system page size
+(typically 4096 bytes). In Linux v6.3, this limitation was removed.
ext4 sets the EXT4_VERITY_FL on-disk inode flag on verity files. It
can only be set by `FS_IOC_ENABLE_VERITY`_, and it cannot be cleared.
ext4 also supports encryption, which can be used simultaneously with
fs-verity. In this case, the plaintext data is verified rather than
-the ciphertext. This is necessary in order to make the file
-measurement meaningful, since every file is encrypted differently.
+the ciphertext. This is necessary in order to make the fs-verity file
+digest meaningful, since every file is encrypted differently.
ext4 stores the verity metadata (Merkle tree and fsverity_descriptor)
past the end of the file, starting at the first 64K boundary beyond
@@ -435,9 +613,7 @@ support paging multi-gigabyte xattrs into memory, and to support
encrypting xattrs. Note that the verity metadata *must* be encrypted
when the file is, since it contains hashes of the plaintext data.
-Currently, ext4 verity only supports the case where the Merkle tree
-block size, filesystem block size, and page size are all the same. It
-also only supports extent-based files.
+ext4 only allows verity on extent-based files.
f2fs
----
@@ -455,11 +631,17 @@ Like ext4, f2fs stores the verity metadata (Merkle tree and
fsverity_descriptor) past the end of the file, starting at the first
64K boundary beyond i_size. See explanation for ext4 above.
Moreover, f2fs supports at most 4096 bytes of xattr entries per inode
-which wouldn't be enough for even a single Merkle tree block.
+which usually wouldn't be enough for even a single Merkle tree block.
-Currently, f2fs verity only supports a Merkle tree block size of 4096.
-Also, f2fs doesn't support enabling verity on files that currently
-have atomic or volatile writes pending.
+f2fs doesn't support enabling verity on files that currently have
+atomic or volatile writes pending.
+
+btrfs
+-----
+
+btrfs supports fs-verity since Linux v5.15. Verity-enabled inodes are
+marked with a RO_COMPAT inode flag, and the verity metadata is stored
+in separate btree items.
Implementation details
======================
@@ -476,52 +658,49 @@ already verified). Below, we describe how filesystems implement this.
Pagecache
~~~~~~~~~
-For filesystems using Linux's pagecache, the ``->readpage()`` and
-``->readpages()`` methods must be modified to verify pages before they
-are marked Uptodate. Merely hooking ``->read_iter()`` would be
+For filesystems using Linux's pagecache, the ``->read_folio()`` and
+``->readahead()`` methods must be modified to verify folios before
+they are marked Uptodate. Merely hooking ``->read_iter()`` would be
insufficient, since ``->read_iter()`` is not used for memory maps.
-Therefore, fs/verity/ provides a function fsverity_verify_page() which
-verifies a page that has been read into the pagecache of a verity
-inode, but is still locked and not Uptodate, so it's not yet readable
-by userspace. As needed to do the verification,
-fsverity_verify_page() will call back into the filesystem to read
-Merkle tree pages via fsverity_operations::read_merkle_tree_page().
+Therefore, fs/verity/ provides the function fsverity_verify_blocks()
+which verifies data that has been read into the pagecache of a verity
+inode. The containing folio must still be locked and not Uptodate, so
+it's not yet readable by userspace. As needed to do the verification,
+fsverity_verify_blocks() will call back into the filesystem to read
+hash blocks via fsverity_operations::read_merkle_tree_page().
-fsverity_verify_page() returns false if verification failed; in this
-case, the filesystem must not set the page Uptodate. Following this,
+fsverity_verify_blocks() returns false if verification failed; in this
+case, the filesystem must not set the folio Uptodate. Following this,
as per the usual Linux pagecache behavior, attempts by userspace to
-read() from the part of the file containing the page will fail with
-EIO, and accesses to the page within a memory map will raise SIGBUS.
-
-fsverity_verify_page() currently only supports the case where the
-Merkle tree block size is equal to PAGE_SIZE (often 4096 bytes).
+read() from the part of the file containing the folio will fail with
+EIO, and accesses to the folio within a memory map will raise SIGBUS.
-In principle, fsverity_verify_page() verifies the entire path in the
-Merkle tree from the data page to the root hash. However, for
-efficiency the filesystem may cache the hash pages. Therefore,
-fsverity_verify_page() only ascends the tree reading hash pages until
-an already-verified hash page is seen, as indicated by the PageChecked
-bit being set. It then verifies the path to that page.
+In principle, verifying a data block requires verifying the entire
+path in the Merkle tree from the data block to the root hash.
+However, for efficiency the filesystem may cache the hash blocks.
+Therefore, fsverity_verify_blocks() only ascends the tree reading hash
+blocks until an already-verified hash block is seen. It then verifies
+the path to that block.
This optimization, which is also used by dm-verity, results in
excellent sequential read performance. This is because usually (e.g.
-127 in 128 times for 4K blocks and SHA-256) the hash page from the
+127 in 128 times for 4K blocks and SHA-256) the hash block from the
bottom level of the tree will already be cached and checked from
-reading a previous data page. However, random reads perform worse.
+reading a previous data block. However, random reads perform worse.
Block device based filesystems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Block device based filesystems (e.g. ext4 and f2fs) in Linux also use
the pagecache, so the above subsection applies too. However, they
-also usually read many pages from a file at once, grouped into a
+also usually read many data blocks from a file at once, grouped into a
structure called a "bio". To make it easier for these types of
filesystems to support fs-verity, fs/verity/ also provides a function
-fsverity_verify_bio() which verifies all pages in a bio.
+fsverity_verify_bio() which verifies all data blocks in a bio.
ext4 and f2fs also support encryption. If a verity file is also
-encrypted, the pages must be decrypted before being verified. To
+encrypted, the data must be decrypted before being verified. To
support this, these filesystems allocate a "post-read context" for
each bio and store it in ``->bi_private``::
@@ -536,17 +715,17 @@ each bio and store it in ``->bi_private``::
verity, or both is enabled. After the bio completes, for each needed
postprocessing step the filesystem enqueues the bio_post_read_ctx on a
workqueue, and then the workqueue work does the decryption or
-verification. Finally, pages where no decryption or verity error
-occurred are marked Uptodate, and the pages are unlocked.
+verification. Finally, folios where no decryption or verity error
+occurred are marked Uptodate, and the folios are unlocked.
-Files on ext4 and f2fs may contain holes. Normally, ``->readpages()``
-simply zeroes holes and sets the corresponding pages Uptodate; no bios
-are issued. To prevent this case from bypassing fs-verity, these
-filesystems use fsverity_verify_page() to verify hole pages.
+On many filesystems, files can contain holes. Normally,
+``->readahead()`` simply zeroes hole blocks and considers the
+corresponding data to be up-to-date; no bios are issued. To prevent
+this case from bypassing fs-verity, filesystems use
+fsverity_verify_blocks() to verify hole blocks.
-ext4 and f2fs disable direct I/O on verity files, since otherwise
-direct I/O would bypass fs-verity. (They also do the same for
-encrypted files.)
+Filesystems also disable direct I/O on verity files, since otherwise
+direct I/O would bypass fs-verity.
Userspace utility
=================
@@ -554,7 +733,7 @@ Userspace utility
This document focuses on the kernel, but a userspace utility for
fs-verity can be found at:
- https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git
+ https://git.kernel.org/pub/scm/fs/fsverity/fsverity-utils.git
See the README.md file in the fsverity-utils source tree for details,
including examples of setting up fs-verity protected files.
@@ -565,7 +744,7 @@ Tests
To test fs-verity, use xfstests. For example, using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
- kvm-xfstests -c ext4,f2fs -g verity
+ kvm-xfstests -c ext4,f2fs,btrfs -g verity
FAQ
===
@@ -581,19 +760,19 @@ weren't already directly answered in other parts of this document.
hashed and what to do with those hashes, such as log them,
authenticate them, or add them to a measurement list.
- IMA is planned to support the fs-verity hashing mechanism as an
- alternative to doing full file hashes, for people who want the
- performance and security benefits of the Merkle tree based hash.
- But it doesn't make sense to force all uses of fs-verity to be
- through IMA. As a standalone filesystem feature, fs-verity
- already meets many users' needs, and it's testable like other
+ IMA supports the fs-verity hashing mechanism as an alternative
+ to full file hashes, for those who want the performance and
+ security benefits of the Merkle tree based hash. However, it
+ doesn't make sense to force all uses of fs-verity to be through
+ IMA. fs-verity already meets many users' needs even as a
+ standalone filesystem feature, and it's testable like other
filesystem features e.g. with xfstests.
:Q: Isn't fs-verity useless because the attacker can just modify the
hashes in the Merkle tree, which is stored on-disk?
:A: To verify the authenticity of an fs-verity file you must verify
- the authenticity of the "file measurement", which is basically the
- root hash of the Merkle tree. See `Use cases`_.
+ the authenticity of the "fs-verity file digest", which
+ incorporates the root hash of the Merkle tree. See `Use cases`_.
:Q: Isn't fs-verity useless because the attacker can just replace a
verity file with a non-verity one?
@@ -659,7 +838,7 @@ weren't already directly answered in other parts of this document.
retrofit existing filesystems with new consistency mechanisms.
Data journalling is available on ext4, but is very slow.
- - Rebuilding the the Merkle tree after every write, which would be
+ - Rebuilding the Merkle tree after every write, which would be
extremely inefficient. Alternatively, a different authenticated
dictionary structure such as an "authenticated skiplist" could
be used. However, this would be far more complex.
@@ -688,25 +867,25 @@ weren't already directly answered in other parts of this document.
e.g. magically trigger construction of a Merkle tree.
:Q: Does fs-verity support remote filesystems?
-:A: Only ext4 and f2fs support is implemented currently, but in
- principle any filesystem that can store per-file verity metadata
- can support fs-verity, regardless of whether it's local or remote.
- Some filesystems may have fewer options of where to store the
- verity metadata; one possibility is to store it past the end of
- the file and "hide" it from userspace by manipulating i_size. The
- data verification functions provided by ``fs/verity/`` also assume
- that the filesystem uses the Linux pagecache, but both local and
- remote filesystems normally do so.
+:A: So far all filesystems that have implemented fs-verity support are
+ local filesystems, but in principle any filesystem that can store
+ per-file verity metadata can support fs-verity, regardless of
+ whether it's local or remote. Some filesystems may have fewer
+ options of where to store the verity metadata; one possibility is
+ to store it past the end of the file and "hide" it from userspace
+ by manipulating i_size. The data verification functions provided
+ by ``fs/verity/`` also assume that the filesystem uses the Linux
+ pagecache, but both local and remote filesystems normally do so.
:Q: Why is anything filesystem-specific at all? Shouldn't fs-verity
be implemented entirely at the VFS level?
:A: There are many reasons why this is not possible or would be very
difficult, including the following:
- - To prevent bypassing verification, pages must not be marked
+ - To prevent bypassing verification, folios must not be marked
Uptodate until they've been verified. Currently, each
- filesystem is responsible for marking pages Uptodate via
- ``->readpages()``. Therefore, currently it's not possible for
+ filesystem is responsible for marking folios Uptodate via
+ ``->readahead()``. Therefore, currently it's not possible for
the VFS to do the verification on its own. Changing this would
require significant changes to the VFS and all filesystems.
diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
new file mode 100644
index 000000000000..d73dd0dbd238
--- /dev/null
+++ b/Documentation/filesystems/fuse-io-uring.rst
@@ -0,0 +1,99 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+FUSE-over-io-uring design documentation
+=======================================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through io-uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through io-uring, userspace
+is required to also handle requests through /dev/fuse after io-uring setup
+is complete. Specifically notifications (initiated from the daemon side)
+and interrupts.
+
+Fuse io-uring configuration
+===========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until io-uring setup is complete.
+
+In order to set up fuse-over-io-uring fuse-server (user-space)
+needs to submit SQEs (opcode = IORING_OP_URING_CMD) to the /dev/fuse
+connection file descriptor. Initial submit is with the sub command
+FUSE_URING_REQ_REGISTER, which will just register entries to be
+available in the kernel.
+
+Once at least one entry per queue is submitted, kernel starts
+to enqueue to ring queues.
+Note, every CPU core has its own fuse-io-uring queue.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------::
+
+ | | FUSE filesystem daemon
+ | |
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD /
+ | | FUSE_URING_CMD_REGISTER
+ | | [wait cqe]
+ | | >io_uring_wait_cqe() or
+ | | >io_uring_submit_and_wait()
+ | |
+ | >fuse_uring_cmd() |
+ | >fuse_uring_register() |
+
+
+Sending requests with CQEs
+--------------------------::
+
+ | | FUSE filesystem daemon
+ | | [waiting for CQEs]
+ | "rm /mnt/fuse/file" |
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [allocate request] |
+ | >fuse_send_one() |
+ | ... |
+ | >fuse_uring_queue_fuse_req |
+ | [queue request on fg queue] |
+ | >fuse_uring_add_req_to_ring_ent() |
+ | ... |
+ | >fuse_uring_copy_to_ring() |
+ | >io_uring_cmd_done() |
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | [receives and handles CQE]
+ | | [submit result and fetch next]
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD/
+ | | FUSE_URING_CMD_COMMIT_AND_FETCH
+ | >fuse_uring_cmd() |
+ | >fuse_uring_commit_fetch() |
+ | >fuse_uring_commit() |
+ | >fuse_uring_copy_from_ring() |
+ | [ copy the result to the fuse req] |
+ | >fuse_uring_req_end() |
+ | >fuse_request_end() |
+ | [wake up req->waitq] |
+ | >fuse_uring_next_fuse_req |
+ | [wait or handle next req] |
+ | |
+ | [req->waitq woken up] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+
+
diff --git a/Documentation/filesystems/fuse-io.rst b/Documentation/filesystems/fuse-io.rst
index 255a368fe534..6464de4266ad 100644
--- a/Documentation/filesystems/fuse-io.rst
+++ b/Documentation/filesystems/fuse-io.rst
@@ -15,7 +15,8 @@ The direct-io mode can be selected with the FOPEN_DIRECT_IO flag in the
FUSE_OPEN reply.
In direct-io mode the page cache is completely bypassed for reads and writes.
-No read-ahead takes place. Shared mmap is disabled.
+No read-ahead takes place. Shared mmap is disabled by default. To allow shared
+mmap, the FUSE_DIRECT_IO_ALLOW_MMAP flag may be enabled in the FUSE_INIT reply.
In cached mode reads may be satisfied from the page cache, and data may be
read-ahead by the kernel to fill the cache. The cache is always kept consistent
diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse-passthrough.rst
new file mode 100644
index 000000000000..2b0e7c2da54a
--- /dev/null
+++ b/Documentation/filesystems/fuse-passthrough.rst
@@ -0,0 +1,133 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+FUSE Passthrough
+================
+
+Introduction
+============
+
+FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the
+performance of FUSE filesystems for I/O operations. Typically, FUSE operations
+involve communication between the kernel and a userspace FUSE daemon, which can
+incur overhead. Passthrough allows certain operations on a FUSE file to bypass
+the userspace daemon and be executed directly by the kernel on an underlying
+"backing file".
+
+This is achieved by the FUSE daemon registering a file descriptor (pointing to
+the backing file on a lower filesystem) with the FUSE kernel module. The kernel
+then receives an identifier (``backing_id``) for this registered backing file.
+When a FUSE file is subsequently opened, the FUSE daemon can, in its response to
+the ``OPEN`` request, include this ``backing_id`` and set the
+``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific
+operations.
+
+Currently, passthrough is supported for operations like ``read(2)``/``write(2)``
+(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``.
+
+Enabling Passthrough
+====================
+
+To use FUSE passthrough:
+
+ 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH``
+ enabled.
+ 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the
+ ``FUSE_PASSTHROUGH`` capability and specify its desired
+ ``max_stack_depth``.
+ 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl
+ on its connection file descriptor (e.g., ``/dev/fuse``) to register a
+ backing file descriptor and obtain a ``backing_id``.
+ 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon
+ replies with the ``FOPEN_PASSTHROUGH`` flag set in
+ ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id``
+ in ``fuse_open_out::backing_id``.
+ 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with
+ the ``backing_id`` to release the kernel's reference to the backing file
+ when it's no longer needed for passthrough setups.
+
+Privilege Requirements
+======================
+
+Setting up passthrough functionality currently requires the FUSE daemon to
+possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several
+security and resource management considerations that are actively being
+discussed and worked on. The primary reasons for this restriction are detailed
+below.
+
+Resource Accounting and Visibility
+----------------------------------
+
+The core mechanism for passthrough involves the FUSE daemon opening a file
+descriptor to a backing file and registering it with the FUSE kernel module via
+the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id``
+associated with a kernel-internal ``struct fuse_backing`` object, which holds a
+reference to the backing ``struct file``.
+
+A significant concern arises because the FUSE daemon can close its own file
+descriptor to the backing file after registration. The kernel, however, will
+still hold a reference to the ``struct file`` via the ``struct fuse_backing``
+object as long as it's associated with a ``backing_id`` (or subsequently, with
+an open FUSE file in passthrough mode).
+
+This behavior leads to two main issues for unprivileged FUSE daemons:
+
+ 1. **Invisibility to lsof and other inspection tools**: Once the FUSE
+ daemon closes its file descriptor, the open backing file held by the kernel
+ becomes "hidden." Standard tools like ``lsof``, which typically inspect
+ process file descriptor tables, would not be able to identify that this
+ file is still open by the system on behalf of the FUSE filesystem. This
+ makes it difficult for system administrators to track resource usage or
+ debug issues related to open files (e.g., preventing unmounts).
+
+ 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to
+ resource limits, including the maximum number of open file descriptors
+ (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files
+ and then close its own FDs, it could potentially cause the kernel to hold
+ an unlimited number of open ``struct file`` references without these being
+ accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a
+ denial-of-service (DoS) by exhausting system-wide file resources.
+
+The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues,
+restricting this powerful capability to trusted processes.
+
+**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files",
+which are visible via ``fdinfo`` and accounted under the registering user's
+``RLIMIT_NOFILE``.
+
+Filesystem Stacking and Shutdown Loops
+--------------------------------------
+
+Another concern relates to the potential for creating complex and problematic
+filesystem stacking scenarios if unprivileged users could set up passthrough.
+A FUSE passthrough filesystem might use a backing file that resides:
+
+ * On the *same* FUSE filesystem.
+ * On another filesystem (like OverlayFS) which itself might have an upper or
+ lower layer that is a FUSE filesystem.
+
+These configurations could create dependency loops, particularly during
+filesystem shutdown or unmount sequences, leading to deadlocks or system
+instability. This is conceptually similar to the risks associated with the
+``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``.
+
+To mitigate this, FUSE passthrough already incorporates checks based on
+filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``).
+For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate
+the ``max_stack_depth`` it supports. When a backing file is registered via
+``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's
+filesystem stack depth is within the allowed limit.
+
+The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security,
+ensuring that only privileged users can create these potentially complex
+stacking arrangements.
+
+General Security Posture
+------------------------
+
+As a general principle for new kernel features that allow userspace to instruct
+the kernel to perform direct operations on its behalf based on user-provided
+file descriptors, starting with a higher privilege requirement (like
+``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows
+the feature to be used and tested while further security implications are
+evaluated and addressed.
diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse.rst
index cd717f9bf940..1e31e87aee68 100644
--- a/Documentation/filesystems/fuse.rst
+++ b/Documentation/filesystems/fuse.rst
@@ -47,7 +47,7 @@ filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
The userspace library and utilities are available from the
-`FUSE homepage: <http://fuse.sourceforge.net/>`_
+`FUSE homepage: <https://github.com/libfuse/>`_
Filesystem type
===============
@@ -279,7 +279,7 @@ How are requirements fulfilled?
the filesystem or not.
Note that the *ptrace* check is not strictly necessary to
- prevent B/2/i, it is enough to check if mount owner has enough
+ prevent C/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since *SIGSTOP* can be used to get a similar effect.
@@ -288,10 +288,29 @@ I think these limitations are unacceptable?
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
-mounts, it can relax the last limitation with a 'user_allow_other'
-config option. If this config option is set, the mounting user can
-add the 'allow_other' mount option which disables the check for other
-users' processes.
+mounts, it can relax the last limitation in several ways:
+
+ - With the 'user_allow_other' config option. If this config option is
+ set, the mounting user can add the 'allow_other' mount option which
+ disables the check for other users' processes.
+
+ User namespaces have an unintuitive interaction with 'allow_other':
+ an unprivileged user - normally restricted from mounting with
+ 'allow_other' - could do so in a user namespace where they're
+ privileged. If any process could access such an 'allow_other' mount
+ this would give the mounting user the ability to manipulate
+ processes in user namespaces where they're unprivileged. For this
+ reason 'allow_other' restricts access to users in the same userns
+ or a descendant.
+
+ - With the 'allow_sys_admin_access' module option. If this option is
+ set, super user's processes have unrestricted access to mounts
+ irrespective of allow_other setting or user namespace of the
+ mounting user.
+
+Note that both of these relaxations expose the system to potential
+information leak or *DoS* as described in points B and C/2/i-ii in the
+preceding section.
Kernel - userspace interface
============================
diff --git a/Documentation/filesystems/gfs2-glocks.rst b/Documentation/filesystems/gfs2-glocks.rst
index d14f230f0b12..adc0d4c4d979 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -20,8 +20,7 @@ The gl_holders list contains all the queued lock requests (not
just the holders) associated with the glock. If there are any
held locks, then they will be contiguous entries at the head
of the list. Locks are granted in strictly the order that they
-are queued, except for those marked LM_FLAG_PRIORITY which are
-used only during recovery, and even then only for journal locks.
+are queued.
There are three lock states that users of the glock layer can request,
namely shared (SH), deferred (DF) and exclusive (EX). Those translate
@@ -41,14 +40,14 @@ shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
operations. The glocks are basically a lock plus some routines which deal
with cache management. The following rules apply for the cache:
-========== ========== ============== ========== ==============
-Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
-========== ========== ============== ========== ==============
- UN No No No No
- SH Yes Yes No No
- DF No Yes No No
- EX Yes Yes Yes Yes
-========== ========== ============== ========== ==============
+========== ============== ========== ========== ==============
+Glock mode Cache Metadata Cache data Dirty Data Dirty Metadata
+========== ============== ========== ========== ==============
+ UN No No No No
+ DF Yes No No No
+ SH Yes Yes No No
+ EX Yes Yes Yes Yes
+========== ============== ========== ========== ==============
These rules are implemented using the various glock operations which
are defined for each type of glock. Not all types of glocks use
@@ -56,53 +55,50 @@ all the modes. Only inode glocks use the DF mode for example.
Table of glock operations and per type constants:
-============= =============================================================
+============== =============================================================
Field Purpose
-============= =============================================================
-go_xmote_th Called before remote state change (e.g. to sync dirty data)
+============== =============================================================
+go_sync Called before remote state change (e.g. to sync dirty data)
go_xmote_bh Called after remote state change (e.g. to refill cache)
go_inval Called if remote state change requires invalidating the cache
-go_demote_ok Returns boolean value of whether its ok to demote a glock
- (e.g. checks timeout, and that there is no cached data)
-go_lock Called for the first local holder of a lock
-go_unlock Called on the final local unlock of a lock
+go_instantiate Called when a glock has been acquired
+go_held Called every time a glock holder is acquired
go_dump Called to print content of object for debugfs file, or on
error to dump glock to the log.
-go_type The type of the glock, ``LM_TYPE_*``
go_callback Called if the DLM sends a callback to drop this lock
+go_unlocked Called when a glock is unlocked (dlm_unlock())
+go_type The type of the glock, ``LM_TYPE_*``
go_flags GLOF_ASPACE is set, if the glock has an address space
associated with it
-============= =============================================================
+============== =============================================================
The minimum hold time for each lock is the time after a remote lock
grant for which we ignore remote demote requests. This is in order to
prevent a situation where locks are being bounced around the cluster
from node to node with none of the nodes making any progress. This
-tends to show up most with shared mmaped files which are being written
+tends to show up most with shared mmapped files which are being written
to by multiple nodes. By delaying the demotion in response to a
remote callback, that gives the userspace program time to make
some progress before the pages are unmapped.
-There is a plan to try and remove the go_lock and go_unlock callbacks
-if possible, in order to try and speed up the fast path though the locking.
-Also, eventually we hope to make the glock "EX" mode locally shared
-such that any local locking will be done with the i_mutex as required
-rather than via the glock.
+Eventually, we hope to make the glock "EX" mode locally shared such that any
+local locking will be done with the i_mutex as required rather than via the
+glock.
Locking rules for glock operations:
-============= ====================== =============================
+============== ====================== =============================
Operation GLF_LOCK bit lock held gl_lockref.lock spinlock held
-============= ====================== =============================
-go_xmote_th Yes No
+============== ====================== =============================
+go_sync Yes No
go_xmote_bh Yes No
go_inval Yes No
-go_demote_ok Sometimes Yes
-go_lock Yes No
-go_unlock Yes No
+go_instantiate No No
+go_held No No
go_dump Sometimes Yes
go_callback Sometimes (N/A) Yes
-============= ====================== =============================
+go_unlocked Yes No
+============== ====================== =============================
.. Note::
diff --git a/Documentation/filesystems/gfs2.rst b/Documentation/filesystems/gfs2.rst
index 8d1ab589ce18..1bc48a13430c 100644
--- a/Documentation/filesystems/gfs2.rst
+++ b/Documentation/filesystems/gfs2.rst
@@ -1,53 +1,52 @@
.. SPDX-License-Identifier: GPL-2.0
-==================
-Global File System
-==================
+====================
+Global File System 2
+====================
-https://fedorahosted.org/cluster/wiki/HomePage
-
-GFS is a cluster file system. It allows a cluster of computers to
+GFS2 is a cluster file system. It allows a cluster of computers to
simultaneously use a block device that is shared between them (with FC,
-iSCSI, NBD, etc). GFS reads and writes to the block device like a local
+iSCSI, NBD, etc). GFS2 reads and writes to the block device like a local
file system, but also uses a lock module to allow the computers coordinate
their I/O so file system consistency is maintained. One of the nifty
-features of GFS is perfect consistency -- changes made to the file system
+features of GFS2 is perfect consistency -- changes made to the file system
on one machine show up immediately on all other machines in the cluster.
-GFS uses interchangeable inter-node locking mechanisms, the currently
+GFS2 uses interchangeable inter-node locking mechanisms, the currently
supported mechanisms are:
lock_nolock
- - allows gfs to be used as a local file system
+ - allows GFS2 to be used as a local file system
lock_dlm
- - uses a distributed lock manager (dlm) for inter-node locking.
+ - uses the distributed lock manager (dlm) for inter-node locking.
The dlm is found at linux/fs/dlm/
-Lock_dlm depends on user space cluster management systems found
+lock_dlm depends on user space cluster management systems found
at the URL above.
-To use gfs as a local file system, no external clustering systems are
+To use GFS2 as a local file system, no external clustering systems are
needed, simply::
$ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device
$ mount -t gfs2 /dev/block_device /dir
-If you are using Fedora, you need to install the gfs2-utils package
-and, for lock_dlm, you will also need to install the cman package
-and write a cluster.conf as per the documentation. For F17 and above
-cman has been replaced by the dlm package.
+The gfs2-utils package is required on all cluster nodes and, for lock_dlm, you
+will also need the dlm and corosync user space utilities configured as per the
+documentation.
+
+gfs2-utils can be found at https://pagure.io/gfs2-utils
GFS2 is not on-disk compatible with previous versions of GFS, but it
is pretty close.
-The following man pages can be found at the URL above:
+The following man pages are available from gfs2-utils:
============ =============================================
fsck.gfs2 to repair a filesystem
gfs2_grow to expand a filesystem online
gfs2_jadd to add journals to a filesystem online
tunegfs2 to manipulate, examine and tune a filesystem
- gfs2_convert to convert a gfs filesystem to gfs2 in-place
+ gfs2_convert to convert a gfs filesystem to GFS2 in-place
mkfs.gfs2 to make a filesystem
============ =============================================
diff --git a/Documentation/filesystems/hfs.rst b/Documentation/filesystems/hfs.rst
index ab17a005e9b1..776015c80e3f 100644
--- a/Documentation/filesystems/hfs.rst
+++ b/Documentation/filesystems/hfs.rst
@@ -76,7 +76,7 @@ Creating HFS filesystems
The hfsutils package from Robert Leslie contains a program called
hformat that can be used to create HFS filesystem. See
-<http://www.mars.org/home/rob/proj/hfs/> for details.
+<https://www.mars.org/home/rob/proj/hfs/> for details.
Credits
diff --git a/Documentation/filesystems/hpfs.rst b/Documentation/filesystems/hpfs.rst
index 0db152278572..7e0dd2f4373e 100644
--- a/Documentation/filesystems/hpfs.rst
+++ b/Documentation/filesystems/hpfs.rst
@@ -7,7 +7,7 @@ Read/Write HPFS 2.09
1998-2004, Mikulas Patocka
:email: mikulas@artax.karlin.mff.cuni.cz
-:homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
+:homepage: https://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi
Credits
=======
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
new file mode 100644
index 000000000000..2a206129f828
--- /dev/null
+++ b/Documentation/filesystems/idmappings.rst
@@ -0,0 +1,1039 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Idmappings
+==========
+
+Most filesystem developers will have encountered idmappings. They are used when
+reading from or writing ownership to disk, reporting ownership to userspace, or
+for permission checking. This document is aimed at filesystem developers that
+want to know how idmappings work.
+
+Formal notes
+------------
+
+An idmapping is essentially a translation of a range of ids into another or the
+same range of ids. The notational convention for idmappings that is widely used
+in userspace is::
+
+ u:k:r
+
+``u`` indicates the first element in the upper idmapset ``U`` and ``k``
+indicates the first element in the lower idmapset ``K``. The ``r`` parameter
+indicates the range of the idmapping, i.e. how many ids are mapped. From now
+on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
+we're talking about an id in the upper or lower idmapset.
+
+To see what this looks like in practice, let's take the following idmapping::
+
+ u22:k10000:r3
+
+and write down the mappings it will generate::
+
+ u22 -> k10000
+ u23 -> k10001
+ u24 -> k10002
+
+From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
+idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
+order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
+the set of all possible ids usable on a given system.
+
+Looking at this mathematically briefly will help us highlight some properties
+that make it easier to understand how we can translate between idmappings. For
+example, we know that the inverse idmapping is an order isomorphism as well::
+
+ k10000 -> u22
+ k10001 -> u23
+ k10002 -> u24
+
+Given that we are dealing with order isomorphisms plus the fact that we're
+dealing with subsets we can embed idmappings into each other, i.e. we can
+sensibly translate between different idmappings. For example, assume we've been
+given the three idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r10000
+ 3. u0:k30000:r10000
+
+and id ``k11000`` which has been generated by the first idmapping by mapping
+``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
+
+Because we're dealing with order isomorphic subsets it is meaningful to ask
+what id ``k11000`` corresponds to in the second or third idmapping. The
+straightforward algorithm to use is to apply the inverse of the first idmapping,
+mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
+either the second idmapping mapping or third idmapping mapping. The second
+idmapping would map ``u1000`` down to ``k21000``. The third idmapping would map
+``u1000`` down to ``k31000``.
+
+If we were given the same task for the following three idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r200
+ 3. u0:k30000:r300
+
+we would fail to translate as the sets aren't order isomorphic over the full
+range of the first idmapping anymore (However they are order isomorphic over
+the full range of the second idmapping.). Neither the second or third idmapping
+contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
+an id mapped. We can simply say that ``u1000`` is unmapped in the second and
+third idmapping. The kernel will report unmapped ids as the overflowuid
+``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
+
+The algorithm to calculate what a given id maps to is pretty simple. First, we
+need to verify that the range can contain our target id. We will skip this step
+for simplicity. After that if we want to know what ``id`` maps to we can do
+simple calculations:
+
+- If we want to map from left to right::
+
+ u:k:r
+ id - u + k = n
+
+- If we want to map from right to left::
+
+ u:k:r
+ id - k + u = n
+
+Instead of "left to right" we can also say "down" and instead of "right to
+left" we can also say "up". Obviously mapping down and up invert each other.
+
+To see whether the simple formulas above work, consider the following two
+idmappings::
+
+ 1. u0:k20000:r10000
+ 2. u500:k30000:r10000
+
+Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
+want to know what id this was mapped from in the upper idmapset of the first
+idmapping. So we're mapping up in the first idmapping::
+
+ id - k + u = n
+ k21000 - k20000 + u0 = u1000
+
+Now assume we are given the id ``u1100`` in the upper idmapset of the second
+idmapping and we want to know what this id maps down to in the lower idmapset
+of the second idmapping. This means we're mapping down in the second
+idmapping::
+
+ id - u + k = n
+ u1100 - u500 + k30000 = k30600
+
+General notes
+-------------
+
+In the context of the kernel an idmapping can be interpreted as mapping a range
+of userspace ids into a range of kernel ids::
+
+ userspace-id:kernel-id:range
+
+A userspace id is always an element in the upper idmapset of an idmapping of
+type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
+idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
+"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
+types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
+
+The kernel is mostly concerned with kernel ids. They are used when performing
+permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
+A userspace id on the other hand is an id that is reported to userspace by the
+kernel, or is passed by userspace to the kernel, or a raw device id that is
+written or read from disk.
+
+Note that we are only concerned with idmappings as the kernel stores them not
+how userspace would specify them.
+
+For the rest of this document we will prefix all userspace ids with ``u`` and
+all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
+an idmapping will be written as ``u0:k10000:r10000``.
+
+For example, within this idmapping, the id ``u1000`` is an id in the upper
+idmapset or "userspace idmapset" starting with ``u0``. And it is mapped to
+``k11000`` which is a kernel id in the lower idmapset or "kernel idmapset"
+starting with ``k10000``.
+
+A kernel id is always created by an idmapping. Such idmappings are associated
+with user namespaces. Since we mainly care about how idmappings work we're not
+going to be concerned with how idmappings are created nor how they are used
+outside of the filesystem context. This is best left to an explanation of user
+namespaces.
+
+The initial user namespace is special. It always has an idmapping of the
+following form::
+
+ u0:k0:r4294967295
+
+which is an identity idmapping over the full range of ids available on this
+system.
+
+Other user namespaces usually have non-identity idmappings such as::
+
+ u0:k10000:r10000
+
+When a process creates or wants to change ownership of a file, or when the
+ownership of a file is read from disk by a filesystem, the userspace id is
+immediately translated into a kernel id according to the idmapping associated
+with the relevant user namespace.
+
+For instance, consider a file that is stored on disk by a filesystem as being
+owned by ``u1000``:
+
+- If a filesystem were to be mounted in the initial user namespaces (as most
+ filesystems are) then the initial idmapping will be used. As we saw this is
+ simply the identity idmapping. This would mean id ``u1000`` read from disk
+ would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
+ would contain ``k1000``.
+
+- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
+ then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
+ ``i_uid`` and ``i_gid`` would contain ``k11000``.
+
+Translation algorithms
+----------------------
+
+We've already seen briefly that it is possible to translate between different
+idmappings. We'll now take a closer look how that works.
+
+Crossmapping
+~~~~~~~~~~~~
+
+This translation algorithm is used by the kernel in quite a few places. For
+example, it is used when reporting back the ownership of a file to userspace
+via the ``stat()`` system call family.
+
+If we've been given ``k11000`` from one idmapping we can map that id up in
+another idmapping. In order for this to work both idmappings need to contain
+the same kernel id in their kernel idmapsets. For example, consider the
+following idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u20000:k10000:r10000
+
+and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
+then translate ``k11000`` into a userspace id in the second idmapping using the
+kernel idmapset of the second idmapping::
+
+ /* Map the kernel id up into a userspace id in the second idmapping. */
+ from_kuid(u20000:k10000:r10000, k11000) = u21000
+
+Note, how we can get back to the kernel id in the first idmapping by inverting
+the algorithm::
+
+ /* Map the userspace id down into a kernel id in the second idmapping. */
+ make_kuid(u20000:k10000:r10000, u21000) = k11000
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+This algorithm allows us to answer the question what userspace id a given
+kernel id corresponds to in a given idmapping. In order to be able to answer
+this question both idmappings need to contain the same kernel id in their
+respective kernel idmapsets.
+
+For example, when the kernel reads a raw userspace id from disk it maps it down
+into a kernel id according to the idmapping associated with the filesystem.
+Let's assume the filesystem was mounted with an idmapping of
+``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
+means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
+the inode's ``i_uid`` and ``i_gid`` field.
+
+When someone in userspace calls ``stat()`` or a related function to get
+ownership information about the file the kernel can't simply map the id back up
+according to the filesystem's idmapping as this would give the wrong owner if
+the caller is using an idmapping.
+
+So the kernel will map the id back up in the idmapping of the caller. Let's
+assume the caller has the somewhat unconventional idmapping
+``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
+Consequently the user would see that this file is owned by ``u4000``.
+
+Remapping
+~~~~~~~~~
+
+It is possible to translate a kernel id from one idmapping to another one via
+the userspace idmapset of the two idmappings. This is equivalent to remapping
+a kernel id.
+
+Let's look at an example. We are given the following two idmappings::
+
+ 1. u0:k10000:r10000
+ 2. u0:k20000:r10000
+
+and we are given ``k11000`` in the first idmapping. In order to translate this
+kernel id in the first idmapping into a kernel id in the second idmapping we
+need to perform two steps:
+
+1. Map the kernel id up into a userspace id in the first idmapping::
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+2. Map the userspace id down into a kernel id in the second idmapping::
+
+ /* Map the userspace id down into a kernel id in the second idmapping. */
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+As you can see we used the userspace idmapset in both idmappings to translate
+the kernel id in one idmapping to a kernel id in another idmapping.
+
+This allows us to answer the question what kernel id we would need to use to
+get the same userspace id in another idmapping. In order to be able to answer
+this question both idmappings need to contain the same userspace id in their
+respective userspace idmapsets.
+
+Note, how we can easily get back to the kernel id in the first idmapping by
+inverting the algorithm:
+
+1. Map the kernel id up into a userspace id in the second idmapping::
+
+ /* Map the kernel id up into a userspace id in the second idmapping. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+2. Map the userspace id down into a kernel id in the first idmapping::
+
+ /* Map the userspace id down into a kernel id in the first idmapping. */
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+Another way to look at this translation is to treat it as inverting one
+idmapping and applying another idmapping if both idmappings have the relevant
+userspace id mapped. This will come in handy when working with idmapped mounts.
+
+Invalid translations
+~~~~~~~~~~~~~~~~~~~~
+
+It is never valid to use an id in the kernel idmapset of one idmapping as the
+id in the userspace idmapset of another or the same idmapping. While the kernel
+idmapset always indicates an idmapset in the kernel id space the userspace
+idmapset indicates a userspace id. So the following translations are forbidden::
+
+ /* Map the userspace id down into a kernel id in the first idmapping. */
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+ /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
+ make_kuid(u10000:k20000:r10000, k110000) = k21000
+ ~~~~~~~
+
+and equally wrong::
+
+ /* Map the kernel id up into a userspace id in the first idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+ /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
+ from_kuid(u20000:k0:r10000, u1000) = k21000
+ ~~~~~
+
+Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
+``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
+conflated. So the two examples above would cause a compilation failure.
+
+Idmappings when creating filesystem objects
+-------------------------------------------
+
+The concepts of mapping an id down or mapping an id up are expressed in the two
+kernel functions filesystem developers are rather familiar with and which we've
+already used in this document::
+
+ /* Map the userspace id down into a kernel id. */
+ make_kuid(idmapping, uid)
+
+ /* Map the kernel id up into a userspace id. */
+ from_kuid(idmapping, kuid)
+
+We will take an abbreviated look into how idmappings figure into creating
+filesystem objects. For simplicity we will only look at what happens when the
+VFS has already completed path lookup right before it calls into the filesystem
+itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
+called. We will also assume that the directory we're creating filesystem
+objects in is readable and writable for everyone.
+
+When creating a filesystem object the caller will look at the caller's
+filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
+but they are exclusively used when determining file ownership which is why they
+are called "filesystem ids". They are usually identical to the uid and gid of
+the caller but can differ. We will just assume they are always identical to not
+get lost in too many details.
+
+When the caller enters the kernel two things happen:
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping.
+ (To be precise, the kernel will simply look at the kernel ids stashed in the
+ credentials of the current task but for our education we'll pretend this
+ translation happens just in time.)
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping.
+
+The second step is important as regular filesystem will ultimately need to map
+the kernel id back up into a userspace id when writing to disk.
+So with the second step the kernel guarantees that a valid userspace id can be
+written to disk. If it can't the kernel will refuse the creation request to not
+even remotely risk filesystem corruption.
+
+The astute reader will have realized that this is simply a variation of the
+crossmapping algorithm we mentioned above in a previous section. First, the
+kernel maps the caller's userspace id down into a kernel id according to the
+caller's idmapping and then maps that kernel id up according to the
+filesystem's idmapping.
+
+From the implementation point it's worth mentioning how idmappings are represented.
+All idmappings are taken from the corresponding user namespace.
+
+ - caller's idmapping (usually taken from ``current_user_ns()``)
+ - filesystem's idmapping (``sb->s_user_ns``)
+ - mount's idmapping (``mnt_idmap(vfsmnt)``)
+
+Let's see some examples with caller/filesystem idmapping but without mount
+idmappings. This will exhibit some problems we can hit. After that we will
+revisit/reconsider these examples, this time using mount idmappings, to see how
+they can solve the problems we observed before.
+
+Example 1
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+
+Both the caller and the filesystem use the identity idmapping:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping.
+
+ For this second step the kernel will call the function
+ ``fsuidgid_has_mapping()`` which ultimately boils down to calling
+ ``from_kuid()``::
+
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+In this example both idmappings are the same so there's nothing exciting going
+on. Ultimately the userspace id that lands on disk will be ``u1000``.
+
+Example 2
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k20000:r10000, k11000) = u-1
+
+It's immediately clear that while the caller's userspace id could be
+successfully mapped down into kernel ids in the caller's idmapping the kernel
+ids could not be mapped up according to the filesystem's idmapping. So the
+kernel will deny this creation request.
+
+Note that while this example is less common, because most filesystem can't be
+mounted with non-initial idmappings this is a general problem as we can see in
+the next examples.
+
+Example 3
+~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+
+1. Map the caller's userspace ids down into kernel ids in the caller's
+ idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Verify that the caller's kernel ids can be mapped up to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k11000) = u11000
+
+We can see that the translation always succeeds. The userspace id that the
+filesystem will ultimately put to disk will always be identical to the value of
+the kernel id that was created in the caller's idmapping. This has mainly two
+consequences.
+
+First, that we can't allow a caller to ultimately write to disk with another
+userspace id. We could only do this if we were to mount the whole filesystem
+with the caller's or another idmapping. But that solution is limited to a few
+filesystems and not very flexible. But this is a use-case that is pretty
+important in containerized workloads.
+
+Second, the caller will usually not be able to create any files or access
+directories that have stricter permissions because none of the filesystem's
+kernel ids map up into valid userspace ids in the caller's idmapping
+
+1. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Map kernel ids up to userspace ids in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k1000) = u-1
+
+Example 4
+~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+
+In order to report ownership to userspace the kernel uses the crossmapping
+algorithm introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k1000) = u-1
+
+The crossmapping algorithm fails in this case because the kernel id in the
+filesystem idmapping cannot be mapped up to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Example 5
+~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+
+In order to report ownership to userspace the kernel uses the crossmapping
+algorithm introduced in a previous section:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k10000:r10000, k21000) = u-1
+
+Again, the crossmapping algorithm fails in this case because the kernel id in
+the filesystem idmapping cannot be mapped to a userspace id in the caller's
+idmapping. Thus, the kernel will report the ownership of this file as the
+overflowid.
+
+Note how in the last two examples things would be simple if the caller would be
+using the initial idmapping. For a filesystem mounted with the initial
+idmapping it would be trivial. So we only consider a filesystem with an
+idmapping of ``u0:k20000:r10000``:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Map the kernel id up into a userspace id in the caller's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k21000) = u21000
+
+Idmappings on idmapped mounts
+-----------------------------
+
+The examples we've seen in the previous section where the caller's idmapping
+and the filesystem's idmapping are incompatible causes various issues for
+workloads. For a more complex but common example, consider two containers
+started on the host. To completely prevent the two containers from affecting
+each other, an administrator may often use different non-overlapping idmappings
+for the two containers::
+
+ container1 idmapping: u0:k10000:r10000
+ container2 idmapping: u0:k20000:r10000
+ filesystem idmapping: u0:k30000:r10000
+
+An administrator wanting to provide easy read-write access to the following set
+of files::
+
+ dir id: u0
+ dir/file1 id: u1000
+ dir/file2 id: u2000
+
+to both containers currently can't.
+
+Of course the administrator has the option to recursively change ownership via
+``chown()``. For example, they could change ownership so that ``dir`` and all
+files below it can be crossmapped from the filesystem's into the container's
+idmapping. Let's assume they change ownership so it is compatible with the
+first container's idmapping::
+
+ dir id: u10000
+ dir/file1 id: u11000
+ dir/file2 id: u12000
+
+This would still leave ``dir`` rather useless to the second container. In fact,
+``dir`` and all files below it would continue to appear owned by the overflowid
+for the second container.
+
+Or consider another increasingly popular example. Some service managers such as
+systemd implement a concept called "portable home directories". A user may want
+to use their home directories on different machines where they are assigned
+different login userspace ids. Most users will have ``u1000`` as the login id
+on their machine at home and all files in their home directory will usually be
+owned by ``u1000``. At uni or at work they may have another login id such as
+``u1125``. This makes it rather difficult to interact with their home directory
+on their work machine.
+
+In both cases changing ownership recursively has grave implications. The most
+obvious one is that ownership is changed globally and permanently. In the home
+directory case this change in ownership would even need to happen every time the
+user switches from their home to their work machine. For really large sets of
+files this becomes increasingly costly.
+
+If the user is lucky, they are dealing with a filesystem that is mountable
+inside user namespaces. But this would also change ownership globally and the
+change in ownership is tied to the lifetime of the filesystem mount, i.e. the
+superblock. The only way to change ownership is to completely unmount the
+filesystem and mount it again in another user namespace. This is usually
+impossible because it would mean that all users currently accessing the
+filesystem can't anymore. And it means that ``dir`` still can't be shared
+between two containers with different idmappings.
+But usually the user doesn't even have this option since most filesystems
+aren't mountable inside containers. And not having them mountable might be
+desirable as it doesn't require the filesystem to deal with malicious
+filesystem images.
+
+But the usecases mentioned above and more can be handled by idmapped mounts.
+They allow to expose the same set of dentries with different ownership at
+different mounts. This is achieved by marking the mounts with a user namespace
+through the ``mount_setattr()`` system call. The idmapping associated with it
+is then used to translate from the caller's idmapping to the filesystem's
+idmapping and vica versa using the remapping algorithm we introduced above.
+
+Idmapped mounts make it possible to change ownership in a temporary and
+localized way. The ownership changes are restricted to a specific mount and the
+ownership changes are tied to the lifetime of the mount. All other users and
+locations where the filesystem is exposed are unaffected.
+
+Filesystems that support idmapped mounts don't have any real reason to support
+being mountable inside user namespaces. A filesystem could be exposed
+completely under an idmapped mount to get the same effect. This has the
+advantage that filesystems can leave the creation of the superblock to
+privileged users in the initial user namespace.
+
+However, it is perfectly possible to combine idmapped mounts with filesystems
+mountable inside user namespaces. We will touch on this further below.
+
+Filesystem types vs idmapped mount types
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+With the introduction of idmapped mounts we need to distinguish between
+filesystem ownership and mount ownership of a VFS object such as an inode. The
+owner of a inode might be different when looked at from a filesystem
+perspective than when looked at from an idmapped mount. Such fundamental
+conceptual distinctions should almost always be clearly expressed in the code.
+So, to distinguish idmapped mount ownership from filesystem ownership separate
+types have been introduced.
+
+If a uid or gid has been generated using the filesystem or caller's idmapping
+then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
+has been generated using a mount idmapping then we will be using the dedicated
+``vfsuid_t`` and ``vfsgid_t`` types.
+
+All VFS helpers that generate or take uids and gids as arguments use the
+``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
+to catch errors that originate from conflating filesystem and VFS uids and gids.
+
+The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
+and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
+from and to ``uid_t`` and ``gid_t`` types::
+
+ uid_t <--> kuid_t <--> vfsuid_t
+ gid_t <--> kgid_t <--> vfsgid_t
+
+Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
+e.g., during ``stat()``, or store ownership information in a shared VFS object
+based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
+use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
+
+To illustrate why this helper currently exists, consider what happens when we
+change ownership of an inode from an idmapped mount. After we generated
+a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
+this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesystem wide ownership.
+Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
+or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
+``vfsgid_into_kgid()``.
+
+Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
+``struct posix_acl``, stores ownership information a filesystem or "global"
+``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
+and ``vfsgid_t`` is specific to an idmapped mount.
+
+We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
+on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
+on filesystem idmappings. To prevent abusing filesystem idmappings to generate
+``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
+or ``kgid_t`` types filesystem idmappings and mount idmappings are different
+types as well.
+
+All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
+a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
+a filesystem or caller idmapping will cause a compilation error.
+
+Similar to how we prefix all userspace ids in this document with ``u`` and all
+kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
+idmapping will be written as: ``u0:v10000:r10000``.
+
+Remapping helpers
+~~~~~~~~~~~~~~~~~
+
+Idmapping functions were added that translate between idmappings. They make use
+of the remapping algorithm we've introduced earlier. We're going to look at:
+
+- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
+
+ The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
+ VFS ids in the mount's idmapping::
+
+ /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(filesystem, kid) = uid
+
+ /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
+ make_kuid(mount, uid) = kuid
+
+- ``mapped_fsuid()`` and ``mapped_fsgid()``
+
+ The ``mapped_fs*id()`` functions translate the caller's kernel ids into
+ kernel ids in the filesystem's idmapping. This translation is achieved by
+ remapping the caller's VFS ids using the mount's idmapping::
+
+ /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(mount, kid) = uid
+
+ /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(filesystem, uid) = kuid
+
+- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
+
+ Whenever
+
+Note that these two functions invert each other. Consider the following
+idmappings::
+
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
+to ``k21000`` according to its idmapping. This is what is stored in the
+inode's ``i_uid`` and ``i_gid`` fields.
+
+When the caller queries the ownership of this file via ``stat()`` the kernel
+would usually simply use the crossmapping algorithm and map the filesystem's
+kernel id up to a userspace id in the caller's idmapping.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
+id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k21000):
+ /* Map the filesystem's kernel id up into a userspace id. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+ /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+Finally, when the kernel reports the owner to the caller it will turn the
+VFS id in the mount's idmapping into a userspace id in the caller's
+idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+We can test whether this algorithm really works by verifying what happens when
+we create a new file. Let's say the user is creating a file with ``u1000``.
+
+The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
+kernel would now apply the crossmapping, verifying that ``k11000`` can be
+mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
+be mapped up in the filesystem's idmapping directly this creation request
+fails.
+
+But when the caller is accessing the file on an idmapped mount the kernel will
+first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
+a VFS id according to the mount's idmapping::
+
+ mapped_fsuid(k11000):
+ /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+ /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:v20000:r10000, u1000) = v21000
+
+When finally writing to disk the kernel will then map ``v21000`` up into a
+userspace id in the filesystem's idmapping::
+
+ k21000 = vfsuid_into_kuid(v21000)
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+As we can see, we end up with an invertible and therefore information
+preserving algorithm. A file created from ``u1000`` on an idmapped mount will
+also be reported as being owned by ``u1000`` and vica versa.
+
+Let's now briefly reconsider the failing examples from earlier in the context
+of idmapped mounts.
+
+Example 2 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+When the caller is using a non-initial idmapping the common case is to attach
+the same idmapping to the mount. We now perform three steps:
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v11000):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:v10000:r10000, v11000) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+So the ownership that lands on disk will be ``u1000``.
+
+Example 3 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ caller id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u0:v10000:r10000
+
+The same translation algorithm works with the third example.
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k10000:r10000, u1000) = k11000
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v11000):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u0:v10000:r10000, v11000) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+So the ownership that lands on disk will be ``u1000``.
+
+Example 4 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u0:v10000:r10000
+
+In order to report ownership to userspace the kernel now does three steps using
+the translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k1000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file will now be created
+with ``u1000`` according to the mount's idmapping.
+
+Example 5 reconsidered
+~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k10000:r10000
+ filesystem idmapping: u0:k20000:r10000
+ mount idmapping: u0:v10000:r10000
+
+Again, in order to report ownership to userspace the kernel now does three
+steps using the translation algorithm we introduced earlier:
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k20000:r10000, u1000) = k21000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k21000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k20000:r10000, k21000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u0:v10000:r10000, u1000) = v11000
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k11000 = vfsuid_into_kuid(v11000)
+ from_kuid(u0:k10000:r10000, k11000) = u1000
+
+Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
+idmapping. With the idmapped mount in place it now can be crossmapped into the
+filesystem's idmapping via the mount's idmapping. The file is now owned by
+``u1000`` according to the mount's idmapping.
+
+Changing ownership on a home directory
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We've seen above how idmapped mounts can be used to translate between
+idmappings when either the caller, the filesystem or both uses a non-initial
+idmapping. A wide range of usecases exist when the caller is using
+a non-initial idmapping. This mostly happens in the context of containerized
+workloads. The consequence is as we have seen that for both, filesystem's
+mounted with the initial idmapping and filesystems mounted with non-initial
+idmappings, access to the filesystem isn't working because the kernel ids can't
+be crossmapped between the caller's and the filesystem's idmapping.
+
+As we've seen above idmapped mounts provide a solution to this by remapping the
+caller's or filesystem's idmapping according to the mount's idmapping.
+
+Aside from containerized workloads, idmapped mounts have the advantage that
+they also work when both the caller and the filesystem use the initial
+idmapping which means users on the host can change the ownership of directories
+and files on a per-mount basis.
+
+Consider our previous example where a user has their home directory on portable
+storage. At home they have id ``u1000`` and all files in their home directory
+are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
+
+Taking their home directory with them becomes problematic. They can't easily
+access their files, they might not be able to write to disk without applying
+lax permissions or ACLs and even if they can, they will end up with an annoying
+mix of files and directories owned by ``u1000`` and ``u1125``.
+
+Idmapped mounts allow to solve this problem. A user can create an idmapped
+mount for their home directory on their work computer or their computer at home
+depending on what ownership they would prefer to end up on the portable storage
+itself.
+
+Let's assume they want all files on disk to belong to ``u1000``. When the user
+plugs in their portable storage at their work station they can setup a job that
+creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
+when they create a file the kernel performs the following steps we already know
+from above:::
+
+ caller id: u1125
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u1000:v1125:r1
+
+1. Map the caller's userspace ids into kernel ids in the caller's idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1125) = k1125
+
+2. Translate the caller's VFS id into a kernel id in the filesystem's
+ idmapping::
+
+ mapped_fsuid(v1125):
+ /* Map the VFS id up into a userspace id in the mount's idmapping. */
+ from_kuid(u1000:v1125:r1, v1125) = u1000
+
+ /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+ filesystem's idmapping::
+
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+So ultimately the file will be created with ``u1000`` on disk.
+
+Now let's briefly look at what ownership the caller with id ``u1125`` will see
+on their work computer:
+
+::
+
+ file id: u1000
+ caller idmapping: u0:k0:r4294967295
+ filesystem idmapping: u0:k0:r4294967295
+ mount idmapping: u1000:v1125:r1
+
+1. Map the userspace id on disk down into a kernel id in the filesystem's
+ idmapping::
+
+ make_kuid(u0:k0:r4294967295, u1000) = k1000
+
+2. Translate the kernel id into a VFS id in the mount's idmapping::
+
+ i_uid_into_vfsuid(k1000):
+ /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
+
+ /* Map the userspace id down into a VFS id in the mounts's idmapping. */
+ make_kuid(u1000:v1125:r1, u1000) = v1125
+
+3. Map the VFS id up into a userspace id in the caller's idmapping::
+
+ k1125 = vfsuid_into_kuid(v1125)
+ from_kuid(u0:k0:r4294967295, k1125) = u1125
+
+So ultimately the caller will be reported that the file belongs to ``u1125``
+which is the caller's userspace id on their workstation in our example.
+
+The raw userspace id that is put on disk is ``u1000`` so when the user takes
+their home directory back to their home computer where they are assigned
+``u1000`` using the initial idmapping and mount the filesystem with the initial
+idmapping they will see all those files owned by ``u1000``.
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 4c536e66dc4c..11a599387266 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,13 +29,13 @@ algorithms work.
fiemap
files
locks
- mandatory-locking
+ multigrain-ts
mount_api
quota
seq_file
sharedsubtree
- sysfs-pci
- sysfs-tagging
+ idmappings
+ iomap/index
automount-support
@@ -52,9 +52,11 @@ filesystem implementations.
.. toctree::
:maxdepth: 2
+ buffer
journalling
fscrypt
fsverity
+ netfs_library
Filesystems
===========
@@ -70,14 +72,15 @@ Documentation for filesystem implementations.
afs
autofs
autofs-mount-control
+ bcachefs/index
befs
bfs
btrfs
- cifs/cifsroot
ceph
coda
configfs
cramfs
+ dax
debugfs
dlmfs
ecryptfs
@@ -85,6 +88,7 @@ Documentation for filesystem implementations.
erofs
ext2
ext3
+ ext4/index
f2fs
gfs2
gfs2-uevents
@@ -94,11 +98,13 @@ Documentation for filesystem implementations.
hpfs
fuse
fuse-io
+ fuse-io-uring
+ fuse-passthrough
inotify
isofs
nilfs2
nfs/index
- ntfs
+ ntfs3
ocfs2
ocfs2-online-filecheck
omfs
@@ -108,17 +114,17 @@ Documentation for filesystem implementations.
qnx6
ramfs-rootfs-initramfs
relay
+ resctrl
romfs
+ smb/index
spufs/index
squashfs
sysfs
- sysv-fs
tmpfs
ubifs
- ubifs-authentication.rst
+ ubifs-authentication
udf
virtiofs
vfat
- xfs-delayed-logging-design
- xfs-self-describing-metadata
+ xfs/index
zonefs
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
new file mode 100644
index 000000000000..f2df9b6df988
--- /dev/null
+++ b/Documentation/filesystems/iomap/design.rst
@@ -0,0 +1,462 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_design:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+==============
+Library Design
+==============
+
+.. contents:: Table of Contents
+ :local:
+
+Introduction
+============
+
+iomap is a filesystem library for handling common file operations.
+The library has two layers:
+
+ 1. A lower layer that provides an iterator over ranges of file offsets.
+ This layer tries to obtain mappings of each file ranges to storage
+ from the filesystem, but the storage information is not necessarily
+ required.
+
+ 2. An upper layer that acts upon the space mappings provided by the
+ lower layer iterator.
+
+The iteration can involve mappings of file's logical offset ranges to
+physical extents, but the storage layer information is not necessarily
+required, e.g. for walking cached file information.
+The library exports various APIs for implementing file operations such
+as:
+
+ * Pagecache reads and writes
+ * Folio write faults to the pagecache
+ * Writeback of dirty folios
+ * Direct I/O reads and writes
+ * fsdax I/O reads, writes, loads, and stores
+ * FIEMAP
+ * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
+ * swapfile activation
+
+This origins of this library is the file I/O path that XFS once used; it
+has now been extended to cover several other operations.
+
+Who Should Read This?
+=====================
+
+The target audience for this document are filesystem, storage, and
+pagecache programmers and code reviewers.
+
+If you are working on PCI, machine architectures, or device drivers, you
+are most likely in the wrong place.
+
+How Is This Better?
+===================
+
+Unlike the classic Linux I/O model which breaks file I/O into small
+units (generally memory pages or blocks) and looks up space mappings on
+the basis of that unit, the iomap model asks the filesystem for the
+largest space mappings that it can create for a given file operation and
+initiates operations on that basis.
+This strategy improves the filesystem's visibility into the size of the
+operation being performed, which enables it to combat fragmentation with
+larger space allocations when possible.
+Larger space mappings improve runtime performance by amortizing the cost
+of mapping function calls into the filesystem across a larger amount of
+data.
+
+At a high level, an iomap operation `looks like this
+<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
+
+1. For each byte in the operation range...
+
+ 1. Obtain a space mapping via ``->iomap_begin``
+
+ 2. For each sub-unit of work...
+
+ 1. Revalidate the mapping and go back to (1) above, if necessary.
+ So far only the pagecache operations need to do this.
+
+ 2. Do the work
+
+ 3. Increment operation cursor
+
+ 4. Release the mapping via ``->iomap_end``, if necessary
+
+Each iomap operation will be covered in more detail below.
+This library was covered previously by an `LWN article
+<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
+<https://kernelnewbies.org/KernelProjects/iomap>`_.
+
+The goal of this document is to provide a brief discussion of the
+design and capabilities of iomap, followed by a more detailed catalog
+of the interfaces presented by iomap.
+If you change iomap, please update this design document.
+
+File Range Iterator
+===================
+
+Definitions
+-----------
+
+ * **buffer head**: Shattered remnants of the old buffer cache.
+
+ * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
+
+ * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
+ Processes hold this in shared mode to read file state and contents.
+ Some filesystems may allow shared mode for writes.
+ Processes often hold this in exclusive mode to change file state and
+ contents.
+
+ * ``invalidate_lock``: The pagecache ``struct address_space``
+ rwsemaphore that protects against folio insertion and removal for
+ filesystems that support punching out folios below EOF.
+ Processes wishing to insert folios must hold this lock in shared
+ mode to prevent removal, though concurrent insertion is allowed.
+ Processes wishing to remove folios must hold this lock in exclusive
+ mode to prevent insertions.
+ Concurrent removals are not allowed.
+
+ * ``dax_read_lock``: The RCU read lock that dax takes to prevent a
+ device pre-shutdown hook from returning before other threads have
+ released resources.
+
+ * **filesystem mapping lock**: This synchronization primitive is
+ internal to the filesystem and must protect the file mapping data
+ from updates while a mapping is being sampled.
+ The filesystem author must determine how this coordination should
+ happen; it does not need to be an actual lock.
+
+ * **iomap internal operation lock**: This is a general term for
+ synchronization primitives that iomap functions take while holding a
+ mapping.
+ A specific example would be taking the folio lock while reading or
+ writing the pagecache.
+
+ * **pure overwrite**: A write operation that does not require any
+ metadata or zeroing operations to perform during either submission
+ or completion.
+ This implies that the filesystem must have already allocated space
+ on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
+ constraints on IO alignment or size.
+ The only constraints on I/O alignment are device level (minimum I/O
+ size and alignment, typically sector size).
+
+``struct iomap``
+----------------
+
+The filesystem communicates to the iomap iterator the mapping of
+byte ranges of a file to byte ranges of a storage device with the
+structure below:
+
+.. code-block:: c
+
+ struct iomap {
+ u64 addr;
+ loff_t offset;
+ u64 length;
+ u16 type;
+ u16 flags;
+ struct block_device *bdev;
+ struct dax_device *dax_dev;
+ void *inline_data;
+ void *private;
+ const struct iomap_folio_ops *folio_ops;
+ u64 validity_cookie;
+ };
+
+The fields are as follows:
+
+ * ``offset`` and ``length`` describe the range of file offsets, in
+ bytes, covered by this mapping.
+ These fields must always be set by the filesystem.
+
+ * ``type`` describes the type of the space mapping:
+
+ * **IOMAP_HOLE**: No storage has been allocated.
+ This type must never be returned in response to an ``IOMAP_WRITE``
+ operation because writes must allocate and map space, and return
+ the mapping.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+ iomap does not support writing (whether via pagecache or direct
+ I/O) to a hole.
+
+ * **IOMAP_DELALLOC**: A promise to allocate space at a later time
+ ("delayed allocation").
+ If the filesystem returns IOMAP_F_NEW here and the write fails, the
+ ``->iomap_end`` function must delete the reservation.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * **IOMAP_MAPPED**: The file range maps to specific space on the
+ storage device.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+
+ * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
+ storage device, but the space has not yet been initialized.
+ The device is returned in ``bdev`` or ``dax_dev``.
+ The device address, in bytes, is returned via ``addr``.
+ Reads from this type of mapping will return zeroes to the caller.
+ For a write or writeback operation, the ioend should update the
+ mapping to MAPPED.
+ Refer to the sections about ioends for more details.
+
+ * **IOMAP_INLINE**: The file range maps to the memory buffer
+ specified by ``inline_data``.
+ For write operation, the ``->iomap_end`` function presumably
+ handles persisting the data.
+ The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
+
+ * ``flags`` describe the status of the space mapping.
+ These flags should be set by the filesystem in ``->iomap_begin``:
+
+ * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
+ Areas that will not be written to must be zeroed.
+ If a write fails and the mapping is a space reservation, the
+ reservation must be deleted.
+
+ * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
+ to access any data written.
+ fdatasync is required to commit these changes to persistent
+ storage.
+ This needs to take into account metadata changes that *may* be made
+ at I/O completion, such as file size updates from direct I/O.
+
+ * **IOMAP_F_SHARED**: The space under the mapping is shared.
+ Copy on write is necessary to avoid corrupting other file data.
+
+ * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
+ heads for pagecache operations.
+ Do not add more uses of this.
+
+ * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
+ coalesced into this single mapping.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
+ regular file data.
+ This is only useful for FIEMAP.
+
+ * **IOMAP_F_BOUNDARY**: This indicates I/O and its completion must not be
+ merged with any other I/O or completion. Filesystems must use this when
+ submitting I/O to devices that cannot handle I/O crossing certain LBAs
+ (e.g. ZNS devices). This flag applies only to buffered I/O writeback; all
+ other functions ignore it.
+
+ * **IOMAP_F_PRIVATE**: This flag is reserved for filesystem private use.
+
+ * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target
+ block assigned to it yet and the file system will do that in the bio
+ submission handler, splitting the I/O as needed.
+
+ * **IOMAP_F_ATOMIC_BIO**: This indicates write I/O must be submitted with the
+ ``REQ_ATOMIC`` flag set in the bio. Filesystems need to set this flag to
+ inform iomap that the write I/O operation requires torn-write protection
+ based on HW-offload mechanism. They must also ensure that mapping updates
+ upon the completion of the I/O must be performed in a single metadata
+ update.
+
+ These flags can be set by iomap itself during file operations.
+ The filesystem should supply an ``->iomap_end`` function if it needs
+ to observe these flags:
+
+ * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
+ using this mapping.
+
+ * **IOMAP_F_STALE**: The mapping was found to be stale.
+ iomap will call ``->iomap_end`` on this mapping and then
+ ``->iomap_begin`` to obtain a new mapping.
+
+ Currently, these flags are only set by pagecache operations.
+
+ * ``addr`` describes the device address, in bytes.
+
+ * ``bdev`` describes the block device for this mapping.
+ This only needs to be set for mapped or unwritten operations.
+
+ * ``dax_dev`` describes the DAX device for this mapping.
+ This only needs to be set for mapped or unwritten operations, and
+ only for a fsdax operation.
+
+ * ``inline_data`` points to a memory buffer for I/O involving
+ ``IOMAP_INLINE`` mappings.
+ This value is ignored for all other mapping types.
+
+ * ``private`` is a pointer to `filesystem-private information
+ <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_.
+ This value will be passed unchanged to ``->iomap_end``.
+
+ * ``folio_ops`` will be covered in the section on pagecache operations.
+
+ * ``validity_cookie`` is a magic freshness value set by the filesystem
+ that should be used to detect stale mappings.
+ For pagecache operations this is critical for correct operation
+ because page faults can occur, which implies that filesystem locks
+ should not be held between ``->iomap_begin`` and ``->iomap_end``.
+ Filesystems with completely static mappings need not set this value.
+ Only pagecache operations revalidate mappings; see the section about
+ ``iomap_valid`` for details.
+
+``struct iomap_ops``
+--------------------
+
+Every iomap function requires the filesystem to pass an operations
+structure to obtain a mapping and (optionally) to release the mapping:
+
+.. code-block:: c
+
+ struct iomap_ops {
+ int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
+ unsigned flags, struct iomap *iomap,
+ struct iomap *srcmap);
+
+ int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
+ ssize_t written, unsigned flags,
+ struct iomap *iomap);
+ };
+
+``->iomap_begin``
+~~~~~~~~~~~~~~~~~
+
+iomap operations call ``->iomap_begin`` to obtain one file mapping for
+the range of bytes specified by ``pos`` and ``length`` for the file
+``inode``.
+This mapping should be returned through the ``iomap`` pointer.
+The mapping must cover at least the first byte of the supplied file
+range, but it does not need to cover the entire requested range.
+
+Each iomap operation describes the requested operation through the
+``flags`` argument.
+The exact value of ``flags`` will be documented in the
+operation-specific sections below.
+These flags can, at least in principle, apply generally to iomap
+operations:
+
+ * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to
+ block storage.
+
+ * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to
+ memory-like storage.
+
+ * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best
+ effort attempt to avoid any operation that would result in blocking
+ the submitting task.
+ This is similar in intent to ``O_NONBLOCK`` for network APIs - it is
+ intended for asynchronous applications to keep doing other work
+ instead of waiting for the specific unavailable filesystem resource
+ to become available.
+ Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use
+ trylock algorithms.
+ They need to be able to satisfy the entire I/O request range with a
+ single iomap mapping.
+ They need to avoid reading or writing metadata synchronously.
+ They need to avoid blocking memory allocations.
+ They need to avoid waiting on transaction reservations to allow
+ modifications to take place.
+ They probably should not be allocating new space.
+ And so on.
+ If there is any doubt in the filesystem developer's mind as to
+ whether any specific ``IOMAP_NOWAIT`` operation may end up blocking,
+ then they should return ``-EAGAIN`` as early as possible rather than
+ start the operation and force the submitting task to block.
+ ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or
+ ``RWF_NOWAIT``.
+
+ * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a
+ buffered file I/O and would like the kernel to drop the pagecache
+ after the I/O completes, if it isn't already being used by another
+ thread.
+
+If it is necessary to read existing file contents from a `different
+<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_
+device or address range on a device, the filesystem should return that
+information via ``srcmap``.
+Only pagecache and fsdax operations support reading from one mapping and
+writing to another.
+
+``->iomap_end``
+~~~~~~~~~~~~~~~
+
+After the operation completes, the ``->iomap_end`` function, if present,
+is called to signal that iomap is finished with a mapping.
+Typically, implementations will use this function to tear down any
+context that were set up in ``->iomap_begin``.
+For example, a write might wish to commit the reservations for the bytes
+that were operated upon and unreserve any space that was not operated
+upon.
+``written`` might be zero if no bytes were touched.
+``flags`` will contain the same value passed to ``->iomap_begin``.
+iomap ops for reads are not likely to need to supply this function.
+
+Both functions should return a negative errno code on error, or zero on
+success.
+
+Preparing for File Operations
+=============================
+
+iomap only handles mapping and I/O.
+Filesystems must still call out to the VFS to check input parameters
+and file state before initiating an I/O operation.
+It does not handle obtaining filesystem freeze protection, updating of
+timestamps, stripping privileges, or access control.
+
+Locking Hierarchy
+=================
+
+iomap requires that filesystems supply their own locking model.
+There are three categories of synchronization primitives, as far as
+iomap is concerned:
+
+ * The **upper** level primitive is provided by the filesystem to
+ coordinate access to different iomap operations.
+ The exact primitive is specific to the filesystem and operation,
+ but is often a VFS inode, pagecache invalidation, or folio lock.
+ For example, a filesystem might take ``i_rwsem`` before calling
+ ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
+ these two file operations from clobbering each other.
+ Pagecache writeback may lock a folio to prevent other threads from
+ accessing the folio until writeback is underway.
+
+ * The **lower** level primitive is taken by the filesystem in the
+ ``->iomap_begin`` and ``->iomap_end`` functions to coordinate
+ access to the file space mapping information.
+ The fields of the iomap object should be filled out while holding
+ this primitive.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring the lower level synchronization primitive.
+ For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem``
+ while sampling mappings.
+ Filesystems with immutable mapping information may not require
+ synchronization here.
+
+ * The **operation** primitive is taken by an iomap operation to
+ coordinate access to its own internal data structures.
+ The upper level synchronization primitive, if any, remains held
+ while acquiring this primitive.
+ The lower level primitive is not held while acquiring this
+ primitive.
+ For example, pagecache write operations will obtain a file mapping,
+ then grab and lock a folio to copy new contents.
+ It may also lock an internal folio state object to update metadata.
+
+The exact locking requirements are specific to the filesystem; for
+certain operations, some of these locks can be elided.
+All further mentions of locking are *recommendations*, not mandates.
+Each filesystem author must figure out the locking for themself.
+
+Bugs and Limitations
+====================
+
+ * No support for fscrypt.
+ * No support for compression.
+ * No support for fsverity yet.
+ * Strong assumptions that IO should work the way it does on XFS.
+ * Does iomap *actually* work for non-regular file data?
+
+Patches welcome!
diff --git a/Documentation/filesystems/iomap/index.rst b/Documentation/filesystems/iomap/index.rst
new file mode 100644
index 000000000000..3c6a52440250
--- /dev/null
+++ b/Documentation/filesystems/iomap/index.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+VFS iomap Documentation
+=======================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ design
+ operations
+ porting
diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst
new file mode 100644
index 000000000000..3b628e370d88
--- /dev/null
+++ b/Documentation/filesystems/iomap/operations.rst
@@ -0,0 +1,744 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_operations:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=========================
+Supported File Operations
+=========================
+
+.. contents:: Table of Contents
+ :local:
+
+Below are a discussion of the high level file operations that iomap
+implements.
+
+Buffered I/O
+============
+
+Buffered I/O is the default file I/O path in Linux.
+File contents are cached in memory ("pagecache") to satisfy reads and
+writes.
+Dirty cache will be written back to disk at some point that can be
+forced via ``fsync`` and variants.
+
+iomap implements nearly all the folio and pagecache management that
+filesystems have to implement themselves under the legacy I/O model.
+This means that the filesystem need not know the details of allocating,
+mapping, managing uptodate and dirty state, or writeback of pagecache
+folios.
+Under the legacy I/O model, this was managed very inefficiently with
+linked lists of buffer heads instead of the per-folio bitmaps that iomap
+uses.
+Unless the filesystem explicitly opts in to buffer heads, they will not
+be used, which makes buffered I/O much more efficient, and the pagecache
+maintainer much happier.
+
+``struct address_space_operations``
+-----------------------------------
+
+The following iomap functions can be referenced directly from the
+address space operations structure:
+
+ * ``iomap_dirty_folio``
+ * ``iomap_release_folio``
+ * ``iomap_invalidate_folio``
+ * ``iomap_is_partially_uptodate``
+
+The following address space operations can be wrapped easily:
+
+ * ``read_folio``
+ * ``readahead``
+ * ``writepages``
+ * ``bmap``
+ * ``swap_activate``
+
+``struct iomap_folio_ops``
+--------------------------
+
+The ``->iomap_begin`` function for pagecache operations may set the
+``struct iomap::folio_ops`` field to an ops structure to override
+default behaviors of iomap:
+
+.. code-block:: c
+
+ struct iomap_folio_ops {
+ struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
+ unsigned len);
+ void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
+ struct folio *folio);
+ bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
+ };
+
+iomap calls these functions:
+
+ - ``get_folio``: Called to allocate and return an active reference to
+ a locked folio prior to starting a write.
+ If this function is not provided, iomap will call
+ ``iomap_get_folio``.
+ This could be used to `set up per-folio filesystem state
+ <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
+ for a write.
+
+ - ``put_folio``: Called to unlock and put a folio after a pagecache
+ operation completes.
+ If this function is not provided, iomap will ``folio_unlock`` and
+ ``folio_put`` on its own.
+ This could be used to `commit per-folio filesystem state
+ <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
+ that was set up by ``->get_folio``.
+
+ - ``iomap_valid``: The filesystem may not hold locks between
+ ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
+ can take folio locks, fault on userspace pages, initiate writeback
+ for memory reclamation, or engage in other time-consuming actions.
+ If a file's space mapping data are mutable, it is possible that the
+ mapping for a particular pagecache folio can `change in the time it
+ takes
+ <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
+ to allocate, install, and lock that folio.
+
+ For the pagecache, races can happen if writeback doesn't take
+ ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
+ Races can also happen if the filesystem allows concurrent writes.
+ For such files, the mapping *must* be revalidated after the folio
+ lock has been taken so that iomap can manage the folio correctly.
+
+ fsdax does not need this revalidation because there's no writeback
+ and no support for unwritten extents.
+
+ Filesystems subject to this kind of race must provide a
+ ``->iomap_valid`` function to decide if the mapping is still valid.
+ If the mapping is not valid, the mapping will be sampled again.
+
+ To support making the validity decision, the filesystem's
+ ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
+ at the same time that it populates the other iomap fields.
+ A simple validation cookie implementation is a sequence counter.
+ If the filesystem bumps the sequence counter every time it modifies
+ the inode's extent map, it can be placed in the ``struct
+ iomap::validity_cookie`` during ``->iomap_begin``.
+ If the value in the cookie is found to be different to the value
+ the filesystem holds when the mapping is passed back to
+ ``->iomap_valid``, then the iomap should considered stale and the
+ validation failed.
+
+These ``struct kiocb`` flags are significant for buffered I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
+
+Internal per-Folio State
+------------------------
+
+If the fsblock size matches the size of a pagecache folio, it is assumed
+that all disk I/O operations will operate on the entire folio.
+The uptodate (memory contents are at least as new as what's on disk) and
+dirty (memory contents are newer than what's on disk) status of the
+folio are all that's needed for this case.
+
+If the fsblock size is less than the size of a pagecache folio, iomap
+tracks the per-fsblock uptodate and dirty state itself.
+This enables iomap to handle both "bs < ps" `filesystems
+<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
+and large folios in the pagecache.
+
+iomap internally tracks two state bits per fsblock:
+
+ * ``uptodate``: iomap will try to keep folios fully up to date.
+ If there are read(ahead) errors, those fsblocks will not be marked
+ uptodate.
+ The folio itself will be marked uptodate when all fsblocks within the
+ folio are uptodate.
+
+ * ``dirty``: iomap will set the per-block dirty state when programs
+ write to the file.
+ The folio itself will be marked dirty when any fsblock within the
+ folio is dirty.
+
+iomap also tracks the amount of read and write disk IOs that are in
+flight.
+This structure is much lighter weight than ``struct buffer_head``
+because there is only one per folio, and the per-fsblock overhead is two
+bits vs. 104 bytes.
+
+Filesystems wishing to turn on large folios in the pagecache should call
+``mapping_set_large_folios`` when initializing the incore inode.
+
+Buffered Readahead and Reads
+----------------------------
+
+The ``iomap_readahead`` function initiates readahead to the pagecache.
+The ``iomap_read_folio`` function reads one folio's worth of data into
+the pagecache.
+The ``flags`` argument to ``->iomap_begin`` will be set to zero.
+The pagecache takes whatever locks it needs before calling the
+filesystem.
+
+Buffered Writes
+---------------
+
+The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
+pagecache.
+``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
+the ``flags`` argument to ``->iomap_begin``.
+Callers commonly take ``i_rwsem`` in either shared or exclusive mode
+before calling this function.
+
+mmap Write Faults
+~~~~~~~~~~~~~~~~~
+
+The ``iomap_page_mkwrite`` function handles a write fault to a folio in
+the pagecache.
+``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers commonly take the mmap ``invalidate_lock`` in shared or
+exclusive mode before calling this function.
+
+Buffered Write Failures
+~~~~~~~~~~~~~~~~~~~~~~~
+
+After a short write to the pagecache, the areas not written will not
+become marked dirty.
+The filesystem must arrange to `cancel
+<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
+such `reservations
+<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
+because writeback will not consume the reservation.
+The ``iomap_write_delalloc_release`` can be called from a
+``->iomap_end`` function to find all the clean areas of the folios
+caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
+It takes the ``invalidate_lock``.
+
+The filesystem must supply a function ``punch`` to be called for
+each file range in this state.
+This function must *only* remove delayed allocation reservations, in
+case another thread racing with the current thread writes successfully
+to the same region and triggers writeback to flush the dirty data out to
+disk.
+
+Zeroing for File Operations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_zero_range`` to perform zeroing of the
+pagecache for non-truncation file operations that are not aligned to
+the fsblock size.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Unsharing Reflinked File Data
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Filesystems can call ``iomap_file_unshare`` to force a file sharing
+storage with another file to preemptively copy the shared data to newly
+allocate storage.
+``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
+to ``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Truncation
+----------
+
+Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
+pagecache from EOF to the end of the fsblock during a file truncation
+operation.
+``truncate_setsize`` or ``truncate_pagecache`` will take care of
+everything after the EOF block.
+``IOMAP_ZERO`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
+mode before calling this function.
+
+Pagecache Writeback
+-------------------
+
+Filesystems can call ``iomap_writepages`` to respond to a request to
+write dirty pagecache folios to disk.
+The ``mapping`` and ``wbc`` parameters should be passed unchanged.
+The ``wpc`` pointer should be allocated by the filesystem and must
+be initialized to zero.
+
+The pagecache will lock each folio before trying to schedule it for
+writeback.
+It does not lock ``i_rwsem`` or ``invalidate_lock``.
+
+The dirty bit will be cleared for all folios run through the
+``->map_blocks`` machinery described below even if the writeback fails.
+This is to prevent dirty folio clots when storage devices fail; an
+``-EIO`` is recorded for userspace to collect via ``fsync``.
+
+The ``ops`` structure must be specified and is as follows:
+
+``struct iomap_writeback_ops``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ struct iomap_writeback_ops {
+ int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
+ loff_t offset, unsigned len);
+ int (*submit_ioend)(struct iomap_writepage_ctx *wpc, int status);
+ void (*discard_folio)(struct folio *folio, loff_t pos);
+ };
+
+The fields are as follows:
+
+ - ``map_blocks``: Sets ``wpc->iomap`` to the space mapping of the file
+ range (in bytes) given by ``offset`` and ``len``.
+ iomap calls this function for each dirty fs block in each dirty folio,
+ though it will `reuse mappings
+ <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
+ for runs of contiguous dirty fsblocks within a folio.
+ Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
+ function must deal with persisting written data.
+ Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
+ requires mapping to allocated space.
+ Filesystems can skip a potentially expensive mapping lookup if the
+ mappings have not changed.
+ This revalidation must be open-coded by the filesystem; it is
+ unclear if ``iomap::validity_cookie`` can be reused for this
+ purpose.
+ This function must be supplied by the filesystem.
+
+ - ``submit_ioend``: Allows the file systems to hook into writeback bio
+ submission.
+ This might include pre-write space accounting updates, or installing
+ a custom ``->bi_end_io`` function for internal purposes, such as
+ deferring the ioend completion to a workqueue to run metadata update
+ transactions from process context before submitting the bio.
+ This function is optional.
+
+ - ``discard_folio``: iomap calls this function after ``->map_blocks``
+ fails to schedule I/O for any part of a dirty folio.
+ The function should throw away any reservations that may have been
+ made for the write.
+ The folio will be marked clean and an ``-EIO`` recorded in the
+ pagecache.
+ Filesystems can use this callback to `remove
+ <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
+ delalloc reservations to avoid having delalloc reservations for
+ clean pagecache.
+ This function is optional.
+
+Pagecache Writeback Completion
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To handle the bookkeeping that must happen after disk I/O for writeback
+completes, iomap creates chains of ``struct iomap_ioend`` objects that
+wrap the ``bio`` that is used to write pagecache data to disk.
+By default, iomap finishes writeback ioends by clearing the writeback
+bit on the folios attached to the ``ioend``.
+If the write failed, it will also set the error bits on the folios and
+the address space.
+This can happen in interrupt or process context, depending on the
+storage device.
+
+Filesystems that need to update internal bookkeeping (e.g. unwritten
+extent conversions) should provide a ``->submit_ioend`` function to
+set ``struct iomap_end::bio::bi_end_io`` to its own function.
+This function should call ``iomap_finish_ioends`` after finishing its
+own work (e.g. unwritten extent conversion).
+
+Some filesystems may wish to `amortize the cost of running metadata
+transactions
+<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
+for post-writeback updates by batching them.
+They may also require transactions to run from process context, which
+implies punting batches to a workqueue.
+iomap ioends contain a ``list_head`` to enable batching.
+
+Given a batch of ioends, iomap has a few helpers to assist with
+amortization:
+
+ * ``iomap_sort_ioends``: Sort all the ioends in the list by file
+ offset.
+
+ * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
+ a separate list of sorted ioends, merge as many of the ioends from
+ the head of the list into the given ioend.
+ ioends can only be merged if the file range and storage addresses are
+ contiguous; the unwritten and shared status are the same; and the
+ write I/O outcome is the same.
+ The merged ioends become their own list.
+
+ * ``iomap_finish_ioends``: Finish an ioend that possibly has other
+ ioends linked to it.
+
+Direct I/O
+==========
+
+In Linux, direct I/O is defined as file I/O that is issued directly to
+storage, bypassing the pagecache.
+The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
+writes for files.
+
+.. code-block:: c
+
+ ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
+ const struct iomap_ops *ops,
+ const struct iomap_dio_ops *dops,
+ unsigned int dio_flags, void *private,
+ size_t done_before);
+
+The filesystem can provide the ``dops`` parameter if it needs to perform
+extra work before or after the I/O is issued to storage.
+The ``done_before`` parameter tells the how much of the request has
+already been transferred.
+It is used to continue a request asynchronously when `part of the
+request
+<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
+has already been completed synchronously.
+
+The ``done_before`` parameter should be set if writes for the ``iocb``
+have been initiated prior to the call.
+The direction of the I/O is determined from the ``iocb`` passed in.
+
+The ``dio_flags`` argument can be set to any combination of the
+following values:
+
+ * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
+ kiocb is not synchronous.
+
+ * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
+ or fail with ``-EAGAIN``.
+ This can be used by filesystems with complex unaligned I/O
+ write paths to provide an optimised fast path for unaligned writes.
+ If a pure overwrite can be performed, then serialisation against
+ other I/Os to the same filesystem block(s) is unnecessary as there is
+ no risk of stale data exposure or data loss.
+ If a pure overwrite cannot be performed, then the filesystem can
+ perform the serialisation steps needed to provide exclusive access
+ to the unaligned I/O range so that it can perform allocation and
+ sub-block zeroing safely.
+ Filesystems can use this flag to try to reduce locking contention,
+ but a lot of `detailed checking
+ <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
+ is required to do it `correctly
+ <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
+
+ * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
+ progress has already been made.
+ The caller may deal with the page fault and retry the operation.
+ If the caller decides to retry the operation, it should pass the
+ accumulated return values of all previous calls as the
+ ``done_before`` parameter to the next call.
+
+These ``struct kiocb`` flags are significant for direct I/O with iomap:
+
+ * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
+
+ * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
+ before completing the call.
+ In the case of pure overwrites, the I/O may be issued with FUA
+ enabled.
+
+ * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
+ interrupt.
+ Only meaningful for asynchronous I/O, and only if the entire I/O can
+ be issued as a single ``struct bio``.
+
+ * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
+ process context.
+ See ``linux/fs.h`` for more details.
+
+Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
+``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
+function for the file.
+They should not set ``->direct_IO``, which is deprecated.
+
+If a filesystem wishes to perform its own work before direct I/O
+completion, it should call ``__iomap_dio_rw``.
+If its return value is not an error pointer or a NULL pointer, the
+filesystem should pass the return value to ``iomap_dio_complete`` after
+finishing its internal work.
+
+Return Values
+-------------
+
+``iomap_dio_rw`` can return one of the following:
+
+ * A non-negative number of bytes transferred.
+
+ * ``-ENOTBLK``: Fall back to buffered I/O.
+ iomap itself will return this value if it cannot invalidate the page
+ cache before issuing the I/O to storage.
+ The ``->iomap_begin`` or ``->iomap_end`` functions may also return
+ this value.
+
+ * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
+ queued and will be completed separately.
+
+ * Any of the other negative error codes.
+
+Direct Reads
+------------
+
+A direct I/O read initiates a read I/O from the storage device to the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the read io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
+any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Direct Writes
+-------------
+
+A direct I/O write initiates a write I/O to the storage device from the
+caller's buffer.
+Dirty parts of the pagecache are flushed to storage before initiating
+the write io.
+The pagecache is invalidated both before and after the write io.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
+ blocks is not allowed.
+ The entire file range must map to a single written or unwritten
+ extent.
+ The file I/O range must be aligned to the filesystem block size
+ if the mapping is unwritten and the filesystem cannot handle zeroing
+ the unaligned regions without exposing stale contents.
+
+ * ``IOMAP_ATOMIC``: This write is being issued with torn-write
+ protection.
+ Torn-write protection may be provided based on HW-offload or by a
+ software mechanism provided by the filesystem.
+
+ For HW-offload based support, only a single bio can be created for the
+ write, and the write must not be split into multiple I/O requests, i.e.
+ flag REQ_ATOMIC must be set.
+ The file range to write must be aligned to satisfy the requirements
+ of both the filesystem and the underlying block device's atomic
+ commit capabilities.
+ If filesystem metadata updates are required (e.g. unwritten extent
+ conversion or copy-on-write), all updates for the entire file range
+ must be committed atomically as well.
+ Untorn-writes may be longer than a single file block. In all cases,
+ the mapping start disk block must have at least the same alignment as
+ the write offset.
+ The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
+ untorn-write based on HW-offload.
+
+ For untorn-writes based on a software mechanism provided by the
+ filesystem, all the disk block alignment and single bio restrictions
+ which apply for HW-offload based untorn-writes do not apply.
+ The mechanism would typically be used as a fallback for when
+ HW-offload based untorn-writes may not be issued, e.g. the range of the
+ write covers multiple extents, meaning that it is not possible to issue
+ a single bio.
+ All filesystem metadata updates for the entire file range must be
+ committed atomically as well.
+
+Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
+calling this function.
+
+``struct iomap_dio_ops:``
+-------------------------
+.. code-block:: c
+
+ struct iomap_dio_ops {
+ void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
+ loff_t file_offset);
+ int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
+ unsigned flags);
+ struct bio_set *bio_set;
+ };
+
+The fields of this structure are as follows:
+
+ - ``submit_io``: iomap calls this function when it has constructed a
+ ``struct bio`` object for the I/O requested, and wishes to submit it
+ to the block device.
+ If no function is provided, ``submit_bio`` will be called directly.
+ Filesystems that would like to perform additional work before (e.g.
+ data replication for btrfs) should implement this function.
+
+ - ``end_io``: This is called after the ``struct bio`` completes.
+ This function should perform post-write conversions of unwritten
+ extent mappings, handle write failures, etc.
+ The ``flags`` argument may be set to a combination of the following:
+
+ * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
+ should mark the extent as written.
+
+ * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
+ copy on write operation, so the ioend should switch mappings.
+
+ - ``bio_set``: This allows the filesystem to provide a custom bio_set
+ for allocating direct I/O bios.
+ This enables filesystems to `stash additional per-bio information
+ <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
+ for private use.
+ If this field is NULL, generic ``struct bio`` objects will be used.
+
+Filesystems that want to perform extra work after an I/O completion
+should set a custom ``->bi_end_io`` function via ``->submit_io``.
+Afterwards, the custom endio function must call
+``iomap_dio_bio_end_io`` to finish the direct I/O.
+
+DAX I/O
+=======
+
+Some storage devices can be directly mapped as memory.
+These devices support a new access mode known as "fsdax" that allows
+loads and stores through the CPU and memory controller.
+
+fsdax Reads
+-----------
+
+A fsdax read performs a memcpy from storage device to the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
+combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+fsdax Writes
+------------
+
+A fsdax write initiates a memcpy to the storage device from the caller's
+buffer.
+The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
+IOMAP_WRITE`` with any combination of the following enhancements:
+
+ * ``IOMAP_NOWAIT``, as defined previously.
+
+ * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
+ performed from this mapping.
+ This requires the filesystem extent mapping to already exist as an
+ ``IOMAP_MAPPED`` type and span the entire range of the write I/O
+ request.
+ If the filesystem cannot map this request in a way that allows the
+ iomap infrastructure to perform a pure overwrite, it must fail the
+ mapping operation with ``-EAGAIN``.
+
+Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
+function.
+
+fsdax mmap Faults
+~~~~~~~~~~~~~~~~~
+
+The ``dax_iomap_fault`` function handles read and write faults to fsdax
+storage.
+For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
+``flags`` argument to ``->iomap_begin``.
+For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
+passed as the ``flags`` argument to ``->iomap_begin``.
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Truncation, fallocate, and Unsharing
+------------------------------------------
+
+For fsdax files, the following functions are provided to replace their
+iomap pagecache I/O counterparts.
+The ``flags`` argument to ``->iomap_begin`` are the same as the
+pagecache counterparts, with ``IOMAP_DAX`` added.
+
+ * ``dax_file_unshare``
+ * ``dax_zero_range``
+ * ``dax_truncate_page``
+
+Callers commonly hold the same locks as they do to call their iomap
+pagecache counterparts.
+
+fsdax Deduplication
+-------------------
+
+Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
+``dax_remap_file_range_prep`` function with their own iomap read ops.
+
+Seeking Files
+=============
+
+iomap implements the two iterating whence modes of the ``llseek`` system
+call.
+
+SEEK_DATA
+---------
+
+The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with a folio mapped and uptodate fsblocks
+within those folios will be reported as data areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+SEEK_HOLE
+---------
+
+The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
+for llseek.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+
+For unwritten mappings, the pagecache will be searched.
+Regions of the pagecache with no folio mapped, or a !uptodate fsblock
+within a folio will be reported as sparse hole areas.
+
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+Swap File Activation
+====================
+
+The ``iomap_swapfile_activate`` function finds all the base-page aligned
+regions in a file and sets them up as swap space.
+The file will be ``fsync()``'d before activation.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+All mappings must be mapped or unwritten; cannot be dirty or shared, and
+cannot span multiple block devices.
+Callers must hold ``i_rwsem`` in exclusive mode; this is already
+provided by ``swapon``.
+
+File Space Mapping Reporting
+============================
+
+iomap implements two of the file space mapping system calls.
+
+FS_IOC_FIEMAP
+-------------
+
+The ``iomap_fiemap`` function exports file extent mappings to userspace
+in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
+``IOMAP_REPORT`` will be passed as the ``flags`` argument to
+``->iomap_begin``.
+Callers commonly hold ``i_rwsem`` in shared mode before calling this
+function.
+
+FIBMAP (deprecated)
+-------------------
+
+``iomap_bmap`` implements FIBMAP.
+The calling conventions are the same as for FIEMAP.
+This function is only provided to maintain compatibility for filesystems
+that implemented FIBMAP prior to conversion.
+This ioctl is deprecated; do **not** add a FIBMAP implementation to
+filesystems that do not have it.
+Callers should probably hold ``i_rwsem`` in shared mode before calling
+this function, but this is unclear.
diff --git a/Documentation/filesystems/iomap/porting.rst b/Documentation/filesystems/iomap/porting.rst
new file mode 100644
index 000000000000..3d49a32c0fff
--- /dev/null
+++ b/Documentation/filesystems/iomap/porting.rst
@@ -0,0 +1,120 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _iomap_porting:
+
+..
+ Dumb style notes to maintain the author's sanity:
+ Please try to start sentences on separate lines so that
+ sentence changes don't bleed colors in diff.
+ Heading decorations are documented in sphinx.rst.
+
+=======================
+Porting Your Filesystem
+=======================
+
+.. contents:: Table of Contents
+ :local:
+
+Why Convert?
+============
+
+There are several reasons to convert a filesystem to iomap:
+
+ 1. The classic Linux I/O path is not terribly efficient.
+ Pagecache operations lock a single base page at a time and then call
+ into the filesystem to return a mapping for only that page.
+ Direct I/O operations build I/O requests a single file block at a
+ time.
+ This worked well enough for direct/indirect-mapped filesystems such
+ as ext2, but is very inefficient for extent-based filesystems such
+ as XFS.
+
+ 2. Large folios are only supported via iomap; there are no plans to
+ convert the old buffer_head path to use them.
+
+ 3. Direct access to storage on memory-like devices (fsdax) is only
+ supported via iomap.
+
+ 4. Lower maintenance overhead for individual filesystem maintainers.
+ iomap handles common pagecache related operations itself, such as
+ allocating, instantiating, locking, and unlocking of folios.
+ No ->write_begin(), ->write_end() or direct_IO
+ address_space_operations are required to be implemented by
+ filesystem using iomap.
+
+How Do I Convert a Filesystem?
+==============================
+
+First, add ``#include <linux/iomap.h>`` from your source code and add
+``select FS_IOMAP`` to your filesystem's Kconfig option.
+Build the kernel, run fstests with the ``-g all`` option across a wide
+variety of your filesystem's supported configurations to build a
+baseline of which tests pass and which ones fail.
+
+The recommended approach is first to implement ``->iomap_begin`` (and
+``->iomap_end`` if necessary) to allow iomap to obtain a read-only
+mapping of a file range.
+In most cases, this is a relatively trivial conversion of the existing
+``get_block()`` function for read-only mappings.
+``FS_IOC_FIEMAP`` is a good first target because it is trivial to
+implement support for it and then to determine that the extent map
+iteration is correct from userspace.
+If FIEMAP is returning the correct information, it's a good sign that
+other read-only mapping operations will do the right thing.
+
+Next, modify the filesystem's ``get_block(create = false)``
+implementation to use the new ``->iomap_begin`` implementation to map
+file space for selected read operations.
+Hide behind a debugging knob the ability to switch on the iomap mapping
+functions for selected call paths.
+It is necessary to write some code to fill out the bufferhead-based
+mapping information from the ``iomap`` structure, but the new functions
+can be tested without needing to implement any iomap APIs.
+
+Once the read-only functions are working like this, convert each high
+level file operation one by one to use iomap native APIs instead of
+going through ``get_block()``.
+Done one at a time, regressions should be self evident.
+You *do* have a regression test baseline for fstests, right?
+It is suggested to convert swap file activation, ``SEEK_DATA``, and
+``SEEK_HOLE`` before tackling the I/O paths.
+A likely complexity at this point will be converting the buffered read
+I/O path because of bufferheads.
+The buffered read I/O paths doesn't need to be converted yet, though the
+direct I/O read path should be converted in this phase.
+
+At this point, you should look over your ``->iomap_begin`` function.
+If it switches between large blocks of code based on dispatching of the
+``flags`` argument, you should consider breaking it up into
+per-operation iomap ops with smaller, more cohesive functions.
+XFS is a good example of this.
+
+The next thing to do is implement ``get_blocks(create == true)``
+functionality in the ``->iomap_begin``/``->iomap_end`` methods.
+It is strongly recommended to create separate mapping functions and
+iomap ops for write operations.
+Then convert the direct I/O write path to iomap, and start running fsx
+w/ DIO enabled in earnest on filesystem.
+This will flush out lots of data integrity corner case bugs that the new
+write mapping implementation introduces.
+
+Now, convert any remaining file operations to call the iomap functions.
+This will get the entire filesystem using the new mapping functions, and
+they should largely be debugged and working correctly after this step.
+
+Most likely at this point, the buffered read and write paths will still
+need to be converted.
+The mapping functions should all work correctly, so all that needs to be
+done is rewriting all the code that interfaces with bufferheads to
+interface with iomap and folios.
+It is much easier first to get regular file I/O (without any fancy
+features like fscrypt, fsverity, compression, or data=journaling)
+converted to use iomap.
+Some of those fancy features (fscrypt and compression) aren't
+implemented yet in iomap.
+For unjournalled filesystems that use the pagecache for symbolic links
+and directories, you might also try converting their handling to iomap.
+
+The rest is left as an exercise for the reader, as it will be different
+for every filesystem.
+If you encounter problems, email the people and lists in
+``get_maintainers.pl`` for help.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..863e93e623f7 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -10,27 +10,27 @@ Details
The journalling layer is easy to use. You need to first of all create a
journal_t data structure. There are two calls to do this dependent on
how you decide to allocate the physical media on which the journal
-resides. The :c:func:`jbd2_journal_init_inode` call is for journals stored in
-filesystem inodes, or the :c:func:`jbd2_journal_init_dev` call can be used
+resides. The jbd2_journal_init_inode() call is for journals stored in
+filesystem inodes, or the jbd2_journal_init_dev() call can be used
for journal stored on a raw device (in a continuous range of blocks). A
journal_t is a typedef for a struct pointer, so when you are finally
-finished make sure you call :c:func:`jbd2_journal_destroy` on it to free up
+finished make sure you call jbd2_journal_destroy() on it to free up
any used kernel memory.
Once you have got your journal_t object you need to 'mount' or load the
journal file. The journalling layer expects the space for the journal
was already allocated and initialized properly by the userspace tools.
-When loading the journal you must call :c:func:`jbd2_journal_load` to process
+When loading the journal you must call jbd2_journal_load() to process
journal contents. If the client file system detects the journal contents
does not need to be processed (or even need not have valid contents), it
-may call :c:func:`jbd2_journal_wipe` to clear the journal contents before
-calling :c:func:`jbd2_journal_load`.
+may call jbd2_journal_wipe() to clear the journal contents before
+calling jbd2_journal_load().
Note that jbd2_journal_wipe(..,0) calls
-:c:func:`jbd2_journal_skip_recovery` for you if it detects any outstanding
-transactions in the journal and similarly :c:func:`jbd2_journal_load` will
-call :c:func:`jbd2_journal_recover` if necessary. I would advise reading
-:c:func:`ext4_load_journal` in fs/ext4/super.c for examples on this stage.
+jbd2_journal_skip_recovery() for you if it detects any outstanding
+transactions in the journal and similarly jbd2_journal_load() will
+call jbd2_journal_recover() if necessary. I would advise reading
+ext4_load_journal() in fs/ext4/super.c for examples on this stage.
Now you can go ahead and start modifying the underlying filesystem.
Almost.
@@ -39,57 +39,57 @@ You still need to actually journal your filesystem changes, this is done
by wrapping them into transactions. Additionally you also need to wrap
the modification of each of the buffers with calls to the journal layer,
so it knows what the modifications you are actually making are. To do
-this use :c:func:`jbd2_journal_start` which returns a transaction handle.
+this use jbd2_journal_start() which returns a transaction handle.
-:c:func:`jbd2_journal_start` and its counterpart :c:func:`jbd2_journal_stop`,
+jbd2_journal_start() and its counterpart jbd2_journal_stop(),
which indicates the end of a transaction are nestable calls, so you can
reenter a transaction if necessary, but remember you must call
-:c:func:`jbd2_journal_stop` the same number of times as
-:c:func:`jbd2_journal_start` before the transaction is completed (or more
+jbd2_journal_stop() the same number of times as
+jbd2_journal_start() before the transaction is completed (or more
accurately leaves the update phase). Ext4/VFS makes use of this feature to
simplify handling of inode dirtying, quota support, etc.
Inside each transaction you need to wrap the modifications to the
individual buffers (blocks). Before you start to modify a buffer you
-need to call :c:func:`jbd2_journal_get_create_access()` /
-:c:func:`jbd2_journal_get_write_access()` /
-:c:func:`jbd2_journal_get_undo_access()` as appropriate, this allows the
+need to call jbd2_journal_get_create_access() /
+jbd2_journal_get_write_access() /
+jbd2_journal_get_undo_access() as appropriate, this allows the
journalling layer to copy the unmodified
data if it needs to. After all the buffer may be part of a previously
uncommitted transaction. At this point you are at last ready to modify a
buffer, and once you are have done so you need to call
-:c:func:`jbd2_journal_dirty_metadata`. Or if you've asked for access to a
+jbd2_journal_dirty_metadata(). Or if you've asked for access to a
buffer you now know is now longer required to be pushed back on the
-device you can call :c:func:`jbd2_journal_forget` in much the same way as you
-might have used :c:func:`bforget` in the past.
+device you can call jbd2_journal_forget() in much the same way as you
+might have used bforget() in the past.
-A :c:func:`jbd2_journal_flush` may be called at any time to commit and
+A jbd2_journal_flush() may be called at any time to commit and
checkpoint all your transactions.
-Then at umount time , in your :c:func:`put_super` you can then call
-:c:func:`jbd2_journal_destroy` to clean up your in-core journal object.
+Then at umount time , in your put_super() you can then call
+jbd2_journal_destroy() to clean up your in-core journal object.
Unfortunately there a couple of ways the journal layer can cause a
deadlock. The first thing to note is that each task can only have a
single outstanding transaction at any one time, remember nothing commits
-until the outermost :c:func:`jbd2_journal_stop`. This means you must complete
+until the outermost jbd2_journal_stop(). This means you must complete
the transaction at the end of each file/inode/address etc. operation you
perform, so that the journalling system isn't re-entered on another
journal. Since transactions can't be nested/batched across differing
journals, and another filesystem other than yours (say ext4) may be
modified in a later syscall.
-The second case to bear in mind is that :c:func:`jbd2_journal_start` can block
+The second case to bear in mind is that jbd2_journal_start() can block
if there isn't enough space in the journal for your transaction (based
on the passed nblocks param) - when it blocks it merely(!) needs to wait
for transactions to complete and be committed from other tasks, so
-essentially we are waiting for :c:func:`jbd2_journal_stop`. So to avoid
-deadlocks you must treat :c:func:`jbd2_journal_start` /
-:c:func:`jbd2_journal_stop` as if they were semaphores and include them in
+essentially we are waiting for jbd2_journal_stop(). So to avoid
+deadlocks you must treat jbd2_journal_start() /
+jbd2_journal_stop() as if they were semaphores and include them in
your semaphore ordering rules to prevent
-deadlocks. Note that :c:func:`jbd2_journal_extend` has similar blocking
-behaviour to :c:func:`jbd2_journal_start` so you can deadlock here just as
-easily as on :c:func:`jbd2_journal_start`.
+deadlocks. Note that jbd2_journal_extend() has similar blocking
+behaviour to jbd2_journal_start() so you can deadlock here just as
+easily as on jbd2_journal_start().
Try to reserve the right number of blocks the first time. ;-). This will
be the maximum number of blocks you are going to touch in this
@@ -111,13 +111,11 @@ a callback function when the transaction is finally committed to disk,
so that you can do some of your own management. You ask the journalling
layer for calling the callback by simply setting
``journal->j_commit_callback`` function pointer and that function is
-called after each transaction commit. You can also use
-``transaction->t_private_list`` for attaching entries to a transaction
-that need processing when the transaction commits.
+called after each transaction commit.
JBD2 also provides a way to block all transaction updates via
-:c:func:`jbd2_journal_lock_updates()` /
-:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
+jbd2_journal_lock_updates() /
+jbd2_journal_unlock_updates(). Ext4 uses this when it wants a
window with a clean and stable fs for a moment. E.g.
::
@@ -132,6 +130,37 @@ The opportunities for abuse and DOS attacks with this should be obvious,
if you allow unprivileged userspace to trigger codepaths containing
these calls.
+Fast commits
+~~~~~~~~~~~~
+
+JBD2 to also allows you to perform file-system specific delta commits known as
+fast commits. In order to use fast commits, you will need to set following
+callbacks that perform corresponding work:
+
+`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
+fast commit.
+
+`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
+blocks.
+
+File system is free to perform fast commits as and when it wants as long as it
+gets permission from JBD2 to do so by calling the function
+:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
+file system should tell JBD2 about it by calling
+:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full
+commit immediately after stopping the fast commit it can do so by calling
+:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
+fails for some reason and the only way to guarantee consistency is for JBD2 to
+perform the full traditional commit.
+
+JBD2 helper functions to manage fast commit buffers. File system can use
+:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
+and wait on IO completion of fast commit buffers.
+
+Currently, only Ext4 implements fast commits. For details of its implementation
+of fast commits, please refer to the top level comments in
+fs/ext4/fast_commit.c.
+
Summary
~~~~~~~
@@ -168,7 +197,7 @@ Journal Level
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
-Transasction Level
+Transaction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 318605de83f3..2e567e341c3b 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -17,7 +17,8 @@ dentry_operations
prototypes::
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -29,7 +30,9 @@ prototypes::
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
locking rules:
@@ -49,6 +52,8 @@ d_dname: no no no no
d_automount: no no yes no
d_manage: no no yes (ref-walk) maybe
d_real no no yes no
+d_unalias_trylock yes no no no
+d_unalias_unlock yes no no no
================== =========== ======== ============== ========
inode_operations
@@ -56,37 +61,43 @@ inode_operations
prototypes::
- int (*create) (struct inode *,struct dentry *,umode_t, bool);
+ int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
- int (*symlink) (struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct inode *,struct dentry *,umode_t);
+ int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
- int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
- int (*rename) (struct inode *, struct dentry *,
+ int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
void (*truncate) (struct inode *);
- int (*permission) (struct inode *, int, unsigned int);
- int (*get_acl)(struct inode *, int);
- int (*setattr) (struct dentry *, struct iattr *);
- int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
+ int (*permission) (struct mnt_idmap *, struct inode *, int, unsigned int);
+ struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
+ int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
+ int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *,
struct file *, unsigned open_flag,
umode_t create_mode);
- int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+ int (*tmpfile) (struct mnt_idmap *, struct inode *,
+ struct file *, umode_t);
+ int (*fileattr_set)(struct mnt_idmap *idmap,
+ struct dentry *dentry, struct fileattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
locking rules:
all may block
-============ =============================================
+============== ==================================================
ops i_rwsem(inode)
-============ =============================================
+============== ==================================================
lookup: shared
create: exclusive
link: exclusive (both)
@@ -95,11 +106,12 @@ symlink: exclusive
mkdir: exclusive
unlink: exclusive (both)
rmdir: exclusive (both)(see below)
-rename: exclusive (all) (see below)
+rename: exclusive (both parents, some children) (see below)
readlink: no
get_link: no
setattr: exclusive
permission: no (may not block if called in rcu-walk mode)
+get_inode_acl: no
get_acl: no
getattr: no
listxattr: no
@@ -107,12 +119,18 @@ fiemap: no
update_time: no
atomic_open: shared (exclusive if O_CREAT is set in open flags)
tmpfile: no
-============ =============================================
+fileattr_get: no or exclusive
+fileattr_set: exclusive
+get_offset_ctx no
+============== ==================================================
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
exclusive on victim.
cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
+ ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
+ involved.
+ ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent.
See Documentation/filesystems/directory-locking.rst for more detailed discussion
of the locking scheme for directory operations.
@@ -126,9 +144,10 @@ prototypes::
int (*get)(const struct xattr_handler *handler, struct dentry *dentry,
struct inode *inode, const char *name, void *buffer,
size_t size);
- int (*set)(const struct xattr_handler *handler, struct dentry *dentry,
- struct inode *inode, const char *name, const void *buffer,
- size_t size, int flags);
+ int (*set)(const struct xattr_handler *handler,
+ struct mnt_idmap *idmap,
+ struct dentry *dentry, struct inode *inode, const char *name,
+ const void *buffer, size_t size, int flags);
locking rules:
all may block
@@ -163,7 +182,6 @@ prototypes::
int (*show_options)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
- int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
locking rules:
All may block [not true, see below]
@@ -188,7 +206,6 @@ umount_begin: no
show_options: no (namespace_sem)
quota_read: no (see below)
quota_write: no (see below)
-bdev_try_to_free_page: no (see below)
====================== ============ ========================
->statfs() has s_umount (shared) when called by ustat(2) (native or
@@ -204,9 +221,6 @@ dqio_sem) (unless an admin really wants to screw up something and
writes to quota files with quotas on). For other details about locking
see also dquot_operations section.
-->bdev_try_to_free_page is called from the ->releasepage handler of
-the block device inode. See there for more details.
-
file_system_type
================
@@ -235,120 +249,63 @@ address_space_operations
========================
prototypes::
- int (*writepage)(struct page *page, struct writeback_control *wbc);
- int (*readpage)(struct file *, struct page *);
+ int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
- int (*set_page_dirty)(struct page *page);
+ bool (*dirty_folio)(struct address_space *, struct folio *folio);
void (*readahead)(struct readahead_control *);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
- struct page **pagep, void **fsdata);
+ loff_t pos, unsigned len,
+ struct folio **foliop, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
- void (*invalidatepage) (struct page *, unsigned int, unsigned int);
- int (*releasepage) (struct page *, int);
- void (*freepage)(struct page *);
+ void (*invalidate_folio) (struct folio *, size_t start, size_t len);
+ bool (*release_folio)(struct folio *, gfp_t);
+ void (*free_folio)(struct folio *);
int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- bool (*isolate_page) (struct page *, isolate_mode_t);
- int (*migratepage)(struct address_space *, struct page *, struct page *);
- void (*putback_page) (struct page *);
- int (*launder_page)(struct page *);
- int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
- int (*error_remove_page)(struct address_space *, struct page *);
- int (*swap_activate)(struct file *);
+ int (*migrate_folio)(struct address_space *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
+ int (*launder_folio)(struct folio *);
+ bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
+ int (*error_remove_folio)(struct address_space *, struct folio *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
locking rules:
- All except set_page_dirty and freepage may block
+ All except dirty_folio and free_folio may block
-====================== ======================== =========
-ops PageLocked(page) i_rwsem
-====================== ======================== =========
-writepage: yes, unlocks (see below)
-readpage: yes, unlocks
+====================== ======================== ========= ===============
+ops folio locked i_rwsem invalidate_lock
+====================== ======================== ========= ===============
+read_folio: yes, unlocks shared
writepages:
-set_page_dirty no
-readahead: yes, unlocks
-readpages: no
-write_begin: locks the page exclusive
+dirty_folio: maybe
+readahead: yes, unlocks shared
+write_begin: locks the folio exclusive
write_end: yes, unlocks exclusive
bmap:
-invalidatepage: yes
-releasepage: yes
-freepage: yes
+invalidate_folio: yes exclusive
+release_folio: yes
+free_folio: yes
direct_IO:
-isolate_page: yes
-migratepage: yes (both)
-putback_page: yes
-launder_page: yes
+migrate_folio: yes (both)
+launder_folio: yes
is_partially_uptodate: yes
-error_remove_page: yes
+error_remove_folio: yes
swap_activate: no
swap_deactivate: no
-====================== ======================== =========
+swap_rw: yes, unlocks
+====================== ======================== ========= ===============
-->write_begin(), ->write_end() and ->readpage() may be called from
+->write_begin(), ->write_end() and ->read_folio() may be called from
the request handler (/dev/loop).
-->readpage() unlocks the page, either synchronously or via I/O
+->read_folio() unlocks the folio, either synchronously or via I/O
completion.
-->readahead() unlocks the pages that I/O is attempted on like ->readpage().
-
-->readpages() populates the pagecache with the passed pages and starts
-I/O against them. They come unlocked upon I/O completion.
-
-->writepage() is used for two purposes: for "memory cleansing" and for
-"sync". These are quite different operations and the behaviour may differ
-depending upon the mode.
-
-If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then
-it *must* start I/O against the page, even if that would involve
-blocking on in-progress I/O.
-
-If writepage is called for memory cleansing (sync_mode ==
-WBC_SYNC_NONE) then its role is to get as much writeout underway as
-possible. So writepage should try to avoid blocking against
-currently-in-progress I/O.
-
-If the filesystem is not called for "sync" and it determines that it
-would need to block against in-progress I/O to be able to start new I/O
-against the page the filesystem should redirty the page with
-redirty_page_for_writepage(), then unlock the page and return zero.
-This may also be done to avoid internal deadlocks, but rarely.
-
-If the filesystem is called for sync then it must wait on any
-in-progress I/O and then start new I/O.
-
-The filesystem should unlock the page synchronously, before returning to the
-caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE
-value. WRITEPAGE_ACTIVATE means that page cannot really be written out
-currently, and VM should stop calling ->writepage() on this page for some
-time. VM does this by moving page to the head of the active list, hence the
-name.
-
-Unless the filesystem is going to redirty_page_for_writepage(), unlock the page
-and return zero, writepage *must* run set_page_writeback() against the page,
-followed by unlocking it. Once set_page_writeback() has been run against the
-page, write I/O can be submitted and the write I/O completion handler must run
-end_page_writeback() once the I/O is complete. If no I/O is submitted, the
-filesystem must run end_page_writeback() against the page before returning from
-writepage.
-
-That is: after 2.5.12, pages which are under writeout are *not* locked. Note,
-if the filesystem needs the page to be locked during writeout, that is ok, too,
-the page is allowed to be unlocked at any point in time between the calls to
-set_page_writeback() and end_page_writeback().
-
-Note, failure to run either redirty_page_for_writepage() or the combination of
-set_page_writeback()/end_page_writeback() on a page submitted to writepage
-will leave the page itself marked clean but it will be tagged as dirty in the
-radix tree. This incoherency can lead to all sorts of hard-to-debug problems
-in the filesystem like having dirty inodes at umount and losing written data.
+->readahead() unlocks the folios that I/O is attempted on like ->read_folio().
->writepages() is used for periodic writeback and for syscall-initiated
sync operations. The address_space should start I/O against at least
@@ -357,46 +314,60 @@ which is written. The address_space implementation may write more (or less)
pages than ``*nr_to_write`` asks for, but it should try to be reasonably close.
If nr_to_write is NULL, all dirty pages must be written.
-writepages should _only_ write pages which are present on
-mapping->io_pages.
+writepages should _only_ write pages which are present in
+mapping->i_pages.
-->set_page_dirty() is called from various places in the kernel
-when the target page is marked as needing writeback. It may be called
-under spinlock (it cannot block) and is sometimes called with the page
-not locked.
+->dirty_folio() is called from various places in the kernel when
+the target folio is marked as needing writeback. The folio cannot be
+truncated because either the caller holds the folio lock, or the caller
+has found the folio while holding the page table lock which will block
+truncation.
->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some
filesystems and by the swapper. The latter will eventually go away. Please,
keep it that way and don't breed new callers.
-->invalidatepage() is called when the filesystem must attempt to drop
+->invalidate_folio() is called when the filesystem must attempt to drop
some or all of the buffers from the page when it is being truncated. It
-returns zero on success. If ->invalidatepage is zero, the kernel uses
-block_invalidatepage() instead.
-
-->releasepage() is called when the kernel is about to try to drop the
-buffers from the page in preparation for freeing it. It returns zero to
-indicate that the buffers are (or may be) freeable. If ->releasepage is zero,
-the kernel assumes that the fs has no private interest in the buffers.
-
-->freepage() is called when the kernel is done dropping the page
+returns zero on success. The filesystem must exclusively acquire
+invalidate_lock before invalidating page cache in truncate / hole punch
+path (and thus calling into ->invalidate_folio) to block races between page
+cache invalidation and page cache filling functions (fault, read, ...).
+
+->release_folio() is called when the MM wants to make a change to the
+folio that would invalidate the filesystem's private data. For example,
+it may be about to be removed from the address_space or split. The folio
+is locked and not under writeback. It may be dirty. The gfp parameter
+is not usually used for allocation, but rather to indicate what the
+filesystem may do to attempt to free the private data. The filesystem may
+return false to indicate that the folio's private data cannot be freed.
+If it returns true, it should have already removed the private data from
+the folio. If a filesystem does not provide a ->release_folio method,
+the pagecache will assume that private data is buffer_heads and call
+try_to_free_buffers().
+
+->free_folio() is called when the kernel has dropped the folio
from the page cache.
-->launder_page() may be called prior to releasing a page if
-it is still found to be dirty. It returns zero if the page was successfully
-cleaned, or an error value if not. Note that in order to prevent the page
+->launder_folio() may be called prior to releasing a folio if
+it is still found to be dirty. It returns zero if the folio was successfully
+cleaned, or an error value if not. Note that in order to prevent the folio
getting mapped back in and redirtied, it needs to be kept locked
across the entire operation.
-->swap_activate will be called with a non-zero argument on
-files backing (non block device backed) swapfiles. A return value
-of zero indicates success, in which case this file can be used for
-backing swapspace. The swapspace operations will be proxied to the
-address space operations.
+->swap_activate() will be called to prepare the given file for swap. It
+should perform any validation and preparation necessary to ensure that
+writes can be performed with minimal memory allocation. It should call
+add_swap_extent(), or the helper iomap_swapfile_activate(), and return
+the number of extents added. If IO should be submitted through
+->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted
+directly to the block device ``sis->bdev``.
->swap_deactivate() will be called in the sys_swapoff()
path after ->swap_activate() returned success.
+->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate().
+
file_lock_operations
====================
@@ -430,18 +401,22 @@ prototypes::
void (*lm_break)(struct file_lock *); /* break_lease callback */
int (*lm_change)(struct file_lock **, int);
bool (*lm_breaker_owns_lease)(struct file_lock *);
+ bool (*lm_lock_expirable)(struct file_lock *);
+ void (*lm_expire_lock)(void);
locking rules:
-========== ============= ================= =========
-ops inode->i_lock blocked_lock_lock may block
-========== ============= ================= =========
-lm_notify: yes yes no
+====================== ============= ================= =========
+ops flc_lock blocked_lock_lock may block
+====================== ============= ================= =========
+lm_notify: no yes no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no
-lm_breaker_owns_lease: no no no
-========== ============= ================= =========
+lm_breaker_owns_lease: yes no no
+lm_lock_expirable yes no no
+lm_expire_lock no no yes
+====================== ============= ================= =========
buffer_head
===========
@@ -467,32 +442,25 @@ prototypes::
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*direct_access) (struct block_device *, sector_t, void **,
unsigned long *);
- int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
- int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
locking rules:
======================= ===================
-ops bd_mutex
+ops open_mutex
======================= ===================
open: yes
release: yes
ioctl: no
compat_ioctl: no
direct_access: no
-media_changed: no
unlock_native_capacity: no
-revalidate_disk: no
getgeo: no
swap_slot_free_notify: no (see below)
======================= ===================
-media_changed, unlock_native_capacity and revalidate_disk are called only from
-check_disk_change().
-
swap_slot_free_notify is called with swap_lock and sometimes the page lock
held.
@@ -507,7 +475,7 @@ prototypes::
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
- int (*iterate) (struct file *, struct dir_context *);
+ int (*iopoll) (struct kiocb *kiocb, bool spin);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -519,14 +487,6 @@ prototypes::
int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
- loff_t *);
- ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
- loff_t *);
- ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
- void __user *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
- loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long,
unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
@@ -537,6 +497,14 @@ prototypes::
size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **, void **);
long (*fallocate)(struct file *, int, loff_t, loff_t);
+ void (*show_fdinfo)(struct seq_file *m, struct file *f);
+ unsigned (*mmap_capabilities)(struct file *);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
+ loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ loff_t len, unsigned int remap_flags);
+ int (*fadvise)(struct file *, loff_t, loff_t, int);
locking rules:
All may block.
@@ -549,9 +517,8 @@ mutex or just to use i_size_read() instead.
Note: this does not protect the file->f_pos against concurrent modifications
since this is something the userspace has to take care about.
-->iterate() is called with i_rwsem exclusive.
-
-->iterate_shared() is called with i_rwsem at least shared.
+->iterate_shared() is called with i_rwsem held for reading, and with the
+file f_pos_lock held exclusively
->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
Most instances call fasync_helper(), which does that maintenance, so it's
@@ -571,6 +538,25 @@ in sys_read() and friends.
the lease within the individual filesystem to record the result of the
operation
+->fallocate implementation must be really careful to maintain page cache
+consistency when punching holes or performing other operations that invalidate
+page cache contents. Usually the filesystem needs to call
+truncate_inode_pages_range() to invalidate relevant range of the page cache.
+However the filesystem usually also needs to update its internal (and on disk)
+view of file offset -> disk block mapping. Until this update is finished, the
+filesystem needs to block page faults and reads from reloading now-stale page
+cache contents from the disk. Since VFS acquires mapping->invalidate_lock in
+shared mode when loading pages from disk (filemap_fault(), filemap_read(),
+readahead paths), the fallocate implementation must take the invalidate_lock to
+prevent reloading.
+
+->copy_file_range and ->remap_file_range implementations need to serialize
+against modifications of file data while the operation is running. For
+blocking changes through write(2) and similar operations inode->i_rwsem can be
+used. To block changes to file contents via a memory mapping during the
+operation, the filesystem must take mapping->invalidate_lock to coordinate
+with ->page_mkwrite.
+
dquot_operations
================
@@ -607,50 +593,62 @@ vm_operations_struct
prototypes::
- void (*open)(struct vm_area_struct*);
- void (*close)(struct vm_area_struct*);
- vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
+ void (*open)(struct vm_area_struct *);
+ void (*close)(struct vm_area_struct *);
+ vm_fault_t (*fault)(struct vm_fault *);
+ vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
+ vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
locking rules:
-============= ======== ===========================
+============= ========== ===========================
ops mmap_lock PageLocked(page)
-============= ======== ===========================
-open: yes
-close: yes
-fault: yes can return with page locked
-map_pages: yes
-page_mkwrite: yes can return with page locked
-pfn_mkwrite: yes
-access: yes
-============= ======== ===========================
-
-->fault() is called when a previously not present pte is about
-to be faulted in. The filesystem must find and return the page associated
-with the passed in "pgoff" in the vm_fault structure. If it is possible that
-the page may be truncated and/or invalidated, then the filesystem must lock
-the page, then ensure it is not already truncated (the page lock will block
+============= ========== ===========================
+open: write
+close: read/write
+fault: read can return with page locked
+huge_fault: maybe-read
+map_pages: maybe-read
+page_mkwrite: read can return with page locked
+pfn_mkwrite: read
+access: read
+============= ========== ===========================
+
+->fault() is called when a previously not present pte is about to be faulted
+in. The filesystem must find and return the page associated with the passed in
+"pgoff" in the vm_fault structure. If it is possible that the page may be
+truncated and/or invalidated, then the filesystem must lock invalidate_lock,
+then ensure the page is not already truncated (invalidate_lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.
+->huge_fault() is called when there is no PUD or PMD entry present. This
+gives the filesystem the opportunity to install a PUD or PMD sized page.
+Filesystems can also use the ->fault method to return a PMD sized page,
+so implementing this function may not be necessary. In particular,
+filesystems should not call filemap_fault() from ->huge_fault().
+The mmap_lock may not be held when this method is called.
+
->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "start_pgoff"
-till "end_pgoff". ->map_pages() is called with page table locked and must
+till "end_pgoff". ->map_pages() is called with the RCU lock held and must
not block. If it's not possible to reach a page without blocking,
-filesystem should skip it. Filesystem should use do_set_pte() to setup
+filesystem should skip it. Filesystem should use set_pte_range() to setup
page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte".
-->page_mkwrite() is called when a previously read-only pte is
-about to become writeable. The filesystem again must ensure that there are
-no truncate/invalidate races, and then return with the page locked. If
-the page has been truncated, the filesystem should not look up a new page
-like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
-will cause the VM to retry the fault.
+->page_mkwrite() is called when a previously read-only pte is about to become
+writeable. The filesystem again must ensure that there are no
+truncate/invalidate races or races with operations such as ->remap_file_range
+or ->copy_file_range, and then return with the page locked. Usually
+mapping->invalidate_lock is suitable for proper serialization. If the page has
+been truncated, the filesystem should not look up a new page like the ->fault()
+handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to
+retry the fault.
->pfn_mkwrite() is the same as page_mkwrite but when the pte is
VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is
diff --git a/Documentation/filesystems/locks.rst b/Documentation/filesystems/locks.rst
index c5ae858b1aac..26429317dbbc 100644
--- a/Documentation/filesystems/locks.rst
+++ b/Documentation/filesystems/locks.rst
@@ -57,16 +57,9 @@ fcntl(), with all the problems that implies.
1.3 Mandatory Locking As A Mount Option
---------------------------------------
-Mandatory locking, as described in
-'Documentation/filesystems/mandatory-locking.rst' was prior to this release a
-general configuration option that was valid for all mounted filesystems. This
-had a number of inherent dangers, not the least of which was the ability to
-freeze an NFS server by asking it to read a file for which a mandatory lock
-existed.
-
-From this release of the kernel, mandatory locking can be turned on and off
-on a per-filesystem basis, using the mount options 'mand' and 'nomand'.
-The default is to disallow mandatory locking. The intention is that
-mandatory locking only be enabled on a local filesystem as the specific need
-arises.
+Mandatory locking was prior to this release a general configuration option
+that was valid for all mounted filesystems. This had a number of inherent
+dangers, not the least of which was the ability to freeze an NFS server by
+asking it to read a file for which a mandatory lock existed.
+Such option was dropped in Kernel v5.14.
diff --git a/Documentation/filesystems/mandatory-locking.rst b/Documentation/filesystems/mandatory-locking.rst
deleted file mode 100644
index 9ce73544a8f0..000000000000
--- a/Documentation/filesystems/mandatory-locking.rst
+++ /dev/null
@@ -1,188 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=====================================================
-Mandatory File Locking For The Linux Operating System
-=====================================================
-
- Andy Walker <andy@lysaker.kvaerner.no>
-
- 15 April 1996
-
- (Updated September 2007)
-
-0. Why you should avoid mandatory locking
------------------------------------------
-
-The Linux implementation is prey to a number of difficult-to-fix race
-conditions which in practice make it not dependable:
-
- - The write system call checks for a mandatory lock only once
- at its start. It is therefore possible for a lock request to
- be granted after this check but before the data is modified.
- A process may then see file data change even while a mandatory
- lock was held.
- - Similarly, an exclusive lock may be granted on a file after
- the kernel has decided to proceed with a read, but before the
- read has actually completed, and the reading process may see
- the file data in a state which should not have been visible
- to it.
- - Similar races make the claimed mutual exclusion between lock
- and mmap similarly unreliable.
-
-1. What is mandatory locking?
-------------------------------
-
-Mandatory locking is kernel enforced file locking, as opposed to the more usual
-cooperative file locking used to guarantee sequential access to files among
-processes. File locks are applied using the flock() and fcntl() system calls
-(and the lockf() library routine which is a wrapper around fcntl().) It is
-normally a process' responsibility to check for locks on a file it wishes to
-update, before applying its own lock, updating the file and unlocking it again.
-The most commonly used example of this (and in the case of sendmail, the most
-troublesome) is access to a user's mailbox. The mail user agent and the mail
-transfer agent must guard against updating the mailbox at the same time, and
-prevent reading the mailbox while it is being updated.
-
-In a perfect world all processes would use and honour a cooperative, or
-"advisory" locking scheme. However, the world isn't perfect, and there's
-a lot of poorly written code out there.
-
-In trying to address this problem, the designers of System V UNIX came up
-with a "mandatory" locking scheme, whereby the operating system kernel would
-block attempts by a process to write to a file that another process holds a
-"read" -or- "shared" lock on, and block attempts to both read and write to a
-file that a process holds a "write " -or- "exclusive" lock on.
-
-The System V mandatory locking scheme was intended to have as little impact as
-possible on existing user code. The scheme is based on marking individual files
-as candidates for mandatory locking, and using the existing fcntl()/lockf()
-interface for applying locks just as if they were normal, advisory locks.
-
-.. Note::
-
- 1. In saying "file" in the paragraphs above I am actually not telling
- the whole truth. System V locking is based on fcntl(). The granularity of
- fcntl() is such that it allows the locking of byte ranges in files, in
- addition to entire files, so the mandatory locking rules also have byte
- level granularity.
-
- 2. POSIX.1 does not specify any scheme for mandatory locking, despite
- borrowing the fcntl() locking scheme from System V. The mandatory locking
- scheme is defined by the System V Interface Definition (SVID) Version 3.
-
-2. Marking a file for mandatory locking
----------------------------------------
-
-A file is marked as a candidate for mandatory locking by setting the group-id
-bit in its file mode but removing the group-execute bit. This is an otherwise
-meaningless combination, and was chosen by the System V implementors so as not
-to break existing user programs.
-
-Note that the group-id bit is usually automatically cleared by the kernel when
-a setgid file is written to. This is a security measure. The kernel has been
-modified to recognize the special case of a mandatory lock candidate and to
-refrain from clearing this bit. Similarly the kernel has been modified not
-to run mandatory lock candidates with setgid privileges.
-
-3. Available implementations
-----------------------------
-
-I have considered the implementations of mandatory locking available with
-SunOS 4.1.x, Solaris 2.x and HP-UX 9.x.
-
-Generally I have tried to make the most sense out of the behaviour exhibited
-by these three reference systems. There are many anomalies.
-
-All the reference systems reject all calls to open() for a file on which
-another process has outstanding mandatory locks. This is in direct
-contravention of SVID 3, which states that only calls to open() with the
-O_TRUNC flag set should be rejected. The Linux implementation follows the SVID
-definition, which is the "Right Thing", since only calls with O_TRUNC can
-modify the contents of the file.
-
-HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not
-just mandatory locks. That would appear to contravene POSIX.1.
-
-mmap() is another interesting case. All the operating systems mentioned
-prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX
-also disallows advisory locks for such a file. SVID actually specifies the
-paranoid HP-UX behaviour.
-
-In my opinion only MAP_SHARED mappings should be immune from locking, and then
-only from mandatory locks - that is what is currently implemented.
-
-SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for
-mandatory locks, so reads and writes to locked files always block when they
-should return EAGAIN.
-
-I'm afraid that this is such an esoteric area that the semantics described
-below are just as valid as any others, so long as the main points seem to
-agree.
-
-4. Semantics
-------------
-
-1. Mandatory locks can only be applied via the fcntl()/lockf() locking
- interface - in other words the System V/POSIX interface. BSD style
- locks using flock() never result in a mandatory lock.
-
-2. If a process has locked a region of a file with a mandatory read lock, then
- other processes are permitted to read from that region. If any of these
- processes attempts to write to the region it will block until the lock is
- released, unless the process has opened the file with the O_NONBLOCK
- flag in which case the system call will return immediately with the error
- status EAGAIN.
-
-3. If a process has locked a region of a file with a mandatory write lock, all
- attempts to read or write to that region block until the lock is released,
- unless a process has opened the file with the O_NONBLOCK flag in which case
- the system call will return immediately with the error status EAGAIN.
-
-4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has
- any mandatory locks owned by other processes will be rejected with the
- error status EAGAIN.
-
-5. Attempts to apply a mandatory lock to a file that is memory mapped and
- shared (via mmap() with MAP_SHARED) will be rejected with the error status
- EAGAIN.
-
-6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED)
- that has any mandatory locks in effect will be rejected with the error status
- EAGAIN.
-
-5. Which system calls are affected?
------------------------------------
-
-Those which modify a file's contents, not just the inode. That gives read(),
-write(), readv(), writev(), open(), creat(), mmap(), truncate() and
-ftruncate(). truncate() and ftruncate() are considered to be "write" actions
-for the purposes of mandatory locking.
-
-The affected region is usually defined as stretching from the current position
-for the total number of bytes read or written. For the truncate calls it is
-defined as the bytes of a file removed or added (we must also consider bytes
-added, as a lock can specify just "the whole file", rather than a specific
-range of bytes.)
-
-Note 3: I may have overlooked some system calls that need mandatory lock
-checking in my eagerness to get this code out the door. Please let me know, or
-better still fix the system calls yourself and submit a patch to me or Linus.
-
-6. Warning!
------------
-
-Not even root can override a mandatory lock, so runaway processes can wreak
-havoc if they lock crucial files. The way around it is to change the file
-permissions (remove the setgid bit) before trying to read or write to it.
-Of course, that might be a bit tricky if the system is hung :-(
-
-7. The "mand" mount option
---------------------------
-Mandatory locking is disabled on all filesystems by default, and must be
-administratively enabled by mounting with "-o mand". That mount option
-is only allowed if the mounting task has the CAP_SYS_ADMIN capability.
-
-Since kernel v4.5, it is possible to disable mandatory locking
-altogether by setting CONFIG_MANDATORY_FILE_LOCKING to "n". A kernel
-with this disabled will reject attempts to mount filesystems with the
-"mand" mount option with the error status EPERM.
diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst
index dea22d64f060..e149b89118c8 100644
--- a/Documentation/filesystems/mount_api.rst
+++ b/Documentation/filesystems/mount_api.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
====================
-fILESYSTEM Mount API
+Filesystem Mount API
====================
.. CONTENTS
@@ -79,7 +79,6 @@ context. This is represented by the fs_context structure::
unsigned int sb_flags;
unsigned int sb_flags_mask;
unsigned int s_iflags;
- unsigned int lsm_flags;
enum fs_context_purpose purpose:8;
...
};
@@ -213,7 +212,7 @@ The filesystem context points to a table of operations::
void (*free)(struct fs_context *fc);
int (*dup)(struct fs_context *fc, struct fs_context *src_fc);
int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ struct fs_parameter *param);
int (*parse_monolithic)(struct fs_context *fc, void *data);
int (*get_tree)(struct fs_context *fc);
int (*reconfigure)(struct fs_context *fc);
@@ -247,7 +246,7 @@ manage the filesystem context. They are as follows:
* ::
int (*parse_param)(struct fs_context *fc,
- struct struct fs_parameter *param);
+ struct fs_parameter *param);
Called when a parameter is being added to the filesystem context. param
points to the key name and maybe a value object. VFS-specific options
@@ -479,7 +478,7 @@ returned.
int vfs_parse_fs_param(struct fs_context *fc,
struct fs_parameter *param);
- Supply a single mount parameter to the filesystem context. This include
+ Supply a single mount parameter to the filesystem context. This includes
the specification of the source/device which is specified as the "source"
parameter (which may be specified multiple times if the filesystem
supports that).
@@ -562,17 +561,6 @@ or looking up of superblocks.
The following helpers all wrap sget_fc():
- * ::
-
- int vfs_get_super(struct fs_context *fc,
- enum vfs_get_super_keying keying,
- int (*fill_super)(struct super_block *sb,
- struct fs_context *fc))
-
- This creates/looks up a deviceless superblock. The keying indicates how
- many superblocks of this type may exist and in what manner they may be
- shared:
-
(1) vfs_get_single_super
Only one such superblock may exist in the system. Any further
@@ -592,8 +580,7 @@ The following helpers all wrap sget_fc():
one.
-=====================
-PARAMETER DESCRIPTION
+Parameter Description
=====================
Parameters are described using structures defined in linux/fs_parser.h.
@@ -658,6 +645,8 @@ The members are as follows:
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
fs_param_is_fd File descriptor result->int_32
+ fs_param_is_uid User ID (u32) result->uid
+ fs_param_is_gid Group ID (u32) result->gid
======================= ======================= =====================
Note that if the value is of fs_param_is_bool type, fs_parse() will try
@@ -682,7 +671,6 @@ The members are as follows:
fsparam_bool() fs_param_is_bool
fsparam_u32() fs_param_is_u32
fsparam_u32oct() fs_param_is_u32_octal
- fsparam_u32hex() fs_param_is_u32_hex
fsparam_s32() fs_param_is_s32
fsparam_u64() fs_param_is_u64
fsparam_enum() fs_param_is_enum
@@ -691,6 +679,8 @@ The members are as follows:
fsparam_bdev() fs_param_is_blockdev
fsparam_path() fs_param_is_path
fsparam_fd() fs_param_is_fd
+ fsparam_uid() fs_param_is_uid
+ fsparam_gid() fs_param_is_gid
======================= ===============================================
all of which take two arguments, name string and option number - for
@@ -764,26 +754,12 @@ process the parameters it is given.
* ::
- bool validate_constant_table(const struct constant_table *tbl,
- size_t tbl_size,
- int low, int high, int special);
-
- Validate a constant table. Checks that all the elements are appropriately
- ordered, that there are no duplicates and that the values are between low
- and high inclusive, though provision is made for one allowable special
- value outside of that range. If no special value is required, special
- should just be set to lie inside the low-to-high range.
-
- If all is good, true is returned. If the table is invalid, errors are
- logged to dmesg and false is returned.
-
- * ::
-
- bool fs_validate_description(const struct fs_parameter_description *desc);
+ bool fs_validate_description(const char *name,
+ const struct fs_parameter_description *desc);
This performs some validation checks on a parameter description. It
returns true if the description is good and false if it is not. It will
- log errors to dmesg if validation fails.
+ log errors to the kernel log buffer if validation fails.
* ::
@@ -797,8 +773,9 @@ process the parameters it is given.
option number (which it returns).
If successful, and if the parameter type indicates the result is a
- boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32,uint_64}.
+ boolean, integer, enum, uid, or gid type, the value is converted by this
+ function and the result stored in
+ result->{boolean,int_32,uint_32,uint_64,uid,gid}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
@@ -815,6 +792,7 @@ process the parameters it is given.
int fs_lookup_param(struct fs_context *fc,
struct fs_parameter *value,
bool want_bdev,
+ unsigned int flags,
struct path *_path);
This takes a parameter that carries a string or filename type and attempts
diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst
new file mode 100644
index 000000000000..c779e47284e8
--- /dev/null
+++ b/Documentation/filesystems/multigrain-ts.rst
@@ -0,0 +1,125 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multigrain Timestamps
+=====================
+
+Introduction
+============
+Historically, the kernel has always used coarse time values to stamp inodes.
+This value is updated every jiffy, so any change that happens within that jiffy
+will end up with the same timestamp.
+
+When the kernel goes to stamp an inode (due to a read or write), it first gets
+the current time and then compares it to the existing timestamp(s) to see
+whether anything will change. If nothing changed, then it can avoid updating
+the inode's metadata.
+
+Coarse timestamps are therefore good from a performance standpoint, since they
+reduce the need for metadata updates, but bad from the standpoint of
+determining whether anything has changed, since a lot of things can happen in a
+jiffy.
+
+They are particularly troublesome with NFSv3, where unchanging timestamps can
+make it difficult to tell whether to invalidate caches. NFSv4 provides a
+dedicated change attribute that should always show a visible change, but not
+all filesystems implement this properly, causing the NFS server to substitute
+the ctime in many cases.
+
+Multigrain timestamps aim to remedy this by selectively using fine-grained
+timestamps when a file has had its timestamps queried recently, and the current
+coarse-grained time does not cause a change.
+
+Inode Timestamps
+================
+There are currently 3 timestamps in the inode that are updated to the current
+wallclock time on different activity:
+
+ctime:
+ The inode change time. This is stamped with the current time whenever
+ the inode's metadata is changed. Note that this value is not settable
+ from userland.
+
+mtime:
+ The inode modification time. This is stamped with the current time
+ any time a file's contents change.
+
+atime:
+ The inode access time. This is stamped whenever an inode's contents are
+ read. Widely considered to be a terrible mistake. Usually avoided with
+ options like noatime or relatime.
+
+Updating the mtime always implies a change to the ctime, but updating the
+atime due to a read request does not.
+
+Multigrain timestamps are only tracked for the ctime and the mtime. atimes are
+not affected and always use the coarse-grained value (subject to the floor).
+
+Inode Timestamp Ordering
+========================
+
+In addition to just providing info about changes to individual files, file
+timestamps also serve an important purpose in applications like "make". These
+programs measure timestamps in order to determine whether source files might be
+newer than cached objects.
+
+Userland applications like make can only determine ordering based on
+operational boundaries. For a syscall those are the syscall entry and exit
+points. For io_uring or nfsd operations, that's the request submission and
+response. In the case of concurrent operations, userland can make no
+determination about the order in which things will occur.
+
+For instance, if a single thread modifies one file, and then another file in
+sequence, the second file must show an equal or later mtime than the first. The
+same is true if two threads are issuing similar operations that do not overlap
+in time.
+
+If however, two threads have racing syscalls that overlap in time, then there
+is no such guarantee, and the second file may appear to have been modified
+before, after or at the same time as the first, regardless of which one was
+submitted first.
+
+Note that the above assumes that the system doesn't experience a backward jump
+of the realtime clock. If that occurs at an inopportune time, then timestamps
+can appear to go backward, even on a properly functioning system.
+
+Multigrain Timestamp Implementation
+===================================
+Multigrain timestamps are aimed at ensuring that changes to a single file are
+always recognizable, without violating the ordering guarantees when multiple
+different files are modified. This affects the mtime and the ctime, but the
+atime will always use coarse-grained timestamps.
+
+It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime
+or ctime has been queried. If either or both have, then the kernel takes
+special care to ensure the next timestamp update will display a visible change.
+This ensures tight cache coherency for use-cases like NFS, without sacrificing
+the benefits of reduced metadata updates when files aren't being watched.
+
+The Ctime Floor Value
+=====================
+It's not sufficient to simply use fine or coarse-grained timestamps based on
+whether the mtime or ctime has been queried. A file could get a fine grained
+timestamp, and then a second file modified later could get a coarse-grained one
+that appears earlier than the first, which would break the kernel's timestamp
+ordering guarantees.
+
+To mitigate this problem, maintain a global floor value that ensures that
+this can't happen. The two files in the above example may appear to have been
+modified at the same time in such a case, but they will never show the reverse
+order. To avoid problems with realtime clock jumps, the floor is managed as a
+monotonic ktime_t, and the values are converted to realtime clock values as
+needed.
+
+Implementation Notes
+====================
+Multigrain timestamps are intended for use by local filesystems that get
+ctime values from the local clock. This is in contrast to network filesystems
+and the like that just mirror timestamp values from a server.
+
+For most filesystems, it's sufficient to just set the FS_MGTIME flag in the
+fstype->fs_flags in order to opt-in, providing the ctime is only ever set via
+inode_set_ctime_current(). If the filesystem has a ->getattr routine that
+doesn't call generic_fillattr, then it should call fill_mg_cmtime() to
+fill those values. For setattr, it should use setattr_copy() to update the
+timestamps, or otherwise mimic its behavior.
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
new file mode 100644
index 000000000000..ddd799df6ce3
--- /dev/null
+++ b/Documentation/filesystems/netfs_library.rst
@@ -0,0 +1,1051 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Network Filesystem Services Library
+===================================
+
+.. Contents:
+
+ - Overview.
+ - Requests and streams.
+ - Subrequests.
+ - Result collection and retry.
+ - Local caching.
+ - Content encryption (fscrypt).
+ - Per-inode context.
+ - Inode context helper functions.
+ - Inode locking.
+ - Inode writeback.
+ - High-level VFS API.
+ - Unlocked read/write iter.
+ - Pre-locked read/write iter.
+ - Monolithic files API.
+ - Memory-mapped I/O API.
+ - High-level VM API.
+ - Deprecated PG_private2 API.
+ - I/O request API.
+ - Request structure.
+ - Stream structure.
+ - Subrequest structure.
+ - Filesystem methods.
+ - Terminating a subrequest.
+ - Local cache API.
+ - API function reference.
+
+
+Overview
+========
+
+The network filesystem services library, netfslib, is a set of functions
+designed to aid a network filesystem in implementing VM/VFS API operations. It
+takes over the normal buffered read, readahead, write and writeback and also
+handles unbuffered and direct I/O.
+
+The library provides support for (re-)negotiation of I/O sizes and retrying
+failed I/O as well as local caching and will, in the future, provide content
+encryption.
+
+It insulates the filesystem from VM interface changes as much as possible and
+handles VM features such as large multipage folios. The filesystem basically
+just has to provide a way to perform read and write RPC calls.
+
+The way I/O is organised inside netfslib consists of a number of objects:
+
+ * A *request*. A request is used to track the progress of the I/O overall and
+ to hold on to resources. The collection of results is done at the request
+ level. The I/O within a request is divided into a number of parallel
+ streams of subrequests.
+
+ * A *stream*. A non-overlapping series of subrequests. The subrequests
+ within a stream do not have to be contiguous.
+
+ * A *subrequest*. This is the basic unit of I/O. It represents a single RPC
+ call or a single cache I/O operation. The library passes these to the
+ filesystem and the cache to perform.
+
+Requests and Streams
+--------------------
+
+When actually performing I/O (as opposed to just copying into the pagecache),
+netfslib will create one or more requests to track the progress of the I/O and
+to hold resources.
+
+A read operation will have a single stream and the subrequests within that
+stream may be of mixed origins, for instance mixing RPC subrequests and cache
+subrequests.
+
+On the other hand, a write operation may have multiple streams, where each
+stream targets a different destination. For instance, there may be one stream
+writing to the local cache and one to the server. Currently, only two streams
+are allowed, but this could be increased if parallel writes to multiple servers
+is desired.
+
+The subrequests within a write stream do not need to match alignment or size
+with the subrequests in another write stream and netfslib performs the tiling
+of subrequests in each stream over the source buffer independently. Further,
+each stream may contain holes that don't correspond to holes in the other
+stream.
+
+In addition, the subrequests do not need to correspond to the boundaries of the
+folios or vectors in the source/destination buffer. The library handles the
+collection of results and the wrangling of folio flags and references.
+
+Subrequests
+-----------
+
+Subrequests are at the heart of the interaction between netfslib and the
+filesystem using it. Each subrequest is expected to correspond to a single
+read or write RPC or cache operation. The library will stitch together the
+results from a set of subrequests to provide a higher level operation.
+
+Netfslib has two interactions with the filesystem or the cache when setting up
+a subrequest. First, there's an optional preparatory step that allows the
+filesystem to negotiate the limits on the subrequest, both in terms of maximum
+number of bytes and maximum number of vectors (e.g. for RDMA). This may
+involve negotiating with the server (e.g. cifs needing to acquire credits).
+
+And, secondly, there's the issuing step in which the subrequest is handed off
+to the filesystem to perform.
+
+Note that these two steps are done slightly differently between read and write:
+
+ * For reads, the VM/VFS tells us how much is being requested up front, so the
+ library can preset maximum values that the cache and then the filesystem can
+ then reduce. The cache also gets consulted first on whether it wants to do
+ a read before the filesystem is consulted.
+
+ * For writeback, it is unknown how much there will be to write until the
+ pagecache is walked, so no limit is set by the library.
+
+Once a subrequest is completed, the filesystem or cache informs the library of
+the completion and then collection is invoked. Depending on whether the
+request is synchronous or asynchronous, the collection of results will be done
+in either the application thread or in a work queue.
+
+Result Collection and Retry
+---------------------------
+
+As subrequests complete, the results are collected and collated by the library
+and folio unlocking is performed progressively (if appropriate). Once the
+request is complete, async completion will be invoked (again, if appropriate).
+It is possible for the filesystem to provide interim progress reports to the
+library to cause folio unlocking to happen earlier if possible.
+
+If any subrequests fail, netfslib can retry them. It will wait until all
+subrequests are completed, offer the filesystem the opportunity to fiddle with
+the resources/state held by the request and poke at the subrequests before
+re-preparing and re-issuing the subrequests.
+
+This allows the tiling of contiguous sets of failed subrequest within a stream
+to be changed, adding more subrequests or ditching excess as necessary (for
+instance, if the network sizes change or the server decides it wants smaller
+chunks).
+
+Further, if one or more contiguous cache-read subrequests fail, the library
+will pass them to the filesystem to perform instead, renegotiating and retiling
+them as necessary to fit with the filesystem's parameters rather than those of
+the cache.
+
+Local Caching
+-------------
+
+One of the services netfslib provides, via ``fscache``, is the option to cache
+on local disk a copy of the data obtained from/written to a network filesystem.
+The library will manage the storing, retrieval and some invalidation of data
+automatically on behalf of the filesystem if a cookie is attached to the
+``netfs_inode``.
+
+Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to
+keep track of a page that was being written to the cache, but this is now
+deprecated as PG_private_2 will be removed.
+
+Instead, folios that are read from the server for which there was no data in
+the cache will be marked as dirty and will have ``folio->private`` set to a
+special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write.
+If the folio is modified before that happened, the special value will be
+cleared and the write will become normally dirty.
+
+When writeback occurs, folios that are so marked will only be written to the
+cache and not to the server. Writeback handles mixed cache-only writes and
+server-and-cache writes by using two streams, sending one to the cache and one
+to the server. The server stream will have gaps in it corresponding to those
+folios.
+
+Content Encryption (fscrypt)
+----------------------------
+
+Though it does not do so yet, at some point netfslib will acquire the ability
+to do client-side content encryption on behalf of the network filesystem (Ceph,
+for example). fscrypt can be used for this if appropriate (it may not be -
+cifs, for example).
+
+The data will be stored encrypted in the local cache using the same manner of
+encryption as the data written to the server and the library will impose bounce
+buffering and RMW cycles as necessary.
+
+
+Per-Inode Context
+=================
+
+The network filesystem helper library needs a place to store a bit of state for
+its use on each netfs inode it is helping to manage. To this end, a context
+structure is defined::
+
+ struct netfs_inode {
+ struct inode inode;
+ const struct netfs_request_ops *ops;
+ struct fscache_cookie * cache;
+ loff_t remote_i_size;
+ unsigned long flags;
+ ...
+ };
+
+A network filesystem that wants to use netfslib must place one of these in its
+inode wrapper struct instead of the VFS ``struct inode``. This can be done in
+a way similar to the following::
+
+ struct my_inode {
+ struct netfs_inode netfs; /* Netfslib context and vfs inode */
+ ...
+ };
+
+This allows netfslib to find its state by using ``container_of()`` from the
+inode pointer, thereby allowing the netfslib helper functions to be pointed to
+directly by the VFS/VM operation tables.
+
+The structure contains the following fields that are of interest to the
+filesystem:
+
+ * ``inode``
+
+ The VFS inode structure.
+
+ * ``ops``
+
+ The set of operations provided by the network filesystem to netfslib.
+
+ * ``cache``
+
+ Local caching cookie, or NULL if no caching is enabled. This field does not
+ exist if fscache is disabled.
+
+ * ``remote_i_size``
+
+ The size of the file on the server. This differs from inode->i_size if
+ local modifications have been made but not yet written back.
+
+ * ``flags``
+
+ A set of flags, some of which the filesystem might be interested in:
+
+ * ``NETFS_ICTX_MODIFIED_ATTR``
+
+ Set if netfslib modifies mtime/ctime. The filesystem is free to ignore
+ this or clear it.
+
+ * ``NETFS_ICTX_UNBUFFERED``
+
+ Do unbuffered I/O upon the file. Like direct I/O but without the
+ alignment limitations. RMW will be performed if necessary. The pagecache
+ will not be used unless mmap() is also used.
+
+ * ``NETFS_ICTX_WRITETHROUGH``
+
+ Do writethrough caching upon the file. I/O will be set up and dispatched
+ as buffered writes are made to the page cache. mmap() does the normal
+ writeback thing.
+
+ * ``NETFS_ICTX_SINGLE_NO_UPLOAD``
+
+ Set if the file has a monolithic content that must be read entirely in a
+ single go and must not be written back to the server, though it can be
+ cached (e.g. AFS directories).
+
+Inode Context Helper Functions
+------------------------------
+
+To help deal with the per-inode context, a number helper functions are
+provided. Firstly, a function to perform basic initialisation on a context and
+set the operations table pointer::
+
+ void netfs_inode_init(struct netfs_inode *ctx,
+ const struct netfs_request_ops *ops);
+
+then a function to cast from the VFS inode structure to the netfs context::
+
+ struct netfs_inode *netfs_inode(struct inode *inode);
+
+and finally, a function to get the cache cookie pointer from the context
+attached to an inode (or NULL if fscache is disabled)::
+
+ struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+
+Inode Locking
+-------------
+
+A number of functions are provided to manage the locking of i_rwsem for I/O and
+to effectively extend it to provide more separate classes of exclusion::
+
+ int netfs_start_io_read(struct inode *inode);
+ void netfs_end_io_read(struct inode *inode);
+ int netfs_start_io_write(struct inode *inode);
+ void netfs_end_io_write(struct inode *inode);
+ int netfs_start_io_direct(struct inode *inode);
+ void netfs_end_io_direct(struct inode *inode);
+
+The exclusion breaks down into four separate classes:
+
+ 1) Buffered reads and writes.
+
+ Buffered reads can run concurrently each other and with buffered writes,
+ but buffered writes cannot run concurrently with each other.
+
+ 2) Direct reads and writes.
+
+ Direct (and unbuffered) reads and writes can run concurrently since they do
+ not share local buffering (i.e. the pagecache) and, in a network
+ filesystem, are expected to have exclusion managed on the server (though
+ this may not be the case for, say, Ceph).
+
+ 3) Other major inode modifying operations (e.g. truncate, fallocate).
+
+ These should just access i_rwsem directly.
+
+ 4) mmap().
+
+ mmap'd accesses might operate concurrently with any of the other classes.
+ They might form the buffer for an intra-file loopback DIO read/write. They
+ might be permitted on unbuffered files.
+
+Inode Writeback
+---------------
+
+Netfslib will pin resources on an inode for future writeback (such as pinning
+use of an fscache cookie) when an inode is dirtied. However, this pinning
+needs careful management. To manage the pinning, the following sequence
+occurs:
+
+ 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the
+ pinning begins (when a folio is dirtied, for example) if the cache is
+ active to stop the cache structures from being discarded and the cache
+ space from being culled. This also prevents re-getting of cache resources
+ if the flag is already set.
+
+ 2) This flag then cleared inside the inode lock during inode writeback in the
+ VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb``
+ in ``struct writeback_control``.
+
+ 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced.
+
+ 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup.
+
+ 5) The filesystem invokes netfs to do its cleanup.
+
+To do the cleanup, netfslib provides a function to do the resource unpinning::
+
+ int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc);
+
+If the filesystem doesn't need to do anything else, this may be set as a its
+``.write_inode`` method.
+
+Further, if an inode is deleted, the filesystem's write_inode method may not
+get called, so::
+
+ void netfs_clear_inode_writeback(struct inode *inode, const void *aux);
+
+must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called.
+
+
+High-Level VFS API
+==================
+
+Netfslib provides a number of sets of API calls for the filesystem to delegate
+VFS operations to. Netfslib, in turn, will call out to the filesystem and the
+cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene
+at various times.
+
+Unlocked Read/Write Iter
+------------------------
+
+The first API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS read/write_iter methods::
+
+ ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from);
+ ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+ ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from);
+
+They can be assigned directly to ``.read_iter`` and ``.write_iter``. They
+perform the inode locking themselves and the first two will switch between
+buffered I/O and DIO as appropriate.
+
+Pre-Locked Read/Write Iter
+--------------------------
+
+The second API set is for the delegation of operations to netfslib when the
+filesystem is called through the standard VFS methods, but needs to do some
+other stuff before or after calling netfslib whilst still inside locked section
+(e.g. Ceph negotiating caps). The unbuffered read function is::
+
+ ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter);
+
+This must not be assigned directly to ``.read_iter`` and the filesystem is
+responsible for performing the inode locking before calling it. In the case of
+buffered read, the filesystem should use ``filemap_read()``.
+
+There are three functions for writes::
+
+ ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
+ ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter,
+ struct netfs_group *netfs_group);
+
+These must not be assigned directly to ``.write_iter`` and the filesystem is
+responsible for performing the inode locking before calling them.
+
+The first two functions are for buffered writes; the first just adds some
+standard write checks and jumps to the second, but if the filesystem wants to
+do the checks itself, it can use the second directly. The third function is
+for unbuffered or DIO writes.
+
+On all three write functions, there is a writeback group pointer (which should
+be NULL if the filesystem doesn't use this). Writeback groups are set on
+folios when they're modified. If a folio to-be-modified is already marked with
+a different group, it is flushed first. The writeback API allows writing back
+of a specific group.
+
+Memory-Mapped I/O API
+---------------------
+
+An API for support of mmap()'d I/O is provided::
+
+ vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group);
+
+This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The
+filesystem should not take the inode lock before calling it, but, as with the
+locked write functions above, this does take a writeback group pointer. If the
+page to be made writable is in a different group, it will be flushed first.
+
+Monolithic Files API
+--------------------
+
+There is also a special API set for files for which the content must be read in
+a single RPC (and not written back) and is maintained as a monolithic blob
+(e.g. an AFS directory), though it can be stored and updated in the local cache::
+
+ ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter);
+ void netfs_single_mark_inode_dirty(struct inode *inode);
+ int netfs_writeback_single(struct address_space *mapping,
+ struct writeback_control *wbc,
+ struct iov_iter *iter);
+
+The first function reads from a file into the given buffer, reading from the
+cache in preference if the data is cached there; the second function allows the
+inode to be marked dirty, causing a later writeback; and the third function can
+be called from the writeback code to write the data to the cache, if there is
+one.
+
+The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be
+used. The writeback function requires the buffer to be of ITER_FOLIOQ type.
+
+High-Level VM API
+==================
+
+Netfslib also provides a number of sets of API calls for the filesystem to
+delegate VM operations to. Again, netfslib, in turn, will call out to the
+filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places
+for it to intervene at various times::
+
+ void netfs_readahead(struct readahead_control *);
+ int netfs_read_folio(struct file *, struct folio *);
+ int netfs_writepages(struct address_space *mapping,
+ struct writeback_control *wbc);
+ bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio);
+ void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length);
+ bool netfs_release_folio(struct folio *folio, gfp_t gfp);
+
+These are ``address_space_operations`` methods and can be set directly in the
+operations table.
+
+Deprecated PG_private_2 API
+---------------------------
+
+There is also a deprecated function for filesystems that still use the
+``->write_begin`` method::
+
+ int netfs_write_begin(struct netfs_inode *inode, struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned int len,
+ struct folio **_folio, void **_fsdata);
+
+It uses the deprecated PG_private_2 flag and so should not be used.
+
+
+I/O Request API
+===============
+
+The I/O request API comprises a number of structures and a number of functions
+that the filesystem may need to use.
+
+Request Structure
+-----------------
+
+The request structure manages the request as a whole, holding some resources
+and state on behalf of the filesystem and tracking the collection of results::
+
+ struct netfs_io_request {
+ enum netfs_io_origin origin;
+ struct inode *inode;
+ struct address_space *mapping;
+ struct netfs_group *group;
+ struct netfs_io_stream io_streams[];
+ void *netfs_priv;
+ void *netfs_priv2;
+ unsigned long long start;
+ unsigned long long len;
+ unsigned long long i_size;
+ unsigned int debug_id;
+ unsigned long flags;
+ ...
+ };
+
+Many of the fields are for internal use, but the fields shown here are of
+interest to the filesystem:
+
+ * ``origin``
+
+ The origin of the request (readahead, read_folio, DIO read, writeback, ...).
+
+ * ``inode``
+ * ``mapping``
+
+ The inode and the address space of the file being read from. The mapping
+ may or may not point to inode->i_data.
+
+ * ``group``
+
+ The writeback group this request is dealing with or NULL. This holds a ref
+ on the group.
+
+ * ``io_streams``
+
+ The parallel streams of subrequests available to the request. Currently two
+ are available, but this may be made extensible in future. ``NR_IO_STREAMS``
+ indicates the size of the array.
+
+ * ``netfs_priv``
+ * ``netfs_priv2``
+
+ The network filesystem's private data. The value for this can be passed in
+ to the helper functions or set during the request.
+
+ * ``start``
+ * ``len``
+
+ The file position of the start of the read request and the length. These
+ may be altered by the ->expand_readahead() op.
+
+ * ``i_size``
+
+ The size of the file at the start of the request.
+
+ * ``debug_id``
+
+ A number allocated to this operation that can be displayed in trace lines
+ for reference.
+
+ * ``flags``
+
+ Flags for managing and controlling the operation of the request. Some of
+ these may be of interest to the filesystem:
+
+ * ``NETFS_RREQ_RETRYING``
+
+ Netfslib sets this when generating retries.
+
+ * ``NETFS_RREQ_PAUSE``
+
+ The filesystem can set this to request to pause the library's subrequest
+ issuing loop - but care needs to be taken as netfslib may also set it.
+
+ * ``NETFS_RREQ_NONBLOCK``
+ * ``NETFS_RREQ_BLOCKED``
+
+ Netfslib sets the first to indicate that non-blocking mode was set by the
+ caller and the filesystem can set the second to indicate that it would
+ have had to block.
+
+ * ``NETFS_RREQ_USE_PGPRIV2``
+
+ The filesystem can set this if it wants to use PG_private_2 to track
+ whether a folio is being written to the cache. This is deprecated as
+ PG_private_2 is going to go away.
+
+If the filesystem wants more private data than is afforded by this structure,
+then it should wrap it and provide its own allocator.
+
+Stream Structure
+----------------
+
+A request is comprised of one or more parallel streams and each stream may be
+aimed at a different target.
+
+For read requests, only stream 0 is used. This can contain a mixture of
+subrequests aimed at different sources. For write requests, stream 0 is used
+for the server and stream 1 is used for the cache. For buffered writeback,
+stream 0 is not enabled unless a normal dirty folio is encountered, at which
+point ->begin_writeback() will be invoked and the filesystem can mark the
+stream available.
+
+The stream struct looks like::
+
+ struct netfs_io_stream {
+ unsigned char stream_nr;
+ bool avail;
+ size_t sreq_max_len;
+ unsigned int sreq_max_segs;
+ unsigned int submit_extendable_to;
+ ...
+ };
+
+A number of members are available for access/use by the filesystem:
+
+ * ``stream_nr``
+
+ The number of the stream within the request.
+
+ * ``avail``
+
+ True if the stream is available for use. The filesystem should set this on
+ stream zero if in ->begin_writeback().
+
+ * ``sreq_max_len``
+ * ``sreq_max_segs``
+
+ These are set by the filesystem or the cache in ->prepare_read() or
+ ->prepare_write() for each subrequest to indicate the maximum number of
+ bytes and, optionally, the maximum number of segments (if not 0) that that
+ subrequest can support.
+
+ * ``submit_extendable_to``
+
+ The size that a subrequest can be rounded up to beyond the EOF, given the
+ available buffer. This allows the cache to work out if it can do a DIO read
+ or write that straddles the EOF marker.
+
+Subrequest Structure
+--------------------
+
+Individual units of I/O are managed by the subrequest structure. These
+represent slices of the overall request and run independently::
+
+ struct netfs_io_subrequest {
+ struct netfs_io_request *rreq;
+ struct iov_iter io_iter;
+ unsigned long long start;
+ size_t len;
+ size_t transferred;
+ unsigned long flags;
+ short error;
+ unsigned short debug_index;
+ unsigned char stream_nr;
+ ...
+ };
+
+Each subrequest is expected to access a single source, though the library will
+handle falling back from one source type to another. The members are:
+
+ * ``rreq``
+
+ A pointer to the read request.
+
+ * ``io_iter``
+
+ An I/O iterator representing a slice of the buffer to be read into or
+ written from.
+
+ * ``start``
+ * ``len``
+
+ The file position of the start of this slice of the read request and the
+ length.
+
+ * ``transferred``
+
+ The amount of data transferred so far for this subrequest. This should be
+ added to with the length of the transfer made by this issuance of the
+ subrequest. If this is less than ``len`` then the subrequest may be
+ reissued to continue.
+
+ * ``flags``
+
+ Flags for managing the subrequest. There are a number of interest to the
+ filesystem or cache:
+
+ * ``NETFS_SREQ_MADE_PROGRESS``
+
+ Set by the filesystem to indicates that at least one byte of data was read
+ or written.
+
+ * ``NETFS_SREQ_HIT_EOF``
+
+ The filesystem should set this if a read hit the EOF on the file (in which
+ case ``transferred`` should stop at the EOF). Netfslib may expand the
+ subrequest out to the size of the folio containing the EOF on the off
+ chance that a third party change happened or a DIO read may have asked for
+ more than is available. The library will clear any excess pagecache.
+
+ * ``NETFS_SREQ_CLEAR_TAIL``
+
+ The filesystem can set this to indicate that the remainder of the slice,
+ from transferred to len, should be cleared. Do not set if HIT_EOF is set.
+
+ * ``NETFS_SREQ_NEED_RETRY``
+
+ The filesystem can set this to tell netfslib to retry the subrequest.
+
+ * ``NETFS_SREQ_BOUNDARY``
+
+ This can be set by the filesystem on a subrequest to indicate that it ends
+ at a boundary with the filesystem structure (e.g. at the end of a Ceph
+ object). It tells netfslib not to retile subrequests across it.
+
+ * ``error``
+
+ This is for the filesystem to store result of the subrequest. It should be
+ set to 0 if successful and a negative error code otherwise.
+
+ * ``debug_index``
+ * ``stream_nr``
+
+ A number allocated to this slice that can be displayed in trace lines for
+ reference and the number of the request stream that it belongs to.
+
+If necessary, the filesystem can get and put extra refs on the subrequest it is
+given::
+
+ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
+ void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
+ enum netfs_sreq_ref_trace what);
+
+using netfs trace codes to indicate the reason. Care must be taken, however,
+as once control of the subrequest is returned to netfslib, the same subrequest
+can be reissued/retried.
+
+Filesystem Methods
+------------------
+
+The filesystem sets a table of operations in ``netfs_inode`` for netfslib to
+use::
+
+ struct netfs_request_ops {
+ mempool_t *request_pool;
+ mempool_t *subrequest_pool;
+ int (*init_request)(struct netfs_io_request *rreq, struct file *file);
+ void (*free_request)(struct netfs_io_request *rreq);
+ void (*free_subrequest)(struct netfs_io_subrequest *rreq);
+ void (*expand_readahead)(struct netfs_io_request *rreq);
+ int (*prepare_read)(struct netfs_io_subrequest *subreq);
+ void (*issue_read)(struct netfs_io_subrequest *subreq);
+ void (*done)(struct netfs_io_request *rreq);
+ void (*update_i_size)(struct inode *inode, loff_t i_size);
+ void (*post_modify)(struct inode *inode);
+ void (*begin_writeback)(struct netfs_io_request *wreq);
+ void (*prepare_write)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
+ void (*retry_request)(struct netfs_io_request *wreq,
+ struct netfs_io_stream *stream);
+ void (*invalidate_cache)(struct netfs_io_request *wreq);
+ };
+
+The table starts with a pair of optional pointers to memory pools from which
+requests and subrequests can be allocated. If these are not given, netfslib
+has default pools that it will use instead. If the filesystem wraps the netfs
+structs in its own larger structs, then it will need to use its own pools.
+Netfslib will allocate directly from the pools.
+
+The methods defined in the table are:
+
+ * ``init_request()``
+ * ``free_request()``
+ * ``free_subrequest()``
+
+ [Optional] A filesystem may implement these to initialise or clean up any
+ resources that it attaches to the request or subrequest.
+
+ * ``expand_readahead()``
+
+ [Optional] This is called to allow the filesystem to expand the size of a
+ readahead request. The filesystem gets to expand the request in both
+ directions, though it must retain the initial region as that may represent
+ an allocation already made. If local caching is enabled, it gets to expand
+ the request first.
+
+ Expansion is communicated by changing ->start and ->len in the request
+ structure. Note that if any change is made, ->len must be increased by at
+ least as much as ->start is reduced.
+
+ * ``prepare_read()``
+
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream zero in::
+
+ rreq->io_streams[0].sreq_max_len
+ rreq->io_streams[0].sreq_max_segs
+
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple reads in flight.
+
+ Zero should be returned on success and an error code otherwise.
+
+ * ``issue_read()``
+
+ [Required] Netfslib calls this to dispatch a subrequest to the server for
+ reading. In the subrequest, ->start, ->len and ->transferred indicate what
+ data should be read from the server and ->io_iter indicates the buffer to be
+ used.
+
+ There is no return value; the ``netfs_read_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
+
+ Note: the filesystem must not deal with setting folios uptodate, unlocking
+ them or dropping their refs - the library deals with this as it may have to
+ stitch together the results of multiple subrequests that variously overlap
+ the set of folios.
+
+ * ``done()``
+
+ [Optional] This is called after the folios in a read request have all been
+ unlocked (and marked uptodate if applicable).
+
+ * ``update_i_size()``
+
+ [Optional] This is invoked by netfslib at various points during the write
+ paths to ask the filesystem to update its idea of the file size. If not
+ given, netfslib will set i_size and i_blocks and update the local cache
+ cookie.
+
+ * ``post_modify()``
+
+ [Optional] This is called after netfslib writes to the pagecache or when it
+ allows an mmap'd page to be marked as writable.
+
+ * ``begin_writeback()``
+
+ [Optional] Netfslib calls this when processing a writeback request if it
+ finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE,
+ indicating it must be written to the server. This allows the filesystem to
+ only set up writeback resources when it knows it's going to have to perform
+ a write.
+
+ * ``prepare_write()``
+
+ [Optional] This is called to allow the filesystem to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by RDMA. This information should be set on stream to which
+ the subrequest belongs::
+
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
+
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
+
+ This is not permitted to return an error. Instead, in the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
+
+ * ``issue_write()``
+
+ [Required] This is used to dispatch a subrequest to the server for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the server and ->io_iter indicates the buffer to be
+ used.
+
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
+
+ Note: the filesystem must not deal with removing the dirty or writeback
+ marks on folios involved in the operation and should not take refs or pins
+ on them, but should leave retention to netfslib.
+
+ * ``retry_request()``
+
+ [Optional] Netfslib calls this at the beginning of a retry cycle. This
+ allows the filesystem to examine the state of the request, the subrequests
+ in the indicated stream and of its own data and make adjustments or
+ renegotiate resources.
+
+ * ``invalidate_cache()``
+
+ [Optional] This is called by netfslib to invalidate data stored in the local
+ cache in the event that writing to the local cache fails, providing updated
+ coherency data that netfs can't provide.
+
+Terminating a subrequest
+------------------------
+
+When a subrequest completes, there are a number of functions that the cache or
+subrequest can call to inform netfslib of the status change. One function is
+provided to terminate a write subrequest at the preparation stage and acts
+synchronously:
+
+ * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);``
+
+ Indicate that the ->prepare_write() call failed. The ``error`` field should
+ have been updated.
+
+Note that ->prepare_read() can return an error as a read can simply be aborted.
+Dealing with writeback failure is trickier.
+
+The other functions are used for subrequests that got as far as being issued:
+
+ * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);``
+
+ Tell netfslib that a read subrequest has terminated. The ``error``,
+ ``flags`` and ``transferred`` fields should have been updated.
+
+ * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);``
+
+ Tell netfslib that a write subrequest has terminated. Either the amount of
+ data processed or the negative error code can be passed in. This is
+ can be used as a kiocb completion function.
+
+ * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);``
+
+ This is provided to optionally update netfslib on the incremental progress
+ of a read, allowing some folios to be unlocked early and does not actually
+ terminate the subrequest. The ``transferred`` field should have been
+ updated.
+
+Local Cache API
+---------------
+
+Netfslib provides a separate API for a local cache to implement, though it
+provides some somewhat similar routines to the filesystem request API.
+
+Firstly, the netfs_io_request object contains a place for the cache to hang its
+state::
+
+ struct netfs_cache_resources {
+ const struct netfs_cache_ops *ops;
+ void *cache_priv;
+ void *cache_priv2;
+ unsigned int debug_id;
+ unsigned int inval_counter;
+ };
+
+This contains an operations table pointer and two private pointers plus the
+debug ID of the fscache cookie for tracing purposes and an invalidation counter
+that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests
+to be invalidated after completion.
+
+The cache operation table looks like the following::
+
+ struct netfs_cache_ops {
+ void (*end_operation)(struct netfs_cache_resources *cres);
+ void (*expand_readahead)(struct netfs_cache_resources *cres,
+ loff_t *_start, size_t *_len, loff_t i_size);
+ enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
+ loff_t i_size);
+ int (*read)(struct netfs_cache_resources *cres,
+ loff_t start_pos,
+ struct iov_iter *iter,
+ bool seek_data,
+ netfs_io_terminated_t term_func,
+ void *term_func_priv);
+ void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq);
+ void (*issue_write)(struct netfs_io_subrequest *subreq);
+ };
+
+With a termination handler function pointer::
+
+ typedef void (*netfs_io_terminated_t)(void *priv,
+ ssize_t transferred_or_error,
+ bool was_async);
+
+The methods defined in the table are:
+
+ * ``end_operation()``
+
+ [Required] Called to clean up the resources at the end of the read request.
+
+ * ``expand_readahead()``
+
+ [Optional] Called at the beginning of a readahead operation to allow the
+ cache to expand a request in either direction. This allows the cache to
+ size the request appropriately for the cache granularity.
+
+ * ``prepare_read()``
+
+ [Required] Called to configure the next slice of a request. ->start and
+ ->len in the subrequest indicate where and how big the next slice can be;
+ the cache gets to reduce the length to match its granularity requirements.
+
+ The function is passed pointers to the start and length in its parameters,
+ plus the size of the file for reference, and adjusts the start and length
+ appropriately. It should return one of:
+
+ * ``NETFS_FILL_WITH_ZEROES``
+ * ``NETFS_DOWNLOAD_FROM_SERVER``
+ * ``NETFS_READ_FROM_CACHE``
+ * ``NETFS_INVALID_READ``
+
+ to indicate whether the slice should just be cleared or whether it should be
+ downloaded from the server or read from the cache - or whether slicing
+ should be given up at the current point.
+
+ * ``read()``
+
+ [Required] Called to read from the cache. The start file offset is given
+ along with an iterator to read to, which gives the length also. It can be
+ given a hint requesting that it seek forward from that start position for
+ data.
+
+ Also provided is a pointer to a termination handler function and private
+ data to pass to that function. The termination function should be called
+ with the number of bytes transferred or an error code, plus a flag
+ indicating whether the termination is definitely happening in the caller's
+ context.
+
+ * ``prepare_write_subreq()``
+
+ [Required] This is called to allow the cache to limit the size of a
+ subrequest. It may also limit the number of individual regions in iterator,
+ such as required by DIO/DMA. This information should be set on stream to
+ which the subrequest belongs::
+
+ rreq->io_streams[subreq->stream_nr].sreq_max_len
+ rreq->io_streams[subreq->stream_nr].sreq_max_segs
+
+ The filesystem can use this, for example, to chop up a request that has to
+ be split across multiple servers or to put multiple writes in flight.
+
+ This is not permitted to return an error. In the event of failure,
+ ``netfs_prepare_write_failed()`` must be called.
+
+ * ``issue_write()``
+
+ [Required] This is used to dispatch a subrequest to the cache for writing.
+ In the subrequest, ->start, ->len and ->transferred indicate what data
+ should be written to the cache and ->io_iter indicates the buffer to be
+ used.
+
+ There is no return value; the ``netfs_write_subreq_terminated()`` function
+ should be called to indicate that the subrequest completed either way.
+ ->error, ->transferred and ->flags should be updated before completing. The
+ termination can be done asynchronously.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: include/linux/netfs.h
+.. kernel-doc:: fs/netfs/buffered_read.c
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst
new file mode 100644
index 000000000000..4804441155f5
--- /dev/null
+++ b/Documentation/filesystems/nfs/client-identifier.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+NFSv4 client identifier
+=======================
+
+This document explains how the NFSv4 protocol identifies client
+instances in order to maintain file open and lock state during
+system restarts. A special identifier and principal are maintained
+on each client. These can be set by administrators, scripts
+provided by site administrators, or tools provided by Linux
+distributors.
+
+There are risks if a client's NFSv4 identifier and its principal
+are not chosen carefully.
+
+
+Introduction
+------------
+
+The NFSv4 protocol uses "lease-based file locking". Leases help
+NFSv4 servers provide file lock guarantees and manage their
+resources.
+
+Simply put, an NFSv4 server creates a lease for each NFSv4 client.
+The server collects each client's file open and lock state under
+the lease for that client.
+
+The client is responsible for periodically renewing its leases.
+While a lease remains valid, the server holding that lease
+guarantees the file locks the client has created remain in place.
+
+If a client stops renewing its lease (for example, if it crashes),
+the NFSv4 protocol allows the server to remove the client's open
+and lock state after a certain period of time. When a client
+restarts, it indicates to servers that open and lock state
+associated with its previous leases is no longer valid and can be
+destroyed immediately.
+
+In addition, each NFSv4 server manages a persistent list of client
+leases. When the server restarts and clients attempt to recover
+their state, the server uses this list to distinguish amongst
+clients that held state before the server restarted and clients
+sending fresh OPEN and LOCK requests. This enables file locks to
+persist safely across server restarts.
+
+NFSv4 client identifiers
+------------------------
+
+Each NFSv4 client presents an identifier to NFSv4 servers so that
+they can associate the client with its lease. Each client's
+identifier consists of two elements:
+
+ - co_ownerid: An arbitrary but fixed string.
+
+ - boot verifier: A 64-bit incarnation verifier that enables a
+ server to distinguish successive boot epochs of the same client.
+
+The NFSv4.0 specification refers to these two items as an
+"nfs_client_id4". The NFSv4.1 specification refers to these two
+items as a "client_owner4".
+
+NFSv4 servers tie this identifier to the principal and security
+flavor that the client used when presenting it. Servers use this
+principal to authorize subsequent lease modification operations
+sent by the client. Effectively this principal is a third element of
+the identifier.
+
+As part of the identity presented to servers, a good
+"co_ownerid" string has several important properties:
+
+ - The "co_ownerid" string identifies the client during reboot
+ recovery, therefore the string is persistent across client
+ reboots.
+ - The "co_ownerid" string helps servers distinguish the client
+ from others, therefore the string is globally unique. Note
+ that there is no central authority that assigns "co_ownerid"
+ strings.
+ - Because it often appears on the network in the clear, the
+ "co_ownerid" string does not reveal private information about
+ the client itself.
+ - The content of the "co_ownerid" string is set and unchanging
+ before the client attempts NFSv4 mounts after a restart.
+ - The NFSv4 protocol places a 1024-byte limit on the size of the
+ "co_ownerid" string.
+
+Protecting NFSv4 lease state
+----------------------------
+
+NFSv4 servers utilize the "client_owner4" as described above to
+assign a unique lease to each client. Under this scheme, there are
+circumstances where clients can interfere with each other. This is
+referred to as "lease stealing".
+
+If distinct clients present the same "co_ownerid" string and use
+the same principal (for example, AUTH_SYS and UID 0), a server is
+unable to tell that the clients are not the same. Each distinct
+client presents a different boot verifier, so it appears to the
+server as if there is one client that is rebooting frequently.
+Neither client can maintain open or lock state in this scenario.
+
+If distinct clients present the same "co_ownerid" string and use
+distinct principals, the server is likely to allow the first client
+to operate normally but reject subsequent clients with the same
+"co_ownerid" string.
+
+If a client's "co_ownerid" string or principal are not stable,
+state recovery after a server or client reboot is not guaranteed.
+If a client unexpectedly restarts but presents a different
+"co_ownerid" string or principal to the server, the server orphans
+the client's previous open and lock state. This blocks access to
+locked files until the server removes the orphaned state.
+
+If the server restarts and a client presents a changed "co_ownerid"
+string or principal to the server, the server will not allow the
+client to reclaim its open and lock state, and may give those locks
+to other clients in the meantime. This is referred to as "lock
+stealing".
+
+Lease stealing and lock stealing increase the potential for denial
+of service and in rare cases even data corruption.
+
+Selecting an appropriate client identifier
+------------------------------------------
+
+By default, the Linux NFSv4 client implementation constructs its
+"co_ownerid" string starting with the words "Linux NFS" followed by
+the client's UTS node name (the same node name, incidentally, that
+is used as the "machine name" in an AUTH_SYS credential). In small
+deployments, this construction is usually adequate. Often, however,
+the node name by itself is not adequately unique, and can change
+unexpectedly. Problematic situations include:
+
+ - NFS-root (diskless) clients, where the local DHCP server (or
+ equivalent) does not provide a unique host name.
+
+ - "Containers" within a single Linux host. If each container has
+ a separate network namespace, but does not use the UTS namespace
+ to provide a unique host name, then there can be multiple NFS
+ client instances with the same host name.
+
+ - Clients across multiple administrative domains that access a
+ common NFS server. If hostnames are not assigned centrally
+ then uniqueness cannot be guaranteed unless a domain name is
+ included in the hostname.
+
+Linux provides two mechanisms to add uniqueness to its "co_ownerid"
+string:
+
+ nfs.nfs4_unique_id
+ This module parameter can set an arbitrary uniquifier string
+ via the kernel command line, or when the "nfs" module is
+ loaded.
+
+ /sys/fs/nfs/net/nfs_client/identifier
+ This virtual file, available since Linux 5.3, is local to the
+ network namespace in which it is accessed and so can provide
+ distinction between network namespaces (containers) when the
+ hostname remains uniform.
+
+Note that this file is empty on name-space creation. If the
+container system has access to some sort of per-container identity
+then that uniquifier can be used. For example, a uniquifier might
+be formed at boot using the container's internal identifier:
+
+ sha256sum /etc/machine-id | awk '{print $1}' \\
+ > /sys/fs/nfs/net/nfs_client/identifier
+
+Security considerations
+-----------------------
+
+The use of cryptographic security for lease management operations
+is strongly encouraged.
+
+If NFS with Kerberos is not configured, a Linux NFSv4 client uses
+AUTH_SYS and UID 0 as the principal part of its client identity.
+This configuration is not only insecure, it increases the risk of
+lease and lock stealing. However, it might be the only choice for
+client configurations that have no local persistent storage.
+"co_ownerid" string uniqueness and persistence is critical in this
+case.
+
+When a Kerberos keytab is present on a Linux NFS client, the client
+attempts to use one of the principals in that keytab when
+identifying itself to servers. The "sec=" mount option does not
+control this behavior. Alternately, a single-user client with a
+Kerberos principal can use that principal in place of the client's
+host principal.
+
+Using Kerberos for this purpose enables the client and server to
+use the same lease for operations covered by all "sec=" settings.
+Additionally, the Linux NFS client uses the RPCSEC_GSS security
+flavor with Kerberos and the integrity QOS to prevent in-transit
+modification of lease modification requests.
+
+Additional notes
+----------------
+The Linux NFSv4 client establishes a single lease on each NFSv4
+server it accesses. NFSv4 mounts from a Linux NFSv4 client of a
+particular server then share that lease.
+
+Once a client establishes open and lock state, the NFSv4 protocol
+enables lease state to transition to other servers, following data
+that has been migrated. This hides data migration completely from
+running applications. The Linux NFSv4 client facilitates state
+migration by presenting the same "client_owner4" to all servers it
+encounters.
+
+========
+See Also
+========
+
+ - nfs(5)
+ - kerberos(7)
+ - RFC 7530 for the NFSv4.0 specification
+ - RFC 8881 for the NFSv4.1 specification.
diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst
index 33d588a01ace..de64d2d002a2 100644
--- a/Documentation/filesystems/nfs/exporting.rst
+++ b/Documentation/filesystems/nfs/exporting.rst
@@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct
super_block. This field must point to a "struct export_operations"
struct which has the following members:
- encode_fh (optional)
- Takes a dentry and creates a filehandle fragment which can later be used
- to find or create a dentry for the same object. The default
- implementation creates a filehandle fragment that encodes a 32bit inode
- and generation number for the inode encoded, and if necessary the
- same information for the parent.
+ encode_fh (mandatory)
+ Takes a dentry and creates a filehandle fragment which may later be used
+ to find or create a dentry for the same object.
fh_to_dentry (mandatory)
Given a filehandle fragment, this should find the implied object and
@@ -154,6 +151,11 @@ struct which has the following members:
to find potential names, and matches inode numbers to find the correct
match.
+ flags
+ Some filesystems may need to be handled differently than others. The
+ export_operations struct also includes a flags field that allows the
+ filesystem to communicate such information to nfsd. See the Export
+ Operations Flags section below for more explanation.
A filehandle fragment consists of an array of 1 or more 4byte words,
together with a one byte "type".
@@ -163,3 +165,76 @@ generated by encode_fh, in which case it will have been padded with
nuls. Rather, the encode_fh routine should choose a "type" which
indicates the decode_fh how much of the filehandle is valid, and how
it should be interpreted.
+
+Export Operations Flags
+-----------------------
+In addition to the operation vector pointers, struct export_operations also
+contains a "flags" field that allows the filesystem to communicate to nfsd
+that it may want to do things differently when dealing with it. The
+following flags are defined:
+
+ EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem
+ RFC 1813 recommends that servers always send weak cache consistency
+ (WCC) data to the client after each operation. The server should
+ atomically collect attributes about the inode, do an operation on it,
+ and then collect the attributes afterward. This allows the client to
+ skip issuing GETATTRs in some situations but means that the server
+ is calling vfs_getattr for almost all RPCs. On some filesystems
+ (particularly those that are clustered or networked) this is expensive
+ and atomicity is difficult to guarantee. This flag indicates to nfsd
+ that it should skip providing WCC attributes to the client in NFSv3
+ replies when doing operations on this filesystem. Consider enabling
+ this on filesystems that have an expensive ->getattr inode operation,
+ or when atomicity between pre and post operation attribute collection
+ is impossible to guarantee.
+
+ EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs
+ Many NFS operations deal with filehandles, which the server must then
+ vet to ensure that they live inside of an exported tree. When the
+ export consists of an entire filesystem, this is trivial. nfsd can just
+ ensure that the filehandle live on the filesystem. When only part of a
+ filesystem is exported however, then nfsd must walk the ancestors of the
+ inode to ensure that it's within an exported subtree. This is an
+ expensive operation and not all filesystems can support it properly.
+ This flag exempts the filesystem from subtree checking and causes
+ exportfs to get back an error if it tries to enable subtree checking
+ on it.
+
+ EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking
+ On some exportable filesystems (such as NFS) unlinking a file that
+ is still open can cause a fair bit of extra work. For instance,
+ the NFS client will do a "sillyrename" to ensure that the file
+ sticks around while it's still open. When reexporting, that open
+ file is held by nfsd so we usually end up doing a sillyrename, and
+ then immediately deleting the sillyrenamed file just afterward when
+ the link count actually goes to zero. Sometimes this delete can race
+ with other operations (for instance an rmdir of the parent directory).
+ This flag causes nfsd to close any open files for this inode _before_
+ calling into the vfs to do an unlink or a rename that would replace
+ an existing file.
+
+ EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote
+ PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to
+ write to one bdi (the final bdi) in order to free up writes queued
+ to another bdi (the client bdi). Such threads get a private balance
+ of dirty pages so that dirty pages for the client bdi do not imact
+ the daemon writing to the final bdi. For filesystems whose durable
+ storage is not local (such as exported NFS filesystems), this
+ constraint has negative consequences. EXPORT_OP_REMOTE_FS enables
+ an export to disable writeback throttling.
+
+ EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically
+ EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem
+ cannot provide the semantics required by the "atomic" boolean in
+ NFSv4's change_info4. This boolean indicates to a client whether the
+ returned before and after change attributes were obtained atomically
+ with the respect to the requested metadata operation (UNLINK,
+ OPEN/CREATE, MKDIR, etc).
+
+ EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2)
+ On most filesystems, inodes can remain under writeback after the
+ file is closed. NFSD relies on client activity or local flusher
+ threads to handle writeback. Certain filesystems, such as NFS, flush
+ all of an inode's dirty data on last close. Exports that behave this
+ way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip
+ waiting for writeback when closing such files.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 65805624e39b..95c2c009874c 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -6,8 +6,12 @@ NFS
.. toctree::
:maxdepth: 1
+ client-identifier
+ exporting
+ localio
pnfs
rpc-cache
rpc-server-gss
nfs41-server
knfsd-stats
+ reexport
diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..79808b37d745
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,357 @@
+===========
+NFS LOCALIO
+===========
+
+Overview
+========
+
+The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
+server to reliably handshake to determine if they are on the same
+host. Select "NFS client and server support for LOCALIO auxiliary
+protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
+config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
+
+Once an NFS client and server handshake as "local", the client will
+bypass the network RPC protocol for read, write and commit operations.
+Due to this XDR and RPC bypass, these operations will operate faster.
+
+The LOCALIO auxiliary protocol's implementation, which uses the same
+connection as NFS traffic, follows the pattern established by the NFS
+ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers. In a private implementation that
+preceded use of this LOCALIO protocol, a fragile sockaddr network
+address based match against all local network interfaces was attempted.
+But unlike the LOCALIO protocol, the sockaddr-based matching didn't
+handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate use case this locality makes possible is the
+client is able to open files and issue reads, writes and commits
+directly to the server without having to go over the network. The
+requirement is to perform these loopback NFS operations as efficiently
+as possible, this is particularly useful for container use cases
+(e.g. kubernetes) where it is possible to run an IO job local to the
+server.
+
+The performance advantage realized from LOCALIO's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+
+fio for 20 secs with directio, qd of 8, 16 libaio threads:
+ - With LOCALIO:
+ 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
+ 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
+ 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
+ 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
+ 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
+ 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
+ 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
+
+fio for 20 secs with directio, qd of 8, 1 libaio thread:
+ - With LOCALIO:
+ 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
+ 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
+ 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
+ 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
+ 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
+
+FAQ
+===
+
+1. What are the use cases for LOCALIO?
+
+ a. Workloads where the NFS client and server are on the same host
+ realize improved IO performance. In particular, it is common when
+ running containerised workloads for jobs to find themselves
+ running on the same host as the knfsd server being used for
+ storage.
+
+2. What are the requirements for LOCALIO?
+
+ a. Bypass use of the network RPC protocol as much as possible. This
+ includes bypassing XDR and RPC for open, read, write and commit
+ operations.
+ b. Allow client and server to autonomously discover if they are
+ running local to each other without making any assumptions about
+ the local network topology.
+ c. Support the use of containers by being compatible with relevant
+ namespaces (e.g. network, user, mount).
+ d. Support all versions of NFS. NFSv3 is of particular importance
+ because it has wide enterprise usage and pNFS flexfiles makes use
+ of it for the data path.
+
+3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
+ deciding if the NFS client and server are co-located on the same
+ host?
+
+ Since one of the main use cases is containerised workloads, we cannot
+ assume that IP addresses will be shared between the client and
+ server. This sets up a requirement for a handshake protocol that
+ needs to go over the same connection as the NFS traffic in order to
+ identify that the client and the server really are running on the
+ same host. The handshake uses a secret that is sent over the wire,
+ and can be verified by both parties by comparing with a value stored
+ in shared kernel memory if they are truly co-located.
+
+4. Does LOCALIO improve pNFS flexfiles?
+
+ Yes, LOCALIO complements pNFS flexfiles by allowing it to take
+ advantage of NFS client and server locality. Policy that initiates
+ client IO as closely to the server where the data is stored naturally
+ benefits from the data path optimization LOCALIO provides.
+
+5. Why not develop a new pNFS layout to enable LOCALIO?
+
+ A new pNFS layout could be developed, but doing so would put the
+ onus on the server to somehow discover that the client is co-located
+ when deciding to hand out the layout.
+ There is value in a simpler approach (as provided by LOCALIO) that
+ allows the NFS client to negotiate and leverage locality without
+ requiring more elaborate modeling and discovery of such locality in a
+ more centralized manner.
+
+6. Why is having the client perform a server-side file OPEN, without
+ using RPC, beneficial? Is the benefit pNFS specific?
+
+ Avoiding the use of XDR and RPC for file opens is beneficial to
+ performance regardless of whether pNFS is used. Especially when
+ dealing with small files its best to avoid going over the wire
+ whenever possible, otherwise it could reduce or even negate the
+ benefits of avoiding the wire for doing the small file I/O itself.
+ Given LOCALIO's requirements the current approach of having the
+ client perform a server-side file open, without using RPC, is ideal.
+ If in the future requirements change then we can adapt accordingly.
+
+7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
+
+ Strong authentication is usually tied to the connection itself. It
+ works by establishing a context that is cached by the server, and
+ that acts as the key for discovering the authorisation token, which
+ can then be passed to rpc.mountd to complete the authentication
+ process. On the other hand, in the case of AUTH_UNIX, the credential
+ that was passed over the wire is used directly as the key in the
+ upcall to rpc.mountd. This simplifies the authentication process, and
+ so makes AUTH_UNIX easier to support.
+
+8. How do export options that translate RPC user IDs behave for LOCALIO
+ operations (eg. root_squash, all_squash)?
+
+ Export options that translate user IDs are managed by nfsd_setuser()
+ which is called by nfsd_setuser_and_check_port() which is called by
+ __fh_verify(). So they get handled exactly the same way for LOCALIO
+ as they do for non-LOCALIO.
+
+9. How does LOCALIO make certain that object lifetimes are managed
+ properly given NFSD and NFS operate in different contexts?
+
+ See the detailed "NFS Client and Server Interlock" section below.
+
+RPC
+===
+
+The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
+RPC method that allows the Linux NFS client to verify the local Linux
+NFS server can see the nonce (single-use UUID) the client generated and
+made available in nfs_common. This protocol isn't part of an IETF
+standard, nor does it need to be considering it is Linux-to-Linux
+auxiliary RPC protocol that amounts to an implementation detail.
+
+The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
+the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
+XDR methods are used instead of the less efficient variable sized
+methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
+by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
+Linux Kernel Organization 400122 nfslocalio
+
+The LOCALIO protocol spec in rpcgen syntax is::
+
+ /* raw RFC 9562 UUID */
+ #define UUID_SIZE 16
+ typedef u8 uuid_t<UUID_SIZE>;
+
+ program NFS_LOCALIO_PROGRAM {
+ version LOCALIO_V1 {
+ void
+ NULL(void) = 0;
+
+ void
+ UUID_IS_LOCAL(uuid_t) = 1;
+ } = 1;
+ } = 400122;
+
+LOCALIO uses the same transport connection as NFS traffic. As such,
+LOCALIO is not registered with rpcbind.
+
+NFS Common and Client/Server Handshake
+======================================
+
+fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
+to generate a nonce (single-use UUID) and associated short-lived
+nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
+verification by the NFS server and if matched the NFS server populates
+members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
+transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
+clients_list from the nfs_common's uuids_list. See:
+fs/nfs/localio.c:nfs_local_probe()
+
+nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
+it has members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access). It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client and Server Interlock
+===============================
+
+LOCALIO provides the nfs_uuid_t object and associated interfaces to
+allow proper network namespace (net-ns) and NFSD object refcounting.
+
+LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref
+to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure
+each net-ns is not destroyed while in use by nfsd_open_local_fh(), and
+warrants a more detailed explanation:
+
+ nfsd_open_local_fh() uses nfsd_net_try_get() before opening its
+ nfsd_file handle and then the caller (NFS client) must drop the
+ reference for the nfsd_file and associated net-ns using
+ nfsd_file_put_local() once it has completed its IO.
+
+ This interlock working relies heavily on nfsd_open_local_fh() being
+ afforded the ability to safely deal with the possibility that the
+ NFSD's net-ns (and nfsd_net by association) may have been destroyed
+ by nfsd_destroy_serv() via nfsd_shutdown_net().
+
+This interlock of the NFS client and server has been verified to fix an
+easy to hit crash that would occur if an NFSD instance running in a
+container, with a LOCALIO client mounted, is shutdown. Upon restart of
+the container and associated NFSD, the client would go on to crash due
+to NULL pointer dereference that occurred due to the LOCALIO client's
+attempting to nfsd_open_local_fh() without having a proper reference on
+NFSD's net-ns.
+
+NFS Client issues IO instead of Server
+======================================
+
+Because LOCALIO is focused on protocol bypass to achieve improved IO
+performance, alternatives to the traditional NFS wire protocol (SUNRPC
+with XDR) must be provided to access the backing filesystem.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of select nfs server objects to allow a client local to a
+server to open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the associated nfsd network namespace and nn->nfsd_serv in terms of
+RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
+nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again. This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a LOCALIO client is connected to it.
+
+Once the client has an open nfsd_file pointer it will issue reads,
+writes and commits directly to the underlying local filesystem (normally
+done by the nfs server). As such, for these operations, the NFS client
+is issuing IO to the underlying local filesystem that it is sharing with
+the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
+fs/nfs/localio.c:nfs_local_commit().
+
+With normal NFS that makes use of RPC to issue IO to the server, if an
+application uses O_DIRECT the NFS client will bypass the pagecache but
+the NFS server will not. The NFS server's use of buffered IO affords
+applications to be less precise with their alignment when issuing IO to
+the NFS client. But if all applications properly align their IO, LOCALIO
+can be configured to use end-to-end O_DIRECT semantics from the NFS
+client to the underlying local filesystem, that it is sharing with
+the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module
+parameter to Y, e.g.:
+
+ echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics
+
+Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics
+(but again, this may cause IO to fail if applications do not properly
+align their IO).
+
+Security
+========
+
+LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka
+AUTH_SYS) is used.
+
+Care is taken to ensure the same NFS security mechanisms are used
+(authentication, etc) regardless of whether LOCALIO or regular NFS
+access is used. The auth_domain established as part of the traditional
+NFS client access to the NFS server is also used for LOCALIO.
+
+Relative to containers, LOCALIO gives the client access to the network
+namespace the server has. This is required to allow the client to access
+the server's per-namespace nfsd_net struct. With traditional NFS, the
+client is afforded this same level of access (albeit in terms of the NFS
+protocol via SUNRPC). No other namespaces (user, mount, etc) have been
+altered or purposely extended from the server to the client.
+
+Module Parameters
+=================
+
+/sys/module/nfs/parameters/localio_enabled (bool)
+controls if LOCALIO is enabled, defaults to Y. If client and server are
+local but 'localio_enabled' is set to N then LOCALIO will not be used.
+
+/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool)
+controls if O_DIRECT extends down to the underlying filesystem, defaults
+to N. Application IO must be logical blocksize aligned, otherwise
+O_DIRECT will fail.
+
+/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint)
+controls if NFSv3 read and write IOs will trigger (re)enabling of
+LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0
+(disabled). Must be power-of-2, admin keeps all the pieces if they
+misconfigure (too low a value or non-power-of-2).
+
+Testing
+=======
+
+The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
+and commit access have proven stable against various test scenarios:
+
+- Client and server both on the same host.
+
+- All permutations of client and server support enablement for both
+ local and remote client and server.
+
+- Testing against NFS storage products that don't support the LOCALIO
+ protocol was also performed.
+
+- Client on host, server within a container (for both v3 and v4.2).
+ The container testing was in terms of podman managed containers and
+ includes successful container stop/restart scenario.
+
+- Formalizing these test scenarios in terms of existing test
+ infrastructure is on-going. Initial regular coverage is provided in
+ terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
+ mount configuration, and includes lockdep and KASAN coverage, see:
+ https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
+ https://github.com/koverstreet/ktest
+
+- Various kdevops testing (in terms of "Chuck's BuildBot") has been
+ performed to regularly verify the LOCALIO changes haven't caused any
+ regressions to non-LOCALIO NFS use cases.
+
+- All of Hammerspace's various sanity tests pass with LOCALIO enabled
+ (this includes numerous pNFS and flexfiles tests).
diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst
new file mode 100644
index 000000000000..044be965d75e
--- /dev/null
+++ b/Documentation/filesystems/nfs/reexport.rst
@@ -0,0 +1,117 @@
+Reexporting NFS filesystems
+===========================
+
+Overview
+--------
+
+It is possible to reexport an NFS filesystem over NFS. However, this
+feature comes with a number of limitations. Before trying it, we
+recommend some careful research to determine whether it will work for
+your purposes.
+
+A discussion of current known limitations follows.
+
+"fsid=" required, crossmnt broken
+---------------------------------
+
+We require the "fsid=" export option on any reexport of an NFS
+filesystem. You can use "uuidgen -r" to generate a unique argument.
+
+The "crossmnt" export does not propagate "fsid=", so it will not allow
+traversing into further nfs filesystems; if you wish to export nfs
+filesystems mounted under the exported filesystem, you'll need to export
+them explicitly, assigning each its own unique "fsid= option.
+
+Reboot recovery
+---------------
+
+The NFS protocol's normal reboot recovery mechanisms don't work for the
+case when the reexport server reboots because the source server has not
+rebooted, and so it is not in grace. Since the source server is not in
+grace, it cannot offer any guarantees that the file won't have been
+changed between the locks getting lost and any attempt to recover them.
+The same applies to delegations and any associated locks. Clients are
+not allowed to get file locks or delegations from a reexport server, any
+attempts will fail with operation not supported.
+
+Filehandle limits
+-----------------
+
+If the original server uses an X byte filehandle for a given object, the
+reexport server's filehandle for the reexported object will be X+22
+bytes, rounded up to the nearest multiple of four bytes.
+
+The result must fit into the RFC-mandated filehandle size limits:
+
++-------+-----------+
+| NFSv2 | 32 bytes |
++-------+-----------+
+| NFSv3 | 64 bytes |
++-------+-----------+
+| NFSv4 | 128 bytes |
++-------+-----------+
+
+So, for example, you will only be able to reexport a filesystem over
+NFSv2 if the original server gives you filehandles that fit in 10
+bytes--which is unlikely.
+
+In general there's no way to know the maximum filehandle size given out
+by an NFS server without asking the server vendor.
+
+But the following table gives a few examples. The first column is the
+typical length of the filehandle from a Linux server exporting the given
+filesystem, the second is the length after that nfs export is reexported
+by another Linux host:
+
++--------+-------------------+----------------+
+| | filehandle length | after reexport |
++========+===================+================+
+| ext4: | 28 bytes | 52 bytes |
++--------+-------------------+----------------+
+| xfs: | 32 bytes | 56 bytes |
++--------+-------------------+----------------+
+| btrfs: | 40 bytes | 64 bytes |
++--------+-------------------+----------------+
+
+All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport,
+but none are reexportable over NFSv2.
+
+Linux server filehandles are a bit more complicated than this, though;
+for example:
+
+ - The (non-default) "subtreecheck" export option generally
+ requires another 4 to 8 bytes in the filehandle.
+ - If you export a subdirectory of a filesystem (instead of
+ exporting the filesystem root), that also usually adds 4 to 8
+ bytes.
+ - If you export over NFSv2, knfsd usually uses a shorter
+ filesystem identifier that saves 8 bytes.
+ - The root directory of an export uses a filehandle that is
+ shorter.
+
+As you can see, the 128-byte NFSv4 filehandle is large enough that
+you're unlikely to have trouble using NFSv4 to reexport any filesystem
+exported from a Linux server. In general, if the original server is
+something that also supports NFSv3, you're *probably* OK. Re-exporting
+over NFSv3 may be dicier, and reexporting over NFSv2 will probably
+never work.
+
+For more details of Linux filehandle structure, the best reference is
+the source code and comments; see in particular:
+
+ - include/linux/exportfs.h:enum fid_type
+ - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new
+ - fs/nfsd/nfsfh.c:set_version_and_fsid_type
+ - fs/nfs/export.c:nfs_encode_fh
+
+Open DENY bits ignored
+----------------------
+
+NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which
+allow you, for example, to open a file in a mode which forbids other
+read opens or write opens. The Linux client doesn't use them, and the
+server's support has always been incomplete: they are enforced only
+against other NFS users, not against processes accessing the exported
+filesystem locally. A reexport server will also not pass them along to
+the original server, so they will not be enforced between clients of
+different reexport servers.
diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst
index bb164eea969b..339efd75016a 100644
--- a/Documentation/filesystems/nfs/rpc-cache.rst
+++ b/Documentation/filesystems/nfs/rpc-cache.rst
@@ -78,7 +78,7 @@ Creating a Cache
include taking references to shared objects.
void update(struct cache_head \*orig, struct cache_head \*new)
- Set the 'content' fileds in 'new' from 'orig'.
+ Set the 'content' fields in 'new' from 'orig'.
int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h)
Optional. Used to provide a /proc file that lists the
diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst
index 812754576845..5c1a1c58fc27 100644
--- a/Documentation/filesystems/nfs/rpc-server-gss.rst
+++ b/Documentation/filesystems/nfs/rpc-server-gss.rst
@@ -10,13 +10,12 @@ purposes of authentication.)
RPCGSS is specified in a few IETF documents:
- - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt
- - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt
+ - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt
+ - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt
-and there is a 3rd version being proposed:
+There is a third version that we don't currently implement:
- - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt
- (At draft n. 02 at the time of writing)
+ - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt
Background
==========
@@ -30,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and
depends on GSSAPI extensions that are KRB5 specific.
GSSAPI is a complex library, and implementing it completely in kernel is
-unwarranted. However GSSAPI operations are fundementally separable in 2
+unwarranted. However GSSAPI operations are fundamentally separable in 2
parts:
- initial context establishment
diff --git a/Documentation/filesystems/nilfs2.rst b/Documentation/filesystems/nilfs2.rst
index 6c49f04e9e0a..e3a5c8977f2c 100644
--- a/Documentation/filesystems/nilfs2.rst
+++ b/Documentation/filesystems/nilfs2.rst
@@ -231,7 +231,7 @@ file structures (nilfs_finfo), and per block structures (nilfs_binfo)::
The logs include regular files, directory files, symbolic link files
-and several meta data files. The mata data files are the files used
+and several meta data files. The meta data files are the files used
to maintain file system meta data. The current version of NILFS2 uses
the following meta data files::
diff --git a/Documentation/filesystems/ntfs.rst b/Documentation/filesystems/ntfs.rst
deleted file mode 100644
index 5bb093a26485..000000000000
--- a/Documentation/filesystems/ntfs.rst
+++ /dev/null
@@ -1,466 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-================================
-The Linux NTFS filesystem driver
-================================
-
-
-.. Table of contents
-
- - Overview
- - Web site
- - Features
- - Supported mount options
- - Known bugs and (mis-)features
- - Using NTFS volume and stripe sets
- - The Device-Mapper driver
- - The Software RAID / MD driver
- - Limitations when using the MD driver
-
-
-Overview
-========
-
-Linux-NTFS comes with a number of user-space programs known as ntfsprogs.
-These include mkntfs, a full-featured ntfs filesystem format utility,
-ntfsundelete used for recovering files that were unintentionally deleted
-from an NTFS volume and ntfsresize which is used to resize an NTFS partition.
-See the web site for more information.
-
-To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file
-system type 'ntfs'. The driver currently supports read-only mode (with no
-fault-tolerance, encryption or journalling) and very limited, but safe, write
-support.
-
-For fault tolerance and raid support (i.e. volume and stripe sets), you can
-use the kernel's Software RAID / MD driver. See section "Using Software RAID
-with NTFS" for details.
-
-
-Web site
-========
-
-There is plenty of additional information on the linux-ntfs web site
-at http://www.linux-ntfs.org/
-
-The web site has a lot of additional information, such as a comprehensive
-FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS
-userspace utilities, etc.
-
-
-Features
-========
-
-- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and
- earlier kernels. This new driver implements NTFS read support and is
- functionally equivalent to the old ntfs driver and it also implements limited
- write support. The biggest limitation at present is that files/directories
- cannot be created or deleted. See below for the list of write features that
- are so far supported. Another limitation is that writing to compressed files
- is not implemented at all. Also, neither read nor write access to encrypted
- files is so far implemented.
-- The new driver has full support for sparse files on NTFS 3.x volumes which
- the old driver isn't happy with.
-- The new driver supports execution of binaries due to mmap() now being
- supported.
-- The new driver supports loopback mounting of files on NTFS which is used by
- some Linux distributions to enable the user to run Linux from an NTFS
- partition by creating a large file while in Windows and then loopback
- mounting the file while in Linux and creating a Linux filesystem on it that
- is used to install Linux on it.
-- A comparison of the two drivers using::
-
- time find . -type f -exec md5sum "{}" \;
-
- run three times in sequence with each driver (after a reboot) on a 1.4GiB
- NTFS partition, showed the new driver to be 20% faster in total time elapsed
- (from 9:43 minutes on average down to 7:53). The time spent in user space
- was unchanged but the time spent in the kernel was decreased by a factor of
- 2.5 (from 85 CPU seconds down to 33).
-- The driver does not support short file names in general. For backwards
- compatibility, we implement access to files using their short file names if
- they exist. The driver will not create short file names however, and a
- rename will discard any existing short file name.
-- The new driver supports exporting of mounted NTFS volumes via NFS.
-- The new driver supports async io (aio).
-- The new driver supports fsync(2), fdatasync(2), and msync(2).
-- The new driver supports readv(2) and writev(2).
-- The new driver supports access time updates (including mtime and ctime).
-- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present
- only very limited support for highly fragmented files, i.e. ones which have
- their data attribute split across multiple extents, is included. Another
- limitation is that at present truncate(2) will never create sparse files,
- since to mark a file sparse we need to modify the directory entry for the
- file and we do not implement directory modifications yet.
-- The new driver supports write(2) which can both overwrite existing data and
- extend the file size so that you can write beyond the existing data. Also,
- writing into sparse regions is supported and the holes are filled in with
- clusters. But at present only limited support for highly fragmented files,
- i.e. ones which have their data attribute split across multiple extents, is
- included. Another limitation is that write(2) will never create sparse
- files, since to mark a file sparse we need to modify the directory entry for
- the file and we do not implement directory modifications yet.
-
-Supported mount options
-=======================
-
-In addition to the generic mount options described by the manual page for the
-mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the
-following mount options:
-
-======================= =======================================================
-iocharset=name Deprecated option. Still supported but please use
- nls=name in the future. See description for nls=name.
-
-nls=name Character set to use when returning file names.
- Unlike VFAT, NTFS suppresses names that contain
- unconvertible characters. Note that most character
- sets contain insufficient characters to represent all
- possible Unicode characters that can exist on NTFS.
- To be sure you are not missing any files, you are
- advised to use nls=utf8 which is capable of
- representing all Unicode characters.
-
-utf8=<bool> Option no longer supported. Currently mapped to
- nls=utf8 but please use nls=utf8 in the future and
- make sure utf8 is compiled either as module or into
- the kernel. See description for nls=name.
-
-uid=
-gid=
-umask= Provide default owner, group, and access mode mask.
- These options work as documented in mount(8). By
- default, the files/directories are owned by root and
- he/she has read and write permissions, as well as
- browse permission for directories. No one else has any
- access permissions. I.e. the mode on all files is by
- default rw------- and for directories rwx------, a
- consequence of the default fmask=0177 and dmask=0077.
- Using a umask of zero will grant all permissions to
- everyone, i.e. all files and directories will have mode
- rwxrwxrwx.
-
-fmask=
-dmask= Instead of specifying umask which applies both to
- files and directories, fmask applies only to files and
- dmask only to directories.
-
-sloppy=<BOOL> If sloppy is specified, ignore unknown mount options.
- Otherwise the default behaviour is to abort mount if
- any unknown options are found.
-
-show_sys_files=<BOOL> If show_sys_files is specified, show the system files
- in directory listings. Otherwise the default behaviour
- is to hide the system files.
- Note that even when show_sys_files is specified, "$MFT"
- will not be visible due to bugs/mis-features in glibc.
- Further, note that irrespective of show_sys_files, all
- files are accessible by name, i.e. you can always do
- "ls -l \$UpCase" for example to specifically show the
- system file containing the Unicode upcase table.
-
-case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as
- case sensitive and create file names in the POSIX
- namespace. Otherwise the default behaviour is to treat
- file names as case insensitive and to create file names
- in the WIN32/LONG name space. Note, the Linux NTFS
- driver will never create short file names and will
- remove them on rename/delete of the corresponding long
- file name.
- Note that files remain accessible via their short file
- name, if it exists. If case_sensitive, you will need
- to provide the correct case of the short file name.
-
-disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse
- regions, i.e. holes, inside files is disabled for the
- volume (for the duration of this mount only). By
- default, creation of sparse regions is enabled, which
- is consistent with the behaviour of traditional Unix
- filesystems.
-
-errors=opt What to do when critical filesystem errors are found.
- Following values can be used for "opt":
-
- ======== =========================================
- continue DEFAULT, try to clean-up as much as
- possible, e.g. marking a corrupt inode as
- bad so it is no longer accessed, and then
- continue.
- recover At present only supported is recovery of
- the boot sector from the backup copy.
- If read-only mount, the recovery is done
- in memory only and not written to disk.
- ======== =========================================
-
- Note that the options are additive, i.e. specifying::
-
- errors=continue,errors=recover
-
- means the driver will attempt to recover and if that
- fails it will clean-up as much as possible and
- continue.
-
-mft_zone_multiplier= Set the MFT zone multiplier for the volume (this
- setting is not persistent across mounts and can be
- changed from mount to mount but cannot be changed on
- remount). Values of 1 to 4 are allowed, 1 being the
- default. The MFT zone multiplier determines how much
- space is reserved for the MFT on the volume. If all
- other space is used up, then the MFT zone will be
- shrunk dynamically, so this has no impact on the
- amount of free space. However, it can have an impact
- on performance by affecting fragmentation of the MFT.
- In general use the default. If you have a lot of small
- files then use a higher value. The values have the
- following meaning:
-
- ===== =================================
- Value MFT zone size (% of volume size)
- ===== =================================
- 1 12.5%
- 2 25%
- 3 37.5%
- 4 50%
- ===== =================================
-
- Note this option is irrelevant for read-only mounts.
-======================= =======================================================
-
-
-Known bugs and (mis-)features
-=============================
-
-- The link count on each directory inode entry is set to 1, due to Linux not
- supporting directory hard links. This may well confuse some user space
- applications, since the directory names will have the same inode numbers.
- This also speeds up ntfs_read_inode() immensely. And we haven't found any
- problems with this approach so far. If you find a problem with this, please
- let us know.
-
-
-Please send bug reports/comments/feedback/abuse to the Linux-NTFS development
-list at sourceforge: linux-ntfs-dev@lists.sourceforge.net
-
-
-Using NTFS volume and stripe sets
-=================================
-
-For support of volume and stripe sets, you can either use the kernel's
-Device-Mapper driver or the kernel's Software RAID / MD driver. The former is
-the recommended one to use for linear raid. But the latter is required for
-raid level 5. For striping and mirroring, either driver should work fine.
-
-
-The Device-Mapper driver
-------------------------
-
-You will need to create a table of the components of the volume/stripe set and
-how they fit together and load this into the kernel using the dmsetup utility
-(see man 8 dmsetup).
-
-Linear volume sets, i.e. linear raid, has been tested and works fine. Even
-though untested, there is no reason why stripe sets, i.e. raid level 0, and
-mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e.
-raid level 5, unfortunately cannot work yet because the current version of the
-Device-Mapper driver does not support raid level 5. You may be able to use the
-Software RAID / MD driver for raid level 5, see the next section for details.
-
-To create the table describing your volume you will need to know each of its
-components and their sizes in sectors, i.e. multiples of 512-byte blocks.
-
-For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for
-example if one of your partitions is /dev/hda2 you would do::
-
- $ fdisk -ul /dev/hda
-
- Disk /dev/hda: 81.9 GB, 81964302336 bytes
- 255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors
- Units = sectors of 1 * 512 = 512 bytes
-
- Device Boot Start End Blocks Id System
- /dev/hda1 * 63 4209029 2104483+ 83 Linux
- /dev/hda2 4209030 37768814 16779892+ 86 NTFS
- /dev/hda3 37768815 46170809 4200997+ 83 Linux
-
-And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 =
-33559785 sectors.
-
-For Win2k and later dynamic disks, you can for example use the ldminfo utility
-which is part of the Linux LDM tools (the latest version at the time of
-writing is linux-ldm-0.0.8.tar.bz2). You can download it from:
-
- http://www.linux-ntfs.org/
-
-Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go
-into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You
-will find the precompiled (i386) ldminfo utility there. NOTE: You will not be
-able to compile this yourself easily so use the binary version!
-
-Then you would use ldminfo in dump mode to obtain the necessary information::
-
- $ ./ldminfo --dump /dev/hda
-
-This would dump the LDM database found on /dev/hda which describes all of your
-dynamic disks and all the volumes on them. At the bottom you will see the
-VOLUME DEFINITIONS section which is all you really need. You may need to look
-further above to determine which of the disks in the volume definitions is
-which device in Linux. Hint: Run ldminfo on each of your dynamic disks and
-look at the Disk Id close to the top of the output for each (the PRIVATE HEADER
-section). You can then find these Disk Ids in the VBLK DATABASE section in the
-<Disk> components where you will get the LDM Name for the disk that is found in
-the VOLUME DEFINITIONS section.
-
-Note you will also need to enable the LDM driver in the Linux kernel. If your
-distribution did not enable it, you will need to recompile the kernel with it
-enabled. This will create the LDM partitions on each device at boot time. You
-would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc)
-in the Device-Mapper table.
-
-You can also bypass using the LDM driver by using the main device (e.g.
-/dev/hda) and then using the offsets of the LDM partitions into this device as
-the "Start sector of device" when creating the table. Once again ldminfo would
-give you the correct information to do this.
-
-Assuming you know all your devices and their sizes things are easy.
-
-For a linear raid the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset into Size of this Raid type Device Start sector
- # volume device of device
- 0 1028161 linear /dev/hda1 0
- 1028161 3903762 linear /dev/hdb2 0
- 4931923 2103211 linear /dev/hdc1 0
-
-For a striped volume, i.e. raid level 0, you will need to know the chunk size
-you used when creating the volume. Windows uses 64kiB as the default, so it
-will probably be this unless you changes the defaults when creating the array.
-
-For a raid level 0 the table would look like this (note all values are in
-512-byte sectors)::
-
- # Offset Size Raid Number Chunk 1st Start 2nd Start
- # into of the type of size Device in Device in
- # volume volume stripes device device
- 0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0
-
-If there are more than two devices, just add each of them to the end of the
-line.
-
-Finally, for a mirrored volume, i.e. raid level 1, the table would look like
-this (note all values are in 512-byte sectors)::
-
- # Ofs Size Raid Log Number Region Should Number Source Start Target Start
- # in of the type type of log size sync? of Device in Device in
- # vol volume params mirrors Device Device
- 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0
-
-If you are mirroring to multiple devices you can specify further targets at the
-end of the line.
-
-Note the "Should sync?" parameter "nosync" means that the two mirrors are
-already in sync which will be the case on a clean shutdown of Windows. If the
-mirrors are not clean, you can specify the "sync" option instead of "nosync"
-and the Device-Mapper driver will then copy the entirety of the "Source Device"
-to the "Target Device" or if you specified multiple target devices to all of
-them.
-
-Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1),
-and hand it over to dmsetup to work with, like so::
-
- $ dmsetup create myvolume1 /etc/ntfsvolume1
-
-You can obviously replace "myvolume1" with whatever name you like.
-
-If it all worked, you will now have the device /dev/device-mapper/myvolume1
-which you can then just use as an argument to the mount command as usual to
-mount the ntfs volume. For example::
-
- $ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1
-
-(You need to create the directory /mnt/myvol1 first and of course you can use
-anything you like instead of /mnt/myvol1 as long as it is an existing
-directory.)
-
-It is advisable to do the mount read-only to see if the volume has been setup
-correctly to avoid the possibility of causing damage to the data on the ntfs
-volume.
-
-
-The Software RAID / MD driver
------------------------------
-
-An alternative to using the Device-Mapper driver is to use the kernel's
-Software RAID / MD driver. For which you need to set up your /etc/raidtab
-appropriately (see man 5 raidtab).
-
-Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level
-0, have been tested and work fine (though see section "Limitations when using
-the MD driver with NTFS volumes" especially if you want to use linear raid).
-Even though untested, there is no reason why mirrors, i.e. raid level 1, and
-stripes with parity, i.e. raid level 5, should not work, too.
-
-You have to use the "persistent-superblock 0" option for each raid-disk in the
-NTFS volume/stripe you are configuring in /etc/raidtab as the persistent
-superblock used by the MD driver would damage the NTFS volume.
-
-Windows by default uses a stripe chunk size of 64k, so you probably want the
-"chunk-size 64k" option for each raid-disk, too.
-
-For example, if you have a stripe set consisting of two partitions /dev/hda5
-and /dev/hdb1 your /etc/raidtab would look like this::
-
- raiddev /dev/md0
- raid-level 0
- nr-raid-disks 2
- nr-spare-disks 0
- persistent-superblock 0
- chunk-size 64k
- device /dev/hda5
- raid-disk 0
- device /dev/hdb1
- raid-disk 1
-
-For linear raid, just change the raid-level above to "raid-level linear", for
-mirrors, change it to "raid-level 1", and for stripe sets with parity, change
-it to "raid-level 5".
-
-Note for stripe sets with parity you will also need to tell the MD driver
-which parity algorithm to use by specifying the option "parity-algorithm
-which", where you need to replace "which" with the name of the algorithm to
-use (see man 5 raidtab for available algorithms) and you will have to try the
-different available algorithms until you find one that works. Make sure you
-are working read-only when playing with this as you may damage your data
-otherwise. If you find which algorithm works please let us know (email the
-linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on
-IRC in channel #ntfs on the irc.freenode.net network) so we can update this
-documentation.
-
-Once the raidtab is setup, run for example raid0run -a to start all devices or
-raid0run /dev/md0 to start a particular md device, in this case /dev/md0.
-
-Then just use the mount command as usual to mount the ntfs volume using for
-example::
-
- mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume
-
-It is advisable to do the mount read-only to see if the md volume has been
-setup correctly to avoid the possibility of causing damage to the data on the
-ntfs volume.
-
-
-Limitations when using the Software RAID / MD driver
------------------------------------------------------
-
-Using the md driver will not work properly if any of your NTFS partitions have
-an odd number of sectors. This is especially important for linear raid as all
-data after the first partition with an odd number of sectors will be offset by
-one or more sectors so if you mount such a partition with write support you
-will cause massive damage to the data on the volume which will only become
-apparent when you try to use the volume again under Windows.
-
-So when using linear raid, make sure that all your partitions have an even
-number of sectors BEFORE attempting to use it. You have been warned!
-
-Even better is to simply use the Device-Mapper for linear raid and then you do
-not have this problem with odd numbers of sectors.
diff --git a/Documentation/filesystems/ntfs3.rst b/Documentation/filesystems/ntfs3.rst
new file mode 100644
index 000000000000..2b86a9b3a6de
--- /dev/null
+++ b/Documentation/filesystems/ntfs3.rst
@@ -0,0 +1,123 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====
+NTFS3
+=====
+
+Summary and Features
+====================
+
+NTFS3 is fully functional NTFS Read-Write driver. The driver works with NTFS
+versions up to 3.1. File system type to use on mount is *ntfs3*.
+
+- This driver implements NTFS read/write support for normal, sparse and
+ compressed files.
+- Supports native journal replaying.
+- Supports NFS export of mounted NTFS volumes.
+- Supports extended attributes. Predefined extended attributes:
+
+ - *system.ntfs_security* gets/sets security
+
+ Descriptor: SECURITY_DESCRIPTOR_RELATIVE
+
+ - *system.ntfs_attrib* gets/sets ntfs file/dir attributes.
+
+ Note: Applied to empty files, this allows to switch type between
+ sparse(0x200), compressed(0x800) and normal.
+
+ - *system.ntfs_attrib_be* gets/sets ntfs file/dir attributes.
+
+ Same value as system.ntfs_attrib but always represent as big-endian
+ (endianness of system.ntfs_attrib is the same as of the CPU).
+
+Mount Options
+=============
+
+The list below describes mount options supported by NTFS3 driver in addition to
+generic ones. You can use every mount option with **no** option. If it is in
+this table marked with no it means default is without **no**.
+
+.. flat-table::
+ :widths: 1 5
+ :fill-cells:
+
+ * - iocharset=name
+ - This option informs the driver how to interpret path strings and
+ translate them to Unicode and back. If this option is not set, the
+ default codepage will be used (CONFIG_NLS_DEFAULT).
+
+ Example: iocharset=utf8
+
+ * - uid=
+ - :rspan:`1`
+ * - gid=
+
+ * - umask=
+ - Controls the default permissions for files/directories created after
+ the NTFS volume is mounted.
+
+ * - dmask=
+ - :rspan:`1` Instead of specifying umask which applies both to files and
+ directories, fmask applies only to files and dmask only to directories.
+ * - fmask=
+
+ * - nohidden
+ - Files with the Windows-specific HIDDEN (FILE_ATTRIBUTE_HIDDEN) attribute
+ will not be shown under Linux.
+
+ * - sys_immutable
+ - Files with the Windows-specific SYSTEM (FILE_ATTRIBUTE_SYSTEM) attribute
+ will be marked as system immutable files.
+
+ * - hide_dot_files
+ - Updates the Windows-specific HIDDEN (FILE_ATTRIBUTE_HIDDEN) attribute
+ when creating and moving or renaming files. Files whose names start
+ with a dot will have the HIDDEN attribute set and files whose names
+ do not start with a dot will have it unset.
+
+ * - windows_names
+ - Prevents the creation of files and directories with a name not allowed
+ by Windows, either because it contains some not allowed character (which
+ are the characters " * / : < > ? \\ | and those whose code is less than
+ 0x20), because the name (with or without extension) is a reserved file
+ name (CON, AUX, NUL, PRN, LPT1-9, COM1-9) or because the last character
+ is a space or a dot. Existing such files can still be read and renamed.
+
+ * - discard
+ - Enable support of the TRIM command for improved performance on delete
+ operations, which is recommended for use with the solid-state drives
+ (SSD).
+
+ * - force
+ - Forces the driver to mount partitions even if volume is marked dirty.
+ Not recommended for use.
+
+ * - sparse
+ - Create new files as sparse.
+
+ * - showmeta
+ - Use this parameter to show all meta-files (System Files) on a mounted
+ NTFS partition. By default, all meta-files are hidden.
+
+ * - prealloc
+ - Preallocate space for files excessively when file size is increasing on
+ writes. Decreases fragmentation in case of parallel write operations to
+ different files.
+
+ * - acl
+ - Support POSIX ACLs (Access Control Lists). Effective if supported by
+ Kernel. Not to be confused with NTFS ACLs. The option specified as acl
+ enables support for POSIX ACLs.
+
+Todo list
+=========
+- Full journaling support over JBD. Currently journal replaying is supported
+ which is not necessarily as effective as JBD would be.
+
+References
+==========
+- Commercial version of the NTFS driver for Linux.
+ https://www.paragon-software.com/home/ntfs-linux-professional/
+
+- Direct e-mail address for feedback and requests on the NTFS3 implementation.
+ almaz.alexandrovich@paragon-software.com
diff --git a/Documentation/filesystems/ocfs2.rst b/Documentation/filesystems/ocfs2.rst
index 412386bc6506..5827062995cb 100644
--- a/Documentation/filesystems/ocfs2.rst
+++ b/Documentation/filesystems/ocfs2.rst
@@ -14,7 +14,7 @@ get "mount.ocfs2" and "ocfs2_hb_ctl".
Project web page: http://ocfs2.wiki.kernel.org
Tools git tree: https://github.com/markfasheh/ocfs2-tools
-OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/
+OCFS2 mailing lists: https://subspace.kernel.org/lists.linux.dev.html
All code copyright 2005 Oracle except when otherwise noted.
diff --git a/Documentation/filesystems/omfs.rst b/Documentation/filesystems/omfs.rst
index 4c8bb3074169..a104c25b7a2f 100644
--- a/Documentation/filesystems/omfs.rst
+++ b/Documentation/filesystems/omfs.rst
@@ -24,7 +24,7 @@ More information is available at:
Various utilities, including mkomfs and omfsck, are included with
omfsprogs, available at:
- http://bobcopeland.com/karma/
+ https://bobcopeland.com/karma/
Instructions are included in its README.
diff --git a/Documentation/filesystems/orangefs.rst b/Documentation/filesystems/orangefs.rst
index 463e37694250..931159e61796 100644
--- a/Documentation/filesystems/orangefs.rst
+++ b/Documentation/filesystems/orangefs.rst
@@ -274,7 +274,7 @@ then contains:
of kcalloced memory. This memory is used as an array of pointers
to each of the pages in the IO buffer through a call to get_user_pages.
* desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))``
- bytes of kcalloced memory. This memory is further intialized:
+ bytes of kcalloced memory. This memory is further initialized:
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc
structure. user_desc->ptr points to the IO buffer.
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 660dbaf0b9b8..4133a336486d 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -39,18 +39,18 @@ objects in the original filesystem.
On 64bit systems, even if all overlay layers are not on the same
underlying filesystem, the same compliant behavior could be achieved
with the "xino" feature. The "xino" feature composes a unique object
-identifier from the real object st_ino and an underlying fsid index.
-
-If all underlying filesystems support NFS file handles and export file
-handles with 32bit inode number encoding (e.g. ext4), overlay filesystem
-will use the high inode number bits for fsid. Even when the underlying
-filesystem uses 64bit inode numbers, users can still enable the "xino"
-feature with the "-o xino=on" overlay mount option. That is useful for the
-case of underlying filesystems like xfs and tmpfs, which use 64bit inode
-numbers, but are very unlikely to use the high inode number bits. In case
+identifier from the real object st_ino and an underlying fsid number.
+The "xino" feature uses the high inode number bits for fsid, because the
+underlying filesystems rarely use the high inode number bits. In case
the underlying inode number does overflow into the high xino bits, overlay
filesystem will fall back to the non xino behavior for that inode.
+The "xino" feature can be enabled with the "-o xino=on" overlay mount option.
+If all underlying filesystems support NFS file handles, the value of st_ino
+for overlay filesystem objects is not only unique, but also persistent over
+the lifetime of the filesystem. The "-o xino=auto" overlay mount option
+enables the "xino" feature only if the persistent st_ino requirement is met.
+
The following table summarizes what can be expected in different overlay
configurations.
@@ -66,14 +66,13 @@ Inode properties
| All layers | Y | Y | Y | Y | Y | Y | Y | Y |
| on same fs | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
-| Layers not | N | Y | Y | N | N | Y | N | Y |
+| Layers not | N | N | Y | N | N | Y | N | Y |
| on same fs, | | | | | | | | |
| xino=off | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
| xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y |
-| | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
-| xino=on/auto,| N | Y | Y | N | N | Y | N | Y |
+| xino=on/auto,| N | N | Y | N | N | Y | N | Y |
| ino overflow | | | | | | | | |
+--------------+-----+------+-----+------+--------+--------+--------+-------+
@@ -81,7 +80,6 @@ Inode properties
/proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify
file descriptor.
-
Upper and Lower
---------------
@@ -97,11 +95,13 @@ directory trees to be in the same filesystem and there is no
requirement that the root of a filesystem be given for either upper or
lower.
-The lower filesystem can be any filesystem supported by Linux and does
-not need to be writable. The lower filesystem can even be another
-overlayfs. The upper filesystem will normally be writable and if it
-is it must support the creation of trusted.* extended attributes, and
-must provide valid d_type in readdir responses, so NFS is not suitable.
+A wide range of filesystems supported by Linux can be the lower filesystem,
+but not all filesystems that are mountable by Linux have the features
+needed for OverlayFS to work. The lower filesystem does not need to be
+writable. The lower filesystem can even be another overlayfs. The upper
+filesystem will normally be writable and if it is it must support the
+creation of trusted.* and/or user.* extended attributes, and must provide
+valid d_type in readdir responses, so NFS is not suitable.
A read-only overlay of two read-only filesystems may use any
filesystem type.
@@ -118,7 +118,7 @@ Where both upper and lower objects are directories, a merged directory
is formed.
At mount time, the two directories given as mount options "lowerdir" and
-"upperdir" are combined into a merged directory:
+"upperdir" are combined into a merged directory::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\
workdir=/work /merged
@@ -145,7 +145,9 @@ filesystem, an overlay filesystem needs to record in the upper filesystem
that files have been removed. This is done using whiteouts and opaque
directories (non-directories are always opaque).
-A whiteout is created as a character device with 0/0 device number.
+A whiteout is created as a character device with 0/0 device number or
+as a zero-size regular file with the xattr "trusted.overlay.whiteout".
+
When a whiteout is found in the upper level of a merged directory, any
matching name in the lower level is ignored, and the whiteout itself
is also hidden.
@@ -154,6 +156,13 @@ A directory is made opaque by setting the xattr "trusted.overlay.opaque"
to "y". Where the upper filesystem contains an opaque directory, any
directory in the lower filesystem with the same name is ignored.
+An opaque directory should not contain any whiteouts, because they do not
+serve any purpose. A merge directory containing regular files with the xattr
+"trusted.overlay.whiteout", should be additionally marked by setting the xattr
+"trusted.overlay.opaque" to "x" on the merge directory itself.
+This is needed to avoid the overhead of checking the "trusted.overlay.whiteout"
+on all entries during readdir in the common case.
+
readdir
-------
@@ -172,12 +181,12 @@ directory is being read. This is unlikely to be noticed by many
programs.
seek offsets are assigned sequentially when the directories are read.
-Thus if
+Thus if:
- - read part of a directory
- - remember an offset, and close the directory
- - re-open the directory some time later
- - seek to the remembered offset
+ - read part of a directory
+ - remember an offset, and close the directory
+ - re-open the directory some time later
+ - seek to the remembered offset
there may be little correlation between the old and new locations in
the list of filenames, particularly if anything has changed in the
@@ -195,7 +204,7 @@ handle it in two different ways:
1. return EXDEV error: this error is returned by rename(2) when trying to
move a file or directory across filesystem boundaries. Hence
- applications are usually prepared to hande this error (mv(1) for example
+ applications are usually prepared to handle this error (mv(1) for example
recursively copies the directory tree). This is the default behavior.
2. If the "redirect_dir" feature is enabled, then the directory will be
@@ -231,12 +240,11 @@ Mount options:
Redirects are enabled.
- "redirect_dir=follow":
Redirects are not created, but followed.
-- "redirect_dir=off":
- Redirects are not created and only followed if "redirect_always_follow"
- feature is enabled in the kernel/module config.
- "redirect_dir=nofollow":
- Redirects are not created and not followed (equivalent to "redirect_dir=off"
- if "redirect_always_follow" feature is not enabled).
+ Redirects are not created and not followed.
+- "redirect_dir=off":
+ If "redirect_always_follow" is enabled in the kernel/module config,
+ this "off" translates to "follow", otherwise it translates to "nofollow".
When the NFS export feature is enabled, every copied up directory is
indexed by the file handle of the lower inode and a file handle of the
@@ -258,7 +266,7 @@ Non-directories
Objects that are not directories (files, symlinks, device-special
files etc.) are presented either from the upper or lower filesystem as
appropriate. When a file in the lower filesystem is accessed in a way
-the requires write-access, such as opening for write access, changing
+that requires write-access, such as opening for write access, changing
some metadata etc., the file is first copied from the lower filesystem
to the upper filesystem (copy_up). Note that creating a hard-link
also requires copy_up, though of course creation of a symlink does
@@ -284,21 +292,35 @@ rename or unlink will of course be noticed and handled).
Permission model
----------------
+An overlay filesystem stashes credentials that will be used when
+accessing lower or upper filesystems.
+
+In the old mount api the credentials of the task calling mount(2) are
+stashed. In the new mount api the credentials of the task creating the
+superblock through FSCONFIG_CMD_CREATE command of fsconfig(2) are
+stashed.
+
+Starting with kernel v6.15 it is possible to use the "override_creds"
+mount option which will cause the credentials of the calling task to be
+recorded. Note that "override_creds" is only meaningful when used with
+the new mount api as the old mount api combines setting options and
+superblock creation in a single mount(2) syscall.
+
Permission checking in the overlay filesystem follows these principles:
1) permission check SHOULD return the same result before and after copy up
2) task creating the overlay mount MUST NOT gain additional privileges
- 3) non-mounting task MAY gain additional privileges through the overlay,
- compared to direct access on underlying lower or upper filesystems
+ 3) task[*] MAY gain additional privileges through the overlay,
+ compared to direct access on underlying lower or upper filesystems
-This is achieved by performing two permission checks on each access
+This is achieved by performing two permission checks on each access:
a) check if current task is allowed access based on local DAC (owner,
group, mode and posix acl), as well as MAC checks
- b) check if mounting task would be allowed real operation on lower or
+ b) check if stashed credentials would be allowed real operation on lower or
upper layer based on underlying filesystem permissions, again including
MAC checks
@@ -307,16 +329,16 @@ are copied up. On the other hand it can result in server enforced
permissions (used by NFS, for example) being ignored (3).
Check (b) ensures that no task gains permissions to underlying layers that
-the mounting task does not have (2). This also means that it is possible
+the stashed credentials do not have (2). This also means that it is possible
to create setups where the consistency rule (1) does not hold; normally,
-however, the mounting task will have sufficient privileges to perform all
-operations.
+however, the stashed credentials will have sufficient privileges to
+perform all operations.
-Another way to demonstrate this model is drawing parallels between
+Another way to demonstrate this model is drawing parallels between::
mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged
-and
+and::
cp -a /lower /upper
mount --bind /upper /merged
@@ -328,8 +350,8 @@ the time of copy (on-demand vs. up-front).
Multiple lower layers
---------------------
-Multiple lower layers can now be given using the the colon (":") as a
-separator character between the directory names. For example:
+Multiple lower layers can now be given using the colon (":") as a
+separator character between the directory names. For example::
mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged
@@ -340,14 +362,30 @@ The specified lower directories will be stacked beginning from the
rightmost one and going left. In the above example lower1 will be the
top, lower2 the middle and lower3 the bottom layer.
+Note: directory names containing colons can be provided as lower layer by
+escaping the colons with a single backslash. For example::
+
+ mount -t overlay overlay -olowerdir=/a\:lower\:\:dir /merged
+
+Since kernel version v6.8, directory names containing colons can also
+be configured as lower layer using the "lowerdir+" mount options and the
+fsconfig syscall from new mount api. For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/a:lower::dir", 0);
+
+In the latter case, colons in lower layer directory names will be escaped
+as an octal characters (\072) when displayed in /proc/self/mountinfo.
Metadata only copy up
---------------------
-When metadata only copy up feature is enabled, overlayfs will only copy
+When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
-like chown/chmod is performed. Full file will be copied up later when
-file is opened for WRITE operation.
+like chown/chmod is performed. An upper file in this state is marked with
+"trusted.overlayfs.metacopy" xattr which indicates that the upper file
+contains no data. The data will be copied up later when file is opened for
+WRITE operation. After the lower file's data is copied up,
+the "trusted.overlayfs.metacopy" xattr is removed from the upper file.
In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.
@@ -371,6 +409,122 @@ conflict with metacopy=on, and will result in an error.
[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is
given.
+
+Data-only lower layers
+----------------------
+
+With "metacopy" feature enabled, an overlayfs regular file may be a composition
+of information from up to three different layers:
+
+ 1) metadata from a file in the upper layer
+
+ 2) st_ino and st_dev object identifier from a file in a lower layer
+
+ 3) data from a file in another lower layer (further below)
+
+The "lower data" file can be on any lower layer, except from the top most
+lower layer.
+
+Below the top most lower layer, any number of lower most layers may be defined
+as "data-only" lower layers, using double colon ("::") separators.
+A normal lower layer is not allowed to be below a data-only layer, so single
+colon separators are not allowed to the right of double colon ("::") separators.
+
+
+For example::
+
+ mount -t overlay overlay -olowerdir=/l1:/l2:/l3::/do1::/do2 /merged
+
+The paths of files in the "data-only" lower layers are not visible in the
+merged overlayfs directories and the metadata and st_ino/st_dev of files
+in the "data-only" lower layers are not visible in overlayfs inodes.
+
+Only the data of the files in the "data-only" lower layers may be visible
+when a "metacopy" file in one of the lower layers above it, has a "redirect"
+to the absolute path of the "lower data" file in the "data-only" lower layer.
+
+Instead of explicitly enabling "metacopy=on" it is sufficient to specify at
+least one data-only layer to enable redirection of data to a data-only layer.
+In this case other forms of metacopy are rejected. Note: this way data-only
+layers may be used toghether with "userxattr", in which case careful attention
+must be given to privileges needed to change the "user.overlay.redirect" xattr
+to prevent misuse.
+
+Since kernel version v6.8, "data-only" lower layers can also be added using
+the "datadir+" mount options and the fsconfig syscall from new mount api.
+For example::
+
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l2", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "lowerdir+", "/l3", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do1", 0);
+ fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0);
+
+
+Specifying layers via file descriptors
+--------------------------------------
+
+Since kernel v6.13, overlayfs supports specifying layers via file descriptors in
+addition to specifying them as paths. This feature is available for the
+"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the
+fsconfig syscall from the new mount api::
+
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work);
+ fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper);
+
+
+fs-verity support
+-----------------
+
+During metadata copy up of a lower file, if the source file has
+fs-verity enabled and overlay verity support is enabled, then the
+digest of the lower file is added to the "trusted.overlay.metacopy"
+xattr. This is then used to verify the content of the lower file
+each the time the metacopy file is opened.
+
+When a layer containing verity xattrs is used, it means that any such
+metacopy file in the upper layer is guaranteed to match the content
+that was in the lower at the time of the copy-up. If at any time
+(during a mount, after a remount, etc) such a file in the lower is
+replaced or modified in any way, access to the corresponding file in
+overlayfs will result in EIO errors (either on open, due to overlayfs
+digest check, or from a later read due to fs-verity) and a detailed
+error is printed to the kernel logs. For more details of how fs-verity
+file access works, see :ref:`Documentation/filesystems/fsverity.rst
+<accessing_verity_files>`.
+
+Verity can be used as a general robustness check to detect accidental
+changes in the overlayfs directories in use. But, with additional care
+it can also give more powerful guarantees. For example, if the upper
+layer is fully trusted (by using dm-verity or something similar), then
+an untrusted lower layer can be used to supply validated file content
+for all metacopy files. If additionally the untrusted lower
+directories are specified as "Data-only", then they can only supply
+such file content, and the entire mount can be trusted to match the
+upper layer.
+
+This feature is controlled by the "verity" mount option, which
+supports these values:
+
+- "off":
+ The metacopy digest is never generated or used. This is the
+ default if verity option is not specified.
+- "on":
+ Whenever a metacopy files specifies an expected digest, the
+ corresponding data file must match the specified digest. When
+ generating a metacopy file the verity digest will be set in it
+ based on the source file (if it has one).
+- "require":
+ Same as "on", but additionally all metacopy files must specify a
+ digest (or EIO is returned on open). This means metadata copy up
+ will only be used if the data file has fs-verity enabled,
+ otherwise a full copy-up is used.
+
Sharing and copying layers
--------------------------
@@ -388,29 +542,53 @@ though it will not result in a crash or deadlock.
Mounting an overlay using an upper layer path, where the upper layer path
was previously used by another mounted overlay in combination with a
-different lower layer path, is allowed, unless the "inodes index" feature
-or "metadata only copy up" feature is enabled.
+different lower layer path, is allowed, unless the "index" or "metacopy"
+features are enabled.
-With the "inodes index" feature, on the first time mount, an NFS file
+With the "index" feature, on the first time mount, an NFS file
handle of the lower layer root directory, along with the UUID of the lower
filesystem, are encoded and stored in the "trusted.overlay.origin" extended
attribute on the upper layer root directory. On subsequent mount attempts,
the lower root directory file handle and lower filesystem UUID are compared
to the stored origin in upper root directory. On failure to verify the
lower root origin, mount will fail with ESTALE. An overlayfs mount with
-"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem
+"index" enabled will fail with EOPNOTSUPP if the lower filesystem
does not support NFS export, lower filesystem does not have a valid UUID or
if the upper filesystem does not support extended attributes.
-For "metadata only copy up" feature there is no verification mechanism at
+For the "metacopy" feature, there is no verification mechanism at
mount time. So if same upper is mounted with different set of lower, mount
probably will succeed but expect the unexpected later on. So don't do it.
It is quite a common practice to copy overlay layers to a different
directory tree on the same or different underlying filesystem, and even
-to a different machine. With the "inodes index" feature, trying to mount
+to a different machine. With the "index" feature, trying to mount
the copied layers will fail the verification of the lower root file handle.
+Nesting overlayfs mounts
+------------------------
+
+It is possible to use a lower directory that is stored on an overlayfs
+mount. For regular files this does not need any special care. However, files
+that have overlayfs attributes, such as whiteouts or "overlay.*" xattrs, will
+be interpreted by the underlying overlayfs mount and stripped out. In order to
+allow the second overlayfs mount to see the attributes they must be escaped.
+
+Overlayfs specific xattrs are escaped by using a special prefix of
+"overlay.overlay.". So, a file with a "trusted.overlay.overlay.metacopy" xattr
+in the lower dir will be exposed as a regular file with a
+"trusted.overlay.metacopy" xattr in the overlayfs mount. This can be nested by
+repeating the prefix multiple time, as each instance only removes one prefix.
+
+A lower dir with a regular whiteout will always be handled by the overlayfs
+mount, so to support storing an effective whiteout file in an overlayfs mount an
+alternative form of whiteout is supported. This form is a regular, zero-size
+file with the "overlay.whiteout" xattr set, inside a directory with the
+"overlay.opaque" xattr set to "x" (see `whiteouts and opaque directories`_).
+These alternative whiteouts are never created by overlayfs, but can be used by
+userspace tools (like containers) that generate lower layers.
+These alternative whiteouts can be escaped using the standard xattr escape
+mechanism in order to properly nest to any depth.
Non-standard behavior
---------------------
@@ -420,17 +598,21 @@ filesystem.
This is the list of cases that overlayfs doesn't currently handle:
-a) POSIX mandates updating st_atime for reads. This is currently not
-done in the case when the file resides on a lower layer.
+ a) POSIX mandates updating st_atime for reads. This is currently not
+ done in the case when the file resides on a lower layer.
+
+ b) If a file residing on a lower layer is opened for read-only and then
+ memory mapped with MAP_SHARED, then subsequent changes to the file are not
+ reflected in the memory mapping.
-b) If a file residing on a lower layer is opened for read-only and then
-memory mapped with MAP_SHARED, then subsequent changes to the file are not
-reflected in the memory mapping.
+ c) If a file residing on a lower layer is being executed, then opening that
+ file for write or truncating the file will not be denied with ETXTBSY.
The following options allow overlayfs to act more like a standards
compliant filesystem:
-1) "redirect_dir"
+redirect_dir
+````````````
Enabled with the mount option or module option: "redirect_dir=on" or with
the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
@@ -438,7 +620,8 @@ the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
If this feature is disabled, then rename(2) on a lower or merged directory
will fail with EXDEV ("Invalid cross-device link").
-2) "inode index"
+index
+`````
Enabled with the mount option or module option "index=on" or with the
kernel config option CONFIG_OVERLAY_FS_INDEX=y.
@@ -447,7 +630,8 @@ If this feature is disabled and a file with multiple hard links is copied
up, then this will "break" the link. Changes will not be propagated to
other names referring to the same inode.
-3) "xino"
+xino
+````
Enabled with the mount option "xino=auto" or "xino=on", with the module
option "xino_auto=on" or with the kernel config option
@@ -459,7 +643,7 @@ enough free bits in the inode number, then overlayfs will not be able to
guarantee that the values of st_ino and st_dev returned by stat(2) and the
value of d_ino returned by readdir(3) will act like on a normal filesystem.
E.g. the value of st_dev may be different for two objects in the same
-overlay filesystem and the value of st_ino for directory objects may not be
+overlay filesystem and the value of st_ino for filesystem objects may not be
persistent and could change even while the overlay filesystem is mounted, as
summarized in the `Inode properties`_ table above.
@@ -467,14 +651,18 @@ summarized in the `Inode properties`_ table above.
Changes to underlying filesystems
---------------------------------
-Offline changes, when the overlay is not mounted, are allowed to either
-the upper or the lower trees.
-
Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed. If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock.
+Offline changes, when the overlay is not mounted, are allowed to the
+upper tree. Offline changes to the lower tree are only allowed if the
+"metacopy", "index", "xino" and "redirect_dir" features
+have not been used. If the lower tree is modified and any of these
+features has been used, the behavior of the overlay is undefined,
+though it will not result in a crash or deadlock.
+
When the overlay NFS export feature is enabled, overlay filesystems
behavior on offline changes of the underlying lower layer is different
than the behavior when NFS export is disabled.
@@ -510,12 +698,13 @@ directory inode.
When encoding a file handle from an overlay filesystem object, the
following rules apply:
-1. For a non-upper object, encode a lower file handle from lower inode
-2. For an indexed object, encode a lower file handle from copy_up origin
-3. For a pure-upper object and for an existing non-indexed upper object,
- encode an upper file handle from upper inode
+ 1. For a non-upper object, encode a lower file handle from lower inode
+ 2. For an indexed object, encode a lower file handle from copy_up origin
+ 3. For a pure-upper object and for an existing non-indexed upper object,
+ encode an upper file handle from upper inode
The encoded overlay file handle includes:
+
- Header including path type information (e.g. lower/upper)
- UUID of the underlying filesystem
- Underlying filesystem encoding of underlying inode
@@ -525,15 +714,15 @@ are stored in extended attribute "trusted.overlay.origin".
When decoding an overlay file handle, the following steps are followed:
-1. Find underlying layer by UUID and path type information.
-2. Decode the underlying filesystem file handle to underlying dentry.
-3. For a lower file handle, lookup the handle in index directory by name.
-4. If a whiteout is found in index, return ESTALE. This represents an
- overlay object that was deleted after its file handle was encoded.
-5. For a non-directory, instantiate a disconnected overlay dentry from the
- decoded underlying dentry, the path type and index inode, if found.
-6. For a directory, use the connected underlying decoded dentry, path type
- and index, to lookup a connected overlay dentry.
+ 1. Find underlying layer by UUID and path type information.
+ 2. Decode the underlying filesystem file handle to underlying dentry.
+ 3. For a lower file handle, lookup the handle in index directory by name.
+ 4. If a whiteout is found in index, return ESTALE. This represents an
+ overlay object that was deleted after its file handle was encoded.
+ 5. For a non-directory, instantiate a disconnected overlay dentry from the
+ decoded underlying dentry, the path type and index inode, if found.
+ 6. For a directory, use the connected underlying decoded dentry, path type
+ and index, to lookup a connected overlay dentry.
Decoding a non-directory file handle may return a disconnected dentry.
copy_up of that disconnected dentry will create an upper index entry with
@@ -560,8 +749,74 @@ When the NFS export feature is enabled, all directory index entries are
verified on mount time to check that upper file handles are not stale.
This verification may cause significant overhead in some cases.
-Note: the mount options index=off,nfs_export=on are conflicting and will
-result in an error.
+Note: the mount options index=off,nfs_export=on are conflicting for a
+read-write mount and will result in an error.
+
+Note: the mount option uuid=off can be used to replace UUID of the underlying
+filesystem in file handles with null, and effectively disable UUID checks. This
+can be useful in case the underlying disk is copied and the UUID of this copy
+is changed. This is only applicable if all lower/upper/work directories are on
+the same filesystem, otherwise it will fallback to normal behaviour.
+
+
+UUID and fsid
+-------------
+
+The UUID of overlayfs instance itself and the fsid reported by statfs(2) are
+controlled by the "uuid" mount option, which supports these values:
+
+- "null":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+- "off":
+ UUID of overlayfs is null. fsid is taken from upper most filesystem.
+ UUID of underlying layers is ignored.
+- "on":
+ UUID of overlayfs is generated and used to report a unique fsid.
+ UUID is stored in xattr "trusted.overlay.uuid", making overlayfs fsid
+ unique and persistent. This option requires an overlayfs with upper
+ filesystem that supports xattrs.
+- "auto": (default)
+ UUID is taken from xattr "trusted.overlay.uuid" if it exists.
+ Upgrade to "uuid=on" on first time mount of new overlay filesystem that
+ meets the prerequites.
+ Downgrade to "uuid=null" for existing overlay filesystems that were never
+ mounted with "uuid=on".
+
+
+Volatile mount
+--------------
+
+This is enabled with the "volatile" mount option. Volatile mounts are not
+guaranteed to survive a crash. It is strongly recommended that volatile
+mounts are only used if data written to the overlay can be recreated
+without significant effort.
+
+The advantage of mounting with the "volatile" option is that all forms of
+sync calls to the upper filesystem are omitted.
+
+In order to avoid a giving a false sense of safety, the syncfs (and fsync)
+semantics of volatile mounts are slightly different than that of the rest of
+VFS. If any writeback error occurs on the upperdir's filesystem after a
+volatile mount takes place, all sync functions will return an error. Once this
+condition is reached, the filesystem will not recover, and every subsequent sync
+call will return an error, even if the upperdir has not experience a new error
+since the last sync call.
+
+When overlay is mounted with "volatile" option, the directory
+"$workdir/work/incompat/volatile" is created. During next mount, overlay
+checks for this directory and refuses to mount if present. This is a strong
+indicator that user should throw away upper and work directories and create
+fresh one. In very limited cases where the user knows that the system has
+not crashed and contents of upperdir are intact, The "volatile" directory
+can be removed.
+
+
+User xattr
+----------
+
+The "-o userxattr" mount option forces overlayfs to use the
+"user.overlay." xattr namespace instead of "trusted.overlay.". This is
+useful for unprivileged mounting of overlayfs.
Testsuite
@@ -570,9 +825,9 @@ Testsuite
There's a testsuite originally developed by David Howells and currently
maintained by Amir Goldstein at:
- https://github.com/amir73il/unionmount-testsuite.git
+https://github.com/amir73il/unionmount-testsuite.git
-Run as root:
+Run as root::
# cd unionmount-testsuite
# ./run --ov --verify
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index f46b05e9b96c..9ced1135608e 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -43,15 +43,15 @@ characters, and "components" that are sequences of one or more
non-"``/``" characters. These form two kinds of paths. Those that
start with slashes are "absolute" and start from the filesystem root.
The others are "relative" and start from the current directory, or
-from some other location specified by a file descriptor given to a
-"``XXXat``" system call such as `openat() <openat_>`_.
+from some other location specified by a file descriptor given to
+"``*at()``" system calls such as `openat() <openat_>`_.
.. _execveat: http://man7.org/linux/man-pages/man2/execveat.2.html
It is tempting to describe the second kind as starting with a
component, but that isn't always accurate: a pathname can lack both
slashes and components, it can be empty, in other words. This is
-generally forbidden in POSIX, but some of those "xxx``at``" system calls
+generally forbidden in POSIX, but some of those "``*at()``" system calls
in Linux permit it when the ``AT_EMPTY_PATH`` flag is given. For
example, if you have an open file descriptor on an executable file you
can execute it by calling `execveat() <execveat_>`_ passing
@@ -69,17 +69,17 @@ pathname that is just slashes have a final component. If it does
exist, it could be "``.``" or "``..``" which are handled quite differently
from other components.
-.. _POSIX: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
+.. _POSIX: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_12
If a pathname ends with a slash, such as "``/tmp/foo/``" it might be
tempting to consider that to have an empty final component. In many
ways that would lead to correct results, but not always. In
particular, ``mkdir()`` and ``rmdir()`` each create or remove a directory named
by the final component, and they are required to work with pathnames
-ending in "``/``". According to POSIX_
+ending in "``/``". According to POSIX_:
- A pathname that contains at least one non- &lt;slash> character and
- that ends with one or more trailing &lt;slash> characters shall not
+ A pathname that contains at least one non-<slash> character and
+ that ends with one or more trailing <slash> characters shall not
be resolved successfully unless the last pathname component before
the trailing <slash> characters names an existing directory or a
directory entry that is to be created for a directory immediately
@@ -229,7 +229,7 @@ happened to be looking at a dentry that was moved in this way,
it might end up continuing the search down the wrong chain,
and so miss out on part of the correct chain.
-The name-lookup process (``d_lookup()``) does _not_ try to prevent this
+The name-lookup process (``d_lookup()``) does *not* try to prevent this
from happening, but only to detect when it happens.
``rename_lock`` is a seqlock that is updated whenever any dentry is
renamed. If ``d_lookup`` finds that a rename happened while it
@@ -376,7 +376,7 @@ table, and the mount point hash table.
Bringing it together with ``struct nameidata``
----------------------------------------------
-.. _First edition Unix: http://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
+.. _First edition Unix: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V1/u2.s
Throughout the process of walking a path, the current status is stored
in a ``struct nameidata``, "namei" being the traditional name - dating
@@ -398,7 +398,7 @@ held.
``struct qstr last``
~~~~~~~~~~~~~~~~~~~~
-This is a string together with a length (i.e. _not_ ``nul`` terminated)
+This is a string together with a length (i.e. *not* ``nul`` terminated)
that is the "next" component in the pathname.
``int last_type``
@@ -448,15 +448,17 @@ described. If it finds a ``LAST_NORM`` component it first calls
filesystem to revalidate the result if it is that sort of filesystem.
If that doesn't get a good result, it calls "``lookup_slow()``" which
takes ``i_rwsem``, rechecks the cache, and then asks the filesystem
-to find a definitive answer. Each of these will call
-``follow_managed()`` (as described below) to handle any mount points.
-
-In the absence of symbolic links, ``walk_component()`` creates a new
-``struct path`` containing a counted reference to the new dentry and a
-reference to the new ``vfsmount`` which is only counted if it is
-different from the previous ``vfsmount``. It then calls
-``path_to_nameidata()`` to install the new ``struct path`` in the
-``struct nameidata`` and drop the unneeded references.
+to find a definitive answer.
+
+As the last step of walk_component(), step_into() will be called either
+directly from walk_component() or from handle_dots(). It calls
+handle_mounts(), to check and handle mount points, in which a new
+``struct path`` is created containing a counted reference to the new dentry and
+a reference to the new ``vfsmount`` which is only counted if it is
+different from the previous ``vfsmount``. Then if there is
+a symbolic link, step_into() calls pick_link() to deal with it,
+otherwise it installs the new ``struct path`` in the ``struct nameidata``, and
+drops the unneeded references.
This "hand-over-hand" sequencing of getting a reference to the new
dentry before dropping the reference to the previous dentry may
@@ -470,8 +472,8 @@ Handling the final component
``nd->last_type`` to refer to the final component of the path. It does
not call ``walk_component()`` that last time. Handling that final
component remains for the caller to sort out. Those callers are
-``path_lookupat()``, ``path_parentat()``, ``path_mountpoint()`` and
-``path_openat()`` each of which handles the differing requirements of
+path_lookupat(), path_parentat() and
+path_openat() each of which handles the differing requirements of
different system calls.
``path_parentat()`` is clearly the simplest - it just wraps a little bit
@@ -486,20 +488,18 @@ perform their operation.
object is wanted such as by ``stat()`` or ``chmod()``. It essentially just
calls ``walk_component()`` on the final component through a call to
``lookup_last()``. ``path_lookupat()`` returns just the final dentry.
-
-``path_mountpoint()`` handles the special case of unmounting which must
-not try to revalidate the mounted filesystem. It effectively
-contains, through a call to ``mountpoint_last()``, an alternate
-implementation of ``lookup_slow()`` which skips that step. This is
-important when unmounting a filesystem that is inaccessible, such as
+It is worth noting that when flag ``LOOKUP_MOUNTPOINT`` is set,
+path_lookupat() will unset LOOKUP_JUMPED in nameidata so that in the
+subsequent path traversal d_weak_revalidate() won't be called.
+This is important when unmounting a filesystem that is inaccessible, such as
one provided by a dead NFS server.
Finally ``path_openat()`` is used for the ``open()`` system call; it
-contains, in support functions starting with "``do_last()``", all the
+contains, in support functions starting with "open_last_lookups()", all the
complexity needed to handle the different subtleties of O_CREAT (with
or without O_EXCL), final "``/``" characters, and trailing symbolic
links. We will revisit this in the final part of this series, which
-focuses on those symbolic links. "``do_last()``" will sometimes, but
+focuses on those symbolic links. "open_last_lookups()" will sometimes, but
not always, take ``i_rwsem``, depending on what it finds.
Each of these, or the functions which call them, need to be alert to
@@ -531,12 +531,11 @@ this retry process in the next article.
Automount points are locations in the filesystem where an attempt to
lookup a name can trigger changes to how that lookup should be
handled, in particular by mounting a filesystem there. These are
-covered in greater detail in autofs.txt in the Linux documentation
+covered in greater detail in autofs.rst in the Linux documentation
tree, but a few notes specifically related to path lookup are in order
here.
-The Linux VFS has a concept of "managed" dentries which is reflected
-in function names such as "``follow_managed()``". There are three
+The Linux VFS has a concept of "managed" dentries. There are three
potentially interesting things about these dentries corresponding
to three different flags that might be set in ``dentry->d_flags``:
@@ -652,11 +651,11 @@ RCU-walk finds it cannot stop gracefully, it simply gives up and
restarts from the top with REF-walk.
This pattern of "try RCU-walk, if that fails try REF-walk" can be
-clearly seen in functions like ``filename_lookup()``,
-``filename_parentat()``, ``filename_mountpoint()``,
-``do_filp_open()``, and ``do_file_open_root()``. These five
-correspond roughly to the four ``path_``* functions we met earlier,
-each of which calls ``link_path_walk()``. The ``path_*`` functions are
+clearly seen in functions like filename_lookup(),
+filename_parentat(),
+do_filp_open(), and do_file_open_root(). These four
+correspond roughly to the three ``path_*()`` functions we met earlier,
+each of which calls ``link_path_walk()``. The ``path_*()`` functions are
called using different mode flags until a mode is found which works.
They are first called with ``LOOKUP_RCU`` set to request "RCU-walk". If
that fails with the error ``ECHILD`` they are called again with no
@@ -720,7 +719,7 @@ against a dentry. The length and name pointer are copied into local
variables, then ``read_seqcount_retry()`` is called to confirm the two
are consistent, and only then is ``->d_compare()`` called. When
standard filename comparison is used, ``dentry_cmp()`` is called
-instead. Notably it does _not_ use ``read_seqcount_retry()``, but
+instead. Notably it does *not* use ``read_seqcount_retry()``, but
instead has a large comment explaining why the consistency guarantee
isn't necessary. A subsequent ``read_seqcount_retry()`` will be
sufficient to catch any problem that could occur at this point.
@@ -928,7 +927,7 @@ if anything goes wrong it is much safer to just abort and try a more
sedate approach.
The emphasis here is "try quickly and check". It should probably be
-"try quickly _and carefully,_ then check". The fact that checking is
+"try quickly *and carefully*, then check". The fact that checking is
needed is a reminder that the system is dynamic and only a limited
number of things are safe at all. The most likely cause of errors in
this whole process is assuming something is safe when in reality it
@@ -993,8 +992,8 @@ is 4096. There are a number of reasons for this limit; not letting the
kernel spend too much time on just one path is one of them. With
symbolic links you can effectively generate much longer paths so some
sort of limit is needed for the same reason. Linux imposes a limit of
-at most 40 symlinks in any one path lookup. It previously imposed a
-further limit of eight on the maximum depth of recursion, but that was
+at most 40 (MAXSYMLINKS) symlinks in any one path lookup. It previously imposed
+a further limit of eight on the maximum depth of recursion, but that was
raised to 40 when a separate stack was implemented, so there is now
just the one limit.
@@ -1061,42 +1060,26 @@ filesystem cannot successfully get a reference in RCU-walk mode, it
must return ``-ECHILD`` and ``unlazy_walk()`` will be called to return to
REF-walk mode in which the filesystem is allowed to sleep.
-The place for all this to happen is the ``i_op->follow_link()`` inode
-method. In the present mainline code this is never actually called in
-RCU-walk mode as the rewrite is not quite complete. It is likely that
-in a future release this method will be passed an ``inode`` pointer when
-called in RCU-walk mode so it both (1) knows to be careful, and (2) has the
-validated pointer. Much like the ``i_op->permission()`` method we
-looked at previously, ``->follow_link()`` would need to be careful that
+The place for all this to happen is the ``i_op->get_link()`` inode
+method. This is called both in RCU-walk and REF-walk. In RCU-walk the
+``dentry*`` argument is NULL, ``->get_link()`` can return -ECHILD to drop out of
+RCU-walk. Much like the ``i_op->permission()`` method we
+looked at previously, ``->get_link()`` would need to be careful that
all the data structures it references are safe to be accessed while
-holding no counted reference, only the RCU lock. Though getting a
-reference with ``->follow_link()`` is not yet done in RCU-walk mode, the
-code is ready to release the reference when that does happen.
-
-This need to drop the reference to a symlink adds significant
-complexity. It requires a reference to the inode so that the
-``i_op->put_link()`` inode operation can be called. In REF-walk, that
-reference is kept implicitly through a reference to the dentry, so
-keeping the ``struct path`` of the symlink is easiest. For RCU-walk,
-the pointer to the inode is kept separately. To allow switching from
-RCU-walk back to REF-walk in the middle of processing nested symlinks
-we also need the seq number for the dentry so we can confirm that
-switching back was safe.
-
-Finally, when providing a reference to a symlink, the filesystem also
-provides an opaque "cookie" that must be passed to ``->put_link()`` so that it
-knows what to free. This might be the allocated memory area, or a
-pointer to the ``struct page`` in the page cache, or something else
-completely. Only the filesystem knows what it is.
+holding no counted reference, only the RCU lock. A callback
+``struct delayed_called`` will be passed to ``->get_link()``:
+file systems can set their own put_link function and argument through
+set_delayed_call(). Later on, when VFS wants to put link, it will call
+do_delayed_call() to invoke that callback function with the argument.
In order for the reference to each symlink to be dropped when the walk completes,
whether in RCU-walk or REF-walk, the symlink stack needs to contain,
along with the path remnants:
-- the ``struct path`` to provide a reference to the inode in REF-walk
-- the ``struct inode *`` to provide a reference to the inode in RCU-walk
+- the ``struct path`` to provide a reference to the previous path
+- the ``const char *`` to provide a reference to the to previous name
- the ``seq`` to allow the path to be safely switched from RCU-walk to REF-walk
-- the ``cookie`` that tells ``->put_path()`` what to put.
+- the ``struct delayed_call`` for later invocation.
This means that each entry in the symlink stack needs to hold five
pointers and an integer instead of just one pointer (the path
@@ -1120,12 +1103,10 @@ doesn't need to notice. Getting this ``name`` variable on and off the
stack is very straightforward; pushing and popping the references is
a little more complex.
-When a symlink is found, ``walk_component()`` returns the value ``1``
-(``0`` is returned for any other sort of success, and a negative number
-is, as usual, an error indicator). This causes ``get_link()`` to be
-called; it then gets the link from the filesystem. Providing that
-operation is successful, the old path ``name`` is placed on the stack,
-and the new value is used as the ``name`` for a while. When the end of
+When a symlink is found, walk_component() calls pick_link() via step_into()
+which returns the link from the filesystem.
+Providing that operation is successful, the old path ``name`` is placed on the
+stack, and the new value is used as the ``name`` for a while. When the end of
the path is found (i.e. ``*name`` is ``'\0'``) the old ``name`` is restored
off the stack and path walking continues.
@@ -1142,23 +1123,23 @@ stack in ``walk_component()`` immediately when the symlink is found;
old symlink as it walks that last component. So it is quite
convenient for ``walk_component()`` to release the old symlink and pop
the references just before pushing the reference information for the
-new symlink. It is guided in this by two flags; ``WALK_GET``, which
-gives it permission to follow a symlink if it finds one, and
-``WALK_PUT``, which tells it to release the current symlink after it has been
-followed. ``WALK_PUT`` is tested first, leading to a call to
-``put_link()``. ``WALK_GET`` is tested subsequently (by
-``should_follow_link()``) leading to a call to ``pick_link()`` which sets
-up the stack frame.
+new symlink. It is guided in this by three flags: ``WALK_NOFOLLOW`` which
+forbids it from following a symlink if it finds one, ``WALK_MORE``
+which indicates that it is yet too early to release the
+current symlink, and ``WALK_TRAILING`` which indicates that it is on the final
+component of the lookup, so we will check userspace flag ``LOOKUP_FOLLOW`` to
+decide whether follow it when it is a symlink and call ``may_follow_link()`` to
+check if we have privilege to follow it.
Symlinks with no final component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A pair of special-case symlinks deserve a little further explanation.
Both result in a new ``struct path`` (with mount and dentry) being set
-up in the ``nameidata``, and result in ``get_link()`` returning ``NULL``.
+up in the ``nameidata``, and result in pick_link() returning ``NULL``.
The more obvious case is a symlink to "``/``". All symlinks starting
-with "``/``" are detected in ``get_link()`` which resets the ``nameidata``
+with "``/``" are detected in pick_link() which resets the ``nameidata``
to point to the effective filesystem root. If the symlink only
contains "``/``" then there is nothing more to do, no components at all,
so ``NULL`` is returned to indicate that the symlink can be released and
@@ -1175,12 +1156,11 @@ something that looks like a symlink. It is really a reference to the
target file, not just the name of it. When you ``readlink`` these
objects you get a name that might refer to the same file - unless it
has been unlinked or mounted over. When ``walk_component()`` follows
-one of these, the ``->follow_link()`` method in "procfs" doesn't return
-a string name, but instead calls ``nd_jump_link()`` which updates the
-``nameidata`` in place to point to that target. ``->follow_link()`` then
-returns ``NULL``. Again there is no final component and ``get_link()``
-reports this by leaving the ``last_type`` field of ``nameidata`` as
-``LAST_BIND``.
+one of these, the ``->get_link()`` method in "procfs" doesn't return
+a string name, but instead calls nd_jump_link() which updates the
+``nameidata`` in place to point to that target. ``->get_link()`` then
+returns ``NULL``. Again there is no final component and pick_link()
+returns ``NULL``.
Following the symlink in the final component
--------------------------------------------
@@ -1197,42 +1177,38 @@ potentially need to call ``link_path_walk()`` again and again on
successive symlinks until one is found that doesn't point to another
symlink.
-This case is handled by the relevant caller of ``link_path_walk()``, such as
-``path_lookupat()`` using a loop that calls ``link_path_walk()``, and then
-handles the final component. If the final component is a symlink
-that needs to be followed, then ``trailing_symlink()`` is called to set
-things up properly and the loop repeats, calling ``link_path_walk()``
-again. This could loop as many as 40 times if the last component of
-each symlink is another symlink.
-
-The various functions that examine the final component and possibly
-report that it is a symlink are ``lookup_last()``, ``mountpoint_last()``
-and ``do_last()``, each of which use the same convention as
-``walk_component()`` of returning ``1`` if a symlink was found that needs
-to be followed.
-
-Of these, ``do_last()`` is the most interesting as it is used for
-opening a file. Part of ``do_last()`` runs with ``i_rwsem`` held and this
-part is in a separate function: ``lookup_open()``.
-
-Explaining ``do_last()`` completely is beyond the scope of this article,
-but a few highlights should help those interested in exploring the
-code.
-
-1. Rather than just finding the target file, ``do_last()`` needs to open
+This case is handled by relevant callers of link_path_walk(), such as
+path_lookupat(), path_openat() using a loop that calls link_path_walk(),
+and then handles the final component by calling open_last_lookups() or
+lookup_last(). If it is a symlink that needs to be followed,
+open_last_lookups() or lookup_last() will set things up properly and
+return the path so that the loop repeats, calling
+link_path_walk() again. This could loop as many as 40 times if the last
+component of each symlink is another symlink.
+
+Of the various functions that examine the final component,
+open_last_lookups() is the most interesting as it works in tandem
+with do_open() for opening a file. Part of open_last_lookups() runs
+with ``i_rwsem`` held and this part is in a separate function: lookup_open().
+
+Explaining open_last_lookups() and do_open() completely is beyond the scope
+of this article, but a few highlights should help those interested in exploring
+the code.
+
+1. Rather than just finding the target file, do_open() is used after
+ open_last_lookup() to open
it. If the file was found in the dcache, then ``vfs_open()`` is used for
this. If not, then ``lookup_open()`` will either call ``atomic_open()`` (if
the filesystem provides it) to combine the final lookup with the open, or
- will perform the separate ``lookup_real()`` and ``vfs_create()`` steps
+ will perform the separate ``i_op->lookup()`` and ``i_op->create()`` steps
directly. In the later case the actual "open" of this newly found or
- created file will be performed by ``vfs_open()``, just as if the name
+ created file will be performed by vfs_open(), just as if the name
were found in the dcache.
-2. ``vfs_open()`` can fail with ``-EOPENSTALE`` if the cached information
- wasn't quite current enough. Rather than restarting the lookup from
- the top with ``LOOKUP_REVAL`` set, ``lookup_open()`` is called instead,
- giving the filesystem a chance to resolve small inconsistencies.
- If that doesn't work, only then is the lookup restarted from the top.
+2. vfs_open() can fail with ``-EOPENSTALE`` if the cached information
+ wasn't quite current enough. If it's in RCU-walk ``-ECHILD`` will be returned
+ otherwise ``-ESTALE`` is returned. When ``-ESTALE`` is returned, the caller may
+ retry with ``LOOKUP_REVAL`` flag set.
3. An open with O_CREAT **does** follow a symlink in the final component,
unlike other creation system calls (like ``mkdir``). So the sequence::
@@ -1242,8 +1218,8 @@ code.
will create a file called ``/tmp/bar``. This is not permitted if
``O_EXCL`` is set but otherwise is handled for an O_CREAT open much
- like for a non-creating open: ``should_follow_link()`` returns ``1``, and
- so does ``do_last()`` so that ``trailing_symlink()`` gets called and the
+ like for a non-creating open: lookup_last() or open_last_lookup()
+ returns a non ``NULL`` value, and link_path_walk() gets called and the
open process continues on the symlink that was found.
Updating the access time
@@ -1265,7 +1241,7 @@ Symlinks are different it seems. Both reading a symlink (with ``readlink()``)
and looking up a symlink on the way to some other destination can
update the atime on that symlink.
-.. _clearest statement: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
+.. _clearest statement: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_08
It is not clear why this is the case; POSIX has little to say on the
subject. The `clearest statement`_ is that, if a particular implementation
@@ -1321,18 +1297,18 @@ to lookup: RCU-walk, REF-walk, and REF-walk with forced revalidation.
yet. This is primarily used to tell the audit subsystem the full
context of a particular access being audited.
-``LOOKUP_ROOT`` indicates that the ``root`` field in the ``nameidata`` was
+``ND_ROOT_PRESET`` indicates that the ``root`` field in the ``nameidata`` was
provided by the caller, so it shouldn't be released when it is no
longer needed.
-``LOOKUP_JUMPED`` means that the current dentry was chosen not because
+``ND_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
link"). In this case the filesystem has not been asked to revalidate the
name (with ``d_revalidate()``). In such cases the inode may still need
to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
-``LOOKUP_JUMPED`` is set when the look completes - which may be at the
+``ND_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
Resolution-restriction flags
@@ -1365,7 +1341,7 @@ as well as blocking ".." if it would jump outside the starting point.
resolution of "..". Magic-links are also blocked.
``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
-were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
+were the filesystem root. ``nd_jump_root()`` brings the resolution back to
the starting point, and ".." at the starting point will act as a no-op. As with
``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
attacks against ".." resolution. Magic-links are also blocked.
diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt
index 1aa7ce099f6f..d2cf2852e1f8 100644
--- a/Documentation/filesystems/path-lookup.txt
+++ b/Documentation/filesystems/path-lookup.txt
@@ -379,4 +379,4 @@ Papers and other documentation on dcache locking
2. http://lse.sourceforge.net/locking/dcache/dcache.html
-3. path-lookup.md in this directory.
+3. path-lookup.rst in this directory.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 867036aa90b8..a5734bdd1cc7 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -45,6 +45,12 @@ typically between calling iget_locked() and unlocking the inode.
At some point that will become mandatory.
+**mandatory**
+
+The foo_inode_info should always be allocated through alloc_inode_sb() rather
+than kmem_cache_alloc() or kmalloc() related to set up the inode reclaim context
+correctly.
+
---
**mandatory**
@@ -171,7 +177,7 @@ settles down a bit.
**mandatory**
s_export_op is now required for exporting a filesystem.
-isofs, ext2, ext3, resierfs, fat
+isofs, ext2, ext3, fat
can be used as examples of very different filesystems.
---
@@ -307,7 +313,7 @@ done.
**mandatory**
-block truncatation on error exit from ->write_begin, and ->direct_IO
+block truncation on error exit from ->write_begin, and ->direct_IO
moved from generic methods (block_write_begin, cont_write_begin,
nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at
ext2_write_failed and callers for an example.
@@ -456,15 +462,15 @@ ERR_PTR(...).
argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask.
generic_permission() has also lost the check_acl argument; ACL checking
-has been taken to VFS and filesystems need to provide a non-NULL ->i_op->get_acl
-to read an ACL from disk.
+has been taken to VFS and filesystems need to provide a non-NULL
+->i_op->get_inode_acl to read an ACL from disk.
---
**mandatory**
If you implement your own ->llseek() you must handle SEEK_HOLE and
-SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to
+SEEK_DATA. You can handle this by returning -EINVAL, but it would be nicer to
support it in some way. The generic handler assumes that the entire file is
data and there is a virtual hole at the end of the file. So if the provided
offset is less than i_size and SEEK_DATA is specified, return the same offset.
@@ -511,7 +517,7 @@ The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and
->create() doesn't take ``struct nameidata *``; unlike the previous
two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that
-local filesystems can ignore tha argument - they are guaranteed that the
+local filesystems can ignore this argument - they are guaranteed that the
object doesn't exist. It's remote/distributed ones that might care...
---
@@ -531,7 +537,7 @@ vfs_readdir() is gone; switch to iterate_dir() instead
**mandatory**
-->readdir() is gone now; switch to ->iterate()
+->readdir() is gone now; switch to ->iterate_shared()
**mandatory**
@@ -618,7 +624,7 @@ any symlink that might use page_follow_link_light/page_put_link() must
have inode_nohighmem(inode) called before anything might start playing with
its pagecache. No highmem pages should end up in the pagecache of such
symlinks. That includes any preseeding that might be done during symlink
-creation. __page_symlink() will honour the mapping gfp flags, so once
+creation. page_symlink() will honour the mapping gfp flags, so once
you've done inode_nohighmem() it's safe to use, but if you allocate and
insert the page manually, make sure to use the right gfp flags.
@@ -687,24 +693,19 @@ parallel now.
---
-**recommended**
+**mandatory**
-->iterate_shared() is added; it's a parallel variant of ->iterate().
+->iterate_shared() is added.
Exclusion on struct file level is still provided (as well as that
between it and lseek on the same struct file), but if your directory
has been opened several times, you can get these called in parallel.
Exclusion between that method and all directory-modifying ones is
still provided, of course.
-Often enough ->iterate() can serve as ->iterate_shared() without any
-changes - it is a read-only operation, after all. If you have any
-per-inode or per-dentry in-core data structures modified by ->iterate(),
-you might need something to serialize the access to them. If you
-do dcache pre-seeding, you'll need to switch to d_alloc_parallel() for
-that; look for in-tree examples.
-
-Old method is only used if the new one is absent; eventually it will
-be removed. Switch while you still can; the old one won't stay.
+If you have any per-inode or per-dentry in-core data structures modified
+by ->iterate_shared(), you might need something to serialize the access
+to them. If you do dcache pre-seeding, you'll need to switch to
+d_alloc_parallel() for that; look for in-tree examples.
---
@@ -717,6 +718,8 @@ be removed. Switch while you still can; the old one won't stay.
**mandatory**
->setxattr() and xattr_handler.set() get dentry and inode passed separately.
+The xattr_handler.set() gets passed the user namespace of the mount the inode
+is seen from so filesystems can idmap the i_uid and i_gid accordingly.
dentry might be yet to be attached to inode, so do _not_ use its ->d_inode
in the instances. Rationale: !@#!@# security_d_instantiate() needs to be
called before we attach dentry to inode and !@#!@##!@$!$#!@#$!@$!@$ smack
@@ -855,7 +858,7 @@ be misspelled d_alloc_anon().
**mandatory**
-[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+[should've been added in 2016] stale comment in finish_open() notwithstanding,
failure exits in ->atomic_open() instances should *NOT* fput() the file,
no matter what. Everything is handled by the caller.
@@ -865,3 +868,393 @@ no matter what. Everything is handled by the caller.
clone_private_mount() returns a longterm mount now, so the proper destructor of
its result is kern_unmount() or kern_unmount_array().
+
+---
+
+**mandatory**
+
+zero-length bvec segments are disallowed, they must be filtered out before
+passed on to an iterator.
+
+---
+
+**mandatory**
+
+For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but
+uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and
+page references stay until I/O has completed, i.e. until ->ki_complete() has
+been called or returned with non -EIOCBQUEUED code.
+
+---
+
+**mandatory**
+
+mnt_want_write_file() can now only be paired with mnt_drop_write_file(),
+whereas previously it could be paired with mnt_drop_write() as well.
+
+---
+
+**mandatory**
+
+iov_iter_copy_from_user_atomic() is gone; use copy_page_from_iter_atomic().
+The difference is copy_page_from_iter_atomic() advances the iterator and
+you don't need iov_iter_advance() after it. However, if you decide to use
+only a part of obtained data, you should do iov_iter_revert().
+
+---
+
+**mandatory**
+
+Calling conventions for file_open_root() changed; now it takes struct path *
+instead of passing mount and dentry separately. For callers that used to
+pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper
+is provided - file_open_root_mnt(). In-tree users adjusted.
+
+---
+
+**mandatory**
+
+no_llseek is gone; don't set .llseek to that - just leave it NULL instead.
+Checks for "does that file have llseek(2), or should it fail with ESPIPE"
+should be done by looking at FMODE_LSEEK in file->f_mode.
+
+---
+
+*mandatory*
+
+filldir_t (readdir callbacks) calling conventions have changed. Instead of
+returning 0 or -E... it returns bool now. false means "no more" (as -E... used
+to) and true - "keep going" (as 0 in old calling conventions). Rationale:
+callers never looked at specific -E... values anyway. -> iterate_shared()
+instances require no changes at all, all filldir_t ones in the tree
+converted.
+
+---
+
+**mandatory**
+
+Calling conventions for ->tmpfile() have changed. It now takes a struct
+file pointer instead of struct dentry pointer. d_tmpfile() is similarly
+changed to simplify callers. The passed file is in a non-open state and on
+success must be opened before returning (e.g. by calling
+finish_open_simple()).
+
+---
+
+**mandatory**
+
+Calling convention for ->huge_fault has changed. It now takes a page
+order instead of an enum page_entry_size, and it may be called without the
+mmap_lock held. All in-tree users have been audited and do not seem to
+depend on the mmap_lock being held, but out of tree users should verify
+for themselves. If they do need it, they can return VM_FAULT_RETRY to
+be called with the mmap_lock held.
+
+---
+
+**mandatory**
+
+The order of opening block devices and matching or creating superblocks has
+changed.
+
+The old logic opened block devices first and then tried to find a
+suitable superblock to reuse based on the block device pointer.
+
+The new logic tries to find a suitable superblock first based on the device
+number, and opening the block device afterwards.
+
+Since opening block devices cannot happen under s_umount because of lock
+ordering requirements s_umount is now dropped while opening block devices and
+reacquired before calling fill_super().
+
+In the old logic concurrent mounters would find the superblock on the list of
+superblocks for the filesystem type. Since the first opener of the block device
+would hold s_umount they would wait until the superblock became either born or
+was discarded due to initialization failure.
+
+Since the new logic drops s_umount concurrent mounters could grab s_umount and
+would spin. Instead they are now made to wait using an explicit wait-wake
+mechanism without having to hold s_umount.
+
+---
+
+**mandatory**
+
+The holder of a block device is now the superblock.
+
+The holder of a block device used to be the file_system_type which wasn't
+particularly useful. It wasn't possible to go from block device to owning
+superblock without matching on the device pointer stored in the superblock.
+This mechanism would only work for a single device so the block layer couldn't
+find the owning superblock of any additional devices.
+
+In the old mechanism reusing or creating a superblock for a racing mount(2) and
+umount(2) relied on the file_system_type as the holder. This was severely
+underdocumented however:
+
+(1) Any concurrent mounter that managed to grab an active reference on an
+ existing superblock was made to wait until the superblock either became
+ ready or until the superblock was removed from the list of superblocks of
+ the filesystem type. If the superblock is ready the caller would simple
+ reuse it.
+
+(2) If the mounter came after deactivate_locked_super() but before
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would wait until the superblock was shutdown,
+ reuse the block device and allocate a new superblock.
+
+(3) If the mounter came after deactivate_locked_super() and after
+ the superblock had been removed from the list of superblocks of the
+ filesystem type the mounter would reuse the block device and allocate a new
+ superblock (the bd_holder point may still be set to the filesystem type).
+
+Because the holder of the block device was the file_system_type any concurrent
+mounter could open the block devices of any superblock of the same
+file_system_type without risking seeing EBUSY because the block device was
+still in use by another superblock.
+
+Making the superblock the owner of the block device changes this as the holder
+is now a unique superblock and thus block devices associated with it cannot be
+reused by concurrent mounters. So a concurrent mounter in (2) could suddenly
+see EBUSY when trying to open a block device whose holder was a different
+superblock.
+
+The new logic thus waits until the superblock and the devices are shutdown in
+->kill_sb(). Removal of the superblock from the list of superblocks of the
+filesystem type is now moved to a later point when the devices are closed:
+
+(1) Any concurrent mounter managing to grab an active reference on an existing
+ superblock is made to wait until the superblock is either ready or until
+ the superblock and all devices are shutdown in ->kill_sb(). If the
+ superblock is ready the caller will simply reuse it.
+
+(2) If the mounter comes after deactivate_locked_super() but before
+ the superblock has been removed from the list of superblocks of the
+ filesystem type the mounter is made to wait until the superblock and the
+ devices are shut down in ->kill_sb() and the superblock is removed from the
+ list of superblocks of the filesystem type. The mounter will allocate a new
+ superblock and grab ownership of the block device (the bd_holder pointer of
+ the block device will be set to the newly allocated superblock).
+
+(3) This case is now collapsed into (2) as the superblock is left on the list
+ of superblocks of the filesystem type until all devices are shutdown in
+ ->kill_sb(). In other words, if the superblock isn't on the list of
+ superblock of the filesystem type anymore then it has given up ownership of
+ all associated block devices (the bd_holder pointer is NULL).
+
+As this is a VFS level change it has no practical consequences for filesystems
+other than that all of them must use one of the provided kill_litter_super(),
+kill_anon_super(), or kill_block_super() helpers.
+
+---
+
+**mandatory**
+
+Lock ordering has been changed so that s_umount ranks above open_mutex again.
+All places where s_umount was taken under open_mutex have been fixed up.
+
+---
+
+**mandatory**
+
+export_operations ->encode_fh() no longer has a default implementation to
+encode FILEID_INO32_GEN* file handles.
+Filesystems that used the default implementation may use the generic helper
+generic_encode_ino32_fh() explicitly.
+
+---
+
+**mandatory**
+
+If ->rename() update of .. on cross-directory move needs an exclusion with
+directory modifications, do *not* lock the subdirectory in question in your
+->rename() - it's done by the caller now [that item should've been added in
+28eceeda130f "fs: Lock moved directories"].
+
+---
+
+**mandatory**
+
+On same-directory ->rename() the (tautological) update of .. is not protected
+by any locks; just don't do it if the old parent is the same as the new one.
+We really can't lock two subdirectories in same-directory rename - not without
+deadlocks.
+
+---
+
+**mandatory**
+
+lock_rename() and lock_rename_child() may fail in cross-directory case, if
+their arguments do not have a common ancestor. In that case ERR_PTR(-EXDEV)
+is returned, with no locks taken. In-tree users updated; out-of-tree ones
+would need to do so.
+
+---
+
+**mandatory**
+
+The list of children anchored in parent dentry got turned into hlist now.
+Field names got changed (->d_children/->d_sib instead of ->d_subdirs/->d_child
+for anchor/entries resp.), so any affected places will be immediately caught
+by compiler.
+
+---
+
+**mandatory**
+
+->d_delete() instances are now called for dentries with ->d_lock held
+and refcount equal to 0. They are not permitted to drop/regain ->d_lock.
+None of in-tree instances did anything of that sort. Make sure yours do not...
+
+---
+
+**mandatory**
+
+->d_prune() instances are now called without ->d_lock held on the parent.
+->d_lock on dentry itself is still held; if you need per-parent exclusions (none
+of the in-tree instances did), use your own spinlock.
+
+->d_iput() and ->d_release() are called with victim dentry still in the
+list of parent's children. It is still unhashed, marked killed, etc., just not
+removed from parent's ->d_children yet.
+
+Anyone iterating through the list of children needs to be aware of the
+half-killed dentries that might be seen there; taking ->d_lock on those will
+see them negative, unhashed and with negative refcount, which means that most
+of the in-kernel users would've done the right thing anyway without any adjustment.
+
+---
+
+**recommended**
+
+Block device freezing and thawing have been moved to holder operations.
+
+Before this change, get_active_super() would only be able to find the
+superblock of the main block device, i.e., the one stored in sb->s_bdev. Block
+device freezing now works for any block device owned by a given superblock, not
+just the main block device. The get_active_super() helper and bd_fsfreeze_sb
+pointer are gone.
+
+---
+
+**mandatory**
+
+set_blocksize() takes opened struct file instead of struct block_device now
+and it *must* be opened exclusive.
+
+---
+
+**mandatory**
+
+->d_revalidate() gets two extra arguments - inode of parent directory and
+name our dentry is expected to have. Both are stable (dir is pinned in
+non-RCU case and will stay around during the call in RCU case, and name
+is guaranteed to stay unchanging). Your instance doesn't have to use
+either, but it often helps to avoid a lot of painful boilerplate.
+Note that while name->name is stable and NUL-terminated, it may (and
+often will) have name->name[name->len] equal to '/' rather than '\0' -
+in normal case it points into the pathname being looked up.
+NOTE: if you need something like full path from the root of filesystem,
+you are still on your own - this assists with simple cases, but it's not
+magic.
+
+---
+
+**recommended**
+
+kern_path_locked() and user_path_locked() no longer return a negative
+dentry so this doesn't need to be checked. If the name cannot be found,
+ERR_PTR(-ENOENT) is returned.
+
+---
+
+**recommended**
+
+lookup_one_qstr_excl() is changed to return errors in more cases, so
+these conditions don't require explicit checks:
+
+ - if LOOKUP_CREATE is NOT given, then the dentry won't be negative,
+ ERR_PTR(-ENOENT) is returned instead
+ - if LOOKUP_EXCL IS given, then the dentry won't be positive,
+ ERR_PTR(-EEXIST) is rreturned instread
+
+LOOKUP_EXCL now means "target must not exist". It can be combined with
+LOOK_CREATE or LOOKUP_RENAME_TARGET.
+
+---
+
+**mandatory**
+invalidate_inodes() is gone use evict_inodes() instead.
+
+---
+
+**mandatory**
+
+->mkdir() now returns a dentry. If the created inode is found to
+already be in cache and have a dentry (often IS_ROOT()), it will need to
+be spliced into the given name in place of the given dentry. That dentry
+now needs to be returned. If the original dentry is used, NULL should
+be returned. Any error should be returned with ERR_PTR().
+
+In general, filesystems which use d_instantiate_new() to install the new
+inode can safely return NULL. Filesystems which may not have an I_NEW inode
+should use d_drop();d_splice_alias() and return the result of the latter.
+
+If a positive dentry cannot be returned for some reason, in-kernel
+clients such as cachefiles, nfsd, smb/server may not perform ideally but
+will fail-safe.
+
+---
+
+** mandatory**
+
+lookup_one(), lookup_one_unlocked(), lookup_one_positive_unlocked() now
+take a qstr instead of a name and len. These, not the "one_len"
+versions, should be used whenever accessing a filesystem from outside
+that filesysmtem, through a mount point - which will have a mnt_idmap.
+
+---
+
+** mandatory**
+
+Functions try_lookup_one_len(), lookup_one_len(),
+lookup_one_len_unlocked() and lookup_positive_unlocked() have been
+renamed to try_lookup_noperm(), lookup_noperm(),
+lookup_noperm_unlocked(), lookup_noperm_positive_unlocked(). They now
+take a qstr instead of separate name and length. QSTR() can be used
+when strlen() is needed for the length.
+
+For try_lookup_noperm() a reference to the qstr is passed in case the
+hash might subsequently be needed.
+
+These function no longer do any permission checking - they previously
+checked that the caller has 'X' permission on the parent. They must
+ONLY be used internally by a filesystem on itself when it knows that
+permissions are irrelevant or in a context where permission checks have
+already been performed such as after vfs_path_parent_lookup()
+
+---
+
+** mandatory**
+
+d_hash_and_lookup() is no longer exported or available outside the VFS.
+Use try_lookup_noperm() instead. This adds name validation and takes
+arguments in the opposite order but is otherwise identical.
+
+Using try_lookup_noperm() will require linux/namei.h to be included.
+
+---
+
+**mandatory**
+
+Calling conventions for ->d_automount() have changed; we should *not* grab
+an extra reference to new mount - it should be returned with refcount 1.
+
+---
+
+collect_mounts()/drop_collected_mounts()/iterate_mounts() are gone now.
+Replacement is collect_paths()/drop_collected_path(), with no special
+iterator needed. Instead of a cloned mount tree, the new interface returns
+an array of struct path, one for each mount collect_mounts() would've
+created. These struct path point to locations in the caller's namespace
+that would be roots of the cloned mounts.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 996f3cfe7030..5236cb52e357 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,6 +47,8 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
3.11 /proc/<pid>/patch_state - Livepatch patch operation state
3.12 /proc/<pid>/arch_status - Task architecture specific information
+ 3.13 /proc/<pid>/fd - List of symlinks to open files
+ 3.14 /proc/<pid/ksm_stat - Information about the process's ksm status.
4 Configuring procfs
4.1 Mount options
@@ -84,7 +86,7 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this
document.
The latest version of this document is available online at
-http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html
+https://www.kernel.org/doc/html/latest/filesystems/proc.html
If the above direction does not works for you, you could try the kernel
mailing list at linux-kernel@vger.kernel.org and/or try to reach me at
@@ -123,10 +125,20 @@ show you how you can use /proc/sys to change settings.
The directory /proc contains (among other things) one subdirectory for each
process running on the system, which is named after the process ID (PID).
-The link self points to the process reading the file system. Each process
+The link 'self' points to the process reading the file system. Each process
subdirectory has the entries listed in Table 1-1.
-Note that an open a file descriptor to /proc/<pid> or to any of its
+A process can read its own information from /proc/PID/* with no extra
+permissions. When reading /proc/PID/* information for other processes, reading
+process is required to have either CAP_SYS_PTRACE capability with
+PTRACE_MODE_READ access permissions, or, alternatively, CAP_PERFMON
+capability. This applies to all read-only information like `maps`, `environ`,
+`pagemap`, etc. The only exception is `mem` file due to its read-write nature,
+which requires CAP_SYS_PTRACE capabilities with more elevated
+PTRACE_MODE_ATTACH permissions; CAP_PERFMON capability does not grant access
+to /proc/PID/mem for other processes.
+
+Note that an open file descriptor to /proc/<pid> or to any of its
contained files or subdirectories does not prevent <pid> being reused
for some other process in the event that <pid> exits. Operations on
open /proc/<pid> file descriptors corresponding to dead processes
@@ -178,6 +190,7 @@ read the file /proc/PID/status::
Gid: 100 100 100 100
FDSize: 256
Groups: 100 14 16
+ Kthread: 0
VmPeak: 5004 kB
VmSize: 5004 kB
VmLck: 0 kB
@@ -210,6 +223,7 @@ read the file /proc/PID/status::
NoNewPrivs: 0
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
+ SpeculationIndirectBranch: conditional enabled
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 1
@@ -220,7 +234,7 @@ file /proc/PID/status. It fields are described in table 1-2.
The statm file contains more detailed information about the process
memory usage. Its seven fields are explained in Table 1-3. The stat file
-contains details information about the process itself. Its fields are
+contains detailed information about the process itself. Its fields are
explained in Table 1-4.
(for SMP CONFIG users)
@@ -230,7 +244,7 @@ asynchronous manner and the value may not be very precise. To see a precise
snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
It's slow but very precise.
-.. table:: Table 1-2: Contents of the status files (as of 4.19)
+.. table:: Table 1-2: Contents of the status fields (as of 4.19)
========================== ===================================================
Field Content
@@ -244,7 +258,8 @@ It's slow but very precise.
Ngid NUMA group ID (0 if none)
Pid process id
PPid process id of the parent process
- TracerPid PID of process tracing this process (0 if not)
+ TracerPid PID of process tracing this process (0 if not, or
+ the tracer is outside of the current pid namespace)
Uid Real, effective, saved set, and file system UIDs
Gid Real, effective, saved set, and file system GIDs
FDSize number of file descriptor slots currently allocated
@@ -253,6 +268,7 @@ It's slow but very precise.
NSpid descendant namespace process ID hierarchy
NSpgid descendant namespace process group ID hierarchy
NSsid descendant namespace session ID hierarchy
+ Kthread kernel thread flag, 1 is yes, 0 is no
VmPeak peak virtual memory size
VmSize total program size
VmLck locked memory size
@@ -292,6 +308,7 @@ It's slow but very precise.
NoNewPrivs no_new_privs, like prctl(PR_GET_NO_NEW_PRIV, ...)
Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...)
Speculation_Store_Bypass speculative store bypass mitigation status
+ SpeculationIndirectBranch indirect branch speculation mode
Cpus_allowed mask of CPUs on which this process may run
Cpus_allowed_list Same as previous, but in "list format"
Mems_allowed mask of memory nodes allowed to this process
@@ -301,7 +318,7 @@ It's slow but very precise.
========================== ===================================================
-.. table:: Table 1-3: Contents of the statm files (as of 2.6.8-rc3)
+.. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
======== =============================== ==============================
Field Content
@@ -319,7 +336,7 @@ It's slow but very precise.
======== =============================== ==============================
-.. table:: Table 1-4: Contents of the stat files (as of 2.6.30-rc7)
+.. table:: Table 1-4: Contents of the stat fields (as of 2.6.30-rc7)
============= ===============================================================
Field Content
@@ -424,15 +441,28 @@ with the memory region, as the case would be with BSS (uninitialized data).
The "pathname" shows the name associated file for this mapping. If the mapping
is not associated with a file:
- ======= ====================================
+ =================== ===========================================
[heap] the heap of the program
[stack] the stack of the main process
[vdso] the "virtual dynamic shared object",
the kernel system call handler
- ======= ====================================
+ [anon:<name>] a private anonymous mapping that has been
+ named by userspace
+ [anon_shmem:<name>] an anonymous shared memory mapping that has
+ been named by userspace
+ =================== ===========================================
or if empty, the mapping is anonymous.
+Starting with 6.11 kernel, /proc/PID/maps provides an alternative
+ioctl()-based API that gives ability to flexibly and efficiently query and
+filter individual VMAs. This interface is binary and is meant for more
+efficient and easy programmatic use. `struct procmap_query`, defined in
+linux/fs.h UAPI header, serves as an input/output argument to the
+`PROCMAP_QUERY` ioctl() command. See comments in linus/fs.h UAPI header for
+details on query semantics, supported flags, data returned, and general API
+usage information.
+
The /proc/PID/smaps is an extension based on maps, showing the memory
consumption for each of the process's mappings. For each mapping (aka Virtual
Memory Area, or VMA) there is a series of lines such as the following::
@@ -444,12 +474,14 @@ Memory Area, or VMA) there is a series of lines such as the following::
MMUPageSize: 4 kB
Rss: 892 kB
Pss: 374 kB
+ Pss_Dirty: 0 kB
Shared_Clean: 892 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
+ KSM: 0 kB
LazyFree: 0 kB
AnonHugePages: 0 kB
ShmemPmdMapped: 0 kB
@@ -463,23 +495,42 @@ Memory Area, or VMA) there is a series of lines such as the following::
THPeligible: 0
VmFlags: rd ex mr mw me dw
-The first of these lines shows the same information as is displayed for the
-mapping in /proc/PID/maps. Following lines show the size of the mapping
-(size); the size of each page allocated when backing a VMA (KernelPageSize),
-which is usually the same as the size in the page table entries; the page size
-used by the MMU when backing a VMA (in most cases, the same as KernelPageSize);
-the amount of the mapping that is currently resident in RAM (RSS); the
-process' proportional share of this mapping (PSS); and the number of clean and
-dirty shared and private pages in the mapping.
+The first of these lines shows the same information as is displayed for
+the mapping in /proc/PID/maps. Following lines show the size of the
+mapping (size); the size of each page allocated when backing a VMA
+(KernelPageSize), which is usually the same as the size in the page table
+entries; the page size used by the MMU when backing a VMA (in most cases,
+the same as KernelPageSize); the amount of the mapping that is currently
+resident in RAM (RSS); the process's proportional share of this mapping
+(PSS); and the number of clean and dirty shared and private pages in the
+mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one other
-process, its PSS will be 1500.
-
-Note that even a page which is part of a MAP_SHARED mapping, but has only
-a single pte mapped, i.e. is currently used by only one process, is accounted
-as private and not as shared.
+process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
+consists of dirty pages. ("Pss_Clean" is not included, but it can be
+calculated by subtracting "Pss_Dirty" from "Pss".)
+
+Traditionally, a page is accounted as "private" if it is mapped exactly once,
+and a page is accounted as "shared" when mapped multiple times, even when
+mapped in the same process multiple times. Note that this accounting is
+independent of MAP_SHARED.
+
+In some kernel configurations, the semantics of pages part of a larger
+allocation (e.g., THP) can differ: a page is accounted as "private" if all
+pages part of the corresponding large allocation are *certainly* mapped in the
+same process, even if the page is mapped multiple times in that process. A
+page is accounted as "shared" if any page page of the larger allocation
+is *maybe* mapped in a different process. In some cases, a large allocation
+might be treated as "maybe mapped by multiple processes" even though this
+is no longer the case.
+
+Some kernel configurations do not track the precise number of times a page part
+of a larger allocation is mapped. In this case, when calculating the PSS, the
+average number of mappings per page in this larger allocation might be used
+as an approximation for the number of mappings of a page. The PSS calculation
+will be imprecise in this case.
"Referenced" indicates the amount of memory currently marked as referenced or
accessed.
@@ -488,18 +539,21 @@ accessed.
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
+"KSM" reports how many of the pages are KSM pages. Note that KSM-placed zeropages
+are not included, only actual KSM pages.
+
"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
The memory isn't freed immediately with madvise(). It's freed in memory
pressure if the memory is clean. Please note that the printed value might
be lower than the real value due to optimizations used in the current
implementation. If this is not desirable please file a bug report.
-"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
+"AnonHugePages" shows the amount of memory backed by transparent hugepage.
-"ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
+"ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by
huge pages.
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
@@ -510,8 +564,10 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
-"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages - 1 if true, 0 otherwise. It just shows the current status.
+
+"THPeligible" indicates whether the mapping is eligible for allocating
+naturally aligned THP pages of any currently enabled size. 1 if true, 0
+otherwise.
"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
@@ -528,7 +584,6 @@ encoded manner. The codes are the following:
ms may share
gd stack segment growns down
pf pure PFN range
- dw disabled write to the mapped file
lo pages are locked in memory
io memory mapped I/O area
sr sequential read advise provided
@@ -538,14 +593,24 @@ encoded manner. The codes are the following:
ac area is accountable
nr swap space is not reserved for the area
ht area uses huge tlb pages
+ sf synchronous page fault
ar architecture specific flag
+ wf wipe on fork
dd do not include area into core dump
sd soft dirty flag
mm mixed map area
hg huge page advise flag
nh no huge page advise flag
- mg mergable advise flag
- bt - arm64 BTI guarded page
+ mg mergeable advise flag
+ bt arm64 BTI guarded page
+ mt arm64 MTE allocation tags are enabled
+ um userfaultfd missing tracking
+ uw userfaultfd wr-protect tracking
+ ui userfaultfd minor fault
+ ss shadow/guarded control stack page
+ sl sealed
+ lf lock on fault pages
+ dp always lazily freeable mapping
== =======================================
Note that there is no guarantee that every flag and associated mnemonic will
@@ -649,6 +714,11 @@ Where:
node locality page counters (N0 == node0, N1 == node1, ...) and the kernel page
size, in KB, that is backing the mapping up.
+Note that some kernel configurations do not track the precise number of times
+a page part of a larger allocation (e.g., THP) is mapped. In these
+configurations, "mapmax" might corresponds to the average number of mappings
+per page in such a larger allocation instead.
+
1.2 Kernel data
---------------
@@ -663,10 +733,17 @@ files are there, and which are missing.
============ ===============================================================
File Content
============ ===============================================================
+ allocinfo Memory allocations profiling information
apm Advanced power management info
+ bootconfig Kernel command line obtained from boot config,
+ and, if there were kernel parameters from the
+ boot loader, a "# Parameters from bootloader:"
+ line followed by a line containing those
+ parameters prefixed by "# ". (5.5)
buddyinfo Kernel memory allocator information (see text) (2.5)
bus Directory containing bus specific information
- cmdline Kernel command line
+ cmdline Kernel command line, both from bootloader and embedded
+ in the kernel image
cpuinfo Info about the CPU
devices Available devices (block and character)
dma Used DMS channels
@@ -684,7 +761,14 @@ files are there, and which are missing.
kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4))
kmsg Kernel messages
ksyms Kernel symbol table
- loadavg Load average of last 1, 5 & 15 minutes
+ loadavg Load average of last 1, 5 & 15 minutes;
+ number of processes currently runnable (running or on ready queue);
+ total number of processes in system;
+ last pid created.
+ All fields are separated by one space except "number of
+ processes currently runnable" and "total number of processes
+ in system", which are separated by a slash ('/'). Example:
+ 0.61 0.61 0.55 3/828 22084
locks Kernel locks
meminfo Memory info
misc Miscellaneous
@@ -782,7 +866,7 @@ SPU
For this case the APIC will generate the interrupt with a IRQ vector
of 0xff. This might also be generated by chipset bugs.
-RES, CAL, TLB]
+RES, CAL, TLB
rescheduling, call and TLB flush interrupts are
sent from one CPU to another per the needs of the OS. Typically,
their statistics are used by kernel developers and interested users to
@@ -794,7 +878,7 @@ suppressed when the system is a uniprocessor. As of this writing, only
i386 and x86_64 platforms support the new IRQ vector displays.
Of some interest is the introduction of the /proc/irq directory to 2.4.
-It could be used to set IRQ to CPU affinity, this means that you can "hook" an
+It could be used to set IRQ to CPU affinity. This means that you can "hook" an
IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
prof_cpu_mask.
@@ -808,7 +892,7 @@ For example::
smp_affinity
smp_affinity is a bitmask, in which you can specify which CPUs can handle the
-IRQ, you can set it by doing::
+IRQ. You can set it by doing::
> echo 1 > /proc/irq/10/smp_affinity
@@ -821,7 +905,7 @@ The contents of each smp_affinity file is the same by default::
ffffffff
There is an alternate interface, smp_affinity_list which allows specifying
-a cpu range instead of a bitmask::
+a CPU range instead of a bitmask::
> cat /proc/irq/0/smp_affinity_list
1024-1031
@@ -835,7 +919,7 @@ reports itself as being attached. This hardware locality information does not
include information about any possible driver locality preference.
prof_cpu_mask specifies which CPUs are to be profiled by the system wide
-profiler. Default value is ffffffff (all cpus if there are only 32 of them).
+profiler. Default value is ffffffff (all CPUs if there are only 32 of them).
The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
between all the CPUs which are allowed to handle it. As usual the kernel has
@@ -897,7 +981,7 @@ pagetypeinfo::
Fragmentation avoidance in the kernel works by grouping pages of different
migrate types into the same contiguous regions of memory called page blocks.
-A page block is typically the size of the default hugepage size e.g. 2MB on
+A page block is typically the size of the default hugepage size, e.g. 2MB on
X86-64. By keeping pages grouped based on their ability to move, the kernel
can reclaim pages within a page block to satisfy a high-order allocation.
@@ -915,60 +999,117 @@ also be allocatable although a lot of filesystem metadata may have to be
reclaimed to achieve this.
+allocinfo
+~~~~~~~~~
+
+Provides information about memory allocations at all locations in the code
+base. Each allocation in the code is identified by its source file, line
+number, module (if originates from a loadable module) and the function calling
+the allocation. The number of bytes allocated and number of calls at each
+location are reported. The first line indicates the version of the file, the
+second line is the header listing fields in the file.
+
+Example output.
+
+::
+
+ > tail -n +3 /proc/allocinfo | sort -rn
+ 127664128 31168 mm/page_ext.c:270 func:alloc_page_ext
+ 56373248 4737 mm/slub.c:2259 func:alloc_slab_page
+ 14880768 3633 mm/readahead.c:247 func:page_cache_ra_unbounded
+ 14417920 3520 mm/mm_init.c:2530 func:alloc_large_system_hash
+ 13377536 234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
+ 11718656 2861 mm/filemap.c:1919 func:__filemap_get_folio
+ 9192960 2800 kernel/fork.c:307 func:alloc_thread_stack_node
+ 4206592 4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
+ 4136960 1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
+ 3940352 962 mm/memory.c:4214 func:alloc_anon_folio
+ 2894464 22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
+ ...
+
+
meminfo
~~~~~~~
Provides information about distribution and utilization of memory. This
-varies by architecture and compile options. The following is from a
-16GB PIII, which has highmem enabled. You may not have all of these fields.
+varies by architecture and compile options. Some of the counters reported
+here overlap. The memory reported by the non overlapping counters may not
+add up to the overall memory usage and the difference for some workloads
+can be substantial. In many cases there are other means to find out
+additional memory using subsystem specific interfaces, for instance
+/proc/net/sockstat for TCP memory allocations.
+
+Example output. You may not have all of these fields.
::
> cat /proc/meminfo
- MemTotal: 16344972 kB
- MemFree: 13634064 kB
- MemAvailable: 14836172 kB
- Buffers: 3656 kB
- Cached: 1195708 kB
- SwapCached: 0 kB
- Active: 891636 kB
- Inactive: 1077224 kB
- HighTotal: 15597528 kB
- HighFree: 13629632 kB
- LowTotal: 747444 kB
- LowFree: 4432 kB
- SwapTotal: 0 kB
- SwapFree: 0 kB
- Dirty: 968 kB
- Writeback: 0 kB
- AnonPages: 861800 kB
- Mapped: 280372 kB
- Shmem: 644 kB
- KReclaimable: 168048 kB
- Slab: 284364 kB
- SReclaimable: 159856 kB
- SUnreclaim: 124508 kB
- PageTables: 24448 kB
- NFS_Unstable: 0 kB
- Bounce: 0 kB
- WritebackTmp: 0 kB
- CommitLimit: 7669796 kB
- Committed_AS: 100056 kB
- VmallocTotal: 112216 kB
- VmallocUsed: 428 kB
- VmallocChunk: 111088 kB
- Percpu: 62080 kB
- HardwareCorrupted: 0 kB
- AnonHugePages: 49152 kB
- ShmemHugePages: 0 kB
- ShmemPmdMapped: 0 kB
+ MemTotal: 32858820 kB
+ MemFree: 21001236 kB
+ MemAvailable: 27214312 kB
+ Buffers: 581092 kB
+ Cached: 5587612 kB
+ SwapCached: 0 kB
+ Active: 3237152 kB
+ Inactive: 7586256 kB
+ Active(anon): 94064 kB
+ Inactive(anon): 4570616 kB
+ Active(file): 3143088 kB
+ Inactive(file): 3015640 kB
+ Unevictable: 0 kB
+ Mlocked: 0 kB
+ SwapTotal: 0 kB
+ SwapFree: 0 kB
+ Zswap: 1904 kB
+ Zswapped: 7792 kB
+ Dirty: 12 kB
+ Writeback: 0 kB
+ AnonPages: 4654780 kB
+ Mapped: 266244 kB
+ Shmem: 9976 kB
+ KReclaimable: 517708 kB
+ Slab: 660044 kB
+ SReclaimable: 517708 kB
+ SUnreclaim: 142336 kB
+ KernelStack: 11168 kB
+ PageTables: 20540 kB
+ SecPageTables: 0 kB
+ NFS_Unstable: 0 kB
+ Bounce: 0 kB
+ WritebackTmp: 0 kB
+ CommitLimit: 16429408 kB
+ Committed_AS: 7715148 kB
+ VmallocTotal: 34359738367 kB
+ VmallocUsed: 40444 kB
+ VmallocChunk: 0 kB
+ Percpu: 29312 kB
+ EarlyMemtestBad: 0 kB
+ HardwareCorrupted: 0 kB
+ AnonHugePages: 4149248 kB
+ ShmemHugePages: 0 kB
+ ShmemPmdMapped: 0 kB
+ FileHugePages: 0 kB
+ FilePmdMapped: 0 kB
+ CmaTotal: 0 kB
+ CmaFree: 0 kB
+ Unaccepted: 0 kB
+ Balloon: 0 kB
+ HugePages_Total: 0
+ HugePages_Free: 0
+ HugePages_Rsvd: 0
+ HugePages_Surp: 0
+ Hugepagesize: 2048 kB
+ Hugetlb: 0 kB
+ DirectMap4k: 401152 kB
+ DirectMap2M: 10008576 kB
+ DirectMap1G: 24117248 kB
MemTotal
- Total usable ram (i.e. physical ram minus a few reserved
+ Total usable RAM (i.e. physical RAM minus a few reserved
bits and the kernel binary code)
MemFree
- The sum of LowFree+HighFree
+ Total free RAM. On highmem systems, the sum of LowFree+HighFree
MemAvailable
An estimate of how much memory is available for starting new
applications, without swapping. Calculated from MemFree,
@@ -982,8 +1123,9 @@ Buffers
Relatively temporary storage for raw disk blocks
shouldn't get tremendously large (20MB or so)
Cached
- in-memory cache for files read from the disk (the
- pagecache). Doesn't include SwapCached
+ In-memory cache for files read from the disk (the
+ pagecache) as well as tmpfs & shmem.
+ Doesn't include SwapCached.
SwapCached
Memory that once was swapped out, is swapped back in but
still also is in the swapfile (if memory is needed it
@@ -995,8 +1137,13 @@ Active
Inactive
Memory which has been less recently used. It is more
eligible to be reclaimed for other purposes
+Unevictable
+ Memory allocated for userspace which cannot be reclaimed, such
+ as mlocked pages, ramfs backing pages, secret memfd pages etc.
+Mlocked
+ Memory locked with mlock().
HighTotal, HighFree
- Highmem is all memory above ~860MB of physical memory
+ Highmem is all memory above ~860MB of physical memory.
Highmem areas are for use by userspace programs, or
for the pagecache. The kernel must use tricks to access
this memory, making it slower to access than lowmem.
@@ -1011,26 +1158,26 @@ SwapTotal
SwapFree
Memory which has been evicted from RAM, and is temporarily
on the disk
+Zswap
+ Memory consumed by the zswap backend (compressed size)
+Zswapped
+ Amount of anonymous memory stored in zswap (original size)
Dirty
Memory which is waiting to get written back to the disk
Writeback
Memory which is actively being written back to the disk
AnonPages
- Non-file backed pages mapped into userspace page tables
-HardwareCorrupted
- The amount of RAM/memory in KB, the kernel identifies as
- corrupted.
-AnonHugePages
- Non-file backed huge pages mapped into userspace page tables
+ Non-file backed pages mapped into userspace page tables. Note that
+ some kernel configurations might consider all pages part of a
+ larger allocation (e.g., THP) as "mapped", as soon as a single
+ page is mapped.
Mapped
- files which have been mmaped, such as libraries
+ files which have been mmapped, such as libraries. Note that some
+ kernel configurations might consider all pages part of a larger
+ allocation (e.g., THP) as "mapped", as soon as a single page is
+ mapped.
Shmem
Total memory used by shared memory (shmem) and tmpfs
-ShmemHugePages
- Memory used by shared memory (shmem) and tmpfs allocated
- with huge pages
-ShmemPmdMapped
- Shared memory mapped into userspace with huge pages
KReclaimable
Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other
@@ -1041,9 +1188,13 @@ SReclaimable
Part of Slab, that might be reclaimed, such as caches
SUnreclaim
Part of Slab, that cannot be reclaimed on memory pressure
+KernelStack
+ Memory consumed by the kernel stacks of all tasks
PageTables
- amount of memory dedicated to the lowest level of page
- tables.
+ Memory consumed by userspace page tables
+SecPageTables
+ Memory consumed by secondary page tables, this currently includes
+ KVM mmu and IOMMU allocations on x86 and arm64.
NFS_Unstable
Always zero. Previous counted pages which had been written to
the server, but has not been committed to stable storage.
@@ -1068,23 +1219,23 @@ CommitLimit
yield a CommitLimit of 7.3G.
For more details, see the memory overcommit documentation
- in vm/overcommit-accounting.
+ in mm/overcommit-accounting.
Committed_AS
The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which
has been allocated by processes, even if it has not been
"used" by them as of yet. A process which malloc()'s 1G
of memory, but only touches 300M of it will show up as
- using 1G. This 1G is memory which has been "committed" to
+ using 1G. This 1G is memory which has been "committed" to
by the VM and can be used at any time by the allocating
application. With strict overcommit enabled on the system
- (mode 2 in 'vm.overcommit_memory'),allocations which would
+ (mode 2 in 'vm.overcommit_memory'), allocations which would
exceed the CommitLimit (detailed above) will not be permitted.
This is useful if one needs to guarantee that processes will
not fail due to lack of memory once that memory has been
successfully allocated.
VmallocTotal
- total size of vmalloc memory area
+ total size of vmalloc virtual address space
VmallocUsed
amount of vmalloc area which is used
VmallocChunk
@@ -1092,6 +1243,41 @@ VmallocChunk
Percpu
Memory allocated to the percpu allocator used to back percpu
allocations. This stat excludes the cost of metadata.
+EarlyMemtestBad
+ The amount of RAM/memory in kB, that was identified as corrupted
+ by early memtest. If memtest was not run, this field will not
+ be displayed at all. Size is never rounded down to 0 kB.
+ That means if 0 kB is reported, you can safely assume
+ there was at least one pass of memtest and none of the passes
+ found a single faulty byte of RAM.
+HardwareCorrupted
+ The amount of RAM/memory in KB, the kernel identifies as
+ corrupted.
+AnonHugePages
+ Non-file backed huge pages mapped into userspace page tables
+ShmemHugePages
+ Memory used by shared memory (shmem) and tmpfs allocated
+ with huge pages
+ShmemPmdMapped
+ Shared memory mapped into userspace with huge pages
+FileHugePages
+ Memory used for filesystem data (page cache) allocated
+ with huge pages
+FilePmdMapped
+ Page cache mapped into userspace with huge pages
+CmaTotal
+ Memory reserved for the Contiguous Memory Allocator (CMA)
+CmaFree
+ Free remaining memory in the CMA reserves
+Unaccepted
+ Memory that has not been accepted by the guest
+Balloon
+ Memory returned to Host by VM Balloon Drivers
+HugePages_Total, HugePages_Free, HugePages_Rsvd, HugePages_Surp, Hugepagesize, Hugetlb
+ See Documentation/admin-guide/mm/hugetlbpage.rst.
+DirectMap4k, DirectMap2M, DirectMap1G
+ Breakdown of page table sizes used in the kernel's
+ identity mapping of RAM
vmallocinfo
~~~~~~~~~~~
@@ -1099,7 +1285,7 @@ vmallocinfo
Provides information about vmalloced/vmaped areas. One line per area,
containing the virtual address range of the area, size in bytes,
caller information of the creator, and optional information depending
-on the kind of area :
+on the kind of area:
========== ===================================================
pages=nr number of pages
@@ -1144,101 +1330,23 @@ on the kind of area :
softirqs
~~~~~~~~
-Provides counts of softirq handlers serviced since boot time, for each cpu.
+Provides counts of softirq handlers serviced since boot time, for each CPU.
::
> cat /proc/softirqs
- CPU0 CPU1 CPU2 CPU3
+ CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0
- TIMER: 27166 27120 27097 27034
+ TIMER: 27166 27120 27097 27034
NET_TX: 0 0 0 17
NET_RX: 42 0 0 39
- BLOCK: 0 0 107 1121
- TASKLET: 0 0 0 290
- SCHED: 27035 26983 26971 26746
- HRTIMER: 0 0 0 0
- RCU: 1678 1769 2178 2250
-
-
-1.3 IDE devices in /proc/ide
-----------------------------
-
-The subdirectory /proc/ide contains information about all IDE devices of which
-the kernel is aware. There is one subdirectory for each IDE controller, the
-file drivers and a link for each IDE device, pointing to the device directory
-in the controller specific subtree.
-
-The file drivers contains general information about the drivers used for the
-IDE devices::
-
- > cat /proc/ide/drivers
- ide-cdrom version 4.53
- ide-disk version 1.08
-
-More detailed information can be found in the controller specific
-subdirectories. These are named ide0, ide1 and so on. Each of these
-directories contains the files shown in table 1-6.
-
-
-.. table:: Table 1-6: IDE controller info in /proc/ide/ide?
-
- ======= =======================================
- File Content
- ======= =======================================
- channel IDE channel (0 or 1)
- config Configuration (only for PCI/IDE bridge)
- mate Mate name
- model Type/Chipset of IDE controller
- ======= =======================================
-
-Each device connected to a controller has a separate subdirectory in the
-controllers directory. The files listed in table 1-7 are contained in these
-directories.
-
-
-.. table:: Table 1-7: IDE device information
-
- ================ ==========================================
- File Content
- ================ ==========================================
- cache The cache
- capacity Capacity of the medium (in 512Byte blocks)
- driver driver and version
- geometry physical and logical geometry
- identify device identify block
- media media type
- model device identifier
- settings device setup
- smart_thresholds IDE disk management thresholds
- smart_values IDE disk management values
- ================ ==========================================
-
-The most interesting file is ``settings``. This file contains a nice
-overview of the drive parameters::
-
- # cat /proc/ide/ide0/hda/settings
- name value min max mode
- ---- ----- --- --- ----
- bios_cyl 526 0 65535 rw
- bios_head 255 0 255 rw
- bios_sect 63 0 63 rw
- breada_readahead 4 0 127 rw
- bswap 0 0 1 r
- file_readahead 72 0 2097151 rw
- io_32bit 0 0 3 rw
- keepsettings 0 0 1 rw
- max_kb_per_request 122 1 127 rw
- multcount 0 0 8 rw
- nice1 1 0 1 rw
- nowerr 0 0 1 rw
- pio_mode write-only 0 255 w
- slow 0 0 1 rw
- unmaskirq 0 0 1 rw
- using_dma 0 0 1 rw
-
-
-1.4 Networking info in /proc/net
+ BLOCK: 0 0 107 1121
+ TASKLET: 0 0 0 290
+ SCHED: 27035 26983 26971 26746
+ HRTIMER: 0 0 0 0
+ RCU: 1678 1769 2178 2250
+
+1.3 Networking info in /proc/net
--------------------------------
The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the
@@ -1284,6 +1392,7 @@ support this. Table 1-9 lists the files and their meaning.
rt_cache Routing cache
snmp SNMP data
sockstat Socket statistics
+ softnet_stat Per-CPU incoming packets queues statistics of online CPUs
tcp TCP sockets
udp UDP sockets
unix UNIX domain sockets
@@ -1317,12 +1426,12 @@ It will contain information that is specific to that bond, such as the
current slaves of the bond, the link status of the slaves, and how
many times the slaves link has failed.
-1.5 SCSI info
+1.4 SCSI info
-------------
-If you have a SCSI host adapter in your system, you'll find a subdirectory
-named after the driver for this adapter in /proc/scsi. You'll also see a list
-of all recognized SCSI devices in /proc/scsi::
+If you have a SCSI or ATA host adapter in your system, you'll find a
+subdirectory named after the driver for this adapter in /proc/scsi.
+You'll also see a list of all recognized SCSI devices in /proc/scsi::
>cat /proc/scsi/scsi
Attached devices:
@@ -1380,7 +1489,7 @@ AHA-2940 SCSI adapter::
Total transfers 0 (0 reads and 0 writes)
-1.6 Parallel port info in /proc/parport
+1.5 Parallel port info in /proc/parport
---------------------------------------
The directory /proc/parport contains information about the parallel ports of
@@ -1405,11 +1514,11 @@ These directories contain the four files shown in Table 1-10.
number or none).
========= ====================================================================
-1.7 TTY info in /proc/tty
+1.6 TTY info in /proc/tty
-------------------------
Information about the available and actually used tty's can be found in the
-directory /proc/tty.You'll find entries for drivers and line disciplines in
+directory /proc/tty. You'll find entries for drivers and line disciplines in
this directory, as shown in Table 1-11.
@@ -1440,7 +1549,7 @@ To see which tty's are currently in use, you can simply look into the file
unknown /dev/tty 4 1-63 console
-1.8 Miscellaneous kernel statistics in /proc/stat
+1.7 Miscellaneous kernel statistics in /proc/stat
-------------------------------------------------
Various pieces of information about kernel activity are available in the
@@ -1448,16 +1557,18 @@ Various pieces of information about kernel activity are available in the
since the system first booted. For a quick look, simply cat the file::
> cat /proc/stat
- cpu 2255 34 2290 22625563 6290 127 456 0 0 0
- cpu0 1132 34 1441 11311718 3675 127 438 0 0 0
- cpu1 1123 0 849 11313845 2614 0 18 0 0 0
- intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...]
- ctxt 1990473
- btime 1062191376
- processes 2915
- procs_running 1
+ cpu 237902850 368826709 106375398 1873517540 1135548 0 14507935 0 0 0
+ cpu0 60045249 91891769 26331539 468411416 495718 0 5739640 0 0 0
+ cpu1 59746288 91759249 26609887 468860630 312281 0 4384817 0 0 0
+ cpu2 59489247 92985423 26904446 467808813 171668 0 2268998 0 0 0
+ cpu3 58622065 92190267 26529524 468436680 155879 0 2114478 0 0 0
+ intr 8688370575 8 3373 0 0 0 0 0 0 1 40791 0 0 353317 0 0 0 0 224789828 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 190974333 41958554 123983334 43 0 224593 0 0 0 <more 0's deleted>
+ ctxt 22848221062
+ btime 1605316999
+ processes 746787147
+ procs_running 2
procs_blocked 0
- softirq 183433 0 21755 12 39 1137 231 21459 2263
+ softirq 12121874454 100099120 3938138295 127375644 2795979 187870761 0 173808342 3072582055 52608 224184354
The very first "cpu" line aggregates the numbers in all of the other "cpuN"
lines. These numbers identify the amount of time the CPU has spent performing
@@ -1471,9 +1582,9 @@ second). The meanings of the columns are as follows, from left to right:
- iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems:
- 1. Cpu will not wait for I/O to complete, iowait is the time that a task is
- waiting for I/O to complete. When cpu goes into idle state for
- outstanding task io, another task will be scheduled on this CPU.
+ 1. CPU will not wait for I/O to complete, iowait is the time that a task is
+ waiting for I/O to complete. When CPU goes into idle state for
+ outstanding task I/O, another task will be scheduled on this CPU.
2. In a multi-core CPU, the task waiting for I/O to complete is not running
on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain
@@ -1513,14 +1624,14 @@ softirqs serviced; each subsequent column is the total for that particular
softirq.
-1.9 Ext4 file system parameters
+1.8 Ext4 file system parameters
-------------------------------
Information about mounted ext4 file systems can be found in
/proc/fs/ext4. Each mounted filesystem will have a directory in
/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
-/proc/fs/ext4/dm-0). The files in each per-device directory are shown
-in Table 1-12, below.
+/proc/fs/ext4/sda9 or /proc/fs/ext4/dm-0). The files in each per-device
+directory are shown in Table 1-12, below.
.. table:: Table 1-12: Files in /proc/fs/ext4/<devname>
@@ -1529,8 +1640,8 @@ in Table 1-12, below.
mb_groups details of multiblock allocator buddy cache of free blocks
============== ==========================================================
-2.0 /proc/consoles
-------------------
+1.9 /proc/consoles
+-------------------
Shows registered system console lines.
To see which character device lines are currently used for the system console
@@ -1590,10 +1701,9 @@ production system. Set up a development machine and test to make sure that
everything works the way you want it to. You may have no alternative but to
reboot the machine once an error has been made.
-To change a value, simply echo the new value into the file. An example is
-given below in the section on the file system data. You need to be root to do
-this. You can create your own boot script to perform this every time your
-system boots.
+To change a value, simply echo the new value into the file.
+You need to be root to do this. You can create your own boot script
+to perform this every time your system boots.
The files in /proc/sys can be used to fine tune and monitor miscellaneous and
general things in the operation of the Linux kernel. Since some of the files
@@ -1601,12 +1711,12 @@ can inadvertently disrupt your system, it is advisable to read both
documentation and source before actually making adjustments. In any case, be
very careful when writing to any of these files. The entries in /proc may
change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt
-review the kernel documentation in the directory /usr/src/linux/Documentation.
+review the kernel documentation in the directory linux/Documentation.
This chapter is heavily based on the documentation included in the pre 2.2
kernels, and became part of it in version 2.2.1 of the Linux kernel.
-Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these
-entries.
+Please see: Documentation/admin-guide/sysctl/ directory for descriptions of
+these entries.
Summary
-------
@@ -1624,8 +1734,8 @@ Chapter 3: Per-process Parameters
3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score
--------------------------------------------------------------------------------
-These file can be used to adjust the badness heuristic used to select which
-process gets killed in out of memory conditions.
+These files can be used to adjust the badness heuristic used to select which
+process gets killed in out of memory (oom) conditions.
The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted. The
@@ -1634,9 +1744,6 @@ may allocate from based on an estimation of its current memory and swap use.
For example, if a task is using all allowed memory, its badness score will be
1000. If it is using half of its allowed memory, its score will be 500.
-There is an additional factor included in the badness score: the current memory
-and swap usage is discounted by 3% for root processes.
-
The amount of "allowed" memory depends on the context in which the oom killer
was called. If it is due to the memory assigned to the allocating task's cpuset
being exhausted, the allowed memory represents the set of mems assigned to that
@@ -1672,24 +1779,22 @@ The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last
value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
requires CAP_SYS_RESOURCE.
-Caveat: when a parent task is selected, the oom killer will sacrifice any first
-generation children with separate address spaces instead, if possible. This
-avoids servers and important system daemons from being killed and loses the
-minimal amount of work.
-
3.2 /proc/<pid>/oom_score - Display current oom-killer score
-------------------------------------------------------------
-This file can be used to check the current score used by the oom-killer is for
+This file can be used to check the current score used by the oom-killer for
any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which
process should be killed in an out-of-memory situation.
+Please note that the exported value includes oom_score_adj so it is
+effectively in range [0,2000].
+
3.3 /proc/<pid>/io - Display the IO accounting fields
-------------------------------------------------------
-This file contains IO statistics for each running process
+This file contains IO statistics for each running process.
Example
~~~~~~~
@@ -1720,7 +1825,7 @@ The number of bytes which this task has caused to be read from storage. This
is simply the sum of bytes which this process passed to read() and pread().
It includes things like tty IO and it is unaffected by whether or not actual
physical disk IO was required (the read might have been satisfied from
-pagecache)
+pagecache).
wchar
@@ -1842,19 +1947,19 @@ For example::
This file contains lines of the form::
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
- (1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
-
- (1) mount ID: unique identifier of the mount (may be reused after umount)
- (2) parent ID: ID of parent (or of self for the top of the mount tree)
- (3) major:minor: value of st_dev for files on filesystem
- (4) root: root of the mount within the filesystem
- (5) mount point: mount point relative to the process's root
- (6) mount options: per mount options
- (7) optional fields: zero or more fields of the form "tag[:value]"
- (8) separator: marks the end of the optional fields
- (9) filesystem type: name of filesystem of the form "type[.subtype]"
- (10) mount source: filesystem specific information or "none"
- (11) super options: per super block options
+ (1)(2)(3) (4) (5) (6) (n…m) (m+1)(m+2) (m+3) (m+4)
+
+ (1) mount ID: unique identifier of the mount (may be reused after umount)
+ (2) parent ID: ID of parent (or of self for the top of the mount tree)
+ (3) major:minor: value of st_dev for files on filesystem
+ (4) root: root of the mount within the filesystem
+ (5) mount point: mount point relative to the process's root
+ (6) mount options: per mount options
+ (n…m) optional fields: zero or more fields of the form "tag[:value]"
+ (m+1) separator: marks the end of the optional fields
+ (m+2) filesystem type: name of filesystem of the form "type[.subtype]"
+ (m+3) mount source: filesystem specific information or "none"
+ (m+4) super options: per super block options
Parsers should ignore all unrecognised optional fields. Currently the
possible optional fields are:
@@ -1878,11 +1983,11 @@ For more information on mount propagation see:
3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
--------------------------------------------------------
-These files provide a method to access a tasks comm value. It also allows for
+These files provide a method to access a task's comm value. It also allows for
a task to set its own or one of its thread siblings comm value. The comm value
is limited in size compared to the cmdline value, so writing anything longer
-then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated
-comm value.
+then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
+terminator) will result in a truncated comm value.
3.7 /proc/<pid>/task/<tid>/children - Information about task children
@@ -1891,32 +1996,34 @@ This file provides a fast way to retrieve first level children pids
of a task pointed by <pid>/<tid> pair. The format is a space separated
stream of pids.
-Note the "first level" here -- if a child has own children they will
-not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children
+Note the "first level" here -- if a child has its own children they will
+not be listed here; one needs to read /proc/<children-pid>/task/<tid>/children
to obtain the descendants.
Since this interface is intended to be fast and cheap it doesn't
guarantee to provide precise results and some children might be
skipped, especially if they've exited right after we printed their
-pids, so one need to either stop or freeze processes being inspected
+pids, so one needs to either stop or freeze processes being inspected
if precise results are needed.
3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
---------------------------------------------------------------
This file provides information associated with an opened file. The regular
-files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos'
-represents the current offset of the opened file in decimal form [see lseek(2)
-for details], 'flags' denotes the octal O_xxx mask the file has been
-created with [see open(2) for details] and 'mnt_id' represents mount ID of
-the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo
-for details].
+files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
+The 'pos' represents the current offset of the opened file in decimal
+form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the
+file has been created with [see open(2) for details] and 'mnt_id' represents
+mount ID of the file system containing the opened file [see 3.5
+/proc/<pid>/mountinfo for details]. 'ino' represents the inode number of
+the file.
A typical output is::
pos: 0
flags: 0100002
mnt_id: 19
+ ino: 63107
All locks associated with a file descriptor are shown in its fdinfo too::
@@ -1933,6 +2040,7 @@ Eventfd files
pos: 0
flags: 04002
mnt_id: 9
+ ino: 63107
eventfd-count: 5a
where 'eventfd-count' is hex value of a counter.
@@ -1945,6 +2053,7 @@ Signalfd files
pos: 0
flags: 04002
mnt_id: 9
+ ino: 63107
sigmask: 0000000000000200
where 'sigmask' is hex value of the signal mask associated
@@ -1958,6 +2067,7 @@ Epoll files
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7
where 'tfd' is a target file descriptor number in decimal form,
@@ -1974,9 +2084,11 @@ For inotify files the format is the following::
pos: 0
flags: 02000000
+ mnt_id: 9
+ ino: 63107
inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d
-where 'wd' is a watch descriptor in decimal form, ie a target file
+where 'wd' is a watch descriptor in decimal form, i.e. a target file
descriptor number, 'ino' and 'sdev' are inode and device where the
target file resides and the 'mask' is the mask of events, all in hex
form [see inotify(7) for more details].
@@ -1996,6 +2108,7 @@ For fanotify files the format is::
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
fanotify flags:10 event-flags:0
fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003
fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4
@@ -2003,10 +2116,10 @@ For fanotify files the format is::
where fanotify 'flags' and 'event-flags' are values used in fanotify_init
call, 'mnt_id' is the mount point identifier, 'mflags' is the value of
flags associated with mark which are tracked separately from events
-mask. 'ino', 'sdev' are target inode and device, 'mask' is the events
+mask. 'ino' and 'sdev' are target inode and device, 'mask' is the events
mask and 'ignored_mask' is the mask of events which are to be ignored.
-All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
-does provide information about flags and mask used in fanotify_mark
+All are in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask'
+provide information about flags and mask used in fanotify_mark
call [see fsnotify manpage for details].
While the first three lines are mandatory and always printed, the rest is
@@ -2020,6 +2133,7 @@ Timerfd files
pos: 0
flags: 02
mnt_id: 9
+ ino: 63107
clockid: 0
ticks: 0
settime flags: 01
@@ -2029,11 +2143,27 @@ Timerfd files
where 'clockid' is the clock type and 'ticks' is the number of the timer expirations
that have occurred [see timerfd_create(2) for details]. 'settime flags' are
flags in octal form been used to setup the timer [see timerfd_settime(2) for
-details]. 'it_value' is remaining time until the timer exiration.
+details]. 'it_value' is remaining time until the timer expiration.
'it_interval' is the interval for the timer. Note the timer might be set up
with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value'
still exhibits timer's remaining time.
+DMA Buffer files
+~~~~~~~~~~~~~~~~
+
+::
+
+ pos: 0
+ flags: 04002
+ mnt_id: 9
+ ino: 63107
+ size: 32768
+ count: 2
+ exp_name: system-heap
+
+where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of
+the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter.
+
3.9 /proc/<pid>/map_files - Information about memory mapped files
---------------------------------------------------------------------
This directory contains symbolic links which represent memory mapped files
@@ -2059,13 +2189,13 @@ are actually shared.
3.10 /proc/<pid>/timerslack_ns - Task timerslack value
---------------------------------------------------------
This file provides the value of the task's timerslack value in nanoseconds.
-This value specifies a amount of time that normal timers may be deferred
+This value specifies an amount of time that normal timers may be deferred
in order to coalesce timers and avoid unnecessary wakeups.
-This allows a task's interactivity vs power consumption trade off to be
+This allows a task's interactivity vs power consumption tradeoff to be
adjusted.
-Writing 0 to the file will set the tasks timerslack to the default value.
+Writing 0 to the file will set the task's timerslack to the default value.
Valid values are from 0 - ULLONG_MAX
@@ -2105,10 +2235,10 @@ Example
Description
~~~~~~~~~~~
-x86 specific entries:
+x86 specific entries
~~~~~~~~~~~~~~~~~~~~~
-AVX512_elapsed_ms:
+AVX512_elapsed_ms
^^^^^^^^^^^^^^^^^^
If AVX512 is supported on the machine, this entry shows the milliseconds
@@ -2134,8 +2264,92 @@ AVX512_elapsed_ms:
the task is unlikely an AVX512 user, but depends on the workload and the
scheduling scenario, it also could be a false negative mentioned above.
-Configuring procfs
-------------------
+3.13 /proc/<pid>/fd - List of symlinks to open files
+-------------------------------------------------------
+This directory contains symbolic links which represent open files
+the process is maintaining. Example output::
+
+ lr-x------ 1 root root 64 Sep 20 17:53 0 -> /dev/null
+ l-wx------ 1 root root 64 Sep 20 17:53 1 -> /dev/null
+ lrwx------ 1 root root 64 Sep 20 17:53 10 -> 'socket:[12539]'
+ lrwx------ 1 root root 64 Sep 20 17:53 11 -> 'socket:[12540]'
+ lrwx------ 1 root root 64 Sep 20 17:53 12 -> 'socket:[12542]'
+
+The number of open files for the process is stored in 'size' member
+of stat() output for /proc/<pid>/fd for fast access.
+-------------------------------------------------------
+
+3.14 /proc/<pid/ksm_stat - Information about the process's ksm status
+---------------------------------------------------------------------
+When CONFIG_KSM is enabled, each process has this file which displays
+the information of ksm merging status.
+
+Example
+~~~~~~~
+
+::
+
+ / # cat /proc/self/ksm_stat
+ ksm_rmap_items 0
+ ksm_zero_pages 0
+ ksm_merging_pages 0
+ ksm_process_profit 0
+ ksm_merge_any: no
+ ksm_mergeable: no
+
+Description
+~~~~~~~~~~~
+
+ksm_rmap_items
+^^^^^^^^^^^^^^
+
+The number of ksm_rmap_item structures in use. The structure
+ksm_rmap_item stores the reverse mapping information for virtual
+addresses. KSM will generate a ksm_rmap_item for each ksm-scanned page of
+the process.
+
+ksm_zero_pages
+^^^^^^^^^^^^^^
+
+When /sys/kernel/mm/ksm/use_zero_pages is enabled, it represent how many
+empty pages are merged with kernel zero pages by KSM.
+
+ksm_merging_pages
+^^^^^^^^^^^^^^^^^
+
+It represents how many pages of this process are involved in KSM merging
+(not including ksm_zero_pages). It is the same with what
+/proc/<pid>/ksm_merging_pages shows.
+
+ksm_process_profit
+^^^^^^^^^^^^^^^^^^
+
+The profit that KSM brings (Saved bytes). KSM can save memory by merging
+identical pages, but also can consume additional memory, because it needs
+to generate a number of rmap_items to save each scanned page's brief rmap
+information. Some of these pages may be merged, but some may not be abled
+to be merged after being checked several times, which are unprofitable
+memory consumed.
+
+ksm_merge_any
+^^^^^^^^^^^^^
+
+It specifies whether the process's 'mm is added by prctl() into the
+candidate list of KSM or not, and if KSM scanning is fully enabled at
+process level.
+
+ksm_mergeable
+^^^^^^^^^^^^^
+
+It specifies whether any VMAs of the process''s mms are currently
+applicable to KSM.
+
+More information about KSM can be found in
+Documentation/admin-guide/mm/ksm.rst.
+
+
+Chapter 4: Configuring procfs
+=============================
4.1 Mount options
---------------------
@@ -2162,7 +2376,7 @@ arguments are now protected against local eavesdroppers.
hidepid=invisible or hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be
fully invisible to other users. It doesn't mean that it hides a fact whether a
process with a specific pid value exists (it can be learned by other means, e.g.
-by "kill -0 $PID"), but it hides process' uid and gid, which may be learned by
+by "kill -0 $PID"), but it hides process's uid and gid, which may be learned by
stat()'ing /proc/<pid>/ otherwise. It greatly complicates an intruder's task of
gathering information about running processes, whether some daemon runs with
elevated privileges, whether other user runs some sensitive program, whether
@@ -2178,47 +2392,45 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
-5 Filesystem behavior
-----------------------------
+Chapter 5: Filesystem behavior
+==============================
-Originally, before the advent of pid namepsace, procfs was a global file
+Originally, before the advent of pid namespace, procfs was a global file
system. It means that there was only one procfs instance in the system.
When pid namespace was added, a separate procfs instance was mounted in
each pid namespace. So, procfs mount options are global among all
-mountpoints within the same namespace.
-
-::
+mountpoints within the same namespace::
-# grep ^proc /proc/mounts
-proc /proc proc rw,relatime,hidepid=2 0 0
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
-# strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
-mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
-+++ exited with 0 +++
+ # strace -e mount mount -o hidepid=1 -t proc proc /tmp/proc
+ mount("proc", "/tmp/proc", "proc", 0, "hidepid=1") = 0
+ +++ exited with 0 +++
-# grep ^proc /proc/mounts
-proc /proc proc rw,relatime,hidepid=2 0 0
-proc /tmp/proc proc rw,relatime,hidepid=2 0 0
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=2 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=2 0 0
and only after remounting procfs mount options will change at all
-mountpoints.
+mountpoints::
-# mount -o remount,hidepid=1 -t proc proc /tmp/proc
+ # mount -o remount,hidepid=1 -t proc proc /tmp/proc
-# grep ^proc /proc/mounts
-proc /proc proc rw,relatime,hidepid=1 0 0
-proc /tmp/proc proc rw,relatime,hidepid=1 0 0
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=1 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=1 0 0
This behavior is different from the behavior of other filesystems.
The new procfs behavior is more like other filesystems. Each procfs mount
creates a new procfs instance. Mount options affect own procfs instance.
It means that it became possible to have several procfs instances
-displaying tasks with different filtering options in one pid namespace.
+displaying tasks with different filtering options in one pid namespace::
-# mount -o hidepid=invisible -t proc proc /proc
-# mount -o hidepid=noaccess -t proc proc /tmp/proc
-# grep ^proc /proc/mounts
-proc /proc proc rw,relatime,hidepid=invisible 0 0
-proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
+ # mount -o hidepid=invisible -t proc proc /proc
+ # mount -o hidepid=noaccess -t proc proc /tmp/proc
+ # grep ^proc /proc/mounts
+ proc /proc proc rw,relatime,hidepid=invisible 0 0
+ proc /tmp/proc proc rw,relatime,hidepid=noaccess 0 0
diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst
index fd13433d362c..560f3d470422 100644
--- a/Documentation/filesystems/qnx6.rst
+++ b/Documentation/filesystems/qnx6.rst
@@ -135,7 +135,7 @@ inode.
Character and block special devices do not exist in QNX as those files
are handled by the QNX kernel/drivers and created in /dev independent of the
-underlaying filesystem.
+underlying filesystem.
Long filenames
--------------
@@ -176,7 +176,7 @@ Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
-Each superblock got it's own half of the system area. So superblock #1
+Each superblock got its own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
diff --git a/Documentation/filesystems/quota.rst b/Documentation/filesystems/quota.rst
index a30cdd47c652..abd4303c546e 100644
--- a/Documentation/filesystems/quota.rst
+++ b/Documentation/filesystems/quota.rst
@@ -18,7 +18,7 @@ Quota limits (and amount of grace time) are set independently for each
filesystem.
For more details about quota design, see the documentation in quota-tools package
-(http://sourceforge.net/projects/linuxquota).
+(https://sourceforge.net/projects/linuxquota).
Quota netlink interface
=======================
@@ -31,11 +31,11 @@ the above events to userspace. There they can be captured by an application
and processed accordingly.
The interface uses generic netlink framework (see
-http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more
-details about this layer). The name of the quota generic netlink interface
-is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>.
-Since the quota netlink protocol is not namespace aware, quota netlink messages
-are sent only in initial network namespace.
+https://lwn.net/Articles/208755/ and http://www.infradead.org/~tgr/libnl/ for
+more details about this layer). The name of the quota generic netlink interface
+is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. Since
+the quota netlink protocol is not namespace aware, quota netlink messages are
+sent only in initial network namespace.
Currently, the interface supports only one message type QUOTA_NL_C_WARNING.
This command is used to send a notification about any of the above mentioned
diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.rst b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
index 3fddacc6bf14..fa4f81099cb4 100644
--- a/Documentation/filesystems/ramfs-rootfs-initramfs.rst
+++ b/Documentation/filesystems/ramfs-rootfs-initramfs.rst
@@ -6,8 +6,7 @@ Ramfs, rootfs and initramfs
October 17, 2005
-Rob Landley <rob@landley.net>
-=============================
+:Author: Rob Landley <rob@landley.net>
What is ramfs?
--------------
@@ -170,7 +169,7 @@ Documentation/driver-api/early-userspace/early_userspace_support.rst for more de
The kernel does not depend on external cpio tools. If you specify a
directory instead of a configuration file, the kernel's build infrastructure
creates a configuration file from that directory (usr/Makefile calls
-usr/gen_initramfs_list.sh), and proceeds to package up that directory
+usr/gen_initramfs.sh), and proceeds to package up that directory
using the config file (by feeding it to usr/gen_init_cpio, which is created
from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is
entirely self-contained, and the kernel's boot-time extractor is also
@@ -246,15 +245,15 @@ If you don't already understand what shared libraries, devices, and paths
you need to get a minimal root filesystem up and running, here are some
references:
-- http://www.tldp.org/HOWTO/Bootdisk-HOWTO/
-- http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
+- https://www.tldp.org/HOWTO/Bootdisk-HOWTO/
+- https://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html
- http://www.linuxfromscratch.org/lfs/view/stable/
-The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is
+The "klibc" package (https://www.kernel.org/pub/linux/libs/klibc) is
designed to be a tiny C library to statically link early userspace
code against, along with some related utilities. It is BSD licensed.
-I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net)
+I use uClibc (https://www.uclibc.org) and busybox (https://www.busybox.net)
myself. These are LGPL and GPL, respectively. (A self-contained initramfs
package is planned for the busybox 1.3 release.)
@@ -316,7 +315,7 @@ the above threads) is:
2) The cpio archive format chosen by the kernel is simpler and cleaner (and
thus easier to create and parse) than any of the (literally dozens of)
various tar archive formats. The complete initramfs archive format is
- explained in buffer-format.txt, created in usr/gen_init_cpio.c, and
+ explained in buffer-format.rst, created in usr/gen_init_cpio.c, and
extracted in init/initramfs.c. All three together come to less than 26k
total of human-readable text.
diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst
index 04ad083cfe62..301ff4c6e6c6 100644
--- a/Documentation/filesystems/relay.rst
+++ b/Documentation/filesystems/relay.rst
@@ -32,7 +32,7 @@ functions in the relay interface code - please see that for details.
Semantics
=========
-Each relay channel has one buffer per CPU, each buffer has one or more
+Each relay channel has one buffer per CPU; each buffer has one or more
sub-buffers. Messages are written to the first sub-buffer until it is
too full to contain a new message, in which case it is written to
the next (if available). Messages are never split across sub-buffers.
@@ -40,7 +40,7 @@ At this point, userspace can be notified so it empties the first
sub-buffer, while the kernel continues writing to the next.
When notified that a sub-buffer is full, the kernel knows how many
-bytes of it are padding i.e. unused space occurring because a complete
+bytes of it are padding, i.e., unused space occurring because a complete
message couldn't fit into a sub-buffer. Userspace can use this
knowledge to copy only valid data.
@@ -71,7 +71,7 @@ klog and relay-apps example code
================================
The relay interface itself is ready to use, but to make things easier,
-a couple simple utility functions and a set of examples are provided.
+a couple of simple utility functions and a set of examples are provided.
The relay-apps example tarball, available on the relay sourceforge
site, contains a set of self-contained examples, each consisting of a
@@ -91,7 +91,7 @@ registered will data actually be logged (see the klog and kleak
examples for details).
It is of course possible to use the relay interface from scratch,
-i.e. without using any of the relay-apps example code or klog, but
+i.e., without using any of the relay-apps example code or klog, but
you'll have to implement communication between userspace and kernel,
allowing both to convey the state of buffers (full, empty, amount of
padding). The read() interface both removes padding and internally
@@ -119,7 +119,7 @@ mmap() results in channel buffer being mapped into the caller's
must map the entire file, which is NRBUF * SUBBUFSIZE.
read() read the contents of a channel buffer. The bytes read are
- 'consumed' by the reader, i.e. they won't be available
+ 'consumed' by the reader, i.e., they won't be available
again to subsequent reads. If the channel is being used
in no-overwrite mode (the default), it can be read at any
time even if there's an active kernel writer. If the
@@ -138,7 +138,7 @@ poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are
notified when sub-buffer boundaries are crossed.
close() decrements the channel buffer's refcount. When the refcount
- reaches 0, i.e. when no process or kernel client has the
+ reaches 0, i.e., when no process or kernel client has the
buffer open, the channel buffer is freed.
=========== ============================================================
@@ -149,7 +149,7 @@ host filesystem must be mounted. For example::
.. Note::
- the host filesystem doesn't need to be mounted for kernel
+ The host filesystem doesn't need to be mounted for kernel
clients to create or use channels - it only needs to be
mounted when user space applications need access to the buffer
data.
@@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available
(including in create_buf_file()) via chan->private_data or
buf->chan->private_data.
-Buffer-only channels
---------------------
-
-These channels have no files associated and can be created with
-relay_open(NULL, NULL, ...). Such channels are useful in scenarios such
-as when doing early tracing in the kernel, before the VFS is up. In these
-cases, one may open a buffer-only channel and then call
-relay_late_setup_files() when the kernel is ready to handle files,
-to expose the buffered data to the userspace.
-
Channel 'modes'
---------------
@@ -325,7 +315,7 @@ section, as it pertains mainly to mmap() implementations.
In 'overwrite' mode, also known as 'flight recorder' mode, writes
continuously cycle around the buffer and will never fail, but will
unconditionally overwrite old data regardless of whether it's actually
-been consumed. In no-overwrite mode, writes will fail, i.e. data will
+been consumed. In no-overwrite mode, writes will fail, i.e., data will
be lost, if the number of unconsumed sub-buffers equals the total
number of sub-buffers in the channel. It should be clear that if
there is no consumer or if the consumer can't consume sub-buffers fast
@@ -344,7 +334,7 @@ initialize the next sub-buffer if appropriate 2) finalize the previous
sub-buffer if appropriate and 3) return a boolean value indicating
whether or not to actually move on to the next sub-buffer.
-To implement 'no-overwrite' mode, the userspace client would provide
+To implement 'no-overwrite' mode, the userspace client provides
an implementation of the subbuf_start() callback something like the
following::
@@ -364,9 +354,9 @@ following::
return 1;
}
-If the current buffer is full, i.e. all sub-buffers remain unconsumed,
+If the current buffer is full, i.e., all sub-buffers remain unconsumed,
the callback returns 0 to indicate that the buffer switch should not
-occur yet, i.e. until the consumer has had a chance to read the
+occur yet, i.e., until the consumer has had a chance to read the
current set of ready sub-buffers. For the relay_buf_full() function
to make sense, the consumer is responsible for notifying the relay
interface when sub-buffers have been consumed via
@@ -400,7 +390,7 @@ consulted.
The default subbuf_start() implementation, used if the client doesn't
define any callbacks, or doesn't define the subbuf_start() callback,
-implements the simplest possible 'no-overwrite' mode, i.e. it does
+implements the simplest possible 'no-overwrite' mode, i.e., it does
nothing but return 0.
Header information can be reserved at the beginning of each sub-buffer
@@ -467,7 +457,7 @@ rather than open and close a new channel for each use. relay_reset()
can be used for this purpose - it resets a channel to its initial
state without reallocating channel buffer memory or destroying
existing mappings. It should however only be called when it's safe to
-do so, i.e. when the channel isn't currently being written to.
+do so, i.e., when the channel isn't currently being written to.
Finally, there are a couple of utility callbacks that can be used for
different purposes. buf_mapped() is called whenever a channel buffer
diff --git a/Documentation/filesystems/resctrl.rst b/Documentation/filesystems/resctrl.rst
new file mode 100644
index 000000000000..c7949dd44f2f
--- /dev/null
+++ b/Documentation/filesystems/resctrl.rst
@@ -0,0 +1,1523 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+=====================================================
+User Interface for Resource Control feature (resctrl)
+=====================================================
+
+:Copyright: |copy| 2016 Intel Corporation
+:Authors: - Fenghua Yu <fenghua.yu@intel.com>
+ - Tony Luck <tony.luck@intel.com>
+ - Vikas Shivappa <vikas.shivappa@intel.com>
+
+
+Intel refers to this feature as Intel Resource Director Technology(Intel(R) RDT).
+AMD refers to this feature as AMD Platform Quality of Service(AMD QoS).
+
+This feature is enabled by the CONFIG_X86_CPU_RESCTRL and the x86 /proc/cpuinfo
+flag bits:
+
+=============================================== ================================
+RDT (Resource Director Technology) Allocation "rdt_a"
+CAT (Cache Allocation Technology) "cat_l3", "cat_l2"
+CDP (Code and Data Prioritization) "cdp_l3", "cdp_l2"
+CQM (Cache QoS Monitoring) "cqm_llc", "cqm_occup_llc"
+MBM (Memory Bandwidth Monitoring) "cqm_mbm_total", "cqm_mbm_local"
+MBA (Memory Bandwidth Allocation) "mba"
+SMBA (Slow Memory Bandwidth Allocation) ""
+BMEC (Bandwidth Monitoring Event Configuration) ""
+=============================================== ================================
+
+Historically, new features were made visible by default in /proc/cpuinfo. This
+resulted in the feature flags becoming hard to parse by humans. Adding a new
+flag to /proc/cpuinfo should be avoided if user space can obtain information
+about the feature from resctrl's info directory.
+
+To use the feature mount the file system::
+
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
+
+mount options are:
+
+"cdp":
+ Enable code/data prioritization in L3 cache allocations.
+"cdpl2":
+ Enable code/data prioritization in L2 cache allocations.
+"mba_MBps":
+ Enable the MBA Software Controller(mba_sc) to specify MBA
+ bandwidth in MiBps
+"debug":
+ Make debug files accessible. Available debug files are annotated with
+ "Available only with debug option".
+
+L2 and L3 CDP are controlled separately.
+
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control. Cache
+pseudo-locking is a unique way of using cache control to "pin" or
+"lock" data in the cache. Details can be found in
+"Cache Pseudo-Locking".
+
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
+
+Info directory
+==============
+
+The 'info' directory contains information about the enabled
+resources. Each resource has its own subdirectory. The subdirectory
+names reflect the resource names.
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:
+
+"num_closids":
+ The number of CLOSIDs which are valid for this
+ resource. The kernel uses the smallest number of
+ CLOSIDs of all enabled resources as limit.
+"cbm_mask":
+ The bitmask which is valid for this resource.
+ This mask is equivalent to 100%.
+"min_cbm_bits":
+ The minimum number of consecutive bits which
+ must be set when writing a mask.
+
+"shareable_bits":
+ Bitmask of shareable resource with other executing
+ entities (e.g. I/O). User can use this when
+ setting up exclusive cache partitions. Note that
+ some platforms support devices that have their
+ own settings for cache use which can over-ride
+ these bits.
+"bit_usage":
+ Annotated capacity bitmasks showing how all
+ instances of the resource are used. The legend is:
+
+ "0":
+ Corresponding region is unused. When the system's
+ resources have been allocated and a "0" is found
+ in "bit_usage" it is a sign that resources are
+ wasted.
+
+ "H":
+ Corresponding region is used by hardware only
+ but available for software use. If a resource
+ has bits set in "shareable_bits" but not all
+ of these bits appear in the resource groups'
+ schematas then the bits appearing in
+ "shareable_bits" but no resource group will
+ be marked as "H".
+ "X":
+ Corresponding region is available for sharing and
+ used by hardware and software. These are the
+ bits that appear in "shareable_bits" as
+ well as a resource group's allocation.
+ "S":
+ Corresponding region is used by software
+ and available for sharing.
+ "E":
+ Corresponding region is used exclusively by
+ one resource group. No sharing allowed.
+ "P":
+ Corresponding region is pseudo-locked. No
+ sharing allowed.
+"sparse_masks":
+ Indicates if non-contiguous 1s value in CBM is supported.
+
+ "0":
+ Only contiguous 1s value in CBM is supported.
+ "1":
+ Non-contiguous 1s value in CBM is supported.
+
+Memory bandwidth(MB) subdirectory contains the following files
+with respect to allocation:
+
+"min_bandwidth":
+ The minimum memory bandwidth percentage which
+ user can request.
+
+"bandwidth_gran":
+ The granularity in which the memory bandwidth
+ percentage is allocated. The allocated
+ b/w percentage is rounded off to the next
+ control step available on the hardware. The
+ available bandwidth control steps are:
+ min_bandwidth + N * bandwidth_gran.
+
+"delay_linear":
+ Indicates if the delay scale is linear or
+ non-linear. This field is purely informational
+ only.
+
+"thread_throttle_mode":
+ Indicator on Intel systems of how tasks running on threads
+ of a physical core are throttled in cases where they
+ request different memory bandwidth percentages:
+
+ "max":
+ the smallest percentage is applied
+ to all threads
+ "per-thread":
+ bandwidth percentages are directly applied to
+ the threads running on the core
+
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids":
+ The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features":
+ Lists the monitoring events if
+ monitoring is enabled for the resource.
+ Example::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_local_bytes
+
+ If the system supports Bandwidth Monitoring Event
+ Configuration (BMEC), then the bandwidth events will
+ be configurable. The output will be::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mon_features
+ llc_occupancy
+ mbm_total_bytes
+ mbm_total_bytes_config
+ mbm_local_bytes
+ mbm_local_bytes_config
+
+"mbm_total_bytes_config", "mbm_local_bytes_config":
+ Read/write files containing the configuration for the mbm_total_bytes
+ and mbm_local_bytes events, respectively, when the Bandwidth
+ Monitoring Event Configuration (BMEC) feature is supported.
+ The event configuration settings are domain specific and affect
+ all the CPUs in the domain. When either event configuration is
+ changed, the bandwidth counters for all RMIDs of both events
+ (mbm_total_bytes as well as mbm_local_bytes) are cleared for that
+ domain. The next read for every RMID will report "Unavailable"
+ and subsequent reads will report the valid value.
+
+ Following are the types of events supported:
+
+ ==== ========================================================
+ Bits Description
+ ==== ========================================================
+ 6 Dirty Victims from the QOS domain to all types of memory
+ 5 Reads to slow memory in the non-local NUMA domain
+ 4 Reads to slow memory in the local NUMA domain
+ 3 Non-temporal writes to non-local NUMA domain
+ 2 Non-temporal writes to local NUMA domain
+ 1 Reads to memory in the non-local NUMA domain
+ 0 Reads to memory in the local NUMA domain
+ ==== ========================================================
+
+ By default, the mbm_total_bytes configuration is set to 0x7f to count
+ all the event types and the mbm_local_bytes configuration is set to
+ 0x15 to count all the local memory events.
+
+ Examples:
+
+ * To view the current configuration::
+ ::
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x7f;1=0x7f;2=0x7f;3=0x7f
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x15;1=0x15;3=0x15;4=0x15
+
+ * To change the mbm_total_bytes to count only reads on domain 0,
+ the bits 0, 1, 4 and 5 needs to be set, which is 110011b in binary
+ (in hexadecimal 0x33):
+ ::
+
+ # echo "0=0x33" > /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_total_bytes_config
+ 0=0x33;1=0x7f;2=0x7f;3=0x7f
+
+ * To change the mbm_local_bytes to count all the slow memory reads on
+ domain 0 and 1, the bits 4 and 5 needs to be set, which is 110000b
+ in binary (in hexadecimal 0x30):
+ ::
+
+ # echo "0=0x30;1=0x30" > /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+
+ # cat /sys/fs/resctrl/info/L3_MON/mbm_local_bytes_config
+ 0=0x30;1=0x30;3=0x15;4=0x15
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+Finally, in the top level of the "info" directory there is a file
+named "last_cmd_status". This is reset with every "command" issued
+via the file system (making new directories or writing to any of the
+control files). If the command was successful, it will read as "ok".
+If the command failed, it will provide more information that can be
+conveyed in the error returns from file operations. E.g.
+::
+
+ # echo L3:0=f7 > schemata
+ bash: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ mask f7 has non-consecutive 1-bits
+
+Resource alloc and monitor groups
+=================================
+
+Resource groups are represented as directories in the resctrl file
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+Moving MON group directories to a new parent CTRL_MON group is supported
+for the purpose of changing the resource allocations of a MON group
+without impacting its monitoring data or assigned tasks. This operation
+is not allowed for MON groups which monitor CPUs. No other move
+operation is currently allowed other than simply renaming a CTRL_MON or
+MON group.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. Multiple tasks can be added by separating the task ids
+ with commas. Tasks will be assigned sequentially. Multiple
+ failures are not supported. A single failure encountered while
+ attempting to assign a task will cause the operation to abort and
+ already added tasks before the failure will remain in the group.
+ Failures will be logged to /sys/fs/resctrl/info/last_cmd_status.
+
+ If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+ When the resource group is in pseudo-locked mode this file will
+ only be readable, reflecting the CPUs associated with the
+ pseudo-locked region.
+
+
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.
+
+
+When control is enabled all CTRL_MON groups will also contain:
+
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.
+
+"size":
+ Mirrors the display of the "schemata" file to display the size in
+ bytes of each allocation instead of the bits representing the
+ allocation.
+
+"mode":
+ The "mode" of the resource group dictates the sharing of its
+ allocations. A "shareable" resource group allows sharing of its
+ allocations while an "exclusive" resource group does not. A
+ cache pseudo-locked region is created by first writing
+ "pseudo-locksetup" to the "mode" file before writing the cache
+ pseudo-locked region's schemata to the resource group's "schemata"
+ file. On successful pseudo-locked region creation the mode will
+ automatically change to "pseudo-locked".
+
+"ctrl_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the control group. On x86 this is the CLOSID.
+
+When monitoring is enabled all MON groups will also contain:
+
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
+
+"mon_hw_id":
+ Available only with debug option. The identifier used by hardware
+ for the monitor group. On x86 this is the RMID.
+
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
+Resource allocation rules
+-------------------------
+
+When a task is running the following rules define which resources are
+available to it:
+
+1) If the task is a member of a non-default group, then the schemata
+ for that group is used.
+
+2) Else if the task belongs to the default group, but is running on a
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.
+
+3) Otherwise the schemata for the default group is used.
+
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+===============================================
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
+
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
+Schemata files - general concepts
+---------------------------------
+Each line in the file describes one resource. The line starts with
+the name of the resource, followed by specific values to be applied
+in each of the instances of that resource on the system.
+
+Cache IDs
+---------
+On current generation systems there is one L3 cache per socket and L2
+caches are generally just shared by the hyperthreads on a core, but this
+isn't an architectural requirement. We could have multiple separate L3
+caches on a socket, multiple cores could share an L2 cache. So instead
+of using "socket" or "core" to define the set of logical cpus sharing
+a resource we use a "Cache ID". At a given cache level this will be a
+unique number across the whole system (but it isn't guaranteed to be a
+contiguous sequence, there may be gaps). To find the ID for each logical
+CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
+
+Cache Bit Masks (CBM)
+---------------------
+For cache resources we describe the portion of the cache that is available
+for allocation using a bitmask. The maximum value of the mask is defined
+by each cpu model (and may be different for different cache levels). It
+is found using CPUID, but is also provided in the "info" directory of
+the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
+requires that these masks have all the '1' bits in a contiguous block. So
+0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
+and 0xA are not. Check /sys/fs/resctrl/info/{resource}/sparse_masks
+if non-contiguous 1s value is supported. On a system with a 20-bit mask
+each bit represents 5% of the capacity of the cache. You could partition
+the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
+Memory bandwidth Allocation and monitoring
+==========================================
+
+For Memory bandwidth resource, by default the user controls the resource
+by indicating the percentage of total memory bandwidth.
+
+The minimum bandwidth percentage value for each cpu model is predefined
+and can be looked up through "info/MB/min_bandwidth". The bandwidth
+granularity that is allocated is also dependent on the cpu model and can
+be looked up at "info/MB/bandwidth_gran". The available bandwidth
+control steps are: min_bw + N * bw_gran. Intermediate values are rounded
+to the next control step available on the hardware.
+
+The bandwidth throttling is a core specific mechanism on some of Intel
+SKUs. Using a high bandwidth and a low bandwidth setting on two threads
+sharing a core may result in both threads being throttled to use the
+low bandwidth (see "thread_throttle_mode").
+
+The fact that Memory bandwidth allocation(MBA) may be a core
+specific mechanism where as memory bandwidth monitoring(MBM) is done at
+the package level may lead to confusion when users try to apply control
+via the MBA and then monitor the bandwidth to see if the controls are
+effective. Below are such scenarios:
+
+1. User may *not* see increase in actual bandwidth when percentage
+ values are increased:
+
+This can occur when aggregate L2 external bandwidth is more than L3
+external bandwidth. Consider an SKL SKU with 24 cores on a package and
+where L2 external is 10GBps (hence aggregate L2 external bandwidth is
+240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
+threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
+bandwidth of 100GBps although the percentage value specified is only 50%
+<< 100%. Hence increasing the bandwidth percentage will not yield any
+more bandwidth. This is because although the L2 external bandwidth still
+has capacity, the L3 external bandwidth is fully used. Also note that
+this would be dependent on number of cores the benchmark is run on.
+
+2. Same bandwidth percentage may mean different actual bandwidth
+ depending on # of threads:
+
+For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
+thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
+they have same percentage bandwidth of 10%. This is simply because as
+threads start using more cores in an rdtgroup, the actual bandwidth may
+increase or vary although user specified bandwidth percentage is same.
+
+In order to mitigate this and make the interface more user friendly,
+resctrl added support for specifying the bandwidth in MiBps as well. The
+kernel underneath would use a software feedback mechanism or a "Software
+Controller(mba_sc)" which reads the actual bandwidth using MBM counters
+and adjust the memory bandwidth percentages to ensure::
+
+ "actual bandwidth < user specified bandwidth".
+
+By default, the schemata would take the bandwidth percentage values
+where as user can switch to the "MBA software controller" mode using
+a mount option 'mba_MBps'. The schemata format is specified in the below
+sections.
+
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
+With CDP disabled the L3 schemata format is::
+
+ L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
+When CDP is enabled L3 control is split into two separate resources
+so you can specify independent masks for code and data like this::
+
+ L3DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L3CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+L2 schemata file details
+------------------------
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
+
+ L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+or
+
+ L2DATA:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+ L2CODE:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
+
+
+Memory bandwidth Allocation (default mode)
+------------------------------------------
+
+Memory b/w domain is L3 cache.
+::
+
+ MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
+
+Memory bandwidth domain is L3 cache.
+::
+
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
+
+Slow Memory Bandwidth Allocation (SMBA)
+---------------------------------------
+AMD hardware supports Slow Memory Bandwidth Allocation (SMBA).
+CXL.memory is the only supported "slow" memory device. With the
+support of SMBA, the hardware enables bandwidth allocation on
+the slow memory devices. If there are multiple such devices in
+the system, the throttling logic groups all the slow sources
+together and applies the limit on them as a whole.
+
+The presence of SMBA (with CXL.memory) is independent of slow memory
+devices presence. If there are no such devices on the system, then
+configuring SMBA will have no impact on the performance of the system.
+
+The bandwidth domain for slow memory is L3 cache. Its schemata file
+is formatted as:
+::
+
+ SMBA:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
+
+Reading/writing the schemata file
+---------------------------------
+Reading the schemata file will show the state of all resources
+on all domains. When writing you only need to specify those values
+which you wish to change. E.g.
+::
+
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+ # echo "L3DATA:2=3c0;" > schemata
+ # cat schemata
+ L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
+ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+
+Reading/writing the schemata file (on AMD systems)
+--------------------------------------------------
+Reading the schemata file will show the current bandwidth limit on all
+domains. The allocated resources are in multiples of one eighth GB/s.
+When writing to the file, you need to specify what cache id you wish to
+configure the bandwidth limit.
+
+For example, to allocate 2GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "MB:1=16" > schemata
+ # cat schemata
+ MB:0=2048;1= 16;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Reading/writing the schemata file (on AMD systems) with SMBA feature
+--------------------------------------------------------------------
+Reading and writing the schemata file is the same as without SMBA in
+above section.
+
+For example, to allocate 8GB/s limit on the first cache id:
+
+::
+
+ # cat schemata
+ SMBA:0=2048;1=2048;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+ # echo "SMBA:1=64" > schemata
+ # cat schemata
+ SMBA:0=2048;1= 64;2=2048;3=2048
+ MB:0=2048;1=2048;2=2048;3=2048
+ L3:0=ffff;1=ffff;2=ffff;3=ffff
+
+Cache Pseudo-Locking
+====================
+CAT enables a user to specify the amount of cache space that an
+application can fill. Cache pseudo-locking builds on the fact that a
+CPU can still read and write data pre-allocated outside its current
+allocated area on a cache hit. With cache pseudo-locking, data can be
+preloaded into a reserved portion of cache that no application can
+fill, and from that point on will only serve cache hits. The cache
+pseudo-locked memory is made accessible to user space where an
+application can map it into its virtual address space and thus have
+a region of memory with reduced average read latency.
+
+The creation of a cache pseudo-locked region is triggered by a request
+from the user to do so that is accompanied by a schemata of the region
+to be pseudo-locked. The cache pseudo-locked region is created as follows:
+
+- Create a CAT allocation CLOSNEW with a CBM matching the schemata
+ from the user of the cache region that will contain the pseudo-locked
+ memory. This region must not overlap with any current CAT allocation/CLOS
+ on the system and no future overlap with this cache region is allowed
+ while the pseudo-locked region exists.
+- Create a contiguous region of memory of the same size as the cache
+ region.
+- Flush the cache, disable hardware prefetchers, disable preemption.
+- Make CLOSNEW the active CLOS and touch the allocated memory to load
+ it into the cache.
+- Set the previous CLOS as active.
+- At this point the closid CLOSNEW can be released - the cache
+ pseudo-locked region is protected as long as its CBM does not appear in
+ any CAT allocation. Even though the cache pseudo-locked region will from
+ this point on not appear in any CBM of any CLOS an application running with
+ any CLOS will be able to access the memory in the pseudo-locked region since
+ the region continues to serve cache hits.
+- The contiguous region of memory loaded into the cache is exposed to
+ user-space as a character device.
+
+Cache pseudo-locking increases the probability that data will remain
+in the cache via carefully configuring the CAT feature and controlling
+application behavior. There is no guarantee that data is placed in
+cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
+“locked” data from cache. Power management C-states may shrink or
+power off cache. Deeper C-states will automatically be restricted on
+pseudo-locked region creation.
+
+It is required that an application using a pseudo-locked region runs
+with affinity to the cores (or a subset of the cores) associated
+with the cache on which the pseudo-locked region resides. A sanity check
+within the code will not allow an application to map pseudo-locked memory
+unless it runs with affinity to cores associated with the cache on which the
+pseudo-locked region resides. The sanity check is only done during the
+initial mmap() handling, there is no enforcement afterwards and the
+application self needs to ensure it remains affine to the correct cores.
+
+Pseudo-locking is accomplished in two stages:
+
+1) During the first stage the system administrator allocates a portion
+ of cache that should be dedicated to pseudo-locking. At this time an
+ equivalent portion of memory is allocated, loaded into allocated
+ cache portion, and exposed as a character device.
+2) During the second stage a user-space application maps (mmap()) the
+ pseudo-locked memory into its address space.
+
+Cache Pseudo-Locking Interface
+------------------------------
+A pseudo-locked region is created using the resctrl interface as follows:
+
+1) Create a new resource group by creating a new directory in /sys/fs/resctrl.
+2) Change the new resource group's mode to "pseudo-locksetup" by writing
+ "pseudo-locksetup" to the "mode" file.
+3) Write the schemata of the pseudo-locked region to the "schemata" file. All
+ bits within the schemata should be "unused" according to the "bit_usage"
+ file.
+
+On successful pseudo-locked region creation the "mode" file will contain
+"pseudo-locked" and a new character device with the same name as the resource
+group will exist in /dev/pseudo_lock. This character device can be mmap()'ed
+by user space in order to obtain access to the pseudo-locked memory region.
+
+An example of cache pseudo-locked region creation and usage can be found below.
+
+Cache Pseudo-Locking Debugging Interface
+----------------------------------------
+The pseudo-locking debugging interface is enabled by default (if
+CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl.
+
+There is no explicit way for the kernel to test if a provided memory
+location is present in the cache. The pseudo-locking debugging interface uses
+the tracing infrastructure to provide two ways to measure cache residency of
+the pseudo-locked region:
+
+1) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data
+ from these measurements are best visualized using a hist trigger (see
+ example below). In this test the pseudo-locked region is traversed at
+ a stride of 32 bytes while hardware prefetchers and preemption
+ are disabled. This also provides a substitute visualization of cache
+ hits and misses.
+2) Cache hit and miss measurements using model specific precision counters if
+ available. Depending on the levels of cache on the system the pseudo_lock_l2
+ and pseudo_lock_l3 tracepoints are available.
+
+When a pseudo-locked region is created a new debugfs directory is created for
+it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single
+write-only file, pseudo_lock_measure, is present in this directory. The
+measurement of the pseudo-locked region depends on the number written to this
+debugfs file:
+
+1:
+ writing "1" to the pseudo_lock_measure file will trigger the latency
+ measurement captured in the pseudo_lock_mem_latency tracepoint. See
+ example below.
+2:
+ writing "2" to the pseudo_lock_measure file will trigger the L2 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l2 tracepoint. See example below.
+3:
+ writing "3" to the pseudo_lock_measure file will trigger the L3 cache
+ residency (cache hits and misses) measurement captured in the
+ pseudo_lock_l3 tracepoint.
+
+All measurements are recorded with the tracing infrastructure. This requires
+the relevant tracepoints to be enabled before the measurement is triggered.
+
+Example of latency debugging interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created. Here is
+how we can measure the latency in cycles of reading from this region and
+visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS
+is set::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 'hist:keys=latency' > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/trigger
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/enable
+ # cat /sys/kernel/tracing/events/resctrl/pseudo_lock_mem_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active]
+ #
+
+ { latency: 456 } hitcount: 1
+ { latency: 50 } hitcount: 83
+ { latency: 36 } hitcount: 96
+ { latency: 44 } hitcount: 174
+ { latency: 48 } hitcount: 195
+ { latency: 46 } hitcount: 262
+ { latency: 42 } hitcount: 693
+ { latency: 40 } hitcount: 3204
+ { latency: 38 } hitcount: 3484
+
+ Totals:
+ Hits: 8192
+ Entries: 9
+ Dropped: 0
+
+Example of cache hits/misses debugging
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example a pseudo-locked region named "newlock" was created on the L2
+cache of a platform. Here is how we can obtain details of the cache hits
+and misses using the platform's precision counters.
+::
+
+ # :> /sys/kernel/tracing/trace
+ # echo 1 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure
+ # echo 0 > /sys/kernel/tracing/events/resctrl/pseudo_lock_l2/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / delay
+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION
+ # | | | |||| | |
+ pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0
+
+
+Examples for RDT allocation usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1) Example 1
+
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks, minimum b/w of 10% with a memory bandwidth
+granularity of 10%.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Similarly, tasks that are under the control of group "p0" may use a
+maximum memory b/w of 50% on socket0 and 50% on socket 1.
+Tasks in group "p1" may also use 50% memory b/w on both sockets.
+Note that unlike cache masks, memory b/w cannot specify whether these
+allocations can overlap or not. The allocations specifies the maximum
+b/w that the group may be able to use and the system admin can configure
+the b/w accordingly.
+
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
+::
+
+ # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
+
+In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
+of 1024MB where as on socket 1 they would use 500MB.
+
+2) Example 2
+
+Again two sockets, but this time with a more realistic 20-bit mask.
+
+Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
+processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
+neighbors, each of the two real-time tasks exclusively occupies one quarter
+of L3 cache on socket 0.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
+ordinary tasks::
+
+ # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
+
+Next we make a resource group for our first real time task and give
+it access to the "top" 25% of the cache on socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=f8000;1=fffff" > p0/schemata
+
+Finally we move our first real time task into this resource group. We
+also use taskset(1) to ensure the task always runs on a dedicated CPU
+on socket 0. Most uses of resource groups will also constrain which
+processors tasks run on.
+::
+
+ # echo 1234 > p0/tasks
+ # taskset -cp 1 1234
+
+Ditto for the second real time task (with the remaining 25% of cache)::
+
+ # mkdir p1
+ # echo "L3:0=7c00;1=fffff" > p1/schemata
+ # echo 5678 > p1/tasks
+ # taskset -cp 2 5678
+
+For the same 2 socket system with memory b/w resource and CAT L3 the
+schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
+10):
+
+For our first real time task this would request 20% memory b/w on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+For our second real time task this would request an other 20% memory b/w
+on socket 0.
+::
+
+ # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
+
+3) Example 3
+
+A single socket system which has real-time tasks running on core 4-7 and
+non real-time workload assigned to core 0-3. The real-time tasks share text
+and data, so a per task association is not required and due to interaction
+with the kernel it's desired that the kernel on these cores shares L3 with
+the tasks.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+
+First we reset the schemata for the default group so that the "upper"
+50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
+cannot be used by ordinary tasks::
+
+ # echo "L3:0=3ff\nMB:0=50" > schemata
+
+Next we make a resource group for our real time cores and give it access
+to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
+socket 0.
+::
+
+ # mkdir p0
+ # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
+
+Finally we move core 4-7 over to the new group and make sure that the
+kernel and the tasks running there get 50% of the cache. They should
+also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
+siblings and only the real time threads are scheduled on the cores 4-7.
+::
+
+ # echo F0 > p0/cpus
+
+4) Example 4
+
+The resource groups in previous examples were all in the default "shareable"
+mode allowing sharing of their cache allocations. If one resource group
+configures a cache allocation then nothing prevents another resource group
+to overlap with that allocation.
+
+In this example a new exclusive resource group will be created on a L2 CAT
+system with two L2 cache instances that can be configured with an 8-bit
+capacity bitmask. The new exclusive resource group will be configured to use
+25% of each cache instance.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+First, we observe that the default group is configured to allocate to all L2
+cache::
+
+ # cat schemata
+ L2:0=ff;1=ff
+
+We could attempt to create the new resource group at this point, but it will
+fail because of the overlap with the schemata of the default group::
+
+ # mkdir p0
+ # echo 'L2:0=0x3;1=0x3' > p0/schemata
+ # cat p0/mode
+ shareable
+ # echo exclusive > p0/mode
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ schemata overlaps
+
+To ensure that there is no overlap with another resource group the default
+resource group's schemata has to change, making it possible for the new
+resource group to become exclusive.
+::
+
+ # echo 'L2:0=0xfc;1=0xfc' > schemata
+ # echo exclusive > p0/mode
+ # grep . p0/*
+ p0/cpus:0
+ p0/mode:exclusive
+ p0/schemata:L2:0=03;1=03
+ p0/size:L2:0=262144;1=262144
+
+A new resource group will on creation not overlap with an exclusive resource
+group::
+
+ # mkdir p1
+ # grep . p1/*
+ p1/cpus:0
+ p1/mode:shareable
+ p1/schemata:L2:0=fc;1=fc
+ p1/size:L2:0=786432;1=786432
+
+The bit_usage will reflect how the cache is used::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSEE;1=SSSSSSEE
+
+A resource group cannot be forced to overlap with an exclusive resource group::
+
+ # echo 'L2:0=0x1;1=0x1' > p1/schemata
+ -sh: echo: write error: Invalid argument
+ # cat info/last_cmd_status
+ overlaps with exclusive group
+
+Example of Cache Pseudo-Locking
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
+region is exposed at /dev/pseudo_lock/newlock that can be provided to
+application for argument to mmap().
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl/
+ # cd /sys/fs/resctrl
+
+Ensure that there are bits available that can be pseudo-locked, since only
+unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
+removed from the default resource group's schemata::
+
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSSS
+ # echo 'L2:1=0xfc' > schemata
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSS00
+
+Create a new resource group that will be associated with the pseudo-locked
+region, indicate that it will be used for a pseudo-locked region, and
+configure the requested pseudo-locked region capacity bitmask::
+
+ # mkdir newlock
+ # echo pseudo-locksetup > newlock/mode
+ # echo 'L2:1=0x3' > newlock/schemata
+
+On success the resource group's mode will change to pseudo-locked, the
+bit_usage will reflect the pseudo-locked region, and the character device
+exposing the pseudo-locked region will exist::
+
+ # cat newlock/mode
+ pseudo-locked
+ # cat info/L2/bit_usage
+ 0=SSSSSSSS;1=SSSSSSPP
+ # ls -l /dev/pseudo_lock/newlock
+ crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock
+
+::
+
+ /*
+ * Example code to access one page of pseudo-locked cache region
+ * from user space.
+ */
+ #define _GNU_SOURCE
+ #include <fcntl.h>
+ #include <sched.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+ #include <sys/mman.h>
+
+ /*
+ * It is required that the application runs with affinity to only
+ * cores associated with the pseudo-locked region. Here the cpu
+ * is hardcoded for convenience of example.
+ */
+ static int cpuid = 2;
+
+ int main(int argc, char *argv[])
+ {
+ cpu_set_t cpuset;
+ long page_size;
+ void *mapping;
+ int dev_fd;
+ int ret;
+
+ page_size = sysconf(_SC_PAGESIZE);
+
+ CPU_ZERO(&cpuset);
+ CPU_SET(cpuid, &cpuset);
+ ret = sched_setaffinity(0, sizeof(cpuset), &cpuset);
+ if (ret < 0) {
+ perror("sched_setaffinity");
+ exit(EXIT_FAILURE);
+ }
+
+ dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR);
+ if (dev_fd < 0) {
+ perror("open");
+ exit(EXIT_FAILURE);
+ }
+
+ mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ dev_fd, 0);
+ if (mapping == MAP_FAILED) {
+ perror("mmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Application interacts with pseudo-locked memory @mapping */
+
+ ret = munmap(mapping, page_size);
+ if (ret < 0) {
+ perror("munmap");
+ close(dev_fd);
+ exit(EXIT_FAILURE);
+ }
+
+ close(dev_fd);
+ exit(EXIT_SUCCESS);
+ }
+
+Locking between applications
+----------------------------
+
+Certain operations on the resctrl filesystem, composed of read/writes
+to/from multiple files, must be atomic.
+
+As an example, the allocation of an exclusive reservation of L3 cache
+involves:
+
+ 1. Read the cbmmasks from each directory or the per-resource "bit_usage"
+ 2. Find a contiguous set of bits in the global CBM bitmask that is clear
+ in any of the directory cbmmasks
+ 3. Create a new directory
+ 4. Set the bits found in step 2 to the new directory "schemata" file
+
+If two applications attempt to allocate space concurrently then they can
+end up allocating the same bits so the reservations are shared instead of
+exclusive.
+
+To coordinate atomic operations on the resctrlfs and to avoid the problem
+above, the following locking procedure is recommended:
+
+Locking is based on flock, which is available in libc and also as a shell
+script command
+
+Write lock:
+
+ A) Take flock(LOCK_EX) on /sys/fs/resctrl
+ B) Read/write the directory structure.
+ C) funlock
+
+Read lock:
+
+ A) Take flock(LOCK_SH) on /sys/fs/resctrl
+ B) If success read the directory structure.
+ C) funlock
+
+Example with bash::
+
+ # Atomically read directory structure
+ $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
+
+ # Read directory contents and create new subdirectory
+
+ $ cat create-dir.sh
+ find /sys/fs/resctrl/ > output.txt
+ mask = function-of(output.txt)
+ mkdir /sys/fs/resctrl/newres/
+ echo mask > /sys/fs/resctrl/newres/schemata
+
+ $ flock /sys/fs/resctrl/ ./create-dir.sh
+
+Example with C::
+
+ /*
+ * Example code do take advisory locks
+ * before accessing resctrl filesystem
+ */
+ #include <sys/file.h>
+ #include <stdlib.h>
+
+ void resctrl_take_shared_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_SH);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_take_exclusive_lock(int fd)
+ {
+ int ret;
+
+ /* release lock on resctrl filesystem */
+ ret = flock(fd, LOCK_EX);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void resctrl_release_lock(int fd)
+ {
+ int ret;
+
+ /* take shared lock on resctrl filesystem */
+ ret = flock(fd, LOCK_UN);
+ if (ret) {
+ perror("flock");
+ exit(-1);
+ }
+ }
+
+ void main(void)
+ {
+ int fd, ret;
+
+ fd = open("/sys/fs/resctrl", O_DIRECTORY);
+ if (fd == -1) {
+ perror("open");
+ exit(-1);
+ }
+ resctrl_take_shared_lock(fd);
+ /* code to read directory contents */
+ resctrl_release_lock(fd);
+
+ resctrl_take_exclusive_lock(fd);
+ /* code to read and write directory contents */
+ resctrl_release_lock(fd);
+ }
+
+Examples for RDT Monitoring along with allocation usage
+=======================================================
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+------------------------------------------------------------------------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+ # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+ # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+ # echo 5678 > p1/tasks
+ # echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+::
+
+ # cd /sys/fs/resctrl/p1/mon_groups
+ # mkdir m11 m12
+ # echo 5678 > m11/tasks
+ # echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+::
+
+ # cat m11/mon_data/mon_L3_00/llc_occupancy
+ 16234000
+ # cat m11/mon_data/mon_L3_01/llc_occupancy
+ 14789000
+ # cat m12/mon_data/mon_L3_00/llc_occupancy
+ 16789000
+
+The parent ctrl_mon group shows the aggregated data.
+::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31234000
+
+Example 2 (Monitor a task from its creation)
+--------------------------------------------
+On a two socket machine (one L3 cache per socket)::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+::
+
+ # echo $$ > /sys/fs/resctrl/p1/tasks
+ # <cmd>
+
+Fetch the data::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+ 31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------------------------------------------------------------------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir mon_groups/m01
+ # mkdir mon_groups/m02
+
+ # echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+ # echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+::
+
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+ 34555
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+ 31234000
+ # cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+ 32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+::
+
+ # mount -t resctrl resctrl /sys/fs/resctrl
+ # cd /sys/fs/resctrl
+ # mkdir p1
+
+Move the cpus 4-7 over to p1::
+
+ # echo f0 > p1/cpus
+
+View the llc occupancy snapshot::
+
+ # cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+ 11234000
+
+Intel RDT Errata
+================
+
+Intel MBM Counters May Report System Memory Bandwidth Incorrectly
+-----------------------------------------------------------------
+
+Errata SKX99 for Skylake server and BDF102 for Broadwell server.
+
+Problem: Intel Memory Bandwidth Monitoring (MBM) counters track metrics
+according to the assigned Resource Monitor ID (RMID) for that logical
+core. The IA32_QM_CTR register (MSR 0xC8E), used to report these
+metrics, may report incorrect system bandwidth for certain RMID values.
+
+Implication: Due to the errata, system memory bandwidth may not match
+what is reported.
+
+Workaround: MBM total and local readings are corrected according to the
+following correction factor table:
+
++---------------+---------------+---------------+-----------------+
+|core count |rmid count |rmid threshold |correction factor|
++---------------+---------------+---------------+-----------------+
+|1 |8 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|2 |16 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|3 |24 |15 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|4 |32 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|6 |48 |31 |0.969650 |
++---------------+---------------+---------------+-----------------+
+|7 |56 |47 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|8 |64 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|9 |72 |63 |1.185115 |
++---------------+---------------+---------------+-----------------+
+|10 |80 |63 |1.066553 |
++---------------+---------------+---------------+-----------------+
+|11 |88 |79 |1.454545 |
++---------------+---------------+---------------+-----------------+
+|12 |96 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|13 |104 |95 |1.230769 |
++---------------+---------------+---------------+-----------------+
+|14 |112 |95 |1.142857 |
++---------------+---------------+---------------+-----------------+
+|15 |120 |95 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|16 |128 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|17 |136 |127 |1.254863 |
++---------------+---------------+---------------+-----------------+
+|18 |144 |127 |1.185255 |
++---------------+---------------+---------------+-----------------+
+|19 |152 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|20 |160 |127 |1.066667 |
++---------------+---------------+---------------+-----------------+
+|21 |168 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|22 |176 |159 |1.454334 |
++---------------+---------------+---------------+-----------------+
+|23 |184 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|24 |192 |127 |0.969744 |
++---------------+---------------+---------------+-----------------+
+|25 |200 |191 |1.280246 |
++---------------+---------------+---------------+-----------------+
+|26 |208 |191 |1.230921 |
++---------------+---------------+---------------+-----------------+
+|27 |216 |0 |1.000000 |
++---------------+---------------+---------------+-----------------+
+|28 |224 |191 |1.143118 |
++---------------+---------------+---------------+-----------------+
+
+If rmid > rmid threshold, MBM total and local values should be multiplied
+by the correction factor.
+
+See:
+
+1. Erratum SKX99 in Intel Xeon Processor Scalable Family Specification Update:
+http://web.archive.org/web/20200716124958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
+
+2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
+http://web.archive.org/web/20191125200531/https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
+
+3. The errata in Intel Resource Director Technology (Intel RDT) on 2nd Generation Intel Xeon Scalable Processors Reference Manual:
+https://software.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-manual.html
+
+for further information.
diff --git a/Documentation/filesystems/seq_file.rst b/Documentation/filesystems/seq_file.rst
index 7f7ee06b2693..1e1713d00010 100644
--- a/Documentation/filesystems/seq_file.rst
+++ b/Documentation/filesystems/seq_file.rst
@@ -129,7 +129,9 @@ also a special value which can be returned by the start() function
called SEQ_START_TOKEN; it can be used if you wish to instruct your
show() function (described below) to print a header at the top of the
output. SEQ_START_TOKEN should only be used if the offset is zero,
-however.
+however. SEQ_START_TOKEN has no special meaning to the core seq_file
+code. It is provided as a convenience for a start() function to
+communicate with the next() and show() functions.
The next function to implement is called, amazingly, next(); its job is to
move the iterator forward to the next position in the sequence. The
@@ -145,6 +147,22 @@ complete. Here's the example version::
return spos;
}
+The next() function should set ``*pos`` to a value that start() can use
+to find the new location in the sequence. When the iterator is being
+stored in the private data area, rather than being reinitialized on each
+start(), it might seem sufficient to simply set ``*pos`` to any non-zero
+value (zero always tells start() to restart the sequence). This is not
+sufficient due to historical problems.
+
+Historically, many next() functions have *not* updated ``*pos`` at
+end-of-file. If the value is then used by start() to initialise the
+iterator, this can result in corner cases where the last entry in the
+sequence is reported twice in the file. In order to discourage this bug
+from being resurrected, the core seq_file code now produces a warning if
+a next() function does not change the value of ``*pos``. Consequently a
+next() function *must* change the value of ``*pos``, and of course must
+set it to a non-zero value.
+
The stop() function closes a session; its job, of course, is to clean
up. If dynamic memory is allocated for the iterator, stop() is the
place to free it; if a lock was taken by start(), stop() must release
@@ -199,6 +217,12 @@ between the calls to start() and stop(), so holding a lock during that time
is a reasonable thing to do. The seq_file code will also avoid taking any
other locks while the iterator is active.
+The iterator value returned by start() or next() is guaranteed to be
+passed to a subsequent next() or stop() call. This allows resources
+such as locks that were taken to be reliably released. There is *no*
+guarantee that the iterator will be passed to show(), though in practice
+it often will be.
+
Formatted output
================
diff --git a/Documentation/filesystems/sharedsubtree.rst b/Documentation/filesystems/sharedsubtree.rst
index d83395354250..1cf56489ed48 100644
--- a/Documentation/filesystems/sharedsubtree.rst
+++ b/Documentation/filesystems/sharedsubtree.rst
@@ -147,6 +147,7 @@ replicas continue to be exactly same.
3) Setting mount states
+-----------------------
The mount command (util-linux package) can be used to set mount
states::
@@ -612,6 +613,7 @@ replicas continue to be exactly same.
6) Quiz
+-------
A. What is the result of the following command sequence?
@@ -673,6 +675,7 @@ replicas continue to be exactly same.
/mnt/1/test be?
7) FAQ
+------
Q1. Why is bind mount needed? How is it different from symbolic links?
symbolic links can get stale if the destination mount gets
@@ -841,6 +844,7 @@ replicas continue to be exactly same.
tmp usr tmp usr tmp usr
8) Implementation
+-----------------
8A) Datastructure
diff --git a/Documentation/filesystems/cifs/cifsroot.rst b/Documentation/filesystems/smb/cifsroot.rst
index 4930bb443134..bf2d9db3acb9 100644
--- a/Documentation/filesystems/cifs/cifsroot.rst
+++ b/Documentation/filesystems/smb/cifsroot.rst
@@ -59,7 +59,7 @@ the root file system via SMB protocol.
Enables the kernel to mount the root file system via SMB that are
located in the <server-ip> and <share> specified in this option.
-The default mount options are set in fs/cifs/cifsroot.c.
+The default mount options are set in fs/smb/client/cifsroot.c.
server-ip
IPv4 address of the server.
diff --git a/Documentation/filesystems/smb/index.rst b/Documentation/filesystems/smb/index.rst
new file mode 100644
index 000000000000..6df23b0e45c8
--- /dev/null
+++ b/Documentation/filesystems/smb/index.rst
@@ -0,0 +1,11 @@
+===============================
+CIFS
+===============================
+
+
+.. toctree::
+ :maxdepth: 1
+
+ ksmbd
+ cifsroot
+ smbdirect
diff --git a/Documentation/filesystems/smb/ksmbd.rst b/Documentation/filesystems/smb/ksmbd.rst
new file mode 100644
index 000000000000..67cb68ea6e68
--- /dev/null
+++ b/Documentation/filesystems/smb/ksmbd.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+KSMBD - SMB3 Kernel Server
+==========================
+
+KSMBD is a linux kernel server which implements SMB3 protocol in kernel space
+for sharing files over network.
+
+KSMBD architecture
+==================
+
+The subset of performance related operations belong in kernelspace and
+the other subset which belong to operations which are not really related with
+performance in userspace. So, DCE/RPC management that has historically resulted
+into a number of buffer overflow issues and dangerous security bugs and user
+account management are implemented in user space as ksmbd.mountd.
+File operations that are related with performance (open/read/write/close etc.)
+in kernel space (ksmbd). This also allows for easier integration with VFS
+interface for all file operations.
+
+ksmbd (kernel daemon)
+---------------------
+
+When the server daemon is started, It starts up a forker thread
+(ksmbd/interface name) at initialization time and open a dedicated port 445
+for listening to SMB requests. Whenever new clients make a request, the Forker
+thread will accept the client connection and fork a new thread for a dedicated
+communication channel between the client and the server. It allows for parallel
+processing of SMB requests(commands) from clients as well as allowing for new
+clients to make new connections. Each instance is named ksmbd/1~n(port number)
+to indicate connected clients. Depending on the SMB request types, each new
+thread can decide to pass through the commands to the user space (ksmbd.mountd),
+currently DCE/RPC commands are identified to be handled through the user space.
+To further utilize the linux kernel, it has been chosen to process the commands
+as workitems and to be executed in the handlers of the ksmbd-io kworker threads.
+It allows for multiplexing of the handlers as the kernel takes care of initiating
+extra worker threads if the load is increased and vice versa, if the load is
+decreased it destroys the extra worker threads. So, after the connection is
+established with the client. Dedicated ksmbd/1..n(port number) takes complete
+ownership of receiving/parsing of SMB commands. Each received command is worked
+in parallel i.e., there can be multiple client commands which are worked in
+parallel. After receiving each command a separated kernel workitem is prepared
+for each command which is further queued to be handled by ksmbd-io kworkers.
+So, each SMB workitem is queued to the kworkers. This allows the benefit of load
+sharing to be managed optimally by the default kernel and optimizing client
+performance by handling client commands in parallel.
+
+ksmbd.mountd (user space daemon)
+--------------------------------
+
+ksmbd.mountd is a userspace process to, transfer the user account and password that
+are registered using ksmbd.adduser (part of utils for user space). Further it
+allows sharing information parameters that are parsed from smb.conf to ksmbd in
+kernel. For the execution part it has a daemon which is continuously running
+and connected to the kernel interface using netlink socket, it waits for the
+requests (dcerpc and share/user info). It handles RPC calls (at a minimum few
+dozen) that are most important for file server from NetShareEnum and
+NetServerGetInfo. Complete DCE/RPC response is prepared from the user space
+and passed over to the associated kernel thread for the client.
+
+
+KSMBD Feature Status
+====================
+
+============================== =================================================
+Feature name Status
+============================== =================================================
+Dialects Supported. SMB2.1 SMB3.0, SMB3.1.1 dialects
+ (intentionally excludes security vulnerable SMB1
+ dialect).
+Auto Negotiation Supported.
+Compound Request Supported.
+Oplock Cache Mechanism Supported.
+SMB2 leases(v1 lease) Supported.
+Directory leases(v2 lease) Supported.
+Multi-credits Supported.
+NTLM/NTLMv2 Supported.
+HMAC-SHA256 Signing Supported.
+Secure negotiate Supported.
+Signing Update Supported.
+Pre-authentication integrity Supported.
+SMB3 encryption(CCM, GCM) Supported. (CCM/GCM128 and CCM/GCM256 supported)
+SMB direct(RDMA) Supported.
+SMB3 Multi-channel Partially Supported. Planned to implement
+ replay/retry mechanisms for future.
+Receive Side Scaling mode Supported.
+SMB3.1.1 POSIX extension Supported.
+ACLs Partially Supported. only DACLs available, SACLs
+ (auditing) is planned for the future. For
+ ownership (SIDs) ksmbd generates random subauth
+ values(then store it to disk) and use uid/gid
+ get from inode as RID for local domain SID.
+ The current acl implementation is limited to
+ standalone server, not a domain member.
+ Integration with Samba tools is being worked on
+ to allow future support for running as a domain
+ member.
+Kerberos Supported.
+Durable handle v1,v2 Planned for future.
+Persistent handle Planned for future.
+SMB2 notify Planned for future.
+Sparse file support Supported.
+DCE/RPC support Partially Supported. a few calls(NetShareEnumAll,
+ NetServerGetInfo, SAMR, LSARPC) that are needed
+ for file server handled via netlink interface
+ from ksmbd.mountd. Additional integration with
+ Samba tools and libraries via upcall is being
+ investigated to allow support for additional
+ DCE/RPC management calls (and future support
+ for Witness protocol e.g.)
+ksmbd/nfsd interoperability Planned for future. The features that ksmbd
+ support are Leases, Notify, ACLs and Share modes.
+SMB3.1.1 Compression Planned for future.
+SMB3.1.1 over QUIC Planned for future.
+Signing/Encryption over RDMA Planned for future.
+SMB3.1.1 GMAC signing support Planned for future.
+============================== =================================================
+
+
+How to run
+==========
+
+1. Download ksmbd-tools(https://github.com/cifsd-team/ksmbd-tools/releases) and
+ compile them.
+
+ - Refer to README(https://github.com/cifsd-team/ksmbd-tools/blob/master/README.md)
+ to know how to use ksmbd.mountd/adduser/addshare/control utils
+
+ $ ./autogen.sh
+ $ ./configure --with-rundir=/run
+ $ make && sudo make install
+
+2. Create /usr/local/etc/ksmbd/ksmbd.conf file, add SMB share in ksmbd.conf file.
+
+ - Refer to ksmbd.conf.example in ksmbd-utils, See ksmbd.conf manpage
+ for details to configure shares.
+
+ $ man ksmbd.conf
+
+3. Create user/password for SMB share.
+
+ - See ksmbd.adduser manpage.
+
+ $ man ksmbd.adduser
+ $ sudo ksmbd.adduser -a <Enter USERNAME for SMB share access>
+
+4. Insert the ksmbd.ko module after you build your kernel. No need to load the module
+ if ksmbd is built into the kernel.
+
+ - Set ksmbd in menuconfig(e.g. $ make menuconfig)
+ [*] Network File Systems --->
+ <M> SMB3 server support (EXPERIMENTAL)
+
+ $ sudo modprobe ksmbd.ko
+
+5. Start ksmbd user space daemon
+
+ $ sudo ksmbd.mountd
+
+6. Access share from Windows or Linux using SMB3 client (cifs.ko or smbclient of samba)
+
+Shutdown KSMBD
+==============
+
+1. kill user and kernel space daemon
+ # sudo ksmbd.control -s
+
+How to turn debug print on
+==========================
+
+Each layer
+/sys/class/ksmbd-control/debug
+
+1. Enable all component prints
+ # sudo ksmbd.control -d "all"
+
+2. Enable one of the components (smb, auth, vfs, oplock, ipc, conn, rdma)
+ # sudo ksmbd.control -d "smb"
+
+3. Show what prints are enabled.
+ # cat /sys/class/ksmbd-control/debug
+ [smb] auth vfs oplock ipc conn [rdma]
+
+4. Disable prints:
+ If you try the selected component once more, It is disabled without brackets.
diff --git a/Documentation/filesystems/smb/smbdirect.rst b/Documentation/filesystems/smb/smbdirect.rst
new file mode 100644
index 000000000000..ca6927c0b2c0
--- /dev/null
+++ b/Documentation/filesystems/smb/smbdirect.rst
@@ -0,0 +1,103 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+SMB Direct - SMB3 over RDMA
+===========================
+
+This document describes how to set up the Linux SMB client and server to
+use RDMA.
+
+Overview
+========
+The Linux SMB kernel client supports SMB Direct, which is a transport
+scheme for SMB3 that uses RDMA (Remote Direct Memory Access) to provide
+high throughput and low latencies by bypassing the traditional TCP/IP
+stack.
+SMB Direct on the Linux SMB client can be tested against KSMBD - a
+kernel-space SMB server.
+
+Installation
+=============
+- Install an RDMA device. As long as the RDMA device driver is supported
+ by the kernel, it should work. This includes both software emulators (soft
+ RoCE, soft iWARP) and hardware devices (InfiniBand, RoCE, iWARP).
+
+- Install a kernel with SMB Direct support. The first kernel release to
+ support SMB Direct on both the client and server side is 5.15. Therefore,
+ a distribution compatible with kernel 5.15 or later is required.
+
+- Install cifs-utils, which provides the `mount.cifs` command to mount SMB
+ shares.
+
+- Configure the RDMA stack
+
+ Make sure that your kernel configuration has RDMA support enabled. Under
+ Device Drivers -> Infiniband support, update the kernel configuration to
+ enable Infiniband support.
+
+ Enable the appropriate IB HCA support or iWARP adapter support,
+ depending on your hardware.
+
+ If you are using InfiniBand, enable IP-over-InfiniBand support.
+
+ For soft RDMA, enable either the soft iWARP (`RDMA _SIW`) or soft RoCE
+ (`RDMA_RXE`) module. Install the `iproute2` package and use the
+ `rdma link add` command to load the module and create an
+ RDMA interface.
+
+ e.g. if your local ethernet interface is `eth0`, you can use:
+
+ .. code-block:: bash
+
+ sudo rdma link add siw0 type siw netdev eth0
+
+- Enable SMB Direct support for both the server and the client in the kernel
+ configuration.
+
+ Server Setup
+
+ .. code-block:: text
+
+ Network File Systems --->
+ <M> SMB3 server support
+ [*] Support for SMB Direct protocol
+
+ Client Setup
+
+ .. code-block:: text
+
+ Network File Systems --->
+ <M> SMB3 and CIFS support (advanced network filesystem)
+ [*] SMB Direct support
+
+- Build and install the kernel. SMB Direct support will be enabled in the
+ cifs.ko and ksmbd.ko modules.
+
+Setup and Usage
+================
+
+- Set up and start a KSMBD server as described in the `KSMBD documentation
+ <https://www.kernel.org/doc/Documentation/filesystems/smb/ksmbd.rst>`_.
+ Also add the "server multi channel support = yes" parameter to ksmbd.conf.
+
+- On the client, mount the share with `rdma` mount option to use SMB Direct
+ (specify a SMB version 3.0 or higher using `vers`).
+
+ For example:
+
+ .. code-block:: bash
+
+ mount -t cifs //server/share /mnt/point -o vers=3.1.1,rdma
+
+- To verify that the mount is using SMB Direct, you can check dmesg for the
+ following log line after mounting:
+
+ .. code-block:: text
+
+ CIFS: VFS: RDMA transport established
+
+ Or, verify `rdma` mount option for the share in `/proc/mounts`:
+
+ .. code-block:: bash
+
+ cat /proc/mounts | grep cifs
diff --git a/Documentation/filesystems/spufs/spufs.rst b/Documentation/filesystems/spufs/spufs.rst
index 8a42859bb100..ca0441cbe37e 100644
--- a/Documentation/filesystems/spufs/spufs.rst
+++ b/Documentation/filesystems/spufs/spufs.rst
@@ -227,7 +227,7 @@ Files
from the data buffer, updating the value of the specified signal
notification register. The signal notification register will
either be replaced with the input data or will be updated to the
- bitwise OR or the old value and the input data, depending on the
+ bitwise OR of the old value and the input data, depending on the
contents of the signal1_type, or signal2_type respectively,
file.
diff --git a/Documentation/filesystems/squashfs.rst b/Documentation/filesystems/squashfs.rst
index df42106bae71..45653b3228f9 100644
--- a/Documentation/filesystems/squashfs.rst
+++ b/Documentation/filesystems/squashfs.rst
@@ -6,7 +6,7 @@ Squashfs 4.0 Filesystem
Squashfs is a compressed read-only filesystem for Linux.
-It uses zlib, lz4, lzo, or xz compression to compress files, inodes and
+It uses zlib, lz4, lzo, xz or zstd compression to compress files, inodes and
directories. Inodes in the system are very small and all blocks are packed to
minimise data overhead. Block sizes greater than 4K are supported up to a
maximum of 1Mbytes (default block size 128K).
@@ -16,8 +16,8 @@ use (i.e. in cases where a .tar.gz file may be used), and in constrained
block device/memory systems (e.g. embedded systems) where low overhead is
needed.
-Mailing list: squashfs-devel@lists.sourceforge.net
-Web site: www.squashfs.org
+Mailing list (kernel code): linux-fsdevel@vger.kernel.org
+Web site: github.com/plougher/squashfs-tools
1. Filesystem Features
----------------------
@@ -58,11 +58,69 @@ inodes have different sizes).
As squashfs is a read-only filesystem, the mksquashfs program must be used to
create populated squashfs filesystems. This and other squashfs utilities
-can be obtained from http://www.squashfs.org. Usage instructions can be
-obtained from this site also.
+are very likely packaged by your linux distribution (called squashfs-tools).
+The source code can be obtained from github.com/plougher/squashfs-tools.
+Usage instructions can also be obtained from this site.
-The squashfs-tools development tree is now located on kernel.org
- git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
+2.1 Mount options
+-----------------
+=================== =========================================================
+errors=%s Specify whether squashfs errors trigger a kernel panic
+ or not
+
+ ========== =============================================
+ continue errors don't trigger a panic (default)
+ panic trigger a panic when errors are encountered,
+ similar to several other filesystems (e.g.
+ btrfs, ext4, f2fs, GFS2, jfs, ntfs, ubifs)
+
+ This allows a kernel dump to be saved,
+ useful for analyzing and debugging the
+ corruption.
+ ========== =============================================
+threads=%s Select the decompression mode or the number of threads
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is set:
+
+ ========== =============================================
+ single use single-threaded decompression (default)
+
+ Only one block (data or metadata) can be
+ decompressed at any one time. This limits
+ CPU and memory usage to a minimum, but it
+ also gives poor performance on parallel I/O
+ workloads when using multiple CPU machines
+ due to waiting on decompressor availability.
+ multi use up to two parallel decompressors per core
+
+ If you have a parallel I/O workload and your
+ system has enough memory, using this option
+ may improve overall I/O performance. It
+ dynamically allocates decompressors on a
+ demand basis.
+ percpu use a maximum of one decompressor per core
+
+ It uses percpu variables to ensure
+ decompression is load-balanced across the
+ cores.
+ 1|2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+ If SQUASHFS_CHOICE_DECOMP_BY_MOUNT is **not** set and
+ SQUASHFS_DECOMP_MULTI, SQUASHFS_MOUNT_DECOMP_THREADS are
+ both set:
+
+ ========== =============================================
+ 2|3|... configure the number of threads used for
+ decompression
+
+ The upper limit is num_online_cpus() * 2.
+ ========== =============================================
+
+=================== =========================================================
3. Squashfs Filesystem Design
-----------------------------
diff --git a/Documentation/filesystems/sysfs-pci.rst b/Documentation/filesystems/sysfs-pci.rst
deleted file mode 100644
index a265f3e2cc80..000000000000
--- a/Documentation/filesystems/sysfs-pci.rst
+++ /dev/null
@@ -1,138 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-============================================
-Accessing PCI device resources through sysfs
-============================================
-
-sysfs, usually mounted at /sys, provides access to PCI resources on platforms
-that support it. For example, a given bus might look like this::
-
- /sys/devices/pci0000:17
- |-- 0000:17:00.0
- | |-- class
- | |-- config
- | |-- device
- | |-- enable
- | |-- irq
- | |-- local_cpus
- | |-- remove
- | |-- resource
- | |-- resource0
- | |-- resource1
- | |-- resource2
- | |-- revision
- | |-- rom
- | |-- subsystem_device
- | |-- subsystem_vendor
- | `-- vendor
- `-- ...
-
-The topmost element describes the PCI domain and bus number. In this case,
-the domain number is 0000 and the bus number is 17 (both values are in hex).
-This bus contains a single function device in slot 0. The domain and bus
-numbers are reproduced for convenience. Under the device directory are several
-files, each with their own function.
-
- =================== =====================================================
- file function
- =================== =====================================================
- class PCI class (ascii, ro)
- config PCI config space (binary, rw)
- device PCI device (ascii, ro)
- enable Whether the device is enabled (ascii, rw)
- irq IRQ number (ascii, ro)
- local_cpus nearby CPU mask (cpumask, ro)
- remove remove device from kernel's list (ascii, wo)
- resource PCI resource host addresses (ascii, ro)
- resource0..N PCI resource N, if present (binary, mmap, rw\ [1]_)
- resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
- revision PCI revision (ascii, ro)
- rom PCI ROM resource, if present (binary, ro)
- subsystem_device PCI subsystem device (ascii, ro)
- subsystem_vendor PCI subsystem vendor (ascii, ro)
- vendor PCI vendor (ascii, ro)
- =================== =====================================================
-
-::
-
- ro - read only file
- rw - file is readable and writable
- wo - write only file
- mmap - file is mmapable
- ascii - file contains ascii text
- binary - file contains binary data
- cpumask - file contains a cpumask type
-
-.. [1] rw for RESOURCE_IO (I/O port) regions only
-
-The read only files are informational, writes to them will be ignored, with
-the exception of the 'rom' file. Writable files can be used to perform
-actions on the device (e.g. changing config space, detaching a device).
-mmapable files are available via an mmap of the file at offset 0 and can be
-used to do actual device programming from userspace. Note that some platforms
-don't support mmapping of certain resources, so be sure to check the return
-value from any attempted mmap. The most notable of these are I/O port
-resources, which also provide read/write access.
-
-The 'enable' file provides a counter that indicates how many times the device
-has been enabled. If the 'enable' file currently returns '4', and a '1' is
-echoed into it, it will then return '5'. Echoing a '0' into it will decrease
-the count. Even when it returns to 0, though, some of the initialisation
-may not be reversed.
-
-The 'rom' file is special in that it provides read-only access to the device's
-ROM file, if available. It's disabled by default, however, so applications
-should write the string "1" to the file to enable it before attempting a read
-call, and disable it following the access by writing "0" to the file. Note
-that the device must be enabled for a rom read to return data successfully.
-In the event a driver is not bound to the device, it can be enabled using the
-'enable' file, documented above.
-
-The 'remove' file is used to remove the PCI device, by writing a non-zero
-integer to the file. This does not involve any kind of hot-plug functionality,
-e.g. powering off the device. The device is removed from the kernel's list of
-PCI devices, the sysfs directory for it is removed, and the device will be
-removed from any drivers attached to it. Removal of PCI root buses is
-disallowed.
-
-Accessing legacy resources through sysfs
-----------------------------------------
-
-Legacy I/O port and ISA memory resources are also provided in sysfs if the
-underlying platform supports them. They're located in the PCI class hierarchy,
-e.g.::
-
- /sys/class/pci_bus/0000:17/
- |-- bridge -> ../../../devices/pci0000:17
- |-- cpuaffinity
- |-- legacy_io
- `-- legacy_mem
-
-The legacy_io file is a read/write file that can be used by applications to
-do legacy port I/O. The application should open the file, seek to the desired
-port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem
-file should be mmapped with an offset corresponding to the memory offset
-desired, e.g. 0xa0000 for the VGA frame buffer. The application can then
-simply dereference the returned pointer (after checking for errors of course)
-to access legacy memory space.
-
-Supporting PCI access on new platforms
---------------------------------------
-
-In order to support PCI resource mapping as described above, Linux platform
-code should ideally define ARCH_GENERIC_PCI_MMAP_RESOURCE and use the generic
-implementation of that functionality. To support the historical interface of
-mmap() through files in /proc/bus/pci, platforms may also set HAVE_PCI_MMAP.
-
-Alternatively, platforms which set HAVE_PCI_MMAP may provide their own
-implementation of pci_mmap_page_range() instead of defining
-ARCH_GENERIC_PCI_MMAP_RESOURCE.
-
-Platforms which support write-combining maps of PCI resources must define
-arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
-write-combining is permitted. Platforms which support maps of I/O resources
-define arch_can_pci_mmap_io() similarly.
-
-Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms
-wishing to support legacy functionality should define it and provide
-pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions.
diff --git a/Documentation/filesystems/sysfs-tagging.rst b/Documentation/filesystems/sysfs-tagging.rst
deleted file mode 100644
index 8888a05c398e..000000000000
--- a/Documentation/filesystems/sysfs-tagging.rst
+++ /dev/null
@@ -1,48 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=============
-Sysfs tagging
-=============
-
-(Taken almost verbatim from Eric Biederman's netns tagging patch
-commit msg)
-
-The problem. Network devices show up in sysfs and with the network
-namespace active multiple devices with the same name can show up in
-the same directory, ouch!
-
-To avoid that problem and allow existing applications in network
-namespaces to see the same interface that is currently presented in
-sysfs, sysfs now has tagging directory support.
-
-By using the network namespace pointers as tags to separate out the
-the sysfs directory entries we ensure that we don't have conflicts
-in the directories and applications only see a limited set of
-the network devices.
-
-Each sysfs directory entry may be tagged with a namespace via the
-``void *ns member`` of its ``kernfs_node``. If a directory entry is tagged,
-then ``kernfs_node->flags`` will have a flag between KOBJ_NS_TYPE_NONE
-and KOBJ_NS_TYPES, and ns will point to the namespace to which it
-belongs.
-
-Each sysfs superblock's kernfs_super_info contains an array
-``void *ns[KOBJ_NS_TYPES]``. When a task in a tagging namespace
-kobj_nstype first mounts sysfs, a new superblock is created. It
-will be differentiated from other sysfs mounts by having its
-``s_fs_info->ns[kobj_nstype]`` set to the new namespace. Note that
-through bind mounting and mounts propagation, a task can easily view
-the contents of other namespaces' sysfs mounts. Therefore, when a
-namespace exits, it will call kobj_ns_exit() to invalidate any
-kernfs_node->ns pointers pointing to it.
-
-Users of this interface:
-
-- define a type in the ``kobj_ns_type`` enumeration.
-- call kobj_ns_type_register() with its ``kobj_ns_type_operations`` which has
-
- - current_ns() which returns current's namespace
- - netlink_ns() which returns a socket's namespace
- - initial_ns() which returns the initial namesapce
-
-- call kobj_ns_exit() when an individual tag is no longer valid
diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst
index ab0f7795792b..c32993bc83c7 100644
--- a/Documentation/filesystems/sysfs.rst
+++ b/Documentation/filesystems/sysfs.rst
@@ -12,10 +12,10 @@ Mike Murphy <mamurph@cs.clemson.edu>
:Original: 10 January 2003
-What it is:
-~~~~~~~~~~~
+What it is
+~~~~~~~~~~
-sysfs is a ram-based filesystem initially based on ramfs. It provides
+sysfs is a RAM-based filesystem initially based on ramfs. It provides
a means to export kernel data structures, their attributes, and the
linkages between them to userspace.
@@ -43,7 +43,7 @@ userspace. Top-level directories in sysfs represent the common
ancestors of object hierarchies; i.e. the subsystems the objects
belong to.
-Sysfs internally stores a pointer to the kobject that implements a
+sysfs internally stores a pointer to the kobject that implements a
directory in the kernfs_node object associated with the directory. In
the past this kobject pointer has been used by sysfs to do reference
counting directly on the kobject whenever the file is opened or closed.
@@ -55,7 +55,7 @@ Attributes
~~~~~~~~~~
Attributes can be exported for kobjects in the form of regular files in
-the filesystem. Sysfs forwards file I/O operations to methods defined
+the filesystem. sysfs forwards file I/O operations to methods defined
for the attributes, providing a means to read and write kernel
attributes.
@@ -72,8 +72,8 @@ you publicly humiliated and your code rewritten without notice.
An attribute definition is simply::
struct attribute {
- char * name;
- struct module *owner;
+ char *name;
+ struct module *owner;
umode_t mode;
};
@@ -138,7 +138,7 @@ __ATTR_WO(name):
assumes a name_store only and is restricted to mode
0200 that is root write access only.
__ATTR_RO_MODE(name, mode):
- fore more restrictive RO access currently
+ for more restrictive RO access; currently
only use case is the EFI System Resource Table
(see drivers/firmware/efi/esrt.c)
__ATTR_RW(name):
@@ -172,14 +172,13 @@ calls the associated methods.
To illustrate::
- #define to_dev(obj) container_of(obj, struct device, kobj)
#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr)
static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr,
char *buf)
{
struct device_attribute *dev_attr = to_dev_attr(attr);
- struct device *dev = to_dev(kobj);
+ struct device *dev = kobj_to_dev(kobj);
ssize_t ret = -EIO;
if (dev_attr->show)
@@ -208,7 +207,7 @@ IOW, they should take only an object, an attribute, and a buffer as parameters.
sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the
-method. Sysfs will call the method exactly once for each read or
+method. sysfs will call the method exactly once for each read or
write. This forces the following behavior on the method
implementations:
@@ -222,7 +221,7 @@ implementations:
be called again, rearmed, to fill the buffer.
- On write(2), sysfs expects the entire buffer to be passed during the
- first write. Sysfs then passes the entire buffer to the store() method.
+ first write. sysfs then passes the entire buffer to the store() method.
A terminating null is added after the data on stores. This makes
functions like sysfs_streq() safe to use.
@@ -238,16 +237,14 @@ Other notes:
- Writing causes the show() method to be rearmed regardless of current
file position.
-- The buffer will always be PAGE_SIZE bytes in length. On i386, this
+- The buffer will always be PAGE_SIZE bytes in length. On x86, this
is 4096.
- show() methods should return the number of bytes printed into the
- buffer. This is the return value of scnprintf().
+ buffer.
-- show() must not use snprintf() when formatting the value to be
- returned to user space. If you can guarantee that an overflow
- will never happen you can use sprintf() otherwise you must use
- scnprintf().
+- show() should only use sysfs_emit() or sysfs_emit_at() when formatting
+ the value to be returned to user space.
- store() should return the number of bytes used from the buffer. If the
entire buffer has been used, just return the count argument.
@@ -256,7 +253,7 @@ Other notes:
through, be sure to return an error.
- The object passed to the methods will be pinned in memory via sysfs
- referencing counting its embedded object. However, the physical
+ reference counting its embedded object. However, the physical
entity (e.g. device) the object represents may not be present. Be
sure to have a way to check this, if necessary.
@@ -266,7 +263,7 @@ A very simple (and naive) implementation of a device attribute is::
static ssize_t show_name(struct device *dev, struct device_attribute *attr,
char *buf)
{
- return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name);
+ return sysfs_emit(buf, "%s\n", dev->name);
}
static ssize_t store_name(struct device *dev, struct device_attribute *attr,
@@ -298,8 +295,12 @@ The top level sysfs directory looks like::
dev/
devices/
firmware/
- net/
fs/
+ hypervisor/
+ kernel/
+ module/
+ net/
+ power/
devices/ contains a filesystem representation of the device tree. It maps
directly to the internal kernel device tree, which is a hierarchy of
@@ -320,15 +321,18 @@ span multiple bus types).
fs/ contains a directory for some filesystems. Currently each
filesystem wanting to export attributes must create its own hierarchy
-below fs/ (see ./fuse.txt for an example).
+below fs/ (see ./fuse.rst for an example).
+
+module/ contains parameter values and state information for all
+loaded system modules, for both builtin and loadable modules.
-dev/ contains two directories char/ and block/. Inside these two
+dev/ contains two directories: char/ and block/. Inside these two
directories there are symlinks named <major>:<minor>. These symlinks
point to the sysfs directory for the given device. /sys/dev provides a
quick way to lookup the sysfs interface for a device from the result of
a stat(2) operation.
-More information can driver-model specific features can be found in
+More information on driver-model specific features can be found in
Documentation/driver-api/driver-model/.
@@ -338,7 +342,7 @@ TODO: Finish this section.
Current Interfaces
~~~~~~~~~~~~~~~~~~
-The following interface layers currently exist in sysfs:
+The following interface layers currently exist in sysfs.
devices (include/linux/device.h)
@@ -369,8 +373,8 @@ Structure::
struct bus_attribute {
struct attribute attr;
- ssize_t (*show)(struct bus_type *, char * buf);
- ssize_t (*store)(struct bus_type *, const char * buf, size_t count);
+ ssize_t (*show)(const struct bus_type *, char * buf);
+ ssize_t (*store)(const struct bus_type *, const char * buf, size_t count);
};
Declaring::
diff --git a/Documentation/filesystems/sysv-fs.rst b/Documentation/filesystems/sysv-fs.rst
deleted file mode 100644
index 89e40911ad7c..000000000000
--- a/Documentation/filesystems/sysv-fs.rst
+++ /dev/null
@@ -1,264 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==================
-SystemV Filesystem
-==================
-
-It implements all of
- - Xenix FS,
- - SystemV/386 FS,
- - Coherent FS.
-
-To install:
-
-* Answer the 'System V and Coherent filesystem support' question with 'y'
- when configuring the kernel.
-* To mount a disk or a partition, use::
-
- mount [-r] -t sysv device mountpoint
-
- The file system type names::
-
- -t sysv
- -t xenix
- -t coherent
-
- may be used interchangeably, but the last two will eventually disappear.
-
-Bugs in the present implementation:
-
-- Coherent FS:
-
- - The "free list interleave" n:m is currently ignored.
- - Only file systems with no filesystem name and no pack name are recognized.
- (See Coherent "man mkfs" for a description of these features.)
-
-- SystemV Release 2 FS:
-
- The superblock is only searched in the blocks 9, 15, 18, which
- corresponds to the beginning of track 1 on floppy disks. No support
- for this FS on hard disk yet.
-
-
-These filesystems are rather similar. Here is a comparison with Minix FS:
-
-* Linux fdisk reports on partitions
-
- - Minix FS 0x81 Linux/Minix
- - Xenix FS ??
- - SystemV FS ??
- - Coherent FS 0x08 AIX bootable
-
-* Size of a block or zone (data allocation unit on disk)
-
- - Minix FS 1024
- - Xenix FS 1024 (also 512 ??)
- - SystemV FS 1024 (also 512 and 2048)
- - Coherent FS 512
-
-* General layout: all have one boot block, one super block and
- separate areas for inodes and for directories/data.
- On SystemV Release 2 FS (e.g. Microport) the first track is reserved and
- all the block numbers (including the super block) are offset by one track.
-
-* Byte ordering of "short" (16 bit entities) on disk:
-
- - Minix FS little endian 0 1
- - Xenix FS little endian 0 1
- - SystemV FS little endian 0 1
- - Coherent FS little endian 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Byte ordering of "long" (32 bit entities) on disk:
-
- - Minix FS little endian 0 1 2 3
- - Xenix FS little endian 0 1 2 3
- - SystemV FS little endian 0 1 2 3
- - Coherent FS PDP-11 2 3 0 1
-
- Of course, this affects only the file system, not the data of files on it!
-
-* Inode on disk: "short", 0 means non-existent, the root dir ino is:
-
- ================================= ==
- Minix FS 1
- Xenix FS, SystemV FS, Coherent FS 2
- ================================= ==
-
-* Maximum number of hard links to a file:
-
- =========== =========
- Minix FS 250
- Xenix FS ??
- SystemV FS ??
- Coherent FS >=10000
- =========== =========
-
-* Free inode management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- There is a cache of a certain number of free inodes in the super-block.
- When it is exhausted, new free inodes are found using a linear search.
-
-* Free block management:
-
- - Minix FS
- a bitmap
- - Xenix FS, SystemV FS, Coherent FS
- Free blocks are organized in a "free list". Maybe a misleading term,
- since it is not true that every free block contains a pointer to
- the next free block. Rather, the free blocks are organized in chunks
- of limited size, and every now and then a free block contains pointers
- to the free blocks pertaining to the next chunk; the first of these
- contains pointers and so on. The list terminates with a "block number"
- 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS.
-
-* Super-block location:
-
- =========== ==========================
- Minix FS block 1 = bytes 1024..2047
- Xenix FS block 1 = bytes 1024..2047
- SystemV FS bytes 512..1023
- Coherent FS block 1 = bytes 512..1023
- =========== ==========================
-
-* Super-block layout:
-
- - Minix FS::
-
- unsigned short s_ninodes;
- unsigned short s_nzones;
- unsigned short s_imap_blocks;
- unsigned short s_zmap_blocks;
- unsigned short s_firstdatazone;
- unsigned short s_log_zone_size;
- unsigned long s_max_size;
- unsigned short s_magic;
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short s_firstdatazone;
- unsigned long s_nzones;
- unsigned short s_fzone_count;
- unsigned long s_fzones[NICFREE];
- unsigned short s_finode_count;
- unsigned short s_finodes[NICINOD];
- char s_flock;
- char s_ilock;
- char s_modified;
- char s_rdonly;
- unsigned long s_time;
- short s_dinfo[4]; -- SystemV FS only
- unsigned long s_free_zones;
- unsigned short s_free_inodes;
- short s_dinfo[4]; -- Xenix FS only
- unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only
- char s_fname[6];
- char s_fpack[6];
-
- then they differ considerably:
-
- Xenix FS::
-
- char s_clean;
- char s_fill[371];
- long s_magic;
- long s_type;
-
- SystemV FS::
-
- long s_fill[12 or 14];
- long s_state;
- long s_magic;
- long s_type;
-
- Coherent FS::
-
- unsigned long s_unique;
-
- Note that Coherent FS has no magic.
-
-* Inode layout:
-
- - Minix FS::
-
- unsigned short i_mode;
- unsigned short i_uid;
- unsigned long i_size;
- unsigned long i_time;
- unsigned char i_gid;
- unsigned char i_nlinks;
- unsigned short i_zone[7+1+1];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short i_mode;
- unsigned short i_nlink;
- unsigned short i_uid;
- unsigned short i_gid;
- unsigned long i_size;
- unsigned char i_zone[3*(10+1+1+1)];
- unsigned long i_atime;
- unsigned long i_mtime;
- unsigned long i_ctime;
-
-
-* Regular file data blocks are organized as
-
- - Minix FS:
-
- - 7 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
-
- - Xenix FS, SystemV FS, Coherent FS:
-
- - 10 direct blocks
- - 1 indirect block (pointers to blocks)
- - 1 double-indirect block (pointer to pointers to blocks)
- - 1 triple-indirect block (pointer to pointers to pointers to blocks)
-
-
- =========== ========== ================
- Inode size inodes per block
- =========== ========== ================
- Minix FS 32 32
- Xenix FS 64 16
- SystemV FS 64 16
- Coherent FS 64 8
- =========== ========== ================
-
-* Directory entry on disk
-
- - Minix FS::
-
- unsigned short inode;
- char name[14/30];
-
- - Xenix FS, SystemV FS, Coherent FS::
-
- unsigned short inode;
- char name[14];
-
- =========== ============== =====================
- Dir entry size dir entries per block
- =========== ============== =====================
- Minix FS 16/32 64/32
- Xenix FS 16 64
- SystemV FS 16 64
- Coherent FS 16 32
- =========== ============== =====================
-
-* How to implement symbolic links such that the host fsck doesn't scream:
-
- - Minix FS normal
- - Xenix FS kludge: as regular files with chmod 1000
- - SystemV FS ??
- - Coherent FS kludge: as regular files with chmod 1000
-
-
-Notation: We often speak of a "block" but mean a zone (the allocation unit)
-and not the disk driver's notion of "block".
diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 4e95929301a5..d677e0428c3f 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -4,7 +4,7 @@
Tmpfs
=====
-Tmpfs is a file system which keeps all files in virtual memory.
+Tmpfs is a file system which keeps all of its files in virtual memory.
Everything in tmpfs is temporary in the sense that no files will be
@@ -13,17 +13,29 @@ everything stored therein is lost.
tmpfs puts everything into the kernel internal caches and grows and
shrinks to accommodate the files it contains and is able to swap
-unneeded pages out to swap space. It has maximum size limits which can
-be adjusted on the fly via 'mount -o remount ...'
-
-If you compare it to ramfs (which was the template to create tmpfs)
-you gain swapping and limit checking. Another similar thing is the RAM
-disk (/dev/ram*), which simulates a fixed size hard disk in physical
-RAM, where you have to create an ordinary filesystem on top. Ramdisks
-cannot swap and you do not have the possibility to resize them.
-
-Since tmpfs lives completely in the page cache and on swap, all tmpfs
-pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
+unneeded pages out to swap space, if swap was enabled for the tmpfs
+mount. tmpfs also supports THP.
+
+tmpfs extends ramfs with a few userspace configurable options listed and
+explained further below, some of which can be reconfigured dynamically on the
+fly using a remount ('mount -o remount ...') of the filesystem. A tmpfs
+filesystem can be resized but it cannot be resized to a size below its current
+usage. tmpfs also supports POSIX ACLs, and extended attributes for the
+trusted.*, security.* and user.* namespaces. ramfs does not use swap and you
+cannot modify any parameter for a ramfs filesystem. The size limit of a ramfs
+filesystem is how much memory you have available, and so care must be taken if
+used so to not run out of memory.
+
+An alternative to tmpfs and ramfs is to use brd to create RAM disks
+(/dev/ram*), which allows you to simulate a block device disk in physical RAM.
+To write data you would just then need to create an regular filesystem on top
+this ramdisk. As with ramfs, brd ramdisks cannot swap. brd ramdisks are also
+configured in size at initialization and you cannot dynamically resize them.
+Contrary to brd ramdisks, tmpfs has its own filesystem, it does not rely on the
+block layer at all.
+
+Since tmpfs lives completely in the page cache and optionally on swap,
+all tmpfs pages will be shown as "Shmem" in /proc/meminfo and "Shared" in
free(1). Notice that these counters also include shared memory
(shmem, see ipcs(1)). The most reliable way to get the count is
using df(1) and du(1).
@@ -35,7 +47,7 @@ tmpfs has the following uses:
memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
- set, the user visible part of tmpfs is not build. But the internal
+ set, the user visible part of tmpfs is not built. But the internal
mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
@@ -50,7 +62,7 @@ tmpfs has the following uses:
This mount is _not_ needed for SYSV shared memory. The internal
mount is used for that. (In the 2.3 kernel versions it was
necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
- shared memory)
+ shared memory.)
3) Some people (including me) find it very convenient to mount it
e.g. on /tmp and /var/tmp and have a big swap partition. And now
@@ -83,8 +95,67 @@ If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
if nr_inodes=0, inodes will not be limited. It is generally unwise to
mount with such options, since it allows any user with write access to
use up all the memory on the machine; but enhances the scalability of
-that instance in a system with many cpus making intensive use of it.
-
+that instance in a system with many CPUs making intensive use of it.
+
+If nr_inodes is not 0, that limited space for inodes is also used up by
+extended attributes: "df -i"'s IUsed and IUse% increase, IFree decreases.
+
+tmpfs blocks may be swapped out, when there is a shortage of memory.
+tmpfs has a mount option to disable its use of swap:
+
+====== ===========================================================
+noswap Disables swap. Remounts must respect the original settings.
+ By default swap is enabled.
+====== ===========================================================
+
+tmpfs also supports Transparent Huge Pages which requires a kernel
+configured with CONFIG_TRANSPARENT_HUGEPAGE and with huge supported for
+your system (has_transparent_hugepage(), which is architecture specific).
+The mount options for this are:
+
+================ ==============================================================
+huge=never Do not allocate huge pages. This is the default.
+huge=always Attempt to allocate huge page every time a new page is needed.
+huge=within_size Only allocate huge page if it will be fully within i_size.
+ Also respect madvise(2) hints.
+huge=advise Only allocate huge page if requested with madvise(2).
+================ ==============================================================
+
+See also Documentation/admin-guide/mm/transhuge.rst, which describes the
+sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled: which can
+be used to deny huge pages on all tmpfs mounts in an emergency, or to
+force huge pages on all tmpfs mounts for testing.
+
+tmpfs also supports quota with the following mount options
+
+======================== =================================================
+quota User and group quota accounting and enforcement
+ is enabled on the mount. Tmpfs is using hidden
+ system quota files that are initialized on mount.
+usrquota User quota accounting and enforcement is enabled
+ on the mount.
+grpquota Group quota accounting and enforcement is enabled
+ on the mount.
+usrquota_block_hardlimit Set global user quota block hard limit.
+usrquota_inode_hardlimit Set global user quota inode hard limit.
+grpquota_block_hardlimit Set global group quota block hard limit.
+grpquota_inode_hardlimit Set global group quota inode hard limit.
+======================== =================================================
+
+None of the quota related mount options can be set or changed on remount.
+
+Quota limit parameters accept a suffix k, m or g for kilo, mega and giga
+and can't be changed on remount. Default global quota limits are taking
+effect for any and all user/group/project except root the first time the
+quota entry for user/group/project id is being accessed - typically the
+first time an inode with a particular id ownership is being created after
+the mount. In other words, instead of the limits being initialized to zero,
+they are initialized with the particular value provided with these mount
+options. The limits can be changed for any user/group id at any time as they
+normally can be.
+
+Note that tmpfs quotas do not support user namespaces so no uid/gid
+translation is done if quotas are enabled inside user namespaces.
tmpfs has a mount option to set the NUMA memory allocation policy for
all files in that instance (if CONFIG_NUMA is enabled) - which can be
@@ -150,10 +221,48 @@ These options do not have any effect on remount. You can change these
parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+======= ========================
+inode64 Use 64-bit inode numbers
+inode32 Use 32-bit inode numbers
+======= ========================
+
+On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
+On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default. inode64 avoids the
+possibility of multiple files with the same inode number on a single device;
+but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
+if a long-lived tmpfs is accessed by 32-bit applications so ancient that
+opening a file larger than 2GiB fails with EINVAL.
+
+
So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
will give you tmpfs instance on /mytmpfs which can allocate 10GB
RAM/SWAP in 10240 inodes and it is only accessible by root.
+tmpfs has the following mounting options for case-insensitive lookup support:
+
+================= ==============================================================
+casefold Enable casefold support at this mount point using the given
+ argument as the encoding standard. Currently only UTF-8
+ encodings are supported. If no argument is used, it will load
+ the latest UTF-8 encoding available.
+strict_encoding Enable strict encoding at this mount point (disabled by
+ default). In this mode, the filesystem refuses to create file
+ and directory with names containing invalid UTF-8 characters.
+================= ==============================================================
+
+This option doesn't render the entire filesystem case-insensitive. One needs to
+still set the casefold flag per directory, by flipping the +F attribute in an
+empty directory. Nevertheless, new directories will inherit the attribute. The
+mountpoint itself cannot be made case-insensitive.
+
+Example::
+
+ $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs
+ $ mount -t tmpfs -o casefold fs_name /mytmpfs
+
:Author:
Christoph Rohland <cr@sap.com>, 1.12.01
@@ -161,3 +270,7 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
Hugh Dickins, 4 June 2007
:Updated:
KOSAKI Motohiro, 16 Mar 2010
+:Updated:
+ Chris Down, 13 July 2020
+:Updated:
+ André Almeida, 23 Aug 2024
diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst
index 16efd729bf7c..3d85ee88719a 100644
--- a/Documentation/filesystems/ubifs-authentication.rst
+++ b/Documentation/filesystems/ubifs-authentication.rst
@@ -1,11 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
-:orphan:
-
.. UBIFS Authentication
.. sigma star gmbh
.. 2018
+============================
+UBIFS Authentication Support
+============================
+
Introduction
============
@@ -128,7 +130,7 @@ marked as dirty are written to the flash to update the persisted index.
Journal
~~~~~~~
-To avoid wearing out the flash, the index is only persisted (*commited*) when
+To avoid wearing out the flash, the index is only persisted (*committed*) when
certain conditions are met (eg. ``fsync(2)``). The journal is used to record
any changes (in form of inode nodes, data nodes etc.) between commits
of the index. During mount, the journal is read from the flash and replayed
@@ -433,9 +435,9 @@ will then have to be provided beforehand in the normal way.
References
==========
-[CRYPTSETUP2] http://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
+[CRYPTSETUP2] https://www.saout.de/pipermail/dm-crypt/2017-November/005745.html
-[DMC-CBC-ATTACK] http://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
+[DMC-CBC-ATTACK] https://www.jakoblell.com/blog/2013/12/22/practical-malleability-attack-against-cbc-encrypted-luks-partitions/
[DM-INTEGRITY] https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.rst
diff --git a/Documentation/filesystems/ubifs.rst b/Documentation/filesystems/ubifs.rst
index e6ee99762534..ced2f7679ddb 100644
--- a/Documentation/filesystems/ubifs.rst
+++ b/Documentation/filesystems/ubifs.rst
@@ -59,7 +59,7 @@ differences.
* JFFS2 is a write-through file-system, while UBIFS supports write-back,
which makes UBIFS much faster on writes.
-Similarly to JFFS2, UBIFS supports on-the-flight compression which makes
+Similarly to JFFS2, UBIFS supports on-the-fly compression which makes
it possible to fit quite a lot of data to the flash.
Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts.
diff --git a/Documentation/filesystems/udf.rst b/Documentation/filesystems/udf.rst
index d9badbf285b2..f9489ddbb767 100644
--- a/Documentation/filesystems/udf.rst
+++ b/Documentation/filesystems/udf.rst
@@ -72,4 +72,4 @@ For the latest version and toolset see:
Documentation on UDF and ECMA 167 is available FREE from:
- http://www.osta.org/
- - http://www.ecma-international.org/
+ - https://www.ecma-international.org/
diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
index e85d74e91295..b289c4449cd0 100644
--- a/Documentation/filesystems/vfat.rst
+++ b/Documentation/filesystems/vfat.rst
@@ -50,7 +50,7 @@ VFAT MOUNT OPTIONS
Normally utime(2) checks current process is owner of
the file, or it has CAP_FOWNER capability. But FAT
filesystem doesn't have uid/gid on disk, so normal
- check is too unflexible. With this option you can
+ check is too inflexible. With this option you can
relax it.
**codepage=###**
@@ -189,7 +189,7 @@ VFAT MOUNT OPTIONS
**discard**
If set, issues discard/TRIM commands to the block
device when blocks are freed. This is useful for SSD devices
- and sparse/thinly-provisoned LUNs.
+ and sparse/thinly-provisioned LUNs.
**nfs=stale_rw|nostale_ro**
Enable this only if you want to export the FAT filesystem
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index ed17771c212b..fd32a9a17bfb 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -107,22 +107,32 @@ file /proc/filesystems.
struct file_system_type
-----------------------
-This describes the filesystem. As of kernel 2.6.39, the following
+This describes the filesystem. The following
members are defined:
.. code-block:: c
- struct file_system_operations {
+ struct file_system_type {
const char *name;
int fs_flags;
+ int (*init_fs_context)(struct fs_context *);
+ const struct fs_parameter_spec *parameters;
struct dentry *(*mount) (struct file_system_type *, int,
- const char *, void *);
+ const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
- struct list_head fs_supers;
+ struct hlist_head fs_supers;
+
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
+ struct lock_class_key s_vfs_rename_key;
+ struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];
+
+ struct lock_class_key i_lock_key;
+ struct lock_class_key i_mutex_key;
+ struct lock_class_key invalidate_lock_key;
+ struct lock_class_key i_mutex_dir_key;
};
``name``
@@ -132,6 +142,15 @@ members are defined:
``fs_flags``
various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
+``init_fs_context``
+ Initializes 'struct fs_context' ->ops and ->fs_private fields with
+ filesystem-specific data.
+
+``parameters``
+ Pointer to the array of filesystem parameters descriptors
+ 'struct fs_parameter_spec'.
+ More info in Documentation/filesystems/mount_api.rst.
+
``mount``
the method to call when a new instance of this filesystem should
be mounted
@@ -148,7 +167,11 @@ members are defined:
``next``
for internal VFS use: you should initialize this to NULL
- s_lock_key, s_umount_key: lockdep-specific
+``fs_supers``
+ for internal VFS use: hlist of filesystem instances (superblocks)
+
+ s_lock_key, s_umount_key, s_vfs_rename_key, s_writers_key,
+ i_lock_key, i_mutex_key, invalidate_lock_key, i_mutex_dir_key: lockdep-specific
The mount() method has the following arguments:
@@ -222,33 +245,44 @@ struct super_operations
-----------------------
This describes how the VFS can manipulate the superblock of your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. The following members are defined:
.. code-block:: c
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
+ void (*free_inode)(struct inode *);
void (*dirty_inode) (struct inode *, int flags);
- int (*write_inode) (struct inode *, int);
- void (*drop_inode) (struct inode *);
- void (*delete_inode) (struct inode *);
+ int (*write_inode) (struct inode *, struct writeback_control *wbc);
+ int (*drop_inode) (struct inode *);
+ void (*evict_inode) (struct inode *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
+ int (*freeze_super) (struct super_block *sb,
+ enum freeze_holder who);
int (*freeze_fs) (struct super_block *);
+ int (*thaw_super) (struct super_block *sb,
+ enum freeze_wholder who);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
- void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct dentry *);
+ int (*show_devname)(struct seq_file *, struct dentry *);
+ int (*show_path)(struct seq_file *, struct dentry *);
+ int (*show_stats)(struct seq_file *, struct dentry *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
- int (*nr_cached_objects)(struct super_block *);
- void (*free_cached_objects)(struct super_block *, int);
+ struct dquot **(*get_dquots)(struct inode *);
+
+ long (*nr_cached_objects)(struct super_block *,
+ struct shrink_control *);
+ long (*free_cached_objects)(struct super_block *,
+ struct shrink_control *);
};
All methods are called without any locks being held, unless otherwise
@@ -269,8 +303,19 @@ or bottom half).
->alloc_inode was defined and simply undoes anything done by
->alloc_inode.
+``free_inode``
+ this method is called from RCU callback. If you use call_rcu()
+ in ->destroy_inode to free 'struct inode' memory, then it's
+ better to release memory in this method.
+
``dirty_inode``
- this method is called by the VFS to mark an inode dirty.
+ this method is called by the VFS when an inode is marked dirty.
+ This is specifically for the inode itself being marked dirty,
+ not its data. If the update needs to be persisted by fdatasync(),
+ then I_DIRTY_DATASYNC will be set in the flags argument.
+ I_DIRTY_TIME will be set in the flags in case lazytime is enabled
+ and struct inode has times updated since the last ->dirty_inode
+ call.
``write_inode``
this method is called when the VFS needs to write an inode to
@@ -290,8 +335,12 @@ or bottom half).
practice of using "force_delete" in the put_inode() case, but
does not have the races that the "force_delete()" approach had.
-``delete_inode``
- called when the VFS wants to delete an inode
+``evict_inode``
+ called when the VFS wants to evict an inode. Caller does
+ *not* evict the pagecache or inode-associated metadata buffers;
+ the method has to use truncate_inode_pages_final() to get rid
+ of those. Caller makes sure async writeback cannot be running for
+ the inode while (or after) ->evict_inode() is called. Optional.
``put_super``
called when the VFS wishes to free the superblock
@@ -302,14 +351,25 @@ or bottom half).
superblock. The second parameter indicates whether the method
should wait until the write out has been completed. Optional.
+``freeze_super``
+ Called instead of ->freeze_fs callback if provided.
+ Main difference is that ->freeze_super is called without taking
+ down_write(&sb->s_umount). If filesystem implements it and wants
+ ->freeze_fs to be called too, then it has to call ->freeze_fs
+ explicitly from this callback. Optional.
+
``freeze_fs``
called when VFS is locking a filesystem and forcing it into a
consistent state. This method is currently used by the Logical
- Volume Manager (LVM).
+ Volume Manager (LVM) and ioctl(FIFREEZE). Optional.
+
+``thaw_super``
+ called when VFS is unlocking a filesystem and making it writable
+ again after ->freeze_super. Optional.
``unfreeze_fs``
called when VFS is unlocking a filesystem and making it writable
- again.
+ again after ->freeze_fs. Optional.
``statfs``
called when the VFS needs to get filesystem statistics.
@@ -318,22 +378,37 @@ or bottom half).
called when the filesystem is remounted. This is called with
the kernel lock held
-``clear_inode``
- called then the VFS clears the inode. Optional
-
``umount_begin``
called when the VFS is unmounting a filesystem.
``show_options``
- called by the VFS to show mount options for /proc/<pid>/mounts.
+ called by the VFS to show mount options for /proc/<pid>/mounts
+ and /proc/<pid>/mountinfo.
(see "Mount Options" section)
+``show_devname``
+ Optional. Called by the VFS to show device name for
+ /proc/<pid>/{mounts,mountinfo,mountstats}. If not provided then
+ '(struct mount).mnt_devname' will be used.
+
+``show_path``
+ Optional. Called by the VFS (for /proc/<pid>/mountinfo) to show
+ the mount root dentry path relative to the filesystem root.
+
+``show_stats``
+ Optional. Called by the VFS (for /proc/<pid>/mountstats) to show
+ filesystem-specific mount statistics.
+
``quota_read``
called by the VFS to read from filesystem quota file.
``quota_write``
called by the VFS to write to filesystem quota file.
+``get_dquots``
+ called by quota to get 'struct dquot' array for a particular inode.
+ Optional.
+
``nr_cached_objects``
called by the sb cache shrinking function for the filesystem to
return the number of freeable cached objects it contains.
@@ -362,7 +437,7 @@ field. This is a pointer to a "struct inode_operations" which describes
the methods that can be performed on individual inodes.
-struct xattr_handlers
+struct xattr_handler
---------------------
On filesystems that support extended attributes (xattrs), the s_xattr
@@ -392,7 +467,7 @@ Extended attributes are name:value pairs.
``set``
Called by the VFS to set the value of a particular extended
attribute. When the new value is NULL, called to remove a
- particular extended attribute. This method is called by the the
+ particular extended attribute. This method is called by the
setxattr(2) and removexattr(2) system calls.
When none of the xattr handlers of a filesystem match the specified
@@ -415,28 +490,34 @@ As of kernel 2.6.22, the following members are defined:
.. code-block:: c
struct inode_operations {
- int (*create) (struct inode *,struct dentry *, umode_t, bool);
+ int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
- int (*symlink) (struct inode *,struct dentry *,const char *);
- int (*mkdir) (struct inode *,struct dentry *,umode_t);
+ int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+ struct dentry *(*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
int (*rmdir) (struct inode *,struct dentry *);
- int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
- int (*rename) (struct inode *, struct dentry *,
+ int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+ int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
struct inode *, struct dentry *, unsigned int);
int (*readlink) (struct dentry *, char __user *,int);
const char *(*get_link) (struct dentry *, struct inode *,
struct delayed_call *);
- int (*permission) (struct inode *, int);
- int (*get_acl)(struct inode *, int);
- int (*setattr) (struct dentry *, struct iattr *);
- int (*getattr) (const struct path *, struct kstat *, u32, unsigned int);
+ int (*permission) (struct mnt_idmap *, struct inode *, int);
+ struct posix_acl * (*get_inode_acl)(struct inode *, int, bool);
+ int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
+ int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
void (*update_time)(struct inode *, struct timespec *, int);
int (*atomic_open)(struct inode *, struct dentry *, struct file *,
unsigned open_flag, umode_t create_mode);
- int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+ int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
+ struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
+ int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
+ int (*fileattr_set)(struct mnt_idmap *idmap,
+ struct dentry *dentry, struct fileattr *fa);
+ int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa);
+ struct offset_ctx *(*get_offset_ctx)(struct inode *inode);
};
Again, all methods are called without any locks being held, unless
@@ -481,7 +562,26 @@ otherwise noted.
``mkdir``
called by the mkdir(2) system call. Only required if you want
to support creating subdirectories. You will probably need to
- call d_instantiate() just as you would in the create() method
+ call d_instantiate_new() just as you would in the create() method.
+
+ If d_instantiate_new() is not used and if the fh_to_dentry()
+ export operation is provided, or if the storage might be
+ accessible by another path (e.g. with a network filesystem)
+ then more care may be needed. Importantly d_instantate()
+ should not be used with an inode that is no longer I_NEW if there
+ any chance that the inode could already be attached to a dentry.
+ This is because of a hard rule in the VFS that a directory must
+ only ever have one dentry.
+
+ For example, if an NFS filesystem is mounted twice the new directory
+ could be visible on the other mount before it is on the original
+ mount, and a pair of name_to_handle_at(), open_by_handle_at()
+ calls could instantiate the directory inode with an IS_ROOT()
+ dentry before the first mkdir returns.
+
+ If there is any chance this could happen, then the new inode
+ should be d_drop()ed and attached with d_splice_alias(). The
+ returned dentry (if any) should be returned by ->mkdir().
``rmdir``
called by the rmdir(2) system call. Only required if you want
@@ -582,8 +682,25 @@ otherwise noted.
``tmpfile``
called in the end of O_TMPFILE open(). Optional, equivalent to
atomically creating, opening and unlinking a file in given
- directory.
-
+ directory. On success needs to return with the file already
+ open; this can be done by calling finish_open_simple() right at
+ the end.
+
+``fileattr_get``
+ called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to
+ retrieve miscellaneous file flags and attributes. Also called
+ before the relevant SET operation to check what is being changed
+ (in this case with i_rwsem locked exclusive). If unset, then
+ fall back to f_op->ioctl().
+
+``fileattr_set``
+ called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to
+ change miscellaneous file flags and attributes. Callers hold
+ i_rwsem exclusive. If unset, then fall back to f_op->ioctl().
+``get_offset_ctx``
+ called to get the offset context for a directory inode. A
+ filesystem must define this operation to use
+ simple_offset_dir_operations.
The Address Space Object
========================
@@ -599,11 +716,10 @@ page lookup by address, and keeping track of pages tagged as Dirty or
Writeback.
The first can be used independently to the others. The VM can try to
-either write dirty pages in order to clean them, or release clean pages
-in order to reuse them. To do this it can call the ->writepage method
-on dirty pages, and ->releasepage on clean pages with PagePrivate set.
-Clean pages without PagePrivate and with no external references will be
-released without notice being given to the address_space.
+release clean pages in order to reuse them. To do this it can call
+->release_folio on clean folios with the private
+flag set. Clean pages without PagePrivate and with no external references
+will be released without notice being given to the address_space.
To achieve this functionality, pages need to be placed on an LRU with
lru_cache_add and mark_page_active needs to be called whenever the page
@@ -614,8 +730,8 @@ maintains information about the PG_Dirty and PG_Writeback status of each
page, so that pages with either of these flags can be found quickly.
The Dirty tag is primarily used by mpage_writepages - the default
-->writepages method. It uses the tag to find dirty pages to call
-->writepage on. If mpage_writepages is not used (i.e. the address
+->writepages method. It uses the tag to find dirty pages to
+write back. If mpage_writepages is not used (i.e. the address
provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is almost
unused. write_inode_now and sync_inode do use it (through
__sync_single_inode) to check if ->writepages has been successful in
@@ -637,25 +753,25 @@ by memory-mapping the page. Data is written into the address space by
the application, and then written-back to storage typically in whole
pages, however the address_space has finer control of write sizes.
-The read process essentially only requires 'readpage'. The write
+The read process essentially only requires 'read_folio'. The write
process is more complicated and uses write_begin/write_end or
-set_page_dirty to write data into the address_space, and writepage and
+dirty_folio to write data into the address_space, and
writepages to writeback data to storage.
Adding and removing pages to/from an address_space is protected by the
inode's i_mutex.
When data is written to a page, the PG_Dirty flag should be set. It
-typically remains set until writepage asks for it to be written. This
+typically remains set until writepages asks for it to be written. This
should clear PG_Dirty and set PG_Writeback. It can be actually written
at any point after PG_Dirty is clear. Once it is known to be safe,
PG_Writeback is cleared.
Writeback makes use of a writeback_control structure to direct the
-operations. This gives the the writepage and writepages operations some
+operations. This gives the writepages operation some
information about the nature of and reason for the writeback request,
and the constraints under which it is being done. It is also used to
-return information back to the caller about the result of a writepage or
+return information back to the caller about the result of a
writepages request.
@@ -669,7 +785,7 @@ is an error during writeback, they expect that error to be reported when
a file sync request is made. After an error has been reported on one
request, subsequent requests on the same file descriptor should return
0, unless further writeback errors have occurred since the previous file
-syncronization.
+synchronization.
Ideally, the kernel would report errors only on file descriptions on
which writes were done that subsequently failed to be written back. The
@@ -702,106 +818,100 @@ cache in your filesystem. The following members are defined:
.. code-block:: c
struct address_space_operations {
- int (*writepage)(struct page *page, struct writeback_control *wbc);
- int (*readpage)(struct file *, struct page *);
+ int (*read_folio)(struct file *, struct folio *);
int (*writepages)(struct address_space *, struct writeback_control *);
- int (*set_page_dirty)(struct page *page);
+ bool (*dirty_folio)(struct address_space *, struct folio *);
void (*readahead)(struct readahead_control *);
- int (*readpages)(struct file *filp, struct address_space *mapping,
- struct list_head *pages, unsigned nr_pages);
int (*write_begin)(struct file *, struct address_space *mapping,
- loff_t pos, unsigned len, unsigned flags,
+ loff_t pos, unsigned len,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
- void (*invalidatepage) (struct page *, unsigned int, unsigned int);
- int (*releasepage) (struct page *, int);
- void (*freepage)(struct page *);
+ void (*invalidate_folio) (struct folio *, size_t start, size_t len);
+ bool (*release_folio)(struct folio *, gfp_t);
+ void (*free_folio)(struct folio *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- /* isolate a page for migration */
- bool (*isolate_page) (struct page *, isolate_mode_t);
- /* migrate the contents of a page to the specified target */
- int (*migratepage) (struct page *, struct page *);
- /* put migration-failed page back to right list */
- void (*putback_page) (struct page *);
- int (*launder_page) (struct page *);
-
- int (*is_partially_uptodate) (struct page *, unsigned long,
- unsigned long);
- void (*is_dirty_writeback) (struct page *, bool *, bool *);
- int (*error_remove_page) (struct mapping *mapping, struct page *page);
- int (*swap_activate)(struct file *);
+ int (*migrate_folio)(struct mapping *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
+ int (*launder_folio) (struct folio *);
+
+ bool (*is_partially_uptodate) (struct folio *, size_t from,
+ size_t count);
+ void (*is_dirty_writeback)(struct folio *, bool *, bool *);
+ int (*error_remove_folio)(struct mapping *mapping, struct folio *);
+ int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span)
int (*swap_deactivate)(struct file *);
+ int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter);
};
-``writepage``
- called by the VM to write a dirty page to backing store. This
- may happen for data integrity reasons (i.e. 'sync'), or to free
- up memory (flush). The difference can be seen in
- wbc->sync_mode. The PG_Dirty flag has been cleared and
- PageLocked is true. writepage should start writeout, should set
- PG_Writeback, and should make sure the page is unlocked, either
- synchronously or asynchronously when the write operation
- completes.
-
- If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
- try too hard if there are problems, and may choose to write out
- other pages from the mapping if that is easier (e.g. due to
- internal dependencies). If it chooses not to start writeout, it
- should return AOP_WRITEPAGE_ACTIVATE so that the VM will not
- keep calling ->writepage on that page.
-
- See the file "Locking" for more details.
-
-``readpage``
- called by the VM to read a page from backing store. The page
- will be Locked when readpage is called, and should be unlocked
- and marked uptodate once the read completes. If ->readpage
- discovers that it needs to unlock the page for some reason, it
- can do so, and then return AOP_TRUNCATED_PAGE. In this case,
- the page will be relocated, relocked and if that all succeeds,
- ->readpage will be called again.
+``read_folio``
+ Called by the page cache to read a folio from the backing store.
+ The 'file' argument supplies authentication information to network
+ filesystems, and is generally not used by block based filesystems.
+ It may be NULL if the caller does not have an open file (eg if
+ the kernel is performing a read for itself rather than on behalf
+ of a userspace process with an open file).
+
+ If the mapping does not support large folios, the folio will
+ contain a single page. The folio will be locked when read_folio
+ is called. If the read completes successfully, the folio should
+ be marked uptodate. The filesystem should unlock the folio
+ once the read has completed, whether it was successful or not.
+ The filesystem does not need to modify the refcount on the folio;
+ the page cache holds a reference count and that will not be
+ released until the folio is unlocked.
+
+ Filesystems may implement ->read_folio() synchronously.
+ In normal operation, folios are read through the ->readahead()
+ method. Only if this fails, or if the caller needs to wait for
+ the read to complete will the page cache call ->read_folio().
+ Filesystems should not attempt to perform their own readahead
+ in the ->read_folio() operation.
+
+ If the filesystem cannot perform the read at this time, it can
+ unlock the folio, do whatever action it needs to ensure that the
+ read will succeed in the future and return AOP_TRUNCATED_PAGE.
+ In this case, the caller should look up the folio, lock it,
+ and call ->read_folio again.
+
+ Callers may invoke the ->read_folio() method directly, but using
+ read_mapping_folio() will take care of locking, waiting for the
+ read to complete and handle cases such as AOP_TRUNCATED_PAGE.
``writepages``
called by the VM to write out pages associated with the
- address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
+ address_space object. If wbc->sync_mode is WB_SYNC_ALL, then
the writeback_control will specify a range of pages that must be
- written out. If it is WBC_SYNC_NONE, then a nr_to_write is
+ written out. If it is WB_SYNC_NONE, then a nr_to_write is
given and that many pages should be written if possible. If no
->writepages is given, then mpage_writepages is used instead.
This will choose pages from the address space that are tagged as
- DIRTY and will pass them to ->writepage.
+ DIRTY and will write them back.
-``set_page_dirty``
- called by the VM to set a page dirty. This is particularly
- needed if an address space attaches private data to a page, and
- that data needs to be updated when a page is dirtied. This is
+``dirty_folio``
+ called by the VM to mark a folio as dirty. This is particularly
+ needed if an address space attaches private data to a folio, and
+ that data needs to be updated when a folio is dirtied. This is
called, for example, when a memory mapped page gets modified.
- If defined, it should set the PageDirty flag, and the
- PAGECACHE_TAG_DIRTY tag in the radix tree.
+ If defined, it should set the folio dirty flag, and the
+ PAGECACHE_TAG_DIRTY search mark in i_pages.
``readahead``
Called by the VM to read pages associated with the address_space
object. The pages are consecutive in the page cache and are
locked. The implementation should decrement the page refcount
after starting I/O on each page. Usually the page will be
- unlocked by the I/O completion handler. If the filesystem decides
- to stop attempting I/O before reaching the end of the readahead
- window, it can simply return. The caller will decrement the page
- refcount and unlock the remaining pages for you. Set PageUptodate
- if the I/O completes successfully. Setting PageError on any page
- will be ignored; simply unlock the page if an I/O error occurs.
-
-``readpages``
- called by the VM to read pages associated with the address_space
- object. This is essentially just a vector version of readpage.
- Instead of just one page, several pages are requested.
- readpages is only used for read-ahead, so read errors are
- ignored. If anything goes wrong, feel free to give up.
- This interface is deprecated and will be removed by the end of
- 2020; implement readahead instead.
+ unlocked by the I/O completion handler. The set of pages are
+ divided into some sync pages followed by some async pages,
+ rac->ra->async_size gives the number of async pages. The
+ filesystem should attempt to read all sync pages but may decide
+ to stop once it reaches the async pages. If it does decide to
+ stop attempting I/O, it can simply return. The caller will
+ remove the remaining pages from the address space, unlock them
+ and decrement the page refcount. Set PageUptodate if the I/O
+ completes successfully.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -813,15 +923,12 @@ cache in your filesystem. The following members are defined:
(if they haven't been read already) so that the updated blocks
can be written out properly.
- The filesystem must return the locked pagecache page for the
- specified offset, in ``*pagep``, for the caller to write into.
+ The filesystem must return the locked pagecache folio for the
+ specified offset, in ``*foliop``, for the caller to write into.
It must be able to cope with short writes (where the length
passed to write_begin is greater than the number of bytes copied
- into the page).
-
- flags is a field for AOP_FLAG_xxx flags, described in
- include/linux/fs.h.
+ into the folio).
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -834,8 +941,8 @@ cache in your filesystem. The following members are defined:
called. len is the original len passed to write_begin, and
copied is the amount that was able to be copied.
- The filesystem must take care of unlocking the page and
- releasing it refcount, and updating i_size.
+ The filesystem must take care of unlocking the folio,
+ decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=
'copied') that were able to be copied into pagecache.
@@ -849,42 +956,41 @@ cache in your filesystem. The following members are defined:
to find out where the blocks in the file are and uses those
addresses directly.
-``invalidatepage``
- If a page has PagePrivate set, then invalidatepage will be
- called when part or all of the page is to be removed from the
+``invalidate_folio``
+ If a folio has private data, then invalidate_folio will be
+ called when part or all of the folio is to be removed from the
address space. This generally corresponds to either a
truncation, punch hole or a complete invalidation of the address
space (in the latter case 'offset' will always be 0 and 'length'
- will be PAGE_SIZE). Any private data associated with the page
+ will be folio_size()). Any private data associated with the folio
should be updated to reflect this truncation. If offset is 0
- and length is PAGE_SIZE, then the private data should be
- released, because the page must be able to be completely
- discarded. This may be done by calling the ->releasepage
+ and length is folio_size(), then the private data should be
+ released, because the folio must be able to be completely
+ discarded. This may be done by calling the ->release_folio
function, but in this case the release MUST succeed.
-``releasepage``
- releasepage is called on PagePrivate pages to indicate that the
- page should be freed if possible. ->releasepage should remove
- any private data from the page and clear the PagePrivate flag.
- If releasepage() fails for some reason, it must indicate failure
- with a 0 return value. releasepage() is used in two distinct
- though related cases. The first is when the VM finds a clean
- page with no active users and wants to make it a free page. If
- ->releasepage succeeds, the page will be removed from the
- address_space and become free.
+``release_folio``
+ release_folio is called on folios with private data to tell the
+ filesystem that the folio is about to be freed. ->release_folio
+ should remove any private data from the folio and clear the
+ private flag. If release_folio() fails, it should return false.
+ release_folio() is used in two distinct though related cases.
+ The first is when the VM wants to free a clean folio with no
+ active users. If ->release_folio succeeds, the folio will be
+ removed from the address_space and be freed.
The second case is when a request has been made to invalidate
- some or all pages in an address_space. This can happen through
- the fadvise(POSIX_FADV_DONTNEED) system call or by the
- filesystem explicitly requesting it as nfs and 9fs do (when they
+ some or all folios in an address_space. This can happen
+ through the fadvise(POSIX_FADV_DONTNEED) system call or by the
+ filesystem explicitly requesting it as nfs and 9p do (when they
believe the cache may be out of date with storage) by calling
invalidate_inode_pages2(). If the filesystem makes such a call,
- and needs to be certain that all pages are invalidated, then its
- releasepage will need to ensure this. Possibly it can clear the
- PageUptodate bit if it cannot free private data yet.
+ and needs to be certain that all folios are invalidated, then
+ its release_folio will need to ensure this. Possibly it can
+ clear the uptodate flag if it cannot free private data yet.
-``freepage``
- freepage is called once the page is no longer visible in the
+``free_folio``
+ free_folio is called once the folio is no longer visible in the
page cache in order to allow the cleanup of any private data.
Since it may be called by the memory reclaimer, it should not
assume that the original address_space mapping still exists, and
@@ -896,59 +1002,57 @@ cache in your filesystem. The following members are defined:
data directly between the storage and the application's address
space.
-``isolate_page``
- Called by the VM when isolating a movable non-lru page. If page
- is successfully isolated, VM marks the page as PG_isolated via
- __SetPageIsolated.
-
-``migrate_page``
+``migrate_folio``
This is used to compact the physical memory usage. If the VM
- wants to relocate a page (maybe off a memory card that is
- signalling imminent failure) it will pass a new page and an old
- page to this function. migrate_page should transfer any private
- data across and update any references that it has to the page.
-
-``putback_page``
- Called by the VM when isolated page's migration fails.
-
-``launder_page``
- Called before freeing a page - it writes back the dirty page.
- To prevent redirtying the page, it is kept locked during the
+ wants to relocate a folio (maybe from a memory device that is
+ signalling imminent failure) it will pass a new folio and an old
+ folio to this function. migrate_folio should transfer any private
+ data across and update any references that it has to the folio.
+
+``launder_folio``
+ Called before freeing a folio - it writes back the dirty folio.
+ To prevent redirtying the folio, it is kept locked during the
whole operation.
``is_partially_uptodate``
Called by the VM when reading a file through the pagecache when
- the underlying blocksize != pagesize. If the required block is
- up to date then the read can complete without needing the IO to
- bring the whole page up to date.
+ the underlying blocksize is smaller than the size of the folio.
+ If the required block is up to date then the read can complete
+ without needing I/O to bring the whole page up to date.
``is_dirty_writeback``
- Called by the VM when attempting to reclaim a page. The VM uses
+ Called by the VM when attempting to reclaim a folio. The VM uses
dirty and writeback information to determine if it needs to
stall to allow flushers a chance to complete some IO.
- Ordinarily it can use PageDirty and PageWriteback but some
- filesystems have more complex state (unstable pages in NFS
+ Ordinarily it can use folio_test_dirty and folio_test_writeback but
+ some filesystems have more complex state (unstable folios in NFS
prevent reclaim) or do not set those flags due to locking
problems. This callback allows a filesystem to indicate to the
- VM if a page should be treated as dirty or writeback for the
+ VM if a folio should be treated as dirty or writeback for the
purposes of stalling.
-``error_remove_page``
- normally set to generic_error_remove_page if truncation is ok
+``error_remove_folio``
+ normally set to generic_error_remove_folio if truncation is ok
for this address space. Used for memory failure handling.
Setting this implies you deal with pages going away under you,
unless you have them locked or reference counts increased.
``swap_activate``
- Called when swapon is used on a file to allocate space if
- necessary and pin the block lookup information in memory. A
- return value of zero indicates success, in which case this file
- can be used to back swapspace.
+
+ Called to prepare the given file for swap. It should perform
+ any validation and preparation necessary to ensure that writes
+ can be performed with minimal memory allocation. It should call
+ add_swap_extent(), or the helper iomap_swapfile_activate(), and
+ return the number of extents added. If IO should be submitted
+ through ->swap_rw(), it should set SWP_FS_OPS, otherwise IO will
+ be submitted directly to the block device ``sis->bdev``.
``swap_deactivate``
Called during swapoff on files where swap_activate was
successful.
+``swap_rw``
+ Called to read or write swap pages when SWP_FS_OPS is set.
The File Object
===============
@@ -973,7 +1077,6 @@ This describes how the VFS can manipulate an open file. As of kernel
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iopoll)(struct kiocb *kiocb, bool spin);
- int (*iterate) (struct file *, struct dir_context *);
int (*iterate_shared) (struct file *, struct dir_context *);
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -985,7 +1088,6 @@ This describes how the VFS can manipulate an open file. As of kernel
int (*fsync) (struct file *, loff_t, loff_t, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
- ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
int (*check_flags)(int);
int (*flock) (struct file *, int, struct file_lock *);
@@ -1026,12 +1128,8 @@ otherwise noted.
``iopoll``
called when aio wants to poll for completions on HIPRI iocbs
-``iterate``
- called when the VFS needs to read the directory contents
-
``iterate_shared``
- called when the VFS needs to read the directory contents when
- filesystem supports concurrent dir iterators
+ called when the VFS needs to read the directory contents
``poll``
called by the VFS when a process wants to check if there is
@@ -1116,7 +1214,7 @@ otherwise noted.
before any bytes were remapped. The remap_flags parameter
accepts REMAP_FILE_* flags. If REMAP_FILE_DEDUP is set then the
implementation must only remap if the requested file ranges have
- identical contents. If REMAP_CAN_SHORTEN is set, the caller is
+ identical contents. If REMAP_FILE_CAN_SHORTEN is set, the caller is
ok with the implementation shortening the request length to
satisfy alignment or EOF requirements (or any other reason).
@@ -1151,7 +1249,8 @@ defined:
.. code-block:: c
struct dentry_operations {
- int (*d_revalidate)(struct dentry *, unsigned int);
+ int (*d_revalidate)(struct inode *, const struct qstr *,
+ struct dentry *, unsigned int);
int (*d_weak_revalidate)(struct dentry *, unsigned int);
int (*d_hash)(const struct dentry *, struct qstr *);
int (*d_compare)(const struct dentry *,
@@ -1163,7 +1262,9 @@ defined:
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *);
+ struct dentry *(*d_real)(struct dentry *, enum d_real_type type);
+ bool (*d_unalias_trylock)(const struct dentry *);
+ void (*d_unalias_unlock)(const struct dentry *);
};
``d_revalidate``
@@ -1188,7 +1289,7 @@ defined:
return
-ECHILD and it will be called again in ref-walk mode.
-``_weak_revalidate``
+``d_weak_revalidate``
called when the VFS needs to revalidate a "jumped" dentry. This
is called when a path-walk ends at dentry that was not acquired
by doing a lookup in the parent directory. This includes "/",
@@ -1289,9 +1390,7 @@ defined:
If a vfsmount is returned, the caller will attempt to mount it
on the mountpoint and will remove the vfsmount from its
- expiration list in the case of failure. The vfsmount should be
- returned with 2 refs on it to prevent automatic expiration - the
- caller will clean up the additional ref.
+ expiration list in the case of failure.
This function is only used if DCACHE_NEED_AUTOMOUNT is set on
the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is
@@ -1318,16 +1417,33 @@ defined:
the dentry being transited from.
``d_real``
- overlay/union type filesystems implement this method to return
- one of the underlying dentries hidden by the overlay. It is
- used in two different modes:
+ overlay/union type filesystems implement this method to return one
+ of the underlying dentries of a regular file hidden by the overlay.
+
+ The 'type' argument takes the values D_REAL_DATA or D_REAL_METADATA
+ for returning the real underlying dentry that refers to the inode
+ hosting the file's data or metadata respectively.
+
+ For non-regular files, the 'dentry' argument is returned.
+
+``d_unalias_trylock``
+ if present, will be called by d_splice_alias() before moving a
+ preexisting attached alias. Returning false prevents __d_move(),
+ making d_splice_alias() fail with -ESTALE.
+
+ Rationale: setting FS_RENAME_DOES_D_MOVE will prevent d_move()
+ and d_exchange() calls from the outside of filesystem methods;
+ however, it does not guarantee that attached dentries won't
+ be renamed or moved by d_splice_alias() finding a preexisting
+ alias for a directory inode. Normally we would not care;
+ however, something that wants to stabilize the entire path to
+ root over a blocking operation might need that. See 9p for one
+ (and hopefully only) example.
- Called from file_dentry() it returns the real dentry matching
- the inode argument. The real dentry may be from a lower layer
- already copied up, but still referenced from the file. This
- mode is selected with a non-NULL inode argument.
+``d_unalias_unlock``
+ should be paired with ``d_unalias_trylock``; that one is called after
+ __d_move() call in __d_unalias().
- With NULL inode the topmost real underlying dentry is returned.
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
@@ -1431,13 +1547,13 @@ Resources
version.)
Creating Linux virtual filesystems. 2002
- <http://lwn.net/Articles/13325/>
+ <https://lwn.net/Articles/13325/>
The Linux Virtual File-system Layer by Neil Brown. 1999
<http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
A tour of the Linux VFS by Michael K. Johnson. 1996
- <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
+ <https://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
A small trail through the Linux kernel by Andries Brouwer. 2001
- <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
+ <https://www.win.tue.nl/~aeb/linux/vfs/trail.html>
diff --git a/Documentation/filesystems/xfs/index.rst b/Documentation/filesystems/xfs/index.rst
new file mode 100644
index 000000000000..ab66c57a5d18
--- /dev/null
+++ b/Documentation/filesystems/xfs/index.rst
@@ -0,0 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+XFS Filesystem Documentation
+============================
+
+.. toctree::
+ :maxdepth: 2
+ :numbered:
+
+ xfs-delayed-logging-design
+ xfs-maintainer-entry-profile
+ xfs-self-describing-metadata
+ xfs-online-fsck-design
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
index 464405d2801e..2a2705e975e8 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs/xfs-delayed-logging-design.rst
@@ -1,29 +1,314 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================
-XFS Delayed Logging Design
-==========================
-
-Introduction to Re-logging in XFS
-=================================
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
+==================
+XFS Logging Design
+==================
+
+Preamble
+========
+
+This document describes the design and algorithms that the XFS journalling
+subsystem is based on. This document describes the design and algorithms that
+the XFS journalling subsystem is based on so that readers may familiarize
+themselves with the general concepts of how transaction processing in XFS works.
+
+We begin with an overview of transactions in XFS, followed by describing how
+transaction reservations are structured and accounted, and then move into how we
+guarantee forwards progress for long running transactions with finite initial
+reservations bounds. At this point we need to explain how relogging works. With
+the basic concepts covered, the design of the delayed logging mechanism is
+documented.
+
+
+Introduction
+============
+
+XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
+are atomic and recoverable. For reasons of space and time efficiency, the
+logging mechanisms are varied and complex, combining intents, logical and
+physical logging mechanisms to provide the necessary recovery guarantees the
+filesystem requires.
+
+Some objects, such as inodes and dquots, are logged in logical format where the
+details logged are made up of the changes to in-core structures rather than
+on-disk structures. Other objects - typically buffers - have their physical
+changes logged. Long running atomic modifications have individual changes
+chained together by intents, ensuring that journal recovery can restart and
+finish an operation that was only partially done when the system stopped
+functioning.
+
+The reason for these differences is to keep the amount of log space and CPU time
+required to process objects being modified as small as possible and hence the
+logging overhead as low as possible. Some items are very frequently modified,
+and some parts of objects are more frequently modified than others, so keeping
+the overhead of metadata logging low is of prime importance.
+
+The method used to log an item or chain modifications together isn't
+particularly important in the scope of this document. It suffices to know that
+the method used for logging a particular object or chaining modifications
+together are different and are dependent on the object and/or modification being
+performed. The logging subsystem only cares that certain specific rules are
+followed to guarantee forwards progress and prevent deadlocks.
+
+
+Transactions in XFS
+===================
+
+XFS has two types of high level transactions, defined by the type of log space
+reservation they take. These are known as "one shot" and "permanent"
+transactions. Permanent transaction reservations can take reservations that span
+commit boundaries, whilst "one shot" transactions are for a single atomic
+modification.
+
+The type and size of reservation must be matched to the modification taking
+place. This means that permanent transactions can be used for one-shot
+modifications, but one-shot reservations cannot be used for permanent
+transactions.
+
+In the code, a one-shot transaction pattern looks somewhat like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ <lock items>
+ <join item to transaction>
+ <do modification>
+ xfs_trans_commit(tp);
+
+As items are modified in the transaction, the dirty regions in those items are
+tracked via the transaction handle. Once the transaction is committed, all
+resources joined to it are released, along with the remaining unused reservation
+space that was taken at the transaction allocation time.
+
+In contrast, a permanent transaction is made up of multiple linked individual
+transactions, and the pattern looks like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ xfs_ilock(ip, XFS_ILOCK_EXCL)
+
+ loop {
+ xfs_trans_ijoin(tp, 0);
+ <do modification>
+ xfs_trans_log_inode(tp, ip);
+ xfs_trans_roll(&tp);
+ }
+
+ xfs_trans_commit(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+While this might look similar to a one-shot transaction, there is an important
+difference: xfs_trans_roll() performs a specific operation that links two
+transactions together::
+
+ ntp = xfs_trans_dup(tp);
+ xfs_trans_commit(tp);
+ xfs_trans_reserve(ntp);
+
+This results in a series of "rolling transactions" where the inode is locked
+across the entire chain of transactions. Hence while this series of rolling
+transactions is running, nothing else can read from or write to the inode and
+this provides a mechanism for complex changes to appear atomic from an external
+observer's point of view.
+
+It is important to note that a series of rolling transactions in a permanent
+transaction does not form an atomic change in the journal. While each
+individual modification is atomic, the chain is *not atomic*. If we crash half
+way through, then recovery will only replay up to the last transactional
+modification the loop made that was committed to the journal.
+
+This affects long running permanent transactions in that it is not possible to
+predict how much of a long running operation will actually be recovered because
+there is no guarantee of how much of the operation reached stale storage. Hence
+if a long running operation requires multiple transactions to fully complete,
+the high level operation must use intents and deferred operations to guarantee
+recovery can complete the operation once the first transactions is persisted in
+the on-disk journal.
+
+
+Transactions are Asynchronous
+=============================
+
+In XFS, all high level transactions are asynchronous by default. This means that
+xfs_trans_commit() does not guarantee that the modification has been committed
+to stable storage when it returns. Hence when a system crashes, not all the
+completed transactions will be replayed during recovery.
+
+However, the logging subsystem does provide global ordering guarantees, such
+that if a specific change is seen after recovery, all metadata modifications
+that were committed prior to that change will also be seen.
+
+For single shot operations that need to reach stable storage immediately, or
+ensuring that a long running permanent transaction is fully committed once it is
+complete, we can explicitly tag a transaction as synchronous. This will trigger
+a "log force" to flush the outstanding committed transactions to stable storage
+in the journal and wait for that to complete.
+
+Synchronous transactions are rarely used, however, because they limit logging
+throughput to the IO latency limitations of the underlying storage. Instead, we
+tend to use log forces to ensure modifications are on stable storage only when
+a user operation requires a synchronisation point to occur (e.g. fsync).
+
+
+Transaction Reservations
+========================
+
+It has been mentioned a number of times now that the logging subsystem needs to
+provide a forwards progress guarantee so that no modification ever stalls
+because it can't be written to the journal due to a lack of space in the
+journal. This is achieved by the transaction reservations that are made when
+a transaction is first allocated. For permanent transactions, these reservations
+are maintained as part of the transaction rolling mechanism.
+
+A transaction reservation provides a guarantee that there is physical log space
+available to write the modification into the journal before we start making
+modifications to objects and items. As such, the reservation needs to be large
+enough to take into account the amount of metadata that the change might need to
+log in the worst case. This means that if we are modifying a btree in the
+transaction, we have to reserve enough space to record a full leaf-to-root split
+of the btree. As such, the reservations are quite complex because we have to
+take into account all the hidden changes that might occur.
+
+For example, a user data extent allocation involves allocating an extent from
+free space, which modifies the free space trees. That's two btrees. Inserting
+the extent into the inode's extent map might require a split of the extent map
+btree, which requires another allocation that can modify the free space trees
+again. Then we might have to update reverse mappings, which modifies yet
+another btree which might require more space. And so on. Hence the amount of
+metadata that a "simple" operation can modify can be quite large.
+
+This "worst case" calculation provides us with the static "unit reservation"
+for the transaction that is calculated at mount time. We must guarantee that the
+log has this much space available before the transaction is allowed to proceed
+so that when we come to write the dirty metadata into the log we don't run out
+of log space half way through the write.
+
+For one-shot transactions, a single unit space reservation is all that is
+required for the transaction to proceed. For permanent transactions, however, we
+also have a "log count" that affects the size of the reservation that is to be
+made.
+
+While a permanent transaction can get by with a single unit of space
+reservation, it is somewhat inefficient to do this as it requires the
+transaction rolling mechanism to re-reserve space on every transaction roll. We
+know from the implementation of the permanent transactions how many transaction
+rolls are likely for the common modifications that need to be made.
+
+For example, an inode allocation is typically two transactions - one to
+physically allocate a free inode chunk on disk, and another to allocate an inode
+from an inode chunk that has free inodes in it. Hence for an inode allocation
+transaction, we might set the reservation log count to a value of 2 to indicate
+that the common/fast path transaction will commit two linked transactions in a
+chain. Each time a permanent transaction rolls, it consumes an entire unit
+reservation.
+
+Hence when the permanent transaction is first allocated, the log space
+reservation is increased from a single unit reservation to multiple unit
+reservations. That multiple is defined by the reservation log count, and this
+means we can roll the transaction multiple times before we have to re-reserve
+log space when we roll the transaction. This ensures that the common
+modifications we make only need to reserve log space once.
+
+If the log count for a permanent transaction reaches zero, then it needs to
+re-reserve physical space in the log. This is somewhat complex, and requires
+an understanding of how the log accounts for space that has been reserved.
+
+
+Log Space Accounting
+====================
+
+The position in the log is typically referred to as a Log Sequence Number (LSN).
+The log is circular, so the positions in the log are defined by the combination
+of a cycle number - the number of times the log has been overwritten - and the
+offset into the log. A LSN carries the cycle in the upper 32 bits and the
+offset in the lower 32 bits. The offset is in units of "basic blocks" (512
+bytes). Hence we can do relatively simple LSN based math to keep track of
+available space in the log.
+
+Log space accounting is done via a pair of constructs called "grant heads". The
+position of the grant heads is an absolute value, so the amount of space
+available in the log is defined by the distance between the position of the
+grant head and the current log tail. That is, how much space can be
+reserved/consumed before the grant heads would fully wrap the log and overtake
+the tail position.
+
+The first grant head is the "reserve" head. This tracks the byte count of the
+reservations currently held by active transactions. It is a purely in-memory
+accounting of the space reservation and, as such, actually tracks byte offsets
+into the log rather than basic blocks. Hence it technically isn't using LSNs to
+represent the log position, but it is still treated like a split {cycle,offset}
+tuple for the purposes of tracking reservation space.
+
+The reserve grant head is used to accurately account for exact transaction
+reservations amounts and the exact byte count that modifications actually make
+and need to write into the log. The reserve head is used to prevent new
+transactions from taking new reservations when the head reaches the current
+tail. It will block new reservations in a FIFO queue and as the log tail moves
+forward it will wake them in order once sufficient space is available. This FIFO
+mechanism ensures no transaction is starved of resources when log space
+shortages occur.
+
+The other grant head is the "write" head. Unlike the reserve head, this grant
+head contains an LSN and it tracks the physical space usage in the log. While
+this might sound like it is accounting the same state as the reserve grant head
+- and it mostly does track exactly the same location as the reserve grant head -
+there are critical differences in behaviour between them that provides the
+forwards progress guarantees that rolling permanent transactions require.
+
+These differences when a permanent transaction is rolled and the internal "log
+count" reaches zero and the initial set of unit reservations have been
+exhausted. At this point, we still require a log space reservation to continue
+the next transaction in the sequeunce, but we have none remaining. We cannot
+sleep during the transaction commit process waiting for new log space to become
+available, as we may end up on the end of the FIFO queue and the items we have
+locked while we sleep could end up pinning the tail of the log before there is
+enough free space in the log to fulfill all of the pending reservations and
+then wake up transaction commit in progress.
+
+To take a new reservation without sleeping requires us to be able to take a
+reservation even if there is no reservation space currently available. That is,
+we need to be able to *overcommit* the log reservation space. As has already
+been detailed, we cannot overcommit physical log space. However, the reserve
+grant head does not track physical space - it only accounts for the amount of
+reservations we currently have outstanding. Hence if the reserve head passes
+over the tail of the log all it means is that new reservations will be throttled
+immediately and remain throttled until the log tail is moved forward far enough
+to remove the overcommit and start taking new reservations. In other words, we
+can overcommit the reserve head without violating the physical log head and tail
+rules.
+
+As a result, permanent transactions only "regrant" reservation space during
+xfs_trans_commit() calls, while the physical log space reservation - tracked by
+the write head - is then reserved separately by a call to xfs_log_reserve()
+after the commit completes. Once the commit completes, we can sleep waiting for
+physical log space to be reserved from the write grant head, but only if one
+critical rule has been observed::
+
+ Code using permanent reservations must always log the items they hold
+ locked across each transaction they roll in the chain.
+
+"Re-logging" the locked items on every transaction roll ensures that the items
+attached to the transaction chain being rolled are always relocated to the
+physical head of the log and so do not pin the tail of the log. If a locked item
+pins the tail of the log when we sleep on the write reservation, then we will
+deadlock the log as we cannot take the locks needed to write back that item and
+move the tail of the log forwards to free up write grant space. Re-logging the
+locked items avoids this deadlock and guarantees that the log reservation we are
+making cannot self-deadlock.
+
+If all rolling transactions obey this rule, then they can all make forwards
+progress independently because nothing will block the progress of the log
+tail moving forwards and hence ensuring that write grant space is always
+(eventually) made available to permanent transactions no matter how many times
+they roll.
+
+
+Re-logging Explained
+====================
+
+XFS allows multiple separate modifications to a single object to be carried in
+the log at any given time. This allows the log to avoid needing to flush each
+change to disk before recording a new change to the object. XFS does this via a
+method called "re-logging". Conceptually, this is quite simple - all it requires
+is that any new change to the object is recorded with a *new copy* of all the
+existing changes in the new transaction that is written to the log.
That is, if we have a sequence of changes A through to F, and the object was
written to disk after change D, we would see in the log the following series
@@ -42,16 +327,13 @@ transaction::
In other words, each time an object is relogged, the new transaction contains
the aggregation of all the previous changes currently held only in the log.
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward. This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
+This relogging technique allows objects to be moved forward in the log so that
+an object being relogged does not prevent the tail of the log from ever moving
+forward. This can be seen in the table above by the changing (increasing) LSN
+of each subsequent transaction, and it's the technique that allows us to
+implement long-running, multiple-commit permanent transactions.
-This relogging is also used to implement long-running, multiple-commit
-transactions. These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
+A typical example of a rolling transaction is the removal of extents from an
inode which can only be done at a rate of two extents per transaction because
of reservation size limitations. Hence a rolling extent removal transaction
keeps relogging the inode and btree buffers as they get modified in each
@@ -67,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get
dirtier as they get relogged, so each subsequent transaction is writing more
metadata into the log.
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
+It should now also be obvious how relogging and asynchronous transactions go
+hand in hand. That is, transactions don't get written to the physical journal
+until either a log buffer is filled (a log buffer can hold multiple
+transactions) or a synchronous operation forces the log buffers holding the
+transactions to disk. This means that XFS is doing aggregation of transactions
+in memory - batching them, if you like - to minimise the impact of the log IO on
+transaction throughput.
The limitation on asynchronous transaction throughput is the number and size of
log buffers made available by the log manager. By default there are 8 log
@@ -268,14 +551,14 @@ Essentially, this shows that an item that is in the AIL can still be modified
and relogged, so any tracking must be separate to the AIL infrastructure. As
such, we cannot reuse the AIL list pointers for tracking committed items, nor
can we store state in any field that is protected by the AIL lock. Hence the
-committed item tracking needs it's own locks, lists and state fields in the log
+committed item tracking needs its own locks, lists and state fields in the log
item.
Similar to the AIL, tracking of committed items is done through a new list
called the Committed Item List (CIL). The list tracks log items that have been
committed and have formatted memory buffers attached to them. It tracks objects
in transaction commit order, so when an object is relogged it is removed from
-it's place in the list and re-inserted at the tail. This is entirely arbitrary
+its place in the list and re-inserted at the tail. This is entirely arbitrary
and done to make it easy for debugging - the last items in the list are the
ones that are most recently modified. Ordering of the CIL is not necessary for
transactional integrity (as discussed in the next section) so the ordering is
@@ -332,7 +615,7 @@ those changes into the current checkpoint context. We then initialise a new
context and attach that to the CIL for aggregation of new transactions.
This allows us to unlock the CIL immediately after transfer of all the
-committed items and effectively allow new transactions to be issued while we
+committed items and effectively allows new transactions to be issued while we
are formatting the checkpoint into the log. It also allows concurrent
checkpoints to be written into the log buffers in the case of log force heavy
workloads, just like the existing transaction commit code does. This, however,
@@ -601,9 +884,9 @@ pin the object the first time it is inserted into the CIL - if it is already in
the CIL during a transaction commit, then we do not pin it again. Because there
can be multiple outstanding checkpoint contexts, we can still see elevated pin
counts, but as each checkpoint completes the pin count will retain the correct
-value according to it's context.
+value according to its context.
-Just to make matters more slightly more complex, this checkpoint level context
+Just to make matters slightly more complex, this checkpoint level context
for the pin count means that the pinning of an item must take place under the
CIL commit/flush lock. If we pin the object outside this lock, we cannot
guarantee which context the pin count is associated with. This is because of
diff --git a/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
new file mode 100644
index 000000000000..ce4584fb3103
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-maintainer-entry-profile.rst
@@ -0,0 +1,194 @@
+XFS Maintainer Entry Profile
+============================
+
+Overview
+--------
+XFS is a well known high-performance filesystem in the Linux kernel.
+The aim of this project is to provide and maintain a robust and
+performant filesystem.
+
+Patches are generally merged to the for-next branch of the appropriate
+git repository.
+After a testing period, the for-next branch is merged to the master
+branch.
+
+Kernel code are merged to the xfs-linux tree[0].
+Userspace code are merged to the xfsprogs tree[1].
+Test cases are merged to the xfstests tree[2].
+Ondisk format documentation are merged to the xfs-documentation tree[3].
+
+All patchsets involving XFS *must* be cc'd in their entirety to the mailing
+list linux-xfs@vger.kernel.org.
+
+Roles
+-----
+There are eight key roles in the XFS project.
+A person can take on multiple roles, and a role can be filled by
+multiple people.
+Anyone taking on a role is advised to check in with themselves and
+others on a regular basis about burnout.
+
+- **Outside Contributor**: Anyone who sends a patch but is not involved
+ in the XFS project on a regular basis.
+ These folks are usually people who work on other filesystems or
+ elsewhere in the kernel community.
+
+- **Developer**: Someone who is familiar with the XFS codebase enough to
+ write new code, documentation, and tests.
+
+ Developers can often be found in the IRC channel mentioned by the ``C:``
+ entry in the kernel MAINTAINERS file.
+
+- **Senior Developer**: A developer who is very familiar with at least
+ some part of the XFS codebase and/or other subsystems in the kernel.
+ These people collectively decide the long term goals of the project
+ and nudge the community in that direction.
+ They should help prioritize development and review work for each release
+ cycle.
+
+ Senior developers tend to be more active participants in the IRC channel.
+
+- **Reviewer**: Someone (most likely also a developer) who reads code
+ submissions to decide:
+
+ 0. Is the idea behind the contribution sound?
+ 1. Does the idea fit the goals of the project?
+ 2. Is the contribution designed correctly?
+ 3. Is the contribution polished?
+ 4. Can the contribution be tested effectively?
+
+ Reviewers should identify themselves with an ``R:`` entry in the kernel
+ and fstests MAINTAINERS files.
+
+- **Testing Lead**: This person is responsible for setting the test
+ coverage goals of the project, negotiating with developers to decide
+ on new tests for new features, and making sure that developers and
+ release managers execute on the testing.
+
+ The testing lead should identify themselves with an ``M:`` entry in
+ the XFS section of the fstests MAINTAINERS file.
+
+- **Bug Triager**: Someone who examines incoming bug reports in just
+ enough detail to identify the person to whom the report should be
+ forwarded.
+
+ The bug triagers should identify themselves with a ``B:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Release Manager**: This person merges reviewed patchsets into an
+ integration branch, tests the result locally, pushes the branch to a
+ public git repository, and sends pull requests further upstream.
+ The release manager is not expected to work on new feature patchsets.
+ If a developer and a reviewer fail to reach a resolution on some point,
+ the release manager must have the ability to intervene to try to drive a
+ resolution.
+
+ The release manager should identify themselves with an ``M:`` entry in
+ the kernel MAINTAINERS file.
+
+- **Community Manager**: This person calls and moderates meetings of as many
+ XFS participants as they can get when mailing list discussions prove
+ insufficient for collective decisionmaking.
+ They may also serve as liaison between managers of the organizations
+ sponsoring work on any part of XFS.
+
+- **LTS Maintainer**: Someone who backports and tests bug fixes from
+ upstream to the LTS kernels.
+ There tend to be six separate LTS trees at any given time.
+
+ The maintainer for a given LTS release should identify themselves with an
+ ``M:`` entry in the MAINTAINERS file for that LTS tree.
+ Unmaintained LTS kernels should be marked with status ``S: Orphan`` in that
+ same file.
+
+Submission Checklist Addendum
+-----------------------------
+Please follow these additional rules when submitting to XFS:
+
+- Patches affecting only the filesystem itself should be based against
+ the latest -rc or the for-next branch.
+ These patches will be merged back to the for-next branch.
+
+- Authors of patches touching other subsystems need to coordinate with
+ the maintainers of XFS and the relevant subsystems to decide how to
+ proceed with a merge.
+
+- Any patchset changing XFS should be cc'd in its entirety to linux-xfs.
+ Do not send partial patchsets; that makes analysis of the broader
+ context of the changes unnecessarily difficult.
+
+- Anyone making kernel changes that have corresponding changes to the
+ userspace utilities should send the userspace changes as separate
+ patchsets immediately after the kernel patchsets.
+
+- Authors of bug fix patches are expected to use fstests[2] to perform
+ an A/B test of the patch to determine that there are no regressions.
+ When possible, a new regression test case should be written for
+ fstests.
+
+- Authors of new feature patchsets must ensure that fstests will have
+ appropriate functional and input corner-case test cases for the new
+ feature.
+
+- When implementing a new feature, it is strongly suggested that the
+ developers write a design document to answer the following questions:
+
+ * **What** problem is this trying to solve?
+
+ * **Who** will benefit from this solution, and **where** will they
+ access it?
+
+ * **How** will this new feature work? This should touch on major data
+ structures and algorithms supporting the solution at a higher level
+ than code comments.
+
+ * **What** userspace interfaces are necessary to build off of the new
+ features?
+
+ * **How** will this work be tested to ensure that it solves the
+ problems laid out in the design document without causing new
+ problems?
+
+ The design document should be committed in the kernel documentation
+ directory.
+ It may be omitted if the feature is already well known to the
+ community.
+
+- Patchsets for the new tests should be submitted as separate patchsets
+ immediately after the kernel and userspace code patchsets.
+
+- Changes to the on-disk format of XFS must be described in the ondisk
+ format document[3] and submitted as a patchset after the fstests
+ patchsets.
+
+- Patchsets implementing bug fixes and further code cleanups should put
+ the bug fixes at the beginning of the series to ease backporting.
+
+Key Release Cycle Dates
+-----------------------
+Bug fixes may be sent at any time, though the release manager may decide to
+defer a patch when the next merge window is close.
+
+Code submissions targeting the next merge window should be sent between
+-rc1 and -rc6.
+This gives the community time to review the changes, to suggest other changes,
+and for the author to retest those changes.
+
+Code submissions also requiring changes to fs/iomap and targeting the
+next merge window should be sent between -rc1 and -rc4.
+This allows the broader kernel community adequate time to test the
+infrastructure changes.
+
+Review Cadence
+--------------
+In general, please wait at least one week before pinging for feedback.
+To find reviewers, either consult the MAINTAINERS file, or ask
+developers that have Reviewed-by tags for XFS changes to take a look and
+offer their opinion.
+
+References
+----------
+| [0] https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/
+| [1] https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/
+| [2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/
+| [3] https://git.kernel.org/pub/scm/fs/xfs/xfs-documentation.git/
diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
new file mode 100644
index 000000000000..e231d127cd40
--- /dev/null
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -0,0 +1,5503 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_online_fsck_design:
+
+..
+ Mapping of heading styles within this document:
+ Heading 1 uses "====" above and below
+ Heading 2 uses "===="
+ Heading 3 uses "----"
+ Heading 4 uses "````"
+ Heading 5 uses "^^^^"
+ Heading 6 uses "~~~~"
+ Heading 7 uses "...."
+
+ Sections are manually numbered because apparently that's what everyone
+ does in the kernel.
+
+======================
+XFS Online Fsck Design
+======================
+
+This document captures the design of the online filesystem check feature for
+XFS.
+The purpose of this document is threefold:
+
+- To help kernel distributors understand exactly what the XFS online fsck
+ feature is, and issues about which they should be aware.
+
+- To help people reading the code to familiarize themselves with the relevant
+ concepts and design points before they start digging into the code.
+
+- To help developers maintaining the system by capturing the reasons
+ supporting higher level decision making.
+
+As the online fsck code is merged, the links in this document to topic branches
+will be replaced with links to code.
+
+This document is licensed under the terms of the GNU Public License, v2.
+The primary author is Darrick J. Wong.
+
+This design document is split into seven parts.
+Part 1 defines what fsck tools are and the motivations for writing a new one.
+Parts 2 and 3 present a high level overview of how online fsck process works
+and how it is tested to ensure correct functionality.
+Part 4 discusses the user interface and the intended usage modes of the new
+program.
+Parts 5 and 6 show off the high level components and how they fit together, and
+then present case studies of how each repair function actually works.
+Part 7 sums up what has been discussed so far and speculates about what else
+might be built atop online fsck.
+
+.. contents:: Table of Contents
+ :local:
+
+1. What is a Filesystem Check?
+==============================
+
+A Unix filesystem has four main responsibilities:
+
+- Provide a hierarchy of names through which application programs can associate
+ arbitrary blobs of data for any length of time,
+
+- Virtualize physical storage media across those names, and
+
+- Retrieve the named data blobs at any time.
+
+- Examine resource usage.
+
+Metadata directly supporting these functions (e.g. files, directories, space
+mappings) are sometimes called primary metadata.
+Secondary metadata (e.g. reverse mapping and directory parent pointers) support
+operations internal to the filesystem, such as internal consistency checking
+and reorganization.
+Summary metadata, as the name implies, condense information contained in
+primary metadata for performance reasons.
+
+The filesystem check (fsck) tool examines all the metadata in a filesystem
+to look for errors.
+In addition to looking for obvious metadata corruptions, fsck also
+cross-references different types of metadata records with each other to look
+for inconsistencies.
+People do not like losing data, so most fsck tools also contains some ability
+to correct any problems found.
+As a word of caution -- the primary goal of most Linux fsck tools is to restore
+the filesystem metadata to a consistent state, not to maximize the data
+recovered.
+That precedent will not be challenged here.
+
+Filesystems of the 20th century generally lacked any redundancy in the ondisk
+format, which means that fsck can only respond to errors by erasing files until
+errors are no longer detected.
+More recent filesystem designs contain enough redundancy in their metadata that
+it is now possible to regenerate data structures when non-catastrophic errors
+occur; this capability aids both strategies.
+
++--------------------------------------------------------------------------+
+| **Note**: |
++--------------------------------------------------------------------------+
+| System administrators avoid data loss by increasing the number of |
+| separate storage systems through the creation of backups; and they avoid |
+| downtime by increasing the redundancy of each storage system through the |
+| creation of RAID arrays. |
+| fsck tools address only the first problem. |
++--------------------------------------------------------------------------+
+
+TLDR; Show Me the Code!
+-----------------------
+
+Code is posted to the kernel.org git trees as follows:
+`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
+`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
+`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
+Each kernel patchset adding an online repair function will use the same branch
+name across the kernel, xfsprogs, and fstests git repos.
+
+Existing Tools
+--------------
+
+The online fsck tool described here will be the third tool in the history of
+XFS (on Linux) to check and repair filesystems.
+Two programs precede it:
+
+The first program, ``xfs_check``, was created as part of the XFS debugger
+(``xfs_db``) and can only be used with unmounted filesystems.
+It walks all metadata in the filesystem looking for inconsistencies in the
+metadata, though it lacks any ability to repair what it finds.
+Due to its high memory requirements and inability to repair things, this
+program is now deprecated and will not be discussed further.
+
+The second program, ``xfs_repair``, was created to be faster and more robust
+than the first program.
+Like its predecessor, it can only be used with unmounted filesystems.
+It uses extent-based in-memory data structures to reduce memory consumption,
+and tries to schedule readahead IO appropriately to reduce I/O waiting time
+while it scans the metadata of the entire filesystem.
+The most important feature of this tool is its ability to respond to
+inconsistencies in file metadata and directory tree by erasing things as needed
+to eliminate problems.
+Space usage metadata are rebuilt from the observed file metadata.
+
+Problem Statement
+-----------------
+
+The current XFS tools leave several problems unsolved:
+
+1. **User programs** suddenly **lose access** to the filesystem when unexpected
+ shutdowns occur as a result of silent corruptions in the metadata.
+ These occur **unpredictably** and often without warning.
+
+2. **Users** experience a **total loss of service** during the recovery period
+ after an **unexpected shutdown** occurs.
+
+3. **Users** experience a **total loss of service** if the filesystem is taken
+ offline to **look for problems** proactively.
+
+4. **Data owners** cannot **check the integrity** of their stored data without
+ reading all of it.
+ This may expose them to substantial billing costs when a linear media scan
+ performed by the storage system administrator might suffice.
+
+5. **System administrators** cannot **schedule** a maintenance window to deal
+ with corruptions if they **lack the means** to assess filesystem health
+ while the filesystem is online.
+
+6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
+ health when doing so requires **manual intervention** and downtime.
+
+7. **Users** can be tricked into **doing things they do not desire** when
+ malicious actors **exploit quirks of Unicode** to place misleading names
+ in directories.
+
+Given this definition of the problems to be solved and the actors who would
+benefit, the proposed solution is a third fsck tool that acts on a running
+filesystem.
+
+This new third program has three components: an in-kernel facility to check
+metadata, an in-kernel facility to repair metadata, and a userspace driver
+program to drive fsck activity on a live filesystem.
+``xfs_scrub`` is the name of the driver program.
+The rest of this document presents the goals and use cases of the new fsck
+tool, describes its major design points in connection to those goals, and
+discusses the similarities and differences with existing tools.
+
++--------------------------------------------------------------------------+
+| **Note**: |
++--------------------------------------------------------------------------+
+| Throughout this document, the existing offline fsck tool can also be |
+| referred to by its current name "``xfs_repair``". |
+| The userspace driver program for the new online fsck tool can be |
+| referred to as "``xfs_scrub``". |
+| The kernel portion of online fsck that validates metadata is called |
+| "online scrub", and portion of the kernel that fixes metadata is called |
+| "online repair". |
++--------------------------------------------------------------------------+
+
+The naming hierarchy is broken up into objects known as directories and files
+and the physical space is split into pieces known as allocation groups.
+Sharding enables better performance on highly parallel systems and helps to
+contain the damage when corruptions occur.
+The division of the filesystem into principal objects (allocation groups and
+inodes) means that there are ample opportunities to perform targeted checks and
+repairs on a subset of the filesystem.
+
+While this is going on, other parts continue processing IO requests.
+Even if a piece of filesystem metadata can only be regenerated by scanning the
+entire system, the scan can still be done in the background while other file
+operations continue.
+
+In summary, online fsck takes advantage of resource sharding and redundant
+metadata to enable targeted checking and repair operations while the system
+is running.
+This capability will be coupled to automatic system management so that
+autonomous self-healing of XFS maximizes service availability.
+
+2. Theory of Operation
+======================
+
+Because it is necessary for online fsck to lock and scan live metadata objects,
+online fsck consists of three separate code components.
+The first is the userspace driver program ``xfs_scrub``, which is responsible
+for identifying individual metadata items, scheduling work items for them,
+reacting to the outcomes appropriately, and reporting results to the system
+administrator.
+The second and third are in the kernel, which implements functions to check
+and repair each type of online fsck work item.
+
++------------------------------------------------------------------+
+| **Note**: |
++------------------------------------------------------------------+
+| For brevity, this document shortens the phrase "online fsck work |
+| item" to "scrub item". |
++------------------------------------------------------------------+
+
+Scrub item types are delineated in a manner consistent with the Unix design
+philosophy, which is to say that each item should handle one aspect of a
+metadata structure, and handle it well.
+
+Scope
+-----
+
+In principle, online fsck should be able to check and to repair everything that
+the offline fsck program can handle.
+However, online fsck cannot be running 100% of the time, which means that
+latent errors may creep in after a scrub completes.
+If these errors cause the next mount to fail, offline fsck is the only
+solution.
+This limitation means that maintenance of the offline fsck tool will continue.
+A second limitation of online fsck is that it must follow the same resource
+sharing and lock acquisition rules as the regular filesystem.
+This means that scrub cannot take *any* shortcuts to save time, because doing
+so could lead to concurrency problems.
+In other words, online fsck is not a complete replacement for offline fsck, and
+a complete run of online fsck may take longer than online fsck.
+However, both of these limitations are acceptable tradeoffs to satisfy the
+different motivations of online fsck, which are to **minimize system downtime**
+and to **increase predictability of operation**.
+
+.. _scrubphases:
+
+Phases of Work
+--------------
+
+The userspace driver program ``xfs_scrub`` splits the work of checking and
+repairing an entire filesystem into seven phases.
+Each phase concentrates on checking specific types of scrub items and depends
+on the success of all previous phases.
+The seven phases are as follows:
+
+1. Collect geometry information about the mounted filesystem and computer,
+ discover the online fsck capabilities of the kernel, and open the
+ underlying storage devices.
+
+2. Check allocation group metadata, all realtime volume metadata, and all quota
+ files.
+ Each metadata structure is scheduled as a separate scrub item.
+ If corruption is found in the inode header or inode btree and ``xfs_scrub``
+ is permitted to perform repairs, then those scrub items are repaired to
+ prepare for phase 3.
+ Repairs are implemented by using the information in the scrub item to
+ resubmit the kernel scrub call with the repair flag enabled; this is
+ discussed in the next section.
+ Optimizations and all other repairs are deferred to phase 4.
+
+3. Check all metadata of every file in the filesystem.
+ Each metadata structure is also scheduled as a separate scrub item.
+ If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
+ and there were no problems detected during phase 2, then those scrub items
+ are repaired immediately.
+ Optimizations, deferred repairs, and unsuccessful repairs are deferred to
+ phase 4.
+
+4. All remaining repairs and scheduled optimizations are performed during this
+ phase, if the caller permits them.
+ Before starting repairs, the summary counters are checked and any necessary
+ repairs are performed so that subsequent repairs will not fail the resource
+ reservation step due to wildly incorrect summary counters.
+ Unsuccessful repairs are requeued as long as forward progress on repairs is
+ made somewhere in the filesystem.
+ Free space in the filesystem is trimmed at the end of phase 4 if the
+ filesystem is clean.
+
+5. By the start of this phase, all primary and secondary filesystem metadata
+ must be correct.
+ Summary counters such as the free space counts and quota resource counts
+ are checked and corrected.
+ Directory entry names and extended attribute names are checked for
+ suspicious entries such as control characters or confusing Unicode sequences
+ appearing in names.
+
+6. If the caller asks for a media scan, read all allocated and written data
+ file extents in the filesystem.
+ The ability to use hardware-assisted data file integrity checking is new
+ to online fsck; neither of the previous tools have this capability.
+ If media errors occur, they will be mapped to the owning files and reported.
+
+7. Re-check the summary counters and presents the caller with a summary of
+ space usage and file counts.
+
+This allocation of responsibilities will be :ref:`revisited <scrubcheck>`
+later in this document.
+
+Steps for Each Scrub Item
+-------------------------
+
+The kernel scrub code uses a three-step strategy for checking and repairing
+the one aspect of a metadata object represented by a scrub item:
+
+1. The scrub item of interest is checked for corruptions; opportunities for
+ optimization; and for values that are directly controlled by the system
+ administrator but look suspicious.
+ If the item is not corrupt or does not need optimization, resource are
+ released and the positive scan results are returned to userspace.
+ If the item is corrupt or could be optimized but the caller does not permit
+ this, resources are released and the negative scan results are returned to
+ userspace.
+ Otherwise, the kernel moves on to the second step.
+
+2. The repair function is called to rebuild the data structure.
+ Repair functions generally choose rebuild a structure from other metadata
+ rather than try to salvage the existing structure.
+ If the repair fails, the scan results from the first step are returned to
+ userspace.
+ Otherwise, the kernel moves on to the third step.
+
+3. In the third step, the kernel runs the same checks over the new metadata
+ item to assess the efficacy of the repairs.
+ The results of the reassessment are returned to userspace.
+
+Classification of Metadata
+--------------------------
+
+Each type of metadata object (and therefore each type of scrub item) is
+classified as follows:
+
+Primary Metadata
+````````````````
+
+Metadata structures in this category should be most familiar to filesystem
+users either because they are directly created by the user or they index
+objects created by the user
+Most filesystem objects fall into this class:
+
+- Free space and reference count information
+
+- Inode records and indexes
+
+- Storage mapping information for file data
+
+- Directories
+
+- Extended attributes
+
+- Symbolic links
+
+- Quota limits
+
+Scrub obeys the same rules as regular filesystem accesses for resource and lock
+acquisition.
+
+Primary metadata objects are the simplest for scrub to process.
+The principal filesystem object (either an allocation group or an inode) that
+owns the item being scrubbed is locked to guard against concurrent updates.
+The check function examines every record associated with the type for obvious
+errors and cross-references healthy records against other metadata to look for
+inconsistencies.
+Repairs for this class of scrub item are simple, since the repair function
+starts by holding all the resources acquired in the previous step.
+The repair function scans available metadata as needed to record all the
+observations needed to complete the structure.
+Next, it stages the observations in a new ondisk structure and commits it
+atomically to complete the repair.
+Finally, the storage from the old data structure are carefully reaped.
+
+Because ``xfs_scrub`` locks a primary object for the duration of the repair,
+this is effectively an offline repair operation performed on a subset of the
+filesystem.
+This minimizes the complexity of the repair code because it is not necessary to
+handle concurrent updates from other threads, nor is it necessary to access
+any other part of the filesystem.
+As a result, indexed structures can be rebuilt very quickly, and programs
+trying to access the damaged structure will be blocked until repairs complete.
+The only infrastructure needed by the repair code are the staging area for
+observations and a means to write new structures to disk.
+Despite these limitations, the advantage that online repair holds is clear:
+targeted work on individual shards of the filesystem avoids total loss of
+service.
+
+This mechanism is described in section 2.1 ("Off-Line Algorithm") of
+V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
+Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
+*Extending Database Technology*, pp. 293-309, 1992.
+
+Most primary metadata repair functions stage their intermediate results in an
+in-memory array prior to formatting the new ondisk structure, which is very
+similar to the list-based algorithm discussed in section 2.3 ("List-Based
+Algorithms") of Srinivasan.
+However, any data structure builder that maintains a resource lock for the
+duration of the repair is *always* an offline algorithm.
+
+.. _secondary_metadata:
+
+Secondary Metadata
+``````````````````
+
+Metadata structures in this category reflect records found in primary metadata,
+but are only needed for online fsck or for reorganization of the filesystem.
+
+Secondary metadata include:
+
+- Reverse mapping information
+
+- Directory parent pointers
+
+This class of metadata is difficult for scrub to process because scrub attaches
+to the secondary object but needs to check primary metadata, which runs counter
+to the usual order of resource acquisition.
+Frequently, this means that full filesystems scans are necessary to rebuild the
+metadata.
+Check functions can be limited in scope to reduce runtime.
+Repairs, however, require a full scan of primary metadata, which can take a
+long time to complete.
+Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
+duration of the repair.
+
+Instead, repair functions set up an in-memory staging structure to store
+observations.
+Depending on the requirements of the specific repair function, the staging
+index will either have the same format as the ondisk structure or a design
+specific to that repair function.
+The next step is to release all locks and start the filesystem scan.
+When the repair scanner needs to record an observation, the staging data are
+locked long enough to apply the update.
+While the filesystem scan is in progress, the repair function hooks the
+filesystem so that it can apply pending filesystem updates to the staging
+information.
+Once the scan is done, the owning object is re-locked, the live data is used to
+write a new ondisk structure, and the repairs are committed atomically.
+The hooks are disabled and the staging staging area is freed.
+Finally, the storage from the old data structure are carefully reaped.
+
+Introducing concurrency helps online repair avoid various locking problems, but
+comes at a high cost to code complexity.
+Live filesystem code has to be hooked so that the repair function can observe
+updates in progress.
+The staging area has to become a fully functional parallel structure so that
+updates can be merged from the hooks.
+Finally, the hook, the filesystem scan, and the inode locking model must be
+sufficiently well integrated that a hook event can decide if a given update
+should be applied to the staging structure.
+
+In theory, the scrub implementation could apply these same techniques for
+primary metadata, but doing so would make it massively more complex and less
+performant.
+Programs attempting to access the damaged structures are not blocked from
+operation, which may cause application failure or an unplanned filesystem
+shutdown.
+
+Inspiration for the secondary metadata repair strategy was drawn from section
+2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
+and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
+Creating Indexes for Very Large Tables Without Quiescing Updates"
+<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
+
+The sidecar index mentioned above bears some resemblance to the side file
+method mentioned in Srinivasan and Mohan.
+Their method consists of an index builder that extracts relevant record data to
+build the new structure as quickly as possible; and an auxiliary structure that
+captures all updates that would be committed to the index by other threads were
+the new index already online.
+After the index building scan finishes, the updates recorded in the side file
+are applied to the new index.
+To avoid conflicts between the index builder and other writer threads, the
+builder maintains a publicly visible cursor that tracks the progress of the
+scan through the record space.
+To avoid duplication of work between the side file and the index builder, side
+file updates are elided when the record ID for the update is greater than the
+cursor position within the record ID space.
+
+To minimize changes to the rest of the codebase, XFS online repair keeps the
+replacement index hidden until it's completely ready to go.
+In other words, there is no attempt to expose the keyspace of the new index
+while repair is running.
+The complexity of such an approach would be very high and perhaps more
+appropriate to building *new* indices.
+
+**Future Work Question**: Can the full scan and live update code used to
+facilitate a repair also be used to implement a comprehensive check?
+
+*Answer*: In theory, yes. Check would be much stronger if each scrub function
+employed these live scans to build a shadow copy of the metadata and then
+compared the shadow records to the ondisk records.
+However, doing that is a fair amount more work than what the checking functions
+do now.
+The live scans and hooks were developed much later.
+That in turn increases the runtime of those scrub functions.
+
+Summary Information
+```````````````````
+
+Metadata structures in this last category summarize the contents of primary
+metadata records.
+These are often used to speed up resource usage queries, and are many times
+smaller than the primary metadata which they represent.
+
+Examples of summary information include:
+
+- Summary counts of free space and inodes
+
+- File link counts from directories
+
+- Quota resource usage counts
+
+Check and repair require full filesystem scans, but resource and lock
+acquisition follow the same paths as regular filesystem accesses.
+
+The superblock summary counters have special requirements due to the underlying
+implementation of the incore counters, and will be treated separately.
+Check and repair of the other types of summary counters (quota resource counts
+and file link counts) employ the same filesystem scanning and hooking
+techniques as outlined above, but because the underlying data are sets of
+integer counters, the staging data need not be a fully functional mirror of the
+ondisk structure.
+
+Inspiration for quota and file link count repair strategies were drawn from
+sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
+Maintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views
+and Their Indexes"
+<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
+
+Since quotas are non-negative integer counts of resource usage, online
+quotacheck can use the incremental view deltas described in section 2.14 to
+track pending changes to the block and inode usage counts in each transaction,
+and commit those changes to a dquot side file when the transaction commits.
+Delta tracking is necessary for dquots because the index builder scans inodes,
+whereas the data structure being rebuilt is an index of dquots.
+Link count checking combines the view deltas and commit step into one because
+it sets attributes of the objects being scanned instead of writing them to a
+separate data structure.
+Each online fsck function will be discussed as case studies later in this
+document.
+
+Risk Management
+---------------
+
+During the development of online fsck, several risk factors were identified
+that may make the feature unsuitable for certain distributors and users.
+Steps can be taken to mitigate or eliminate those risks, though at a cost to
+functionality.
+
+- **Decreased performance**: Adding metadata indices to the filesystem
+ increases the time cost of persisting changes to disk, and the reverse space
+ mapping and directory parent pointers are no exception.
+ System administrators who require the maximum performance can disable the
+ reverse mapping features at format time, though this choice dramatically
+ reduces the ability of online fsck to find inconsistencies and repair them.
+
+- **Incorrect repairs**: As with all software, there might be defects in the
+ software that result in incorrect repairs being written to the filesystem.
+ Systematic fuzz testing (detailed in the next section) is employed by the
+ authors to find bugs early, but it might not catch everything.
+ The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
+ and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
+ accept this risk.
+ The xfsprogs build system has a configure option (``--enable-scrub=no``) that
+ disables building of the ``xfs_scrub`` binary, though this is not a risk
+ mitigation if the kernel functionality remains enabled.
+
+- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
+ repairable.
+ If the keyspaces of several metadata indices overlap in some manner but a
+ coherent narrative cannot be formed from records collected, then the repair
+ fails.
+ To reduce the chance that a repair will fail with a dirty transaction and
+ render the filesystem unusable, the online repair functions have been
+ designed to stage and validate all new records before committing the new
+ structure.
+
+- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
+ devices, opening files by handle, ignoring Unix discretionary access control,
+ and the ability to perform administrative changes.
+ Running this automatically in the background scares people, so the systemd
+ background service is configured to run with only the privileges required.
+ Obviously, this cannot address certain problems like the kernel crashing or
+ deadlocking, but it should be sufficient to prevent the scrub process from
+ escaping and reconfiguring the system.
+ The cron job does not have this protection.
+
+- **Fuzz Kiddiez**: There are many people now who seem to think that running
+ automated fuzz testing of ondisk artifacts to find mischievous behavior and
+ spraying exploit code onto the public mailing list for instant zero-day
+ disclosure is somehow of some social benefit.
+ In the view of this author, the benefit is realized only when the fuzz
+ operators help to **fix** the flaws, but this opinion apparently is not
+ widely shared among security "researchers".
+ The XFS maintainers' continuing ability to manage these events presents an
+ ongoing risk to the stability of the development process.
+ Automated testing should front-load some of the risk while the feature is
+ considered EXPERIMENTAL.
+
+Many of these risks are inherent to software programming.
+Despite this, it is hoped that this new functionality will prove useful in
+reducing unexpected downtime.
+
+3. Testing Plan
+===============
+
+As stated before, fsck tools have three main goals:
+
+1. Detect inconsistencies in the metadata;
+
+2. Eliminate those inconsistencies; and
+
+3. Minimize further loss of data.
+
+Demonstrations of correct operation are necessary to build users' confidence
+that the software behaves within expectations.
+Unfortunately, it was not really feasible to perform regular exhaustive testing
+of every aspect of a fsck tool until the introduction of low-cost virtual
+machines with high-IOPS storage.
+With ample hardware availability in mind, the testing strategy for the online
+fsck project involves differential analysis against the existing fsck tools and
+systematic testing of every attribute of every type of metadata object.
+Testing can be split into four major categories, as discussed below.
+
+Integrated Testing with fstests
+-------------------------------
+
+The primary goal of any free software QA effort is to make testing as
+inexpensive and widespread as possible to maximize the scaling advantages of
+community.
+In other words, testing should maximize the breadth of filesystem configuration
+scenarios and hardware setups.
+This improves code quality by enabling the authors of online fsck to find and
+fix bugs early, and helps developers of new features to find integration
+issues earlier in their development effort.
+
+The Linux filesystem community shares a common QA testing suite,
+`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
+functional and regression testing.
+Even before development work began on online fsck, fstests (when run on XFS)
+would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
+scratch filesystems between each test.
+This provides a level of assurance that the kernel and the fsck tools stay in
+alignment about what constitutes consistent metadata.
+During development of the online checking code, fstests was modified to run
+``xfs_scrub -n`` between each test to ensure that the new checking code
+produces the same results as the two existing fsck tools.
+
+To start development of online repair, fstests was modified to run
+``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
+This ensures that offline repair does not crash, leave a corrupt filesystem
+after it exists, or trigger complaints from the online check.
+This also established a baseline for what can and cannot be repaired offline.
+To complete the first phase of development of online repair, fstests was
+modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
+This enables a comparison of the effectiveness of online repair as compared to
+the existing offline repair tools.
+
+General Fuzz Testing of Metadata Blocks
+---------------------------------------
+
+XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
+
+Before development of online fsck even began, a set of fstests were created
+to test the rather common fault that entire metadata blocks get corrupted.
+This required the creation of fstests library code that can create a filesystem
+containing every possible type of metadata object.
+Next, individual test cases were created to create a test filesystem, identify
+a single block of a specific type of metadata object, trash it with the
+existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
+particular metadata validation strategy.
+
+This earlier test suite enabled XFS developers to test the ability of the
+in-kernel validation functions and the ability of the offline fsck tool to
+detect and eliminate the inconsistent metadata.
+This part of the test suite was extended to cover online fsck in exactly the
+same manner.
+
+In other words, for a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem:
+
+ * Write garbage to it
+
+ * Test the reactions of:
+
+ 1. The kernel verifiers to stop obviously bad metadata
+ 2. Offline repair (``xfs_repair``) to detect and fix
+ 3. Online repair (``xfs_scrub``) to detect and fix
+
+Targeted Fuzz Testing of Metadata Records
+-----------------------------------------
+
+The testing plan for online fsck includes extending the existing fs testing
+infrastructure to provide a much more powerful facility: targeted fuzz testing
+of every metadata field of every metadata object in the filesystem.
+``xfs_db`` can modify every field of every metadata structure in every
+block in the filesystem to simulate the effects of memory corruption and
+software bugs.
+Given that fstests already contains the ability to create a filesystem
+containing every metadata format known to the filesystem, ``xfs_db`` can be
+used to perform exhaustive fuzz testing!
+
+For a given fstests filesystem configuration:
+
+* For each metadata object existing on the filesystem...
+
+ * For each record inside that metadata object...
+
+ * For each field inside that record...
+
+ * For each conceivable type of transformation that can be applied to a bit field...
+
+ 1. Clear all bits
+ 2. Set all bits
+ 3. Toggle the most significant bit
+ 4. Toggle the middle bit
+ 5. Toggle the least significant bit
+ 6. Add a small quantity
+ 7. Subtract a small quantity
+ 8. Randomize the contents
+
+ * ...test the reactions of:
+
+ 1. The kernel verifiers to stop obviously bad metadata
+ 2. Offline checking (``xfs_repair -n``)
+ 3. Offline repair (``xfs_repair``)
+ 4. Online checking (``xfs_scrub -n``)
+ 5. Online repair (``xfs_scrub``)
+ 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
+
+This is quite the combinatoric explosion!
+
+Fortunately, having this much test coverage makes it easy for XFS developers to
+check the responses of XFS' fsck tools.
+Since the introduction of the fuzz testing framework, these tests have been
+used to discover incorrect repair code and missing functionality for entire
+classes of metadata objects in ``xfs_repair``.
+The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
+confirming that ``xfs_repair`` could detect at least as many corruptions as
+the older tool.
+
+These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
+allow the online fsck developers to compare online fsck against offline fsck,
+and they enable XFS developers to find deficiencies in the code base.
+
+Proposed patchsets include
+`general fuzzer improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
+`fuzzing baselines
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
+and `improvements in fuzz testing comprehensiveness
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
+
+Stress Testing
+--------------
+
+A unique requirement to online fsck is the ability to operate on a filesystem
+concurrently with regular workloads.
+Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
+impact on the running system, the online repair code should never introduce
+inconsistencies into the filesystem metadata, and regular workloads should
+never notice resource starvation.
+To verify that these conditions are being met, fstests has been enhanced in
+the following ways:
+
+* For each scrub item type, create a test to exercise checking that item type
+ while running ``fsstress``.
+* For each scrub item type, create a test to exercise repairing that item type
+ while running ``fsstress``.
+* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
+ filesystem doesn't cause problems.
+* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
+ force-repairing the whole filesystem doesn't cause problems.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+ freezing and thawing the filesystem.
+* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
+ remounting the filesystem read-only and read-write.
+* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?)
+
+Success is defined by the ability to run all of these tests without observing
+any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
+check warnings, or any other sort of mischief.
+
+Proposed patchsets include `general stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
+and the `evolution of existing per-function stress testing
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
+
+4. User Interface
+=================
+
+The primary user of online fsck is the system administrator, just like offline
+repair.
+Online fsck presents two modes of operation to administrators:
+A foreground CLI process for online fsck on demand, and a background service
+that performs autonomous checking and repair.
+
+Checking on Demand
+------------------
+
+For administrators who want the absolute freshest information about the
+metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
+a command line.
+The program checks every piece of metadata in the filesystem while the
+administrator waits for the results to be reported, just like the existing
+``xfs_repair`` tool.
+Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
+option to increase the verbosity of the information reported.
+
+A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
+correction capabilities of the hardware to check data file contents.
+The media scan is not enabled by default because it may dramatically increase
+program runtime and consume a lot of bandwidth on older storage hardware.
+
+The output of a foreground invocation is captured in the system log.
+
+The ``xfs_scrub_all`` program walks the list of mounted filesystems and
+initiates ``xfs_scrub`` for each of them in parallel.
+It serializes scans for any filesystems that resolve to the same top level
+kernel block device to prevent resource overconsumption.
+
+Background Service
+------------------
+
+To reduce the workload of system administrators, the ``xfs_scrub`` package
+provides a suite of `systemd <https://systemd.io/>`_ timers and services that
+run online fsck automatically on weekends by default.
+The background service configures scrub to run with as little privilege as
+possible, the lowest CPU and IO priority, and in a CPU-constrained single
+threaded mode.
+This can be tuned by the systemd administrator at any time to suit the latency
+and throughput requirements of customer workloads.
+
+The output of the background service is also captured in the system log.
+If desired, reports of failures (either due to inconsistencies or mere runtime
+errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
+variable in the following service files:
+
+* ``xfs_scrub_fail@.service``
+* ``xfs_scrub_media_fail@.service``
+* ``xfs_scrub_all_fail.service``
+
+The decision to enable the background scan is left to the system administrator.
+This can be done by enabling either of the following services:
+
+* ``xfs_scrub_all.timer`` on systemd systems
+* ``xfs_scrub_all.cron`` on non-systemd systems
+
+This automatic weekly scan is configured out of the box to perform an
+additional media scan of all file data once per month.
+This is less foolproof than, say, storing file data block checksums, but much
+more performant if application software provides its own integrity checking,
+redundancy can be provided elsewhere above the filesystem, or the storage
+device's integrity guarantees are deemed sufficient.
+
+The systemd unit file definitions have been subjected to a security audit
+(as of systemd 249) to ensure that the xfs_scrub processes have as little
+access to the rest of the system as possible.
+This was performed via ``systemd-analyze security``, after which privileges
+were restricted to the minimum required, sandboxing was set up to the maximal
+extent possible with sandboxing and system call filtering; and access to the
+filesystem tree was restricted to the minimum needed to start the program and
+access the filesystem being scanned.
+The service definition files restrict CPU usage to 80% of one CPU core, and
+apply as nice of a priority to IO and CPU scheduling as possible.
+This measure was taken to minimize delays in the rest of the filesystem.
+No such hardening has been performed for the cron job.
+
+Proposed patchset:
+`Enabling the xfs_scrub background service
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
+
+Health Reporting
+----------------
+
+XFS caches a summary of each filesystem's health status in memory.
+The information is updated whenever ``xfs_scrub`` is run, or whenever
+inconsistencies are detected in the filesystem metadata during regular
+operations.
+System administrators should use the ``health`` command of ``xfs_spaceman`` to
+download this information into a human-readable format.
+If problems have been observed, the administrator can schedule a reduced
+service window to run the online repair tool to correct the problem.
+Failing that, the administrator can decide to schedule a maintenance window to
+run the traditional offline repair tool to correct the problem.
+
+**Future Work Question**: Should the health reporting integrate with the new
+inotify fs error notification system?
+Would it be helpful for sysadmins to have a daemon to listen for corruption
+notifications and initiate a repair?
+
+*Answer*: These questions remain unanswered, but should be a part of the
+conversation with early adopters and potential downstream users of XFS.
+
+Proposed patchsets include
+`wiring up health reports to correction returns
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
+and
+`preservation of sickness info during memory reclaim
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
+
+5. Kernel Algorithms and Data Structures
+========================================
+
+This section discusses the key algorithms and data structures of the kernel
+code that provide the ability to check and repair metadata while the system
+is running.
+The first chapters in this section reveal the pieces that provide the
+foundation for checking metadata.
+The remainder of this section presents the mechanisms through which XFS
+regenerates itself.
+
+Self Describing Metadata
+------------------------
+
+Starting with XFS version 5 in 2012, XFS updated the format of nearly every
+ondisk block header to record a magic number, a checksum, a universally
+"unique" identifier (UUID), an owner code, the ondisk address of the block,
+and a log sequence number.
+When loading a block buffer from disk, the magic number, UUID, owner, and
+ondisk address confirm that the retrieved block matches the specific owner of
+the current filesystem, and that the information contained in the block is
+supposed to be found at the ondisk address.
+The first three components enable checking tools to disregard alleged metadata
+that doesn't belong to the filesystem, and the fourth component enables the
+filesystem to detect lost writes.
+
+Whenever a file system operation modifies a block, the change is submitted
+to the log as part of a transaction.
+The log then processes these transactions marking them done once they are
+safely persisted to storage.
+The logging code maintains the checksum and the log sequence number of the last
+transactional update.
+Checksums are useful for detecting torn writes and other discrepancies that can
+be introduced between the computer and its storage devices.
+Sequence number tracking enables log recovery to avoid applying out of date
+log updates to the filesystem.
+
+These two features improve overall runtime resiliency by providing a means for
+the filesystem to detect obvious corruption when reading metadata blocks from
+disk, but these buffer verifiers cannot provide any consistency checking
+between metadata structures.
+
+For more information, please see the documentation for
+Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
+
+Reverse Mapping
+---------------
+
+The original design of XFS (circa 1993) is an improvement upon 1980s Unix
+filesystem design.
+In those days, storage density was expensive, CPU time was scarce, and
+excessive seek time could kill performance.
+For performance reasons, filesystem authors were reluctant to add redundancy to
+the filesystem, even at the cost of data integrity.
+Filesystems designers in the early 21st century choose different strategies to
+increase internal redundancy -- either storing nearly identical copies of
+metadata, or more space-efficient encoding techniques.
+
+For XFS, a different redundancy strategy was chosen to modernize the design:
+a secondary space usage index that maps allocated disk extents back to their
+owners.
+By adding a new index, the filesystem retains most of its ability to scale
+well to heavily threaded workloads involving large datasets, since the primary
+file metadata (the directory tree, the file block map, and the allocation
+groups) remain unchanged.
+Like any system that improves redundancy, the reverse-mapping feature increases
+overhead costs for space mapping activities.
+However, it has two critical advantages: first, the reverse index is key to
+enabling online fsck and other requested functionality such as free space
+defragmentation, better media failure reporting, and filesystem shrinking.
+Second, the different ondisk storage format of the reverse mapping btree
+defeats device-level deduplication because the filesystem requires real
+redundancy.
+
++--------------------------------------------------------------------------+
+| **Sidebar**: |
++--------------------------------------------------------------------------+
+| A criticism of adding the secondary index is that it does nothing to |
+| improve the robustness of user data storage itself. |
+| This is a valid point, but adding a new index for file data block |
+| checksums increases write amplification by turning data overwrites into |
+| copy-writes, which age the filesystem prematurely. |
+| In keeping with thirty years of precedent, users who want file data |
+| integrity can supply as powerful a solution as they require. |
+| As for metadata, the complexity of adding a new secondary index of space |
+| usage is much less than adding volume management and storage device |
+| mirroring to XFS itself. |
+| Perfection of RAID and volume management are best left to existing |
+| layers in the kernel. |
++--------------------------------------------------------------------------+
+
+The information captured in a reverse space mapping record is as follows:
+
+.. code-block:: c
+
+ struct xfs_rmap_irec {
+ xfs_agblock_t rm_startblock; /* extent start block */
+ xfs_extlen_t rm_blockcount; /* extent length */
+ uint64_t rm_owner; /* extent owner */
+ uint64_t rm_offset; /* offset within the owner */
+ unsigned int rm_flags; /* state flags */
+ };
+
+The first two fields capture the location and size of the physical space,
+in units of filesystem blocks.
+The owner field tells scrub which metadata structure or file inode have been
+assigned this space.
+For space allocated to files, the offset field tells scrub where the space was
+mapped within the file fork.
+Finally, the flags field provides extra information about the space usage --
+is this an attribute fork extent? A file mapping btree extent? Or an
+unwritten data extent?
+
+Online filesystem checking judges the consistency of each primary metadata
+record by comparing its information against all other space indices.
+The reverse mapping index plays a key role in the consistency checking process
+because it contains a centralized alternate copy of all space allocation
+information.
+Program runtime and ease of resource acquisition are the only real limits to
+what online checking can consult.
+For example, a file data extent mapping can be checked against:
+
+* The absence of an entry in the free space information.
+* The absence of an entry in the inode index.
+* The absence of an entry in the reference count data if the file is not
+ marked as having shared extents.
+* The correspondence of an entry in the reverse mapping information.
+
+There are several observations to make about reverse mapping indices:
+
+1. Reverse mappings can provide a positive affirmation of correctness if any of
+ the above primary metadata are in doubt.
+ The checking code for most primary metadata follows a path similar to the
+ one outlined above.
+
+2. Proving the consistency of secondary metadata with the primary metadata is
+ difficult because that requires a full scan of all primary space metadata,
+ which is very time intensive.
+ For example, checking a reverse mapping record for a file extent mapping
+ btree block requires locking the file and searching the entire btree to
+ confirm the block.
+ Instead, scrub relies on rigorous cross-referencing during the primary space
+ mapping structure checks.
+
+3. Consistency scans must use non-blocking lock acquisition primitives if the
+ required locking order is not the same order used by regular filesystem
+ operations.
+ For example, if the filesystem normally takes a file ILOCK before taking
+ the AGF buffer lock but scrub wants to take a file ILOCK while holding
+ an AGF buffer lock, scrub cannot block on that second acquisition.
+ This means that forward progress during this part of a scan of the reverse
+ mapping data cannot be guaranteed if system load is heavy.
+
+In summary, reverse mappings play a key role in reconstruction of primary
+metadata.
+The details of how these records are staged, written to disk, and committed
+into the filesystem are covered in subsequent sections.
+
+Checking and Cross-Referencing
+------------------------------
+
+The first step of checking a metadata structure is to examine every record
+contained within the structure and its relationship with the rest of the
+system.
+XFS contains multiple layers of checking to try to prevent inconsistent
+metadata from wreaking havoc on the system.
+Each of these layers contributes information that helps the kernel to make
+three decisions about the health of a metadata structure:
+
+- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
+- Is this structure inconsistent with the rest of the system
+ (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
+- Is there so much damage around the filesystem that cross-referencing is not
+ possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
+- Can the structure be optimized to improve performance or reduce the size of
+ metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
+- Does the structure contain data that is not inconsistent but deserves review
+ by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
+
+The following sections describe how the metadata scrubbing process works.
+
+Metadata Buffer Verification
+````````````````````````````
+
+The lowest layer of metadata protection in XFS are the metadata verifiers built
+into the buffer cache.
+These functions perform inexpensive internal consistency checking of the block
+itself, and answer these questions:
+
+- Does the block belong to this filesystem?
+
+- Does the block belong to the structure that asked for the read?
+ This assumes that metadata blocks only have one owner, which is always true
+ in XFS.
+
+- Is the type of data stored in the block within a reasonable range of what
+ scrub is expecting?
+
+- Does the physical location of the block match the location it was read from?
+
+- Does the block checksum match the data?
+
+The scope of the protections here are very limited -- verifiers can only
+establish that the filesystem code is reasonably free of gross corruption bugs
+and that the storage system is reasonably competent at retrieval.
+Corruption problems observed at runtime cause the generation of health reports,
+failed system calls, and in the extreme case, filesystem shutdowns if the
+corrupt metadata force the cancellation of a dirty transaction.
+
+Every online fsck scrubbing function is expected to read every ondisk metadata
+block of a structure in the course of checking the structure.
+Corruption problems observed during a check are immediately reported to
+userspace as corruption; during a cross-reference, they are reported as a
+failure to cross-reference once the full examination is complete.
+Reads satisfied by a buffer already in cache (and hence already verified)
+bypass these checks.
+
+Internal Consistency Checks
+```````````````````````````
+
+After the buffer cache, the next level of metadata protection is the internal
+record verification code built into the filesystem.
+These checks are split between the buffer verifiers, the in-filesystem users of
+the buffer cache, and the scrub code itself, depending on the amount of higher
+level context required.
+The scope of checking is still internal to the block.
+These higher level checking functions answer these questions:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- If the block contains records, do the records fit within the block?
+
+- If the block tracks internal free space information, is it consistent with
+ the record areas?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+Record checks in this category are more rigorous and more time-intensive.
+For example, block pointers and inumbers are checked to ensure that they point
+within the dynamically allocated parts of an allocation group and within
+the filesystem.
+Names are checked for invalid characters, and flags are checked for invalid
+combinations.
+Other record attributes are checked for sensible values.
+Btree records spanning an interval of the btree keyspace are checked for
+correct order and lack of mergeability (except for file fork mappings).
+For performance reasons, regular code may skip some of these checks unless
+debugging is enabled or a write is about to occur.
+Scrub functions, of course, must check all possible problems.
+
+Validation of Userspace-Controlled Record Attributes
+````````````````````````````````````````````````````
+
+Various pieces of filesystem metadata are directly controlled by userspace.
+Because of this nature, validation work cannot be more precise than checking
+that a value is within the possible range.
+These fields include:
+
+- Superblock fields controlled by mount options
+- Filesystem labels
+- File timestamps
+- File permissions
+- File size
+- File flags
+- Names present in directory entries, extended attribute keys, and filesystem
+ labels
+- Extended attribute key namespaces
+- Extended attribute values
+- File data block contents
+- Quota limits
+- Quota timer expiration (if resource usage exceeds the soft limit)
+
+Cross-Referencing Space Metadata
+````````````````````````````````
+
+After internal block checks, the next higher level of checking is
+cross-referencing records between metadata structures.
+For regular runtime code, the cost of these checks is considered to be
+prohibitively expensive, but as scrub is dedicated to rooting out
+inconsistencies, it must pursue all avenues of inquiry.
+The exact set of cross-referencing is highly dependent on the context of the
+data structure being checked.
+
+The XFS btree code has keyspace scanning functions that online fsck uses to
+cross reference one structure with another.
+Specifically, scrub can scan the key space of an index to determine if that
+keyspace is fully, sparsely, or not at all mapped to records.
+For the reverse mapping btree, it is possible to mask parts of the key for the
+purposes of performing a keyspace scan so that scrub can decide if the rmap
+btree contains records mapping a certain extent of physical space without the
+sparsenses of the rest of the rmap keyspace getting in the way.
+
+Btree blocks undergo the following checks before cross-referencing:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the btree point to valid block addresses for the type
+ of btree?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each node block record, does the record key accurate reflect the contents
+ of the child block?
+
+Space allocation records are cross-referenced as follows:
+
+1. Any space mentioned by any metadata structure are cross-referenced as
+ follows:
+
+ - Does the reverse mapping index list only the appropriate owner as the
+ owner of each block?
+
+ - Are none of the blocks claimed as free space?
+
+ - If these aren't file data blocks, are none of the blocks claimed as space
+ shared by different owners?
+
+2. Btree blocks are cross-referenced as follows:
+
+ - Everything in class 1 above.
+
+ - If there's a parent node block, do the keys listed for this block match the
+ keyspace of this block?
+
+ - Do the sibling pointers point to valid blocks? Of the same level?
+
+ - Do the child pointers point to valid blocks? Of the next level down?
+
+3. Free space btree records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Does the reverse mapping index list no owners of this space?
+
+ - Is this space not claimed by the inode index for inodes?
+
+ - Is it not mentioned by the reference count index?
+
+ - Is there a matching record in the other free space btree?
+
+4. Inode btree records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Is there a matching record in free inode btree?
+
+ - Do cleared bits in the holemask correspond with inode clusters?
+
+ - Do set bits in the freemask correspond with inode records with zero link
+ count?
+
+5. Inode records are cross-referenced as follows:
+
+ - Everything in class 1.
+
+ - Do all the fields that summarize information about the file forks actually
+ match those forks?
+
+ - Does each inode with zero link count correspond to a record in the free
+ inode btree?
+
+6. File fork space mapping records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Is this space not mentioned by the inode btrees?
+
+ - If this is a CoW fork mapping, does it correspond to a CoW entry in the
+ reference count btree?
+
+7. Reference count records are cross-referenced as follows:
+
+ - Everything in class 1 and 2 above.
+
+ - Within the space subkeyspace of the rmap btree (that is to say, all
+ records mapped to a particular space extent and ignoring the owner info),
+ are there the same number of reverse mapping records for each block as the
+ reference count record claims?
+
+Proposed patchsets are the series to find gaps in
+`refcount btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
+`inode btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
+`rmap btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
+to find
+`mergeable records
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
+and to
+`improve cross referencing with rmap
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
+before starting a repair.
+
+Checking Extended Attributes
+````````````````````````````
+
+Extended attributes implement a key-value store that enable fragments of data
+to be attached to any file.
+Both the kernel and userspace can access the keys and values, subject to
+namespace and privilege restrictions.
+Most typically these fragments are metadata about the file -- origins, security
+contexts, user-supplied labels, indexing information, etc.
+
+Names can be as long as 255 bytes and can exist in several different
+namespaces.
+Values can be as large as 64KB.
+A file's extended attributes are stored in blocks mapped by the attr fork.
+The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
+Block 0 in the attribute fork is always the top of the structure, but otherwise
+each of the three types of blocks can be found at any offset in the attr fork.
+Leaf blocks contain attribute key records that point to the name and the value.
+Names are always stored elsewhere in the same leaf block.
+Values that are less than 3/4 the size of a filesystem block are also stored
+elsewhere in the same leaf block.
+Remote value blocks contain values that are too large to fit inside a leaf.
+If the leaf information exceeds a single filesystem block, a dabtree (also
+rooted at block 0) is created to map hashes of the attribute names to leaf
+blocks in the attr fork.
+
+Checking an extended attribute structure is not so straightforward due to the
+lack of separation between attr blocks and index blocks.
+Scrub must read each block mapped by the attr fork and ignore the non-leaf
+blocks:
+
+1. Walk the dabtree in the attr fork (if present) to ensure that there are no
+ irregularities in the blocks or dabtree mappings that do not point to
+ attr leaf blocks.
+
+2. Walk the blocks of the attr fork looking for leaf blocks.
+ For each entry inside a leaf:
+
+ a. Validate that the name does not contain invalid characters.
+
+ b. Read the attr value.
+ This performs a named lookup of the attr name to ensure the correctness
+ of the dabtree.
+ If the value is stored in a remote block, this also validates the
+ integrity of the remote value block.
+
+Checking and Cross-Referencing Directories
+``````````````````````````````````````````
+
+The filesystem directory tree is a directed acylic graph structure, with files
+constituting the nodes, and directory entries (dirents) constituting the edges.
+Directories are a special type of file containing a set of mappings from a
+255-byte sequence (name) to an inumber.
+These are called directory entries, or dirents for short.
+Each directory file must have exactly one directory pointing to the file.
+A root directory points to itself.
+Directory entries point to files of any type.
+Each non-directory file may have multiple directories point to it.
+
+In XFS, directories are implemented as a file containing up to three 32GB
+partitions.
+The first partition contains directory entry data blocks.
+Each data block contains variable-sized records associating a user-provided
+name with an inumber and, optionally, a file type.
+If the directory entry data grows beyond one block, the second partition (which
+exists as post-EOF extents) is populated with a block containing free space
+information and an index that maps hashes of the dirent names to directory data
+blocks in the first partition.
+This makes directory name lookups very fast.
+If this second partition grows beyond one block, the third partition is
+populated with a linear array of free space information for faster
+expansions.
+If the free space has been separated and the second partition grows again
+beyond one block, then a dabtree is used to map hashes of dirent names to
+directory data blocks.
+
+Checking a directory is pretty straightforward:
+
+1. Walk the dabtree in the second partition (if present) to ensure that there
+ are no irregularities in the blocks or dabtree mappings that do not point to
+ dirent blocks.
+
+2. Walk the blocks of the first partition looking for directory entries.
+ Each dirent is checked as follows:
+
+ a. Does the name contain no invalid characters?
+
+ b. Does the inumber correspond to an actual, allocated inode?
+
+ c. Does the child inode have a nonzero link count?
+
+ d. If a file type is included in the dirent, does it match the type of the
+ inode?
+
+ e. If the child is a subdirectory, does the child's dotdot pointer point
+ back to the parent?
+
+ f. If the directory has a second partition, perform a named lookup of the
+ dirent name to ensure the correctness of the dabtree.
+
+3. Walk the free space list in the third partition (if present) to ensure that
+ the free spaces it describes are really unused.
+
+Checking operations involving :ref:`parents <dirparent>` and
+:ref:`file link counts <nlinks>` are discussed in more detail in later
+sections.
+
+Checking Directory/Attribute Btrees
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As stated in previous sections, the directory/attribute btree (dabtree) index
+maps user-provided names to improve lookup times by avoiding linear scans.
+Internally, it maps a 32-bit hash of the name to a block offset within the
+appropriate file fork.
+
+The internal structure of a dabtree closely resembles the btrees that record
+fixed-size metadata records -- each dabtree block contains a magic number, a
+checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
+The format of leaf and node records are the same -- each entry points to the
+next level down in the hierarchy, with dabtree node records pointing to dabtree
+leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
+in the fork.
+
+Checking and cross-referencing the dabtree is very similar to what is done for
+space btrees:
+
+- Does the type of data stored in the block match what scrub is expecting?
+
+- Does the block belong to the owning structure that asked for the read?
+
+- Do the records fit within the block?
+
+- Are the records contained inside the block free of obvious corruptions?
+
+- Are the name hashes in the correct order?
+
+- Do node pointers within the dabtree point to valid fork offsets for dabtree
+ blocks?
+
+- Do leaf pointers within the dabtree point to valid fork offsets for directory
+ or attr leaf blocks?
+
+- Do child pointers point towards the leaves?
+
+- Do sibling pointers point across the same level?
+
+- For each dabtree node record, does the record key accurate reflect the
+ contents of the child dabtree block?
+
+- For each dabtree leaf record, does the record key accurate reflect the
+ contents of the directory or attr block?
+
+Cross-Referencing Summary Counters
+``````````````````````````````````
+
+XFS maintains three classes of summary counters: available resources, quota
+resource usage, and file link counts.
+
+In theory, the amount of available resources (data blocks, inodes, realtime
+extents) can be found by walking the entire filesystem.
+This would make for very slow reporting, so a transactional filesystem can
+maintain summaries of this information in the superblock.
+Cross-referencing these values against the filesystem metadata should be a
+simple matter of walking the free space and inode metadata in each AG and the
+realtime bitmap, but there are complications that will be discussed in
+:ref:`more detail <fscounters>` later.
+
+:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
+checking are sufficiently complicated to warrant separate sections.
+
+Post-Repair Reverification
+``````````````````````````
+
+After performing a repair, the checking code is run a second time to validate
+the new structure, and the results of the health assessment are recorded
+internally and returned to the calling process.
+This step is critical for enabling system administrator to monitor the status
+of the filesystem and the progress of any repairs.
+For developers, it is a useful means to judge the efficacy of error detection
+and correction in the online and offline checking tools.
+
+Eventual Consistency vs. Online Fsck
+------------------------------------
+
+Complex operations can make modifications to multiple per-AG data structures
+with a chain of transactions.
+These chains, once committed to the log, are restarted during log recovery if
+the system crashes while processing the chain.
+Because the AG header buffers are unlocked between transactions within a chain,
+online checking must coordinate with chained operations that are in progress to
+avoid incorrectly detecting inconsistencies due to pending chains.
+Furthermore, online repair must not run when operations are pending because
+the metadata are temporarily inconsistent with each other, and rebuilding is
+not possible.
+
+Only online fsck has this requirement of total consistency of AG metadata, and
+should be relatively rare as compared to filesystem change operations.
+Online fsck coordinates with transaction chains as follows:
+
+* For each AG, maintain a count of intent items targeting that AG.
+ The count should be bumped whenever a new item is added to the chain.
+ The count should be dropped when the filesystem has locked the AG header
+ buffers and finished the work.
+
+* When online fsck wants to examine an AG, it should lock the AG header
+ buffers to quiesce all transaction chains that want to modify that AG.
+ If the count is zero, proceed with the checking operation.
+ If it is nonzero, cycle the buffer locks to allow the chain to make forward
+ progress.
+
+This may lead to online fsck taking a long time to complete, but regular
+filesystem updates take precedence over background checking activity.
+Details about the discovery of this situation are presented in the
+:ref:`next section <chain_coordination>`, and details about the solution
+are presented :ref:`after that<intent_drains>`.
+
+.. _chain_coordination:
+
+Discovery of the Problem
+````````````````````````
+
+Midway through the development of online scrubbing, the fsstress tests
+uncovered a misinteraction between online fsck and compound transaction chains
+created by other writer threads that resulted in false reports of metadata
+inconsistency.
+The root cause of these reports is the eventual consistency model introduced by
+the expansion of deferred work items and compound transaction chains when
+reverse mapping and reflink were introduced.
+
+Originally, transaction chains were added to XFS to avoid deadlocks when
+unmapping space from files.
+Deadlock avoidance rules require that AGs only be locked in increasing order,
+which makes it impossible (say) to use a single transaction to free a space
+extent in AG 7 and then try to free a now superfluous block mapping btree block
+in AG 3.
+To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
+items to commit to freeing some space in one transaction while deferring the
+actual metadata updates to a fresh transaction.
+The transaction sequence looks like this:
+
+1. The first transaction contains a physical update to the file's block mapping
+ structures to remove the mapping from the btree blocks.
+ It then attaches to the in-memory transaction an action item to schedule
+ deferred freeing of space.
+ Concretely, each transaction maintains a list of ``struct
+ xfs_defer_pending`` objects, each of which maintains a list of ``struct
+ xfs_extent_free_item`` objects.
+ Returning to the example above, the action item tracks the freeing of both
+ the unmapped space from AG 7 and the block mapping btree (BMBT) block from
+ AG 3.
+ Deferred frees recorded in this manner are committed in the log by creating
+ an EFI log item from the ``struct xfs_extent_free_item`` object and
+ attaching the log item to the transaction.
+ When the log is persisted to disk, the EFI item is written into the ondisk
+ transaction record.
+ EFIs can list up to 16 extents to free, all sorted in AG order.
+
+2. The second transaction contains a physical update to the free space btrees
+ of AG 3 to release the former BMBT block and a second physical update to the
+ free space btrees of AG 7 to release the unmapped file space.
+ Observe that the physical updates are resequenced in the correct order
+ when possible.
+ Attached to the transaction is a an extent free done (EFD) log item.
+ The EFD contains a pointer to the EFI logged in transaction #1 so that log
+ recovery can tell if the EFI needs to be replayed.
+
+If the system goes down after transaction #1 is written back to the filesystem
+but before #2 is committed, a scan of the filesystem metadata would show
+inconsistent filesystem metadata because there would not appear to be any owner
+of the unmapped space.
+Happily, log recovery corrects this inconsistency for us -- when recovery finds
+an intent log item but does not find a corresponding intent done item, it will
+reconstruct the incore state of the intent item and finish it.
+In the example above, the log must replay both frees described in the recovered
+EFI to complete the recovery phase.
+
+There are subtleties to XFS' transaction chaining strategy to consider:
+
+* Log items must be added to a transaction in the correct order to prevent
+ conflicts with principal objects that are not held by the transaction.
+ In other words, all per-AG metadata updates for an unmapped block must be
+ completed before the last update to free the extent, and extents should not
+ be reallocated until that last update commits to the log.
+
+* AG header buffers are released between each transaction in a chain.
+ This means that other threads can observe an AG in an intermediate state,
+ but as long as the first subtlety is handled, this should not affect the
+ correctness of filesystem operations.
+
+* Unmounting the filesystem flushes all pending work to disk, which means that
+ offline fsck never sees the temporary inconsistencies caused by deferred
+ work item processing.
+
+In this manner, XFS employs a form of eventual consistency to avoid deadlocks
+and increase parallelism.
+
+During the design phase of the reverse mapping and reflink features, it was
+decided that it was impractical to cram all the reverse mapping updates for a
+single filesystem change into a single transaction because a single file
+mapping operation can explode into many small updates:
+
+* The block mapping update itself
+* A reverse mapping update for the block mapping update
+* Fixing the freelist
+* A reverse mapping update for the freelist fix
+
+* A shape change to the block mapping btree
+* A reverse mapping update for the btree update
+* Fixing the freelist (again)
+* A reverse mapping update for the freelist fix
+
+* An update to the reference counting information
+* A reverse mapping update for the refcount update
+* Fixing the freelist (a third time)
+* A reverse mapping update for the freelist fix
+
+* Freeing any space that was unmapped and not owned by any other file
+* Fixing the freelist (a fourth time)
+* A reverse mapping update for the freelist fix
+
+* Freeing the space used by the block mapping btree
+* Fixing the freelist (a fifth time)
+* A reverse mapping update for the freelist fix
+
+Free list fixups are not usually needed more than once per AG per transaction
+chain, but it is theoretically possible if space is very tight.
+For copy-on-write updates this is even worse, because this must be done once to
+remove the space from a staging area and again to map it into the file!
+
+To deal with this explosion in a calm manner, XFS expands its use of deferred
+work items to cover most reverse mapping updates and all refcount updates.
+This reduces the worst case size of transaction reservations by breaking the
+work into a long chain of small updates, which increases the degree of eventual
+consistency in the system.
+Again, this generally isn't a problem because XFS orders its deferred work
+items carefully to avoid resource reuse conflicts between unsuspecting threads.
+
+However, online fsck changes the rules -- remember that although physical
+updates to per-AG structures are coordinated by locking the buffers for AG
+headers, buffer locks are dropped between transactions.
+Once scrub acquires resources and takes locks for a data structure, it must do
+all the validation work without releasing the lock.
+If the main lock for a space btree is an AG header buffer lock, scrub may have
+interrupted another thread that is midway through finishing a chain.
+For example, if a thread performing a copy-on-write has completed a reverse
+mapping update but not the corresponding refcount update, the two AG btrees
+will appear inconsistent to scrub and an observation of corruption will be
+recorded. This observation will not be correct.
+If a repair is attempted in this state, the results will be catastrophic!
+
+Several other solutions to this problem were evaluated upon discovery of this
+flaw and rejected:
+
+1. Add a higher level lock to allocation groups and require writer threads to
+ acquire the higher level lock in AG order before making any changes.
+ This would be very difficult to implement in practice because it is
+ difficult to determine which locks need to be obtained, and in what order,
+ without simulating the entire operation.
+ Performing a dry run of a file operation to discover necessary locks would
+ make the filesystem very slow.
+
+2. Make the deferred work coordinator code aware of consecutive intent items
+ targeting the same AG and have it hold the AG header buffers locked across
+ the transaction roll between updates.
+ This would introduce a lot of complexity into the coordinator since it is
+ only loosely coupled with the actual deferred work items.
+ It would also fail to solve the problem because deferred work items can
+ generate new deferred subtasks, but all subtasks must be complete before
+ work can start on a new sibling task.
+
+3. Teach online fsck to walk all transactions waiting for whichever lock(s)
+ protect the data structure being scrubbed to look for pending operations.
+ The checking and repair operations must factor these pending operations into
+ the evaluations being performed.
+ This solution is a nonstarter because it is *extremely* invasive to the main
+ filesystem.
+
+.. _intent_drains:
+
+Intent Drains
+`````````````
+
+Online fsck uses an atomic intent item counter and lock cycling to coordinate
+with transaction chains.
+There are two key properties to the drain mechanism.
+First, the counter is incremented when a deferred work item is *queued* to a
+transaction, and it is decremented after the associated intent done log item is
+*committed* to another transaction.
+The second property is that deferred work can be added to a transaction without
+holding an AG header lock, but per-AG work items cannot be marked done without
+locking that AG header buffer to log the physical updates and the intent done
+log item.
+The first property enables scrub to yield to running transaction chains, which
+is an explicit deprioritization of online fsck to benefit file operations.
+The second property of the drain is key to the correct coordination of scrub,
+since scrub will always be able to decide if a conflict is possible.
+
+For regular filesystem code, the drain works as follows:
+
+1. Call the appropriate subsystem function to add a deferred work item to a
+ transaction.
+
+2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
+
+3. When the deferred item manager wants to finish the deferred work item, it
+ calls ``->finish_item`` to complete it.
+
+4. The ``->finish_item`` implementation logs some changes and calls
+ ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
+ waiting on the drain.
+
+5. The subtransaction commits, which unlocks the resource associated with the
+ intent item.
+
+For scrub, the drain works as follows:
+
+1. Lock the resource(s) associated with the metadata being scrubbed.
+ For example, a scan of the refcount btree would lock the AGI and AGF header
+ buffers.
+
+2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
+ chains in progress and the operation may proceed.
+
+3. Otherwise, release the resources grabbed in step 1.
+
+4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
+ back to step 1 unless a signal has been caught.
+
+To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
+be woken up whenever the intent count drops to zero.
+
+The proposed patchset is the
+`scrub intent drain series
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
+
+.. _jump_labels:
+
+Static Keys (aka Jump Label Patching)
+`````````````````````````````````````
+
+Online fsck for XFS separates the regular filesystem from the checking and
+repair code as much as possible.
+However, there are a few parts of online fsck (such as the intent drains, and
+later, live update hooks) where it is useful for the online fsck code to know
+what's going on in the rest of the filesystem.
+Since it is not expected that online fsck will be constantly running in the
+background, it is very important to minimize the runtime overhead imposed by
+these hooks when online fsck is compiled into the kernel but not actively
+running on behalf of userspace.
+Taking locks in the hot path of a writer thread to access a data structure only
+to find that no further action is necessary is expensive -- on the author's
+computer, this have an overhead of 40-50ns per access.
+Fortunately, the kernel supports dynamic code patching, which enables XFS to
+replace a static branch to hook code with ``nop`` sleds when online fsck isn't
+running.
+This sled has an overhead of however long it takes the instruction decoder to
+skip past the sled, which seems to be on the order of less than 1ns and
+does not access memory outside of instruction fetching.
+
+When online fsck enables the static key, the sled is replaced with an
+unconditional branch to call the hook code.
+The switchover is quite expensive (~22000ns) but is paid entirely by the
+program that invoked online fsck, and can be amortized if multiple threads
+enter online fsck at the same time, or if multiple filesystems are being
+checked at the same time.
+Changing the branch direction requires taking the CPU hotplug lock, and since
+CPU initialization requires memory allocation, online fsck must be careful not
+to change a static key while holding any locks or resources that could be
+accessed in the memory reclaim paths.
+To minimize contention on the CPU hotplug lock, care should be taken not to
+enable or disable static keys unnecessarily.
+
+Because static keys are intended to minimize hook overhead for regular
+filesystem operations when xfs_scrub is not running, the intended usage
+patterns are as follows:
+
+- The hooked part of XFS should declare a static-scoped static key that
+ defaults to false.
+ The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
+ The static key itself should be declared as a ``static`` variable.
+
+- When deciding to invoke code that's only used by scrub, the regular
+ filesystem should call the ``static_branch_unlikely`` predicate to avoid the
+ scrub-only hook code if the static key is not enabled.
+
+- The regular filesystem should export helper functions that call
+ ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
+ static key.
+ Wrapper functions make it easy to compile out the relevant code if the kernel
+ distributor turns off online fsck at build time.
+
+- Scrub functions wanting to turn on scrub-only XFS functionality should call
+ the ``xchk_fsgates_enable`` from the setup function to enable a specific
+ hook.
+ This must be done before obtaining any resources that are used by memory
+ reclaim.
+ Callers had better be sure they really need the functionality gated by the
+ static key; the ``TRY_HARDER`` flag is useful here.
+
+Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
+handle locking AGI and AGF buffers for all scrubber functions.
+If it detects a conflict between scrub and the running transactions, it will
+try to wait for intents to complete.
+If the caller of the helper has not enabled the static key, the helper will
+return -EDEADLOCK, which should result in the scrub being restarted with the
+``TRY_HARDER`` flag set.
+The scrub setup function should detect that flag, enable the static key, and
+try the scrub again.
+Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
+
+For more information, please see the kernel documentation of
+Documentation/staging/static-keys.rst.
+
+.. _xfile:
+
+Pageable Kernel Memory
+----------------------
+
+Some online checking functions work by scanning the filesystem to build a
+shadow copy of an ondisk metadata structure in memory and comparing the two
+copies.
+For online repair to rebuild a metadata structure, it must compute the record
+set that will be stored in the new structure before it can persist that new
+structure to disk.
+Ideally, repairs complete with a single atomic commit that introduces
+a new data structure.
+To meet these goals, the kernel needs to collect a large amount of information
+in a place that doesn't require the correct operation of the filesystem.
+
+Kernel memory isn't suitable because:
+
+* Allocating a contiguous region of memory to create a C array is very
+ difficult, especially on 32-bit systems.
+
+* Linked lists of records introduce double pointer overhead which is very high
+ and eliminate the possibility of indexed lookups.
+
+* Kernel memory is pinned, which can drive the system into OOM conditions.
+
+* The system might not have sufficient memory to stage all the information.
+
+At any given time, online fsck does not need to keep the entire record set in
+memory, which means that individual records can be paged out if necessary.
+Continued development of online fsck demonstrated that the ability to perform
+indexed data storage would also be very useful.
+Fortunately, the Linux kernel already has a facility for byte-addressable and
+pageable storage: tmpfs.
+In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
+to store intermediate data that doesn't need to be in memory at all times, so
+that usage precedent is already established.
+Hence, the ``xfile`` was born!
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| The first edition of online repair inserted records into a new btree as |
+| it found them, which failed because filesystem could shut down with a |
+| built data structure, which would be live after recovery finished. |
+| |
+| The second edition solved the half-rebuilt structure problem by storing |
+| everything in memory, but frequently ran the system out of memory. |
+| |
+| The third edition solved the OOM problem by using linked lists, but the |
+| memory overhead of the list pointers was extreme. |
++--------------------------------------------------------------------------+
+
+xfile Access Models
+```````````````````
+
+A survey of the intended uses of xfiles suggested these use cases:
+
+1. Arrays of fixed-sized records (space management btrees, directory and
+ extended attribute entries)
+
+2. Sparse arrays of fixed-sized records (quotas and link counts)
+
+3. Large binary objects (BLOBs) of variable sizes (directory and extended
+ attribute names and values)
+
+4. Staging btrees in memory (reverse mapping btrees)
+
+5. Arbitrary contents (realtime space management)
+
+To support the first four use cases, high level data structures wrap the xfile
+to share functionality between online fsck functions.
+The rest of this section discusses the interfaces that the xfile presents to
+four of those five higher level data structures.
+The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
+study.
+
+XFS is very record-based, which suggests that the ability to load and store
+complete records is important.
+To support these cases, a pair of ``xfile_load`` and ``xfile_store``
+functions are provided to read and persist objects into an xfile that treat any
+error as an out of memory error. For online repair, squashing error conditions
+in this manner is an acceptable behavior because the only reaction is to abort
+the operation back to userspace.
+
+However, no discussion of file access idioms is complete without answering the
+question, "But what about mmap?"
+It is convenient to access storage directly with pointers, just like userspace
+code does with regular memory.
+Online fsck must not drive the system into OOM conditions, which means that
+xfiles must be responsive to memory reclamation.
+tmpfs can only push a pagecache folio to the swap cache if the folio is neither
+pinned nor locked, which means the xfile must not pin too many folios.
+
+Short term direct access to xfile contents is done by locking the pagecache
+folio and mapping it into kernel address space. Object load and store uses this
+mechanism. Folio locks are not supposed to be held for long periods of time, so
+long term direct access to xfile contents is done by bumping the folio refcount,
+mapping it into kernel address space, and dropping the folio lock.
+These long term users *must* be responsive to memory reclaim by hooking into
+the shrinker infrastructure to know when to release folios.
+
+The ``xfile_get_folio`` and ``xfile_put_folio`` functions are provided to
+retrieve the (locked) folio that backs part of an xfile and to release it.
+The only code to use these folio lease functions are the xfarray
+:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
+btrees<xfbtree>`.
+
+xfile Access Coordination
+`````````````````````````
+
+For security reasons, xfiles must be owned privately by the kernel.
+They are marked ``S_PRIVATE`` to prevent interference from the security system,
+must never be mapped into process file descriptor tables, and their pages must
+never be mapped into userspace processes.
+
+To avoid locking recursion issues with the VFS, all accesses to the shmfs file
+are performed by manipulating the page cache directly.
+xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
+xfile's address space to grab writable pages, copy the caller's buffer into the
+page, and release the pages.
+xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
+before copying the contents into the caller's buffer.
+In other words, xfiles ignore the VFS read and write code paths to avoid
+having to create a dummy ``struct kiocb`` and to avoid taking inode and
+freeze locks.
+tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
+
+If an xfile is shared between threads to stage repairs, the caller must provide
+its own locks to coordinate access.
+For example, if a scrub function stores scan results in an xfile and needs
+other threads to provide updates to the scanned data, the scrub function must
+provide a lock for all threads to share.
+
+.. _xfarray:
+
+Arrays of Fixed-Sized Records
+`````````````````````````````
+
+In XFS, each type of indexed space metadata (free space, inodes, reference
+counts, file fork space, and reverse mappings) consists of a set of fixed-size
+records indexed with a classic B+ tree.
+Directories have a set of fixed-size dirent records that point to the names,
+and extended attributes have a set of fixed-size attribute keys that point to
+names and values.
+Quota counters and file link counters index records with numbers.
+During a repair, scrub needs to stage new records during the gathering step and
+retrieve them during the btree building step.
+
+Although this requirement can be satisfied by calling the read and write
+methods of the xfile directly, it is simpler for callers for there to be a
+higher level abstraction to take care of computing array offsets, to provide
+iterator functions, and to deal with sparse records and sorting.
+The ``xfarray`` abstraction presents a linear array for fixed-size records atop
+the byte-accessible xfile.
+
+.. _xfarray_access_patterns:
+
+Array Access Patterns
+^^^^^^^^^^^^^^^^^^^^^
+
+Array access patterns in online fsck tend to fall into three categories.
+Iteration of records is assumed to be necessary for all cases and will be
+covered in the next section.
+
+The first type of caller handles records that are indexed by position.
+Gaps may exist between records, and a record may be updated multiple times
+during the collection step.
+In other words, these callers want a sparse linearly addressed table file.
+The typical use case are quota records or file link count records.
+Access to array elements is performed programmatically via ``xfarray_load`` and
+``xfarray_store`` functions, which wrap the similarly-named xfile functions to
+provide loading and storing of array elements at arbitrary array indices.
+Gaps are defined to be null records, and null records are defined to be a
+sequence of all zero bytes.
+Null records are detected by calling ``xfarray_element_is_null``.
+They are created either by calling ``xfarray_unset`` to null out an existing
+record or by never storing anything to an array index.
+
+The second type of caller handles records that are not indexed by position
+and do not require multiple updates to a record.
+The typical use case here is rebuilding space btrees and key/value btrees.
+These callers can add records to the array without caring about array indices
+via the ``xfarray_append`` function, which stores a record at the end of the
+array.
+For callers that require records to be presentable in a specific order (e.g.
+rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
+records; this function will be covered later.
+
+The third type of caller is a bag, which is useful for counting records.
+The typical use case here is constructing space extent reference counts from
+reverse mapping information.
+Records can be put in the bag in any order, they can be removed from the bag
+at any time, and uniqueness of records is left to callers.
+The ``xfarray_store_anywhere`` function is used to insert a record in any
+null record slot in the bag; and the ``xfarray_unset`` function removes a
+record from the bag.
+
+The proposed patchset is the
+`big in-memory array
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
+
+Iterating Array Elements
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most users of the xfarray require the ability to iterate the records stored in
+the array.
+Callers can probe every possible array index with the following:
+
+.. code-block:: c
+
+ xfarray_idx_t i;
+ foreach_xfarray_idx(array, i) {
+ xfarray_load(array, i, &rec);
+
+ /* do something with rec */
+ }
+
+All users of this idiom must be prepared to handle null records or must already
+know that there aren't any.
+
+For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
+function ignores indices in the xfarray that have never been written to by
+calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
+of the array that are not populated with memory pages.
+Once it finds a page, it will skip the zeroed areas of the page.
+
+.. code-block:: c
+
+ xfarray_idx_t i = XFARRAY_CURSOR_INIT;
+ while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
+ /* do something with rec */
+ }
+
+.. _xfarray_sort:
+
+Sorting Array Elements
+^^^^^^^^^^^^^^^^^^^^^^
+
+During the fourth demonstration of online repair, a community reviewer remarked
+that for performance reasons, online repair ought to load batches of records
+into btree record blocks instead of inserting records into a new btree one at a
+time.
+The btree insertion code in XFS is responsible for maintaining correct ordering
+of the records, so naturally the xfarray must also support sorting the record
+set prior to bulk loading.
+
+Case Study: Sorting xfarrays
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The sorting algorithm used in the xfarray is actually a combination of adaptive
+quicksort and a heapsort subalgorithm in the spirit of
+`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
+`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
+kernel.
+To sort records in a reasonably short amount of time, ``xfarray`` takes
+advantage of the binary subpartitioning offered by quicksort, but it also uses
+heapsort to hedge against performance collapse if the chosen quicksort pivots
+are poor.
+Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
+gulf between the two implementations.
+
+The Linux kernel already contains a reasonably fast implementation of heapsort.
+It only operates on regular C arrays, which limits the scope of its usefulness.
+There are two key places where the xfarray uses it:
+
+* Sorting any record subset backed by a single xfile page.
+
+* Loading a small number of xfarray records from potentially disparate parts
+ of the xfarray into a memory buffer, and sorting the buffer.
+
+In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
+quicksort, thereby mitigating quicksort's worst runtime behavior.
+
+Choosing a quicksort pivot is a tricky business.
+A good pivot splits the set to sort in half, leading to the divide and conquer
+behavior that is crucial to O(n * lg(n)) performance.
+A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
+runtime.
+The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
+records into a memory buffer and using the kernel heapsort to identify the
+median of the nine.
+
+Most modern quicksort implementations employ Tukey's "ninther" to select a
+pivot from a classic C array.
+Typical ninther implementations pick three unique triads of records, sort each
+of the triads, and then sort the middle value of each triad to determine the
+ninther value.
+As stated previously, however, xfile accesses are not entirely cheap.
+It turned out to be much more performant to read the nine elements into a
+memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
+the 4th element of that buffer as the pivot.
+Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
+low-effort robust (resistant) location in large samples`, in *Contributions to
+Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
+1978), pp. 251–257.
+
+The partitioning of quicksort is fairly textbook -- rearrange the record
+subset around the pivot, then set up the current and next stack frames to
+sort with the larger and the smaller halves of the pivot, respectively.
+This keeps the stack space requirements to log2(record count).
+
+As a final performance optimization, the hi and lo scanning phase of quicksort
+keeps examined xfile pages mapped in the kernel for as long as possible to
+reduce map/unmap cycles.
+Surprisingly, this reduces overall sort runtime by nearly half again after
+accounting for the application of heapsort directly onto xfile pages.
+
+.. _xfblob:
+
+Blob Storage
+````````````
+
+Extended attributes and directories add an additional requirement for staging
+records: arbitrary byte sequences of finite length.
+Each directory entry record needs to store entry name,
+and each extended attribute needs to store both the attribute name and value.
+The names, keys, and values can consume a large amount of memory, so the
+``xfblob`` abstraction was created to simplify management of these blobs
+atop an xfile.
+
+Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
+and persist objects.
+The store function returns a magic cookie for every object that it persists.
+Later, callers provide this cookie to the ``xblob_load`` to recall the object.
+The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
+function frees them all because compaction is not needed.
+
+The details of repairing directories and extended attributes will be discussed
+in a subsequent section about atomic file content exchanges.
+However, it should be noted that these repair functions only use blob storage
+to cache a small number of entries before adding them to a temporary ondisk
+file, which is why compaction is not required.
+
+The proposed patchset is at the start of the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
+
+.. _xfbtree:
+
+In-Memory B+Trees
+`````````````````
+
+The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
+checking and repairing of secondary metadata commonly requires coordination
+between a live metadata scan of the filesystem and writer threads that are
+updating that metadata.
+Keeping the scan data up to date requires requires the ability to propagate
+metadata updates from the filesystem into the data being collected by the scan.
+This *can* be done by appending concurrent updates into a separate log file and
+applying them before writing the new metadata to disk, but this leads to
+unbounded memory consumption if the rest of the system is very busy.
+Another option is to skip the side-log and commit live updates from the
+filesystem directly into the scan data, which trades more overhead for a lower
+maximum memory requirement.
+In both cases, the data structure holding the scan results must support indexed
+access to perform well.
+
+Given that indexed lookups of scan data is required for both strategies, online
+fsck employs the second strategy of committing live updates directly into
+scan data.
+Because xfarrays are not indexed and do not enforce record ordering, they
+are not suitable for this task.
+Conveniently, however, XFS has a library to create and maintain ordered reverse
+mapping records: the existing rmap btree code!
+If only there was a means to create one in memory.
+
+Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
+regular file, which means that the kernel can create byte or block addressable
+virtual address spaces at will.
+The XFS buffer cache specializes in abstracting IO to block-oriented address
+spaces, which means that adaptation of the buffer cache to interface with
+xfiles enables reuse of the entire btree library.
+Btrees built atop an xfile are collectively known as ``xfbtrees``.
+The next few sections describe how they actually work.
+
+The proposed patchset is the
+`in-memory btree
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
+series.
+
+Using xfiles as a Buffer Cache Target
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Two modifications are necessary to support xfiles as a buffer cache target.
+The first is to make it possible for the ``struct xfs_buftarg`` structure to
+host the ``struct xfs_buf`` rhashtable, because normally those are held by a
+per-AG structure.
+The second change is to modify the buffer ``ioapply`` function to "read" cached
+pages from the xfile and "write" cached pages back to the xfile.
+Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
+since the xfile does not provide any locking on its own.
+With this adaptation in place, users of the xfile-backed buffer cache use
+exactly the same APIs as users of the disk-backed buffer cache.
+The separation between xfile and buffer cache implies higher memory usage since
+they do not share pages, but this property could some day enable transactional
+updates to an in-memory btree.
+Today, however, it simply eliminates the need for new code.
+
+Space Management with an xfbtree
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Space management for an xfile is very simple -- each btree block is one memory
+page in size.
+These blocks use the same header format as an on-disk btree, but the in-memory
+block verifiers ignore the checksums, assuming that xfile memory is no more
+corruption-prone than regular DRAM.
+Reusing existing code here is more important than absolute memory efficiency.
+
+The very first block of an xfile backing an xfbtree contains a header block.
+The header describes the owner, height, and the block number of the root
+xfbtree block.
+
+To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
+If there are no gaps, create one by extending the length of the xfile.
+Preallocate space for the block with ``xfile_prealloc``, and hand back the
+location.
+To free an xfbtree block, use ``xfile_discard`` (which internally uses
+``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
+
+Populating an xfbtree
+^^^^^^^^^^^^^^^^^^^^^
+
+An online fsck function that wants to create an xfbtree should proceed as
+follows:
+
+1. Call ``xfile_create`` to create an xfile.
+
+2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
+ pointing to the xfile.
+
+3. Pass the buffer cache target, buffer ops, and other information to
+ ``xfbtree_init`` to initialize the passed in ``struct xfbtree`` and write an
+ initial root block to the xfile.
+ Each btree type should define a wrapper that passes necessary arguments to
+ the creation function.
+ For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
+ all the necessary details for callers.
+
+4. Pass the xfbtree object to the btree cursor creation function for the
+ btree type.
+ Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
+ for callers.
+
+5. Pass the btree cursor to the regular btree functions to make queries against
+ and to update the in-memory btree.
+ For example, a btree cursor for an rmap xfbtree can be passed to the
+ ``xfs_rmap_*`` functions just like any other btree cursor.
+ See the :ref:`next section<xfbtree_commit>` for information on dealing with
+ xfbtree updates that are logged to a transaction.
+
+6. When finished, delete the btree cursor, destroy the xfbtree object, free the
+ buffer target, and the destroy the xfile to release all resources.
+
+.. _xfbtree_commit:
+
+Committing Logged xfbtree Buffers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Although it is a clever hack to reuse the rmap btree code to handle the staging
+structure, the ephemeral nature of the in-memory btree block storage presents
+some challenges of its own.
+The XFS transaction manager must not commit buffer log items for buffers backed
+by an xfile because the log format does not understand updates for devices
+other than the data device.
+An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
+log transactions back into the filesystem, and certainly won't exist during
+log recovery.
+For these reasons, any code updating an xfbtree in transaction context must
+remove the buffer log items from the transaction and write the updates into the
+backing xfile before committing or cancelling the transaction.
+
+The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
+this functionality as follows:
+
+1. Find each buffer log item whose buffer targets the xfile.
+
+2. Record the dirty/ordered status of the log item.
+
+3. Detach the log item from the buffer.
+
+4. Queue the buffer to a special delwri list.
+
+5. Clear the transaction dirty flag if the only dirty log items were the ones
+ that were detached in step 3.
+
+6. Submit the delwri list to commit the changes to the xfile, if the updates
+ are being committed.
+
+After removing xfile logged buffers from the transaction in this manner, the
+transaction can be committed or cancelled.
+
+Bulk Loading of Ondisk B+Trees
+------------------------------
+
+As mentioned previously, early iterations of online repair built new btree
+structures by creating a new btree and adding observations individually.
+Loading a btree one record at a time had a slight advantage of not requiring
+the incore records to be sorted prior to commit, but was very slow and leaked
+blocks if the system went down during a repair.
+Loading records one at a time also meant that repair could not control the
+loading factor of the blocks in the new btree.
+
+Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
+rebuilding a btree index from a collection of records -- bulk btree loading.
+This was implemented rather inefficiently code-wise, since ``xfs_repair``
+had separate copy-pasted implementations for each btree type.
+
+To prepare for online fsck, each of the four bulk loaders were studied, notes
+were taken, and the four were refactored into a single generic btree bulk
+loading mechanism.
+Those notes in turn have been refreshed and are presented below.
+
+Geometry Computation
+````````````````````
+
+The zeroth step of bulk loading is to assemble the entire record set that will
+be stored in the new btree, and sort the records.
+Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
+btree from the record set, the type of btree, and any load factor preferences.
+This information is required for resource reservation.
+
+First, the geometry computation computes the minimum and maximum records that
+will fit in a leaf block from the size of a btree block and the size of the
+block header.
+Roughly speaking, the maximum number of records is::
+
+ maxrecs = (block_size - header_size) / record_size
+
+The XFS design specifies that btree blocks should be merged when possible,
+which means the minimum number of records is half of maxrecs::
+
+ minrecs = maxrecs / 2
+
+The next variable to determine is the desired loading factor.
+This must be at least minrecs and no more than maxrecs.
+Choosing minrecs is undesirable because it wastes half the block.
+Choosing maxrecs is also undesirable because adding a single record to each
+newly rebuilt leaf block will cause a tree split, which causes a noticeable
+drop in performance immediately afterwards.
+The default loading factor was chosen to be 75% of maxrecs, which provides a
+reasonably compact structure without any immediate split penalties::
+
+ default_load_factor = (maxrecs + minrecs) / 2
+
+If space is tight, the loading factor will be set to maxrecs to try to avoid
+running out of space::
+
+ leaf_load_factor = enough space ? default_load_factor : maxrecs
+
+Load factor is computed for btree node blocks using the combined size of the
+btree key and pointer as the record size::
+
+ maxrecs = (block_size - header_size) / (key_size + ptr_size)
+ minrecs = maxrecs / 2
+ node_load_factor = enough space ? default_load_factor : maxrecs
+
+Once that's done, the number of leaf blocks required to store the record set
+can be computed as::
+
+ leaf_blocks = ceil(record_count / leaf_load_factor)
+
+The number of node blocks needed to point to the next level down in the tree
+is computed as::
+
+ n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
+ node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
+
+The entire computation is performed recursively until the current level only
+needs one block.
+The resulting geometry is as follows:
+
+- For AG-rooted btrees, this level is the root level, so the height of the new
+ tree is ``level + 1`` and the space needed is the summation of the number of
+ blocks on each level.
+
+- For inode-rooted btrees where the records in the top level do not fit in the
+ inode fork area, the height is ``level + 2``, the space needed is the
+ summation of the number of blocks on each level, and the inode fork points to
+ the root block.
+
+- For inode-rooted btrees where the records in the top level can be stored in
+ the inode fork area, then the root block can be stored in the inode, the
+ height is ``level + 1``, and the space needed is one less than the summation
+ of the number of blocks on each level.
+ This only becomes relevant when non-bmap btrees gain the ability to root in
+ an inode, which is a future patchset and only included here for completeness.
+
+.. _newbt:
+
+Reserving New B+Tree Blocks
+```````````````````````````
+
+Once repair knows the number of blocks needed for the new btree, it allocates
+those blocks using the free space information.
+Each reserved extent is tracked separately by the btree builder state data.
+To improve crash resilience, the reservation code also logs an Extent Freeing
+Intent (EFI) item in the same transaction as each space allocation and attaches
+its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
+If the system goes down, log recovery will use the unfinished EFIs to free the
+unused space, the free space, leaving the filesystem unchanged.
+
+Each time the btree builder claims a block for the btree from a reserved
+extent, it updates the in-memory reservation to reflect the claimed space.
+Block reservation tries to allocate as much contiguous space as possible to
+reduce the number of EFIs in play.
+
+While repair is writing these new btree blocks, the EFIs created for the space
+reservations pin the tail of the ondisk log.
+It's possible that other parts of the system will remain busy and push the head
+of the log towards the pinned tail.
+To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
+for too long.
+To alleviate this problem, the dynamic relogging capability of the deferred ops
+mechanism is reused here to commit a transaction at the log head containing an
+EFD for the old EFI and new EFI at the head.
+This enables the log to release the old EFI to keep the log moving forwards.
+
+EFIs have a role to play during the commit and reaping phases; please see the
+next section and the section about :ref:`reaping<reaping>` for more details.
+
+Proposed patchsets are the
+`bitmap rework
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
+and the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
+
+
+Writing the New Tree
+````````````````````
+
+This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
+a block from the reserved list, writes the new btree block header, fills the
+rest of the block with records, and adds the new leaf block to a list of
+written blocks::
+
+ ┌────┐
+ │leaf│
+ │RRR │
+ └────┘
+
+Sibling pointers are set every time a new block is added to the level::
+
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+When it finishes writing the record leaf blocks, it moves on to the node
+blocks
+To fill a node block, it walks each block in the next level down in the tree
+to compute the relevant keys and write them into the parent node::
+
+ ┌────┐ ┌────┐
+ │node│──────→│node│
+ │PP │←──────│PP │
+ └────┘ └────┘
+ ↙ ↘ ↙ ↘
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+When it reaches the root level, it is ready to commit the new btree!::
+
+ ┌─────────┐
+ │ root │
+ │ PP │
+ └─────────┘
+ ↙ ↘
+ ┌────┐ ┌────┐
+ │node│──────→│node│
+ │PP │←──────│PP │
+ └────┘ └────┘
+ ↙ ↘ ↙ ↘
+ ┌────┐ ┌────┐ ┌────┐ ┌────┐
+ │leaf│→│leaf│→│leaf│→│leaf│
+ │RRR │←│RRR │←│RRR │←│RRR │
+ └────┘ └────┘ └────┘ └────┘
+
+The first step to commit the new btree is to persist the btree blocks to disk
+synchronously.
+This is a little complicated because a new btree block could have been freed
+in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
+remove the (stale) buffer from the AIL list before it can write the new blocks
+to disk.
+Blocks are queued for IO using a delwri list and written in one large batch
+with ``xfs_buf_delwri_submit``.
+
+Once the new blocks have been persisted to disk, control returns to the
+individual repair function that called the bulk loader.
+The repair function must log the location of the new root in a transaction,
+clean up the space reservations that were made for the new btree, and reap the
+old metadata blocks:
+
+1. Commit the location of the new btree root.
+
+2. For each incore reservation:
+
+ a. Log Extent Freeing Done (EFD) items for all the space that was consumed
+ by the btree builder. The new EFDs must point to the EFIs attached to
+ the reservation to prevent log recovery from freeing the new blocks.
+
+ b. For unclaimed portions of incore reservations, create a regular deferred
+ extent free work item to be free the unused space later in the
+ transaction chain.
+
+ c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
+ reservation of the committing transaction.
+ If the btree loading code suspects this might be about to happen, it must
+ call ``xrep_defer_finish`` to clear out the deferred work and obtain a
+ fresh transaction.
+
+3. Clear out the deferred work a second time to finish the commit and clean
+ the repair transaction.
+
+The transaction rolling in steps 2c and 3 represent a weakness in the repair
+algorithm, because a log flush and a crash before the end of the reap step can
+result in space leaking.
+Online repair functions minimize the chances of this occurring by using very
+large transactions, which each can accommodate many thousands of block freeing
+instructions.
+Repair moves on to reaping the old blocks, which will be presented in a
+subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
+
+Case Study: Rebuilding the Inode Index
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild the inode index btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
+ records from the inode chunk information and a bitmap of the old inode btree
+ blocks.
+
+2. Append the records to an xfarray in inode order.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the inode btree.
+ If the free space inode btree is enabled, call it again to estimate the
+ geometry of the finobt.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+ If the free space inode btree is enabled, call it again to load the finobt.
+
+6. Commit the location of the new btree root block(s) to the AGI.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows.
+
+The inode btree maps inumbers to the ondisk location of the associated
+inode records, which means that the inode btrees can be rebuilt from the
+reverse mapping information.
+Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
+location of the old inode btree blocks.
+Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
+location of at least one inode cluster buffer.
+A cluster is the smallest number of ondisk inodes that can be allocated or
+freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
+
+For the space represented by each inode cluster, ensure that there are no
+records in the free space btrees nor any records in the reference count btree.
+If there are, the space metadata inconsistencies are reason enough to abort the
+operation.
+Otherwise, read each cluster buffer to check that its contents appear to be
+ondisk inodes and to decide if the file is allocated
+(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
+Accumulate the results of successive inode cluster buffer reads until there is
+enough information to fill a single inode chunk record, which is 64 consecutive
+numbers in the inumber keyspace.
+If the chunk is sparse, the chunk record may include holes.
+
+Once the repair function accumulates one chunk's worth of data, it calls
+``xfarray_append`` to add the inode btree record to the xfarray.
+This xfarray is walked twice during the btree creation step -- once to populate
+the inode btree with all inode chunk records, and a second time to populate the
+free inode btree with records for chunks that have free non-sparse inodes.
+The number of records for the inode btree is the number of xfarray records,
+but the record count for the free inode btree has to be computed as inode chunk
+records are stored in the xfarray.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the Space Reference Counts
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Reverse mapping records are used to rebuild the reference count information.
+Reference counts are required for correct operation of copy on write for shared
+file data.
+Imagine the reverse mapping entries as rectangles representing extents of
+physical blocks, and that the rectangles can be laid down to allow them to
+overlap each other.
+From the diagram below, it is apparent that a reference count record must start
+or end wherever the height of the stack changes.
+In other words, the record emission stimulus is level-triggered::
+
+ █ ███
+ ██ █████ ████ ███ ██████
+ ██ ████ ███████████ ████ █████████
+ ████████████████████████████████ ███████████
+ ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^
+ 2 1 23 21 3 43 234 2123 1 01 2 3 0
+
+The ondisk reference count btree does not store the refcount == 0 cases because
+the free space btree already records which blocks are free.
+Extents being used to stage copy-on-write operations should be the only records
+with refcount == 1.
+Single-owner file blocks aren't recorded in either the free space or the
+reference count btrees.
+
+The high level process to rebuild the reference count btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
+ records for any space having more than one reverse mapping and add them to
+ the xfarray.
+ Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
+ because these are extents allocated to stage a copy on write operation and
+ are tracked in the refcount btree.
+
+ Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
+ refcount btree blocks.
+
+2. Sort the records in physical extent order, putting the CoW staging extents
+ at the end of the xfarray.
+ This matches the sorting order of records in the refcount btree.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the new tree.
+
+4. Allocate the number of blocks computed in the previous step.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+
+6. Commit the location of new btree root block to the AGF.
+
+7. Reap the old btree blocks using the bitmap created in step 1.
+
+Details are as follows; the same algorithm is used by ``xfs_repair`` to
+generate refcount information from reverse mapping records.
+
+- Until the reverse mapping btree runs out of records:
+
+ - Retrieve the next record from the btree and put it in a bag.
+
+ - Collect all records with the same starting block from the btree and put
+ them in the bag.
+
+ - While the bag isn't empty:
+
+ - Among the mappings in the bag, compute the lowest block number where the
+ reference count changes.
+ This position will be either the starting block number of the next
+ unprocessed reverse mapping or the next block after the shortest mapping
+ in the bag.
+
+ - Remove all mappings from the bag that end at this position.
+
+ - Collect all reverse mappings that start at this position from the btree
+ and put them in the bag.
+
+ - If the size of the bag changed and is greater than one, create a new
+ refcount record associating the block number range that we just walked to
+ the size of the bag.
+
+The bag-like structure in this case is a type 2 xfarray as discussed in the
+:ref:`xfarray access patterns<xfarray_access_patterns>` section.
+Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
+removed via ``xfarray_unset``.
+Bag members are examined through ``xfarray_iter`` loops.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding File Fork Mapping Indices
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The high level process to rebuild a data/attr fork mapping btree is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
+ records from the reverse mapping records for that inode and fork.
+ Append these records to an xfarray.
+ Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
+ records.
+
+2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for the new tree.
+
+3. Sort the records in file offset order.
+
+4. If the extent records would fit in the inode fork immediate area, commit the
+ records to that immediate area and skip to step 8.
+
+5. Allocate the number of blocks computed in the previous step.
+
+6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks.
+
+7. Commit the new btree root block to the inode fork immediate area.
+
+8. Reap the old btree blocks using the bitmap created in step 1.
+
+There are some complications here:
+First, it's possible to move the fork offset to adjust the sizes of the
+immediate areas if the data and attr forks are not both in BMBT format.
+Second, if there are sufficiently few fork mappings, it may be possible to use
+EXTENTS format instead of BMBT, which may require a conversion.
+Third, the incore extent map must be reloaded carefully to avoid disturbing
+any delayed allocation extents.
+
+The proposed patchset is the
+`file mapping repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
+series.
+
+.. _reaping:
+
+Reaping Old Metadata Blocks
+---------------------------
+
+Whenever online fsck builds a new data structure to replace one that is
+suspect, there is a question of how to find and dispose of the blocks that
+belonged to the old structure.
+The laziest method of course is not to deal with them at all, but this slowly
+leads to service degradations as space leaks out of the filesystem.
+Hopefully, someone will schedule a rebuild of the free space information to
+plug all those leaks.
+Offline repair rebuilds all space metadata after recording the usage of
+the files and directories that it decides not to clear, hence it can build new
+structures in the discovered free space and avoid the question of reaping.
+
+As part of a repair, online fsck relies heavily on the reverse mapping records
+to find space that is owned by the corresponding rmap owner yet truly free.
+Cross referencing rmap records with other rmap records is necessary because
+there may be other data structures that also think they own some of those
+blocks (e.g. crosslinked trees).
+Permitting the block allocator to hand them out again will not push the system
+towards consistency.
+
+For space metadata, the process of finding extents to dispose of generally
+follows this format:
+
+1. Create a bitmap of space used by data structures that must be preserved.
+ The space reservations used to create the new metadata can be used here if
+ the same rmap owner code is used to denote all of the objects being rebuilt.
+
+2. Survey the reverse mapping data to create a bitmap of space owned by the
+ same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
+
+3. Use the bitmap disunion operator to subtract (1) from (2).
+ The remaining set bits represent candidate extents that could be freed.
+ The process moves on to step 4 below.
+
+Repairs for file-based metadata such as extended attributes, directories,
+symbolic links, quota files and realtime bitmaps are performed by building a
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
+Afterward, the mappings in the old file fork are the candidate blocks for
+disposal.
+
+The process for disposing of old extents is as follows:
+
+4. For each candidate extent, count the number of reverse mapping records for
+ the first block in that extent that do not have the same rmap owner for the
+ data structure being repaired.
+
+ - If zero, the block has a single owner and can be freed.
+
+ - If not, the block is part of a crosslinked structure and must not be
+ freed.
+
+5. Starting with the next block in the extent, figure out how many more blocks
+ have the same zero/nonzero other owner status as that first block.
+
+6. If the region is crosslinked, delete the reverse mapping entry for the
+ structure being repaired and move on to the next region.
+
+7. If the region is to be freed, mark any corresponding buffers in the buffer
+ cache as stale to prevent log writeback.
+
+8. Free the region and move on.
+
+However, there is one complication to this procedure.
+Transactions are of finite size, so the reaping process must be careful to roll
+the transactions to avoid overruns.
+Overruns come from two sources:
+
+a. EFIs logged on behalf of space that is no longer occupied
+
+b. Log items for buffer invalidations
+
+This is also a window in which a crash during the reaping process can leak
+blocks.
+As stated earlier, online repair functions use very large transactions to
+minimize the chances of this occurring.
+
+The proposed patchset is the
+`preparation for bulk loading btrees
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
+series.
+
+Case Study: Reaping After a Regular Btree Repair
+````````````````````````````````````````````````
+
+Old reference count and inode btrees are the easiest to reap because they have
+rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
+btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
+Creating a list of extents to reap the old btree blocks is quite simple,
+conceptually:
+
+1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
+
+2. For each reverse mapping record with an rmap owner corresponding to the
+ metadata structure being rebuilt, set the corresponding range in a bitmap.
+
+3. Walk the current data structures that have the same rmap owner.
+ For each block visited, clear that range in the above bitmap.
+
+4. Each set bit in the bitmap represents a block that could be a block from the
+ old data structures and hence is a candidate for reaping.
+ In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
+ are the blocks that might be freeable.
+
+If it is possible to maintain the AGF lock throughout the repair (which is the
+common case), then step 2 can be performed at the same time as the reverse
+mapping record walk that creates the records for the new btree.
+
+Case Study: Rebuilding the Free Space Indices
+`````````````````````````````````````````````
+
+The high level process to rebuild the free space indices is:
+
+1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
+ records from the gaps in the reverse mapping btree.
+
+2. Append the records to an xfarray.
+
+3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
+ of blocks needed for each new tree.
+
+4. Allocate the number of blocks computed in the previous step from the free
+ space information collected.
+
+5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
+ generate the internal node blocks for the free space by length index.
+ Call it again for the free space by block number index.
+
+6. Commit the locations of the new btree root blocks to the AGF.
+
+7. Reap the old btree blocks by looking for space that is not recorded by the
+ reverse mapping btree, the new free space btrees, or the AGFL.
+
+Repairing the free space btrees has three key complications over a regular
+btree repair:
+
+First, free space is not explicitly tracked in the reverse mapping records.
+Hence, the new free space records must be inferred from gaps in the physical
+space component of the keyspace of the reverse mapping btree.
+
+Second, free space repairs cannot use the common btree reservation code because
+new blocks are reserved out of the free space btrees.
+This is impossible when repairing the free space btrees themselves.
+However, repair holds the AGF buffer lock for the duration of the free space
+index reconstruction, so it can use the collected free space information to
+supply the blocks for the new free space btrees.
+It is not necessary to back each reserved extent with an EFI because the new
+free space btrees are constructed in what the ondisk filesystem thinks is
+unowned space.
+However, if reserving blocks for the new btrees from the collected free space
+information changes the number of free space records, repair must re-estimate
+the new free space btree geometry with the new record count until the
+reservation is sufficient.
+As part of committing the new btrees, repair must ensure that reverse mappings
+are created for the reserved blocks and that unused reserved blocks are
+inserted into the free space btrees.
+Deferrred rmap and freeing operations are used to ensure that this transition
+is atomic, similar to the other btree repair functions.
+
+Third, finding the blocks to reap after the repair is not overly
+straightforward.
+Blocks for the free space btrees and the reverse mapping btrees are supplied by
+the AGFL.
+Blocks put onto the AGFL have reverse mapping records with the owner
+``XFS_RMAP_OWN_AG``.
+This ownership is retained when blocks move from the AGFL into the free space
+btrees or the reverse mapping btrees.
+When repair walks reverse mapping records to synthesize free space records, it
+creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
+``XFS_RMAP_OWN_AG`` records.
+The repair context maintains a second bitmap corresponding to the rmap btree
+blocks and the AGFL blocks (``rmap_agfl_bitmap``).
+When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
+~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
+btrees.
+These blocks can then be reaped using the methods outlined above.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+.. _rmap_reap:
+
+Case Study: Reaping After Repairing Reverse Mapping Btrees
+``````````````````````````````````````````````````````````
+
+Old reverse mapping btrees are less difficult to reap after a repair.
+As mentioned in the previous section, blocks on the AGFL, the two free space
+btree blocks, and the reverse mapping btree blocks all have reverse mapping
+records with ``XFS_RMAP_OWN_AG`` as the owner.
+The full process of gathering reverse mapping records and building a new btree
+are described in the case study of
+:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
+discussion is that the new rmap btree will not contain any records for the old
+rmap btree, nor will the old btree blocks be tracked in the free space btrees.
+The list of candidate reaping blocks is computed by setting the bits
+corresponding to the gaps in the new rmap btree records, and then clearing the
+bits corresponding to extents in the free space btrees and the current AGFL
+blocks.
+The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
+methods outlined above.
+
+The rest of the process of rebuildng the reverse mapping btree is discussed
+in a separate :ref:`case study<rmap_repair>`.
+
+The proposed patchset is the
+`AG btree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
+series.
+
+Case Study: Rebuilding the AGFL
+```````````````````````````````
+
+The allocation group free block list (AGFL) is repaired as follows:
+
+1. Create a bitmap for all the space that the reverse mapping data claims is
+ owned by ``XFS_RMAP_OWN_AG``.
+
+2. Subtract the space used by the two free space btrees and the rmap btree.
+
+3. Subtract any space that the reverse mapping data claims is owned by any
+ other owner, to avoid re-adding crosslinked blocks to the AGFL.
+
+4. Once the AGFL is full, reap any blocks leftover.
+
+5. The next operation to fix the freelist will right-size the list.
+
+See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
+
+Inode Record Repairs
+--------------------
+
+Inode records must be handled carefully, because they have both ondisk records
+("dinodes") and an in-memory ("cached") representation.
+There is a very high potential for cache coherency issues if online fsck is not
+careful to access the ondisk metadata *only* when the ondisk metadata is so
+badly damaged that the filesystem cannot load the in-memory representation.
+When online fsck wants to open a damaged file for scrubbing, it must use
+specialized resource acquisition functions that return either the in-memory
+representation *or* a lock on whichever object is necessary to prevent any
+update to the ondisk location.
+
+The only repairs that should be made to the ondisk inode buffers are whatever
+is necessary to get the in-core structure loaded.
+This means fixing whatever is caught by the inode cluster buffer and inode fork
+verifiers, and retrying the ``iget`` operation.
+If the second ``iget`` fails, the repair has failed.
+
+Once the in-memory representation is loaded, repair can lock the inode and can
+subject it to comprehensive checks, repairs, and optimizations.
+Most inode attributes are easy to check and constrain, or are user-controlled
+arbitrary bit patterns; these are both easy to fix.
+Dealing with the data and attr fork extent counts and the file block counts is
+more complicated, because computing the correct value requires traversing the
+forks, or if that fails, leaving the fields invalid and waiting for the fork
+fsck functions to run.
+
+The proposed patchset is the
+`inode
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
+repair series.
+
+Quota Record Repairs
+--------------------
+
+Similar to inodes, quota records ("dquots") also have both ondisk records and
+an in-memory representation, and hence are subject to the same cache coherency
+issues.
+Somewhat confusingly, both are known as dquots in the XFS codebase.
+
+The only repairs that should be made to the ondisk quota record buffers are
+whatever is necessary to get the in-core structure loaded.
+Once the in-memory representation is loaded, the only attributes needing
+checking are obviously bad limits and timer values.
+
+Quota usage counters are checked, repaired, and discussed separately in the
+section about :ref:`live quotacheck <quotacheck>`.
+
+The proposed patchset is the
+`quota
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
+repair series.
+
+.. _fscounters:
+
+Freezing to Fix Summary Counters
+--------------------------------
+
+Filesystem summary counters track availability of filesystem resources such
+as free blocks, free inodes, and allocated inodes.
+This information could be compiled by walking the free space and inode indexes,
+but this is a slow process, so XFS maintains a copy in the ondisk superblock
+that should reflect the ondisk metadata, at least when the filesystem has been
+unmounted cleanly.
+For performance reasons, XFS also maintains incore copies of those counters,
+which are key to enabling resource reservations for active transactions.
+Writer threads reserve the worst-case quantities of resources from the
+incore counter and give back whatever they don't use at commit time.
+It is therefore only necessary to serialize on the superblock when the
+superblock is being committed to disk.
+
+The lazy superblock counter feature introduced in XFS v5 took this even further
+by training log recovery to recompute the summary counters from the AG headers,
+which eliminated the need for most transactions even to touch the superblock.
+The only time XFS commits the summary counters is at filesystem unmount.
+To reduce contention even further, the incore counter is implemented as a
+percpu counter, which means that each CPU is allocated a batch of blocks from a
+global incore counter and can satisfy small allocations from the local batch.
+
+The high-performance nature of the summary counters makes it difficult for
+online fsck to check them, since there is no way to quiesce a percpu counter
+while the system is running.
+Although online fsck can read the filesystem metadata to compute the correct
+values of the summary counters, there's no way to hold the value of a percpu
+counter stable, so it's quite possible that the counter will be out of date by
+the time the walk is complete.
+Earlier versions of online scrub would return to userspace with an incomplete
+scan flag, but this is not a satisfying outcome for a system administrator.
+For repairs, the in-memory counters must be stabilized while walking the
+filesystem metadata to get an accurate reading and install it in the percpu
+counter.
+
+To satisfy this requirement, online fsck must prevent other programs in the
+system from initiating new writes to the filesystem, it must disable background
+garbage collection threads, and it must wait for existing writer programs to
+exit the kernel.
+Once that has been established, scrub can walk the AG free space indexes, the
+inode btrees, and the realtime bitmap to compute the correct value of all
+four summary counters.
+This is very similar to a filesystem freeze, though not all of the pieces are
+necessary:
+
+- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
+ prevent other threads from thawing the filesystem, or other scrub threads
+ from initiating another fscounters freeze.
+
+- It does not quiesce the log.
+
+With this code in place, it is now possible to pause the filesystem for just
+long enough to check and correct the summary counters.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| The initial implementation used the actual VFS filesystem freeze |
+| mechanism to quiesce filesystem activity. |
+| With the filesystem frozen, it is possible to resolve the counter values |
+| with exact precision, but there are many problems with calling the VFS |
+| methods directly: |
+| |
+| - Other programs can unfreeze the filesystem without our knowledge. |
+| This leads to incorrect scan results and incorrect repairs. |
+| |
+| - Adding an extra lock to prevent others from thawing the filesystem |
+| required the addition of a ``->freeze_super`` function to wrap |
+| ``freeze_fs()``. |
+| This in turn caused other subtle problems because it turns out that |
+| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the |
+| last reference to the VFS superblock, and any subsequent access |
+| becomes a UAF bug! |
+| This can happen if the filesystem is unmounted while the underlying |
+| block device has frozen the filesystem. |
+| This problem could be solved by grabbing extra references to the |
+| superblock, but it felt suboptimal given the other inadequacies of |
+| this approach. |
+| |
+| - The log need not be quiesced to check the summary counters, but a VFS |
+| freeze initiates one anyway. |
+| This adds unnecessary runtime to live fscounter fsck operations. |
+| |
+| - Quiescing the log means that XFS flushes the (possibly incorrect) |
+| counters to disk as part of cleaning the log. |
+| |
+| - A bug in the VFS meant that freeze could complete even when |
+| sync_filesystem fails to flush the filesystem and returns an error. |
+| This bug was fixed in Linux 5.17. |
++--------------------------------------------------------------------------+
+
+The proposed patchset is the
+`summary counter cleanup
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
+series.
+
+Full Filesystem Scans
+---------------------
+
+Certain types of metadata can only be checked by walking every file in the
+entire filesystem to record observations and comparing the observations against
+what's recorded on disk.
+Like every other type of online repair, repairs are made by writing those
+observations to disk in a replacement structure and committing it atomically.
+However, it is not practical to shut down the entire filesystem to examine
+hundreds of billions of files because the downtime would be excessive.
+Therefore, online fsck must build the infrastructure to manage a live scan of
+all the files in the filesystem.
+There are two questions that need to be solved to perform a live walk:
+
+- How does scrub manage the scan while it is collecting data?
+
+- How does the scan keep abreast of changes being made to the system by other
+ threads?
+
+.. _iscan:
+
+Coordinated Inode Scans
+```````````````````````
+
+In the original Unix filesystems of the 1970s, each directory entry contained
+an index number (*inumber*) which was used as an index into on ondisk array
+(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
+its data block mapping.
+This system is described by J. Lions, `"inode (5659)"
+<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
+UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
+Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
+`"Implementation of the File System"
+<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
+Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
+1913-4.
+
+XFS retains most of this design, except now inumbers are search keys over all
+the space in the data section filesystem.
+They form a continuous keyspace that can be expressed as a 64-bit integer,
+though the inodes themselves are sparsely distributed within the keyspace.
+Scans proceed in a linear fashion across the inumber keyspace, starting from
+``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
+Naturally, a scan through a keyspace requires a scan cursor object to track the
+scan progress.
+Because this keyspace is sparse, this cursor contains two parts.
+The first part of this scan cursor object tracks the inode that will be
+examined next; call this the examination cursor.
+Somewhat less obviously, the scan cursor object must also track which parts of
+the keyspace have already been visited, which is critical for deciding if a
+concurrent filesystem update needs to be incorporated into the scan data.
+Call this the visited inode cursor.
+
+Advancing the scan cursor is a multi-step process encapsulated in
+``xchk_iscan_iter``:
+
+1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
+ inode cursor.
+ This guarantee that inodes in this AG cannot be allocated or freed while
+ advancing the cursor.
+
+2. Use the per-AG inode btree to look up the next inumber after the one that
+ was just visited, since it may not be keyspace adjacent.
+
+3. If there are no more inodes left in this AG:
+
+ a. Move the examination cursor to the point of the inumber keyspace that
+ corresponds to the start of the next AG.
+
+ b. Adjust the visited inode cursor to indicate that it has "visited" the
+ last possible inode in the current AG's inode keyspace.
+ XFS inumbers are segmented, so the cursor needs to be marked as having
+ visited the entire keyspace up to just before the start of the next AG's
+ inode keyspace.
+
+ c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
+ filesystem.
+
+ d. If there are no more AGs to examine, set both cursors to the end of the
+ inumber keyspace.
+ The scan is now complete.
+
+4. Otherwise, there is at least one more inode to scan in this AG:
+
+ a. Move the examination cursor ahead to the next inode marked as allocated
+ by the inode btree.
+
+ b. Adjust the visited inode cursor to point to the inode just prior to where
+ the examination cursor is now.
+ Because the scanner holds the AGI buffer lock, no inodes could have been
+ created in the part of the inode keyspace that the visited inode cursor
+ just advanced.
+
+5. Get the incore inode for the inumber of the examination cursor.
+ By maintaining the AGI buffer lock until this point, the scanner knows that
+ it was safe to advance the examination cursor across the entire keyspace,
+ and that it has stabilized this next inode so that it cannot disappear from
+ the filesystem until the scan releases the incore inode.
+
+6. Drop the AGI lock and return the incore inode to the caller.
+
+Online fsck functions scan all files in the filesystem as follows:
+
+1. Start a scan by calling ``xchk_iscan_start``.
+
+2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
+ If one is provided:
+
+ a. Lock the inode to prevent updates during the scan.
+
+ b. Scan the inode.
+
+ c. While still holding the inode lock, adjust the visited inode cursor
+ (``xchk_iscan_mark_visited``) to point to this inode.
+
+ d. Unlock and release the inode.
+
+8. Call ``xchk_iscan_teardown`` to complete the scan.
+
+There are subtleties with the inode cache that complicate grabbing the incore
+inode for the caller.
+Obviously, it is an absolute requirement that the inode metadata be consistent
+enough to load it into the inode cache.
+Second, if the incore inode is stuck in some intermediate state, the scan
+coordinator must release the AGI and push the main filesystem to get the inode
+back into a loadable state.
+
+The proposed patches are the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+The first user of the new functionality is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+Inode Management
+````````````````
+
+In regular filesystem code, references to allocated XFS incore inodes are
+always obtained (``xfs_iget``) outside of transaction context because the
+creation of the incore context for an existing file does not require metadata
+updates.
+However, it is important to note that references to incore inodes obtained as
+part of file creation must be performed in transaction context because the
+filesystem must ensure the atomicity of the ondisk inode btree index updates
+and the initialization of the actual ondisk inode.
+
+References to incore inodes are always released (``xfs_irele``) outside of
+transaction context because there are a handful of activities that might
+require ondisk updates:
+
+- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
+ release.
+
+- Speculative preallocations need to be unreserved.
+
+- An unlinked file may have lost its last reference, in which case the entire
+ file must be inactivated, which involves releasing all of its resources in
+ the ondisk metadata and freeing the inode.
+
+These activities are collectively called inode inactivation.
+Inactivation has two parts -- the VFS part, which initiates writeback on all
+dirty file pages, and the XFS part, which cleans up XFS-specific information
+and frees the inode if it was unlinked.
+If the inode is unlinked (or unconnected after a file handle operation), the
+kernel drops the inode into the inactivation machinery immediately.
+
+During normal operation, resource acquisition for an update follows this order
+to avoid deadlocks:
+
+1. Inode reference (``iget``).
+
+2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
+
+3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
+
+4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
+ can update page cache mappings.
+
+5. Log feature enablement.
+
+6. Transaction log space grant.
+
+7. Space on the data and realtime devices for the transaction.
+
+8. Incore dquot references, if a file is being repaired.
+ Note that they are not locked, merely acquired.
+
+9. Inode ``ILOCK`` for file metadata updates.
+
+10. AG header buffer locks / Realtime metadata inode ILOCK.
+
+11. Realtime metadata buffer locks, if applicable.
+
+12. Extent mapping btree blocks, if applicable.
+
+Resources are often released in the reverse order, though this is not required.
+However, online fsck differs from regular XFS operations because it may examine
+an object that normally is acquired in a later stage of the locking order, and
+then decide to cross-reference the object with an object that is acquired
+earlier in the order.
+The next few sections detail the specific ways in which online fsck takes care
+to avoid deadlocks.
+
+iget and irele During a Scrub
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+An inode scan performed on behalf of a scrub operation runs in transaction
+context, and possibly with resources already locked and bound to it.
+This isn't much of a problem for ``iget`` since it can operate in the context
+of an existing transaction, as long as all of the bound resources are acquired
+before the inode reference in the regular filesystem.
+
+When the VFS ``iput`` function is given a linked inode with no other
+references, it normally puts the inode on an LRU list in the hope that it can
+save time if another process re-opens the file before the system runs out
+of memory and frees it.
+Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
+flag on the inode to cause the kernel to try to drop the inode into the
+inactivation machinery immediately.
+
+In the past, inactivation was always done from the process that dropped the
+inode, which was a problem for scrub because scrub may already hold a
+transaction, and XFS does not support nesting transactions.
+On the other hand, if there is no scrub transaction, it is desirable to drop
+otherwise unused inodes immediately to avoid polluting caches.
+To capture these nuances, the online fsck code has a separate ``xchk_irele``
+function to set or clear the ``DONTCACHE`` flag to get the required release
+behavior.
+
+Proposed patchsets include fixing
+`scrub iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
+`dir iget usage
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
+
+.. _ilocking:
+
+Locking Inodes
+^^^^^^^^^^^^^^
+
+In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
+in a well-known order: parent → child when updating the directory tree, and
+in numerical order of the addresses of their ``struct inode`` object otherwise.
+For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
+faults.
+If two MMAPLOCKs must be acquired, they are acquired in numerical order of
+the addresses of their ``struct address_space`` objects.
+Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
+acquired before transactions are allocated.
+If two ILOCKs must be acquired, they are acquired in inumber order.
+
+Inode lock acquisition must be done carefully during a coordinated inode scan.
+Online fsck cannot abide these conventions, because for a directory tree
+scanner, the scrub process holds the IOLOCK of the file being scanned and it
+needs to take the IOLOCK of the file at the other end of the directory link.
+If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
+cannot use the regular inode locking functions and avoid becoming trapped in an
+ABBA deadlock.
+
+Solving both of these problems is straightforward -- any time online fsck
+needs to take a second lock of the same class, it uses trylock to avoid an ABBA
+deadlock.
+If the trylock fails, scrub drops all inode locks and use trylock loops to
+(re)acquire all necessary resources.
+Trylock loops enable scrub to check for pending fatal signals, which is how
+scrub avoids deadlocking the filesystem or becoming an unresponsive process.
+However, trylock loops means that online fsck must be prepared to measure the
+resource being scrubbed before and after the lock cycle to detect changes and
+react accordingly.
+
+.. _dirparent:
+
+Case Study: Finding a Directory Parent
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Consider the directory parent pointer repair code as an example.
+Online fsck must verify that the dotdot dirent of a directory points up to a
+parent directory, and that the parent directory contains exactly one dirent
+pointing down to the child directory.
+Fully validating this relationship (and repairing it if possible) requires a
+walk of every directory on the filesystem while holding the child locked, and
+while updates to the directory tree are being made.
+The coordinated inode scan provides a way to walk the filesystem without the
+possibility of missing an inode.
+The child directory is kept locked to prevent updates to the dotdot dirent, but
+if the scanner fails to lock a parent, it can drop and relock both the child
+and the prospective parent.
+If the dotdot entry changes while the directory is unlocked, then a move or
+rename operation must have changed the child's parentage, and the scan can
+exit early.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+.. _fshooks:
+
+Filesystem Hooks
+`````````````````
+
+The second piece of support that online fsck functions need during a full
+filesystem scan is the ability to stay informed about updates being made by
+other threads in the filesystem, since comparisons against the past are useless
+in a dynamic environment.
+Two pieces of Linux kernel infrastructure enable online fsck to monitor regular
+filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
+
+Filesystem hooks convey information about an ongoing filesystem operation to
+a downstream consumer.
+In this case, the downstream consumer is always an online fsck function.
+Because multiple fsck functions can run in parallel, online fsck uses the Linux
+notifier call chain facility to dispatch updates to any number of interested
+fsck processes.
+Call chains are a dynamic list, which means that they can be configured at
+run time.
+Because these hooks are private to the XFS module, the information passed along
+contains exactly what the checking function needs to update its observations.
+
+The current implementation of XFS hooks uses SRCU notifier chains to reduce the
+impact to highly threaded workloads.
+Regular blocking notifier chains use a rwsem and seem to have a much lower
+overhead for single-threaded applications.
+However, it may turn out that the combination of blocking chains and static
+keys are a more performant combination; more study is needed here.
+
+The following pieces are necessary to hook a certain point in the filesystem:
+
+- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
+ a well-known incore filesystem object.
+
+- Each hook must define an action code and a structure containing more context
+ about the action.
+
+- Hook providers should provide appropriate wrapper functions and structs
+ around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
+ checking to ensure correct usage.
+
+- A callsite in the regular filesystem code must be chosen to call
+ ``xfs_hooks_call`` with the action code and data structure.
+ This place should be adjacent to (and not earlier than) the place where
+ the filesystem update is committed to the transaction.
+ In general, when the filesystem calls a hook chain, it should be able to
+ handle sleeping and should not be vulnerable to memory reclaim or locking
+ recursion.
+ However, the exact requirements are very dependent on the context of the hook
+ caller and the callee.
+
+- The online fsck function should define a structure to hold scan data, a lock
+ to coordinate access to the scan data, and a ``struct xfs_hook`` object.
+ The scanner function and the regular filesystem code must acquire resources
+ in the same order; see the next section for details.
+
+- The online fsck code must contain a C function to catch the hook action code
+ and data structure.
+ If the object being updated has already been visited by the scan, then the
+ hook information must be applied to the scan data.
+
+- Prior to unlocking inodes to start the scan, online fsck must call
+ ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
+ ``xfs_hooks_add`` to enable the hook.
+
+- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
+ complete.
+
+The number of hooks should be kept to a minimum to reduce complexity.
+Static keys are used to reduce the overhead of filesystem hooks to nearly
+zero when online fsck is not running.
+
+.. _liveupdate:
+
+Live Updates During a Scan
+``````````````````````````
+
+The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
+filesystem code look like this::
+
+ other program
+ ↓
+ inode lock ←────────────────────┐
+ ↓ │
+ AG header lock │
+ ↓ │
+ filesystem function │
+ ↓ │
+ notifier call chain │ same
+ ↓ ├─── inode
+ scrub hook function │ lock
+ ↓ │
+ scan data mutex ←──┐ same │
+ ↓ ├─── scan │
+ update scan data │ lock │
+ ↑ │ │
+ scan data mutex ←──┘ │
+ ↑ │
+ inode lock ←────────────────────┘
+ ↑
+ scrub function
+ ↑
+ inode scanner
+ ↑
+ xfs_scrub
+
+These rules must be followed to ensure correct interactions between the
+checking code and the code making an update to the filesystem:
+
+- Prior to invoking the notifier call chain, the filesystem function being
+ hooked must acquire the same lock that the scrub scanning function acquires
+ to scan the inode.
+
+- The scanning function and the scrub hook function must coordinate access to
+ the scan data by acquiring a lock on the scan data.
+
+- Scrub hook function must not add the live update information to the scan
+ observations unless the inode being updated has already been scanned.
+ The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
+ for this.
+
+- Scrub hook functions must not change the caller's state, including the
+ transaction that it is running.
+ They must not acquire any resources that might conflict with the filesystem
+ function being hooked.
+
+- The hook function can abort the inode scan to avoid breaking the other rules.
+
+The inode scan APIs are pretty simple:
+
+- ``xchk_iscan_start`` starts a scan
+
+- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
+ returns zero if there is nothing left to scan
+
+- ``xchk_iscan_want_live_update`` to decide if an inode has already been
+ visited in the scan.
+ This is critical for hook functions to decide if they need to update the
+ in-memory scan information.
+
+- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
+ scan
+
+- ``xchk_iscan_teardown`` to finish the scan
+
+This functionality is also a part of the
+`inode scanner
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
+series.
+
+.. _quotacheck:
+
+Case Study: Quota Counter Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+It is useful to compare the mount time quotacheck code to the online repair
+quotacheck code.
+Mount time quotacheck does not have to contend with concurrent operations, so
+it does the following:
+
+1. Make sure the ondisk dquots are in good enough shape that all the incore
+ dquots will actually load, and zero the resource usage counters in the
+ ondisk buffer.
+
+2. Walk every inode in the filesystem.
+ Add each file's resource usage to the incore dquot.
+
+3. Walk each incore dquot.
+ If the incore dquot is not being flushed, add the ondisk buffer backing the
+ incore dquot to a delayed write (delwri) list.
+
+4. Write the buffer list to disk.
+
+Like most online fsck functions, online quotacheck can't write to regular
+filesystem objects until the newly collected metadata reflect all filesystem
+state.
+Therefore, online quotacheck records file resource usage to a shadow dquot
+index implemented with a sparse ``xfarray``, and only writes to the real dquots
+once the scan is complete.
+Handling transactional updates is tricky because quota resource usage updates
+are handled in phases to minimize contention on dquots:
+
+1. The inodes involved are joined and locked to a transaction.
+
+2. For each dquot attached to the file:
+
+ a. The dquot is locked.
+
+ b. A quota reservation is added to the dquot's resource usage.
+ The reservation is recorded in the transaction.
+
+ c. The dquot is unlocked.
+
+3. Changes in actual quota usage are tracked in the transaction.
+
+4. At transaction commit time, each dquot is examined again:
+
+ a. The dquot is locked again.
+
+ b. Quota usage changes are logged and unused reservation is given back to
+ the dquot.
+
+ c. The dquot is unlocked.
+
+For online quotacheck, hooks are placed in steps 2 and 4.
+The step 2 hook creates a shadow version of the transaction dquot context
+(``dqtrx``) that operates in a similar manner to the regular code.
+The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
+Notice that both hooks are called with the inode locked, which is how the
+live update coordinates with the inode scanner.
+
+The quotacheck scan looks like this:
+
+1. Set up a coordinated inode scan.
+
+2. For each inode returned by the inode scan iterator:
+
+ a. Grab and lock the inode.
+
+ b. Determine that inode's resource usage (data blocks, inode counts,
+ realtime blocks) and add that to the shadow dquots for the user, group,
+ and project ids associated with the inode.
+
+ c. Unlock and release the inode.
+
+3. For each dquot in the system:
+
+ a. Grab and lock the dquot.
+
+ b. Check the dquot against the shadow dquots created by the scan and updated
+ by the live hooks.
+
+Live updates are key to being able to walk every quota record without
+needing to hold any locks for a long duration.
+If repairs are desired, the real and shadow dquots are locked and their
+resource counts are set to the values in the shadow dquot.
+
+The proposed patchset is the
+`online quotacheck
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
+series.
+
+.. _nlinks:
+
+Case Study: File Link Count Checking
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+File link count checking also uses live update hooks.
+The coordinated inode scanner is used to visit all directories on the
+filesystem, and per-file link count records are stored in a sparse ``xfarray``
+indexed by inumber.
+During the scanning phase, each entry in a directory generates observation
+data as follows:
+
+1. If the entry is a dotdot (``'..'``) entry of the root directory, the
+ directory's parent link count is bumped because the root directory's dotdot
+ entry is self referential.
+
+2. If the entry is a dotdot entry of a subdirectory, the parent's backref
+ count is bumped.
+
+3. If the entry is neither a dot nor a dotdot entry, the target file's parent
+ count is bumped.
+
+4. If the target is a subdirectory, the parent's child link count is bumped.
+
+A crucial point to understand about how the link count inode scanner interacts
+with the live update hooks is that the scan cursor tracks which *parent*
+directories have been scanned.
+In other words, the live updates ignore any update about ``A → B`` when A has
+not been scanned, even if B has been scanned.
+Furthermore, a subdirectory A with a dotdot entry pointing back to B is
+accounted as a backref counter in the shadow data for A, since child dotdot
+entries affect the parent's link count.
+Live update hooks are carefully placed in all parts of the filesystem that
+create, change, or remove directory entries, since those operations involve
+bumplink and droplink.
+
+For any file, the correct link count is the number of parents plus the number
+of child subdirectories.
+Non-directories never have children of any kind.
+The backref information is used to detect inconsistencies in the number of
+links pointing to child subdirectories and the number of dotdot entries
+pointing back.
+
+After the scan completes, the link count of each file can be checked by locking
+both the inode and the shadow data, and comparing the link counts.
+A second coordinated inode scan cursor is used for comparisons.
+Live updates are key to being able to walk every inode without needing to hold
+any locks between inodes.
+If repairs are desired, the inode's link count is set to the value in the
+shadow information.
+If no parents are found, the file must be :ref:`reparented <orphanage>` to the
+orphanage to prevent the file from being lost forever.
+
+The proposed patchset is the
+`file link count repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
+series.
+
+.. _rmap_repair:
+
+Case Study: Rebuilding Reverse Mapping Records
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Most repair functions follow the same pattern: lock filesystem resources,
+walk the surviving ondisk metadata looking for replacement metadata records,
+and use an :ref:`in-memory array <xfarray>` to store the gathered observations.
+The primary advantage of this approach is the simplicity and modularity of the
+repair code -- code and data are entirely contained within the scrub module,
+do not require hooks in the main filesystem, and are usually the most efficient
+in memory use.
+A secondary advantage of this repair approach is atomicity -- once the kernel
+decides a structure is corrupt, no other threads can access the metadata until
+the kernel finishes repairing and revalidating the metadata.
+
+For repairs going on within a shard of the filesystem, these advantages
+outweigh the delays inherent in locking the shard while repairing parts of the
+shard.
+Unfortunately, repairs to the reverse mapping btree cannot use the "standard"
+btree repair strategy because it must scan every space mapping of every fork of
+every file in the filesystem, and the filesystem cannot stop.
+Therefore, rmap repair foregoes atomicity between scrub and repair.
+It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
+<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
+scan for reverse mapping records.
+
+1. Set up an xfbtree to stage rmap records.
+
+2. While holding the locks on the AGI and AGF buffers acquired during the
+ scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
+ staging extents, and the internal log.
+
+3. Set up an inode scanner.
+
+4. Hook into rmap updates for the AG being repaired so that the live scan data
+ can receive updates to the rmap btree from the rest of the filesystem during
+ the file scan.
+
+5. For each space mapping found in either fork of each file scanned,
+ decide if the mapping matches the AG of interest.
+ If so:
+
+ a. Create a btree cursor for the in-memory btree.
+
+ b. Use the rmap code to add the record to the in-memory btree.
+
+ c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+ xfbtree changes to the xfile.
+
+6. For each live update received via the hook, decide if the owner has already
+ been scanned.
+ If so, apply the live update into the scan data:
+
+ a. Create a btree cursor for the in-memory btree.
+
+ b. Replay the operation into the in-memory btree.
+
+ c. Use the :ref:`special commit function <xfbtree_commit>` to write the
+ xfbtree changes to the xfile.
+ This is performed with an empty transaction to avoid changing the
+ caller's state.
+
+7. When the inode scan finishes, create a new scrub transaction and relock the
+ two AG headers.
+
+8. Compute the new btree geometry using the number of rmap records in the
+ shadow btree, like all other btree rebuilding functions.
+
+9. Allocate the number of blocks computed in the previous step.
+
+10. Perform the usual btree bulk loading and commit to install the new rmap
+ btree.
+
+11. Reap the old rmap btree blocks as discussed in the case study about how
+ to :ref:`reap after rmap btree repair <rmap_reap>`.
+
+12. Free the xfbtree now that it not needed.
+
+The proposed patchset is the
+`rmap repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
+series.
+
+Staging Repairs with Temporary Files on Disk
+--------------------------------------------
+
+XFS stores a substantial amount of metadata in file forks: directories,
+extended attributes, symbolic link targets, free space bitmaps and summary
+information for the realtime volume, and quota records.
+File forks map 64-bit logical file fork space extents to physical storage space
+extents, similar to how a memory management unit maps 64-bit virtual addresses
+to physical memory addresses.
+Therefore, file-based tree structures (such as directories and extended
+attributes) use blocks mapped in the file fork offset address space that point
+to other blocks mapped within that same address space, and file-based linear
+structures (such as bitmaps and quota records) compute array element offsets in
+the file fork offset address space.
+
+Because file forks can consume as much space as the entire filesystem, repairs
+cannot be staged in memory, even when a paging scheme is available.
+Therefore, online repair of file-based metadata createas a temporary file in
+the XFS filesystem, writes a new structure at the correct offsets into the
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
+Once the repair is complete, the old fork can be reaped as necessary; if the
+system goes down during the reap, the iunlink code will delete the blocks
+during log recovery.
+
+**Note**: All space usage and inode indices in the filesystem *must* be
+consistent to use a temporary file safely!
+This dependency is the reason why online repair can only use pageable kernel
+memory to stage ondisk space usage information.
+
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
+
+There is a downside to the reaping process -- if the system crashes during the
+reap phase and the fork extents are crosslinked, the iunlink processing will
+fail because freeing space will find the extra reverse mappings and abort.
+
+Temporary files created for repair are similar to ``O_TMPFILE`` files created
+by userspace.
+They are not linked into a directory and the entire file will be reaped when
+the last reference to the file is lost.
+The key differences are that these files must have no access permission outside
+the kernel at all, they must be specially marked to prevent them from being
+opened by handle, and they must never be linked into the directory tree.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| In the initial iteration of file metadata repair, the damaged metadata |
+| blocks would be scanned for salvageable data; the extents in the file |
+| fork would be reaped; and then a new structure would be built in its |
+| place. |
+| This strategy did not survive the introduction of the atomic repair |
+| requirement expressed earlier in this document. |
+| |
+| The second iteration explored building a second structure at a high |
+| offset in the fork from the salvage data, reaping the old extents, and |
+| using a ``COLLAPSE_RANGE`` operation to slide the new extents into |
+| place. |
+| |
+| This had many drawbacks: |
+| |
+| - Array structures are linearly addressed, and the regular filesystem |
+| codebase does not have the concept of a linear offset that could be |
+| applied to the record offset computation to build an alternate copy. |
+| |
+| - Extended attributes are allowed to use the entire attr fork offset |
+| address space. |
+| |
+| - Even if repair could build an alternate copy of a data structure in a |
+| different part of the fork address space, the atomic repair commit |
+| requirement means that online repair would have to be able to perform |
+| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old |
+| structure was completely replaced. |
+| |
+| - A crash after construction of the secondary tree but before the range |
+| collapse would leave unreachable blocks in the file fork. |
+| This would likely confuse things further. |
+| |
+| - Reaping blocks after a repair is not a simple operation, and |
+| initiating a reap operation from a restarted range collapse operation |
+| during log recovery is daunting. |
+| |
+| - Directory entry blocks and quota records record the file fork offset |
+| in the header area of each block. |
+| An atomic range collapse operation would have to rewrite this part of |
+| each block header. |
+| Rewriting a single field in block headers is not a huge problem, but |
+| it's something to be aware of. |
+| |
+| - Each block in a directory or extended attributes btree index contains |
+| sibling and child block pointers. |
+| Were the atomic commit to use a range collapse operation, each block |
+| would have to be rewritten very carefully to preserve the graph |
+| structure. |
+| Doing this as part of a range collapse means rewriting a large number |
+| of blocks repeatedly, which is not conducive to quick repairs. |
+| |
+| This lead to the introduction of temporary file staging. |
++--------------------------------------------------------------------------+
+
+Using a Temporary File
+``````````````````````
+
+Online repair code should use the ``xrep_tempfile_create`` function to create a
+temporary file inside the filesystem.
+This allocates an inode, marks the in-core inode private, and attaches it to
+the scrub context.
+These files are hidden from userspace, may not be added to the directory tree,
+and must be kept private.
+
+Temporary files only use two inode locks: the IOLOCK and the ILOCK.
+The MMAPLOCK is not needed here, because there must not be page faults from
+userspace for data fork blocks.
+The usage patterns of these two locks are the same as for any other XFS file --
+access to file data are controlled via the IOLOCK, and access to file metadata
+are controlled via the ILOCK.
+Locking helpers are provided so that the temporary file and its lock state can
+be cleaned up by the scrub context.
+To comply with the nested locking strategy laid out in the :ref:`inode
+locking<ilocking>` section, it is recommended that scrub functions use the
+xrep_tempfile_ilock*_nowait lock helpers.
+
+Data can be written to a temporary file by two means:
+
+1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
+ temporary file from an xfile.
+
+2. The regular directory, symbolic link, and extended attribute functions can
+ be used to write to the temporary file.
+
+Once a good copy of a data file has been constructed in a temporary file, it
+must be conveyed to the file being repaired, which is the topic of the next
+section.
+
+The proposed patches are in the
+`repair temporary files
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
+series.
+
+Logged File Content Exchanges
+-----------------------------
+
+Once repair builds a temporary file with a new data structure written into
+it, it must commit the new changes into the existing file.
+It is not possible to swap the inumbers of two files, so instead the new
+metadata must replace the old.
+This suggests the need for the ability to swap extents, but the existing extent
+swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
+for online repair because:
+
+a. When the reverse-mapping btree is enabled, the swap code must keep the
+ reverse mapping information up to date with every exchange of mappings.
+ Therefore, it can only exchange one mapping per transaction, and each
+ transaction is independent.
+
+b. Reverse-mapping is critical for the operation of online fsck, so the old
+ defragmentation code (which swapped entire extent forks in a single
+ operation) is not useful here.
+
+c. Defragmentation is assumed to occur between two files with identical
+ contents.
+ For this use case, an incomplete exchange will not result in a user-visible
+ change in file contents, even if the operation is interrupted.
+
+d. Online repair needs to swap the contents of two files that are by definition
+ *not* identical.
+ For directory and xattr repairs, the user-visible contents might be the
+ same, but the contents of individual blocks may be very different.
+
+e. Old blocks in the file may be cross-linked with another structure and must
+ not reappear if the system goes down mid-repair.
+
+These problems are overcome by creating a new deferred operation and a new type
+of log intent item to track the progress of an operation to exchange two file
+ranges.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
+The new log item records the progress of the exchange to ensure that once an
+exchange begins, it will always run to completion, even there are
+interruptions.
+The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
+in the superblock protects these new log item records from being replayed on
+old kernels.
+
+The proposed patchset is the
+`file contents exchange
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
+series.
+
++--------------------------------------------------------------------------+
+| **Sidebar: Using Log-Incompatible Feature Flags** |
++--------------------------------------------------------------------------+
+| Starting with XFS v5, the superblock contains a |
+| ``sb_features_log_incompat`` field to indicate that the log contains |
+| records that might not readable by all kernels that could mount this |
+| filesystem. |
+| In short, log incompat features protect the log contents against kernels |
+| that will not understand the contents. |
+| Unlike the other superblock feature bits, log incompat bits are |
+| ephemeral because an empty (clean) log does not need protection. |
+| The log cleans itself after its contents have been committed into the |
+| filesystem, either as part of an unmount or because the system is |
+| otherwise idle. |
+| Because upper level code can be working on a transaction at the same |
+| time that the log cleans itself, it is necessary for upper level code to |
+| communicate to the log when it is going to use a log incompatible |
+| feature. |
+| |
+| The log coordinates access to incompatible features through the use of |
+| one ``struct rw_semaphore`` for each feature. |
+| The log cleaning code tries to take this rwsem in exclusive mode to |
+| clear the bit; if the lock attempt fails, the feature bit remains set. |
+| The code supporting a log incompat feature should create wrapper |
+| functions to obtain the log feature and call |
+| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary |
+| superblock. |
+| The superblock update is performed transactionally, so the wrapper to |
+| obtain log assistance must be called just prior to the creation of the |
+| transaction that uses the functionality. |
+| For a file operation, this step must happen after taking the IOLOCK |
+| and the MMAPLOCK, but before allocating the transaction. |
+| When the transaction is complete, the ``xlog_drop_incompat_feat`` |
+| function is called to release the feature. |
+| The feature bit will not be cleared from the superblock until the log |
+| becomes clean. |
+| |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the |
+| functionality. |
++--------------------------------------------------------------------------+
+
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
+
+Exchanging contents between file forks is a complex task.
+The goal is to exchange all file fork mappings between two file fork offset
+ranges.
+There are likely to be many extent mappings in each fork, and the edges of
+the mappings aren't necessarily aligned.
+Furthermore, there may be other updates that need to happen after the exchange,
+such as exchanging file sizes, inode flags, or conversion of fork data to local
+format.
+This is roughly the format of the new deferred exchange-mapping work item:
+
+.. code-block:: c
+
+ struct xfs_exchmaps_intent {
+ /* Inodes participating in the operation. */
+ struct xfs_inode *xmi_ip1;
+ struct xfs_inode *xmi_ip2;
+
+ /* File offset range information. */
+ xfs_fileoff_t xmi_startoff1;
+ xfs_fileoff_t xmi_startoff2;
+ xfs_filblks_t xmi_blockcount;
+
+ /* Set these file sizes after the operation, unless negative. */
+ xfs_fsize_t xmi_isize1;
+ xfs_fsize_t xmi_isize2;
+
+ /* XFS_EXCHMAPS_* log operation flags */
+ uint64_t xmi_flags;
+ };
+
+The new log intent item contains enough information to track two logical fork
+offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
+blockcount)``.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
+
+When the exchange is initiated, the sequence of operations is as follows:
+
+1. Create a deferred work item for the file mapping exchange.
+ At the start, it should contain the entirety of the file block ranges to be
+ exchanged.
+
+2. Call ``xfs_defer_finish`` to process the exchange.
+ This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
+ This will log an extent swap intent item to the transaction for the deferred
+ mapping exchange work item.
+
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
+
+ a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+ ``xmi_startoff2``, respectively, and compute the longest extent that can
+ be exchanged in a single step.
+ This is the minimum of the two ``br_blockcount`` s in the mappings.
+ Keep advancing through the file forks until at least one of the mappings
+ contains written blocks.
+ Mutual holes, unwritten extents, and extent mappings to the same physical
+ space are not exchanged.
+
+ For the next few steps, this document will refer to the mapping that came
+ from file 1 as "map1", and the mapping that came from file 2 as "map2".
+
+ b. Create a deferred block mapping update to unmap map1 from file 1.
+
+ c. Create a deferred block mapping update to unmap map2 from file 2.
+
+ d. Create a deferred block mapping update to map map1 into file 2.
+
+ e. Create a deferred block mapping update to map map2 into file 1.
+
+ f. Log the block, quota, and extent count updates for both files.
+
+ g. Extend the ondisk size of either file if necessary.
+
+ h. Log a mapping exchange done log item for th mapping exchange intent log
+ item that was read at the start of step 3.
+
+ i. Compute the amount of file range that has just been covered.
+ This quantity is ``(map1.br_startoff + map1.br_blockcount -
+ xmi_startoff1)``, because step 3a could have skipped holes.
+
+ j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
+ by the number of blocks computed in the previous step, and decrease
+ ``xmi_blockcount`` by the same quantity.
+ This advances the cursor.
+
+ k. Log a new mapping exchange intent log item reflecting the advanced state
+ of the work item.
+
+ l. Return the proper error code (EAGAIN) to the deferred operation manager
+ to inform it that there is more work to be done.
+ The operation manager completes the deferred work in steps 3b-3e before
+ moving back to the start of step 3.
+
+4. Perform any post-processing.
+ This will be discussed in more detail in subsequent sections.
+
+If the filesystem goes down in the middle of an operation, log recovery will
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
+
+Preparation for File Content Exchanges
+``````````````````````````````````````
+
+There are a few things that need to be taken care of before initiating an
+atomic file mapping exchange operation.
+First, regular files require the page cache to be flushed to disk before the
+operation begins, and directio writes to be quiesced.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
+The preparation step scans the ranges of both files to estimate:
+
+- Data device blocks needed to handle the repeated updates to the fork
+ mappings.
+- Change in data and realtime block counts for both files.
+- Increase in quota usage for both files, if the two files do not share the
+ same set of quota ids.
+- The number of extent mappings that will be added to each file.
+- Whether or not there are partially written realtime extents.
+ User programs must never be able to access a realtime file extent that maps
+ to different extents on the realtime volume, which could happen if the
+ operation fails to run to completion.
+
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
+Regular users are required to abide the quota limits, though metadata repairs
+may exceed quota to resolve inconsistent metadata elsewhere.
+
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
+
+Extended attributes, symbolic links, and directories can set the fork format to
+"local" and treat the fork as a literal area for data storage.
+Metadata repairs must take extra steps to support these cases:
+
+- If both forks are in local format and the fork areas are large enough, the
+ exchange is performed by copying the incore fork contents, logging both
+ forks, and committing.
+ The atomic file mapping exchange mechanism is not necessary, since this can
+ be done with a single transaction.
+
+- If both forks map blocks, then the regular atomic file mapping exchange is
+ used.
+
+- Otherwise, only one fork is in local format.
+ The contents of the local format fork are converted to a block to perform the
+ exchange.
+ The conversion to block format must be done in the same transaction that
+ logs the initial mapping exchange intent log item.
+ The regular atomic mapping exchange is used to exchange the metadata file
+ mappings.
+ Special flags are set on the exchange operation so that the transaction can
+ be rolled one more time to convert the second file's fork back to local
+ format so that the second file will be ready to go as soon as the ILOCK is
+ dropped.
+
+Extended attributes and directories stamp the owning inode into every block,
+but the buffer verifiers do not actually check the inode number!
+Although there is no verification, it is still important to maintain
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
+
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
+If the filesystem should go down during the reap part of the repair, the
+iunlink processing at the end of recovery will free both the temporary file and
+whatever blocks were not reaped.
+However, this iunlink processing omits the cross-link detection of online
+repair, and is not completely foolproof.
+
+Exchanging Temporary File Contents
+``````````````````````````````````
+
+To repair a metadata file, online repair proceeds as follows:
+
+1. Create a temporary repair file.
+
+2. Use the staging data to write out new contents into the temporary repair
+ file.
+ The same fork must be written to as is being repaired.
+
+3. Commit the scrub transaction, since the exchange resource estimation step
+ must be completed before transaction reservations are made.
+
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
+ the appropriate resource reservations, locks, and fill out a ``struct
+ xfs_exchmaps_req`` with the details of the exchange operation.
+
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
+
+6. Commit the transaction to complete the repair.
+
+.. _rtsummary:
+
+Case Study: Repairing the Realtime Summary File
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In the "realtime" section of an XFS filesystem, free space is tracked via a
+bitmap, similar to Unix FFS.
+Each bit in the bitmap represents one realtime extent, which is a multiple of
+the filesystem block size between 4KiB and 1GiB in size.
+The realtime summary file indexes the number of free extents of a given size to
+the offset of the block within the realtime free space bitmap where those free
+extents begin.
+In other words, the summary file helps the allocator find free extents by
+length, similar to what the free space by count (cntbt) btree does for the data
+section.
+
+The summary file itself is a flat file (with no block headers or checksums!)
+partitioned into ``log2(total rt extents)`` sections containing enough 32-bit
+counters to match the number of blocks in the rt bitmap.
+Each counter records the number of free extents that start in that bitmap block
+and can satisfy a power-of-two allocation request.
+
+To check the summary file against the bitmap:
+
+1. Take the ILOCK of both the realtime bitmap and summary files.
+
+2. For each free space extent recorded in the bitmap:
+
+ a. Compute the position in the summary file that contains a counter that
+ represents this free extent.
+
+ b. Read the counter from the xfile.
+
+ c. Increment it, and write it back to the xfile.
+
+3. Compare the contents of the xfile against the ondisk file.
+
+To repair the summary file, write the xfile contents into the temporary file
+and use atomic mapping exchange to commit the new contents.
+The temporary file is then reaped.
+
+The proposed patchset is the
+`realtime summary repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
+series.
+
+Case Study: Salvaging Extended Attributes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In XFS, extended attributes are implemented as a namespaced name-value store.
+Values are limited in size to 64KiB, but there is no limit in the number of
+names.
+The attribute fork is unpartitioned, which means that the root of the attribute
+structure is always in logical block zero, but attribute leaf blocks, dabtree
+index blocks, and remote value blocks are intermixed.
+Attribute leaf blocks contain variable-sized records that associate
+user-provided names with the user-provided values.
+Values larger than a block are allocated separate extents and written there.
+If the leaf information expands beyond a single block, a directory/attribute
+btree (``dabtree``) is created to map hashes of attribute names to entries
+for fast lookup.
+
+Salvaging extended attributes is done as follows:
+
+1. Walk the attr fork mappings of the file being repaired to find the attribute
+ leaf blocks.
+ When one is found,
+
+ a. Walk the attr leaf block to find candidate keys.
+ When one is found,
+
+ 1. Check the name for problems, and ignore the name if there are.
+
+ 2. Retrieve the value.
+ If that succeeds, add the name and value to the staging xfarray and
+ xfblob.
+
+2. If the memory usage of the xfarray and xfblob exceed a certain amount of
+ memory or there are no more attr fork blocks to examine, unlock the file and
+ add the staged extended attributes to the temporary file.
+
+3. Use atomic file mapping exchange to exchange the new and old extended
+ attribute structures.
+ The old attribute blocks are now attached to the temporary file.
+
+4. Reap the temporary file.
+
+The proposed patchset is the
+`extended attribute repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
+series.
+
+Fixing Directories
+------------------
+
+Fixing directories is difficult with currently available filesystem features,
+since directory entries are not redundant.
+The offline repair tool scans all inodes to find files with nonzero link count,
+and then it scans all directories to establish parentage of those linked files.
+Damaged files and directories are zapped, and files with no parent are
+moved to the ``/lost+found`` directory.
+It does not try to salvage anything.
+
+The best that online repair can do at this time is to read directory data
+blocks and salvage any dirents that look plausible, correct link counts, and
+move orphans back into the directory tree.
+The salvage process is discussed in the case study at the end of this section.
+The :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
+and moving orphans to the ``/lost+found`` directory.
+
+Case Study: Salvaging Directories
+`````````````````````````````````
+
+Unlike extended attributes, directory blocks are all the same size, so
+salvaging directories is straightforward:
+
+1. Find the parent of the directory.
+ If the dotdot entry is not unreadable, try to confirm that the alleged
+ parent has a child entry pointing back to the directory being repaired.
+ Otherwise, walk the filesystem to find it.
+
+2. Walk the first partition of data fork of the directory to find the directory
+ entry data blocks.
+ When one is found,
+
+ a. Walk the directory data block to find candidate entries.
+ When an entry is found:
+
+ i. Check the name for problems, and ignore the name if there are.
+
+ ii. Retrieve the inumber and grab the inode.
+ If that succeeds, add the name, inode number, and file type to the
+ staging xfarray and xblob.
+
+3. If the memory usage of the xfarray and xfblob exceed a certain amount of
+ memory or there are no more directory data blocks to examine, unlock the
+ directory and add the staged dirents into the temporary directory.
+ Truncate the staging files.
+
+4. Use atomic file mapping exchange to exchange the new and old directory
+ structures.
+ The old directory blocks are now attached to the temporary file.
+
+5. Reap the temporary file.
+
+**Future Work Question**: Should repair revalidate the dentry cache when
+rebuilding a directory?
+
+*Answer*: Yes, it should.
+
+In theory it is necessary to scan all dentry cache entries for a directory to
+ensure that one of the following apply:
+
+1. The cached dentry reflects an ondisk dirent in the new directory.
+
+2. The cached dentry no longer has a corresponding ondisk dirent in the new
+ directory and the dentry can be purged from the cache.
+
+3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
+ purged.
+ This is the problem case.
+
+Unfortunately, the current dentry cache design doesn't provide a means to walk
+every child dentry of a specific directory, which makes this a hard problem.
+There is no known solution.
+
+The proposed patchset is the
+`directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
+series.
+
+Parent Pointers
+```````````````
+
+A parent pointer is a piece of file metadata that enables a user to locate the
+file's parent directory without having to traverse the directory tree from the
+root.
+Without them, reconstruction of directory trees is hindered in much the same
+way that the historic lack of reverse space mapping information once hindered
+reconstruction of filesystem space metadata.
+The parent pointer feature, however, makes total directory reconstruction
+possible.
+
+XFS parent pointers contain the information needed to identify the
+corresponding directory entry in the parent directory.
+In other words, child files use extended attributes to store pointers to
+parents in the form ``(dirent_name) → (parent_inum, parent_gen)``.
+The directory checking process can be strengthened to ensure that the target of
+each dirent also contains a parent pointer pointing back to the dirent.
+Likewise, each parent pointer can be checked by ensuring that the target of
+each parent pointer is a directory and that it contains a dirent matching
+the parent pointer.
+Both online and offline repair can use this strategy.
+
++--------------------------------------------------------------------------+
+| **Historical Sidebar**: |
++--------------------------------------------------------------------------+
+| Directory parent pointers were first proposed as an XFS feature more |
+| than a decade ago by SGI. |
+| Each link from a parent directory to a child file is mirrored with an |
+| extended attribute in the child that could be used to identify the |
+| parent directory. |
+| Unfortunately, this early implementation had major shortcomings and was |
+| never merged into Linux XFS: |
+| |
+| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
+| enforce strong referential integrity in the directory tree. |
+| It did not guarantee that a change in a forward link would always be |
+| followed up with the corresponding change to the reverse links. |
+| |
+| 2. Referential integrity was not integrated into offline repair. |
+| Checking and repairs were performed on mounted filesystems without |
+| taking any kernel or inode locks to coordinate access. |
+| It is not clear how this actually worked properly. |
+| |
+| 3. The extended attribute did not record the name of the directory entry |
+| in the parent, so the SGI parent pointer implementation cannot be |
+| used to reconnect the directory tree. |
+| |
+| 4. Extended attribute forks only support 65,536 extents, which means |
+| that parent pointer attribute creation is likely to fail at some |
+| point before the maximum file link count is achieved. |
+| |
+| The original parent pointer design was too unstable for something like |
+| a file system repair to depend on. |
+| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a |
+| second implementation that solves all shortcomings of the first. |
+| During 2022, Allison introduced log intent items to track physical |
+| manipulations of the extended attribute structures. |
+| This solves the referential integrity problem by making it possible to |
+| commit a dirent update and a parent pointer update in the same |
+| transaction. |
+| Chandan increased the maximum extent counts of both data and attribute |
+| forks, thereby ensuring that the extended attribute structure can grow |
+| to handle the maximum hardlink count of any file. |
+| |
+| For this second effort, the ondisk parent pointer format as originally |
+| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. |
+| The format was changed during development to eliminate the requirement |
+| of repair tools needing to ensure that the ``dirent_pos`` field always |
+| matched when reconstructing a directory. |
+| |
+| There were a few other ways to have solved that problem: |
+| |
+| 1. The field could be designated advisory, since the other three values |
+| are sufficient to find the entry in the parent. |
+| However, this makes indexed key lookup impossible while repairs are |
+| ongoing. |
+| |
+| 2. We could allow creating directory entries at specified offsets, which |
+| solves the referential integrity problem but runs the risk that |
+| dirent creation will fail due to conflicts with the free space in the |
+| directory. |
+| |
+| These conflicts could be resolved by appending the directory entry |
+| and amending the xattr code to support updating an xattr key and |
+| reindexing the dabtree, though this would have to be performed with |
+| the parent directory still locked. |
+| |
+| 3. Same as above, but remove the old parent pointer entry and add a new |
+| one atomically. |
+| |
+| 4. Change the ondisk xattr format to |
+| ``(parent_inum, name) → (parent_gen)``, which would provide the attr |
+| name uniqueness that we require, without forcing repair code to |
+| update the dirent position. |
+| Unfortunately, this requires changes to the xattr code to support |
+| attr names as long as 263 bytes. |
+| |
+| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → |
+| (name, parent_gen)``. |
+| If the hash is sufficiently resistant to collisions (e.g. sha256) |
+| then this should provide the attr name uniqueness that we require. |
+| Names shorter than 247 bytes could be stored directly. |
+| |
+| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, |
+| parent_gen)``. This format doesn't require any of the complicated |
+| nested name hashing of the previous suggestions. However, it was |
+| discovered that multiple hardlinks to the same inode with the same |
+| filename caused performance problems with hashed xattr lookups, so |
+| the parent inumber is now xor'd into the hash index. |
+| |
+| In the end, it was decided that solution #6 was the most compact and the |
+| most performant. A new hash function was designed for parent pointers. |
++--------------------------------------------------------------------------+
+
+
+Case Study: Repairing Directories with Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
+a :ref:`directory entry live update hook <liveupdate>` as follows:
+
+1. Set up a temporary directory for generating the new directory structure,
+ an xfblob for storing entry names, and an xfarray for stashing the fixed
+ size fields involved in a directory update: ``(child inumber, add vs.
+ remove, name cookie, ftype)``.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+ updates on directory operations.
+
+3. For each parent pointer found in each file scanned, decide if the parent
+ pointer references the directory of interest.
+ If so:
+
+ a. Stash the parent pointer name and an addname entry for this dirent in the
+ xfblob and xfarray, respectively.
+
+ b. When finished scanning that file or the kernel memory consumption exceeds
+ a threshold, flush the stashed updates to the temporary directory.
+
+4. For each live directory update received via the hook, decide if the child
+ has already been scanned.
+ If so:
+
+ a. Stash the parent pointer name an addname or removename entry for this
+ dirent update in the xfblob and xfarray for later.
+ We cannot write directly to the temporary directory because hook
+ functions are not allowed to modify filesystem metadata.
+ Instead, we stash updates in the xfarray and rely on the scanner thread
+ to apply the stashed updates to the temporary directory.
+
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. When the scan is complete, atomically exchange the contents of the temporary
+ directory and the directory being repaired.
+ The temporary directory now contains the damaged directory structure.
+
+7. Reap the temporary directory.
+
+The proposed patchset is the
+`parent pointers directory repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
+series.
+
+Case Study: Repairing Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Online reconstruction of a file's parent pointer information works similarly to
+directory reconstruction:
+
+1. Set up a temporary file for generating a new extended attribute structure,
+ an xfblob for storing parent pointer names, and an xfarray for stashing the
+ fixed size fields involved in a parent pointer update: ``(parent inumber,
+ parent generation, add vs. remove, name cookie)``.
+
+2. Set up an inode scanner and hook into the directory entry code to receive
+ updates on directory operations.
+
+3. For each directory entry found in each directory scanned, decide if the
+ dirent references the file of interest.
+ If so:
+
+ a. Stash the dirent name and an addpptr entry for this parent pointer in the
+ xfblob and xfarray, respectively.
+
+ b. When finished scanning the directory or the kernel memory consumption
+ exceeds a threshold, flush the stashed updates to the temporary file.
+
+4. For each live directory update received via the hook, decide if the parent
+ has already been scanned.
+ If so:
+
+ a. Stash the dirent name and an addpptr or removepptr entry for this dirent
+ update in the xfblob and xfarray for later.
+ We cannot write parent pointers directly to the temporary file because
+ hook functions are not allowed to modify filesystem metadata.
+ Instead, we stash updates in the xfarray and rely on the scanner thread
+ to apply the stashed parent pointer updates to the temporary file.
+
+5. When the scan is complete, replay any stashed entries in the xfarray.
+
+6. Copy all non-parent pointer extended attributes to the temporary file.
+
+7. When the scan is complete, atomically exchange the mappings of the attribute
+ forks of the temporary file and the file being repaired.
+ The temporary file now contains the damaged extended attribute structure.
+
+8. Reap the temporary file.
+
+The proposed patchset is the
+`parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_
+series.
+
+Digression: Offline Checking of Parent Pointers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Examining parent pointers in offline repair works differently because corrupt
+files are erased long before directory tree connectivity checks are performed.
+Parent pointer checks are therefore a second pass to be added to the existing
+connectivity checks:
+
+1. After the set of surviving files has been established (phase 6),
+ walk the surviving directories of each AG in the filesystem.
+ This is already performed as part of the connectivity checks.
+
+2. For each directory entry found,
+
+ a. If the name has already been stored in the xfblob, then use that cookie
+ and skip the next step.
+
+ b. Otherwise, record the name in an xfblob, and remember the xfblob cookie.
+ Unique mappings are critical for
+
+ 1. Deduplicating names to reduce memory usage, and
+
+ 2. Creating a stable sort key for the parent pointer indexes so that the
+ parent pointer validation described below will work.
+
+ c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash``
+ referenced in this section is the regular directory entry name hash, not
+ the specialized one used for parent pointer xattrs.
+
+3. For each AG in the filesystem,
+
+ a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``,
+ ``name_hash``, and ``name_cookie``.
+ Having a single ``name_cookie`` for each ``name`` is critical for
+ handling the uncommon case of a directory containing multiple hardlinks
+ to the same file where all the names hash to the same value.
+
+ b. For each inode in the AG,
+
+ 1. Scan the inode for parent pointers.
+ For each parent pointer found,
+
+ a. Validate the ondisk parent pointer.
+ If validation fails, move on to the next parent pointer in the
+ file.
+
+ b. If the name has already been stored in the xfblob, then use that
+ cookie and skip the next step.
+
+ c. Record the name in a per-file xfblob, and remember the xfblob
+ cookie.
+
+ d. Store ``(parent_inum, parent_gen, name_hash, name_len,
+ name_cookie)`` tuples in a per-file slab.
+
+ 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``,
+ and ``name_cookie``.
+
+ 3. Position one slab cursor at the start of the inode's records in the
+ per-AG tuple slab.
+ This should be trivial since the per-AG tuples are in child inumber
+ order.
+
+ 4. Position a second slab cursor at the start of the per-file tuple slab.
+
+ 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``,
+ ``name_hash``, and ``name_cookie`` fields of the records under each
+ cursor:
+
+ a. If the per-AG cursor is at a lower point in the keyspace than the
+ per-file cursor, then the per-AG cursor points to a missing parent
+ pointer.
+ Add the parent pointer to the inode and advance the per-AG
+ cursor.
+
+ b. If the per-file cursor is at a lower point in the keyspace than
+ the per-AG cursor, then the per-file cursor points to a dangling
+ parent pointer.
+ Remove the parent pointer from the inode and advance the per-file
+ cursor.
+
+ c. Otherwise, both cursors point at the same parent pointer.
+ Update the parent_gen component if necessary.
+ Advance both cursors.
+
+4. Move on to examining link counts, as we do today.
+
+The proposed patchset is the
+`offline parent pointers repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_
+series.
+
+Rebuilding directories from parent pointers in offline repair would be very
+challenging because xfs_repair currently uses two single-pass scans of the
+filesystem during phases 3 and 4 to decide which files are corrupt enough to be
+zapped.
+This scan would have to be converted into a multi-pass scan:
+
+1. The first pass of the scan zaps corrupt inodes, forks, and attributes
+ much as it does now.
+ Corrupt directories are noted but not zapped.
+
+2. The next pass records parent pointers pointing to the directories noted
+ as being corrupt in the first pass.
+ This second pass may have to happen after the phase 4 scan for duplicate
+ blocks, if phase 4 is also capable of zapping directories.
+
+3. The third pass resets corrupt directories to an empty shortform directory.
+ Free space metadata has not been ensured yet, so repair cannot yet use the
+ directory building code in libxfs.
+
+4. At the start of phase 6, space metadata have been rebuilt.
+ Use the parent pointer information recorded during step 2 to reconstruct
+ the dirents and add them to the now-empty directories.
+
+This code has not yet been constructed.
+
+.. _dirtree:
+
+Case Study: Directory Tree Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As mentioned earlier, the filesystem directory tree is supposed to be a
+directed acylic graph structure.
+However, each node in this graph is a separate ``xfs_inode`` object with its
+own locks, which makes validating the tree qualities difficult.
+Fortunately, non-directories are allowed to have multiple parents and cannot
+have children, so only directories need to be scanned.
+Directories typically constitute 5-10% of the files in a filesystem, which
+reduces the amount of work dramatically.
+
+If the directory tree could be frozen, it would be easy to discover cycles and
+disconnected regions by running a depth (or breadth) first search downwards
+from the root directory and marking a bitmap for each directory found.
+At any point in the walk, trying to set an already set bit means there is a
+cycle.
+After the scan completes, XORing the marked inode bitmap with the inode
+allocation bitmap reveals disconnected inodes.
+However, one of online repair's design goals is to avoid locking the entire
+filesystem unless it's absolutely necessary.
+Directory tree updates can move subtrees across the scanner wavefront on a live
+filesystem, so the bitmap algorithm cannot be applied.
+
+Directory parent pointers enable an incremental approach to validation of the
+tree structure.
+Instead of using one thread to scan the entire filesystem, multiple threads can
+walk from individual subdirectories upwards towards the root.
+For this to work, all directory entries and parent pointers must be internally
+consistent, each directory entry must have a parent pointer, and the link
+counts of all directories must be correct.
+Each scanner thread must be able to take the IOLOCK of an alleged parent
+directory while holding the IOLOCK of the child directory to prevent either
+directory from being moved within the tree.
+This is not possible since the VFS does not take the IOLOCK of a child
+subdirectory when moving that subdirectory, so instead the scanner stabilizes
+the parent -> child relationship by taking the ILOCKs and installing a dirent
+update hook to detect changes.
+
+The scanning process uses a dirent hook to detect changes to the directories
+mentioned in the scan data.
+The scan works as follows:
+
+1. For each subdirectory in the filesystem,
+
+ a. For each parent pointer of that subdirectory,
+
+ 1. Create a path object for that parent pointer, and mark the
+ subdirectory inode number in the path object's bitmap.
+
+ 2. Record the parent pointer name and inode number in a path structure.
+
+ 3. If the alleged parent is the subdirectory being scrubbed, the path is
+ a cycle.
+ Mark the path for deletion and repeat step 1a with the next
+ subdirectory parent pointer.
+
+ 4. Try to mark the alleged parent inode number in a bitmap in the path
+ object.
+ If the bit is already set, then there is a cycle in the directory
+ tree.
+ Mark the path as a cycle and repeat step 1a with the next subdirectory
+ parent pointer.
+
+ 5. Load the alleged parent.
+ If the alleged parent is not a linked directory, abort the scan
+ because the parent pointer information is inconsistent.
+
+ 6. For each parent pointer of this alleged ancestor directory,
+
+ a. Record the parent pointer name and inode number in the path object
+ if no parent has been set for that level.
+
+ b. If an ancestor has more than one parent, mark the path as corrupt.
+ Repeat step 1a with the next subdirectory parent pointer.
+
+ c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a.
+ This repeats until the directory tree root is reached or no parents
+ are found.
+
+ 7. If the walk terminates at the root directory, mark the path as ok.
+
+ 8. If the walk terminates without reaching the root, mark the path as
+ disconnected.
+
+2. If the directory entry update hook triggers, check all paths already found
+ by the scan.
+ If the entry matches part of a path, mark that path and the scan stale.
+ When the scanner thread sees that the scan has been marked stale, it deletes
+ all scan data and starts over.
+
+Repairing the directory tree works as follows:
+
+1. Walk each path of the target subdirectory.
+
+ a. Corrupt paths and cycle paths are counted as suspect.
+
+ b. Paths already marked for deletion are counted as bad.
+
+ c. Paths that reached the root are counted as good.
+
+2. If the subdirectory is either the root directory or has zero link count,
+ delete all incoming directory entries in the immediate parents.
+ Repairs are complete.
+
+3. If the subdirectory has exactly one path, set the dotdot entry to the
+ parent and exit.
+
+4. If the subdirectory has at least one good path, delete all the other
+ incoming directory entries in the immediate parents.
+
+5. If the subdirectory has no good paths and more than one suspect path, delete
+ all the other incoming directory entries in the immediate parents.
+
+6. If the subdirectory has zero paths, attach it to the lost and found.
+
+The proposed patches are in the
+`directory tree repair
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_
+series.
+
+
+.. _orphanage:
+
+The Orphanage
+-------------
+
+Filesystems present files as a directed, and hopefully acyclic, graph.
+In other words, a tree.
+The root of the filesystem is a directory, and each entry in a directory points
+downwards either to more subdirectories or to non-directory files.
+Unfortunately, a disruption in the directory graph pointers result in a
+disconnected graph, which makes files impossible to access via regular path
+resolution.
+
+Without parent pointers, the directory parent pointer online scrub code can
+detect a dotdot entry pointing to a parent directory that doesn't have a link
+back to the child directory and the file link count checker can detect a file
+that isn't pointed to by any directory in the filesystem.
+If such a file has a positive link count, the file is an orphan.
+
+With parent pointers, directories can be rebuilt by scanning parent pointers
+and parent pointers can be rebuilt by scanning directories.
+This should reduce the incidence of files ending up in ``/lost+found``.
+
+When orphans are found, they should be reconnected to the directory tree.
+Offline fsck solves the problem by creating a directory ``/lost+found`` to
+serve as an orphanage, and linking orphan files into the orphanage by using the
+inumber as the name.
+Reparenting a file to the orphanage does not reset any of its permissions or
+ACLs.
+
+This process is more involved in the kernel than it is in userspace.
+The directory and file link count repair setup functions must use the regular
+VFS mechanisms to create the orphanage directory with all the necessary
+security attributes and dentry cache entries, just like a regular directory
+tree modification.
+
+Orphaned files are adopted by the orphanage as follows:
+
+1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
+ to try to ensure that the lost and found directory actually exists.
+ This also attaches the orphanage directory to the scrub context.
+
+2. If the decision is made to reconnect a file, take the IOLOCK of both the
+ orphanage and the file being reattached.
+ The ``xrep_orphanage_iolock_two`` function follows the inode locking
+ strategy discussed earlier.
+
+3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair
+ transaction.
+
+4. Call ``xrep_orphanage_compute_name`` to compute the new name in the
+ orphanage.
+
+5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to
+ reparent the orphaned file into the lost and found and invalidate the dentry
+ cache.
+
+6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the
+ orphanage ILOCK, and clean the scrub transaction. Call
+ ``xrep_adoption_commit`` to commit the updates and the scrub transaction.
+
+7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all
+ resources.
+
+The proposed patches are in the
+`orphanage adoption
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
+series.
+
+6. Userspace Algorithms and Data Structures
+===========================================
+
+This section discusses the key algorithms and data structures of the userspace
+program, ``xfs_scrub``, that provide the ability to drive metadata checks and
+repairs in the kernel, verify file data, and look for other potential problems.
+
+.. _scrubcheck:
+
+Checking Metadata
+-----------------
+
+Recall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
+That structure follows naturally from the data dependencies designed into the
+filesystem from its beginnings in 1993.
+In XFS, there are several groups of metadata dependencies:
+
+a. Filesystem summary counts depend on consistency within the inode indices,
+ the allocation group space btrees, and the realtime volume space
+ information.
+
+b. Quota resource counts depend on consistency within the quota file data
+ forks, inode indices, inode records, and the forks of every file on the
+ system.
+
+c. The naming hierarchy depends on consistency within the directory and
+ extended attribute structures.
+ This includes file link counts.
+
+d. Directories, extended attributes, and file data depend on consistency within
+ the file forks that map directory and extended attribute data to physical
+ storage media.
+
+e. The file forks depends on consistency within inode records and the space
+ metadata indices of the allocation groups and the realtime volume.
+ This includes quota and realtime metadata files.
+
+f. Inode records depends on consistency within the inode metadata indices.
+
+g. Realtime space metadata depend on the inode records and data forks of the
+ realtime metadata inodes.
+
+h. The allocation group metadata indices (free space, inodes, reference count,
+ and reverse mapping btrees) depend on consistency within the AG headers and
+ between all the AG metadata btrees.
+
+i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
+ for online fsck functionality.
+
+Therefore, a metadata dependency graph is a convenient way to schedule checking
+operations in the ``xfs_scrub`` program:
+
+- Phase 1 checks that the provided path maps to an XFS filesystem and detect
+ the kernel's scrubbing abilities, which validates group (i).
+
+- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
+
+- Phase 3 scans inodes in parallel.
+ For each inode, groups (f), (e), and (d) are checked, in that order.
+
+- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
+ may run reliably.
+
+- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
+ to checking names.
+
+- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
+ to read them, and to report which blocks of which files are affected.
+
+- Phase 7 checks group (a), having validated everything else.
+
+Notice that the data dependencies between groups are enforced by the structure
+of the program flow.
+
+Parallel Inode Scans
+--------------------
+
+An XFS filesystem can easily contain hundreds of millions of inodes.
+Given that XFS targets installations with large high-performance storage,
+it is desirable to scrub inodes in parallel to minimize runtime, particularly
+if the program has been invoked manually from a command line.
+This requires careful scheduling to keep the threads as evenly loaded as
+possible.
+
+Early iterations of the ``xfs_scrub`` inode scanner naïvely created a single
+workqueue and scheduled a single workqueue item per AG.
+Each workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
+inode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
+information to construct file handles.
+The file handle was then passed to a function to generate scrub items for each
+metadata object of each inode.
+This simple algorithm leads to thread balancing problems in phase 3 if the
+filesystem contains one AG with a few large sparse files and the rest of the
+AGs contain many smaller files.
+The inode scan dispatch function was not sufficiently granular; it should have
+been dispatching at the level of individual inodes, or, to constrain memory
+consumption, inode btree records.
+
+Thanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
+avoid this problem with ease by adding a second workqueue.
+Just like before, the first workqueue is seeded with one workqueue item per AG,
+and it uses INUMBERS to find inode btree chunks.
+The second workqueue, however, is configured with an upper bound on the number
+of items that can be waiting to be run.
+Each inode btree chunk found by the first workqueue's workers are queued to the
+second workqueue, and it is this second workqueue that queries BULKSTAT,
+creates a file handle, and passes it to a function to generate scrub items for
+each metadata object of each inode.
+If the second workqueue is too full, the workqueue add function blocks the
+first workqueue's workers until the backlog eases.
+This doesn't completely solve the balancing problem, but reduces it enough to
+move on to more pressing issues.
+
+The proposed patchsets are the scrub
+`performance tweaks
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
+and the
+`inode scan rebalance
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
+series.
+
+.. _scrubrepair:
+
+Scheduling Repairs
+------------------
+
+During phase 2, corruptions and inconsistencies reported in any AGI header or
+inode btree are repaired immediately, because phase 3 relies on proper
+functioning of the inode indices to find inodes to scan.
+Failed repairs are rescheduled to phase 4.
+Problems reported in any other space metadata are deferred to phase 4.
+Optimization opportunities are always deferred to phase 4, no matter their
+origin.
+
+During phase 3, corruptions and inconsistencies reported in any part of a
+file's metadata are repaired immediately if all space metadata were validated
+during phase 2.
+Repairs that fail or cannot be repaired immediately are scheduled for phase 4.
+
+In the original design of ``xfs_scrub``, it was thought that repairs would be
+so infrequent that the ``struct xfs_scrub_metadata`` objects used to
+communicate with the kernel could also be used as the primary object to
+schedule repairs.
+With recent increases in the number of optimizations possible for a given
+filesystem object, it became much more memory-efficient to track all eligible
+repairs for a given filesystem object with a single repair item.
+Each repair item represents a single lockable object -- AGs, metadata files,
+individual inodes, or a class of summary information.
+
+Phase 4 is responsible for scheduling a lot of repair work in as quick a
+manner as is practical.
+The :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
+means that ``xfs_scrub`` must try to complete the repair work scheduled by
+phase 2 before trying repair work scheduled by phase 3.
+The repair process is as follows:
+
+1. Start a round of repair with a workqueue and enough workers to keep the CPUs
+ as busy as the user desires.
+
+ a. For each repair item queued by phase 2,
+
+ i. Ask the kernel to repair everything listed in the repair item for a
+ given filesystem object.
+
+ ii. Make a note if the kernel made any progress in reducing the number
+ of repairs needed for this object.
+
+ iii. If the object no longer requires repairs, revalidate all metadata
+ associated with this object.
+ If the revalidation succeeds, drop the repair item.
+ If not, requeue the item for more repairs.
+
+ b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
+
+ c. For each repair item queued by phase 3,
+
+ i. Ask the kernel to repair everything listed in the repair item for a
+ given filesystem object.
+
+ ii. Make a note if the kernel made any progress in reducing the number
+ of repairs needed for this object.
+
+ iii. If the object no longer requires repairs, revalidate all metadata
+ associated with this object.
+ If the revalidation succeeds, drop the repair item.
+ If not, requeue the item for more repairs.
+
+ d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
+
+2. If step 1 made any repair progress of any kind, jump back to step 1 to start
+ another round of repair.
+
+3. If there are items left to repair, run them all serially one more time.
+ Complain if the repairs were not successful, since this is the last chance
+ to repair anything.
+
+Corruptions and inconsistencies encountered during phases 5 and 7 are repaired
+immediately.
+Corrupt file data blocks reported by phase 6 cannot be recovered by the
+filesystem.
+
+The proposed patchsets are the
+`repair warning improvements
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
+refactoring of the
+`repair data dependency
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
+and
+`object tracking
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
+and the
+`repair scheduling
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
+improvement series.
+
+Checking Names for Confusable Unicode Sequences
+-----------------------------------------------
+
+If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
+phase 4, it moves on to phase 5, which checks for suspicious looking names in
+the filesystem.
+These names consist of the filesystem label, names in directory entries, and
+the names of extended attributes.
+Like most Unix filesystems, XFS imposes the sparest of constraints on the
+contents of a name:
+
+- Slashes and null bytes are not allowed in directory entries.
+
+- Null bytes are not allowed in userspace-visible extended attributes.
+
+- Null bytes are not allowed in the filesystem label.
+
+Directory entries and attribute keys store the length of the name explicitly
+ondisk, which means that nulls are not name terminators.
+For this section, the term "naming domain" refers to any place where names are
+presented together -- all the names in a directory, or all the attributes of a
+file.
+
+Although the Unix naming constraints are very permissive, the reality of most
+modern-day Linux systems is that programs work with Unicode character code
+points to support international languages.
+These programs typically encode those code points in UTF-8 when interfacing
+with the C library because the kernel expects null-terminated names.
+In the common case, therefore, names found in an XFS filesystem are actually
+UTF-8 encoded Unicode data.
+
+To maximize its expressiveness, the Unicode standard defines separate control
+points for various characters that render similarly or identically in writing
+systems around the world.
+For example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
+identically to "Latin Small Letter A" U+0061 "a".
+
+The standard also permits characters to be constructed in multiple ways --
+either by using a defined code point, or by combining one code point with
+various combining marks.
+For example, the character "Angstrom Sign U+212B "Å" can also be expressed
+as "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
+U+030A "◌̊".
+Both sequences render identically.
+
+Like the standards that preceded it, Unicode also defines various control
+characters to alter the presentation of text.
+For example, the character "Right-to-Left Override" U+202E can trick some
+programs into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
+A second category of rendering problems involves whitespace characters.
+If the character "Zero Width Space" U+200B is encountered in a file name, the
+name will render identically to a name that does not have the zero width
+space.
+
+If two names within a naming domain have different byte sequences but render
+identically, a user may be confused by it.
+The kernel, in its indifference to upper level encoding schemes, permits this.
+Most filesystem drivers persist the byte sequence names that are given to them
+by the VFS.
+
+Techniques for detecting confusable names are explained in great detail in
+sections 4 and 5 of the
+`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
+document.
+When ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
+Unicode normalization form NFD in conjunction with the confusable name
+detection component of
+`libicu <https://github.com/unicode-org/icu>`_
+to identify names with a directory or within a file's extended attributes that
+could be confused for each other.
+Names are also checked for control characters, non-rendering characters, and
+mixing of bidirectional characters.
+All of these potential issues are reported to the system administrator during
+phase 5.
+
+Media Verification of File Data Extents
+---------------------------------------
+
+The system administrator can elect to initiate a media scan of all file data
+blocks.
+This scan after validation of all filesystem metadata (except for the summary
+counters) as phase 6.
+The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
+to find areas that are allocated to file data fork extents.
+Gaps between data fork extents that are smaller than 64k are treated as if
+they were data fork extents to reduce the command setup overhead.
+When the space map scan accumulates a region larger than 32MB, a media
+verification request is sent to the disk as a directio read of the raw block
+device.
+
+If the verification read fails, ``xfs_scrub`` retries with single-block reads
+to narrow down the failure to the specific region of the media and recorded.
+When it has finished issuing verification requests, it again uses the space
+mapping ioctl to map the recorded media errors back to metadata structures
+and report what has been lost.
+For media errors in blocks owned by files, parent pointers can be used to
+construct file paths from inode numbers for user-friendly reporting.
+
+7. Conclusion and Future Work
+=============================
+
+It is hoped that the reader of this document has followed the designs laid out
+in this document and now has some familiarity with how XFS performs online
+rebuilding of its metadata indices, and how filesystem users can interact with
+that functionality.
+Although the scope of this work is daunting, it is hoped that this guide will
+make it easier for code readers to understand what has been built, for whom it
+has been built, and why.
+Please feel free to contact the XFS mailing list with questions.
+
+XFS_IOC_EXCHANGE_RANGE
+----------------------
+
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
+This frontend has been out for review for several years now, though the
+necessary refinements to online repair and lack of customer demand mean that
+the proposal has not been pushed very hard.
+
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
+
+As mentioned earlier, XFS has long had the ability to swap extents between
+files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
+The earliest form of this was the fork swap mechanism, where the entire
+contents of data forks could be exchanged between two files by exchanging the
+raw bytes in each inode fork's immediate area.
+When XFS v5 came along with self-describing metadata, this old mechanism grew
+some log support to continue rewriting the owner fields of BMBT blocks during
+log recovery.
+When the reverse mapping btree was later added to XFS, the only way to maintain
+the consistency of the fork mappings with the reverse mapping index was to
+develop an iterative mechanism that used deferred bmap and rmap operations to
+swap mappings one at a time.
+This mechanism is identical to steps 2-3 from the procedure above except for
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
+For the narrow case of file defragmentation, the file contents must be
+identical, so the recovery guarantees are not much of a gain.
+
+Atomic file content exchanges are much more flexible than the existing swapext
+implementations because it can guarantee that the caller never sees a mix of
+old and new contents even after a crash, and it can operate on two arbitrary
+file fork ranges.
+The extra flexibility enables several new use cases:
+
+- **Atomic commit of file writes**: A userspace process opens a file that it
+ wants to update.
+ Next, it opens a temporary file and calls the file clone operation to reflink
+ the first file's contents into the temporary file.
+ Writes to the original file should instead be written to the temporary file.
+ Finally, the process calls the atomic file mapping exchange system call
+ (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+ committing all of the updates to the original file, or none of them.
+
+.. _exchrange_if_unchanged:
+
+- **Transactional file updates**: The same mechanism as above, but the caller
+ only wants the commit to occur if the original file's contents have not
+ changed.
+ To make this happen, the calling process snapshots the file modification and
+ change timestamps of the original file before reflinking its data to the
+ temporary file.
+ When the program is ready to commit the changes, it passes the timestamps
+ into the kernel as arguments to the atomic file mapping exchange system call.
+ The kernel only commits the changes if the provided timestamps match the
+ original file.
+ A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
+
+- **Emulation of atomic block device writes**: Export a block device with a
+ logical sector size matching the filesystem block size to force all writes
+ to be aligned to the filesystem block size.
+ Stage all writes to a temporary file, and when that is complete, call the
+ atomic file mapping exchange system call with a flag to indicate that holes
+ in the temporary file should be ignored.
+ This emulates an atomic device write in software, and can support arbitrary
+ scattered writes.
+
+Vectorized Scrub
+----------------
+
+As it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
+earlier was a catalyst for enabling a vectorized scrub system call.
+Since 2018, the cost of making a kernel call has increased considerably on some
+systems to mitigate the effects of speculative execution attacks.
+This incentivizes program authors to make as few system calls as possible to
+reduce the number of times an execution path crosses a security boundary.
+
+With vectorized scrub, userspace pushes to the kernel the identity of a
+filesystem object, a list of scrub types to run against that object, and a
+simple representation of the data dependencies between the selected scrub
+types.
+The kernel executes as much of the caller's plan as it can until it hits a
+dependency that cannot be satisfied due to a corruption, and tells userspace
+how much was accomplished.
+It is hoped that ``io_uring`` will pick up enough of this functionality that
+online fsck can use that instead of adding a separate vectored scrub system
+call to XFS.
+
+The relevant patchsets are the
+`kernel vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
+and
+`userspace vectorized scrub
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
+series.
+
+Quality of Service Targets for Scrub
+------------------------------------
+
+One serious shortcoming of the online fsck code is that the amount of time that
+it can spend in the kernel holding resource locks is basically unbounded.
+Userspace is allowed to send a fatal signal to the process which will cause
+``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
+for userspace to provide a time budget to the kernel.
+Given that the scrub codebase has helpers to detect fatal signals, it shouldn't
+be too much work to allow userspace to specify a timeout for a scrub/repair
+operation and abort the operation if it exceeds budget.
+However, most repair functions have the property that once they begin to touch
+ondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
+timeout is no longer useful.
+
+Defragmenting Free Space
+------------------------
+
+Over the years, many XFS users have requested the creation of a program to
+clear a portion of the physical storage underlying a filesystem so that it
+becomes a contiguous chunk of free space.
+Call this free space defragmenter ``clearspace`` for short.
+
+The first piece the ``clearspace`` program needs is the ability to read the
+reverse mapping index from userspace.
+This already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
+The second piece it needs is a new fallocate mode
+(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
+maps it to a file.
+Call this file the "space collector" file.
+The third piece is the ability to force an online repair.
+
+To clear all the metadata out of a portion of physical storage, clearspace
+uses the new fallocate map-freespace call to map any free space in that region
+to the space collector file.
+Next, clearspace finds all metadata blocks in that region by way of
+``GETFSMAP`` and issues forced repair requests on the data structure.
+This often results in the metadata being rebuilt somewhere that is not being
+cleared.
+After each relocation, clearspace calls the "map free space" function again to
+collect any newly freed space in the region being cleared.
+
+To clear all the file data out of a portion of the physical storage, clearspace
+uses the FSMAP information to find relevant file data blocks.
+Having identified a good target, it uses the ``FICLONERANGE`` call on that part
+of the file to try to share the physical space with a dummy file.
+Cloning the extent means that the original owners cannot overwrite the
+contents; any changes will be written somewhere else via copy-on-write.
+Clearspace makes its own copy of the frozen extent in an area that is not being
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
+mapping away from the area being cleared.
+When all other mappings have been moved, clearspace reflinks the space into the
+space collector file so that it becomes unavailable.
+
+There are further optimizations that could apply to the above algorithm.
+To clear a piece of physical storage that has a high sharing factor, it is
+strongly desirable to retain this sharing factor.
+In fact, these extents should be moved first to maximize sharing factor after
+the operation completes.
+To make this work smoothly, clearspace needs a new ioctl
+(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
+With the refcount information exposed, clearspace can quickly find the longest,
+most shared data extents in the filesystem, and target them first.
+
+**Future Work Question**: How might the filesystem move inode chunks?
+
+*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
+that creates a new file with the old contents and then locklessly runs around
+the filesystem updating directory entries.
+The operation cannot complete if the filesystem goes down.
+That problem isn't totally insurmountable: create an inode remapping table
+hidden behind a jump label, and a log item that tracks the kernel walking the
+filesystem to update directory entries.
+The trouble is, the kernel can't do anything about open files, since it cannot
+revoke them.
+
+**Future Work Question**: Can static keys be used to minimize the cost of
+supporting ``revoke()`` on XFS files?
+
+*Answer*: Yes.
+Until the first revocation, the bailout code need not be in the call path at
+all.
+
+The relevant patchsets are the
+`kernel freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
+and
+`userspace freespace defrag
+<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
+series.
+
+Shrinking Filesystems
+---------------------
+
+Removing the end of the filesystem ought to be a simple matter of evacuating
+the data and metadata at the end of the filesystem, and handing the freed space
+to the shrink code.
+That requires an evacuation of the space at end of the filesystem, which is a
+use of free space defragmentation!
diff --git a/Documentation/filesystems/xfs-self-describing-metadata.rst b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
index b79dbf36dc94..a10c4ae6955e 100644
--- a/Documentation/filesystems/xfs-self-describing-metadata.rst
+++ b/Documentation/filesystems/xfs/xfs-self-describing-metadata.rst
@@ -1,4 +1,5 @@
.. SPDX-License-Identifier: GPL-2.0
+.. _xfs_self_describing_metadata:
============================
XFS Self Describing Metadata
diff --git a/Documentation/filesystems/zonefs.rst b/Documentation/filesystems/zonefs.rst
index 71d845c6a700..c22124c2213d 100644
--- a/Documentation/filesystems/zonefs.rst
+++ b/Documentation/filesystems/zonefs.rst
@@ -110,14 +110,14 @@ contain files named "0", "1", "2", ... The file numbers also represent
increasing zone start sector on the device.
All read and write operations to zone files are not allowed beyond the file
-maximum size, that is, beyond the zone size. Any access exceeding the zone
-size is failed with the -EFBIG error.
+maximum size, that is, beyond the zone capacity. Any access exceeding the zone
+capacity is failed with the -EFBIG error.
Creating, deleting, renaming or modifying any attribute of files and
sub-directories is not allowed.
The number of blocks of a file as reported by stat() and fstat() indicates the
-size of the file zone, or in other words, the maximum file size.
+capacity of the zone file, or in other words, the maximum file size.
Conventional zone files
-----------------------
@@ -156,8 +156,8 @@ all accepted.
Truncating sequential zone files is allowed only down to 0, in which case, the
zone is reset to rewind the file zone write pointer position to the start of
-the zone, or up to the zone size, in which case the file's zone is transitioned
-to the FULL state (finish zone operation).
+the zone, or up to the zone capacity, in which case the file's zone is
+transitioned to the FULL state (finish zone operation).
Format options
--------------
@@ -306,8 +306,15 @@ Further notes:
Mount options
-------------
-zonefs define the "errors=<behavior>" mount option to allow the user to specify
-zonefs behavior in response to I/O errors, inode size inconsistencies or zone
+zonefs defines several mount options:
+* errors=<behavior>
+* explicit-open
+
+"errors=<behavior>" option
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The "errors=<behavior>" option mount option allows the user to specify zonefs
+behavior in response to I/O errors, inode size inconsistencies or zone
condition changes. The defined behaviors are as follow:
* remount-ro (default)
@@ -324,7 +331,63 @@ file size set to 0. This is necessary as the write pointer of read-only zones
is defined as invalib by the ZBC and ZAC standards, making it impossible to
discover the amount of data that has been written to the zone. In the case of a
read-only zone discovered at run-time, as indicated in the previous section.
-the size of the zone file is left unchanged from its last updated value.
+The size of the zone file is left unchanged from its last updated value.
+
+"explicit-open" option
+~~~~~~~~~~~~~~~~~~~~~~
+
+A zoned block device (e.g. an NVMe Zoned Namespace device) may have limits on
+the number of zones that can be active, that is, zones that are in the
+implicit open, explicit open or closed conditions. This potential limitation
+translates into a risk for applications to see write IO errors due to this
+limit being exceeded if the zone of a file is not already active when a write
+request is issued by the user.
+
+To avoid these potential errors, the "explicit-open" mount option forces zones
+to be made active using an open zone command when a file is opened for writing
+for the first time. If the zone open command succeeds, the application is then
+guaranteed that write requests can be processed. Conversely, the
+"explicit-open" mount option will result in a zone close command being issued
+to the device on the last close() of a zone file if the zone is not full nor
+empty.
+
+Runtime sysfs attributes
+------------------------
+
+zonefs defines several sysfs attributes for mounted devices. All attributes
+are user readable and can be found in the directory /sys/fs/zonefs/<dev>/,
+where <dev> is the name of the mounted zoned block device.
+
+The attributes defined are as follows.
+
+* **max_wro_seq_files**: This attribute reports the maximum number of
+ sequential zone files that can be open for writing. This number corresponds
+ to the maximum number of explicitly or implicitly open zones that the device
+ supports. A value of 0 means that the device has no limit and that any zone
+ (any file) can be open for writing and written at any time, regardless of the
+ state of other zones. When the *explicit-open* mount option is used, zonefs
+ will fail any open() system call requesting to open a sequential zone file for
+ writing when the number of sequential zone files already open for writing has
+ reached the *max_wro_seq_files* limit.
+* **nr_wro_seq_files**: This attribute reports the current number of sequential
+ zone files open for writing. When the "explicit-open" mount option is used,
+ this number can never exceed *max_wro_seq_files*. If the *explicit-open*
+ mount option is not used, the reported number can be greater than
+ *max_wro_seq_files*. In such case, it is the responsibility of the
+ application to not write simultaneously more than *max_wro_seq_files*
+ sequential zone files. Failure to do so can result in write errors.
+* **max_active_seq_files**: This attribute reports the maximum number of
+ sequential zone files that are in an active state, that is, sequential zone
+ files that are partially written (not empty nor full) or that have a zone that
+ is explicitly open (which happens only if the *explicit-open* mount option is
+ used). This number is always equal to the maximum number of active zones that
+ the device supports. A value of 0 means that the mounted device has no limit
+ on the number of sequential zone files that can be active.
+* **nr_active_seq_files**: This attributes reports the current number of
+ sequential zone files that are active. If *max_active_seq_files* is not 0,
+ then the value of *nr_active_seq_files* can never exceed the value of
+ *nr_active_seq_files*, regardless of the use of the *explicit-open* mount
+ option.
Zonefs User Space Tools
=======================
@@ -401,8 +464,9 @@ append-writes to the file::
# ls -l /mnt/seq/0
-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
-Since files are statically mapped to zones on the disk, the number of blocks of
-a file as reported by stat() and fstat() indicates the size of the file zone::
+Since files are statically mapped to zones on the disk, the number of blocks
+of a file as reported by stat() and fstat() indicates the capacity of the file
+zone::
# stat /mnt/seq/0
File: /mnt/seq/0
@@ -416,5 +480,6 @@ a file as reported by stat() and fstat() indicates the size of the file zone::
The number of blocks of the file ("Blocks") in units of 512B blocks gives the
maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
-size in this example. Of note is that the "IO block" field always indicates the
-minimum I/O size for writes and corresponds to the device physical sector size.
+capacity in this example. Of note is that the "IO block" field always
+indicates the minimum I/O size for writes and corresponds to the device
+physical sector size.