aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/adfs.txt24
-rw-r--r--Documentation/filesystems/automount-support.txt2
-rw-r--r--Documentation/filesystems/debugfs.txt6
-rw-r--r--Documentation/filesystems/f2fs.txt216
-rw-r--r--Documentation/filesystems/fscrypt.rst81
-rw-r--r--Documentation/filesystems/fuse.rst (renamed from Documentation/filesystems/fuse.txt)163
-rw-r--r--Documentation/filesystems/index.rst3
-rw-r--r--Documentation/filesystems/mount_api.txt12
-rw-r--r--Documentation/filesystems/nfs/fault_injection.txt69
-rw-r--r--Documentation/filesystems/nfs/idmapper.txt75
-rw-r--r--Documentation/filesystems/nfs/nfs-rdma.txt274
-rw-r--r--Documentation/filesystems/nfs/nfs.txt136
-rw-r--r--Documentation/filesystems/nfs/nfsd-admin-interfaces.txt41
-rw-r--r--Documentation/filesystems/nfs/nfsroot.txt355
-rw-r--r--Documentation/filesystems/nfs/pnfs-block-server.txt37
-rw-r--r--Documentation/filesystems/nfs/pnfs-scsi-server.txt23
-rw-r--r--Documentation/filesystems/path-lookup.rst68
-rw-r--r--Documentation/filesystems/porting.rst8
-rw-r--r--Documentation/filesystems/vfat.rst387
-rw-r--r--Documentation/filesystems/vfat.txt347
-rw-r--r--Documentation/filesystems/zonefs.txt404
21 files changed, 1080 insertions, 1651 deletions
diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt
index 5949766353f7..0baa8e8c1fc1 100644
--- a/Documentation/filesystems/adfs.txt
+++ b/Documentation/filesystems/adfs.txt
@@ -1,3 +1,27 @@
+Filesystems supported by ADFS
+-----------------------------
+
+The ADFS module supports the following Filecore formats which have:
+
+- new maps
+- new directories or big directories
+
+In terms of the named formats, this means we support:
+
+- E and E+, with or without boot block
+- F and F+
+
+We fully support reading files from these filesystems, and writing to
+existing files within their existing allocation. Essentially, we do
+not support changing any of the filesystem metadata.
+
+This is intended to support loopback mounted Linux native filesystems
+on a RISC OS Filecore filesystem, but will allow the data within files
+to be changed.
+
+If write support (ADFS_FS_RW) is configured, we allow rudimentary
+directory updates, specifically updating the access mode and timestamp.
+
Mount options for ADFS
----------------------
diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt
index b0afd3d55eaf..7d9f82607562 100644
--- a/Documentation/filesystems/automount-support.txt
+++ b/Documentation/filesystems/automount-support.txt
@@ -9,7 +9,7 @@ also be requested by userspace.
IN-KERNEL AUTOMOUNTING
======================
-See section "Mount Traps" of Documentation/filesystems/autofs.txt
+See section "Mount Traps" of Documentation/filesystems/autofs.rst
Then from userspace, you can just do something like:
diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt
index dc497b96fa4f..55336a47a110 100644
--- a/Documentation/filesystems/debugfs.txt
+++ b/Documentation/filesystems/debugfs.txt
@@ -164,9 +164,9 @@ file.
void __iomem *base;
};
- struct dentry *debugfs_create_regset32(const char *name, umode_t mode,
- struct dentry *parent,
- struct debugfs_regset32 *regset);
+ debugfs_create_regset32(const char *name, umode_t mode,
+ struct dentry *parent,
+ struct debugfs_regset32 *regset);
void debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs,
int nregs, void __iomem *base, char *prefix);
diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt
index 3135b80df6da..4eb3e2ddd00e 100644
--- a/Documentation/filesystems/f2fs.txt
+++ b/Documentation/filesystems/f2fs.txt
@@ -235,6 +235,17 @@ checkpoint=%s[:%u[%]] Set to "disable" to turn off checkpointing. Set to "en
hide up to all remaining free space. The actual space that
would be unusable can be viewed at /sys/fs/f2fs/<disk>/unusable
This space is reclaimed once checkpoint=enable.
+compress_algorithm=%s Control compress algorithm, currently f2fs supports "lzo"
+ and "lz4" algorithm.
+compress_log_size=%u Support configuring compress cluster size, the size will
+ be 4KB * (1 << %u), 16KB is minimum size, also it's
+ default size.
+compress_extension=%s Support adding specified extension, so that f2fs can enable
+ compression on those corresponding files, e.g. if all files
+ with '.ext' has high compression rate, we can set the '.ext'
+ on compression extension list and enable compression on
+ these file by default rather than to enable it via ioctl.
+ For other files, we can still enable compression via ioctl.
================================================================================
DEBUGFS ENTRIES
@@ -259,170 +270,6 @@ The files in each per-device directory are shown in table below.
Files in /sys/fs/f2fs/<devname>
(see also Documentation/ABI/testing/sysfs-fs-f2fs)
-..............................................................................
- File Content
-
- gc_urgent_sleep_time This parameter controls sleep time for gc_urgent.
- 500 ms is set by default. See above gc_urgent.
-
- gc_min_sleep_time This tuning parameter controls the minimum sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_max_sleep_time This tuning parameter controls the maximum sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_no_gc_sleep_time This tuning parameter controls the default sleep
- time for the garbage collection thread. Time is
- in milliseconds.
-
- gc_idle This parameter controls the selection of victim
- policy for garbage collection. Setting gc_idle = 0
- (default) will disable this option. Setting
- gc_idle = 1 will select the Cost Benefit approach
- & setting gc_idle = 2 will select the greedy approach.
-
- gc_urgent This parameter controls triggering background GCs
- urgently or not. Setting gc_urgent = 0 [default]
- makes back to default behavior, while if it is set
- to 1, background thread starts to do GC by given
- gc_urgent_sleep_time interval.
-
- reclaim_segments This parameter controls the number of prefree
- segments to be reclaimed. If the number of prefree
- segments is larger than the number of segments
- in the proportion to the percentage over total
- volume size, f2fs tries to conduct checkpoint to
- reclaim the prefree segments to free segments.
- By default, 5% over total # of segments.
-
- main_blkaddr This value gives the first block address of
- MAIN area in the partition.
-
- max_small_discards This parameter controls the number of discard
- commands that consist small blocks less than 2MB.
- The candidates to be discarded are cached until
- checkpoint is triggered, and issued during the
- checkpoint. By default, it is disabled with 0.
-
- discard_granularity This parameter controls the granularity of discard
- command size. It will issue discard commands iif
- the size is larger than given granularity. Its
- unit size is 4KB, and 4 (=16KB) is set by default.
- The maximum value is 128 (=512KB).
-
- reserved_blocks This parameter indicates the number of blocks that
- f2fs reserves internally for root.
-
- batched_trim_sections This parameter controls the number of sections
- to be trimmed out in batch mode when FITRIM
- conducts. 32 sections is set by default.
-
- ipu_policy This parameter controls the policy of in-place
- updates in f2fs. There are five policies:
- 0x01: F2FS_IPU_FORCE, 0x02: F2FS_IPU_SSR,
- 0x04: F2FS_IPU_UTIL, 0x08: F2FS_IPU_SSR_UTIL,
- 0x10: F2FS_IPU_FSYNC.
-
- min_ipu_util This parameter controls the threshold to trigger
- in-place-updates. The number indicates percentage
- of the filesystem utilization, and used by
- F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies.
-
- min_fsync_blocks This parameter controls the threshold to trigger
- in-place-updates when F2FS_IPU_FSYNC mode is set.
- The number indicates the number of dirty pages
- when fsync needs to flush on its call path. If
- the number is less than this value, it triggers
- in-place-updates.
-
- min_seq_blocks This parameter controls the threshold to serialize
- write IOs issued by multiple threads in parallel.
-
- min_hot_blocks This parameter controls the threshold to allocate
- a hot data log for pending data blocks to write.
-
- min_ssr_sections This parameter adds the threshold when deciding
- SSR block allocation. If this is large, SSR mode
- will be enabled early.
-
- ram_thresh This parameter controls the memory footprint used
- by free nids and cached nat entries. By default,
- 1 is set, which indicates 10 MB / 1 GB RAM.
-
- ra_nid_pages When building free nids, F2FS reads NAT blocks
- ahead for speed up. Default is 0.
-
- dirty_nats_ratio Given dirty ratio of cached nat entries, F2FS
- determines flushing them in background.
-
- max_victim_search This parameter controls the number of trials to
- find a victim segment when conducting SSR and
- cleaning operations. The default value is 4096
- which covers 8GB block address range.
-
- migration_granularity For large-sized sections, F2FS can stop GC given
- this granularity instead of reclaiming entire
- section.
-
- dir_level This parameter controls the directory level to
- support large directory. If a directory has a
- number of files, it can reduce the file lookup
- latency by increasing this dir_level value.
- Otherwise, it needs to decrease this value to
- reduce the space overhead. The default value is 0.
-
- cp_interval F2FS tries to do checkpoint periodically, 60 secs
- by default.
-
- idle_interval F2FS detects system is idle, if there's no F2FS
- operations during given interval, 5 secs by
- default.
-
- discard_idle_interval F2FS detects the discard thread is idle, given
- time interval. Default is 5 secs.
-
- gc_idle_interval F2FS detects the GC thread is idle, given time
- interval. Default is 5 secs.
-
- umount_discard_timeout When unmounting the disk, F2FS waits for finishing
- queued discard commands which can take huge time.
- This gives time out for it, 5 secs by default.
-
- iostat_enable This controls to enable/disable iostat in F2FS.
-
- readdir_ra This enables/disabled readahead of inode blocks
- in readdir, and default is enabled.
-
- gc_pin_file_thresh This indicates how many GC can be failed for the
- pinned file. If it exceeds this, F2FS doesn't
- guarantee its pinning state. 2048 trials is set
- by default.
-
- extension_list This enables to change extension_list for hot/cold
- files in runtime.
-
- inject_rate This controls injection rate of arbitrary faults.
-
- inject_type This controls injection type of arbitrary faults.
-
- dirty_segments This shows # of dirty segments.
-
- lifetime_write_kbytes This shows # of data written to the disk.
-
- features This shows current features enabled on F2FS.
-
- current_reserved_blocks This shows # of blocks currently reserved.
-
- unusable If checkpoint=disable, this shows the number of
- blocks that are unusable.
- If checkpoint=enable it shows the number of blocks
- that would be unusable if checkpoint=disable were
- to be set.
-
-encoding This shows the encoding used for casefolding.
- If casefolding is not enabled, returns (none)
================================================================================
USAGE
@@ -840,3 +687,44 @@ zero or random data, which is useful to the below scenario where:
4. address = fibmap(fd, offset)
5. open(blkdev)
6. write(blkdev, address)
+
+Compression implementation
+--------------------------
+
+- New term named cluster is defined as basic unit of compression, file can
+be divided into multiple clusters logically. One cluster includes 4 << n
+(n >= 0) logical pages, compression size is also cluster size, each of
+cluster can be compressed or not.
+
+- In cluster metadata layout, one special block address is used to indicate
+cluster is compressed one or normal one, for compressed cluster, following
+metadata maps cluster to [1, 4 << n - 1] physical blocks, in where f2fs
+stores data including compress header and compressed data.
+
+- In order to eliminate write amplification during overwrite, F2FS only
+support compression on write-once file, data can be compressed only when
+all logical blocks in file are valid and cluster compress ratio is lower
+than specified threshold.
+
+- To enable compression on regular inode, there are three ways:
+* chattr +c file
+* chattr +c dir; touch dir/file
+* mount w/ -o compress_extension=ext; touch file.ext
+
+Compress metadata layout:
+ [Dnode Structure]
+ +-----------------------------------------------+
+ | cluster 1 | cluster 2 | ......... | cluster N |
+ +-----------------------------------------------+
+ . . . .
+ . . . .
+ . Compressed Cluster . . Normal Cluster .
++----------+---------+---------+---------+ +---------+---------+---------+---------+
+|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
++----------+---------+---------+---------+ +---------+---------+---------+---------+
+ . .
+ . .
+ . .
+ +-------------+-------------+----------+----------------------------+
+ | data length | data chksum | reserved | compressed data |
+ +-------------+-------------+----------+----------------------------+
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 68c2bc8275cf..bd9932344804 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -234,8 +234,8 @@ HKDF is more flexible, is nonreversible, and evenly distributes
entropy from the master key. HKDF is also standardized and widely
used by other software, whereas the AES-128-ECB based KDF is ad-hoc.
-Per-file keys
--------------
+Per-file encryption keys
+------------------------
Since each master key can protect many files, it is necessary to
"tweak" the encryption of each file so that the same plaintext in two
@@ -268,9 +268,9 @@ is greater than that of an AES-256-XTS key.
Therefore, to improve performance and save memory, for Adiantum a
"direct key" configuration is supported. When the user has enabled
this by setting FSCRYPT_POLICY_FLAG_DIRECT_KEY in the fscrypt policy,
-per-file keys are not used. Instead, whenever any data (contents or
-filenames) is encrypted, the file's 16-byte nonce is included in the
-IV. Moreover:
+per-file encryption keys are not used. Instead, whenever any data
+(contents or filenames) is encrypted, the file's 16-byte nonce is
+included in the IV. Moreover:
- For v1 encryption policies, the encryption is done directly with the
master key. Because of this, users **must not** use the same master
@@ -302,6 +302,16 @@ For master keys used for v2 encryption policies, a unique 16-byte "key
identifier" is also derived using the KDF. This value is stored in
the clear, since it is needed to reliably identify the key itself.
+Dirhash keys
+------------
+
+For directories that are indexed using a secret-keyed dirhash over the
+plaintext filenames, the KDF is also used to derive a 128-bit
+SipHash-2-4 key per directory in order to hash filenames. This works
+just like deriving a per-file encryption key, except that a different
+KDF context is used. Currently, only casefolded ("case-insensitive")
+encrypted directories use this style of hashing.
+
Encryption modes and usage
==========================
@@ -325,11 +335,11 @@ used.
Adiantum is a (primarily) stream cipher-based mode that is fast even
on CPUs without dedicated crypto instructions. It's also a true
wide-block mode, unlike XTS. It can also eliminate the need to derive
-per-file keys. However, it depends on the security of two primitives,
-XChaCha12 and AES-256, rather than just one. See the paper
-"Adiantum: length-preserving encryption for entry-level processors"
-(https://eprint.iacr.org/2018/720.pdf) for more details. To use
-Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
+per-file encryption keys. However, it depends on the security of two
+primitives, XChaCha12 and AES-256, rather than just one. See the
+paper "Adiantum: length-preserving encryption for entry-level
+processors" (https://eprint.iacr.org/2018/720.pdf) for more details.
+To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
implementations of ChaCha and NHPoly1305 should be enabled, e.g.
CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
@@ -513,7 +523,9 @@ FS_IOC_SET_ENCRYPTION_POLICY can fail with the following errors:
- ``EEXIST``: the file is already encrypted with an encryption policy
different from the one specified
- ``EINVAL``: an invalid encryption policy was specified (invalid
- version, mode(s), or flags; or reserved bits were set)
+ version, mode(s), or flags; or reserved bits were set); or a v1
+ encryption policy was specified but the directory has the casefold
+ flag enabled (casefolding is incompatible with v1 policies).
- ``ENOKEY``: a v2 encryption policy was specified, but the key with
the specified ``master_key_identifier`` has not been added, nor does
the process have the CAP_FOWNER capability in the initial user
@@ -638,7 +650,8 @@ follows::
struct fscrypt_add_key_arg {
struct fscrypt_key_specifier key_spec;
__u32 raw_size;
- __u32 __reserved[9];
+ __u32 key_id;
+ __u32 __reserved[8];
__u8 raw[];
};
@@ -655,6 +668,12 @@ follows::
} u;
};
+ struct fscrypt_provisioning_key_payload {
+ __u32 type;
+ __u32 __reserved;
+ __u8 raw[];
+ };
+
:c:type:`struct fscrypt_add_key_arg` must be zeroed, then initialized
as follows:
@@ -677,9 +696,26 @@ as follows:
``Documentation/security/keys/core.rst``).
- ``raw_size`` must be the size of the ``raw`` key provided, in bytes.
+ Alternatively, if ``key_id`` is nonzero, this field must be 0, since
+ in that case the size is implied by the specified Linux keyring key.
+
+- ``key_id`` is 0 if the raw key is given directly in the ``raw``
+ field. Otherwise ``key_id`` is the ID of a Linux keyring key of
+ type "fscrypt-provisioning" whose payload is a :c:type:`struct
+ fscrypt_provisioning_key_payload` whose ``raw`` field contains the
+ raw key and whose ``type`` field matches ``key_spec.type``. Since
+ ``raw`` is variable-length, the total size of this key's payload
+ must be ``sizeof(struct fscrypt_provisioning_key_payload)`` plus the
+ raw key size. The process must have Search permission on this key.
+
+ Most users should leave this 0 and specify the raw key directly.
+ The support for specifying a Linux keyring key is intended mainly to
+ allow re-adding keys after a filesystem is unmounted and re-mounted,
+ without having to store the raw keys in userspace memory.
- ``raw`` is a variable-length field which must contain the actual
- key, ``raw_size`` bytes long.
+ key, ``raw_size`` bytes long. Alternatively, if ``key_id`` is
+ nonzero, then this field is unused.
For v2 policy keys, the kernel keeps track of which user (identified
by effective user ID) added the key, and only allows the key to be
@@ -701,11 +737,16 @@ FS_IOC_ADD_ENCRYPTION_KEY can fail with the following errors:
- ``EACCES``: FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR was specified, but the
caller does not have the CAP_SYS_ADMIN capability in the initial
- user namespace
+ user namespace; or the raw key was specified by Linux key ID but the
+ process lacks Search permission on the key.
- ``EDQUOT``: the key quota for this user would be exceeded by adding
the key
- ``EINVAL``: invalid key size or key specifier type, or reserved bits
were set
+- ``EKEYREJECTED``: the raw key was specified by Linux key ID, but the
+ key has the wrong type
+- ``ENOKEY``: the raw key was specified by Linux key ID, but no key
+ exists with that ID
- ``ENOTTY``: this type of filesystem does not implement encryption
- ``EOPNOTSUPP``: the kernel was not configured with encryption
support for this filesystem, or the filesystem superblock has not
@@ -975,9 +1016,9 @@ astute users may notice some differences in behavior:
- Direct I/O is not supported on encrypted files. Attempts to use
direct I/O on such files will fall back to buffered I/O.
-- The fallocate operations FALLOC_FL_COLLAPSE_RANGE,
- FALLOC_FL_INSERT_RANGE, and FALLOC_FL_ZERO_RANGE are not supported
- on encrypted files and will fail with EOPNOTSUPP.
+- The fallocate operations FALLOC_FL_COLLAPSE_RANGE and
+ FALLOC_FL_INSERT_RANGE are not supported on encrypted files and will
+ fail with EOPNOTSUPP.
- Online defragmentation of encrypted files is not supported. The
EXT4_IOC_MOVE_EXT and F2FS_IOC_MOVE_RANGE ioctls will fail with
@@ -1108,8 +1149,8 @@ The context structs contain the same information as the corresponding
policy structs (see `Setting an encryption policy`_), except that the
context structs also contain a nonce. The nonce is randomly generated
by the kernel and is used as KDF input or as a tweak to cause
-different files to be encrypted differently; see `Per-file keys`_ and
-`DIRECT_KEY policies`_.
+different files to be encrypted differently; see `Per-file encryption
+keys`_ and `DIRECT_KEY policies`_.
Data path changes
-----------------
@@ -1161,7 +1202,7 @@ filesystem-specific hash(es) needed for directory lookups. This
allows the filesystem to still, with a high degree of confidence, map
the filename given in ->lookup() back to a particular directory entry
that was previously listed by readdir(). See :c:type:`struct
-fscrypt_digested_name` in the source for more details.
+fscrypt_nokey_name` in the source for more details.
Note that the precise way that filenames are presented to userspace
without the key is subject to change in the future. It is only meant
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.rst
index 13af4a49e7db..8e455065ce9e 100644
--- a/Documentation/filesystems/fuse.txt
+++ b/Documentation/filesystems/fuse.rst
@@ -1,41 +1,40 @@
+.. SPDX-License-Identifier: GPL-2.0
+==============
+FUSE
+==============
+
Definitions
-~~~~~~~~~~~
+===========
Userspace filesystem:
-
A filesystem in which data and metadata are provided by an ordinary
userspace process. The filesystem can be accessed normally through
the kernel interface.
Filesystem daemon:
-
The process(es) providing the data and metadata of the filesystem.
Non-privileged mount (or user mount):
-
A userspace filesystem mounted by a non-privileged (non-root) user.
The filesystem daemon is running with the privileges of the mounting
user. NOTE: this is not the same as mounts allowed with the "user"
option in /etc/fstab, which is not discussed here.
Filesystem connection:
-
A connection between the filesystem daemon and the kernel. The
connection exists until either the daemon dies, or the filesystem is
umounted. Note that detaching (or lazy umounting) the filesystem
- does _not_ break the connection, in this case it will exist until
+ does *not* break the connection, in this case it will exist until
the last reference to the filesystem is released.
Mount owner:
-
The user who does the mounting.
User:
-
The user who is performing filesystem operations.
What is FUSE?
-~~~~~~~~~~~~~
+=============
FUSE is a userspace filesystem framework. It consists of a kernel
module (fuse.ko), a userspace library (libfuse.*) and a mount utility
@@ -46,50 +45,41 @@ non-privileged mounts. This opens up new possibilities for the use of
filesystems. A good example is sshfs: a secure network filesystem
using the sftp protocol.
-The userspace library and utilities are available from the FUSE
-homepage:
-
- http://fuse.sourceforge.net/
+The userspace library and utilities are available from the
+`FUSE homepage: <http://fuse.sourceforge.net/>`_
Filesystem type
-~~~~~~~~~~~~~~~
+===============
The filesystem type given to mount(2) can be one of the following:
-'fuse'
-
- This is the usual way to mount a FUSE filesystem. The first
- argument of the mount system call may contain an arbitrary string,
- which is not interpreted by the kernel.
+ fuse
+ This is the usual way to mount a FUSE filesystem. The first
+ argument of the mount system call may contain an arbitrary string,
+ which is not interpreted by the kernel.
-'fuseblk'
-
- The filesystem is block device based. The first argument of the
- mount system call is interpreted as the name of the device.
+ fuseblk
+ The filesystem is block device based. The first argument of the
+ mount system call is interpreted as the name of the device.
Mount options
-~~~~~~~~~~~~~
-
-'fd=N'
+=============
+fd=N
The file descriptor to use for communication between the userspace
filesystem and the kernel. The file descriptor must have been
obtained by opening the FUSE device ('/dev/fuse').
-'rootmode=M'
-
+rootmode=M
The file mode of the filesystem's root in octal representation.
-'user_id=N'
-
+user_id=N
The numeric user id of the mount owner.
-'group_id=N'
-
+group_id=N
The numeric group id of the mount owner.
-'default_permissions'
-
+default_permissions
By default FUSE doesn't check file access permissions, the
filesystem is free to implement its access policy or leave it to
the underlying file access mechanism (e.g. in case of network
@@ -97,28 +87,25 @@ Mount options
access based on file mode. It is usually useful together with the
'allow_other' mount option.
-'allow_other'
-
+allow_other
This option overrides the security measure restricting file access
to the user mounting the filesystem. This option is by default only
allowed to root, but this restriction can be removed with a
(userspace) configuration option.
-'max_read=N'
-
+max_read=N
With this option the maximum size of read operations can be set.
The default is infinite. Note that the size of read requests is
limited anyway to 32 pages (which is 128kbyte on i386).
-'blksize=N'
-
+blksize=N
Set the block size for the filesystem. The default is 512. This
option is only valid for 'fuseblk' type mounts.
Control filesystem
-~~~~~~~~~~~~~~~~~~
+==================
-There's a control filesystem for FUSE, which can be mounted by:
+There's a control filesystem for FUSE, which can be mounted by::
mount -t fusectl none /sys/fs/fuse/connections
@@ -130,53 +117,51 @@ named by a unique number.
For each connection the following files exist within this directory:
- 'waiting'
-
- The number of requests which are waiting to be transferred to
- userspace or being processed by the filesystem daemon. If there is
- no filesystem activity and 'waiting' is non-zero, then the
- filesystem is hung or deadlocked.
-
- 'abort'
+ waiting
+ The number of requests which are waiting to be transferred to
+ userspace or being processed by the filesystem daemon. If there is
+ no filesystem activity and 'waiting' is non-zero, then the
+ filesystem is hung or deadlocked.
- Writing anything into this file will abort the filesystem
- connection. This means that all waiting requests will be aborted an
- error returned for all aborted and new requests.
+ abort
+ Writing anything into this file will abort the filesystem
+ connection. This means that all waiting requests will be aborted an
+ error returned for all aborted and new requests.
Only the owner of the mount may read or write these files.
Interrupting filesystem operations
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+##################################
If a process issuing a FUSE filesystem request is interrupted, the
following will happen:
- 1) If the request is not yet sent to userspace AND the signal is
+ - If the request is not yet sent to userspace AND the signal is
fatal (SIGKILL or unhandled fatal signal), then the request is
dequeued and returns immediately.
- 2) If the request is not yet sent to userspace AND the signal is not
- fatal, then an 'interrupted' flag is set for the request. When
+ - If the request is not yet sent to userspace AND the signal is not
+ fatal, then an interrupted flag is set for the request. When
the request has been successfully transferred to userspace and
this flag is set, an INTERRUPT request is queued.
- 3) If the request is already sent to userspace, then an INTERRUPT
+ - If the request is already sent to userspace, then an INTERRUPT
request is queued.
INTERRUPT requests take precedence over other requests, so the
userspace filesystem will receive queued INTERRUPTs before any others.
The userspace filesystem may ignore the INTERRUPT requests entirely,
-or may honor them by sending a reply to the _original_ request, with
+or may honor them by sending a reply to the *original* request, with
the error set to EINTR.
It is also possible that there's a race between processing the
original request and its INTERRUPT request. There are two possibilities:
- 1) The INTERRUPT request is processed before the original request is
+ 1. The INTERRUPT request is processed before the original request is
processed
- 2) The INTERRUPT request is processed after the original request has
+ 2. The INTERRUPT request is processed after the original request has
been answered
If the filesystem cannot find the original request, it should wait for
@@ -186,7 +171,7 @@ should reply to the INTERRUPT request with an EAGAIN error. In case
reply will be ignored.
Aborting a filesystem connection
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+================================
It is possible to get into certain situations where the filesystem is
not responding. Reasons for this may be:
@@ -216,7 +201,7 @@ the filesystem. There are several ways to do this:
powerful method, always works.
How do non-privileged mounts work?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+==================================
Since the mount() system call is a privileged operation, a helper
program (fusermount) is needed, which is installed setuid root.
@@ -235,15 +220,13 @@ system. Obvious requirements arising from this are:
other users' or the super user's processes
How are requirements fulfilled?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+===============================
A) The mount owner could gain elevated privileges by either:
- 1) creating a filesystem containing a device file, then opening
- this device
+ 1. creating a filesystem containing a device file, then opening this device
- 2) creating a filesystem containing a suid or sgid application,
- then executing this application
+ 2. creating a filesystem containing a suid or sgid application, then executing this application
The solution is not to allow opening device files and ignore
setuid and setgid bits when executing programs. To ensure this
@@ -275,16 +258,16 @@ How are requirements fulfilled?
of other users' processes.
i) It can slow down or indefinitely delay the execution of a
- filesystem operation creating a DoS against the user or the
- whole system. For example a suid application locking a
- system file, and then accessing a file on the mount owner's
- filesystem could be stopped, and thus causing the system
- file to be locked forever.
+ filesystem operation creating a DoS against the user or the
+ whole system. For example a suid application locking a
+ system file, and then accessing a file on the mount owner's
+ filesystem could be stopped, and thus causing the system
+ file to be locked forever.
ii) It can present files or directories of unlimited length, or
- directory structures of unlimited depth, possibly causing a
- system process to eat up diskspace, memory or other
- resources, again causing DoS.
+ directory structures of unlimited depth, possibly causing a
+ system process to eat up diskspace, memory or other
+ resources, again causing *DoS*.
The solution to this as well as B) is not to allow processes
to access the filesystem, which could otherwise not be
@@ -294,28 +277,27 @@ How are requirements fulfilled?
ptrace can be used to check if a process is allowed to access
the filesystem or not.
- Note that the ptrace check is not strictly necessary to
+ Note that the *ptrace* check is not strictly necessary to
prevent B/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
- filesystem, since SIGSTOP can be used to get a similar effect.
+ filesystem, since *SIGSTOP* can be used to get a similar effect.
I think these limitations are unacceptable?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+===========================================
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
-mounts, it can relax the last limitation with a "user_allow_other"
+mounts, it can relax the last limitation with a 'user_allow_other'
config option. If this config option is set, the mounting user can
-add the "allow_other" mount option which disables the check for other
+add the 'allow_other' mount option which disables the check for other
users' processes.
Kernel - userspace interface
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+============================
The following diagram shows how a filesystem operation (in this
-example unlink) is performed in FUSE.
+example unlink) is performed in FUSE. ::
-NOTE: everything in this description is greatly simplified
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
@@ -357,12 +339,13 @@ NOTE: everything in this description is greatly simplified
| <fuse_unlink() |
| <sys_unlink() |
+.. note:: Everything in the description above is greatly simplified
+
There are a couple of ways in which to deadlock a FUSE filesystem.
Since we are talking about unprivileged userspace programs,
something must be done about these.
-Scenario 1 - Simple deadlock
------------------------------
+**Scenario 1 - Simple deadlock**::
| "rm /mnt/fuse/file" | FUSE filesystem daemon
| |
@@ -379,12 +362,12 @@ Scenario 1 - Simple deadlock
The solution for this is to allow the filesystem to be aborted.
-Scenario 2 - Tricky deadlock
-----------------------------
+**Scenario 2 - Tricky deadlock**
+
This one needs a carefully crafted filesystem. It's a variation on
the above, only the call back to the filesystem is not explicit,
-but is caused by a pagefault.
+but is caused by a pagefault. ::
| Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
| |
@@ -410,7 +393,7 @@ but is caused by a pagefault.
| | [lock page]
| | * DEADLOCK *
-Solution is basically the same as above.
+The solution is basically the same as above.
An additional problem is that while the write buffer is being copied
to the request, the request must not be interrupted/aborted. This is
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index ad6315a48d14..386eaad008b2 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -47,4 +47,7 @@ Documentation for filesystem implementations.
:maxdepth: 2
autofs
+ fuse
+ overlayfs
virtiofs
+ vfat
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
index 00ff0cfccfa7..87c14bbb2b35 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.txt
@@ -427,7 +427,6 @@ returned.
fs_value_is_string, Value is a string
fs_value_is_blob, Value is a binary blob
fs_value_is_filename, Value is a filename* + dirfd
- fs_value_is_filename_empty, Value is a filename* + dirfd + AT_EMPTY_PATH
fs_value_is_file, Value is an open file (file*)
If there is a value, that value is stored in a union in the struct in one
@@ -519,7 +518,6 @@ Parameters are described using structures defined in linux/fs_parser.h.
There's a core description struct that links everything together:
struct fs_parameter_description {
- const char name[16];
const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums;
};
@@ -535,19 +533,13 @@ For example:
};
static const struct fs_parameter_description afs_fs_parameters = {
- .name = "kAFS",
.specs = afs_param_specs,
.enums = afs_param_enums,
};
The members are as follows:
- (1) const char name[16];
-
- The name to be used in error messages generated by the parse helper
- functions.
-
- (2) const struct fs_parameter_specification *specs;
+ (1) const struct fs_parameter_specification *specs;
Table of parameter specifications, terminated with a null entry, where the
entries are of type:
@@ -626,7 +618,7 @@ The members are as follows:
of arguments to specify the type and the flags for anything that doesn't
match one of the above macros.
- (6) const struct fs_parameter_enum *enums;
+ (2) const struct fs_parameter_enum *enums;
Table of enum value names to integer mappings, terminated with a null
entry. This is of type:
diff --git a/Documentation/filesystems/nfs/fault_injection.txt b/Documentation/filesystems/nfs/fault_injection.txt
deleted file mode 100644
index f3a5b0a8ac05..000000000000
--- a/Documentation/filesystems/nfs/fault_injection.txt
+++ /dev/null
@@ -1,69 +0,0 @@
-
-Fault Injection
-===============
-Fault injection is a method for forcing errors that may not normally occur, or
-may be difficult to reproduce. Forcing these errors in a controlled environment
-can help the developer find and fix bugs before their code is shipped in a
-production system. Injecting an error on the Linux NFS server will allow us to
-observe how the client reacts and if it manages to recover its state correctly.
-
-NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this
-feature.
-
-
-Using Fault Injection
-=====================
-On the client, mount the fault injection server through NFS v4.0+ and do some
-work over NFS (open files, take locks, ...).
-
-On the server, mount the debugfs filesystem to <debug_dir> and ls
-<debug_dir>/nfsd. This will show a list of files that will be used for
-injecting faults on the NFS server. As root, write a number n to the file
-corresponding to the action you want the server to take. The server will then
-process the first n items it finds. So if you want to forget 5 locks, echo '5'
-to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget
-all corresponding items. A log message will be created containing the number
-of items forgotten (check dmesg).
-
-Go back to work on the client and check if the client recovered from the error
-correctly.
-
-
-Available Faults
-================
-forget_clients:
- The NFS server keeps a list of clients that have placed a mount call. If
- this list is cleared, the server will have no knowledge of who the client
- is, forcing the client to reauthenticate with the server.
-
-forget_openowners:
- The NFS server keeps a list of what files are currently opened and who
- they were opened by. Clearing this list will force the client to reopen
- its files.
-
-forget_locks:
- The NFS server keeps a list of what files are currently locked in the VFS.
- Clearing this list will force the client to reclaim its locks (files are
- unlocked through the VFS as they are cleared from this list).
-
-forget_delegations:
- A delegation is used to assure the client that a file, or part of a file,
- has not changed since the delegation was awarded. Clearing this list will
- force the client to reacquire its delegation before accessing the file
- again.
-
-recall_delegations:
- Delegations can be recalled by the server when another client attempts to
- access a file. This test will notify the client that its delegation has
- been revoked, forcing the client to reacquire the delegation before using
- the file again.
-
-
-tools/nfs/inject_faults.sh script
-=================================
-This script has been created to ease the fault injection process. This script
-will detect the mounted debugfs directory and write to the files located there
-based on the arguments passed by the user. For example, running
-`inject_faults.sh forget_locks 1` as root will instruct the server to forget
-one lock. Running `inject_faults forget_locks` will instruct the server to
-forgetall locks.
diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt
deleted file mode 100644
index b86831acd583..000000000000
--- a/Documentation/filesystems/nfs/idmapper.txt
+++ /dev/null
@@ -1,75 +0,0 @@
-
-=========
-ID Mapper
-=========
-Id mapper is used by NFS to translate user and group ids into names, and to
-translate user and group names into ids. Part of this translation involves
-performing an upcall to userspace to request the information. There are two
-ways NFS could obtain this information: placing a call to /sbin/request-key
-or by placing a call to the rpc.idmap daemon.
-
-NFS will attempt to call /sbin/request-key first. If this succeeds, the
-result will be cached using the generic request-key cache. This call should
-only fail if /etc/request-key.conf is not configured for the id_resolver key
-type, see the "Configuring" section below if you wish to use the request-key
-method.
-
-If the call to /sbin/request-key fails (if /etc/request-key.conf is not
-configured with the id_resolver key type), then the idmapper will ask the
-legacy rpc.idmap daemon for the id mapping. This result will be stored
-in a custom NFS idmap cache.
-
-
-===========
-Configuring
-===========
-The file /etc/request-key.conf will need to be modified so /sbin/request-key can
-direct the upcall. The following line should be added:
-
-#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
-#====== ======= =============== =============== ===============================
-create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
-
-This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap.
-The last parameter, 600, defines how many seconds into the future the key will
-expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout
-is not specified, nfs.idmap will default to 600 seconds.
-
-id mapper uses for key descriptions:
- uid: Find the UID for the given user
- gid: Find the GID for the given group
- user: Find the user name for the given UID
- group: Find the group name for the given GID
-
-You can handle any of these individually, rather than using the generic upcall
-program. If you would like to use your own program for a uid lookup then you
-would edit your request-key.conf so it look similar to this:
-
-#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ...
-#====== ======= =============== =============== ===============================
-create id_resolver uid:* * /some/other/program %k %d 600
-create id_resolver * * /usr/sbin/nfs.idmap %k %d 600
-
-Notice that the new line was added above the line for the generic program.
-request-key will find the first matching line and corresponding program. In
-this case, /some/other/program will handle all uid lookups and
-/usr/sbin/nfs.idmap will handle gid, user, and group lookups.
-
-See <file:Documentation/security/keys/request-key.rst> for more information
-about the request-key function.
-
-
-=========
-nfs.idmap
-=========
-nfs.idmap is designed to be called by request-key, and should not be run "by
-hand". This program takes two arguments, a serialized key and a key
-description. The serialized key is first converted into a key_serial_t, and
-then passed as an argument to keyctl_instantiate (both are part of keyutils.h).
-
-The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap
-determines the correct function to call by looking at the first part of the
-description string. For example, a uid lookup description will appear as
-"uid:user@domain".
-
-nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise.
diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt
deleted file mode 100644
index 22dc0dd6889c..000000000000
--- a/Documentation/filesystems/nfs/nfs-rdma.txt
+++ /dev/null
@@ -1,274 +0,0 @@
-################################################################################
-# #
-# NFS/RDMA README #
-# #
-################################################################################
-
- Author: NetApp and Open Grid Computing
- Date: May 29, 2008
-
-Table of Contents
-~~~~~~~~~~~~~~~~~
- - Overview
- - Getting Help
- - Installation
- - Check RDMA and NFS Setup
- - NFS/RDMA Setup
-
-Overview
-~~~~~~~~
-
- This document describes how to install and setup the Linux NFS/RDMA client
- and server software.
-
- The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server
- was first included in the following release, Linux 2.6.25.
-
- In our testing, we have obtained excellent performance results (full 10Gbit
- wire bandwidth at minimal client CPU) under many workloads. The code passes
- the full Connectathon test suite and operates over both Infiniband and iWARP
- RDMA adapters.
-
-Getting Help
-~~~~~~~~~~~~
-
- If you get stuck, you can ask questions on the
-
- nfs-rdma-devel@lists.sourceforge.net
-
- mailing list.
-
-Installation
-~~~~~~~~~~~~
-
- These instructions are a step by step guide to building a machine for
- use with NFS/RDMA.
-
- - Install an RDMA device
-
- Any device supported by the drivers in drivers/infiniband/hw is acceptable.
-
- Testing has been performed using several Mellanox-based IB cards, the
- Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter.
-
- - Install a Linux distribution and tools
-
- The first kernel release to contain both the NFS/RDMA client and server was
- Linux 2.6.25 Therefore, a distribution compatible with this and subsequent
- Linux kernel release should be installed.
-
- The procedures described in this document have been tested with
- distributions from Red Hat's Fedora Project (http://fedora.redhat.com/).
-
- - Install nfs-utils-1.1.2 or greater on the client
-
- An NFS/RDMA mount point can be obtained by using the mount.nfs command in
- nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils
- version with support for NFS/RDMA mounts, but for various reasons we
- recommend using nfs-utils-1.1.2 or greater). To see which version of
- mount.nfs you are using, type:
-
- $ /sbin/mount.nfs -V
-
- If the version is less than 1.1.2 or the command does not exist,
- you should install the latest version of nfs-utils.
-
- Download the latest package from:
-
- http://www.kernel.org/pub/linux/utils/nfs
-
- Uncompress the package and follow the installation instructions.
-
- If you will not need the idmapper and gssd executables (you do not need
- these to create an NFS/RDMA enabled mount command), the installation
- process can be simplified by disabling these features when running
- configure:
-
- $ ./configure --disable-gss --disable-nfsv4
-
- To build nfs-utils you will need the tcp_wrappers package installed. For
- more information on this see the package's README and INSTALL files.
-
- After building the nfs-utils package, there will be a mount.nfs binary in
- the utils/mount directory. This binary can be used to initiate NFS v2, v3,
- or v4 mounts. To initiate a v4 mount, the binary must be called
- mount.nfs4. The standard technique is to create a symlink called
- mount.nfs4 to mount.nfs.
-
- This mount.nfs binary should be installed at /sbin/mount.nfs as follows:
-
- $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs
-
- In this location, mount.nfs will be invoked automatically for NFS mounts
- by the system mount command.
-
- NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed
- on the NFS client machine. You do not need this specific version of
- nfs-utils on the server. Furthermore, only the mount.nfs command from
- nfs-utils-1.1.2 is needed on the client.
-
- - Install a Linux kernel with NFS/RDMA
-
- The NFS/RDMA client and server are both included in the mainline Linux
- kernel version 2.6.25 and later. This and other versions of the Linux
- kernel can be found at:
-
- https://www.kernel.org/pub/linux/kernel/
-
- Download the sources and place them in an appropriate location.
-
- - Configure the RDMA stack
-
- Make sure your kernel configuration has RDMA support enabled. Under
- Device Drivers -> InfiniBand support, update the kernel configuration
- to enable InfiniBand support [NOTE: the option name is misleading. Enabling
- InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)].
-
- Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or
- iWARP adapter support (amso, cxgb3, etc.).
-
- If you are using InfiniBand, be sure to enable IP-over-InfiniBand support.
-
- - Configure the NFS client and server
-
- Your kernel configuration must also have NFS file system support and/or
- NFS server support enabled. These and other NFS related configuration
- options can be found under File Systems -> Network File Systems.
-
- - Build, install, reboot
-
- The NFS/RDMA code will be enabled automatically if NFS and RDMA
- are turned on. The NFS/RDMA client and server are configured via the hidden
- SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The
- value of SUNRPC_XPRT_RDMA will be:
-
- - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client
- and server will not be built
- - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M,
- in this case the NFS/RDMA client and server will be built as modules
- - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client
- and server will be built into the kernel
-
- Therefore, if you have followed the steps above and turned no NFS and RDMA,
- the NFS/RDMA client and server will be built.
-
- Build a new kernel, install it, boot it.
-
-Check RDMA and NFS Setup
-~~~~~~~~~~~~~~~~~~~~~~~~
-
- Before configuring the NFS/RDMA software, it is a good idea to test
- your new kernel to ensure that the kernel is working correctly.
- In particular, it is a good idea to verify that the RDMA stack
- is functioning as expected and standard NFS over TCP/IP and/or UDP/IP
- is working properly.
-
- - Check RDMA Setup
-
- If you built the RDMA components as modules, load them at
- this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel
- card:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
-
- If you are using InfiniBand, make sure there is a Subnet Manager (SM)
- running on the network. If your IB switch has an embedded SM, you can
- use it. Otherwise, you will need to run an SM, such as OpenSM, on one
- of your end nodes.
-
- If an SM is running on your network, you should see the following:
-
- $ cat /sys/class/infiniband/driverX/ports/1/state
- 4: ACTIVE
-
- where driverX is mthca0, ipath5, ehca3, etc.
-
- To further test the InfiniBand software stack, use IPoIB (this
- assumes you have two IB hosts named host1 and host2):
-
- host1$ ip link set dev ib0 up
- host1$ ip address add dev ib0 a.b.c.x
- host2$ ip link set dev ib0 up
- host2$ ip address add dev ib0 a.b.c.y
- host1$ ping a.b.c.y
- host2$ ping a.b.c.x
-
- For other device types, follow the appropriate procedures.
-
- - Check NFS Setup
-
- For the NFS components enabled above (client and/or server),
- test their functionality over standard Ethernet using TCP/IP or UDP/IP.
-
-NFS/RDMA Setup
-~~~~~~~~~~~~~~
-
- We recommend that you use two machines, one to act as the client and
- one to act as the server.
-
- One time configuration:
-
- - On the server system, configure the /etc/exports file and
- start the NFS/RDMA server.
-
- Exports entries with the following formats have been tested:
-
- /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash)
- /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
-
- The IP address(es) is(are) the client's IPoIB address for an InfiniBand
- HCA or the client's iWARP address(es) for an RNIC.
-
- NOTE: The "insecure" option must be used because the NFS/RDMA client does
- not use a reserved port.
-
- Each time a machine boots:
-
- - Load and configure the RDMA drivers
-
- For InfiniBand using a Mellanox adapter:
-
- $ modprobe ib_mthca
- $ modprobe ib_ipoib
- $ ip li set dev ib0 up
- $ ip addr add dev ib0 a.b.c.d
-
- NOTE: use unique addresses for the client and server
-
- - Start the NFS server
-
- If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
- kernel config), load the RDMA transport module:
-
- $ modprobe svcrdma
-
- Regardless of how the server was built (module or built-in), start the
- server:
-
- $ /etc/init.d/nfs start
-
- or
-
- $ service nfs start
-
- Instruct the server to listen on the RDMA transport:
-
- $ echo rdma 20049 > /proc/fs/nfsd/portlist
-
- - On the client system
-
- If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in
- kernel config), load the RDMA client module:
-
- $ modprobe xprtrdma.ko
-
- Regardless of how the client was built (module or built-in), use this
- command to mount the NFS/RDMA server:
-
- $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt
-
- To verify that the mount is using RDMA, run "cat /proc/mounts" and check
- the "proto" field for the given mount.
-
- Congratulations! You're using NFS/RDMA!
diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt
deleted file mode 100644
index f2571c8bef74..000000000000
--- a/Documentation/filesystems/nfs/nfs.txt
+++ /dev/null
@@ -1,136 +0,0 @@
-
-The NFS client
-==============
-
-The NFS version 2 protocol was first documented in RFC1094 (March 1989).
-Since then two more major releases of NFS have been published, with NFSv3
-being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April
-2003).
-
-The Linux NFS client currently supports all the above published versions,
-and work is in progress on adding support for minor version 1 of the NFSv4
-protocol.
-
-The purpose of this document is to provide information on some of the
-special features of the NFS client that can be configured by system
-administrators.
-
-
-The nfs4_unique_id parameter
-============================
-
-NFSv4 requires clients to identify themselves to servers with a unique
-string. File open and lock state shared between one client and one server
-is associated with this identity. To support robust NFSv4 state recovery
-and transparent state migration, this identity string must not change
-across client reboots.
-
-Without any other intervention, the Linux client uses a string that contains
-the local system's node name. System administrators, however, often do not
-take care to ensure that node names are fully qualified and do not change
-over the lifetime of a client system. Node names can have other
-administrative requirements that require particular behavior that does not
-work well as part of an nfs_client_id4 string.
-
-The nfs.nfs4_unique_id boot parameter specifies a unique string that can be
-used instead of a system's node name when an NFS client identifies itself to
-a server. Thus, if the system's node name is not unique, or it changes, its
-nfs.nfs4_unique_id stays the same, preventing collision with other clients
-or loss of state during NFS reboot recovery or transparent state migration.
-
-The nfs.nfs4_unique_id string is typically a UUID, though it can contain
-anything that is believed to be unique across all NFS clients. An
-nfs4_unique_id string should be chosen when a client system is installed,
-just as a system's root file system gets a fresh UUID in its label at
-install time.
-
-The string should remain fixed for the lifetime of the client. It can be
-changed safely if care is taken that the client shuts down cleanly and all
-outstanding NFSv4 state has expired, to prevent loss of NFSv4 state.
-
-This string can be stored in an NFS client's grub.conf, or it can be provided
-via a net boot facility such as PXE. It may also be specified as an nfs.ko
-module parameter. Specifying a uniquifier string is not support for NFS
-clients running in containers.
-
-
-The DNS resolver
-================
-
-NFSv4 allows for one server to refer the NFS client to data that has been
-migrated onto another server by means of the special "fs_locations"
-attribute. See
- http://tools.ietf.org/html/rfc3530#section-6
-and
- http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00
-
-The fs_locations information can take the form of either an ip address and
-a path, or a DNS hostname and a path. The latter requires the NFS client to
-do a DNS lookup in order to mount the new volume, and hence the need for an
-upcall to allow userland to provide this service.
-
-Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual
-/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps:
-
- (1) The process checks the dns_resolve cache to see if it contains a
- valid entry. If so, it returns that entry and exits.
-
- (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent'
- (may be changed using the 'nfs.cache_getent' kernel boot parameter)
- is run, with two arguments:
- - the cache name, "dns_resolve"
- - the hostname to resolve
-
- (3) After looking up the corresponding ip address, the helper script
- writes the result into the rpc_pipefs pseudo-file
- '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel'
- in the following (text) format:
-
- "<ip address> <hostname> <ttl>\n"
-
- Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6
- (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format.
- <hostname> is identical to the second argument of the helper
- script, and <ttl> is the 'time to live' of this cache entry (in
- units of seconds).
-
- Note: If <ip address> is invalid, say the string "0", then a negative
- entry is created, which will cause the kernel to treat the hostname
- as having no valid DNS translation.
-
-
-
-
-A basic sample /sbin/nfs_cache_getent
-=====================================
-
-#!/bin/bash
-#
-ttl=600
-#
-cut=/usr/bin/cut
-getent=/usr/bin/getent
-rpc_pipefs=/var/lib/nfs/rpc_pipefs
-#
-die()
-{
- echo "Usage: $0 cache_name entry_name"
- exit 1
-}
-
-[ $# -lt 2 ] && die
-cachename="$1"
-cache_path=${rpc_pipefs}/cache/${cachename}/channel
-
-case "${cachename}" in
- dns_resolve)
- name="$2"
- result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )"
- [ -z "${result}" ] && result="0"
- ;;
- *)
- die
- ;;
-esac
-echo "${result} ${name} ${ttl}" >${cache_path}
-
diff --git a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt
deleted file mode 100644
index 56a96fb08a73..000000000000
--- a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt
+++ /dev/null
@@ -1,41 +0,0 @@
-Administrative interfaces for nfsd
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Note that normally these interfaces are used only by the utilities in
-nfs-utils.
-
-nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem,
-which is normally mounted at /proc/fs/nfsd/.
-
-The server is always started by the first write of a nonzero value to
-nfsd/threads.
-
-Before doing that, NFSD can be told which sockets to listen on by
-writing to nfsd/portlist; that write may be:
-
- - an ascii-encoded file descriptor, which should refer to a
- bound (and listening, for tcp) socket, or
- - "transportname port", where transportname is currently either
- "udp", "tcp", or "rdma".
-
-If nfsd is started without doing any of these, then it will create one
-udp and one tcp listener at port 2049 (see nfsd_init_socks).
-
-On startup, nfsd and lockd grace periods start.
-
-nfsd is shut down by a write of 0 to nfsd/threads. All locks and state
-are thrown away at that point.
-
-Between startup and shutdown, the number of threads may be adjusted up
-or down by additional writes to nfsd/threads or by writes to
-nfsd/pool_threads.
-
-For more detail about files under nfsd/ and what they control, see
-fs/nfsd/nfsctl.c; most of them have detailed comments.
-
-Implementation notes
-^^^^^^^^^^^^^^^^^^^^
-
-Note that the rpc server requires the caller to serialize addition and
-removal of listening sockets, and startup and shutdown of the server.
-For nfsd this is done using nfsd_mutex.
diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt
deleted file mode 100644
index ae4332464560..000000000000
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ /dev/null
@@ -1,355 +0,0 @@
-Mounting the root filesystem via NFS (nfsroot)
-===============================================
-
-Written 1996 by Gero Kuhlmann <gero@gkminix.han.de>
-Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz>
-Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org>
-Updated 2006 by Horms <horms@verge.net.au>
-Updated 2018 by Chris Novakovic <chris@chrisn.me.uk>
-
-
-
-In order to use a diskless system, such as an X-terminal or printer server
-for example, it is necessary for the root filesystem to be present on a
-non-disk device. This may be an initramfs (see Documentation/filesystems/
-ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/admin-guide/initrd.rst) or a
-filesystem mounted via NFS. The following text describes on how to use NFS
-for the root filesystem. For the rest of this text 'client' means the
-diskless system, and 'server' means the NFS server.
-
-
-
-
-1.) Enabling nfsroot capabilities
- -----------------------------
-
-In order to use nfsroot, NFS client support needs to be selected as
-built-in during configuration. Once this has been selected, the nfsroot
-option will become available, which should also be selected.
-
-In the networking options, kernel level autoconfiguration can be selected,
-along with the types of autoconfiguration to support. Selecting all of
-DHCP, BOOTP and RARP is safe.
-
-
-
-
-2.) Kernel command line
- -------------------
-
-When the kernel has been loaded by a boot loader (see below) it needs to be
-told what root fs device to use. And in the case of nfsroot, where to find
-both the server and the name of the directory on the server to mount as root.
-This can be established using the following kernel command line parameters:
-
-
-root=/dev/nfs
-
- This is necessary to enable the pseudo-NFS-device. Note that it's not a
- real device but just a synonym to tell the kernel to use NFS instead of
- a real device.
-
-
-nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>]
-
- If the `nfsroot' parameter is NOT given on the command line,
- the default "/tftpboot/%s" will be used.
-
- <server-ip> Specifies the IP address of the NFS server.
- The default address is determined by the `ip' parameter
- (see below). This parameter allows the use of different
- servers for IP autoconfiguration and NFS.
-
- <root-dir> Name of the directory on the server to mount as root.
- If there is a "%s" token in the string, it will be
- replaced by the ASCII-representation of the client's
- IP address.
-
- <nfs-options> Standard NFS options. All options are separated by commas.
- The following defaults are used:
- port = as given by server portmap daemon
- rsize = 4096
- wsize = 4096
- timeo = 7
- retrans = 3
- acregmin = 3
- acregmax = 60
- acdirmin = 30
- acdirmax = 60
- flags = hard, nointr, noposix, cto, ac
-
-
-ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:
- <dns0-ip>:<dns1-ip>:<ntp0-ip>
-
- This parameter tells the kernel how to configure IP addresses of devices
- and also how to set up the IP routing table. It was originally called
- `nfsaddrs', but now the boot-time IP configuration works independently of
- NFS, so it was renamed to `ip' and the old name remained as an alias for
- compatibility reasons.
-
- If this parameter is missing from the kernel command line, all fields are
- assumed to be empty, and the defaults mentioned below apply. In general
- this means that the kernel tries to configure everything using
- autoconfiguration.
-
- The <autoconf> parameter can appear alone as the value to the `ip'
- parameter (without all the ':' characters before). If the value is
- "ip=off" or "ip=none", no autoconfiguration will take place, otherwise
- autoconfiguration will take place. The most common way to use this
- is "ip=dhcp".
-
- <client-ip> IP address of the client.
-
- Default: Determined using autoconfiguration.
-
- <server-ip> IP address of the NFS server. If RARP is used to determine
- the client address and this parameter is NOT empty only
- replies from the specified server are accepted.
-
- Only required for NFS root. That is autoconfiguration
- will not be triggered if it is missing and NFS root is not
- in operation.
-
- Value is exported to /proc/net/pnp with the prefix "bootserver "
- (see below).
-
- Default: Determined using autoconfiguration.
- The address of the autoconfiguration server is used.
-
- <gw-ip> IP address of a gateway if the server is on a different subnet.
-
- Default: Determined using autoconfiguration.
-
- <netmask> Netmask for local network interface. If unspecified
- the netmask is derived from the client IP address assuming
- classful addressing.
-
- Default: Determined using autoconfiguration.
-
- <hostname> Name of the client. If a '.' character is present, anything
- before the first '.' is used as the client's hostname, and anything
- after it is used as its NIS domain name. May be supplied by
- autoconfiguration, but its absence will not trigger autoconfiguration.
- If specified and DHCP is used, the user-provided hostname (and NIS
- domain name, if present) will be carried in the DHCP request; this
- may cause a DNS record to be created or updated for the client.
-
- Default: Client IP address is used in ASCII notation.
-
- <device> Name of network device to use.
-
- Default: If the host only has one device, it is used.
- Otherwise the device is determined using
- autoconfiguration. This is done by sending
- autoconfiguration requests out of all devices,
- and using the device that received the first reply.
-
- <autoconf> Method to use for autoconfiguration. In the case of options
- which specify multiple autoconfiguration protocols,
- requests are sent using all protocols, and the first one
- to reply is used.
-
- Only autoconfiguration protocols that have been compiled
- into the kernel will be used, regardless of the value of
- this option.
-
- off or none: don't use autoconfiguration
- (do static IP assignment instead)
- on or any: use any protocol available in the kernel
- (default)
- dhcp: use DHCP
- bootp: use BOOTP
- rarp: use RARP
- both: use both BOOTP and RARP but not DHCP
- (old option kept for backwards compatibility)
-
- if dhcp is used, the client identifier can be used by following
- format "ip=dhcp,client-id-type,client-id-value"
-
- Default: any
-
- <dns0-ip> IP address of primary nameserver.
- Value is exported to /proc/net/pnp with the prefix "nameserver "
- (see below).
-
- Default: None if not using autoconfiguration; determined
- automatically if using autoconfiguration.
-
- <dns1-ip> IP address of secondary nameserver.
- See <dns0-ip>.
-
- <ntp0-ip> IP address of a Network Time Protocol (NTP) server.
- Value is exported to /proc/net/ipconfig/ntp_servers, but is
- otherwise unused (see below).
-
- Default: None if not using autoconfiguration; determined
- automatically if using autoconfiguration.
-
- After configuration (whether manual or automatic) is complete, two files
- are created in the following format; lines are omitted if their respective
- value is empty following configuration:
-
- - /proc/net/pnp:
-
- #PROTO: <DHCP|BOOTP|RARP|MANUAL> (depending on configuration method)
- domain <dns-domain> (if autoconfigured, the DNS domain)
- nameserver <dns0-ip> (primary name server IP)
- nameserver <dns1-ip> (secondary name server IP)
- nameserver <dns2-ip> (tertiary name server IP)
- bootserver <server-ip> (NFS server IP)
-
- - /proc/net/ipconfig/ntp_servers:
-
- <ntp0-ip> (NTP server IP)
- <ntp1-ip> (NTP server IP)
- <ntp2-ip> (NTP server IP)
-
- <dns-domain> and <dns2-ip> (in /proc/net/pnp) and <ntp1-ip> and <ntp2-ip>
- (in /proc/net/ipconfig/ntp_servers) are requested during autoconfiguration;
- they cannot be specified as part of the "ip=" kernel command line parameter.
-
- Because the "domain" and "nameserver" options are recognised by DNS
- resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
- that use an NFS root filesystem.
-
- Note that the kernel will not synchronise the system time with any NTP
- servers it discovers; this is the responsibility of a user space process
- (e.g. an initrd/initramfs script that passes the IP addresses listed in
- /proc/net/ipconfig/ntp_servers to an NTP client before mounting the real
- root filesystem if it is on NFS).
-
-
-nfsrootdebug
-
- This parameter enables debugging messages to appear in the kernel
- log at boot time so that administrators can verify that the correct
- NFS mount options, server address, and root path are passed to the
- NFS client.
-
-
-rdinit=<executable file>
-
- To specify which file contains the program that starts system
- initialization, administrators can use this command line parameter.
- The default value of this parameter is "/init". If the specified
- file exists and the kernel can execute it, root filesystem related
- kernel command line parameters, including `nfsroot=', are ignored.
-
- A description of the process of mounting the root file system can be
- found in:
-
- Documentation/driver-api/early-userspace/early_userspace_support.rst
-
-
-
-
-3.) Boot Loader
- ----------
-
-To get the kernel into memory different approaches can be used.
-They depend on various facilities being available:
-
-
-3.1) Booting from a floppy using syslinux
-
- When building kernels, an easy way to create a boot floppy that uses
- syslinux is to use the zdisk or bzdisk make targets which use zimage
- and bzimage images respectively. Both targets accept the
- FDARGS parameter which can be used to set the kernel command line.
-
- e.g.
- make bzdisk FDARGS="root=/dev/nfs"
-
- Note that the user running this command will need to have
- access to the floppy drive device, /dev/fd0
-
- For more information on syslinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
- N.B: Previously it was possible to write a kernel directly to
- a floppy using dd, configure the boot device using rdev, and
- boot using the resulting floppy. Linux no longer supports this
- method of booting.
-
-3.2) Booting from a cdrom using isolinux
-
- When building kernels, an easy way to create a bootable cdrom that
- uses isolinux is to use the isoimage target which uses a bzimage
- image. Like zdisk and bzdisk, this target accepts the FDARGS
- parameter which can be used to set the kernel command line.
-
- e.g.
- make isoimage FDARGS="root=/dev/nfs"
-
- The resulting iso image will be arch/<ARCH>/boot/image.iso
- This can be written to a cdrom using a variety of tools including
- cdrecord.
-
- e.g.
- cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso
-
- For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
-3.2) Using LILO
- When using LILO all the necessary command line parameters may be
- specified using the 'append=' directive in the LILO configuration
- file.
-
- However, to use the 'root=' directive you also need to create
- a dummy root device, which may be removed after LILO is run.
-
- mknod /dev/boot255 c 0 255
-
- For information on configuring LILO, please refer to its documentation.
-
-3.3) Using GRUB
- When using GRUB, kernel parameter are simply appended after the kernel
- specification: kernel <kernel> <parameters>
-
-3.4) Using loadlin
- loadlin may be used to boot Linux from a DOS command prompt without
- requiring a local hard disk to mount as root. This has not been
- thoroughly tested by the authors of this document, but in general
- it should be possible configure the kernel command line similarly
- to the configuration of LILO.
-
- Please refer to the loadlin documentation for further information.
-
-3.5) Using a boot ROM
- This is probably the most elegant way of booting a diskless client.
- With a boot ROM the kernel is loaded using the TFTP protocol. The
- authors of this document are not aware of any no commercial boot
- ROMs that support booting Linux over the network. However, there
- are two free implementations of a boot ROM, netboot-nfs and
- etherboot, both of which are available on sunsite.unc.edu, and both
- of which contain everything you need to boot a diskless Linux client.
-
-3.6) Using pxelinux
- Pxelinux may be used to boot linux using the PXE boot loader
- which is present on many modern network cards.
-
- When using pxelinux, the kernel image is specified using
- "kernel <relative-path-below /tftpboot>". The nfsroot parameters
- are passed to the kernel by adding them to the "append" line.
- It is common to use serial console in conjunction with pxeliunx,
- see Documentation/admin-guide/serial-console.rst for more information.
-
- For more information on isolinux, including how to create bootdisks
- for prebuilt kernels, see http://syslinux.zytor.com/
-
-
-
-
-4.) Credits
- -------
-
- The nfsroot code in the kernel and the RARP support have been written
- by Gero Kuhlmann <gero@gkminix.han.de>.
-
- The rest of the IP layer autoconfiguration code has been written
- by Martin Mares <mj@atrey.karlin.mff.cuni.cz>.
-
- In order to write the initial version of nfsroot I would like to thank
- Jens-Uwe Mager <jum@anubis.han.de> for his help.
diff --git a/Documentation/filesystems/nfs/pnfs-block-server.txt b/Documentation/filesystems/nfs/pnfs-block-server.txt
deleted file mode 100644
index 2143673cf154..000000000000
--- a/Documentation/filesystems/nfs/pnfs-block-server.txt
+++ /dev/null
@@ -1,37 +0,0 @@
-pNFS block layout server user guide
-
-The Linux NFS server now supports the pNFS block layout extension. In this
-case the NFS server acts as Metadata Server (MDS) for pNFS, which in addition
-to handling all the metadata access to the NFS export also hands out layouts
-to the clients to directly access the underlying block devices that are
-shared with the client.
-
-To use pNFS block layouts with with the Linux NFS server the exported file
-system needs to support the pNFS block layouts (currently just XFS), and the
-file system must sit on shared storage (typically iSCSI) that is accessible
-to the clients in addition to the MDS. As of now the file system needs to
-sit directly on the exported volume, striping or concatenation of
-volumes on the MDS and clients is not supported yet.
-
-On the server, pNFS block volume support is automatically if the file system
-support it. On the client make sure the kernel has the CONFIG_PNFS_BLOCK
-option enabled, the blkmapd daemon from nfs-utils is running, and the
-file system is mounted using the NFSv4.1 protocol version (mount -o vers=4.1).
-
-If the nfsd server needs to fence a non-responding client it calls
-/sbin/nfsd-recall-failed with the first argument set to the IP address of
-the client, and the second argument set to the device node without the /dev
-prefix for the file system to be fenced. Below is an example file that shows
-how to translate the device into a serial number from SCSI EVPD 0x80:
-
-cat > /sbin/nfsd-recall-failed << EOF
-#!/bin/sh
-
-CLIENT="$1"
-DEV="/dev/$2"
-EVPD=`sg_inq --page=0x80 ${DEV} | \
- grep "Unit serial number:" | \
- awk -F ': ' '{print $2}'`
-
-echo "fencing client ${CLIENT} serial ${EVPD}" >> /var/log/pnfsd-fence.log
-EOF
diff --git a/Documentation/filesystems/nfs/pnfs-scsi-server.txt b/Documentation/filesystems/nfs/pnfs-scsi-server.txt
deleted file mode 100644
index 5bef7268bd9f..000000000000
--- a/Documentation/filesystems/nfs/pnfs-scsi-server.txt
+++ /dev/null
@@ -1,23 +0,0 @@
-
-pNFS SCSI layout server user guide
-==================================
-
-This document describes support for pNFS SCSI layouts in the Linux NFS server.
-With pNFS SCSI layouts, the NFS server acts as Metadata Server (MDS) for pNFS,
-which in addition to handling all the metadata access to the NFS export,
-also hands out layouts to the clients so that they can directly access the
-underlying SCSI LUNs that are shared with the client.
-
-To use pNFS SCSI layouts with with the Linux NFS server, the exported file
-system needs to support the pNFS SCSI layouts (currently just XFS), and the
-file system must sit on a SCSI LUN that is accessible to the clients in
-addition to the MDS. As of now the file system needs to sit directly on the
-exported LUN, striping or concatenation of LUNs on the MDS and clients
-is not supported yet.
-
-On a server built with CONFIG_NFSD_SCSI, the pNFS SCSI volume support is
-automatically enabled if the file system is exported using the "pnfs"
-option and the underlying SCSI device support persistent reservations.
-On the client make sure the kernel has the CONFIG_PNFS_BLOCK option
-enabled, and the file system is mounted using the NFSv4.1 protocol
-version (mount -o vers=4.1).
diff --git a/Documentation/filesystems/path-lookup.rst b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a3216979298b 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -13,6 +13,7 @@ It has subsequently been updated to reflect changes in the kernel
including:
- per-directory parallel name lookup.
+- ``openat2()`` resolution restriction flags.
Introduction to pathname lookup
===============================
@@ -235,6 +236,13 @@ renamed. If ``d_lookup`` finds that a rename happened while it
unsuccessfully scanned a chain in the hash table, it simply tries
again.
+``rename_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``rename_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
inode->i_rwsem
~~~~~~~~~~~~~~
@@ -348,6 +356,13 @@ any changes to any mount points while stepping up. This locking is
needed to stabilize the link to the mounted-on dentry, which the
refcount on the mount itself doesn't ensure.
+``mount_lock`` is also used to detect and defend against potential attacks
+against ``LOOKUP_BENEATH`` and ``LOOKUP_IN_ROOT`` when resolving ".." (where
+the parent directory is moved outside the root, bypassing the ``path_equal()``
+check). If ``mount_lock`` is updated during the lookup and the path encounters
+a "..", a potential attack occurred and ``handle_dots()`` will bail out with
+``-EAGAIN``.
+
RCU
~~~
@@ -405,6 +420,10 @@ is requested. Keeping a reference in the ``nameidata`` ensures that
only one root is in effect for the entire path walk, even if it races
with a ``chroot()`` system call.
+It should be noted that in the case of ``LOOKUP_IN_ROOT`` or
+``LOOKUP_BENEATH``, the effective root becomes the directory file descriptor
+passed to ``openat2()`` (which exposes these ``LOOKUP_`` flags).
+
The root is needed when either of two conditions holds: (1) either the
pathname or a symbolic link starts with a "'/'", or (2) a "``..``"
component is being handled, since "``..``" from the root must always stay
@@ -1149,7 +1168,7 @@ so ``NULL`` is returned to indicate that the symlink can be released and
the stack frame discarded.
The other case involves things in ``/proc`` that look like symlinks but
-aren't really::
+aren't really (and are therefore commonly referred to as "magic-links")::
$ ls -l /proc/self/fd/1
lrwx------ 1 neilb neilb 64 Jun 13 10:19 /proc/self/fd/1 -> /dev/pts/4
@@ -1286,7 +1305,9 @@ A few flags
A suitable way to wrap up this tour of pathname walking is to list
the various flags that can be stored in the ``nameidata`` to guide the
lookup process. Many of these are only meaningful on the final
-component, others reflect the current state of the pathname lookup.
+component, others reflect the current state of the pathname lookup, and some
+apply restrictions to all path components encountered in the path lookup.
+
And then there is ``LOOKUP_EMPTY``, which doesn't fit conceptually with
the others. If this is not set, an empty pathname causes an error
very early on. If it is set, empty pathnames are not considered to be
@@ -1310,13 +1331,48 @@ longer needed.
``LOOKUP_JUMPED`` means that the current dentry was chosen not because
it had the right name but for some other reason. This happens when
following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink. In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``). In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``). In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
``LOOKUP_JUMPED`` is set when the look completes - which may be at the
final component or, when creating, unlinking, or renaming, at the penultimate component.
+Resolution-restriction flags
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to allow userspace to protect itself against certain race conditions
+and attack scenarios involving changing path components, a series of flags are
+available which apply restrictions to all path components encountered during
+path lookup. These flags are exposed through ``openat2()``'s ``resolve`` field.
+
+``LOOKUP_NO_SYMLINKS`` blocks all symlink traversals (including magic-links).
+This is distinctly different from ``LOOKUP_FOLLOW``, because the latter only
+relates to restricting the following of trailing symlinks.
+
+``LOOKUP_NO_MAGICLINKS`` blocks all magic-link traversals. Filesystems must
+ensure that they return errors from ``nd_jump_link()``, because that is how
+``LOOKUP_NO_MAGICLINKS`` and other magic-link restrictions are implemented.
+
+``LOOKUP_NO_XDEV`` blocks all ``vfsmount`` traversals (this includes both
+bind-mounts and ordinary mounts). Note that the ``vfsmount`` which contains the
+lookup is determined by the first mountpoint the path lookup reaches --
+absolute paths start with the ``vfsmount`` of ``/``, and relative paths start
+with the ``dfd``'s ``vfsmount``. Magic-links are only permitted if the
+``vfsmount`` of the path is unchanged.
+
+``LOOKUP_BENEATH`` blocks any path components which resolve outside the
+starting point of the resolution. This is done by blocking ``nd_jump_root()``
+as well as blocking ".." if it would jump outside the starting point.
+``rename_lock`` and ``mount_lock`` are used to detect attacks against the
+resolution of "..". Magic-links are also blocked.
+
+``LOOKUP_IN_ROOT`` resolves all path components as though the starting point
+were the filesystem root. ``nd_jump_root()`` brings the resolution back to to
+the starting point, and ".." at the starting point will act as a no-op. As with
+``LOOKUP_BENEATH``, ``rename_lock`` and ``mount_lock`` are used to detect
+attacks against ".." resolution. Magic-links are also blocked.
+
Final-component flags
~~~~~~~~~~~~~~~~~~~~~
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index f18506083ced..26c093969573 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -850,3 +850,11 @@ business doing so.
d_alloc_pseudo() is internal-only; uses outside of alloc_file_pseudo() are
very suspect (and won't work in modules). Such uses are very likely to
be misspelled d_alloc_anon().
+
+---
+
+**mandatory**
+
+[should've been added in 2016] stale comment in finish_open() nonwithstanding,
+failure exits in ->atomic_open() instances should *NOT* fput() the file,
+no matter what. Everything is handled by the caller.
diff --git a/Documentation/filesystems/vfat.rst b/Documentation/filesystems/vfat.rst
new file mode 100644
index 000000000000..e85d74e91295
--- /dev/null
+++ b/Documentation/filesystems/vfat.rst
@@ -0,0 +1,387 @@
+====
+VFAT
+====
+
+USING VFAT
+==========
+
+To use the vfat filesystem, use the filesystem type 'vfat'. i.e.::
+
+ mount -t vfat /dev/fd0 /mnt
+
+
+No special partition formatter is required,
+'mkdosfs' will work fine if you want to format from within Linux.
+
+VFAT MOUNT OPTIONS
+==================
+
+**uid=###**
+ Set the owner of all files on this filesystem.
+ The default is the uid of current process.
+
+**gid=###**
+ Set the group of all files on this filesystem.
+ The default is the gid of current process.
+
+**umask=###**
+ The permission mask (for files and directories, see *umask(1)*).
+ The default is the umask of current process.
+
+**dmask=###**
+ The permission mask for the directory.
+ The default is the umask of current process.
+
+**fmask=###**
+ The permission mask for files.
+ The default is the umask of current process.
+
+**allow_utime=###**
+ This option controls the permission check of mtime/atime.
+
+ **-20**: If current process is in group of file's group ID,
+ you can change timestamp.
+
+ **-2**: Other users can change timestamp.
+
+ The default is set from dmask option. If the directory is
+ writable, utime(2) is also allowed. i.e. ~dmask & 022.
+
+ Normally utime(2) checks current process is owner of
+ the file, or it has CAP_FOWNER capability. But FAT
+ filesystem doesn't have uid/gid on disk, so normal
+ check is too unflexible. With this option you can
+ relax it.
+
+**codepage=###**
+ Sets the codepage number for converting to shortname
+ characters on FAT filesystem.
+ By default, FAT_DEFAULT_CODEPAGE setting is used.
+
+**iocharset=<name>**
+ Character set to use for converting between the
+ encoding is used for user visible filename and 16 bit
+ Unicode characters. Long filenames are stored on disk
+ in Unicode format, but Unix for the most part doesn't
+ know how to deal with Unicode.
+ By default, FAT_DEFAULT_IOCHARSET setting is used.
+
+ There is also an option of doing UTF-8 translations
+ with the utf8 option.
+
+.. note:: ``iocharset=utf8`` is not recommended. If unsure, you should consider
+ the utf8 option instead.
+
+**utf8=<bool>**
+ UTF-8 is the filesystem safe version of Unicode that
+ is used by the console. It can be enabled or disabled
+ for the filesystem with this option.
+ If 'uni_xlate' gets set, UTF-8 gets disabled.
+ By default, FAT_DEFAULT_UTF8 setting is used.
+
+**uni_xlate=<bool>**
+ Translate unhandled Unicode characters to special
+ escaped sequences. This would let you backup and
+ restore filenames that are created with any Unicode
+ characters. Until Linux supports Unicode for real,
+ this gives you an alternative. Without this option,
+ a '?' is used when no translation is possible. The
+ escape character is ':' because it is otherwise
+ illegal on the vfat filesystem. The escape sequence
+ that gets used is ':' and the four digits of hexadecimal
+ unicode.
+
+**nonumtail=<bool>**
+ When creating 8.3 aliases, normally the alias will
+ end in '~1' or tilde followed by some number. If this
+ option is set, then if the filename is
+ "longfilename.txt" and "longfile.txt" does not
+ currently exist in the directory, longfile.txt will
+ be the short alias instead of longfi~1.txt.
+
+**usefree**
+ Use the "free clusters" value stored on FSINFO. It will
+ be used to determine number of free clusters without
+ scanning disk. But it's not used by default, because
+ recent Windows don't update it correctly in some
+ case. If you are sure the "free clusters" on FSINFO is
+ correct, by this option you can avoid scanning disk.
+
+**quiet**
+ Stops printing certain warning messages.
+
+**check=s|r|n**
+ Case sensitivity checking setting.
+
+ **s**: strict, case sensitive
+
+ **r**: relaxed, case insensitive
+
+ **n**: normal, default setting, currently case insensitive
+
+**nocase**
+ This was deprecated for vfat. Use ``shortname=win95`` instead.
+
+**shortname=lower|win95|winnt|mixed**
+ Shortname display/create setting.
+
+ **lower**: convert to lowercase for display,
+ emulate the Windows 95 rule for create.
+
+ **win95**: emulate the Windows 95 rule for display/create.
+
+ **winnt**: emulate the Windows NT rule for display/create.
+
+ **mixed**: emulate the Windows NT rule for display,
+ emulate the Windows 95 rule for create.
+
+ Default setting is `mixed`.
+
+**tz=UTC**
+ Interpret timestamps as UTC rather than local time.
+ This option disables the conversion of timestamps
+ between local time (as used by Windows on FAT) and UTC
+ (which Linux uses internally). This is particularly
+ useful when mounting devices (like digital cameras)
+ that are set to UTC in order to avoid the pitfalls of
+ local time.
+
+**time_offset=minutes**
+ Set offset for conversion of timestamps from local time
+ used by FAT to UTC. I.e. <minutes> minutes will be subtracted
+ from each timestamp to convert it to UTC used internally by
+ Linux. This is useful when time zone set in ``sys_tz`` is
+ not the time zone used by the filesystem. Note that this
+ option still does not provide correct time stamps in all
+ cases in presence of DST - time stamps in a different DST
+ setting will be off by one hour.
+
+**showexec**
+ If set, the execute permission bits of the file will be
+ allowed only if the extension part of the name is .EXE,
+ .COM, or .BAT. Not set by default.
+
+**debug**
+ Can be set, but unused by the current implementation.
+
+**sys_immutable**
+ If set, ATTR_SYS attribute on FAT is handled as
+ IMMUTABLE flag on Linux. Not set by default.
+
+**flush**
+ If set, the filesystem will try to flush to disk more
+ early than normal. Not set by default.
+
+**rodir**
+ FAT has the ATTR_RO (read-only) attribute. On Windows,
+ the ATTR_RO of the directory will just be ignored,
+ and is used only by applications as a flag (e.g. it's set
+ for the customized folder).
+
+ If you want to use ATTR_RO as read-only flag even for
+ the directory, set this option.
+
+**errors=panic|continue|remount-ro**
+ specify FAT behavior on critical errors: panic, continue
+ without doing anything or remount the partition in
+ read-only mode (default behavior).
+
+**discard**
+ If set, issues discard/TRIM commands to the block
+ device when blocks are freed. This is useful for SSD devices
+ and sparse/thinly-provisoned LUNs.
+
+**nfs=stale_rw|nostale_ro**
+ Enable this only if you want to export the FAT filesystem
+ over NFS.
+
+ **stale_rw**: This option maintains an index (cache) of directory
+ *inodes* by *i_logstart* which is used by the nfs-related code to
+ improve look-ups. Full file operations (read/write) over NFS is
+ supported but with cache eviction at NFS server, this could
+ result in ESTALE issues.
+
+ **nostale_ro**: This option bases the *inode* number and filehandle
+ on the on-disk location of a file in the MS-DOS directory entry.
+ This ensures that ESTALE will not be returned after a file is
+ evicted from the inode cache. However, it means that operations
+ such as rename, create and unlink could cause filehandles that
+ previously pointed at one file to point at a different file,
+ potentially causing data corruption. For this reason, this
+ option also mounts the filesystem readonly.
+
+ To maintain backward compatibility, ``'-o nfs'`` is also accepted,
+ defaulting to "stale_rw".
+
+**dos1xfloppy <bool>: 0,1,yes,no,true,false**
+ If set, use a fallback default BIOS Parameter Block
+ configuration, determined by backing device size. These static
+ parameters match defaults assumed by DOS 1.x for 160 kiB,
+ 180 kiB, 320 kiB, and 360 kiB floppies and floppy images.
+
+
+
+LIMITATION
+==========
+
+The fallocated region of file is discarded at umount/evict time
+when using fallocate with FALLOC_FL_KEEP_SIZE.
+So, User should assume that fallocated region can be discarded at
+last close if there is memory pressure resulting in eviction of
+the inode from the memory. As a result, for any dependency on
+the fallocated region, user should make sure to recheck fallocate
+after reopening the file.
+
+TODO
+====
+Need to get rid of the raw scanning stuff. Instead, always use
+a get next directory entry approach. The only thing left that uses
+raw scanning is the directory renaming code.
+
+
+POSSIBLE PROBLEMS
+=================
+
+- vfat_valid_longname does not properly checked reserved names.
+- When a volume name is the same as a directory name in the root
+ directory of the filesystem, the directory name sometimes shows
+ up as an empty file.
+- autoconv option does not work correctly.
+
+
+TEST SUITE
+==========
+If you plan to make any modifications to the vfat filesystem, please
+get the test suite that comes with the vfat distribution at
+
+`<http://web.archive.org/web/*/http://bmrc.berkeley.edu/people/chaffee/vfat.html>`_
+
+This tests quite a few parts of the vfat filesystem and additional
+tests for new features or untested features would be appreciated.
+
+NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
+=============================================
+This documentation was provided by Galen C. Hunt gchunt@cs.rochester.edu and
+lightly annotated by Gordon Chaffee.
+
+This document presents a very rough, technical overview of my
+knowledge of the extended FAT file system used in Windows NT 3.5 and
+Windows 95. I don't guarantee that any of the following is correct,
+but it appears to be so.
+
+The extended FAT file system is almost identical to the FAT
+file system used in DOS versions up to and including *6.223410239847*
+:-). The significant change has been the addition of long file names.
+These names support up to 255 characters including spaces and lower
+case characters as opposed to the traditional 8.3 short names.
+
+Here is the description of the traditional FAT entry in the current
+Windows 95 filesystem::
+
+ struct directory { // Short 8.3 names
+ unsigned char name[8]; // file name
+ unsigned char ext[3]; // file extension
+ unsigned char attr; // attribute byte
+ unsigned char lcase; // Case for base and extension
+ unsigned char ctime_ms; // Creation time, milliseconds
+ unsigned char ctime[2]; // Creation time
+ unsigned char cdate[2]; // Creation date
+ unsigned char adate[2]; // Last access date
+ unsigned char reserved[2]; // reserved values (ignored)
+ unsigned char time[2]; // time stamp
+ unsigned char date[2]; // date stamp
+ unsigned char start[2]; // starting cluster number
+ unsigned char size[4]; // size of the file
+ };
+
+
+The lcase field specifies if the base and/or the extension of an 8.3
+name should be capitalized. This field does not seem to be used by
+Windows 95 but it is used by Windows NT. The case of filenames is not
+completely compatible from Windows NT to Windows 95. It is not completely
+compatible in the reverse direction, however. Filenames that fit in
+the 8.3 namespace and are written on Windows NT to be lowercase will
+show up as uppercase on Windows 95.
+
+.. note:: Note that the ``start`` and ``size`` values are actually little
+ endian integer values. The descriptions of the fields in this
+ structure are public knowledge and can be found elsewhere.
+
+With the extended FAT system, Microsoft has inserted extra
+directory entries for any files with extended names. (Any name which
+legally fits within the old 8.3 encoding scheme does not have extra
+entries.) I call these extra entries slots. Basically, a slot is a
+specially formatted directory entry which holds up to 13 characters of
+a file's extended name. Think of slots as additional labeling for the
+directory entry of the file to which they correspond. Microsoft
+prefers to refer to the 8.3 entry for a file as its alias and the
+extended slot directory entries as the file name.
+
+The C structure for a slot directory entry follows::
+
+ struct slot { // Up to 13 characters of a long name
+ unsigned char id; // sequence number for slot
+ unsigned char name0_4[10]; // first 5 characters in name
+ unsigned char attr; // attribute byte
+ unsigned char reserved; // always 0
+ unsigned char alias_checksum; // checksum for 8.3 alias
+ unsigned char name5_10[12]; // 6 more characters in name
+ unsigned char start[2]; // starting cluster number
+ unsigned char name11_12[4]; // last 2 characters in name
+ };
+
+
+If the layout of the slots looks a little odd, it's only
+because of Microsoft's efforts to maintain compatibility with old
+software. The slots must be disguised to prevent old software from
+panicking. To this end, a number of measures are taken:
+
+ 1) The attribute byte for a slot directory entry is always set
+ to 0x0f. This corresponds to an old directory entry with
+ attributes of "hidden", "system", "read-only", and "volume
+ label". Most old software will ignore any directory
+ entries with the "volume label" bit set. Real volume label
+ entries don't have the other three bits set.
+
+ 2) The starting cluster is always set to 0, an impossible
+ value for a DOS file.
+
+Because the extended FAT system is backward compatible, it is
+possible for old software to modify directory entries. Measures must
+be taken to ensure the validity of slots. An extended FAT system can
+verify that a slot does in fact belong to an 8.3 directory entry by
+the following:
+
+ 1) Positioning. Slots for a file always immediately proceed
+ their corresponding 8.3 directory entry. In addition, each
+ slot has an id which marks its order in the extended file
+ name. Here is a very abbreviated view of an 8.3 directory
+ entry and its corresponding long name slots for the file
+ "My Big File.Extension which is long"::
+
+ <proceeding files...>
+ <slot #3, id = 0x43, characters = "h is long">
+ <slot #2, id = 0x02, characters = "xtension whic">
+ <slot #1, id = 0x01, characters = "My Big File.E">
+ <directory entry, name = "MYBIGFIL.EXT">
+
+
+ .. note:: Note that the slots are stored from last to first. Slots
+ are numbered from 1 to N. The Nth slot is ``or'ed`` with
+ 0x40 to mark it as the last one.
+
+ 2) Checksum. Each slot has an alias_checksum value. The
+ checksum is calculated from the 8.3 name using the
+ following algorithm::
+
+ for (sum = i = 0; i < 11; i++) {
+ sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
+ }
+
+
+ 3) If there is free space in the final slot, a Unicode ``NULL (0x0000)``
+ is stored after the final character. After that, all unused
+ characters in the final slot are set to Unicode 0xFFFF.
+
+Finally, note that the extended name is stored in Unicode. Each Unicode
+character takes either two or four bytes, UTF-16LE encoded.
diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt
deleted file mode 100644
index 91031298beb1..000000000000
--- a/Documentation/filesystems/vfat.txt
+++ /dev/null
@@ -1,347 +0,0 @@
-USING VFAT
-----------------------------------------------------------------------
-To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
- mount -t vfat /dev/fd0 /mnt
-
-No special partition formatter is required. mkdosfs will work fine
-if you want to format from within Linux.
-
-VFAT MOUNT OPTIONS
-----------------------------------------------------------------------
-uid=### -- Set the owner of all files on this filesystem.
- The default is the uid of current process.
-
-gid=### -- Set the group of all files on this filesystem.
- The default is the gid of current process.
-
-umask=### -- The permission mask (for files and directories, see umask(1)).
- The default is the umask of current process.
-
-dmask=### -- The permission mask for the directory.
- The default is the umask of current process.
-
-fmask=### -- The permission mask for files.
- The default is the umask of current process.
-
-allow_utime=### -- This option controls the permission check of mtime/atime.
-
- 20 - If current process is in group of file's group ID,
- you can change timestamp.
- 2 - Other users can change timestamp.
-
- The default is set from `dmask' option. (If the directory is
- writable, utime(2) is also allowed. I.e. ~dmask & 022)
-
- Normally utime(2) checks current process is owner of
- the file, or it has CAP_FOWNER capability. But FAT
- filesystem doesn't have uid/gid on disk, so normal
- check is too unflexible. With this option you can
- relax it.
-
-codepage=### -- Sets the codepage number for converting to shortname
- characters on FAT filesystem.
- By default, FAT_DEFAULT_CODEPAGE setting is used.
-
-iocharset=<name> -- Character set to use for converting between the
- encoding is used for user visible filename and 16 bit
- Unicode characters. Long filenames are stored on disk
- in Unicode format, but Unix for the most part doesn't
- know how to deal with Unicode.
- By default, FAT_DEFAULT_IOCHARSET setting is used.
-
- There is also an option of doing UTF-8 translations
- with the utf8 option.
-
- NOTE: "iocharset=utf8" is not recommended. If unsure,
- you should consider the following option instead.
-
-utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
- is used by the console. It can be enabled or disabled
- for the filesystem with this option.
- If 'uni_xlate' gets set, UTF-8 gets disabled.
- By default, FAT_DEFAULT_UTF8 setting is used.
-
-uni_xlate=<bool> -- Translate unhandled Unicode characters to special
- escaped sequences. This would let you backup and
- restore filenames that are created with any Unicode
- characters. Until Linux supports Unicode for real,
- this gives you an alternative. Without this option,
- a '?' is used when no translation is possible. The
- escape character is ':' because it is otherwise
- illegal on the vfat filesystem. The escape sequence
- that gets used is ':' and the four digits of hexadecimal
- unicode.
-
-nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
- end in '~1' or tilde followed by some number. If this
- option is set, then if the filename is
- "longfilename.txt" and "longfile.txt" does not
- currently exist in the directory, 'longfile.txt' will
- be the short alias instead of 'longfi~1.txt'.
-
-usefree -- Use the "free clusters" value stored on FSINFO. It'll
- be used to determine number of free clusters without
- scanning disk. But it's not used by default, because
- recent Windows don't update it correctly in some
- case. If you are sure the "free clusters" on FSINFO is
- correct, by this option you can avoid scanning disk.
-
-quiet -- Stops printing certain warning messages.
-
-check=s|r|n -- Case sensitivity checking setting.
- s: strict, case sensitive
- r: relaxed, case insensitive
- n: normal, default setting, currently case insensitive
-
-nocase -- This was deprecated for vfat. Use shortname=win95 instead.
-
-shortname=lower|win95|winnt|mixed
- -- Shortname display/create setting.
- lower: convert to lowercase for display,
- emulate the Windows 95 rule for create.
- win95: emulate the Windows 95 rule for display/create.
- winnt: emulate the Windows NT rule for display/create.
- mixed: emulate the Windows NT rule for display,
- emulate the Windows 95 rule for create.
- Default setting is `mixed'.
-
-tz=UTC -- Interpret timestamps as UTC rather than local time.
- This option disables the conversion of timestamps
- between local time (as used by Windows on FAT) and UTC
- (which Linux uses internally). This is particularly
- useful when mounting devices (like digital cameras)
- that are set to UTC in order to avoid the pitfalls of
- local time.
-time_offset=minutes
- -- Set offset for conversion of timestamps from local time
- used by FAT to UTC. I.e. <minutes> minutes will be subtracted
- from each timestamp to convert it to UTC used internally by
- Linux. This is useful when time zone set in sys_tz is
- not the time zone used by the filesystem. Note that this
- option still does not provide correct time stamps in all
- cases in presence of DST - time stamps in a different DST
- setting will be off by one hour.
-
-showexec -- If set, the execute permission bits of the file will be
- allowed only if the extension part of the name is .EXE,
- .COM, or .BAT. Not set by default.
-
-debug -- Can be set, but unused by the current implementation.
-
-sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as
- IMMUTABLE flag on Linux. Not set by default.
-
-flush -- If set, the filesystem will try to flush to disk more
- early than normal. Not set by default.
-
-rodir -- FAT has the ATTR_RO (read-only) attribute. On Windows,
- the ATTR_RO of the directory will just be ignored,
- and is used only by applications as a flag (e.g. it's set
- for the customized folder).
-
- If you want to use ATTR_RO as read-only flag even for
- the directory, set this option.
-
-errors=panic|continue|remount-ro
- -- specify FAT behavior on critical errors: panic, continue
- without doing anything or remount the partition in
- read-only mode (default behavior).
-
-discard -- If set, issues discard/TRIM commands to the block
- device when blocks are freed. This is useful for SSD devices
- and sparse/thinly-provisoned LUNs.
-
-nfs=stale_rw|nostale_ro
- Enable this only if you want to export the FAT filesystem
- over NFS.
-
- stale_rw: This option maintains an index (cache) of directory
- inodes by i_logstart which is used by the nfs-related code to
- improve look-ups. Full file operations (read/write) over NFS is
- supported but with cache eviction at NFS server, this could
- result in ESTALE issues.
-
- nostale_ro: This option bases the inode number and filehandle
- on the on-disk location of a file in the MS-DOS directory entry.
- This ensures that ESTALE will not be returned after a file is
- evicted from the inode cache. However, it means that operations
- such as rename, create and unlink could cause filehandles that
- previously pointed at one file to point at a different file,
- potentially causing data corruption. For this reason, this
- option also mounts the filesystem readonly.
-
- To maintain backward compatibility, '-o nfs' is also accepted,
- defaulting to stale_rw
-
-dos1xfloppy -- If set, use a fallback default BIOS Parameter Block
- configuration, determined by backing device size. These static
- parameters match defaults assumed by DOS 1.x for 160 kiB,
- 180 kiB, 320 kiB, and 360 kiB floppies and floppy images.
-
-
-<bool>: 0,1,yes,no,true,false
-
-LIMITATION
----------------------------------------------------------------------
-* The fallocated region of file is discarded at umount/evict time
- when using fallocate with FALLOC_FL_KEEP_SIZE.
- So, User should assume that fallocated region can be discarded at
- last close if there is memory pressure resulting in eviction of
- the inode from the memory. As a result, for any dependency on
- the fallocated region, user should make sure to recheck fallocate
- after reopening the file.
-
-TODO
-----------------------------------------------------------------------
-* Need to get rid of the raw scanning stuff. Instead, always use
- a get next directory entry approach. The only thing left that uses
- raw scanning is the directory renaming code.
-
-
-POSSIBLE PROBLEMS
-----------------------------------------------------------------------
-* vfat_valid_longname does not properly checked reserved names.
-* When a volume name is the same as a directory name in the root
- directory of the filesystem, the directory name sometimes shows
- up as an empty file.
-* autoconv option does not work correctly.
-
-BUG REPORTS
-----------------------------------------------------------------------
-If you have trouble with the VFAT filesystem, mail bug reports to
-chaffee@bmrc.cs.berkeley.edu. Please specify the filename
-and the operation that gave you trouble.
-
-TEST SUITE
-----------------------------------------------------------------------
-If you plan to make any modifications to the vfat filesystem, please
-get the test suite that comes with the vfat distribution at
-
- http://web.archive.org/web/*/http://bmrc.berkeley.edu/
- people/chaffee/vfat.html
-
-This tests quite a few parts of the vfat filesystem and additional
-tests for new features or untested features would be appreciated.
-
-NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
-----------------------------------------------------------------------
-(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
- and lightly annotated by Gordon Chaffee).
-
-This document presents a very rough, technical overview of my
-knowledge of the extended FAT file system used in Windows NT 3.5 and
-Windows 95. I don't guarantee that any of the following is correct,
-but it appears to be so.
-
-The extended FAT file system is almost identical to the FAT
-file system used in DOS versions up to and including 6.223410239847
-:-). The significant change has been the addition of long file names.
-These names support up to 255 characters including spaces and lower
-case characters as opposed to the traditional 8.3 short names.
-
-Here is the description of the traditional FAT entry in the current
-Windows 95 filesystem:
-
- struct directory { // Short 8.3 names
- unsigned char name[8]; // file name
- unsigned char ext[3]; // file extension
- unsigned char attr; // attribute byte
- unsigned char lcase; // Case for base and extension
- unsigned char ctime_ms; // Creation time, milliseconds
- unsigned char ctime[2]; // Creation time
- unsigned char cdate[2]; // Creation date
- unsigned char adate[2]; // Last access date
- unsigned char reserved[2]; // reserved values (ignored)
- unsigned char time[2]; // time stamp
- unsigned char date[2]; // date stamp
- unsigned char start[2]; // starting cluster number
- unsigned char size[4]; // size of the file
- };
-
-The lcase field specifies if the base and/or the extension of an 8.3
-name should be capitalized. This field does not seem to be used by
-Windows 95 but it is used by Windows NT. The case of filenames is not
-completely compatible from Windows NT to Windows 95. It is not completely
-compatible in the reverse direction, however. Filenames that fit in
-the 8.3 namespace and are written on Windows NT to be lowercase will
-show up as uppercase on Windows 95.
-
-Note that the "start" and "size" values are actually little
-endian integer values. The descriptions of the fields in this
-structure are public knowledge and can be found elsewhere.
-
-With the extended FAT system, Microsoft has inserted extra
-directory entries for any files with extended names. (Any name which
-legally fits within the old 8.3 encoding scheme does not have extra
-entries.) I call these extra entries slots. Basically, a slot is a
-specially formatted directory entry which holds up to 13 characters of
-a file's extended name. Think of slots as additional labeling for the
-directory entry of the file to which they correspond. Microsoft
-prefers to refer to the 8.3 entry for a file as its alias and the
-extended slot directory entries as the file name.
-
-The C structure for a slot directory entry follows:
-
- struct slot { // Up to 13 characters of a long name
- unsigned char id; // sequence number for slot
- unsigned char name0_4[10]; // first 5 characters in name
- unsigned char attr; // attribute byte
- unsigned char reserved; // always 0
- unsigned char alias_checksum; // checksum for 8.3 alias
- unsigned char name5_10[12]; // 6 more characters in name
- unsigned char start[2]; // starting cluster number
- unsigned char name11_12[4]; // last 2 characters in name
- };
-
-If the layout of the slots looks a little odd, it's only
-because of Microsoft's efforts to maintain compatibility with old
-software. The slots must be disguised to prevent old software from
-panicking. To this end, a number of measures are taken:
-
- 1) The attribute byte for a slot directory entry is always set
- to 0x0f. This corresponds to an old directory entry with
- attributes of "hidden", "system", "read-only", and "volume
- label". Most old software will ignore any directory
- entries with the "volume label" bit set. Real volume label
- entries don't have the other three bits set.
-
- 2) The starting cluster is always set to 0, an impossible
- value for a DOS file.
-
-Because the extended FAT system is backward compatible, it is
-possible for old software to modify directory entries. Measures must
-be taken to ensure the validity of slots. An extended FAT system can
-verify that a slot does in fact belong to an 8.3 directory entry by
-the following:
-
- 1) Positioning. Slots for a file always immediately proceed
- their corresponding 8.3 directory entry. In addition, each
- slot has an id which marks its order in the extended file
- name. Here is a very abbreviated view of an 8.3 directory
- entry and its corresponding long name slots for the file
- "My Big File.Extension which is long":
-
- <proceeding files...>
- <slot #3, id = 0x43, characters = "h is long">
- <slot #2, id = 0x02, characters = "xtension whic">
- <slot #1, id = 0x01, characters = "My Big File.E">
- <directory entry, name = "MYBIGFIL.EXT">
-
- Note that the slots are stored from last to first. Slots
- are numbered from 1 to N. The Nth slot is or'ed with 0x40
- to mark it as the last one.
-
- 2) Checksum. Each slot has an "alias_checksum" value. The
- checksum is calculated from the 8.3 name using the
- following algorithm:
-
- for (sum = i = 0; i < 11; i++) {
- sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
- }
-
- 3) If there is free space in the final slot, a Unicode NULL (0x0000)
- is stored after the final character. After that, all unused
- characters in the final slot are set to Unicode 0xFFFF.
-
-Finally, note that the extended name is stored in Unicode. Each Unicode
-character takes either two or four bytes, UTF-16LE encoded.
diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.txt
new file mode 100644
index 000000000000..d54fa98ac158
--- /dev/null
+++ b/Documentation/filesystems/zonefs.txt
@@ -0,0 +1,404 @@
+ZoneFS - Zone filesystem for Zoned block devices
+
+Introduction
+============
+
+zonefs is a very simple file system exposing each zone of a zoned block device
+as a file. Unlike a regular POSIX-compliant file system with native zoned block
+device support (e.g. f2fs), zonefs does not hide the sequential write
+constraint of zoned block devices to the user. Files representing sequential
+write zones of the device must be written sequentially starting from the end
+of the file (append only writes).
+
+As such, zonefs is in essence closer to a raw block device access interface
+than to a full-featured POSIX file system. The goal of zonefs is to simplify
+the implementation of zoned block device support in applications by replacing
+raw block device file accesses with a richer file API, avoiding relying on
+direct block device file ioctls which may be more obscure to developers. One
+example of this approach is the implementation of LSM (log-structured merge)
+tree structures (such as used in RocksDB and LevelDB) on zoned block devices
+by allowing SSTables to be stored in a zone file similarly to a regular file
+system rather than as a range of sectors of the entire disk. The introduction
+of the higher level construct "one file is one zone" can help reducing the
+amount of changes needed in the application as well as introducing support for
+different application programming languages.
+
+Zoned block devices
+-------------------
+
+Zoned storage devices belong to a class of storage devices with an address
+space that is divided into zones. A zone is a group of consecutive LBAs and all
+zones are contiguous (there are no LBA gaps). Zones may have different types.
+* Conventional zones: there are no access constraints to LBAs belonging to
+ conventional zones. Any read or write access can be executed, similarly to a
+ regular block device.
+* Sequential zones: these zones accept random reads but must be written
+ sequentially. Each sequential zone has a write pointer maintained by the
+ device that keeps track of the mandatory start LBA position of the next write
+ to the device. As a result of this write constraint, LBAs in a sequential zone
+ cannot be overwritten. Sequential zones must first be erased using a special
+ command (zone reset) before rewriting.
+
+Zoned storage devices can be implemented using various recording and media
+technologies. The most common form of zoned storage today uses the SCSI Zoned
+Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
+Magnetic Recording (SMR) HDDs.
+
+Solid State Disks (SSD) storage devices can also implement a zoned interface
+to, for instance, reduce internal write amplification due to garbage collection.
+The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
+committee aiming at adding a zoned storage interface to the NVMe protocol.
+
+Zonefs Overview
+===============
+
+Zonefs exposes the zones of a zoned block device as files. The files
+representing zones are grouped by zone type, which are themselves represented
+by sub-directories. This file structure is built entirely using zone information
+provided by the device and so does not require any complex on-disk metadata
+structure.
+
+On-disk metadata
+----------------
+
+zonefs on-disk metadata is reduced to an immutable super block which
+persistently stores a magic number and optional feature flags and values. On
+mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
+and populates the mount point with a static file tree solely based on this
+information. File sizes come from the device zone type and write pointer
+position managed by the device itself.
+
+The super block is always written on disk at sector 0. The first zone of the
+device storing the super block is never exposed as a zone file by zonefs. If
+the zone containing the super block is a sequential zone, the mkzonefs format
+tool always "finishes" the zone, that is, it transitions the zone to a full
+state to make it read-only, preventing any data write.
+
+Zone type sub-directories
+-------------------------
+
+Files representing zones of the same type are grouped together under the same
+sub-directory automatically created on mount.
+
+For conventional zones, the sub-directory "cnv" is used. This directory is
+however created if and only if the device has usable conventional zones. If
+the device only has a single conventional zone at sector 0, the zone will not
+be exposed as a file as it will be used to store the zonefs super block. For
+such devices, the "cnv" sub-directory will not be created.
+
+For sequential write zones, the sub-directory "seq" is used.
+
+These two directories are the only directories that exist in zonefs. Users
+cannot create other directories and cannot rename nor delete the "cnv" and
+"seq" sub-directories.
+
+The size of the directories indicated by the st_size field of struct stat,
+obtained with the stat() or fstat() system calls, indicates the number of files
+existing under the directory.
+
+Zone files
+----------
+
+Zone files are named using the number of the zone they represent within the set
+of zones of a particular type. That is, both the "cnv" and "seq" directories
+contain files named "0", "1", "2", ... The file numbers also represent
+increasing zone start sector on the device.
+
+All read and write operations to zone files are not allowed beyond the file
+maximum size, that is, beyond the zone size. Any access exceeding the zone
+size is failed with the -EFBIG error.
+
+Creating, deleting, renaming or modifying any attribute of files and
+sub-directories is not allowed.
+
+The number of blocks of a file as reported by stat() and fstat() indicates the
+size of the file zone, or in other words, the maximum file size.
+
+Conventional zone files
+-----------------------
+
+The size of conventional zone files is fixed to the size of the zone they
+represent. Conventional zone files cannot be truncated.
+
+These files can be randomly read and written using any type of I/O operation:
+buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
+constraint for these files beyond the file size limit mentioned above.
+
+Sequential zone files
+---------------------
+
+The size of sequential zone files grouped in the "seq" sub-directory represents
+the file's zone write pointer position relative to the zone start sector.
+
+Sequential zone files can only be written sequentially, starting from the file
+end, that is, write operations can only be append writes. Zonefs makes no
+attempt at accepting random writes and will fail any write request that has a
+start offset not corresponding to the end of the file, or to the end of the last
+write issued and still in-flight (for asynchronous I/O operations).
+
+Since dirty page writeback by the page cache does not guarantee a sequential
+write pattern, zonefs prevents buffered writes and writeable shared mappings
+on sequential files. Only direct I/O writes are accepted for these files.
+zonefs relies on the sequential delivery of write I/O requests to the device
+implemented by the block layer elevator. An elevator implementing the sequential
+write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
+must be used. This type of elevator (e.g. mq-deadline) is set by default
+for zoned block devices on device initialization.
+
+There are no restrictions on the type of I/O used for read operations in
+sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
+all accepted.
+
+Truncating sequential zone files is allowed only down to 0, in which case, the
+zone is reset to rewind the file zone write pointer position to the start of
+the zone, or up to the zone size, in which case the file's zone is transitioned
+to the FULL state (finish zone operation).
+
+Format options
+--------------
+
+Several optional features of zonefs can be enabled at format time.
+* Conventional zone aggregation: ranges of contiguous conventional zones can be
+ aggregated into a single larger file instead of the default one file per zone.
+* File ownership: The owner UID and GID of zone files is by default 0 (root)
+ but can be changed to any valid UID/GID.
+* File access permissions: the default 640 access permissions can be changed.
+
+IO error handling
+-----------------
+
+Zoned block devices may fail I/O requests for reasons similar to regular block
+devices, e.g. due to bad sectors. However, in addition to such known I/O
+failure pattern, the standards governing zoned block devices behavior define
+additional conditions that result in I/O errors.
+
+* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
+ While the data already written in the zone is still readable, the zone can
+ no longer be written. No user action on the zone (zone management command or
+ read/write access) can change the zone condition back to a normal read/write
+ state. While the reasons for the device to transition a zone to read-only
+ state are not defined by the standards, a typical cause for such transition
+ would be a defective write head on an HDD (all zones under this head are
+ changed to read-only).
+
+* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
+ An offline zone cannot be read nor written. No user action can transition an
+ offline zone back to an operational good state. Similarly to zone read-only
+ transitions, the reasons for a drive to transition a zone to the offline
+ condition are undefined. A typical cause would be a defective read-write head
+ on an HDD causing all zones on the platter under the broken head to be
+ inaccessible.
+
+* Unaligned write errors: These errors result from the host issuing write
+ requests with a start sector that does not correspond to a zone write pointer
+ position when the write request is executed by the device. Even though zonefs
+ enforces sequential file write for sequential zones, unaligned write errors
+ may still happen in the case of a partial failure of a very large direct I/O
+ operation split into multiple BIOs/requests or asynchronous I/O operations.
+ If one of the write request within the set of sequential write requests
+ issued to the device fails, all write requests queued after it will
+ become unaligned and fail.
+
+* Delayed write errors: similarly to regular block devices, if the device side
+ write cache is enabled, write errors may occur in ranges of previously
+ completed writes when the device write cache is flushed, e.g. on fsync().
+ Similarly to the previous immediate unaligned write error case, delayed write
+ errors can propagate through a stream of cached sequential data for a zone
+ causing all data to be dropped after the sector that caused the error.
+
+All I/O errors detected by zonefs are notified to the user with an error code
+return for the system call that triggered or detected the error. The recovery
+actions taken by zonefs in response to I/O errors depend on the I/O type (read
+vs write) and on the reason for the error (bad sector, unaligned writes or zone
+condition change).
+
+* For read I/O errors, zonefs does not execute any particular recovery action,
+ but only if the file zone is still in a good condition and there is no
+ inconsistency between the file inode size and its zone write pointer position.
+ If a problem is detected, I/O error recovery is executed (see below table).
+
+* For write I/O errors, zonefs I/O error recovery is always executed.
+
+* A zone condition change to read-only or offline also always triggers zonefs
+ I/O error recovery.
+
+Zonefs minimal I/O error recovery may change a file size and file access
+permissions.
+
+* File size changes:
+ Immediate or delayed write errors in a sequential zone file may cause the file
+ inode size to be inconsistent with the amount of data successfully written in
+ the file zone. For instance, the partial failure of a multi-BIO large write
+ operation will cause the zone write pointer to advance partially, even though
+ the entire write operation will be reported as failed to the user. In such
+ case, the file inode size must be advanced to reflect the zone write pointer
+ change and eventually allow the user to restart writing at the end of the
+ file.
+ A file size may also be reduced to reflect a delayed write error detected on
+ fsync(): in this case, the amount of data effectively written in the zone may
+ be less than originally indicated by the file inode size. After such I/O
+ error, zonefs always fixes the file inode size to reflect the amount of data
+ persistently stored in the file zone.
+
+* Access permission changes:
+ A zone condition change to read-only is indicated with a change in the file
+ access permissions to render the file read-only. This disables changes to the
+ file attributes and data modification. For offline zones, all permissions
+ (read and write) to the file are disabled.
+
+Further action taken by zonefs I/O error recovery can be controlled by the user
+with the "errors=xxx" mount option. The table below summarizes the result of
+zonefs I/O error processing depending on the mount option and on the zone
+conditions.
+
+ +--------------+-----------+-----------------------------------------+
+ | | | Post error state |
+ | "errors=xxx" | device | access permissions |
+ | mount | zone | file file device zone |
+ | option | condition | size read write read write |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes no yes yes |
+ | remount-ro | read-only | fixed yes no yes no |
+ | (default) | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes no yes yes |
+ | zone-ro | read-only | fixed yes no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | 0 no no yes yes |
+ | zone-offline | read-only | 0 no no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+ | | good | fixed yes yes yes yes |
+ | repair | read-only | fixed yes no yes no |
+ | | offline | 0 no no no no |
+ +--------------+-----------+-----------------------------------------+
+
+Further notes:
+* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
+ error processing if no errors mount option is specified.
+* With the "errors=remount-ro" mount option, the change of the file access
+ permissions to read-only applies to all files. The file system is remounted
+ read-only.
+* Access permission and file size changes due to the device transitioning zones
+ to the offline condition are permanent. Remounting or reformatting the device
+ with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
+ state.
+* File access permission changes to read-only due to the device transitioning
+ zones to the read-only condition are permanent. Remounting or reformatting
+ the device will not re-enable file write access.
+* File access permission changes implied by the remount-ro, zone-ro and
+ zone-offline mount options are temporary for zones in a good condition.
+ Unmounting and remounting the file system will restore the previous default
+ (format time values) access rights to the files affected.
+* The repair mount option triggers only the minimal set of I/O error recovery
+ actions, that is, file size fixes for zones in a good condition. Zones
+ indicated as being read-only or offline by the device still imply changes to
+ the zone file access permissions as noted in the table above.
+
+Mount options
+-------------
+
+zonefs define the "errors=<behavior>" mount option to allow the user to specify
+zonefs behavior in response to I/O errors, inode size inconsistencies or zone
+condition changes. The defined behaviors are as follow:
+* remount-ro (default)
+* zone-ro
+* zone-offline
+* repair
+
+The I/O error actions defined for each behavior are detailed in the previous
+section.
+
+Zonefs User Space Tools
+=======================
+
+The mkzonefs tool is used to format zoned block devices for use with zonefs.
+This tool is available on Github at:
+
+https://github.com/damien-lemoal/zonefs-tools
+
+zonefs-tools also includes a test suite which can be run against any zoned
+block device, including null_blk block device created with zoned mode.
+
+Examples
+--------
+
+The following formats a 15TB host-managed SMR HDD with 256 MB zones
+with the conventional zones aggregation feature enabled.
+
+# mkzonefs -o aggr_cnv /dev/sdX
+# mount -t zonefs /dev/sdX /mnt
+# ls -l /mnt/
+total 0
+dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
+dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
+
+The size of the zone files sub-directories indicate the number of files
+existing for each type of zones. In this example, there is only one
+conventional zone file (all conventional zones are aggregated under a single
+file).
+
+# ls -l /mnt/cnv
+total 137101312
+-rw-r----- 1 root root 140391743488 Nov 25 13:23 0
+
+This aggregated conventional zone file can be used as a regular file.
+
+# mkfs.ext4 /mnt/cnv/0
+# mount -o loop /mnt/cnv/0 /data
+
+The "seq" sub-directory grouping files for sequential write zones has in this
+example 55356 zones.
+
+# ls -lv /mnt/seq
+total 14511243264
+-rw-r----- 1 root root 0 Nov 25 13:23 0
+-rw-r----- 1 root root 0 Nov 25 13:23 1
+-rw-r----- 1 root root 0 Nov 25 13:23 2
+...
+-rw-r----- 1 root root 0 Nov 25 13:23 55354
+-rw-r----- 1 root root 0 Nov 25 13:23 55355
+
+For sequential write zone files, the file size changes as data is appended at
+the end of the file, similarly to any regular file system.
+
+# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
+1+0 records in
+1+0 records out
+4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
+
+# ls -l /mnt/seq/0
+-rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
+
+The written file can be truncated to the zone size, preventing any further
+write operation.
+
+# truncate -s 268435456 /mnt/seq/0
+# ls -l /mnt/seq/0
+-rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
+
+Truncation to 0 size allows freeing the file zone storage space and restart
+append-writes to the file.
+
+# truncate -s 0 /mnt/seq/0
+# ls -l /mnt/seq/0
+-rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
+
+Since files are statically mapped to zones on the disk, the number of blocks of
+a file as reported by stat() and fstat() indicates the size of the file zone.
+
+# stat /mnt/seq/0
+ File: /mnt/seq/0
+ Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
+Device: 870h/2160d Inode: 50431 Links: 1
+Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
+Access: 2019-11-25 13:23:57.048971997 +0900
+Modify: 2019-11-25 13:52:25.553805765 +0900
+Change: 2019-11-25 13:52:25.553805765 +0900
+ Birth: -
+
+The number of blocks of the file ("Blocks") in units of 512B blocks gives the
+maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
+size in this example. Of note is that the "IO block" field always indicates the
+minimum I/O size for writes and corresponds to the device physical sector size.