aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/Documentation/filesystems/zonefs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/zonefs.txt')
-rw-r--r--Documentation/filesystems/zonefs.txt404
1 files changed, 0 insertions, 404 deletions
diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.txt
deleted file mode 100644
index d54fa98ac158..000000000000
--- a/Documentation/filesystems/zonefs.txt
+++ /dev/null
@@ -1,404 +0,0 @@
-ZoneFS - Zone filesystem for Zoned block devices
-
-Introduction
-============
-
-zonefs is a very simple file system exposing each zone of a zoned block device
-as a file. Unlike a regular POSIX-compliant file system with native zoned block
-device support (e.g. f2fs), zonefs does not hide the sequential write
-constraint of zoned block devices to the user. Files representing sequential
-write zones of the device must be written sequentially starting from the end
-of the file (append only writes).
-
-As such, zonefs is in essence closer to a raw block device access interface
-than to a full-featured POSIX file system. The goal of zonefs is to simplify
-the implementation of zoned block device support in applications by replacing
-raw block device file accesses with a richer file API, avoiding relying on
-direct block device file ioctls which may be more obscure to developers. One
-example of this approach is the implementation of LSM (log-structured merge)
-tree structures (such as used in RocksDB and LevelDB) on zoned block devices
-by allowing SSTables to be stored in a zone file similarly to a regular file
-system rather than as a range of sectors of the entire disk. The introduction
-of the higher level construct "one file is one zone" can help reducing the
-amount of changes needed in the application as well as introducing support for
-different application programming languages.
-
-Zoned block devices
--------------------
-
-Zoned storage devices belong to a class of storage devices with an address
-space that is divided into zones. A zone is a group of consecutive LBAs and all
-zones are contiguous (there are no LBA gaps). Zones may have different types.
-* Conventional zones: there are no access constraints to LBAs belonging to
- conventional zones. Any read or write access can be executed, similarly to a
- regular block device.
-* Sequential zones: these zones accept random reads but must be written
- sequentially. Each sequential zone has a write pointer maintained by the
- device that keeps track of the mandatory start LBA position of the next write
- to the device. As a result of this write constraint, LBAs in a sequential zone
- cannot be overwritten. Sequential zones must first be erased using a special
- command (zone reset) before rewriting.
-
-Zoned storage devices can be implemented using various recording and media
-technologies. The most common form of zoned storage today uses the SCSI Zoned
-Block Commands (ZBC) and Zoned ATA Commands (ZAC) interfaces on Shingled
-Magnetic Recording (SMR) HDDs.
-
-Solid State Disks (SSD) storage devices can also implement a zoned interface
-to, for instance, reduce internal write amplification due to garbage collection.
-The NVMe Zoned NameSpace (ZNS) is a technical proposal of the NVMe standard
-committee aiming at adding a zoned storage interface to the NVMe protocol.
-
-Zonefs Overview
-===============
-
-Zonefs exposes the zones of a zoned block device as files. The files
-representing zones are grouped by zone type, which are themselves represented
-by sub-directories. This file structure is built entirely using zone information
-provided by the device and so does not require any complex on-disk metadata
-structure.
-
-On-disk metadata
-----------------
-
-zonefs on-disk metadata is reduced to an immutable super block which
-persistently stores a magic number and optional feature flags and values. On
-mount, zonefs uses blkdev_report_zones() to obtain the device zone configuration
-and populates the mount point with a static file tree solely based on this
-information. File sizes come from the device zone type and write pointer
-position managed by the device itself.
-
-The super block is always written on disk at sector 0. The first zone of the
-device storing the super block is never exposed as a zone file by zonefs. If
-the zone containing the super block is a sequential zone, the mkzonefs format
-tool always "finishes" the zone, that is, it transitions the zone to a full
-state to make it read-only, preventing any data write.
-
-Zone type sub-directories
--------------------------
-
-Files representing zones of the same type are grouped together under the same
-sub-directory automatically created on mount.
-
-For conventional zones, the sub-directory "cnv" is used. This directory is
-however created if and only if the device has usable conventional zones. If
-the device only has a single conventional zone at sector 0, the zone will not
-be exposed as a file as it will be used to store the zonefs super block. For
-such devices, the "cnv" sub-directory will not be created.
-
-For sequential write zones, the sub-directory "seq" is used.
-
-These two directories are the only directories that exist in zonefs. Users
-cannot create other directories and cannot rename nor delete the "cnv" and
-"seq" sub-directories.
-
-The size of the directories indicated by the st_size field of struct stat,
-obtained with the stat() or fstat() system calls, indicates the number of files
-existing under the directory.
-
-Zone files
-----------
-
-Zone files are named using the number of the zone they represent within the set
-of zones of a particular type. That is, both the "cnv" and "seq" directories
-contain files named "0", "1", "2", ... The file numbers also represent
-increasing zone start sector on the device.
-
-All read and write operations to zone files are not allowed beyond the file
-maximum size, that is, beyond the zone size. Any access exceeding the zone
-size is failed with the -EFBIG error.
-
-Creating, deleting, renaming or modifying any attribute of files and
-sub-directories is not allowed.
-
-The number of blocks of a file as reported by stat() and fstat() indicates the
-size of the file zone, or in other words, the maximum file size.
-
-Conventional zone files
------------------------
-
-The size of conventional zone files is fixed to the size of the zone they
-represent. Conventional zone files cannot be truncated.
-
-These files can be randomly read and written using any type of I/O operation:
-buffered I/Os, direct I/Os, memory mapped I/Os (mmap), etc. There are no I/O
-constraint for these files beyond the file size limit mentioned above.
-
-Sequential zone files
----------------------
-
-The size of sequential zone files grouped in the "seq" sub-directory represents
-the file's zone write pointer position relative to the zone start sector.
-
-Sequential zone files can only be written sequentially, starting from the file
-end, that is, write operations can only be append writes. Zonefs makes no
-attempt at accepting random writes and will fail any write request that has a
-start offset not corresponding to the end of the file, or to the end of the last
-write issued and still in-flight (for asynchronous I/O operations).
-
-Since dirty page writeback by the page cache does not guarantee a sequential
-write pattern, zonefs prevents buffered writes and writeable shared mappings
-on sequential files. Only direct I/O writes are accepted for these files.
-zonefs relies on the sequential delivery of write I/O requests to the device
-implemented by the block layer elevator. An elevator implementing the sequential
-write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature)
-must be used. This type of elevator (e.g. mq-deadline) is set by default
-for zoned block devices on device initialization.
-
-There are no restrictions on the type of I/O used for read operations in
-sequential zone files. Buffered I/Os, direct I/Os and shared read mappings are
-all accepted.
-
-Truncating sequential zone files is allowed only down to 0, in which case, the
-zone is reset to rewind the file zone write pointer position to the start of
-the zone, or up to the zone size, in which case the file's zone is transitioned
-to the FULL state (finish zone operation).
-
-Format options
---------------
-
-Several optional features of zonefs can be enabled at format time.
-* Conventional zone aggregation: ranges of contiguous conventional zones can be
- aggregated into a single larger file instead of the default one file per zone.
-* File ownership: The owner UID and GID of zone files is by default 0 (root)
- but can be changed to any valid UID/GID.
-* File access permissions: the default 640 access permissions can be changed.
-
-IO error handling
------------------
-
-Zoned block devices may fail I/O requests for reasons similar to regular block
-devices, e.g. due to bad sectors. However, in addition to such known I/O
-failure pattern, the standards governing zoned block devices behavior define
-additional conditions that result in I/O errors.
-
-* A zone may transition to the read-only condition (BLK_ZONE_COND_READONLY):
- While the data already written in the zone is still readable, the zone can
- no longer be written. No user action on the zone (zone management command or
- read/write access) can change the zone condition back to a normal read/write
- state. While the reasons for the device to transition a zone to read-only
- state are not defined by the standards, a typical cause for such transition
- would be a defective write head on an HDD (all zones under this head are
- changed to read-only).
-
-* A zone may transition to the offline condition (BLK_ZONE_COND_OFFLINE):
- An offline zone cannot be read nor written. No user action can transition an
- offline zone back to an operational good state. Similarly to zone read-only
- transitions, the reasons for a drive to transition a zone to the offline
- condition are undefined. A typical cause would be a defective read-write head
- on an HDD causing all zones on the platter under the broken head to be
- inaccessible.
-
-* Unaligned write errors: These errors result from the host issuing write
- requests with a start sector that does not correspond to a zone write pointer
- position when the write request is executed by the device. Even though zonefs
- enforces sequential file write for sequential zones, unaligned write errors
- may still happen in the case of a partial failure of a very large direct I/O
- operation split into multiple BIOs/requests or asynchronous I/O operations.
- If one of the write request within the set of sequential write requests
- issued to the device fails, all write requests queued after it will
- become unaligned and fail.
-
-* Delayed write errors: similarly to regular block devices, if the device side
- write cache is enabled, write errors may occur in ranges of previously
- completed writes when the device write cache is flushed, e.g. on fsync().
- Similarly to the previous immediate unaligned write error case, delayed write
- errors can propagate through a stream of cached sequential data for a zone
- causing all data to be dropped after the sector that caused the error.
-
-All I/O errors detected by zonefs are notified to the user with an error code
-return for the system call that triggered or detected the error. The recovery
-actions taken by zonefs in response to I/O errors depend on the I/O type (read
-vs write) and on the reason for the error (bad sector, unaligned writes or zone
-condition change).
-
-* For read I/O errors, zonefs does not execute any particular recovery action,
- but only if the file zone is still in a good condition and there is no
- inconsistency between the file inode size and its zone write pointer position.
- If a problem is detected, I/O error recovery is executed (see below table).
-
-* For write I/O errors, zonefs I/O error recovery is always executed.
-
-* A zone condition change to read-only or offline also always triggers zonefs
- I/O error recovery.
-
-Zonefs minimal I/O error recovery may change a file size and file access
-permissions.
-
-* File size changes:
- Immediate or delayed write errors in a sequential zone file may cause the file
- inode size to be inconsistent with the amount of data successfully written in
- the file zone. For instance, the partial failure of a multi-BIO large write
- operation will cause the zone write pointer to advance partially, even though
- the entire write operation will be reported as failed to the user. In such
- case, the file inode size must be advanced to reflect the zone write pointer
- change and eventually allow the user to restart writing at the end of the
- file.
- A file size may also be reduced to reflect a delayed write error detected on
- fsync(): in this case, the amount of data effectively written in the zone may
- be less than originally indicated by the file inode size. After such I/O
- error, zonefs always fixes the file inode size to reflect the amount of data
- persistently stored in the file zone.
-
-* Access permission changes:
- A zone condition change to read-only is indicated with a change in the file
- access permissions to render the file read-only. This disables changes to the
- file attributes and data modification. For offline zones, all permissions
- (read and write) to the file are disabled.
-
-Further action taken by zonefs I/O error recovery can be controlled by the user
-with the "errors=xxx" mount option. The table below summarizes the result of
-zonefs I/O error processing depending on the mount option and on the zone
-conditions.
-
- +--------------+-----------+-----------------------------------------+
- | | | Post error state |
- | "errors=xxx" | device | access permissions |
- | mount | zone | file file device zone |
- | option | condition | size read write read write |
- +--------------+-----------+-----------------------------------------+
- | | good | fixed yes no yes yes |
- | remount-ro | read-only | fixed yes no yes no |
- | (default) | offline | 0 no no no no |
- +--------------+-----------+-----------------------------------------+
- | | good | fixed yes no yes yes |
- | zone-ro | read-only | fixed yes no yes no |
- | | offline | 0 no no no no |
- +--------------+-----------+-----------------------------------------+
- | | good | 0 no no yes yes |
- | zone-offline | read-only | 0 no no yes no |
- | | offline | 0 no no no no |
- +--------------+-----------+-----------------------------------------+
- | | good | fixed yes yes yes yes |
- | repair | read-only | fixed yes no yes no |
- | | offline | 0 no no no no |
- +--------------+-----------+-----------------------------------------+
-
-Further notes:
-* The "errors=remount-ro" mount option is the default behavior of zonefs I/O
- error processing if no errors mount option is specified.
-* With the "errors=remount-ro" mount option, the change of the file access
- permissions to read-only applies to all files. The file system is remounted
- read-only.
-* Access permission and file size changes due to the device transitioning zones
- to the offline condition are permanent. Remounting or reformatting the device
- with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good
- state.
-* File access permission changes to read-only due to the device transitioning
- zones to the read-only condition are permanent. Remounting or reformatting
- the device will not re-enable file write access.
-* File access permission changes implied by the remount-ro, zone-ro and
- zone-offline mount options are temporary for zones in a good condition.
- Unmounting and remounting the file system will restore the previous default
- (format time values) access rights to the files affected.
-* The repair mount option triggers only the minimal set of I/O error recovery
- actions, that is, file size fixes for zones in a good condition. Zones
- indicated as being read-only or offline by the device still imply changes to
- the zone file access permissions as noted in the table above.
-
-Mount options
--------------
-
-zonefs define the "errors=<behavior>" mount option to allow the user to specify
-zonefs behavior in response to I/O errors, inode size inconsistencies or zone
-condition changes. The defined behaviors are as follow:
-* remount-ro (default)
-* zone-ro
-* zone-offline
-* repair
-
-The I/O error actions defined for each behavior are detailed in the previous
-section.
-
-Zonefs User Space Tools
-=======================
-
-The mkzonefs tool is used to format zoned block devices for use with zonefs.
-This tool is available on Github at:
-
-https://github.com/damien-lemoal/zonefs-tools
-
-zonefs-tools also includes a test suite which can be run against any zoned
-block device, including null_blk block device created with zoned mode.
-
-Examples
---------
-
-The following formats a 15TB host-managed SMR HDD with 256 MB zones
-with the conventional zones aggregation feature enabled.
-
-# mkzonefs -o aggr_cnv /dev/sdX
-# mount -t zonefs /dev/sdX /mnt
-# ls -l /mnt/
-total 0
-dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
-dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
-
-The size of the zone files sub-directories indicate the number of files
-existing for each type of zones. In this example, there is only one
-conventional zone file (all conventional zones are aggregated under a single
-file).
-
-# ls -l /mnt/cnv
-total 137101312
--rw-r----- 1 root root 140391743488 Nov 25 13:23 0
-
-This aggregated conventional zone file can be used as a regular file.
-
-# mkfs.ext4 /mnt/cnv/0
-# mount -o loop /mnt/cnv/0 /data
-
-The "seq" sub-directory grouping files for sequential write zones has in this
-example 55356 zones.
-
-# ls -lv /mnt/seq
-total 14511243264
--rw-r----- 1 root root 0 Nov 25 13:23 0
--rw-r----- 1 root root 0 Nov 25 13:23 1
--rw-r----- 1 root root 0 Nov 25 13:23 2
-...
--rw-r----- 1 root root 0 Nov 25 13:23 55354
--rw-r----- 1 root root 0 Nov 25 13:23 55355
-
-For sequential write zone files, the file size changes as data is appended at
-the end of the file, similarly to any regular file system.
-
-# dd if=/dev/zero of=/mnt/seq/0 bs=4096 count=1 conv=notrunc oflag=direct
-1+0 records in
-1+0 records out
-4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00044121 s, 9.3 MB/s
-
-# ls -l /mnt/seq/0
--rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
-
-The written file can be truncated to the zone size, preventing any further
-write operation.
-
-# truncate -s 268435456 /mnt/seq/0
-# ls -l /mnt/seq/0
--rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
-
-Truncation to 0 size allows freeing the file zone storage space and restart
-append-writes to the file.
-
-# truncate -s 0 /mnt/seq/0
-# ls -l /mnt/seq/0
--rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
-
-Since files are statically mapped to zones on the disk, the number of blocks of
-a file as reported by stat() and fstat() indicates the size of the file zone.
-
-# stat /mnt/seq/0
- File: /mnt/seq/0
- Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
-Device: 870h/2160d Inode: 50431 Links: 1
-Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
-Access: 2019-11-25 13:23:57.048971997 +0900
-Modify: 2019-11-25 13:52:25.553805765 +0900
-Change: 2019-11-25 13:52:25.553805765 +0900
- Birth: -
-
-The number of blocks of the file ("Blocks") in units of 512B blocks gives the
-maximum file size of 524288 * 512 B = 256 MB, corresponding to the device zone
-size in this example. Of note is that the "IO block" field always indicates the
-minimum I/O size for writes and corresponds to the device physical sector size.