diff options
Diffstat (limited to 'Documentation/filesystems/netfs_library.rst')
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 1051 |
1 files changed, 1051 insertions, 0 deletions
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst new file mode 100644 index 000000000000..ddd799df6ce3 --- /dev/null +++ b/Documentation/filesystems/netfs_library.rst @@ -0,0 +1,1051 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Network Filesystem Services Library +=================================== + +.. Contents: + + - Overview. + - Requests and streams. + - Subrequests. + - Result collection and retry. + - Local caching. + - Content encryption (fscrypt). + - Per-inode context. + - Inode context helper functions. + - Inode locking. + - Inode writeback. + - High-level VFS API. + - Unlocked read/write iter. + - Pre-locked read/write iter. + - Monolithic files API. + - Memory-mapped I/O API. + - High-level VM API. + - Deprecated PG_private2 API. + - I/O request API. + - Request structure. + - Stream structure. + - Subrequest structure. + - Filesystem methods. + - Terminating a subrequest. + - Local cache API. + - API function reference. + + +Overview +======== + +The network filesystem services library, netfslib, is a set of functions +designed to aid a network filesystem in implementing VM/VFS API operations. It +takes over the normal buffered read, readahead, write and writeback and also +handles unbuffered and direct I/O. + +The library provides support for (re-)negotiation of I/O sizes and retrying +failed I/O as well as local caching and will, in the future, provide content +encryption. + +It insulates the filesystem from VM interface changes as much as possible and +handles VM features such as large multipage folios. The filesystem basically +just has to provide a way to perform read and write RPC calls. + +The way I/O is organised inside netfslib consists of a number of objects: + + * A *request*. A request is used to track the progress of the I/O overall and + to hold on to resources. The collection of results is done at the request + level. The I/O within a request is divided into a number of parallel + streams of subrequests. + + * A *stream*. A non-overlapping series of subrequests. The subrequests + within a stream do not have to be contiguous. + + * A *subrequest*. This is the basic unit of I/O. It represents a single RPC + call or a single cache I/O operation. The library passes these to the + filesystem and the cache to perform. + +Requests and Streams +-------------------- + +When actually performing I/O (as opposed to just copying into the pagecache), +netfslib will create one or more requests to track the progress of the I/O and +to hold resources. + +A read operation will have a single stream and the subrequests within that +stream may be of mixed origins, for instance mixing RPC subrequests and cache +subrequests. + +On the other hand, a write operation may have multiple streams, where each +stream targets a different destination. For instance, there may be one stream +writing to the local cache and one to the server. Currently, only two streams +are allowed, but this could be increased if parallel writes to multiple servers +is desired. + +The subrequests within a write stream do not need to match alignment or size +with the subrequests in another write stream and netfslib performs the tiling +of subrequests in each stream over the source buffer independently. Further, +each stream may contain holes that don't correspond to holes in the other +stream. + +In addition, the subrequests do not need to correspond to the boundaries of the +folios or vectors in the source/destination buffer. The library handles the +collection of results and the wrangling of folio flags and references. + +Subrequests +----------- + +Subrequests are at the heart of the interaction between netfslib and the +filesystem using it. Each subrequest is expected to correspond to a single +read or write RPC or cache operation. The library will stitch together the +results from a set of subrequests to provide a higher level operation. + +Netfslib has two interactions with the filesystem or the cache when setting up +a subrequest. First, there's an optional preparatory step that allows the +filesystem to negotiate the limits on the subrequest, both in terms of maximum +number of bytes and maximum number of vectors (e.g. for RDMA). This may +involve negotiating with the server (e.g. cifs needing to acquire credits). + +And, secondly, there's the issuing step in which the subrequest is handed off +to the filesystem to perform. + +Note that these two steps are done slightly differently between read and write: + + * For reads, the VM/VFS tells us how much is being requested up front, so the + library can preset maximum values that the cache and then the filesystem can + then reduce. The cache also gets consulted first on whether it wants to do + a read before the filesystem is consulted. + + * For writeback, it is unknown how much there will be to write until the + pagecache is walked, so no limit is set by the library. + +Once a subrequest is completed, the filesystem or cache informs the library of +the completion and then collection is invoked. Depending on whether the +request is synchronous or asynchronous, the collection of results will be done +in either the application thread or in a work queue. + +Result Collection and Retry +--------------------------- + +As subrequests complete, the results are collected and collated by the library +and folio unlocking is performed progressively (if appropriate). Once the +request is complete, async completion will be invoked (again, if appropriate). +It is possible for the filesystem to provide interim progress reports to the +library to cause folio unlocking to happen earlier if possible. + +If any subrequests fail, netfslib can retry them. It will wait until all +subrequests are completed, offer the filesystem the opportunity to fiddle with +the resources/state held by the request and poke at the subrequests before +re-preparing and re-issuing the subrequests. + +This allows the tiling of contiguous sets of failed subrequest within a stream +to be changed, adding more subrequests or ditching excess as necessary (for +instance, if the network sizes change or the server decides it wants smaller +chunks). + +Further, if one or more contiguous cache-read subrequests fail, the library +will pass them to the filesystem to perform instead, renegotiating and retiling +them as necessary to fit with the filesystem's parameters rather than those of +the cache. + +Local Caching +------------- + +One of the services netfslib provides, via ``fscache``, is the option to cache +on local disk a copy of the data obtained from/written to a network filesystem. +The library will manage the storing, retrieval and some invalidation of data +automatically on behalf of the filesystem if a cookie is attached to the +``netfs_inode``. + +Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to +keep track of a page that was being written to the cache, but this is now +deprecated as PG_private_2 will be removed. + +Instead, folios that are read from the server for which there was no data in +the cache will be marked as dirty and will have ``folio->private`` set to a +special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write. +If the folio is modified before that happened, the special value will be +cleared and the write will become normally dirty. + +When writeback occurs, folios that are so marked will only be written to the +cache and not to the server. Writeback handles mixed cache-only writes and +server-and-cache writes by using two streams, sending one to the cache and one +to the server. The server stream will have gaps in it corresponding to those +folios. + +Content Encryption (fscrypt) +---------------------------- + +Though it does not do so yet, at some point netfslib will acquire the ability +to do client-side content encryption on behalf of the network filesystem (Ceph, +for example). fscrypt can be used for this if appropriate (it may not be - +cifs, for example). + +The data will be stored encrypted in the local cache using the same manner of +encryption as the data written to the server and the library will impose bounce +buffering and RMW cycles as necessary. + + +Per-Inode Context +================= + +The network filesystem helper library needs a place to store a bit of state for +its use on each netfs inode it is helping to manage. To this end, a context +structure is defined:: + + struct netfs_inode { + struct inode inode; + const struct netfs_request_ops *ops; + struct fscache_cookie * cache; + loff_t remote_i_size; + unsigned long flags; + ... + }; + +A network filesystem that wants to use netfslib must place one of these in its +inode wrapper struct instead of the VFS ``struct inode``. This can be done in +a way similar to the following:: + + struct my_inode { + struct netfs_inode netfs; /* Netfslib context and vfs inode */ + ... + }; + +This allows netfslib to find its state by using ``container_of()`` from the +inode pointer, thereby allowing the netfslib helper functions to be pointed to +directly by the VFS/VM operation tables. + +The structure contains the following fields that are of interest to the +filesystem: + + * ``inode`` + + The VFS inode structure. + + * ``ops`` + + The set of operations provided by the network filesystem to netfslib. + + * ``cache`` + + Local caching cookie, or NULL if no caching is enabled. This field does not + exist if fscache is disabled. + + * ``remote_i_size`` + + The size of the file on the server. This differs from inode->i_size if + local modifications have been made but not yet written back. + + * ``flags`` + + A set of flags, some of which the filesystem might be interested in: + + * ``NETFS_ICTX_MODIFIED_ATTR`` + + Set if netfslib modifies mtime/ctime. The filesystem is free to ignore + this or clear it. + + * ``NETFS_ICTX_UNBUFFERED`` + + Do unbuffered I/O upon the file. Like direct I/O but without the + alignment limitations. RMW will be performed if necessary. The pagecache + will not be used unless mmap() is also used. + + * ``NETFS_ICTX_WRITETHROUGH`` + + Do writethrough caching upon the file. I/O will be set up and dispatched + as buffered writes are made to the page cache. mmap() does the normal + writeback thing. + + * ``NETFS_ICTX_SINGLE_NO_UPLOAD`` + + Set if the file has a monolithic content that must be read entirely in a + single go and must not be written back to the server, though it can be + cached (e.g. AFS directories). + +Inode Context Helper Functions +------------------------------ + +To help deal with the per-inode context, a number helper functions are +provided. Firstly, a function to perform basic initialisation on a context and +set the operations table pointer:: + + void netfs_inode_init(struct netfs_inode *ctx, + const struct netfs_request_ops *ops); + +then a function to cast from the VFS inode structure to the netfs context:: + + struct netfs_inode *netfs_inode(struct inode *inode); + +and finally, a function to get the cache cookie pointer from the context +attached to an inode (or NULL if fscache is disabled):: + + struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); + +Inode Locking +------------- + +A number of functions are provided to manage the locking of i_rwsem for I/O and +to effectively extend it to provide more separate classes of exclusion:: + + int netfs_start_io_read(struct inode *inode); + void netfs_end_io_read(struct inode *inode); + int netfs_start_io_write(struct inode *inode); + void netfs_end_io_write(struct inode *inode); + int netfs_start_io_direct(struct inode *inode); + void netfs_end_io_direct(struct inode *inode); + +The exclusion breaks down into four separate classes: + + 1) Buffered reads and writes. + + Buffered reads can run concurrently each other and with buffered writes, + but buffered writes cannot run concurrently with each other. + + 2) Direct reads and writes. + + Direct (and unbuffered) reads and writes can run concurrently since they do + not share local buffering (i.e. the pagecache) and, in a network + filesystem, are expected to have exclusion managed on the server (though + this may not be the case for, say, Ceph). + + 3) Other major inode modifying operations (e.g. truncate, fallocate). + + These should just access i_rwsem directly. + + 4) mmap(). + + mmap'd accesses might operate concurrently with any of the other classes. + They might form the buffer for an intra-file loopback DIO read/write. They + might be permitted on unbuffered files. + +Inode Writeback +--------------- + +Netfslib will pin resources on an inode for future writeback (such as pinning +use of an fscache cookie) when an inode is dirtied. However, this pinning +needs careful management. To manage the pinning, the following sequence +occurs: + + 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the + pinning begins (when a folio is dirtied, for example) if the cache is + active to stop the cache structures from being discarded and the cache + space from being culled. This also prevents re-getting of cache resources + if the flag is already set. + + 2) This flag then cleared inside the inode lock during inode writeback in the + VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb`` + in ``struct writeback_control``. + + 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced. + + 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup. + + 5) The filesystem invokes netfs to do its cleanup. + +To do the cleanup, netfslib provides a function to do the resource unpinning:: + + int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc); + +If the filesystem doesn't need to do anything else, this may be set as a its +``.write_inode`` method. + +Further, if an inode is deleted, the filesystem's write_inode method may not +get called, so:: + + void netfs_clear_inode_writeback(struct inode *inode, const void *aux); + +must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called. + + +High-Level VFS API +================== + +Netfslib provides a number of sets of API calls for the filesystem to delegate +VFS operations to. Netfslib, in turn, will call out to the filesystem and the +cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene +at various times. + +Unlocked Read/Write Iter +------------------------ + +The first API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS read/write_iter methods:: + + ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from); + ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); + ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from); + +They can be assigned directly to ``.read_iter`` and ``.write_iter``. They +perform the inode locking themselves and the first two will switch between +buffered I/O and DIO as appropriate. + +Pre-Locked Read/Write Iter +-------------------------- + +The second API set is for the delegation of operations to netfslib when the +filesystem is called through the standard VFS methods, but needs to do some +other stuff before or after calling netfslib whilst still inside locked section +(e.g. Ceph negotiating caps). The unbuffered read function is:: + + ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter); + +This must not be assigned directly to ``.read_iter`` and the filesystem is +responsible for performing the inode locking before calling it. In the case of +buffered read, the filesystem should use ``filemap_read()``. + +There are three functions for writes:: + + ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, + struct netfs_group *netfs_group); + ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, + struct netfs_group *netfs_group); + ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter, + struct netfs_group *netfs_group); + +These must not be assigned directly to ``.write_iter`` and the filesystem is +responsible for performing the inode locking before calling them. + +The first two functions are for buffered writes; the first just adds some +standard write checks and jumps to the second, but if the filesystem wants to +do the checks itself, it can use the second directly. The third function is +for unbuffered or DIO writes. + +On all three write functions, there is a writeback group pointer (which should +be NULL if the filesystem doesn't use this). Writeback groups are set on +folios when they're modified. If a folio to-be-modified is already marked with +a different group, it is flushed first. The writeback API allows writing back +of a specific group. + +Memory-Mapped I/O API +--------------------- + +An API for support of mmap()'d I/O is provided:: + + vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); + +This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The +filesystem should not take the inode lock before calling it, but, as with the +locked write functions above, this does take a writeback group pointer. If the +page to be made writable is in a different group, it will be flushed first. + +Monolithic Files API +-------------------- + +There is also a special API set for files for which the content must be read in +a single RPC (and not written back) and is maintained as a monolithic blob +(e.g. an AFS directory), though it can be stored and updated in the local cache:: + + ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter); + void netfs_single_mark_inode_dirty(struct inode *inode); + int netfs_writeback_single(struct address_space *mapping, + struct writeback_control *wbc, + struct iov_iter *iter); + +The first function reads from a file into the given buffer, reading from the +cache in preference if the data is cached there; the second function allows the +inode to be marked dirty, causing a later writeback; and the third function can +be called from the writeback code to write the data to the cache, if there is +one. + +The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be +used. The writeback function requires the buffer to be of ITER_FOLIOQ type. + +High-Level VM API +================== + +Netfslib also provides a number of sets of API calls for the filesystem to +delegate VM operations to. Again, netfslib, in turn, will call out to the +filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places +for it to intervene at various times:: + + void netfs_readahead(struct readahead_control *); + int netfs_read_folio(struct file *, struct folio *); + int netfs_writepages(struct address_space *mapping, + struct writeback_control *wbc); + bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); + void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); + bool netfs_release_folio(struct folio *folio, gfp_t gfp); + +These are ``address_space_operations`` methods and can be set directly in the +operations table. + +Deprecated PG_private_2 API +--------------------------- + +There is also a deprecated function for filesystems that still use the +``->write_begin`` method:: + + int netfs_write_begin(struct netfs_inode *inode, struct file *file, + struct address_space *mapping, loff_t pos, unsigned int len, + struct folio **_folio, void **_fsdata); + +It uses the deprecated PG_private_2 flag and so should not be used. + + +I/O Request API +=============== + +The I/O request API comprises a number of structures and a number of functions +that the filesystem may need to use. + +Request Structure +----------------- + +The request structure manages the request as a whole, holding some resources +and state on behalf of the filesystem and tracking the collection of results:: + + struct netfs_io_request { + enum netfs_io_origin origin; + struct inode *inode; + struct address_space *mapping; + struct netfs_group *group; + struct netfs_io_stream io_streams[]; + void *netfs_priv; + void *netfs_priv2; + unsigned long long start; + unsigned long long len; + unsigned long long i_size; + unsigned int debug_id; + unsigned long flags; + ... + }; + +Many of the fields are for internal use, but the fields shown here are of +interest to the filesystem: + + * ``origin`` + + The origin of the request (readahead, read_folio, DIO read, writeback, ...). + + * ``inode`` + * ``mapping`` + + The inode and the address space of the file being read from. The mapping + may or may not point to inode->i_data. + + * ``group`` + + The writeback group this request is dealing with or NULL. This holds a ref + on the group. + + * ``io_streams`` + + The parallel streams of subrequests available to the request. Currently two + are available, but this may be made extensible in future. ``NR_IO_STREAMS`` + indicates the size of the array. + + * ``netfs_priv`` + * ``netfs_priv2`` + + The network filesystem's private data. The value for this can be passed in + to the helper functions or set during the request. + + * ``start`` + * ``len`` + + The file position of the start of the read request and the length. These + may be altered by the ->expand_readahead() op. + + * ``i_size`` + + The size of the file at the start of the request. + + * ``debug_id`` + + A number allocated to this operation that can be displayed in trace lines + for reference. + + * ``flags`` + + Flags for managing and controlling the operation of the request. Some of + these may be of interest to the filesystem: + + * ``NETFS_RREQ_RETRYING`` + + Netfslib sets this when generating retries. + + * ``NETFS_RREQ_PAUSE`` + + The filesystem can set this to request to pause the library's subrequest + issuing loop - but care needs to be taken as netfslib may also set it. + + * ``NETFS_RREQ_NONBLOCK`` + * ``NETFS_RREQ_BLOCKED`` + + Netfslib sets the first to indicate that non-blocking mode was set by the + caller and the filesystem can set the second to indicate that it would + have had to block. + + * ``NETFS_RREQ_USE_PGPRIV2`` + + The filesystem can set this if it wants to use PG_private_2 to track + whether a folio is being written to the cache. This is deprecated as + PG_private_2 is going to go away. + +If the filesystem wants more private data than is afforded by this structure, +then it should wrap it and provide its own allocator. + +Stream Structure +---------------- + +A request is comprised of one or more parallel streams and each stream may be +aimed at a different target. + +For read requests, only stream 0 is used. This can contain a mixture of +subrequests aimed at different sources. For write requests, stream 0 is used +for the server and stream 1 is used for the cache. For buffered writeback, +stream 0 is not enabled unless a normal dirty folio is encountered, at which +point ->begin_writeback() will be invoked and the filesystem can mark the +stream available. + +The stream struct looks like:: + + struct netfs_io_stream { + unsigned char stream_nr; + bool avail; + size_t sreq_max_len; + unsigned int sreq_max_segs; + unsigned int submit_extendable_to; + ... + }; + +A number of members are available for access/use by the filesystem: + + * ``stream_nr`` + + The number of the stream within the request. + + * ``avail`` + + True if the stream is available for use. The filesystem should set this on + stream zero if in ->begin_writeback(). + + * ``sreq_max_len`` + * ``sreq_max_segs`` + + These are set by the filesystem or the cache in ->prepare_read() or + ->prepare_write() for each subrequest to indicate the maximum number of + bytes and, optionally, the maximum number of segments (if not 0) that that + subrequest can support. + + * ``submit_extendable_to`` + + The size that a subrequest can be rounded up to beyond the EOF, given the + available buffer. This allows the cache to work out if it can do a DIO read + or write that straddles the EOF marker. + +Subrequest Structure +-------------------- + +Individual units of I/O are managed by the subrequest structure. These +represent slices of the overall request and run independently:: + + struct netfs_io_subrequest { + struct netfs_io_request *rreq; + struct iov_iter io_iter; + unsigned long long start; + size_t len; + size_t transferred; + unsigned long flags; + short error; + unsigned short debug_index; + unsigned char stream_nr; + ... + }; + +Each subrequest is expected to access a single source, though the library will +handle falling back from one source type to another. The members are: + + * ``rreq`` + + A pointer to the read request. + + * ``io_iter`` + + An I/O iterator representing a slice of the buffer to be read into or + written from. + + * ``start`` + * ``len`` + + The file position of the start of this slice of the read request and the + length. + + * ``transferred`` + + The amount of data transferred so far for this subrequest. This should be + added to with the length of the transfer made by this issuance of the + subrequest. If this is less than ``len`` then the subrequest may be + reissued to continue. + + * ``flags`` + + Flags for managing the subrequest. There are a number of interest to the + filesystem or cache: + + * ``NETFS_SREQ_MADE_PROGRESS`` + + Set by the filesystem to indicates that at least one byte of data was read + or written. + + * ``NETFS_SREQ_HIT_EOF`` + + The filesystem should set this if a read hit the EOF on the file (in which + case ``transferred`` should stop at the EOF). Netfslib may expand the + subrequest out to the size of the folio containing the EOF on the off + chance that a third party change happened or a DIO read may have asked for + more than is available. The library will clear any excess pagecache. + + * ``NETFS_SREQ_CLEAR_TAIL`` + + The filesystem can set this to indicate that the remainder of the slice, + from transferred to len, should be cleared. Do not set if HIT_EOF is set. + + * ``NETFS_SREQ_NEED_RETRY`` + + The filesystem can set this to tell netfslib to retry the subrequest. + + * ``NETFS_SREQ_BOUNDARY`` + + This can be set by the filesystem on a subrequest to indicate that it ends + at a boundary with the filesystem structure (e.g. at the end of a Ceph + object). It tells netfslib not to retile subrequests across it. + + * ``error`` + + This is for the filesystem to store result of the subrequest. It should be + set to 0 if successful and a negative error code otherwise. + + * ``debug_index`` + * ``stream_nr`` + + A number allocated to this slice that can be displayed in trace lines for + reference and the number of the request stream that it belongs to. + +If necessary, the filesystem can get and put extra refs on the subrequest it is +given:: + + void netfs_get_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); + void netfs_put_subrequest(struct netfs_io_subrequest *subreq, + enum netfs_sreq_ref_trace what); + +using netfs trace codes to indicate the reason. Care must be taken, however, +as once control of the subrequest is returned to netfslib, the same subrequest +can be reissued/retried. + +Filesystem Methods +------------------ + +The filesystem sets a table of operations in ``netfs_inode`` for netfslib to +use:: + + struct netfs_request_ops { + mempool_t *request_pool; + mempool_t *subrequest_pool; + int (*init_request)(struct netfs_io_request *rreq, struct file *file); + void (*free_request)(struct netfs_io_request *rreq); + void (*free_subrequest)(struct netfs_io_subrequest *rreq); + void (*expand_readahead)(struct netfs_io_request *rreq); + int (*prepare_read)(struct netfs_io_subrequest *subreq); + void (*issue_read)(struct netfs_io_subrequest *subreq); + void (*done)(struct netfs_io_request *rreq); + void (*update_i_size)(struct inode *inode, loff_t i_size); + void (*post_modify)(struct inode *inode); + void (*begin_writeback)(struct netfs_io_request *wreq); + void (*prepare_write)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); + void (*retry_request)(struct netfs_io_request *wreq, + struct netfs_io_stream *stream); + void (*invalidate_cache)(struct netfs_io_request *wreq); + }; + +The table starts with a pair of optional pointers to memory pools from which +requests and subrequests can be allocated. If these are not given, netfslib +has default pools that it will use instead. If the filesystem wraps the netfs +structs in its own larger structs, then it will need to use its own pools. +Netfslib will allocate directly from the pools. + +The methods defined in the table are: + + * ``init_request()`` + * ``free_request()`` + * ``free_subrequest()`` + + [Optional] A filesystem may implement these to initialise or clean up any + resources that it attaches to the request or subrequest. + + * ``expand_readahead()`` + + [Optional] This is called to allow the filesystem to expand the size of a + readahead request. The filesystem gets to expand the request in both + directions, though it must retain the initial region as that may represent + an allocation already made. If local caching is enabled, it gets to expand + the request first. + + Expansion is communicated by changing ->start and ->len in the request + structure. Note that if any change is made, ->len must be increased by at + least as much as ->start is reduced. + + * ``prepare_read()`` + + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by RDMA. This information should be set on stream zero in:: + + rreq->io_streams[0].sreq_max_len + rreq->io_streams[0].sreq_max_segs + + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple reads in flight. + + Zero should be returned on success and an error code otherwise. + + * ``issue_read()`` + + [Required] Netfslib calls this to dispatch a subrequest to the server for + reading. In the subrequest, ->start, ->len and ->transferred indicate what + data should be read from the server and ->io_iter indicates the buffer to be + used. + + There is no return value; the ``netfs_read_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. + + Note: the filesystem must not deal with setting folios uptodate, unlocking + them or dropping their refs - the library deals with this as it may have to + stitch together the results of multiple subrequests that variously overlap + the set of folios. + + * ``done()`` + + [Optional] This is called after the folios in a read request have all been + unlocked (and marked uptodate if applicable). + + * ``update_i_size()`` + + [Optional] This is invoked by netfslib at various points during the write + paths to ask the filesystem to update its idea of the file size. If not + given, netfslib will set i_size and i_blocks and update the local cache + cookie. + + * ``post_modify()`` + + [Optional] This is called after netfslib writes to the pagecache or when it + allows an mmap'd page to be marked as writable. + + * ``begin_writeback()`` + + [Optional] Netfslib calls this when processing a writeback request if it + finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE, + indicating it must be written to the server. This allows the filesystem to + only set up writeback resources when it knows it's going to have to perform + a write. + + * ``prepare_write()`` + + [Optional] This is called to allow the filesystem to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by RDMA. This information should be set on stream to which + the subrequest belongs:: + + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs + + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple writes in flight. + + This is not permitted to return an error. Instead, in the event of failure, + ``netfs_prepare_write_failed()`` must be called. + + * ``issue_write()`` + + [Required] This is used to dispatch a subrequest to the server for writing. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the server and ->io_iter indicates the buffer to be + used. + + There is no return value; the ``netfs_write_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. + + Note: the filesystem must not deal with removing the dirty or writeback + marks on folios involved in the operation and should not take refs or pins + on them, but should leave retention to netfslib. + + * ``retry_request()`` + + [Optional] Netfslib calls this at the beginning of a retry cycle. This + allows the filesystem to examine the state of the request, the subrequests + in the indicated stream and of its own data and make adjustments or + renegotiate resources. + + * ``invalidate_cache()`` + + [Optional] This is called by netfslib to invalidate data stored in the local + cache in the event that writing to the local cache fails, providing updated + coherency data that netfs can't provide. + +Terminating a subrequest +------------------------ + +When a subrequest completes, there are a number of functions that the cache or +subrequest can call to inform netfslib of the status change. One function is +provided to terminate a write subrequest at the preparation stage and acts +synchronously: + + * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);`` + + Indicate that the ->prepare_write() call failed. The ``error`` field should + have been updated. + +Note that ->prepare_read() can return an error as a read can simply be aborted. +Dealing with writeback failure is trickier. + +The other functions are used for subrequests that got as far as being issued: + + * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);`` + + Tell netfslib that a read subrequest has terminated. The ``error``, + ``flags`` and ``transferred`` fields should have been updated. + + * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);`` + + Tell netfslib that a write subrequest has terminated. Either the amount of + data processed or the negative error code can be passed in. This is + can be used as a kiocb completion function. + + * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);`` + + This is provided to optionally update netfslib on the incremental progress + of a read, allowing some folios to be unlocked early and does not actually + terminate the subrequest. The ``transferred`` field should have been + updated. + +Local Cache API +--------------- + +Netfslib provides a separate API for a local cache to implement, though it +provides some somewhat similar routines to the filesystem request API. + +Firstly, the netfs_io_request object contains a place for the cache to hang its +state:: + + struct netfs_cache_resources { + const struct netfs_cache_ops *ops; + void *cache_priv; + void *cache_priv2; + unsigned int debug_id; + unsigned int inval_counter; + }; + +This contains an operations table pointer and two private pointers plus the +debug ID of the fscache cookie for tracing purposes and an invalidation counter +that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests +to be invalidated after completion. + +The cache operation table looks like the following:: + + struct netfs_cache_ops { + void (*end_operation)(struct netfs_cache_resources *cres); + void (*expand_readahead)(struct netfs_cache_resources *cres, + loff_t *_start, size_t *_len, loff_t i_size); + enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq, + loff_t i_size); + int (*read)(struct netfs_cache_resources *cres, + loff_t start_pos, + struct iov_iter *iter, + bool seek_data, + netfs_io_terminated_t term_func, + void *term_func_priv); + void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq); + void (*issue_write)(struct netfs_io_subrequest *subreq); + }; + +With a termination handler function pointer:: + + typedef void (*netfs_io_terminated_t)(void *priv, + ssize_t transferred_or_error, + bool was_async); + +The methods defined in the table are: + + * ``end_operation()`` + + [Required] Called to clean up the resources at the end of the read request. + + * ``expand_readahead()`` + + [Optional] Called at the beginning of a readahead operation to allow the + cache to expand a request in either direction. This allows the cache to + size the request appropriately for the cache granularity. + + * ``prepare_read()`` + + [Required] Called to configure the next slice of a request. ->start and + ->len in the subrequest indicate where and how big the next slice can be; + the cache gets to reduce the length to match its granularity requirements. + + The function is passed pointers to the start and length in its parameters, + plus the size of the file for reference, and adjusts the start and length + appropriately. It should return one of: + + * ``NETFS_FILL_WITH_ZEROES`` + * ``NETFS_DOWNLOAD_FROM_SERVER`` + * ``NETFS_READ_FROM_CACHE`` + * ``NETFS_INVALID_READ`` + + to indicate whether the slice should just be cleared or whether it should be + downloaded from the server or read from the cache - or whether slicing + should be given up at the current point. + + * ``read()`` + + [Required] Called to read from the cache. The start file offset is given + along with an iterator to read to, which gives the length also. It can be + given a hint requesting that it seek forward from that start position for + data. + + Also provided is a pointer to a termination handler function and private + data to pass to that function. The termination function should be called + with the number of bytes transferred or an error code, plus a flag + indicating whether the termination is definitely happening in the caller's + context. + + * ``prepare_write_subreq()`` + + [Required] This is called to allow the cache to limit the size of a + subrequest. It may also limit the number of individual regions in iterator, + such as required by DIO/DMA. This information should be set on stream to + which the subrequest belongs:: + + rreq->io_streams[subreq->stream_nr].sreq_max_len + rreq->io_streams[subreq->stream_nr].sreq_max_segs + + The filesystem can use this, for example, to chop up a request that has to + be split across multiple servers or to put multiple writes in flight. + + This is not permitted to return an error. In the event of failure, + ``netfs_prepare_write_failed()`` must be called. + + * ``issue_write()`` + + [Required] This is used to dispatch a subrequest to the cache for writing. + In the subrequest, ->start, ->len and ->transferred indicate what data + should be written to the cache and ->io_iter indicates the buffer to be + used. + + There is no return value; the ``netfs_write_subreq_terminated()`` function + should be called to indicate that the subrequest completed either way. + ->error, ->transferred and ->flags should be updated before completing. The + termination can be done asynchronously. + + +API Function Reference +====================== + +.. kernel-doc:: include/linux/netfs.h +.. kernel-doc:: fs/netfs/buffered_read.c |