1 files changed, 610 insertions, 0 deletions
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
new file mode 100644
index 000000000000..73a4176144b3
--- /dev/null
+++ b/Documentation/filesystems/netfs_library.rst
@@ -0,0 +1,610 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================
+Network Filesystem Helper Library
+=================================
+
+.. Contents:
+
+ - Overview.
+ - Per-inode context.
+   - Inode context helper functions.
+ - Buffered read helpers.
+   - Read helper functions.
+   - Read helper structures.
+   - Read helper operations.
+   - Read helper procedure.
+   - Read helper cache API.
+
+
+Overview
+========
+
+The network filesystem helper library is a set of functions designed to aid a
+network filesystem in implementing VM/VFS operations.  For the moment, that
+just includes turning various VM buffered read operations into requests to read
+from the server.  The helper library, however, can also interpose other
+services, such as local caching or local data encryption.
+
+Note that the library module doesn't link against local caching directly, so
+access must be provided by the netfs.
+
+
+Per-Inode Context
+=================
+
+The network filesystem helper library needs a place to store a bit of state for
+its use on each netfs inode it is helping to manage.  To this end, a context
+structure is defined::
+
+	struct netfs_inode {
+		struct inode inode;
+		const struct netfs_request_ops *ops;
+		struct fscache_cookie *cache;
+	};
+
+A network filesystem that wants to use netfs lib must place one of these in its
+inode wrapper struct instead of the VFS ``struct inode``.  This can be done in
+a way similar to the following::
+
+	struct my_inode {
+		struct netfs_inode netfs; /* Netfslib context and vfs inode */
+		...
+	};
+
+This allows netfslib to find its state by using ``container_of()`` from the
+inode pointer, thereby allowing the netfslib helper functions to be pointed to
+directly by the VFS/VM operation tables.
+
+The structure contains the following fields:
+
+ * ``inode``
+
+   The VFS inode structure.
+
+ * ``ops``
+
+   The set of operations provided by the network filesystem to netfslib.
+
+ * ``cache``
+
+   Local caching cookie, or NULL if no caching is enabled.  This field does not
+   exist if fscache is disabled.
+
+
+Inode Context Helper Functions
+------------------------------
+
+To help deal with the per-inode context, a number helper functions are
+provided.  Firstly, a function to perform basic initialisation on a context and
+set the operations table pointer::
+
+	void netfs_inode_init(struct netfs_inode *ctx,
+			      const struct netfs_request_ops *ops);
+
+then a function to cast from the VFS inode structure to the netfs context::
+
+	struct netfs_inode *netfs_node(struct inode *inode);
+
+and finally, a function to get the cache cookie pointer from the context
+attached to an inode (or NULL if fscache is disabled)::
+
+	struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx);
+
+
+Buffered Read Helpers
+=====================
+
+The library provides a set of read helpers that handle the ->read_folio(),
+->readahead() and much of the ->write_begin() VM operations and translate them
+into a common call framework.
+
+The following services are provided:
+
+ * Handle folios that span multiple pages.
+
+ * Insulate the netfs from VM interface changes.
+
+ * Allow the netfs to arbitrarily split reads up into pieces, even ones that
+   don't match folio sizes or folio alignments and that may cross folios.
+
+ * Allow the netfs to expand a readahead request in both directions to meet its
+   needs.
+
+ * Allow the netfs to partially fulfil a read, which will then be resubmitted.
+
+ * Handle local caching, allowing cached data and server-read data to be
+   interleaved for a single request.
+
+ * Handle clearing of bufferage that aren't on the server.
+
+ * Handle retrying of reads that failed, switching reads from the cache to the
+   server as necessary.
+
+ * In the future, this is a place that other services can be performed, such as
+   local encryption of data to be stored remotely or in the cache.
+
+From the network filesystem, the helpers require a table of operations.  This
+includes a mandatory method to issue a read operation along with a number of
+optional methods.
+
+
+Read Helper Functions
+---------------------
+
+Three read helpers are provided::
+
+	void netfs_readahead(struct readahead_control *ractl);
+	int netfs_read_folio(struct file *file,
+			     struct folio *folio);
+	int netfs_write_begin(struct netfs_inode *ctx,
+			      struct file *file,
+			      struct address_space *mapping,
+			      loff_t pos,
+			      unsigned int len,
+			      struct folio **_folio,
+			      void **_fsdata);
+
+Each corresponds to a VM address space operation.  These operations use the
+state in the per-inode context.
+
+For ->readahead() and ->read_folio(), the network filesystem just point directly
+at the corresponding read helper; whereas for ->write_begin(), it may be a
+little more complicated as the network filesystem might want to flush
+conflicting writes or track dirty data and needs to put the acquired folio if
+an error occurs after calling the helper.
+
+The helpers manage the read request, calling back into the network filesystem
+through the suppplied table of operations.  Waits will be performed as
+necessary before returning for helpers that are meant to be synchronous.
+
+If an error occurs, the ->free_request() will be called to clean up the
+netfs_io_request struct allocated.  If some parts of the request are in
+progress when an error occurs, the request will get partially completed if
+sufficient data is read.
+
+Additionally, there is::
+
+  * void netfs_subreq_terminated(struct netfs_io_subrequest *subreq,
+				 ssize_t transferred_or_error,
+				 bool was_async);
+
+which should be called to complete a read subrequest.  This is given the number
+of bytes transferred or a negative error code, plus a flag indicating whether
+the operation was asynchronous (ie. whether the follow-on processing can be
+done in the current context, given this may involve sleeping).
+
+
+Read Helper Structures
+----------------------
+
+The read helpers make use of a couple of structures to maintain the state of
+the read.  The first is a structure that manages a read request as a whole::
+
+	struct netfs_io_request {
+		struct inode		*inode;
+		struct address_space	*mapping;
+		struct netfs_cache_resources cache_resources;
+		void			*netfs_priv;
+		loff_t			start;
+		size_t			len;
+		loff_t			i_size;
+		const struct netfs_request_ops *netfs_ops;
+		unsigned int		debug_id;
+		...
+	};
+
+The above fields are the ones the netfs can use.  They are:
+
+ * ``inode``
+ * ``mapping``
+
+   The inode and the address space of the file being read from.  The mapping
+   may or may not point to inode->i_data.
+
+ * ``cache_resources``
+
+   Resources for the local cache to use, if present.
+
+ * ``netfs_priv``
+
+   The network filesystem's private data.  The value for this can be passed in
+   to the helper functions or set during the request.
+
+ * ``start``
+ * ``len``
+
+   The file position of the start of the read request and the length.  These
+   may be altered by the ->expand_readahead() op.
+
+ * ``i_size``
+
+   The size of the file at the start of the request.
+
+ * ``netfs_ops``
+
+   A pointer to the operation table.  The value for this is passed into the
+   helper functions.
+
+ * ``debug_id``
+
+   A number allocated to this operation that can be displayed in trace lines
+   for reference.
+
+
+The second structure is used to manage individual slices of the overall read
+request::
+
+	struct netfs_io_subrequest {
+		struct netfs_io_request *rreq;
+		loff_t			start;
+		size_t			len;
+		size_t			transferred;
+		unsigned long		flags;
+		unsigned short		debug_index;
+		...
+	};
+
+Each subrequest is expected to access a single source, though the helpers will
+handle falling back from one source type to another.  The members are:
+
+ * ``rreq``
+
+   A pointer to the read request.
+
+ * ``start``
+ * ``len``
+
+   The file position of the start of this slice of the read request and the
+   length.
+
+ * ``transferred``
+
+   The amount of data transferred so far of the length of this slice.  The
+   network filesystem or cache should start the operation this far into the
+   slice.  If a short read occurs, the helpers will call again, having updated
+   this to reflect the amount read so far.
+
+ * ``flags``
+
+   Flags pertaining to the read.  There are two of interest to the filesystem
+   or cache:
+
+   * ``NETFS_SREQ_CLEAR_TAIL``
+
+     This can be set to indicate that the remainder of the slice, from
+     transferred to len, should be cleared.
+
+   * ``NETFS_SREQ_SEEK_DATA_READ``
+
+     This is a hint to the cache that it might want to try skipping ahead to
+     the next data (ie. using SEEK_DATA).
+
+ * ``debug_index``
+
+   A number allocated to this slice that can be displayed in trace lines for
+   reference.
+
+
+Read Helper Operations
+----------------------
+
+The network filesystem must provide the read helpers with a table of operations
+through which it can issue requests and negotiate::
+
+	struct netfs_request_ops {
+		void (*init_request)(struct netfs_io_request *rreq, struct file *file);
+		void (*free_request)(struct netfs_io_request *rreq);
+		int (*begin_cache_operation)(struct netfs_io_request *rreq);
+		void (*expand_readahead)(struct netfs_io_request *rreq);
+		bool (*clamp_length)(struct netfs_io_subrequest *subreq);
+		void (*issue_read)(struct netfs_io_subrequest *subreq);
+		bool (*is_still_valid)(struct netfs_io_request *rreq);
+		int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
+					 struct folio **foliop, void **_fsdata);
+		void (*done)(struct netfs_io_request *rreq);
+	};
+
+The operations are as follows:
+
+ * ``init_request()``
+
+   [Optional] This is called to initialise the request structure.  It is given
+   the file for reference.
+
+ * ``free_request()``
+
+   [Optional] This is called as the request is being deallocated so that the
+   filesystem can clean up any state it has attached there.
+
+ * ``begin_cache_operation()``
+
+   [Optional] This is called to ask the network filesystem to call into the
+   cache (if present) to initialise the caching state for this read.  The netfs
+   library module cannot access the cache directly, so the cache should call
+   something like fscache_begin_read_operation() to do this.
+
+   The cache gets to store its state in ->cache_resources and must set a table
+   of operations of its own there (though of a different type).
+
+   This should return 0 on success and an error code otherwise.  If an error is
+   reported, the operation may proceed anyway, just without local caching (only
+   out of memory and interruption errors cause failure here).
+
+ * ``expand_readahead()``
+
+   [Optional] This is called to allow the filesystem to expand the size of a
+   readahead read request.  The filesystem gets to expand the request in both
+   directions, though it's not permitted to reduce it as the numbers may
+   represent an allocation already made.  If local caching is enabled, it gets
+   to expand the request first.
+
+   Expansion is communicated by changing ->start and ->len in the request
+   structure.  Note that if any change is made, ->len must be increased by at
+   least as much as ->start is reduced.
+
+ * ``clamp_length()``
+
+   [Optional] This is called to allow the filesystem to reduce the size of a
+   subrequest.  The filesystem can use this, for example, to chop up a request
+   that has to be split across multiple servers or to put multiple reads in
+   flight.
+
+   This should return 0 on success and an error code on error.
+
+ * ``issue_read()``
+
+   [Required] The helpers use this to dispatch a subrequest to the server for
+   reading.  In the subrequest, ->start, ->len and ->transferred indicate what
+   data should be read from the server.
+
+   There is no return value; the netfs_subreq_terminated() function should be
+   called to indicate whether or not the operation succeeded and how much data
+   it transferred.  The filesystem also should not deal with setting folios
+   uptodate, unlocking them or dropping their refs - the helpers need to deal
+   with this as they have to coordinate with copying to the local cache.
+
+   Note that the helpers have the folios locked, but not pinned.  It is
+   possible to use the ITER_XARRAY iov iterator to refer to the range of the
+   inode that is being operated upon without the need to allocate large bvec
+   tables.
+
+ * ``is_still_valid()``
+
+   [Optional] This is called to find out if the data just read from the local
+   cache is still valid.  It should return true if it is still valid and false
+   if not.  If it's not still valid, it will be reread from the server.
+
+ * ``check_write_begin()``
+
+   [Optional] This is called from the netfs_write_begin() helper once it has
+   allocated/grabbed the folio to be modified to allow the filesystem to flush
+   conflicting state before allowing it to be modified.
+
+   It may unlock and discard the folio it was given and set the caller's folio
+   pointer to NULL.  It should return 0 if everything is now fine (``*foliop``
+   left set) or the op should be retried (``*foliop`` cleared) and any other
+   error code to abort the operation.
+
+ * ``done``
+
+   [Optional] This is called after the folios in the request have all been
+   unlocked (and marked uptodate if applicable).
+
+
+
+Read Helper Procedure
+---------------------
+
+The read helpers work by the following general procedure:
+
+ * Set up the request.
+
+ * For readahead, allow the local cache and then the network filesystem to
+   propose expansions to the read request.  This is then proposed to the VM.
+   If the VM cannot fully perform the expansion, a partially expanded read will
+   be performed, though this may not get written to the cache in its entirety.
+
+ * Loop around slicing chunks off of the request to form subrequests:
+
+   * If a local cache is present, it gets to do the slicing, otherwise the
+     helpers just try to generate maximal slices.
+
+   * The network filesystem gets to clamp the size of each slice if it is to be
+     the source.  This allows rsize and chunking to be implemented.
+
+   * The helpers issue a read from the cache or a read from the server or just
+     clears the slice as appropriate.
+
+   * The next slice begins at the end of the last one.
+
+   * As slices finish being read, they terminate.
+
+ * When all the subrequests have terminated, the subrequests are assessed and
+   any that are short or have failed are reissued:
+
+   * Failed cache requests are issued against the server instead.
+
+   * Failed server requests just fail.
+
+   * Short reads against either source will be reissued against that source
+     provided they have transferred some more data:
+
+     * The cache may need to skip holes that it can't do DIO from.
+
+     * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the
+       end of the slice instead of reissuing.
+
+ * Once the data is read, the folios that have been fully read/cleared:
+
+   * Will be marked uptodate.
+
+   * If a cache is present, will be marked with PG_fscache.
+
+   * Unlocked
+
+ * Any folios that need writing to the cache will then have DIO writes issued.
+
+ * Synchronous operations will wait for reading to be complete.
+
+ * Writes to the cache will proceed asynchronously and the folios will have the
+   PG_fscache mark removed when that completes.
+
+ * The request structures will be cleaned up when everything has completed.
+
+
+Read Helper Cache API
+---------------------
+
+When implementing a local cache to be used by the read helpers, two things are
+required: some way for the network filesystem to initialise the caching for a
+read request and a table of operations for the helpers to call.
+
+The network filesystem's ->begin_cache_operation() method is called to set up a
+cache and this must call into the cache to do the work.  If using fscache, for
+example, the cache would call::
+
+	int fscache_begin_read_operation(struct netfs_io_request *rreq,
+					 struct fscache_cookie *cookie);
+
+passing in the request pointer and the cookie corresponding to the file.
+
+The netfs_io_request object contains a place for the cache to hang its
+state::
+
+	struct netfs_cache_resources {
+		const struct netfs_cache_ops	*ops;
+		void				*cache_priv;
+		void				*cache_priv2;
+	};
+
+This contains an operations table pointer and two private pointers.  The
+operation table looks like the following::
+
+	struct netfs_cache_ops {
+		void (*end_operation)(struct netfs_cache_resources *cres);
+
+		void (*expand_readahead)(struct netfs_cache_resources *cres,
+					 loff_t *_start, size_t *_len, loff_t i_size);
+
+		enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq,
+						       loff_t i_size);
+
+		int (*read)(struct netfs_cache_resources *cres,
+			    loff_t start_pos,
+			    struct iov_iter *iter,
+			    bool seek_data,
+			    netfs_io_terminated_t term_func,
+			    void *term_func_priv);
+
+		int (*prepare_write)(struct netfs_cache_resources *cres,
+				     loff_t *_start, size_t *_len, loff_t i_size,
+				     bool no_space_allocated_yet);
+
+		int (*write)(struct netfs_cache_resources *cres,
+			     loff_t start_pos,
+			     struct iov_iter *iter,
+			     netfs_io_terminated_t term_func,
+			     void *term_func_priv);
+
+		int (*query_occupancy)(struct netfs_cache_resources *cres,
+				       loff_t start, size_t len, size_t granularity,
+				       loff_t *_data_start, size_t *_data_len);
+	};
+
+With a termination handler function pointer::
+
+	typedef void (*netfs_io_terminated_t)(void *priv,
+					      ssize_t transferred_or_error,
+					      bool was_async);
+
+The methods defined in the table are:
+
+ * ``end_operation()``
+
+   [Required] Called to clean up the resources at the end of the read request.
+
+ * ``expand_readahead()``
+
+   [Optional] Called at the beginning of a netfs_readahead() operation to allow
+   the cache to expand a request in either direction.  This allows the cache to
+   size the request appropriately for the cache granularity.
+
+   The function is passed poiners to the start and length in its parameters,
+   plus the size of the file for reference, and adjusts the start and length
+   appropriately.  It should return one of:
+
+   * ``NETFS_FILL_WITH_ZEROES``
+   * ``NETFS_DOWNLOAD_FROM_SERVER``
+   * ``NETFS_READ_FROM_CACHE``
+   * ``NETFS_INVALID_READ``
+
+   to indicate whether the slice should just be cleared or whether it should be
+   downloaded from the server or read from the cache - or whether slicing
+   should be given up at the current point.
+
+ * ``prepare_read()``
+
+   [Required] Called to configure the next slice of a request.  ->start and
+   ->len in the subrequest indicate where and how big the next slice can be;
+   the cache gets to reduce the length to match its granularity requirements.
+
+ * ``read()``
+
+   [Required] Called to read from the cache.  The start file offset is given
+   along with an iterator to read to, which gives the length also.  It can be
+   given a hint requesting that it seek forward from that start position for
+   data.
+
+   Also provided is a pointer to a termination handler function and private
+   data to pass to that function.  The termination function should be called
+   with the number of bytes transferred or an error code, plus a flag
+   indicating whether the termination is definitely happening in the caller's
+   context.
+
+ * ``prepare_write()``
+
+   [Required] Called to prepare a write to the cache to take place.  This
+   involves checking to see whether the cache has sufficient space to honour
+   the write.  ``*_start`` and ``*_len`` indicate the region to be written; the
+   region can be shrunk or it can be expanded to a page boundary either way as
+   necessary to align for direct I/O.  i_size holds the size of the object and
+   is provided for reference.  no_space_allocated_yet is set to true if the
+   caller is certain that no data has been written to that region - for example
+   if it tried to do a read from there already.
+
+ * ``write()``
+
+   [Required] Called to write to the cache.  The start file offset is given
+   along with an iterator to write from, which gives the length also.
+
+   Also provided is a pointer to a termination handler function and private
+   data to pass to that function.  The termination function should be called
+   with the number of bytes transferred or an error code, plus a flag
+   indicating whether the termination is definitely happening in the caller's
+   context.
+
+ * ``query_occupancy()``
+
+   [Required] Called to find out where the next piece of data is within a
+   particular region of the cache.  The start and length of the region to be
+   queried are passed in, along with the granularity to which the answer needs
+   to be aligned.  The function passes back the start and length of the data,
+   if any, available within that region.  Note that there may be a hole at the
+   front.
+
+   It returns 0 if some data was found, -ENODATA if there was no usable data
+   within the region or -ENOBUFS if there is no caching on this file.
+
+Note that these methods are passed a pointer to the cache resource structure,
+not the read request structure as they could be used in other situations where
+there isn't a read request structure as well, such as writing dirty data to the
+cache.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: include/linux/netfs.h
+.. kernel-doc:: fs/netfs/buffered_read.c
+.. kernel-doc:: fs/netfs/io.c