diff options
Diffstat (limited to 'Documentation/filesystems/nfs')
-rw-r--r-- | Documentation/filesystems/nfs/client-identifier.rst | 216 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/exporting.rst | 87 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 4 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/localio.rst | 357 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/reexport.rst | 117 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/rpc-cache.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/rpc-server-gss.rst | 11 |
7 files changed, 781 insertions, 13 deletions
diff --git a/Documentation/filesystems/nfs/client-identifier.rst b/Documentation/filesystems/nfs/client-identifier.rst new file mode 100644 index 000000000000..4804441155f5 --- /dev/null +++ b/Documentation/filesystems/nfs/client-identifier.rst @@ -0,0 +1,216 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +NFSv4 client identifier +======================= + +This document explains how the NFSv4 protocol identifies client +instances in order to maintain file open and lock state during +system restarts. A special identifier and principal are maintained +on each client. These can be set by administrators, scripts +provided by site administrators, or tools provided by Linux +distributors. + +There are risks if a client's NFSv4 identifier and its principal +are not chosen carefully. + + +Introduction +------------ + +The NFSv4 protocol uses "lease-based file locking". Leases help +NFSv4 servers provide file lock guarantees and manage their +resources. + +Simply put, an NFSv4 server creates a lease for each NFSv4 client. +The server collects each client's file open and lock state under +the lease for that client. + +The client is responsible for periodically renewing its leases. +While a lease remains valid, the server holding that lease +guarantees the file locks the client has created remain in place. + +If a client stops renewing its lease (for example, if it crashes), +the NFSv4 protocol allows the server to remove the client's open +and lock state after a certain period of time. When a client +restarts, it indicates to servers that open and lock state +associated with its previous leases is no longer valid and can be +destroyed immediately. + +In addition, each NFSv4 server manages a persistent list of client +leases. When the server restarts and clients attempt to recover +their state, the server uses this list to distinguish amongst +clients that held state before the server restarted and clients +sending fresh OPEN and LOCK requests. This enables file locks to +persist safely across server restarts. + +NFSv4 client identifiers +------------------------ + +Each NFSv4 client presents an identifier to NFSv4 servers so that +they can associate the client with its lease. Each client's +identifier consists of two elements: + + - co_ownerid: An arbitrary but fixed string. + + - boot verifier: A 64-bit incarnation verifier that enables a + server to distinguish successive boot epochs of the same client. + +The NFSv4.0 specification refers to these two items as an +"nfs_client_id4". The NFSv4.1 specification refers to these two +items as a "client_owner4". + +NFSv4 servers tie this identifier to the principal and security +flavor that the client used when presenting it. Servers use this +principal to authorize subsequent lease modification operations +sent by the client. Effectively this principal is a third element of +the identifier. + +As part of the identity presented to servers, a good +"co_ownerid" string has several important properties: + + - The "co_ownerid" string identifies the client during reboot + recovery, therefore the string is persistent across client + reboots. + - The "co_ownerid" string helps servers distinguish the client + from others, therefore the string is globally unique. Note + that there is no central authority that assigns "co_ownerid" + strings. + - Because it often appears on the network in the clear, the + "co_ownerid" string does not reveal private information about + the client itself. + - The content of the "co_ownerid" string is set and unchanging + before the client attempts NFSv4 mounts after a restart. + - The NFSv4 protocol places a 1024-byte limit on the size of the + "co_ownerid" string. + +Protecting NFSv4 lease state +---------------------------- + +NFSv4 servers utilize the "client_owner4" as described above to +assign a unique lease to each client. Under this scheme, there are +circumstances where clients can interfere with each other. This is +referred to as "lease stealing". + +If distinct clients present the same "co_ownerid" string and use +the same principal (for example, AUTH_SYS and UID 0), a server is +unable to tell that the clients are not the same. Each distinct +client presents a different boot verifier, so it appears to the +server as if there is one client that is rebooting frequently. +Neither client can maintain open or lock state in this scenario. + +If distinct clients present the same "co_ownerid" string and use +distinct principals, the server is likely to allow the first client +to operate normally but reject subsequent clients with the same +"co_ownerid" string. + +If a client's "co_ownerid" string or principal are not stable, +state recovery after a server or client reboot is not guaranteed. +If a client unexpectedly restarts but presents a different +"co_ownerid" string or principal to the server, the server orphans +the client's previous open and lock state. This blocks access to +locked files until the server removes the orphaned state. + +If the server restarts and a client presents a changed "co_ownerid" +string or principal to the server, the server will not allow the +client to reclaim its open and lock state, and may give those locks +to other clients in the meantime. This is referred to as "lock +stealing". + +Lease stealing and lock stealing increase the potential for denial +of service and in rare cases even data corruption. + +Selecting an appropriate client identifier +------------------------------------------ + +By default, the Linux NFSv4 client implementation constructs its +"co_ownerid" string starting with the words "Linux NFS" followed by +the client's UTS node name (the same node name, incidentally, that +is used as the "machine name" in an AUTH_SYS credential). In small +deployments, this construction is usually adequate. Often, however, +the node name by itself is not adequately unique, and can change +unexpectedly. Problematic situations include: + + - NFS-root (diskless) clients, where the local DHCP server (or + equivalent) does not provide a unique host name. + + - "Containers" within a single Linux host. If each container has + a separate network namespace, but does not use the UTS namespace + to provide a unique host name, then there can be multiple NFS + client instances with the same host name. + + - Clients across multiple administrative domains that access a + common NFS server. If hostnames are not assigned centrally + then uniqueness cannot be guaranteed unless a domain name is + included in the hostname. + +Linux provides two mechanisms to add uniqueness to its "co_ownerid" +string: + + nfs.nfs4_unique_id + This module parameter can set an arbitrary uniquifier string + via the kernel command line, or when the "nfs" module is + loaded. + + /sys/fs/nfs/net/nfs_client/identifier + This virtual file, available since Linux 5.3, is local to the + network namespace in which it is accessed and so can provide + distinction between network namespaces (containers) when the + hostname remains uniform. + +Note that this file is empty on name-space creation. If the +container system has access to some sort of per-container identity +then that uniquifier can be used. For example, a uniquifier might +be formed at boot using the container's internal identifier: + + sha256sum /etc/machine-id | awk '{print $1}' \\ + > /sys/fs/nfs/net/nfs_client/identifier + +Security considerations +----------------------- + +The use of cryptographic security for lease management operations +is strongly encouraged. + +If NFS with Kerberos is not configured, a Linux NFSv4 client uses +AUTH_SYS and UID 0 as the principal part of its client identity. +This configuration is not only insecure, it increases the risk of +lease and lock stealing. However, it might be the only choice for +client configurations that have no local persistent storage. +"co_ownerid" string uniqueness and persistence is critical in this +case. + +When a Kerberos keytab is present on a Linux NFS client, the client +attempts to use one of the principals in that keytab when +identifying itself to servers. The "sec=" mount option does not +control this behavior. Alternately, a single-user client with a +Kerberos principal can use that principal in place of the client's +host principal. + +Using Kerberos for this purpose enables the client and server to +use the same lease for operations covered by all "sec=" settings. +Additionally, the Linux NFS client uses the RPCSEC_GSS security +flavor with Kerberos and the integrity QOS to prevent in-transit +modification of lease modification requests. + +Additional notes +---------------- +The Linux NFSv4 client establishes a single lease on each NFSv4 +server it accesses. NFSv4 mounts from a Linux NFSv4 client of a +particular server then share that lease. + +Once a client establishes open and lock state, the NFSv4 protocol +enables lease state to transition to other servers, following data +that has been migrated. This hides data migration completely from +running applications. The Linux NFSv4 client facilitates state +migration by presenting the same "client_owner4" to all servers it +encounters. + +======== +See Also +======== + + - nfs(5) + - kerberos(7) + - RFC 7530 for the NFSv4.0 specification + - RFC 8881 for the NFSv4.1 specification. diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index 33d588a01ace..de64d2d002a2 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -122,12 +122,9 @@ are exportable by setting the s_export_op field in the struct super_block. This field must point to a "struct export_operations" struct which has the following members: - encode_fh (optional) - Takes a dentry and creates a filehandle fragment which can later be used - to find or create a dentry for the same object. The default - implementation creates a filehandle fragment that encodes a 32bit inode - and generation number for the inode encoded, and if necessary the - same information for the parent. + encode_fh (mandatory) + Takes a dentry and creates a filehandle fragment which may later be used + to find or create a dentry for the same object. fh_to_dentry (mandatory) Given a filehandle fragment, this should find the implied object and @@ -154,6 +151,11 @@ struct which has the following members: to find potential names, and matches inode numbers to find the correct match. + flags + Some filesystems may need to be handled differently than others. The + export_operations struct also includes a flags field that allows the + filesystem to communicate such information to nfsd. See the Export + Operations Flags section below for more explanation. A filehandle fragment consists of an array of 1 or more 4byte words, together with a one byte "type". @@ -163,3 +165,76 @@ generated by encode_fh, in which case it will have been padded with nuls. Rather, the encode_fh routine should choose a "type" which indicates the decode_fh how much of the filehandle is valid, and how it should be interpreted. + +Export Operations Flags +----------------------- +In addition to the operation vector pointers, struct export_operations also +contains a "flags" field that allows the filesystem to communicate to nfsd +that it may want to do things differently when dealing with it. The +following flags are defined: + + EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem + RFC 1813 recommends that servers always send weak cache consistency + (WCC) data to the client after each operation. The server should + atomically collect attributes about the inode, do an operation on it, + and then collect the attributes afterward. This allows the client to + skip issuing GETATTRs in some situations but means that the server + is calling vfs_getattr for almost all RPCs. On some filesystems + (particularly those that are clustered or networked) this is expensive + and atomicity is difficult to guarantee. This flag indicates to nfsd + that it should skip providing WCC attributes to the client in NFSv3 + replies when doing operations on this filesystem. Consider enabling + this on filesystems that have an expensive ->getattr inode operation, + or when atomicity between pre and post operation attribute collection + is impossible to guarantee. + + EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs + Many NFS operations deal with filehandles, which the server must then + vet to ensure that they live inside of an exported tree. When the + export consists of an entire filesystem, this is trivial. nfsd can just + ensure that the filehandle live on the filesystem. When only part of a + filesystem is exported however, then nfsd must walk the ancestors of the + inode to ensure that it's within an exported subtree. This is an + expensive operation and not all filesystems can support it properly. + This flag exempts the filesystem from subtree checking and causes + exportfs to get back an error if it tries to enable subtree checking + on it. + + EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking + On some exportable filesystems (such as NFS) unlinking a file that + is still open can cause a fair bit of extra work. For instance, + the NFS client will do a "sillyrename" to ensure that the file + sticks around while it's still open. When reexporting, that open + file is held by nfsd so we usually end up doing a sillyrename, and + then immediately deleting the sillyrenamed file just afterward when + the link count actually goes to zero. Sometimes this delete can race + with other operations (for instance an rmdir of the parent directory). + This flag causes nfsd to close any open files for this inode _before_ + calling into the vfs to do an unlink or a rename that would replace + an existing file. + + EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote + PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to + write to one bdi (the final bdi) in order to free up writes queued + to another bdi (the client bdi). Such threads get a private balance + of dirty pages so that dirty pages for the client bdi do not imact + the daemon writing to the final bdi. For filesystems whose durable + storage is not local (such as exported NFS filesystems), this + constraint has negative consequences. EXPORT_OP_REMOTE_FS enables + an export to disable writeback throttling. + + EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically + EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem + cannot provide the semantics required by the "atomic" boolean in + NFSv4's change_info4. This boolean indicates to a client whether the + returned before and after change attributes were obtained atomically + with the respect to the requested metadata operation (UNLINK, + OPEN/CREATE, MKDIR, etc). + + EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2) + On most filesystems, inodes can remain under writeback after the + file is closed. NFSD relies on client activity or local flusher + threads to handle writeback. Certain filesystems, such as NFS, flush + all of an inode's dirty data on last close. Exports that behave this + way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip + waiting for writeback when closing such files. diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 65805624e39b..95c2c009874c 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -6,8 +6,12 @@ NFS .. toctree:: :maxdepth: 1 + client-identifier + exporting + localio pnfs rpc-cache rpc-server-gss nfs41-server knfsd-stats + reexport diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst new file mode 100644 index 000000000000..79808b37d745 --- /dev/null +++ b/Documentation/filesystems/nfs/localio.rst @@ -0,0 +1,357 @@ +=========== +NFS LOCALIO +=========== + +Overview +======== + +The LOCALIO auxiliary RPC protocol allows the Linux NFS client and +server to reliably handshake to determine if they are on the same +host. Select "NFS client and server support for LOCALIO auxiliary +protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel +config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled). + +Once an NFS client and server handshake as "local", the client will +bypass the network RPC protocol for read, write and commit operations. +Due to this XDR and RPC bypass, these operations will operate faster. + +The LOCALIO auxiliary protocol's implementation, which uses the same +connection as NFS traffic, follows the pattern established by the NFS +ACL protocol extension. + +The LOCALIO auxiliary protocol is needed to allow robust discovery of +clients local to their servers. In a private implementation that +preceded use of this LOCALIO protocol, a fragile sockaddr network +address based match against all local network interfaces was attempted. +But unlike the LOCALIO protocol, the sockaddr-based matching didn't +handle use of iptables or containers. + +The robust handshake between local client and server is just the +beginning, the ultimate use case this locality makes possible is the +client is able to open files and issue reads, writes and commits +directly to the server without having to go over the network. The +requirement is to perform these loopback NFS operations as efficiently +as possible, this is particularly useful for container use cases +(e.g. kubernetes) where it is possible to run an IO job local to the +server. + +The performance advantage realized from LOCALIO's ability to bypass +using XDR and RPC for reads, writes and commits can be extreme, e.g.: + +fio for 20 secs with directio, qd of 8, 16 libaio threads: + - With LOCALIO: + 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec) + 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec) + 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec) + 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec) + + - Without LOCALIO: + 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec) + 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec) + 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec) + 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec) + +fio for 20 secs with directio, qd of 8, 1 libaio thread: + - With LOCALIO: + 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec) + 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec) + 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec) + 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec) + + - Without LOCALIO: + 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec) + 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec) + 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec) + 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec) + +FAQ +=== + +1. What are the use cases for LOCALIO? + + a. Workloads where the NFS client and server are on the same host + realize improved IO performance. In particular, it is common when + running containerised workloads for jobs to find themselves + running on the same host as the knfsd server being used for + storage. + +2. What are the requirements for LOCALIO? + + a. Bypass use of the network RPC protocol as much as possible. This + includes bypassing XDR and RPC for open, read, write and commit + operations. + b. Allow client and server to autonomously discover if they are + running local to each other without making any assumptions about + the local network topology. + c. Support the use of containers by being compatible with relevant + namespaces (e.g. network, user, mount). + d. Support all versions of NFS. NFSv3 is of particular importance + because it has wide enterprise usage and pNFS flexfiles makes use + of it for the data path. + +3. Why doesn’t LOCALIO just compare IP addresses or hostnames when + deciding if the NFS client and server are co-located on the same + host? + + Since one of the main use cases is containerised workloads, we cannot + assume that IP addresses will be shared between the client and + server. This sets up a requirement for a handshake protocol that + needs to go over the same connection as the NFS traffic in order to + identify that the client and the server really are running on the + same host. The handshake uses a secret that is sent over the wire, + and can be verified by both parties by comparing with a value stored + in shared kernel memory if they are truly co-located. + +4. Does LOCALIO improve pNFS flexfiles? + + Yes, LOCALIO complements pNFS flexfiles by allowing it to take + advantage of NFS client and server locality. Policy that initiates + client IO as closely to the server where the data is stored naturally + benefits from the data path optimization LOCALIO provides. + +5. Why not develop a new pNFS layout to enable LOCALIO? + + A new pNFS layout could be developed, but doing so would put the + onus on the server to somehow discover that the client is co-located + when deciding to hand out the layout. + There is value in a simpler approach (as provided by LOCALIO) that + allows the NFS client to negotiate and leverage locality without + requiring more elaborate modeling and discovery of such locality in a + more centralized manner. + +6. Why is having the client perform a server-side file OPEN, without + using RPC, beneficial? Is the benefit pNFS specific? + + Avoiding the use of XDR and RPC for file opens is beneficial to + performance regardless of whether pNFS is used. Especially when + dealing with small files its best to avoid going over the wire + whenever possible, otherwise it could reduce or even negate the + benefits of avoiding the wire for doing the small file I/O itself. + Given LOCALIO's requirements the current approach of having the + client perform a server-side file open, without using RPC, is ideal. + If in the future requirements change then we can adapt accordingly. + +7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)? + + Strong authentication is usually tied to the connection itself. It + works by establishing a context that is cached by the server, and + that acts as the key for discovering the authorisation token, which + can then be passed to rpc.mountd to complete the authentication + process. On the other hand, in the case of AUTH_UNIX, the credential + that was passed over the wire is used directly as the key in the + upcall to rpc.mountd. This simplifies the authentication process, and + so makes AUTH_UNIX easier to support. + +8. How do export options that translate RPC user IDs behave for LOCALIO + operations (eg. root_squash, all_squash)? + + Export options that translate user IDs are managed by nfsd_setuser() + which is called by nfsd_setuser_and_check_port() which is called by + __fh_verify(). So they get handled exactly the same way for LOCALIO + as they do for non-LOCALIO. + +9. How does LOCALIO make certain that object lifetimes are managed + properly given NFSD and NFS operate in different contexts? + + See the detailed "NFS Client and Server Interlock" section below. + +RPC +=== + +The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL" +RPC method that allows the Linux NFS client to verify the local Linux +NFS server can see the nonce (single-use UUID) the client generated and +made available in nfs_common. This protocol isn't part of an IETF +standard, nor does it need to be considering it is Linux-to-Linux +auxiliary RPC protocol that amounts to an implementation detail. + +The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of +the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode +XDR methods are used instead of the less efficient variable sized +methods. + +The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned +by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ): +Linux Kernel Organization 400122 nfslocalio + +The LOCALIO protocol spec in rpcgen syntax is:: + + /* raw RFC 9562 UUID */ + #define UUID_SIZE 16 + typedef u8 uuid_t<UUID_SIZE>; + + program NFS_LOCALIO_PROGRAM { + version LOCALIO_V1 { + void + NULL(void) = 0; + + void + UUID_IS_LOCAL(uuid_t) = 1; + } = 1; + } = 400122; + +LOCALIO uses the same transport connection as NFS traffic. As such, +LOCALIO is not registered with rpcbind. + +NFS Common and Client/Server Handshake +====================================== + +fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client +to generate a nonce (single-use UUID) and associated short-lived +nfs_uuid_t struct, register it with nfs_common for subsequent lookup and +verification by the NFS server and if matched the NFS server populates +members in the nfs_uuid_t struct. The NFS client then uses nfs_common to +transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv +clients_list from the nfs_common's uuids_list. See: +fs/nfs/localio.c:nfs_local_probe() + +nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such +it has members that point to nfsd memory for direct use by the client +(e.g. 'net' is the server's network namespace, through it the client can +access nn->nfsd_serv with proper rcu read access). It is this client +and server synchronization that enables advanced usage and lifetime of +objects to span from the host kernel's nfsd to per-container knfsd +instances that are connected to nfs client's running on the same local +host. + +NFS Client and Server Interlock +=============================== + +LOCALIO provides the nfs_uuid_t object and associated interfaces to +allow proper network namespace (net-ns) and NFSD object refcounting. + +LOCALIO required the introduction and use of NFSD's percpu nfsd_net_ref +to interlock nfsd_shutdown_net() and nfsd_open_local_fh(), to ensure +each net-ns is not destroyed while in use by nfsd_open_local_fh(), and +warrants a more detailed explanation: + + nfsd_open_local_fh() uses nfsd_net_try_get() before opening its + nfsd_file handle and then the caller (NFS client) must drop the + reference for the nfsd_file and associated net-ns using + nfsd_file_put_local() once it has completed its IO. + + This interlock working relies heavily on nfsd_open_local_fh() being + afforded the ability to safely deal with the possibility that the + NFSD's net-ns (and nfsd_net by association) may have been destroyed + by nfsd_destroy_serv() via nfsd_shutdown_net(). + +This interlock of the NFS client and server has been verified to fix an +easy to hit crash that would occur if an NFSD instance running in a +container, with a LOCALIO client mounted, is shutdown. Upon restart of +the container and associated NFSD, the client would go on to crash due +to NULL pointer dereference that occurred due to the LOCALIO client's +attempting to nfsd_open_local_fh() without having a proper reference on +NFSD's net-ns. + +NFS Client issues IO instead of Server +====================================== + +Because LOCALIO is focused on protocol bypass to achieve improved IO +performance, alternatives to the traditional NFS wire protocol (SUNRPC +with XDR) must be provided to access the backing filesystem. + +See fs/nfs/localio.c:nfs_local_open_fh() and +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes +focused use of select nfs server objects to allow a client local to a +server to open a file pointer without needing to go over the network. + +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access +both the associated nfsd network namespace and nn->nfsd_serv in terms of +RCU. If nfsd_open_local_fh() finds that the client no longer sees valid +nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO +to nfs_local_open_fh() and the client will try to reestablish the +LOCALIO resources needed by calling nfs_local_probe() again. This +recovery is needed if/when an nfsd instance running in a container were +to reboot while a LOCALIO client is connected to it. + +Once the client has an open nfsd_file pointer it will issue reads, +writes and commits directly to the underlying local filesystem (normally +done by the nfs server). As such, for these operations, the NFS client +is issuing IO to the underlying local filesystem that it is sharing with +the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and +fs/nfs/localio.c:nfs_local_commit(). + +With normal NFS that makes use of RPC to issue IO to the server, if an +application uses O_DIRECT the NFS client will bypass the pagecache but +the NFS server will not. The NFS server's use of buffered IO affords +applications to be less precise with their alignment when issuing IO to +the NFS client. But if all applications properly align their IO, LOCALIO +can be configured to use end-to-end O_DIRECT semantics from the NFS +client to the underlying local filesystem, that it is sharing with +the NFS server, by setting the 'localio_O_DIRECT_semantics' nfs module +parameter to Y, e.g.: + + echo Y > /sys/module/nfs/parameters/localio_O_DIRECT_semantics + +Once enabled, it will cause LOCALIO to use end-to-end O_DIRECT semantics +(but again, this may cause IO to fail if applications do not properly +align their IO). + +Security +======== + +LOCALIO is only supported when UNIX-style authentication (AUTH_UNIX, aka +AUTH_SYS) is used. + +Care is taken to ensure the same NFS security mechanisms are used +(authentication, etc) regardless of whether LOCALIO or regular NFS +access is used. The auth_domain established as part of the traditional +NFS client access to the NFS server is also used for LOCALIO. + +Relative to containers, LOCALIO gives the client access to the network +namespace the server has. This is required to allow the client to access +the server's per-namespace nfsd_net struct. With traditional NFS, the +client is afforded this same level of access (albeit in terms of the NFS +protocol via SUNRPC). No other namespaces (user, mount, etc) have been +altered or purposely extended from the server to the client. + +Module Parameters +================= + +/sys/module/nfs/parameters/localio_enabled (bool) +controls if LOCALIO is enabled, defaults to Y. If client and server are +local but 'localio_enabled' is set to N then LOCALIO will not be used. + +/sys/module/nfs/parameters/localio_O_DIRECT_semantics (bool) +controls if O_DIRECT extends down to the underlying filesystem, defaults +to N. Application IO must be logical blocksize aligned, otherwise +O_DIRECT will fail. + +/sys/module/nfsv3/parameters/nfs3_localio_probe_throttle (uint) +controls if NFSv3 read and write IOs will trigger (re)enabling of +LOCALIO every N (nfs3_localio_probe_throttle) IOs, defaults to 0 +(disabled). Must be power-of-2, admin keeps all the pieces if they +misconfigure (too low a value or non-power-of-2). + +Testing +======= + +The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write +and commit access have proven stable against various test scenarios: + +- Client and server both on the same host. + +- All permutations of client and server support enablement for both + local and remote client and server. + +- Testing against NFS storage products that don't support the LOCALIO + protocol was also performed. + +- Client on host, server within a container (for both v3 and v4.2). + The container testing was in terms of podman managed containers and + includes successful container stop/restart scenario. + +- Formalizing these test scenarios in terms of existing test + infrastructure is on-going. Initial regular coverage is provided in + terms of ktest running xfstests against a LOCALIO-enabled NFS loopback + mount configuration, and includes lockdep and KASAN coverage, see: + https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next + https://github.com/koverstreet/ktest + +- Various kdevops testing (in terms of "Chuck's BuildBot") has been + performed to regularly verify the LOCALIO changes haven't caused any + regressions to non-LOCALIO NFS use cases. + +- All of Hammerspace's various sanity tests pass with LOCALIO enabled + (this includes numerous pNFS and flexfiles tests). diff --git a/Documentation/filesystems/nfs/reexport.rst b/Documentation/filesystems/nfs/reexport.rst new file mode 100644 index 000000000000..044be965d75e --- /dev/null +++ b/Documentation/filesystems/nfs/reexport.rst @@ -0,0 +1,117 @@ +Reexporting NFS filesystems +=========================== + +Overview +-------- + +It is possible to reexport an NFS filesystem over NFS. However, this +feature comes with a number of limitations. Before trying it, we +recommend some careful research to determine whether it will work for +your purposes. + +A discussion of current known limitations follows. + +"fsid=" required, crossmnt broken +--------------------------------- + +We require the "fsid=" export option on any reexport of an NFS +filesystem. You can use "uuidgen -r" to generate a unique argument. + +The "crossmnt" export does not propagate "fsid=", so it will not allow +traversing into further nfs filesystems; if you wish to export nfs +filesystems mounted under the exported filesystem, you'll need to export +them explicitly, assigning each its own unique "fsid= option. + +Reboot recovery +--------------- + +The NFS protocol's normal reboot recovery mechanisms don't work for the +case when the reexport server reboots because the source server has not +rebooted, and so it is not in grace. Since the source server is not in +grace, it cannot offer any guarantees that the file won't have been +changed between the locks getting lost and any attempt to recover them. +The same applies to delegations and any associated locks. Clients are +not allowed to get file locks or delegations from a reexport server, any +attempts will fail with operation not supported. + +Filehandle limits +----------------- + +If the original server uses an X byte filehandle for a given object, the +reexport server's filehandle for the reexported object will be X+22 +bytes, rounded up to the nearest multiple of four bytes. + +The result must fit into the RFC-mandated filehandle size limits: + ++-------+-----------+ +| NFSv2 | 32 bytes | ++-------+-----------+ +| NFSv3 | 64 bytes | ++-------+-----------+ +| NFSv4 | 128 bytes | ++-------+-----------+ + +So, for example, you will only be able to reexport a filesystem over +NFSv2 if the original server gives you filehandles that fit in 10 +bytes--which is unlikely. + +In general there's no way to know the maximum filehandle size given out +by an NFS server without asking the server vendor. + +But the following table gives a few examples. The first column is the +typical length of the filehandle from a Linux server exporting the given +filesystem, the second is the length after that nfs export is reexported +by another Linux host: + ++--------+-------------------+----------------+ +| | filehandle length | after reexport | ++========+===================+================+ +| ext4: | 28 bytes | 52 bytes | ++--------+-------------------+----------------+ +| xfs: | 32 bytes | 56 bytes | ++--------+-------------------+----------------+ +| btrfs: | 40 bytes | 64 bytes | ++--------+-------------------+----------------+ + +All will therefore fit in an NFSv3 or NFSv4 filehandle after reexport, +but none are reexportable over NFSv2. + +Linux server filehandles are a bit more complicated than this, though; +for example: + + - The (non-default) "subtreecheck" export option generally + requires another 4 to 8 bytes in the filehandle. + - If you export a subdirectory of a filesystem (instead of + exporting the filesystem root), that also usually adds 4 to 8 + bytes. + - If you export over NFSv2, knfsd usually uses a shorter + filesystem identifier that saves 8 bytes. + - The root directory of an export uses a filehandle that is + shorter. + +As you can see, the 128-byte NFSv4 filehandle is large enough that +you're unlikely to have trouble using NFSv4 to reexport any filesystem +exported from a Linux server. In general, if the original server is +something that also supports NFSv3, you're *probably* OK. Re-exporting +over NFSv3 may be dicier, and reexporting over NFSv2 will probably +never work. + +For more details of Linux filehandle structure, the best reference is +the source code and comments; see in particular: + + - include/linux/exportfs.h:enum fid_type + - include/uapi/linux/nfsd/nfsfh.h:struct nfs_fhbase_new + - fs/nfsd/nfsfh.c:set_version_and_fsid_type + - fs/nfs/export.c:nfs_encode_fh + +Open DENY bits ignored +---------------------- + +NFS since NFSv4 supports ALLOW and DENY bits taken from Windows, which +allow you, for example, to open a file in a mode which forbids other +read opens or write opens. The Linux client doesn't use them, and the +server's support has always been incomplete: they are enforced only +against other NFS users, not against processes accessing the exported +filesystem locally. A reexport server will also not pass them along to +the original server, so they will not be enforced between clients of +different reexport servers. diff --git a/Documentation/filesystems/nfs/rpc-cache.rst b/Documentation/filesystems/nfs/rpc-cache.rst index bb164eea969b..339efd75016a 100644 --- a/Documentation/filesystems/nfs/rpc-cache.rst +++ b/Documentation/filesystems/nfs/rpc-cache.rst @@ -78,7 +78,7 @@ Creating a Cache include taking references to shared objects. void update(struct cache_head \*orig, struct cache_head \*new) - Set the 'content' fileds in 'new' from 'orig'. + Set the 'content' fields in 'new' from 'orig'. int cache_show(struct seq_file \*m, struct cache_detail \*cd, struct cache_head \*h) Optional. Used to provide a /proc file that lists the diff --git a/Documentation/filesystems/nfs/rpc-server-gss.rst b/Documentation/filesystems/nfs/rpc-server-gss.rst index 812754576845..5c1a1c58fc27 100644 --- a/Documentation/filesystems/nfs/rpc-server-gss.rst +++ b/Documentation/filesystems/nfs/rpc-server-gss.rst @@ -10,13 +10,12 @@ purposes of authentication.) RPCGSS is specified in a few IETF documents: - - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt - - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt + - RFC2203 v1: https://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: https://tools.ietf.org/rfc/rfc5403.txt -and there is a 3rd version being proposed: +There is a third version that we don't currently implement: - - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt - (At draft n. 02 at the time of writing) + - RFC7861 v3: https://tools.ietf.org/rfc/rfc7861.txt Background ========== @@ -30,7 +29,7 @@ The Linux kernel, at the moment, supports only the KRB5 mechanism, and depends on GSSAPI extensions that are KRB5 specific. GSSAPI is a complex library, and implementing it completely in kernel is -unwarranted. However GSSAPI operations are fundementally separable in 2 +unwarranted. However GSSAPI operations are fundamentally separable in 2 parts: - initial context establishment |