Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull NFS client updates from Anna Schumaker:
"New Features:
- Enable using direct IO with localio
- Added localio related tracepoints
Bugfixes:
- Sunrpc fixes for working with a very large cl_tasks list
- Fix a possible buffer overflow in nfs_sysfs_link_rpc_client()
- Fixes for handling reconnections with localio
- Fix how the NFS_FSCACHE kconfig option interacts with NETFS_SUPPORT
- Fix COPY_NOTIFY xdr_buf size calculations
- pNFS/Flexfiles fix for retrying requesting a layout segment for
reads
- Sunrpc fix for retrying on EKEYEXPIRED error when the TGT is
expired
Cleanups:
- Various other nfs & nfsd localio cleanups
- Prepratory patches for async copy improvements that are under
development
- Make OFFLOAD_CANCEL, LAYOUTSTATS, and LAYOUTERR moveable to other
xprts
- Add netns inum and srcaddr to debugfs rpc_xprt info"
* tag 'nfs-for-6.14-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (28 commits)
SUNRPC: do not retry on EKEYEXPIRED when user TGT ticket expired
sunrpc: add netns inum and srcaddr to debugfs rpc_xprt info
pnfs/flexfiles: retry getting layout segment for reads
NFSv4.2: make LAYOUTSTATS and LAYOUTERROR MOVEABLE
NFSv4.2: mark OFFLOAD_CANCEL MOVEABLE
NFSv4.2: fix COPY_NOTIFY xdr buf size calculation
NFS: Rename struct nfs4_offloadcancel_data
NFS: Fix typo in OFFLOAD_CANCEL comment
NFS: CB_OFFLOAD can return NFS4ERR_DELAY
nfs: Make NFS_FSCACHE select NETFS_SUPPORT instead of depending on it
nfs: fix incorrect error handling in LOCALIO
nfs: probe for LOCALIO when v3 client reconnects to server
nfs: probe for LOCALIO when v4 client reconnects to server
nfs/localio: remove redundant code and simplify LOCALIO enablement
nfs_common: add nfs_localio trace events
nfs_common: track all open nfsd_files per LOCALIO nfs_client
nfs_common: rename nfslocalio nfs_uuid_lock to nfs_uuids_lock
nfsd: nfsd_file_acquire_local no longer returns GC'd nfsd_file
nfsd: rename nfsd_serv_ prefixed methods and variables with nfsd_net_
nfsd: update percpu_ref to manage references on nfsd_net
...
|
|
This tracking enables __nfsd_file_cache_purge() to call
nfs_localio_invalidate_clients(), upon shutdown or export change, to
nfs_close_local_fh() all open nfsd_files that are still cached by the
LOCALIO nfs clients associated with nfsd_net that is being shutdown.
Now that the client must track all open nfsd_files there was more work
than necessary being done with the global nfs_uuids_lock contended.
This manifested in various RCU issues, e.g.:
hrtimer: interrupt took 47969440 ns
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Use nfs_uuid->lock to protect all nfs_uuid_t members, instead of
nfs_uuids_lock, once nfs_uuid_is_local() adds the client to
nn->local_clients.
Also add 'local_clients_lock' to 'struct nfsd_net' to protect
nn->local_clients. And store a pointer to spinlock in the 'list_lock'
member of nfs_uuid_t so nfs_localio_disable_client() can use it to
avoid taking the global nfs_uuids_lock.
In combination, these split out locks eliminate the use of the single
nfslocalio.c global nfs_uuids_lock in the IO paths (open and close).
Also refactored associated fs/nfs_common/nfslocalio.c methods' locking
to reduce work performed with spinlocks held in general.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Also update Documentation/filesystems/nfs/localio.rst accordingly
and reduce the technical documentation debt that was previously
captured in that document.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Now that the connection limit only apply to unconfirmed connections,
there is no need to configure it. So remove all the configuration and
fix the number of unconfirmed connections as always 64 - which is
now given a name: XPT_MAX_TMP_CONN
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The heuristic for limiting the number of incoming connections to nfsd
currently uses sv_nrthreads - allowing more connections if more threads
were configured.
A future patch will allow number of threads to grow dynamically so that
there will be no need to configure sv_nrthreads. So we need a different
solution for limiting connections.
It isn't clear what problem is solved by limiting connections (as
mentioned in a code comment) but the most likely problem is a connection
storm - many connections that are not doing productive work. These will
be closed after about 6 minutes already but it might help to slow down a
storm.
This patch adds a per-connection flag XPT_PEER_VALID which indicates
that the peer has presented a filehandle for which it has some sort of
access. i.e the peer is known to be trusted in some way. We now only
count connections which have NOT been determined to be valid. There
should be relative few of these at any given time.
If the number of non-validated peer exceed a limit - currently 64 - we
close the oldest non-validated peer to avoid having too many of these
useless connections.
Note that this patch significantly changes the meaning of the various
configuration parameters for "max connections". The next patch will
remove all of these.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add server support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when both the client and server are
running on the same host.
If nfsd_open_local_fh() fails then the NFS client will both retry and
fallback to normal network-based read, write and commit operations if
localio is no longer supported.
Care is taken to ensure the same NFS security mechanisms are used
(authentication, etc) regardless of whether localio or regular NFS
access is used. The auth_domain established as part of the traditional
NFS client access to the NFS server is also used for localio. Store
auth_domain for localio in nfsd_uuid_t and transfer it to the client
if it is local to the server.
Relative to containers, localio gives the client access to the network
namespace the server has. This is required to allow the client to
access the server's per-namespace nfsd_net struct.
This commit also introduces the use of NFSD's percpu_ref to interlock
nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is
not destroyed while in use by nfsd_open_local_fh and other LOCALIO
client code.
CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Co-developed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Co-developed-by: NeilBrown <neilb@suse.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Introduce nfsd_serv_try_get and nfsd_serv_put and update the nfsd code
to prevent nfsd_destroy_serv from destroying nn->nfsd_serv until any
caller of nfsd_serv_try_get releases their reference using nfsd_serv_put.
A percpu_ref is used to implement the interlock between
nfsd_destroy_serv and any caller of nfsd_serv_try_get.
This interlock is needed to properly wait for the completion of client
initiated localio calls to nfsd (that are _not_ in the context of nfsd).
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Nothing appears to limit the number of concurrent async COPY
operations that clients can start. In addition, AFAICT each async
COPY can copy an unlimited number of 4MB chunks, so can run for a
long time. Thus IMO async COPY can become a DoS vector.
Add a restriction mechanism that bounds the number of concurrent
background COPY operations. Start simple and try to be fair -- this
patch implements a per-namespace limit.
An async COPY request that occurs while this limit is exceeded gets
NFS4ERR_DELAY. The requesting client can choose to send the request
again after a delay or fall back to a traditional read/write style
copy.
If there is need to make the mechanism more sophisticated, we can
visit that in future patches.
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Instead of using kmalloc to allocate an array for storing active version
info, just declare an array to the max size - it is only 5 or so.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Introduce write_version netlink command through a "declarative" interface.
This patch introduces a change in behavior since for version-set userspace
is expected to provide a NFS major/minor version list it wants to enable
while all the other ones will be disabled. (procfs write_version
command implements imperative interface where the admin writes +3/-3 to
enable/disable a single version.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
For NFSv4.1 and later the client easily discovers if there is any
admin-revoked state and will then find and explicitly free it.
For NFSv4.0 there is no such mechanism. The client can only find that
state is admin-revoked if it tries to use that state, and there is no
way for it to explicitly free the state. So the server must hold on to
the stateid (at least) for an indefinite amount of time. A
RELEASE_LOCKOWNER request might justify forgetting some of these
stateids, as would the whole clients lease lapsing, but these are not
reliable.
This patch takes two approaches.
Whenever a client uses an revoked stateid, that stateid is then
discarded and will not be recognised again. This might confuse a client
which expect to get NFS4ERR_ADMIN_REVOKED consistently once it get it at
all, but should mostly work. Hopefully one error will lead to other
resources being closed (e.g. process exits), which will result in more
stateid being freed when a CLOSE attempt gets NFS4ERR_ADMIN_REVOKED.
Also, any admin-revoked stateids that have been that way for more than
one lease time are periodically revoke.
No actual freeing of state happens in this patch. That will come in
future patches which handle the different sorts of revoked state.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The final bit of stats that is global is the rpc svc_stat. Move this
into the nfsd_net struct and use that everywhere instead of the global
struct. Remove the unused global struct.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
We have a global set of counters that we modify for all of the nfsd
operations, but now that we're exposing these stats across all network
namespaces we need to make the stats also be per-network namespace. We
already have some caching stats that are per-network namespace, so move
these definitions into the same counter and then adjust all the helpers
and users of these stats to provide the appropriate nfsd_net struct so
that the stats are maintained for the per-network namespace objects.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
We're going to merge the stats all into per network namespace in
subsequent patches, rename these nn counters to be consistent with the
rest of the stats.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
sv_refcnt is no longer useful.
lockd and nfs-cb only ever have the svc active when there are a non-zero
number of threads, so sv_refcnt mirrors sv_nrthreads.
nfsd also keeps the svc active between when a socket is added and when
the first thread is started, but we don't really need a refcount for
that. We can simply not destroy the svc while there are any permanent
sockets attached.
So remove sv_refcnt and the get/put functions.
Instead of a final call to svc_put(), call svc_destroy() instead.
This is changed to also store NULL in the passed-in pointer to make it
easier to avoid use-after-free situations.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
A future patch will remove refcounting on svc_serv as it is of little
use.
It is currently used to keep the svc around while the pool_stats file is
open.
Change this to get the pointer, protected by the mutex, only in
seq_start, and the release the mutex in seq_stop.
This means that if the nfsd server is stopped and restarted while the
pool_stats file it open, then some pool stats info could be from the
first instance and some from the second. This might appear odd, but is
unlikely to be a problem in practice.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the nfsd-reply shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct nfsd_net.
Link: https://lkml.kernel.org/r/20230911094444.68966-34-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the nfsd-client shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct nfsd_net.
Link: https://lkml.kernel.org/r/20230911094444.68966-33-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Pull file locking updates from Jeff Layton:
"The main change here is that I've broken out most of the file locking
definitions into a new header file. I also went ahead and completed
the removal of locks_inode function"
* tag 'locks-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fs: remove locks_inode
filelock: move file locking definitions to separate header file
|
|
Since nfsd4_state_shrinker_count always calls mod_delayed_work with
0 delay, we can replace delayed_work with work_struct to save some
space and overhead.
Also add the call to cancel_work after unregister the shrinker
in nfs4_state_shutdown_net.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The file locking definitions have lived in fs.h since the dawn of time,
but they are only used by a small subset of the source files that
include it.
Move the file locking definitions to a new header file, and add the
appropriate #include directives to the source files that need them. By
doing this we trim down fs.h a bit and limit the amount of rebuilding
that has to be done when we make changes to the file locking APIs.
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Steve French <stfrench@microsoft.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
|
|
Add courtesy_client_reaper to react to low memory condition triggered
by the system memory shrinker.
The delayed_work for the courtesy_client_reaper is scheduled on
the shrinker's count callback using the laundry_wq.
The shrinker's scan callback is not used for expiring the courtesy
clients due to potential deadlocks.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add counter nfs4_courtesy_client_count to nfsd_net to keep track
of the number of courtesy clients in the system.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently there is no limit on how many v4 clients are supported
by the system. This can be a problem in systems with small memory
configuration to function properly when a very large number of
clients exist that creates memory shortage conditions.
This patch enforces a limit of 1024 NFSv4 clients, including courtesy
clients, per 1GB of system memory. When the number of the clients
reaches the limit, requests that create new clients are returned
with NFS4ERR_DELAY and the laundromat is kicked start to trim old
clients. Due to the overhead of the upcall to remove the client
record, the maximun number of clients the laundromat removes on
each run is limited to 128. This is done to ensure the laundromat
can still process the other tasks in a timely manner.
Since there is now a limit of the number of clients, the 24-hr
idle time limit of courtesy client is no longer needed and was
removed.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add counter nfs4_client_count to keep track of the total number
of v4 clients, including courtesy clients, in the system.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Clean up: These functions handle what the specs call a write
verifier, which in the Linux NFS server implementation is now
divorced from the server's boot instance
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
There are two boot-time fields in struct nfsd_net: one called
boot_time and one called nfssvc_boot. The latter is used only to
form write verifiers, but its documenting comment declares:
/* Time of server startup */
Since commit 27c438f53e79 ("nfsd: Support the server resetting the
boot verifier"), this field can be reset at any time; it's no
longer tied to server restart. So that comment is stale.
Also, according to pahole, struct timespec64 is 16 bytes long on
x86_64. The nfssvc_boot field is used only to form a write verifier,
which is 8 bytes long.
Let's clarify this situation by manufacturing an 8-byte verifier
in nfs_reset_boot_verifier() and storing only that in struct
nfsd_net.
We're grabbing 128 bits of time, so compress all of those into a
64-bit verifier instead of throwing out the high-order bits.
In the future, the siphash_key can be re-used for other hashed
objects per-nfsd_net.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
We currently have a 'laundrette' for closing cached files - a different
work-item for each network-namespace.
These 'laundrettes' (aka struct nfsd_fcache_disposal) are currently on a
list, and are freed using rcu.
The list is not necessary as we have a per-namespace structure (struct
nfsd_net) which can hold a link to the nfsd_fcache_disposal.
The use of kfree_rcu is also unnecessary as the cache is cleaned of all
files associated with a given namespace, and no new files can be added,
before the nfsd_fcache_disposal is freed.
So add a '->fcache_disposal' link to nfsd_net, and discard the list
management and rcu usage.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd currently maintains an open-coded read/write semaphore (refcount
and wait queue) for each network namespace to ensure the nfs service
isn't shut down while the notifier is running.
This is excessive. As there is unlikely to be contention between
notifiers and they run without sleeping, a single spinlock is sufficient
to avoid problems.
Signed-off-by: NeilBrown <neilb@suse.de>
[ cel: ensure nfsd_notifier_lock is static ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd cannot currently use svc_set_num_threads_sync. It instead
uses svc_set_num_threads which does *not* wait for threads to all
exit, and has a separate mechanism (nfsd_shutdown_complete) to wait
for completion.
The reason that nfsd is unlike other services is that nfsd threads can
exit separately from svc_set_num_threads being called - they die on
receipt of SIGKILL. Also, when the last thread exits, the service must
be shut down (sockets closed).
For this, the nfsd_mutex needs to be taken, and as that mutex needs to
be held while svc_set_num_threads is called, the one cannot wait for
the other.
This patch changes the nfsd thread so that it can drop the ref on the
service without blocking on nfsd_mutex, so that svc_set_num_threads_sync
can be used:
- if it can drop a non-last reference, it does that. This does not
trigger shutdown and does not require a mutex. This will likely
happen for all but the last thread signalled, and for all threads
being shut down by nfsd_shutdown_threads()
- if it can get the mutex without blocking (trylock), it does that
and then drops the reference. This will likely happen for the
last thread killed by SIGKILL
- Otherwise there might be an unrelated task holding the mutex,
possibly in another network namespace, or nfsd_shutdown_threads()
might be just about to get a reference on the service, after which
we can drop ours safely.
We cannot conveniently get wakeup notifications on these events,
and we are unlikely to need to, so we sleep briefly and check again.
With this we can discard nfsd_shutdown_complete and
nfsd_complete_shutdown(), and switch to svc_set_num_threads_sync.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The use of sv_nrthreads as a general refcount results in clumsy code, as
is seen by various comments needed to explain the situation.
This patch introduces a 'struct kref' and uses that for reference
counting, leaving sv_nrthreads to be a pure count of threads. The kref
is managed particularly in svc_get() and svc_put(), and also nfsd_put();
svc_destroy() now takes a pointer to the embedded kref, rather than to
the serv.
nfsd allows the svc_serv to exist with ->sv_nrhtreads being zero. This
happens when a transport is created before the first thread is started.
To support this, a 'keep_active' flag is introduced which holds a ref on
the svc_serv. This is set when any listening socket is successfully
added (unless there are running threads), and cleared when the number of
threads is set. So when the last thread exits, the nfs_serv will be
destroyed.
The use of 'keep_active' replaces previous code which checked if there
were any permanent sockets.
We no longer clear ->rq_server when nfsd() exits. This was done
to prevent svc_exit_thread() from calling svc_destroy().
Instead we take an extra reference to the svc_serv to prevent
svc_destroy() from being called.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Currently the source's export is mounted and unmounted on every
inter-server copy operation. This patch is an enhancement to delay
the unmount of the source export for a certain period of time to
eliminate the mount and unmount overhead on subsequent copy operations.
After a copy operation completes, a work entry is added to the
delayed unmount list with an expiration time. This list is serviced
by the laundromat thread to unmount the export of the expired entries.
Each time the export is being used again, its expiration time is
extended and the entry is re-inserted to the tail of the list.
The unmount task and the mount operation of the copy request are
synced to make sure the export is not unmounted while it's being
used.
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
In order to ensure that knfsd threads don't linger once the nfsd
pseudofs is unmounted (e.g. when the container is killed) we let
nfsd_umount() shut down those threads and wait for them to exit.
This also should ensure that we don't need to do a kernel mount of
the pseudofs, since the thread lifetime is now limited by the
lifetime of the filesystem.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd stats counters can be updated by concurrent nfsd threads without any
protection.
Convert some nfsd_stats and nfsd_net struct members to use percpu counters.
The longest_chain* members of struct nfsd_net remain unprotected.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Drop the repeated word "the" in a comment.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
I made every global per-network-namespace instead. But perhaps doing
that to this slab was a step too far.
The kmem_cache_create call in our net init method also seems to be
responsible for this lockdep warning:
[ 45.163710] Unable to find swap-space signature
[ 45.375718] trinity-c1 (855): attempted to duplicate a private mapping with mremap. This is not supported.
[ 46.055744] futex_wake_op: trinity-c1 tries to shift op by -209; fix this program
[ 51.011723]
[ 51.013378] ======================================================
[ 51.013875] WARNING: possible circular locking dependency detected
[ 51.014378] 5.2.0-rc2 #1 Not tainted
[ 51.014672] ------------------------------------------------------
[ 51.015182] trinity-c2/886 is trying to acquire lock:
[ 51.015593] 000000005405f099 (slab_mutex){+.+.}, at: slab_attr_store+0xa2/0x130
[ 51.016190]
[ 51.016190] but task is already holding lock:
[ 51.016652] 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500
[ 51.017266]
[ 51.017266] which lock already depends on the new lock.
[ 51.017266]
[ 51.017909]
[ 51.017909] the existing dependency chain (in reverse order) is:
[ 51.018497]
[ 51.018497] -> #1 (kn->count#43){++++}:
[ 51.018956] __lock_acquire+0x7cf/0x1a20
[ 51.019317] lock_acquire+0x17d/0x390
[ 51.019658] __kernfs_remove+0x892/0xae0
[ 51.020020] kernfs_remove_by_name_ns+0x78/0x110
[ 51.020435] sysfs_remove_link+0x55/0xb0
[ 51.020832] sysfs_slab_add+0xc1/0x3e0
[ 51.021332] __kmem_cache_create+0x155/0x200
[ 51.021720] create_cache+0xf5/0x320
[ 51.022054] kmem_cache_create_usercopy+0x179/0x320
[ 51.022486] kmem_cache_create+0x1a/0x30
[ 51.022867] nfsd_reply_cache_init+0x278/0x560
[ 51.023266] nfsd_init_net+0x20f/0x5e0
[ 51.023623] ops_init+0xcb/0x4b0
[ 51.023928] setup_net+0x2fe/0x670
[ 51.024315] copy_net_ns+0x30a/0x3f0
[ 51.024653] create_new_namespaces+0x3c5/0x820
[ 51.025257] unshare_nsproxy_namespaces+0xd1/0x240
[ 51.025881] ksys_unshare+0x506/0x9c0
[ 51.026381] __x64_sys_unshare+0x3a/0x50
[ 51.026937] do_syscall_64+0x110/0x10b0
[ 51.027509] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 51.028175]
[ 51.028175] -> #0 (slab_mutex){+.+.}:
[ 51.028817] validate_chain+0x1c51/0x2cc0
[ 51.029422] __lock_acquire+0x7cf/0x1a20
[ 51.029947] lock_acquire+0x17d/0x390
[ 51.030438] __mutex_lock+0x100/0xfa0
[ 51.030995] mutex_lock_nested+0x27/0x30
[ 51.031516] slab_attr_store+0xa2/0x130
[ 51.032020] sysfs_kf_write+0x11d/0x180
[ 51.032529] kernfs_fop_write+0x32a/0x500
[ 51.033056] do_loop_readv_writev+0x21d/0x310
[ 51.033627] do_iter_write+0x2e5/0x380
[ 51.034148] vfs_writev+0x170/0x310
[ 51.034616] do_pwritev+0x13e/0x160
[ 51.035100] __x64_sys_pwritev+0xa3/0x110
[ 51.035633] do_syscall_64+0x110/0x10b0
[ 51.036200] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 51.036924]
[ 51.036924] other info that might help us debug this:
[ 51.036924]
[ 51.037876] Possible unsafe locking scenario:
[ 51.037876]
[ 51.038556] CPU0 CPU1
[ 51.039130] ---- ----
[ 51.039676] lock(kn->count#43);
[ 51.040084] lock(slab_mutex);
[ 51.040597] lock(kn->count#43);
[ 51.041062] lock(slab_mutex);
[ 51.041320]
[ 51.041320] *** DEADLOCK ***
[ 51.041320]
[ 51.041793] 3 locks held by trinity-c2/886:
[ 51.042128] #0: 000000001f55e152 (sb_writers#5){.+.+}, at: vfs_writev+0x2b9/0x310
[ 51.042739] #1: 00000000c7d6c034 (&of->mutex){+.+.}, at: kernfs_fop_write+0x25b/0x500
[ 51.043400] #2: 00000000ac662005 (kn->count#43){++++}, at: kernfs_fop_write+0x286/0x500
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 3ba75830ce17 "drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Currently, nfsd4_encode_exchange_id() encodes the utsname nodename
string in the server_scope field. In a multi-host container
environemnt, if an nfsd container is restarted on a different host than
it was originally running on, clients will see a server_scope mismatch
and will not attempt to reclaim opens.
Instead, set the server_scope while we're in a process context during
service startup, so we get the utsname nodename of the current process
and store that in nfsd_net.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
[bfields: fix up major_id too]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
A couple of time_t variables are only used to track the state of the
lease time and its expiration. The code correctly uses the 'time_after()'
macro to make this work on 32-bit architectures even beyond year 2038,
but the get_seconds() function and the time_t type itself are deprecated
as they behave inconsistently between 32-bit and 64-bit architectures
and often lead to code that is not y2038 safe.
As a minor issue, using get_seconds() leads to problems with concurrent
settimeofday() or clock_settime() calls, in the worst case timeout never
triggering after the time has been set backwards.
Change nfsd to use time64_t and ktime_get_boottime_seconds() here. This
is clearly excessive, as boottime by itself means we never go beyond 32
bits, but it does mean we handle this correctly and consistently without
having to worry about corner cases and should be no more expensive than
the previous implementation on 64-bit architectures.
The max_cb_time() function gets changed in order to avoid an expensive
64-bit division operation, but as the lease time is at most one hour,
there is no change in behavior.
Also do the same for server-to-server copy expiration time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
[bfields@redhat.com: fix up copy expiration]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The local boot time variable gets truncated to time_t at the moment,
which can lead to slightly odd behavior on 32-bit architectures.
Use ktime_get_real_seconds() instead of get_seconds() to always
get a 64-bit result, and keep it that way wherever possible.
It still gets truncated in a few places:
- When assigning to cl_clientid.cl_boot, this is already documented
and is only used as a unique identifier.
- In clients_still_reclaiming(), the truncation is to 'unsigned long'
in order to use the 'time_before() helper.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Add support to allow the server to reset the boot verifier in order to
force clients to resend I/O after a timeout failure.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
We want clientid's on the wire to be randomized for reasons explained in
ebd7c72c63ac "nfsd: randomize SETCLIENTID reply to help distinguish
servers". But I'd rather have mostly small integers for the clients/
directory.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
I plan to expose some information about nfsv4 clients here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Keep around one internal mount of the nfsd filesystem so that we can add
stuff to it when clients come and go, regardless of whether anyone has
it mounted.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
After 89a26b3d295d "nfsd: split DRC global spinlock into per-bucket
locks", there is no longer a single global spinlock to protect these
stats.
So, really we need to fix that. For now, at least fix the comment.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
The nfsd duplicate reply cache should not be shared between network
namespaces.
The most straightforward way to fix this is just to move every global in
the code to per-net-namespace memory, so that's what we do.
Still todo: sort out which members of nfsd_stats should be global and
which per-net-namespace.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not write to the free software foundation inc
51 franklin street fifth floor boston ma 02110 1301 usa
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option [no]_[pad]_[ctrl] any later version this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 51 franklin street fifth floor boston ma
02110 1301 usa
extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 176 file(s).
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154040.652910950@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Support use of the --nfs-version/--no-nfs-version arguments to rpc.nfsd
in containers.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
When using nfsdcld for NFSv4 client tracking, track the number of
RECLAIM_COMPLETE operations we receive from "known" clients to help in
deciding if we can lift the grace period early (or whether we need to
start a v4 grace period at all).
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Upon receiving a request for async copy, create a new kthread. If we
get asynchronous request, make sure to copy the needed arguments/state
from the stack before starting the copy. Then start the thread and reply
back to the client indicating copy is asynchronous.
nfsd_copy_file_range() will copy in a loop over the total number of
bytes is needed to copy. In case a failure happens in the middle, we
ignore the error and return how much we copied so far. Once done
creating a workitem for the callback workqueue and send CB_OFFLOAD with
the results.
The lifetime of the copy stateid is bound to the vfs copy. This way we
don't need to keep the nfsd_net structure for the callback. We could
keep it around longer so that an OFFLOAD_STATUS that came late would
still get results, but clients should be able to deal without that.
We handle OFFLOAD_CANCEL by sending a signal to the copy thread and
calling kthread_stop.
A client should cancel any ongoing copies before calling DESTROY_CLIENT;
if not, we return a CLIENT_BUSY error.
If the client is destroyed for some other reason (lease expiration, or
server shutdown), we must clean up any ongoing copies ourselves.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
[colin.king@canonical.com: fix leak in error case]
[bfields@fieldses.org: remove signalling, merge patches]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
If the client is only renewing state a little sooner than once a lease
period, then it might not discover the server has restarted till close
to the end of the grace period, and might run out of time to do the
actual reclaim.
Extend the grace period by a second each time we notice there are
clients still trying to reclaim, up to a limit of another whole lease
period.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|