aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/RCU/checklist.rst
diff options
context:
space:
mode:
Diffstat (limited to '')
-rw-r--r--Documentation/RCU/checklist.rst491
1 files changed, 491 insertions, 0 deletions
diff --git a/Documentation/RCU/checklist.rst b/Documentation/RCU/checklist.rst
new file mode 100644
index 000000000000..048c5bc1f813
--- /dev/null
+++ b/Documentation/RCU/checklist.rst
@@ -0,0 +1,491 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Review Checklist for RCU Patches
+================================
+
+
+This document contains a checklist for producing and reviewing patches
+that make use of RCU. Violating any of the rules listed below will
+result in the same sorts of problems that leaving out a locking primitive
+would cause. This list is based on experiences reviewing such patches
+over a rather long period of time, but improvements are always welcome!
+
+0. Is RCU being applied to a read-mostly situation? If the data
+ structure is updated more than about 10% of the time, then you
+ should strongly consider some other approach, unless detailed
+ performance measurements show that RCU is nonetheless the right
+ tool for the job. Yes, RCU does reduce read-side overhead by
+ increasing write-side overhead, which is exactly why normal uses
+ of RCU will do much more reading than updating.
+
+ Another exception is where performance is not an issue, and RCU
+ provides a simpler implementation. An example of this situation
+ is the dynamic NMI code in the Linux 2.6 kernel, at least on
+ architectures where NMIs are rare.
+
+ Yet another exception is where the low real-time latency of RCU's
+ read-side primitives is critically important.
+
+ One final exception is where RCU readers are used to prevent
+ the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
+ for lockless updates. This does result in the mildly
+ counter-intuitive situation where rcu_read_lock() and
+ rcu_read_unlock() are used to protect updates, however, this
+ approach provides the same potential simplifications that garbage
+ collectors do.
+
+1. Does the update code have proper mutual exclusion?
+
+ RCU does allow *readers* to run (almost) naked, but *writers* must
+ still use some sort of mutual exclusion, such as:
+
+ a. locking,
+ b. atomic operations, or
+ c. restricting updates to a single task.
+
+ If you choose #b, be prepared to describe how you have handled
+ memory barriers on weakly ordered machines (pretty much all of
+ them -- even x86 allows later loads to be reordered to precede
+ earlier stores), and be prepared to explain why this added
+ complexity is worthwhile. If you choose #c, be prepared to
+ explain how this single task does not become a major bottleneck on
+ big multiprocessor machines (for example, if the task is updating
+ information relating to itself that other tasks can read, there
+ by definition can be no bottleneck). Note that the definition
+ of "large" has changed significantly: Eight CPUs was "large"
+ in the year 2000, but a hundred CPUs was unremarkable in 2017.
+
+2. Do the RCU read-side critical sections make proper use of
+ rcu_read_lock() and friends? These primitives are needed
+ to prevent grace periods from ending prematurely, which
+ could result in data being unceremoniously freed out from
+ under your read-side code, which can greatly increase the
+ actuarial risk of your kernel.
+
+ As a rough rule of thumb, any dereference of an RCU-protected
+ pointer must be covered by rcu_read_lock(), rcu_read_lock_bh(),
+ rcu_read_lock_sched(), or by the appropriate update-side lock.
+ Explicit disabling of preemption (preempt_disable(), for example)
+ can serve as rcu_read_lock_sched(), but is less readable and
+ prevents lockdep from detecting locking issues.
+
+ Please not that you *cannot* rely on code known to be built
+ only in non-preemptible kernels. Such code can and will break,
+ especially in kernels built with CONFIG_PREEMPT_COUNT=y.
+
+ Letting RCU-protected pointers "leak" out of an RCU read-side
+ critical section is every bit as bad as letting them leak out
+ from under a lock. Unless, of course, you have arranged some
+ other means of protection, such as a lock or a reference count
+ *before* letting them out of the RCU read-side critical section.
+
+3. Does the update code tolerate concurrent accesses?
+
+ The whole point of RCU is to permit readers to run without
+ any locks or atomic operations. This means that readers will
+ be running while updates are in progress. There are a number
+ of ways to handle this concurrency, depending on the situation:
+
+ a. Use the RCU variants of the list and hlist update
+ primitives to add, remove, and replace elements on
+ an RCU-protected list. Alternatively, use the other
+ RCU-protected data structures that have been added to
+ the Linux kernel.
+
+ This is almost always the best approach.
+
+ b. Proceed as in (a) above, but also maintain per-element
+ locks (that are acquired by both readers and writers)
+ that guard per-element state. Of course, fields that
+ the readers refrain from accessing can be guarded by
+ some other lock acquired only by updaters, if desired.
+
+ This works quite well, also.
+
+ c. Make updates appear atomic to readers. For example,
+ pointer updates to properly aligned fields will
+ appear atomic, as will individual atomic primitives.
+ Sequences of operations performed under a lock will *not*
+ appear to be atomic to RCU readers, nor will sequences
+ of multiple atomic primitives.
+
+ This can work, but is starting to get a bit tricky.
+
+ d. Carefully order the updates and the reads so that
+ readers see valid data at all phases of the update.
+ This is often more difficult than it sounds, especially
+ given modern CPUs' tendency to reorder memory references.
+ One must usually liberally sprinkle memory barriers
+ (smp_wmb(), smp_rmb(), smp_mb()) through the code,
+ making it difficult to understand and to test.
+
+ It is usually better to group the changing data into
+ a separate structure, so that the change may be made
+ to appear atomic by updating a pointer to reference
+ a new structure containing updated values.
+
+4. Weakly ordered CPUs pose special challenges. Almost all CPUs
+ are weakly ordered -- even x86 CPUs allow later loads to be
+ reordered to precede earlier stores. RCU code must take all of
+ the following measures to prevent memory-corruption problems:
+
+ a. Readers must maintain proper ordering of their memory
+ accesses. The rcu_dereference() primitive ensures that
+ the CPU picks up the pointer before it picks up the data
+ that the pointer points to. This really is necessary
+ on Alpha CPUs.
+
+ The rcu_dereference() primitive is also an excellent
+ documentation aid, letting the person reading the
+ code know exactly which pointers are protected by RCU.
+ Please note that compilers can also reorder code, and
+ they are becoming increasingly aggressive about doing
+ just that. The rcu_dereference() primitive therefore also
+ prevents destructive compiler optimizations. However,
+ with a bit of devious creativity, it is possible to
+ mishandle the return value from rcu_dereference().
+ Please see rcu_dereference.rst for more information.
+
+ The rcu_dereference() primitive is used by the
+ various "_rcu()" list-traversal primitives, such
+ as the list_for_each_entry_rcu(). Note that it is
+ perfectly legal (if redundant) for update-side code to
+ use rcu_dereference() and the "_rcu()" list-traversal
+ primitives. This is particularly useful in code that
+ is common to readers and updaters. However, lockdep
+ will complain if you access rcu_dereference() outside
+ of an RCU read-side critical section. See lockdep.rst
+ to learn what to do about this.
+
+ Of course, neither rcu_dereference() nor the "_rcu()"
+ list-traversal primitives can substitute for a good
+ concurrency design coordinating among multiple updaters.
+
+ b. If the list macros are being used, the list_add_tail_rcu()
+ and list_add_rcu() primitives must be used in order
+ to prevent weakly ordered machines from misordering
+ structure initialization and pointer planting.
+ Similarly, if the hlist macros are being used, the
+ hlist_add_head_rcu() primitive is required.
+
+ c. If the list macros are being used, the list_del_rcu()
+ primitive must be used to keep list_del()'s pointer
+ poisoning from inflicting toxic effects on concurrent
+ readers. Similarly, if the hlist macros are being used,
+ the hlist_del_rcu() primitive is required.
+
+ The list_replace_rcu() and hlist_replace_rcu() primitives
+ may be used to replace an old structure with a new one
+ in their respective types of RCU-protected lists.
+
+ d. Rules similar to (4b) and (4c) apply to the "hlist_nulls"
+ type of RCU-protected linked lists.
+
+ e. Updates must ensure that initialization of a given
+ structure happens before pointers to that structure are
+ publicized. Use the rcu_assign_pointer() primitive
+ when publicizing a pointer to a structure that can
+ be traversed by an RCU read-side critical section.
+
+5. If call_rcu() or call_srcu() is used, the callback function will
+ be called from softirq context. In particular, it cannot block.
+ If you need the callback to block, run that code in a workqueue
+ handler scheduled from the callback. The queue_rcu_work()
+ function does this for you in the case of call_rcu().
+
+6. Since synchronize_rcu() can block, it cannot be called
+ from any sort of irq context. The same rule applies
+ for synchronize_srcu(), synchronize_rcu_expedited(), and
+ synchronize_srcu_expedited().
+
+ The expedited forms of these primitives have the same semantics
+ as the non-expedited forms, but expediting is both expensive and
+ (with the exception of synchronize_srcu_expedited()) unfriendly
+ to real-time workloads. Use of the expedited primitives should
+ be restricted to rare configuration-change operations that would
+ not normally be undertaken while a real-time workload is running.
+ However, real-time workloads can use rcupdate.rcu_normal kernel
+ boot parameter to completely disable expedited grace periods,
+ though this might have performance implications.
+
+ In particular, if you find yourself invoking one of the expedited
+ primitives repeatedly in a loop, please do everyone a favor:
+ Restructure your code so that it batches the updates, allowing
+ a single non-expedited primitive to cover the entire batch.
+ This will very likely be faster than the loop containing the
+ expedited primitive, and will be much much easier on the rest
+ of the system, especially to real-time workloads running on
+ the rest of the system.
+
+7. As of v4.20, a given kernel implements only one RCU flavor, which
+ is RCU-sched for PREEMPTION=n and RCU-preempt for PREEMPTION=y.
+ If the updater uses call_rcu() or synchronize_rcu(), then
+ the corresponding readers may use: (1) rcu_read_lock() and
+ rcu_read_unlock(), (2) any pair of primitives that disables
+ and re-enables softirq, for example, rcu_read_lock_bh() and
+ rcu_read_unlock_bh(), or (3) any pair of primitives that disables
+ and re-enables preemption, for example, rcu_read_lock_sched() and
+ rcu_read_unlock_sched(). If the updater uses synchronize_srcu()
+ or call_srcu(), then the corresponding readers must use
+ srcu_read_lock() and srcu_read_unlock(), and with the same
+ srcu_struct. The rules for the expedited RCU grace-period-wait
+ primitives are the same as for their non-expedited counterparts.
+
+ If the updater uses call_rcu_tasks() or synchronize_rcu_tasks(),
+ then the readers must refrain from executing voluntary
+ context switches, that is, from blocking. If the updater uses
+ call_rcu_tasks_trace() or synchronize_rcu_tasks_trace(), then
+ the corresponding readers must use rcu_read_lock_trace() and
+ rcu_read_unlock_trace(). If an updater uses call_rcu_tasks_rude()
+ or synchronize_rcu_tasks_rude(), then the corresponding readers
+ must use anything that disables interrupts.
+
+ Mixing things up will result in confusion and broken kernels, and
+ has even resulted in an exploitable security issue. Therefore,
+ when using non-obvious pairs of primitives, commenting is
+ of course a must. One example of non-obvious pairing is
+ the XDP feature in networking, which calls BPF programs from
+ network-driver NAPI (softirq) context. BPF relies heavily on RCU
+ protection for its data structures, but because the BPF program
+ invocation happens entirely within a single local_bh_disable()
+ section in a NAPI poll cycle, this usage is safe. The reason
+ that this usage is safe is that readers can use anything that
+ disables BH when updaters use call_rcu() or synchronize_rcu().
+
+8. Although synchronize_rcu() is slower than is call_rcu(), it
+ usually results in simpler code. So, unless update performance is
+ critically important, the updaters cannot block, or the latency of
+ synchronize_rcu() is visible from userspace, synchronize_rcu()
+ should be used in preference to call_rcu(). Furthermore,
+ kfree_rcu() usually results in even simpler code than does
+ synchronize_rcu() without synchronize_rcu()'s multi-millisecond
+ latency. So please take advantage of kfree_rcu()'s "fire and
+ forget" memory-freeing capabilities where it applies.
+
+ An especially important property of the synchronize_rcu()
+ primitive is that it automatically self-limits: if grace periods
+ are delayed for whatever reason, then the synchronize_rcu()
+ primitive will correspondingly delay updates. In contrast,
+ code using call_rcu() should explicitly limit update rate in
+ cases where grace periods are delayed, as failing to do so can
+ result in excessive realtime latencies or even OOM conditions.
+
+ Ways of gaining this self-limiting property when using call_rcu()
+ include:
+
+ a. Keeping a count of the number of data-structure elements
+ used by the RCU-protected data structure, including
+ those waiting for a grace period to elapse. Enforce a
+ limit on this number, stalling updates as needed to allow
+ previously deferred frees to complete. Alternatively,
+ limit only the number awaiting deferred free rather than
+ the total number of elements.
+
+ One way to stall the updates is to acquire the update-side
+ mutex. (Don't try this with a spinlock -- other CPUs
+ spinning on the lock could prevent the grace period
+ from ever ending.) Another way to stall the updates
+ is for the updates to use a wrapper function around
+ the memory allocator, so that this wrapper function
+ simulates OOM when there is too much memory awaiting an
+ RCU grace period. There are of course many other
+ variations on this theme.
+
+ b. Limiting update rate. For example, if updates occur only
+ once per hour, then no explicit rate limiting is
+ required, unless your system is already badly broken.
+ Older versions of the dcache subsystem take this approach,
+ guarding updates with a global lock, limiting their rate.
+
+ c. Trusted update -- if updates can only be done manually by
+ superuser or some other trusted user, then it might not
+ be necessary to automatically limit them. The theory
+ here is that superuser already has lots of ways to crash
+ the machine.
+
+ d. Periodically invoke synchronize_rcu(), permitting a limited
+ number of updates per grace period. Better yet, periodically
+ invoke rcu_barrier() to wait for all outstanding callbacks.
+
+ The same cautions apply to call_srcu() and kfree_rcu().
+
+ Note that although these primitives do take action to avoid memory
+ exhaustion when any given CPU has too many callbacks, a determined
+ user could still exhaust memory. This is especially the case
+ if a system with a large number of CPUs has been configured to
+ offload all of its RCU callbacks onto a single CPU, or if the
+ system has relatively little free memory.
+
+9. All RCU list-traversal primitives, which include
+ rcu_dereference(), list_for_each_entry_rcu(), and
+ list_for_each_safe_rcu(), must be either within an RCU read-side
+ critical section or must be protected by appropriate update-side
+ locks. RCU read-side critical sections are delimited by
+ rcu_read_lock() and rcu_read_unlock(), or by similar primitives
+ such as rcu_read_lock_bh() and rcu_read_unlock_bh(), in which
+ case the matching rcu_dereference() primitive must be used in
+ order to keep lockdep happy, in this case, rcu_dereference_bh().
+
+ The reason that it is permissible to use RCU list-traversal
+ primitives when the update-side lock is held is that doing so
+ can be quite helpful in reducing code bloat when common code is
+ shared between readers and updaters. Additional primitives
+ are provided for this case, as discussed in lockdep.rst.
+
+ One exception to this rule is when data is only ever added to
+ the linked data structure, and is never removed during any
+ time that readers might be accessing that structure. In such
+ cases, READ_ONCE() may be used in place of rcu_dereference()
+ and the read-side markers (rcu_read_lock() and rcu_read_unlock(),
+ for example) may be omitted.
+
+10. Conversely, if you are in an RCU read-side critical section,
+ and you don't hold the appropriate update-side lock, you *must*
+ use the "_rcu()" variants of the list macros. Failing to do so
+ will break Alpha, cause aggressive compilers to generate bad code,
+ and confuse people trying to read your code.
+
+11. Any lock acquired by an RCU callback must be acquired elsewhere
+ with softirq disabled, e.g., via spin_lock_irqsave(),
+ spin_lock_bh(), etc. Failing to disable softirq on a given
+ acquisition of that lock will result in deadlock as soon as
+ the RCU softirq handler happens to run your RCU callback while
+ interrupting that acquisition's critical section.
+
+12. RCU callbacks can be and are executed in parallel. In many cases,
+ the callback code simply wrappers around kfree(), so that this
+ is not an issue (or, more accurately, to the extent that it is
+ an issue, the memory-allocator locking handles it). However,
+ if the callbacks do manipulate a shared data structure, they
+ must use whatever locking or other synchronization is required
+ to safely access and/or modify that data structure.
+
+ Do not assume that RCU callbacks will be executed on the same
+ CPU that executed the corresponding call_rcu() or call_srcu().
+ For example, if a given CPU goes offline while having an RCU
+ callback pending, then that RCU callback will execute on some
+ surviving CPU. (If this was not the case, a self-spawning RCU
+ callback would prevent the victim CPU from ever going offline.)
+ Furthermore, CPUs designated by rcu_nocbs= might well *always*
+ have their RCU callbacks executed on some other CPUs, in fact,
+ for some real-time workloads, this is the whole point of using
+ the rcu_nocbs= kernel boot parameter.
+
+13. Unlike other forms of RCU, it *is* permissible to block in an
+ SRCU read-side critical section (demarked by srcu_read_lock()
+ and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
+ Please note that if you don't need to sleep in read-side critical
+ sections, you should be using RCU rather than SRCU, because RCU
+ is almost always faster and easier to use than is SRCU.
+
+ Also unlike other forms of RCU, explicit initialization and
+ cleanup is required either at build time via DEFINE_SRCU()
+ or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
+ and cleanup_srcu_struct(). These last two are passed a
+ "struct srcu_struct" that defines the scope of a given
+ SRCU domain. Once initialized, the srcu_struct is passed
+ to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
+ synchronize_srcu_expedited(), and call_srcu(). A given
+ synchronize_srcu() waits only for SRCU read-side critical
+ sections governed by srcu_read_lock() and srcu_read_unlock()
+ calls that have been passed the same srcu_struct. This property
+ is what makes sleeping read-side critical sections tolerable --
+ a given subsystem delays only its own updates, not those of other
+ subsystems using SRCU. Therefore, SRCU is less prone to OOM the
+ system than RCU would be if RCU's read-side critical sections
+ were permitted to sleep.
+
+ The ability to sleep in read-side critical sections does not
+ come for free. First, corresponding srcu_read_lock() and
+ srcu_read_unlock() calls must be passed the same srcu_struct.
+ Second, grace-period-detection overhead is amortized only
+ over those updates sharing a given srcu_struct, rather than
+ being globally amortized as they are for other forms of RCU.
+ Therefore, SRCU should be used in preference to rw_semaphore
+ only in extremely read-intensive situations, or in situations
+ requiring SRCU's read-side deadlock immunity or low read-side
+ realtime latency. You should also consider percpu_rw_semaphore
+ when you need lightweight readers.
+
+ SRCU's expedited primitive (synchronize_srcu_expedited())
+ never sends IPIs to other CPUs, so it is easier on
+ real-time workloads than is synchronize_rcu_expedited().
+
+ Note that rcu_assign_pointer() relates to SRCU just as it does to
+ other forms of RCU, but instead of rcu_dereference() you should
+ use srcu_dereference() in order to avoid lockdep splats.
+
+14. The whole point of call_rcu(), synchronize_rcu(), and friends
+ is to wait until all pre-existing readers have finished before
+ carrying out some otherwise-destructive operation. It is
+ therefore critically important to *first* remove any path
+ that readers can follow that could be affected by the
+ destructive operation, and *only then* invoke call_rcu(),
+ synchronize_rcu(), or friends.
+
+ Because these primitives only wait for pre-existing readers, it
+ is the caller's responsibility to guarantee that any subsequent
+ readers will execute safely.
+
+15. The various RCU read-side primitives do *not* necessarily contain
+ memory barriers. You should therefore plan for the CPU
+ and the compiler to freely reorder code into and out of RCU
+ read-side critical sections. It is the responsibility of the
+ RCU update-side primitives to deal with this.
+
+ For SRCU readers, you can use smp_mb__after_srcu_read_unlock()
+ immediately after an srcu_read_unlock() to get a full barrier.
+
+16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the
+ __rcu sparse checks to validate your RCU code. These can help
+ find problems as follows:
+
+ CONFIG_PROVE_LOCKING:
+ check that accesses to RCU-protected data
+ structures are carried out under the proper RCU
+ read-side critical section, while holding the right
+ combination of locks, or whatever other conditions
+ are appropriate.
+
+ CONFIG_DEBUG_OBJECTS_RCU_HEAD:
+ check that you don't pass the
+ same object to call_rcu() (or friends) before an RCU
+ grace period has elapsed since the last time that you
+ passed that same object to call_rcu() (or friends).
+
+ __rcu sparse checks:
+ tag the pointer to the RCU-protected data
+ structure with __rcu, and sparse will warn you if you
+ access that pointer without the services of one of the
+ variants of rcu_dereference().
+
+ These debugging aids can help you find problems that are
+ otherwise extremely difficult to spot.
+
+17. If you register a callback using call_rcu() or call_srcu(), and
+ pass in a function defined within a loadable module, then it in
+ necessary to wait for all pending callbacks to be invoked after
+ the last invocation and before unloading that module. Note that
+ it is absolutely *not* sufficient to wait for a grace period!
+ The current (say) synchronize_rcu() implementation is *not*
+ guaranteed to wait for callbacks registered on other CPUs.
+ Or even on the current CPU if that CPU recently went offline
+ and came back online.
+
+ You instead need to use one of the barrier functions:
+
+ - call_rcu() -> rcu_barrier()
+ - call_srcu() -> srcu_barrier()
+
+ However, these barrier functions are absolutely *not* guaranteed
+ to wait for a grace period. In fact, if there are no call_rcu()
+ callbacks waiting anywhere in the system, rcu_barrier() is within
+ its rights to return immediately.
+
+ So if you need to wait for both an RCU grace period and for
+ all pre-existing call_rcu() callbacks, you will need to execute
+ both rcu_barrier() and synchronize_rcu(), if necessary, using
+ something like workqueues to execute them concurrently.
+
+ See rcubarrier.rst for more information.