1 files changed, 138 insertions, 68 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 43c4e2f05f40..9fca73e03a98 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -900,8 +900,6 @@ Except where otherwise noted, these non-guarantees were premeditated.
 	Grace Periods Don't Partition Read-Side Critical Sections</a>
 <li>	<a href="#Read-Side Critical Sections Don't Partition Grace Periods">
 	Read-Side Critical Sections Don't Partition Grace Periods</a>
-<li>	<a href="#Disabling Preemption Does Not Block Grace Periods">
-	Disabling Preemption Does Not Block Grace Periods</a>
 </ol>
 
 <h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
@@ -1259,54 +1257,6 @@ of RCU grace periods.
 <tr><td>&nbsp;</td></tr>
 </table>
 
-<h3><a name="Disabling Preemption Does Not Block Grace Periods">
-Disabling Preemption Does Not Block Grace Periods</a></h3>
-
-<p>
-There was a time when disabling preemption on any given CPU would block
-subsequent grace periods.
-However, this was an accident of implementation and is not a requirement.
-And in the current Linux-kernel implementation, disabling preemption
-on a given CPU in fact does not block grace periods, as Oleg Nesterov
-<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>.
-
-<p>
-If you need a preempt-disable region to block grace periods, you need to add
-<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example
-as follows:
-
-<blockquote>
-<pre>
- 1 preempt_disable();
- 2 rcu_read_lock();
- 3 do_something();
- 4 rcu_read_unlock();
- 5 preempt_enable();
- 6
- 7 /* Spinlocks implicitly disable preemption. */
- 8 spin_lock(&amp;mylock);
- 9 rcu_read_lock();
-10 do_something();
-11 rcu_read_unlock();
-12 spin_unlock(&amp;mylock);
-</pre>
-</blockquote>
-
-<p>
-In theory, you could enter the RCU read-side critical section first,
-but it is more efficient to keep the entire RCU read-side critical
-section contained in the preempt-disable region as shown above.
-Of course, RCU read-side critical sections that extend outside of
-preempt-disable regions will work correctly, but such critical sections
-can be preempted, which forces <tt>rcu_read_unlock()</tt> to do
-more work.
-And no, this is <i>not</i> an invitation to enclose all of your RCU
-read-side critical sections within preempt-disable regions, because
-doing so would degrade real-time response.
-
-<p>
-This non-requirement appeared with preemptible RCU.
-
 <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
 
 <p>
@@ -1381,6 +1331,7 @@ Classes of quality-of-implementation requirements are as follows:
 <ol>
 <li>	<a href="#Specialization">Specialization</a>
 <li>	<a href="#Performance and Scalability">Performance and Scalability</a>
+<li>	<a href="#Forward Progress">Forward Progress</a>
 <li>	<a href="#Composability">Composability</a>
 <li>	<a href="#Corner Cases">Corner Cases</a>
 </ol>
@@ -1645,7 +1596,7 @@ used in place of <tt>synchronize_rcu()</tt> as follows:
 16   struct foo *p;
 17
 18   spin_lock(&amp;gp_lock);
-19   p = rcu_dereference(gp);
+19   p = rcu_access_pointer(gp);
 20   if (!p) {
 21     spin_unlock(&amp;gp_lock);
 22     return false;
@@ -1822,6 +1773,106 @@ so it is too early to tell whether they will stand the test of time.
 RCU thus provides a range of tools to allow updaters to strike the
 required tradeoff between latency, flexibility and CPU overhead.
 
+<h3><a name="Forward Progress">Forward Progress</a></h3>
+
+<p>
+In theory, delaying grace-period completion and callback invocation
+is harmless.
+In practice, not only are memory sizes finite but also callbacks sometimes
+do wakeups, and sufficiently deferred wakeups can be difficult
+to distinguish from system hangs.
+Therefore, RCU must provide a number of mechanisms to promote forward
+progress.
+
+<p>
+These mechanisms are not foolproof, nor can they be.
+For one simple example, an infinite loop in an RCU read-side critical
+section must by definition prevent later grace periods from ever completing.
+For a more involved example, consider a 64-CPU system built with
+<tt>CONFIG_RCU_NOCB_CPU=y</tt> and booted with <tt>rcu_nocbs=1-63</tt>,
+where CPUs&nbsp;1 through&nbsp;63 spin in tight loops that invoke
+<tt>call_rcu()</tt>.
+Even if these tight loops also contain calls to <tt>cond_resched()</tt>
+(thus allowing grace periods to complete), CPU&nbsp;0 simply will
+not be able to invoke callbacks as fast as the other 63 CPUs can
+register them, at least not until the system runs out of memory.
+In both of these examples, the Spiderman principle applies:  With great
+power comes great responsibility.
+However, short of this level of abuse, RCU is required to
+ensure timely completion of grace periods and timely invocation of
+callbacks.
+
+<p>
+RCU takes the following steps to encourage timely completion of
+grace periods:
+
+<ol>
+<li>	If a grace period fails to complete within 100&nbsp;milliseconds,
+	RCU causes future invocations of <tt>cond_resched()</tt> on
+	the holdout CPUs to provide an RCU quiescent state.
+	RCU also causes those CPUs' <tt>need_resched()</tt> invocations
+	to return <tt>true</tt>, but only after the corresponding CPU's
+	next scheduling-clock.
+<li>	CPUs mentioned in the <tt>nohz_full</tt> kernel boot parameter
+	can run indefinitely in the kernel without scheduling-clock
+	interrupts, which defeats the above <tt>need_resched()</tt>
+	strategem.
+	RCU will therefore invoke <tt>resched_cpu()</tt> on any
+	<tt>nohz_full</tt> CPUs still holding out after
+	109&nbsp;milliseconds.
+<li>	In kernels built with <tt>CONFIG_RCU_BOOST=y</tt>, if a given
+	task that has been preempted within an RCU read-side critical
+	section is holding out for more than 500&nbsp;milliseconds,
+	RCU will resort to priority boosting.
+<li>	If a CPU is still holding out 10&nbsp;seconds into the grace
+	period, RCU will invoke <tt>resched_cpu()</tt> on it regardless
+	of its <tt>nohz_full</tt> state.
+</ol>
+
+<p>
+The above values are defaults for systems running with <tt>HZ=1000</tt>.
+They will vary as the value of <tt>HZ</tt> varies, and can also be
+changed using the relevant Kconfig options and kernel boot parameters.
+RCU currently does not do much sanity checking of these
+parameters, so please use caution when changing them.
+Note that these forward-progress measures are provided only for RCU,
+not for
+<a href="#Sleepable RCU">SRCU</a> or
+<a href="#Tasks RCU">Tasks RCU</a>.
+
+<p>
+RCU takes the following steps in <tt>call_rcu()</tt> to encourage timely
+invocation of callbacks when any given non-<tt>rcu_nocbs</tt> CPU has
+10,000 callbacks, or has 10,000 more callbacks than it had the last time
+encouragement was provided:
+
+<ol>
+<li>	Starts a grace period, if one is not already in progress.
+<li>	Forces immediate checking for quiescent states, rather than
+	waiting for three milliseconds to have elapsed since the
+	beginning of the grace period.
+<li>	Immediately tags the CPU's callbacks with their grace period
+	completion numbers, rather than waiting for the <tt>RCU_SOFTIRQ</tt>
+	handler to get around to it.
+<li>	Lifts callback-execution batch limits, which speeds up callback
+	invocation at the expense of degrading realtime response.
+</ol>
+
+<p>
+Again, these are default values when running at <tt>HZ=1000</tt>,
+and can be overridden.
+Again, these forward-progress measures are provided only for RCU,
+not for
+<a href="#Sleepable RCU">SRCU</a> or
+<a href="#Tasks RCU">Tasks RCU</a>.
+Even for RCU, callback-invocation forward progress for <tt>rcu_nocbs</tt>
+CPUs is much less well-developed, in part because workloads benefiting
+from <tt>rcu_nocbs</tt> CPUs tend to invoke <tt>call_rcu()</tt>
+relatively infrequently.
+If workloads emerge that need both <tt>rcu_nocbs</tt> CPUs and high
+<tt>call_rcu()</tt> invocation rates, then additional forward-progress
+work will be required.
+
 <h3><a name="Composability">Composability</a></h3>
 
 <p>
@@ -2272,7 +2323,7 @@ that meets this requirement.
 Furthermore, NMI handlers can be interrupted by what appear to RCU
 to be normal interrupts.
 One way that this can happen is for code that directly invokes
-<tt>rcu_irq_enter()</tt> and </tt>rcu_irq_exit()</tt> to be called
+<tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt> to be called
 from an NMI handler.
 This astonishing fact of life prompted the current code structure,
 which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt>
@@ -2294,7 +2345,7 @@ via <tt>del_timer_sync()</tt> or similar.
 <p>
 Unfortunately, there is no way to cancel an RCU callback;
 once you invoke <tt>call_rcu()</tt>, the callback function is
-going to eventually be invoked, unless the system goes down first.
+eventually going to be invoked, unless the system goes down first.
 Because it is normally considered socially irresponsible to crash the system
 in response to a module unload request, we need some other way
 to deal with in-flight RCU callbacks.
@@ -2424,23 +2475,37 @@ for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
 but there is room for further improvement.
 
 <p>
-In the past, it was forbidden to disable interrupts across an
-<tt>rcu_read_unlock()</tt> unless that interrupt-disabled region
-of code also included the matching <tt>rcu_read_lock()</tt>.
-Violating this restriction could result in deadlocks involving the
-scheduler's runqueue and priority-inheritance spinlocks.
-This restriction was lifted when interrupt-disabled calls to
-<tt>rcu_read_unlock()</tt> started deferring the reporting of
-the resulting RCU-preempt quiescent state until the end of that
+It is forbidden to hold any of scheduler's runqueue or priority-inheritance
+spinlocks across an <tt>rcu_read_unlock()</tt> unless interrupts have been
+disabled across the entire RCU read-side critical section, that is,
+up to and including the matching <tt>rcu_read_lock()</tt>.
+Violating this restriction can result in deadlocks involving these
+scheduler spinlocks.
+There was hope that this restriction might be lifted when interrupt-disabled
+calls to <tt>rcu_read_unlock()</tt> started deferring the reporting of
+the resulting RCU-preempt quiescent state until the end of the corresponding
 interrupts-disabled region.
-This deferred reporting means that the scheduler's runqueue and
-priority-inheritance locks cannot be held while reporting an RCU-preempt
-quiescent state, which lifts the earlier restriction, at least from
-a deadlock perspective.
-Unfortunately, real-time systems using RCU priority boosting may
+Unfortunately, timely reporting of the corresponding quiescent state
+to expedited grace periods requires a call to <tt>raise_softirq()</tt>,
+which can acquire these scheduler spinlocks.
+In addition, real-time systems using RCU priority boosting
 need this restriction to remain in effect because deferred
-quiescent-state reporting also defers deboosting, which in turn
-degrades real-time latencies.
+quiescent-state reporting would also defer deboosting, which in turn
+would degrade real-time latencies.
+
+<p>
+In theory, if a given RCU read-side critical section could be
+guaranteed to be less than one second in duration, holding a scheduler
+spinlock across that critical section's <tt>rcu_read_unlock()</tt>
+would require only that preemption be disabled across the entire
+RCU read-side critical section, not interrupts.
+Unfortunately, given the possibility of vCPU preemption, long-running
+interrupts, and so on, it is not possible in practice to guarantee
+that a given RCU read-side critical section will complete in less than
+one second.
+Therefore, as noted above, if scheduler spinlocks are held across
+a given call to <tt>rcu_read_unlock()</tt>, interrupts must be
+disabled across the entire RCU read-side critical section.
 
 <h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
 
@@ -3233,6 +3298,11 @@ For example, RCU callback overhead might be charged back to the
 originating <tt>call_rcu()</tt> instance, though probably not
 in production kernels.
 
+<p>
+Additional work may be required to provide reasonable forward-progress
+guarantees under heavy load for grace periods and for callback
+invocation.
+
 <h2><a name="Summary">Summary</a></h2>
 
 <p>