aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/RCU/Design/Data-Structures/Data-Structures.html3
-rw-r--r--Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html4
-rw-r--r--Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html5
-rw-r--r--Documentation/RCU/NMI-RCU.txt13
-rw-r--r--Documentation/RCU/UP.txt6
-rw-r--r--Documentation/RCU/checklist.txt91
-rw-r--r--Documentation/RCU/rcu.txt8
-rw-r--r--Documentation/RCU/rcu_dereference.txt103
-rw-r--r--Documentation/RCU/rcubarrier.txt27
-rw-r--r--Documentation/RCU/whatisRCU.txt10
-rw-r--r--Documentation/accounting/psi.txt12
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt36
-rw-r--r--Documentation/atomic_t.txt17
-rw-r--r--Documentation/bpf/btf.rst8
-rw-r--r--Documentation/core-api/cachetlb.rst10
-rw-r--r--Documentation/devicetree/bindings/arm/cpus.yaml2
-rw-r--r--Documentation/devicetree/bindings/hwmon/adc128d818.txt4
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-xscale.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-mtk.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-stu300.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt)0
-rw-r--r--Documentation/devicetree/bindings/i2c/i2c-wmt.txt (renamed from Documentation/devicetree/bindings/i2c/i2c-vt8500.txt)0
-rw-r--r--Documentation/devicetree/bindings/net/davinci_emac.txt2
-rw-r--r--Documentation/devicetree/bindings/net/dsa/qca8k.txt73
-rw-r--r--Documentation/devicetree/bindings/net/ethernet.txt5
-rw-r--r--Documentation/devicetree/bindings/net/macb.txt4
-rw-r--r--Documentation/devicetree/bindings/serial/mtk-uart.txt1
-rw-r--r--Documentation/driver-api/usb/power-management.rst14
-rw-r--r--Documentation/filesystems/mount_api.txt367
-rw-r--r--Documentation/i2c/busses/i2c-i8011
-rw-r--r--Documentation/kprobes.txt6
-rw-r--r--Documentation/lzo.txt8
-rw-r--r--Documentation/media/uapi/rc/rc-tables.rst4
-rw-r--r--Documentation/networking/bpf_flow_dissector.rst126
-rw-r--r--Documentation/networking/decnet.txt2
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.txt3
-rw-r--r--Documentation/networking/msg_zerocopy.rst2
-rw-r--r--Documentation/networking/netdev-FAQ.rst13
-rw-r--r--Documentation/networking/nf_flowtable.txt8
-rw-r--r--Documentation/networking/rxrpc.txt16
-rw-r--r--Documentation/networking/snmp_counter.rst12
-rw-r--r--Documentation/sysctl/vm.txt16
-rw-r--r--Documentation/translations/ko_KR/memory-barriers.txt49
-rw-r--r--Documentation/virtual/kvm/api.txt88
-rw-r--r--Documentation/virtual/kvm/mmu.txt11
46 files changed, 808 insertions, 383 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html
index 18f179807563..c30c1957c7e6 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html
@@ -155,8 +155,7 @@ keeping lock contention under control at all tree levels regardless
of the level of loading on the system.
</p><p>RCU updaters wait for normal grace periods by registering
-RCU callbacks, either directly via <tt>call_rcu()</tt> and
-friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
+RCU callbacks, either directly via <tt>call_rcu()</tt>
or indirectly via <tt>synchronize_rcu()</tt> and friends.
RCU callbacks are represented by <tt>rcu_head</tt> structures,
which are queued on <tt>rcu_data</tt> structures while they are
diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
index 19e7a5fb6b73..57300db4b5ff 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html
@@ -56,6 +56,7 @@ sections.
RCU-preempt Expedited Grace Periods</a></h2>
<p>
+<tt>CONFIG_PREEMPT=y</tt> kernels implement RCU-preempt.
The overall flow of the handling of a given CPU by an RCU-preempt
expedited grace period is shown in the following diagram:
@@ -139,6 +140,7 @@ or offline, among other things.
RCU-sched Expedited Grace Periods</a></h2>
<p>
+<tt>CONFIG_PREEMPT=n</tt> kernels implement RCU-sched.
The overall flow of the handling of a given CPU by an RCU-sched
expedited grace period is shown in the following diagram:
@@ -146,7 +148,7 @@ expedited grace period is shown in the following diagram:
<p>
As with RCU-preempt, RCU-sched's
-<tt>synchronize_sched_expedited()</tt> ignores offline and
+<tt>synchronize_rcu_expedited()</tt> ignores offline and
idle CPUs, again because they are in remotely detectable
quiescent states.
However, because the
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
index 8d21af02b1f0..c64f8d26609f 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
@@ -34,12 +34,11 @@ Similarly, any code that happens before the beginning of a given RCU grace
period is guaranteed to see the effects of all accesses following the end
of that grace period that are within RCU read-side critical sections.
-<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>,
-for which RCU-sched read-side critical sections include any region
+<p>Note well that RCU-sched read-side critical sections include any region
of code for which preemption is disabled.
Given that each individual machine instruction can be thought of as
an extremely small region of preemption-disabled code, one can think of
-<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids.
+<tt>synchronize_rcu()</tt> as <tt>smp_mb()</tt> on steroids.
<p>RCU updaters use this guarantee by splitting their updates into
two phases, one of which is executed before the grace period and
diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt
index 687777f83b23..881353fd5bff 100644
--- a/Documentation/RCU/NMI-RCU.txt
+++ b/Documentation/RCU/NMI-RCU.txt
@@ -81,18 +81,19 @@ currently executing on some other CPU. We therefore cannot free
up any data structures used by the old NMI handler until execution
of it completes on all other CPUs.
-One way to accomplish this is via synchronize_sched(), perhaps as
+One way to accomplish this is via synchronize_rcu(), perhaps as
follows:
unset_nmi_callback();
- synchronize_sched();
+ synchronize_rcu();
kfree(my_nmi_data);
-This works because synchronize_sched() blocks until all CPUs complete
-any preemption-disabled segments of code that they were executing.
-Since NMI handlers disable preemption, synchronize_sched() is guaranteed
+This works because (as of v4.20) synchronize_rcu() blocks until all
+CPUs complete any preemption-disabled segments of code that they were
+executing.
+Since NMI handlers disable preemption, synchronize_rcu() is guaranteed
not to return until all ongoing NMI handlers exit. It is therefore safe
-to free up the handler's data as soon as synchronize_sched() returns.
+to free up the handler's data as soon as synchronize_rcu() returns.
Important note: for this to work, the architecture in question must
invoke nmi_enter() and nmi_exit() on NMI entry and exit, respectively.
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt
index 90ec5341ee98..53bde717017b 100644
--- a/Documentation/RCU/UP.txt
+++ b/Documentation/RCU/UP.txt
@@ -86,10 +86,8 @@ even on a UP system. So do not do it! Even on a UP system, the RCU
infrastructure -must- respect grace periods, and -must- invoke callbacks
from a known environment in which no locks are held.
-It -is- safe for synchronize_sched() and synchronize_rcu_bh() to return
-immediately on an UP system. It is also safe for synchronize_rcu()
-to return immediately on UP systems, except when running preemptable
-RCU.
+Note that it -is- safe for synchronize_rcu() to return immediately on
+UP systems, including !PREEMPT SMP builds running on UP systems.
Quick Quiz #3: Why can't synchronize_rcu() return immediately on
UP systems running preemptable RCU?
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 6f469864d9f5..e98ff261a438 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -182,16 +182,13 @@ over a rather long period of time, but improvements are always welcome!
when publicizing a pointer to a structure that can
be traversed by an RCU read-side critical section.
-5. If call_rcu(), or a related primitive such as call_rcu_bh(),
- call_rcu_sched(), or call_srcu() is used, the callback function
- will be called from softirq context. In particular, it cannot
- block.
+5. If call_rcu() or call_srcu() is used, the callback function will
+ be called from softirq context. In particular, it cannot block.
-6. Since synchronize_rcu() can block, it cannot be called from
- any sort of irq context. The same rule applies for
- synchronize_rcu_bh(), synchronize_sched(), synchronize_srcu(),
- synchronize_rcu_expedited(), synchronize_rcu_bh_expedited(),
- synchronize_sched_expedite(), and synchronize_srcu_expedited().
+6. Since synchronize_rcu() can block, it cannot be called
+ from any sort of irq context. The same rule applies
+ for synchronize_srcu(), synchronize_rcu_expedited(), and
+ synchronize_srcu_expedited().
The expedited forms of these primitives have the same semantics
as the non-expedited forms, but expediting is both expensive and
@@ -212,20 +209,20 @@ over a rather long period of time, but improvements are always welcome!
of the system, especially to real-time workloads running on
the rest of the system.
-7. If the updater uses call_rcu() or synchronize_rcu(), then the
- corresponding readers must use rcu_read_lock() and
- rcu_read_unlock(). If the updater uses call_rcu_bh() or
- synchronize_rcu_bh(), then the corresponding readers must
- use rcu_read_lock_bh() and rcu_read_unlock_bh(). If the
- updater uses call_rcu_sched() or synchronize_sched(), then
- the corresponding readers must disable preemption, possibly
- by calling rcu_read_lock_sched() and rcu_read_unlock_sched().
- If the updater uses synchronize_srcu() or call_srcu(), then
- the corresponding readers must use srcu_read_lock() and
+7. As of v4.20, a given kernel implements only one RCU flavor,
+ which is RCU-sched for PREEMPT=n and RCU-preempt for PREEMPT=y.
+ If the updater uses call_rcu() or synchronize_rcu(),
+ then the corresponding readers my use rcu_read_lock() and
+ rcu_read_unlock(), rcu_read_lock_bh() and rcu_read_unlock_bh(),
+ or any pair of primitives that disables and re-enables preemption,
+ for example, rcu_read_lock_sched() and rcu_read_unlock_sched().
+ If the updater uses synchronize_srcu() or call_srcu(),
+ then the corresponding readers must use srcu_read_lock() and
srcu_read_unlock(), and with the same srcu_struct. The rules for
the expedited primitives are the same as for their non-expedited
counterparts. Mixing things up will result in confusion and
- broken kernels.
+ broken kernels, and has even resulted in an exploitable security
+ issue.
One exception to this rule: rcu_read_lock() and rcu_read_unlock()
may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
@@ -288,8 +285,7 @@ over a rather long period of time, but improvements are always welcome!
d. Periodically invoke synchronize_rcu(), permitting a limited
number of updates per grace period.
- The same cautions apply to call_rcu_bh(), call_rcu_sched(),
- call_srcu(), and kfree_rcu().
+ The same cautions apply to call_srcu() and kfree_rcu().
Note that although these primitives do take action to avoid memory
exhaustion when any given CPU has too many callbacks, a determined
@@ -322,7 +318,7 @@ over a rather long period of time, but improvements are always welcome!
11. Any lock acquired by an RCU callback must be acquired elsewhere
with softirq disabled, e.g., via spin_lock_irqsave(),
- spin_lock_bh(), etc. Failing to disable irq on a given
+ spin_lock_bh(), etc. Failing to disable softirq on a given
acquisition of that lock will result in deadlock as soon as
the RCU softirq handler happens to run your RCU callback while
interrupting that acquisition's critical section.
@@ -335,13 +331,16 @@ over a rather long period of time, but improvements are always welcome!
must use whatever locking or other synchronization is required
to safely access and/or modify that data structure.
- RCU callbacks are -usually- executed on the same CPU that executed
- the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(),
- but are by -no- means guaranteed to be. For example, if a given
- CPU goes offline while having an RCU callback pending, then that
- RCU callback will execute on some surviving CPU. (If this was
- not the case, a self-spawning RCU callback would prevent the
- victim CPU from ever going offline.)
+ Do not assume that RCU callbacks will be executed on the same
+ CPU that executed the corresponding call_rcu() or call_srcu().
+ For example, if a given CPU goes offline while having an RCU
+ callback pending, then that RCU callback will execute on some
+ surviving CPU. (If this was not the case, a self-spawning RCU
+ callback would prevent the victim CPU from ever going offline.)
+ Furthermore, CPUs designated by rcu_nocbs= might well -always-
+ have their RCU callbacks executed on some other CPUs, in fact,
+ for some real-time workloads, this is the whole point of using
+ the rcu_nocbs= kernel boot parameter.
13. Unlike other forms of RCU, it -is- permissible to block in an
SRCU read-side critical section (demarked by srcu_read_lock()
@@ -381,11 +380,11 @@ over a rather long period of time, but improvements are always welcome!
SRCU's expedited primitive (synchronize_srcu_expedited())
never sends IPIs to other CPUs, so it is easier on
- real-time workloads than is synchronize_rcu_expedited(),
- synchronize_rcu_bh_expedited() or synchronize_sched_expedited().
+ real-time workloads than is synchronize_rcu_expedited().
- Note that rcu_dereference() and rcu_assign_pointer() relate to
- SRCU just as they do to other forms of RCU.
+ Note that rcu_assign_pointer() relates to SRCU just as it does to
+ other forms of RCU, but instead of rcu_dereference() you should
+ use srcu_dereference() in order to avoid lockdep splats.
14. The whole point of call_rcu(), synchronize_rcu(), and friends
is to wait until all pre-existing readers have finished before
@@ -405,6 +404,9 @@ over a rather long period of time, but improvements are always welcome!
read-side critical sections. It is the responsibility of the
RCU update-side primitives to deal with this.
+ For SRCU readers, you can use smp_mb__after_srcu_read_unlock()
+ immediately after an srcu_read_unlock() to get a full barrier.
+
16. Use CONFIG_PROVE_LOCKING, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the
__rcu sparse checks to validate your RCU code. These can help
find problems as follows:
@@ -428,22 +430,19 @@ over a rather long period of time, but improvements are always welcome!
These debugging aids can help you find problems that are
otherwise extremely difficult to spot.
-17. If you register a callback using call_rcu(), call_rcu_bh(),
- call_rcu_sched(), or call_srcu(), and pass in a function defined
- within a loadable module, then it in necessary to wait for
- all pending callbacks to be invoked after the last invocation
- and before unloading that module. Note that it is absolutely
- -not- sufficient to wait for a grace period! The current (say)
- synchronize_rcu() implementation waits only for all previous
- callbacks registered on the CPU that synchronize_rcu() is running
- on, but it is -not- guaranteed to wait for callbacks registered
- on other CPUs.
+17. If you register a callback using call_rcu() or call_srcu(), and
+ pass in a function defined within a loadable module, then it in
+ necessary to wait for all pending callbacks to be invoked after
+ the last invocation and before unloading that module. Note that
+ it is absolutely -not- sufficient to wait for a grace period!
+ The current (say) synchronize_rcu() implementation is -not-
+ guaranteed to wait for callbacks registered on other CPUs.
+ Or even on the current CPU if that CPU recently went offline
+ and came back online.
You instead need to use one of the barrier functions:
o call_rcu() -> rcu_barrier()
- o call_rcu_bh() -> rcu_barrier()
- o call_rcu_sched() -> rcu_barrier()
o call_srcu() -> srcu_barrier()
However, these barrier functions are absolutely -not- guaranteed
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index 721b3e426515..c818cf65c5a9 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -52,10 +52,10 @@ o If I am running on a uniprocessor kernel, which can only do one
o How can I see where RCU is currently used in the Linux kernel?
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
- "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
- "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
- "synchronize_net", "synchronize_srcu", and the other RCU
- primitives. Or grab one of the cscope databases from:
+ "rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock",
+ "srcu_read_unlock", "synchronize_rcu", "synchronize_net",
+ "synchronize_srcu", and the other RCU primitives. Or grab one
+ of the cscope databases from:
http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
index ab96227bad42..bf699e8cfc75 100644
--- a/Documentation/RCU/rcu_dereference.txt
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -351,3 +351,106 @@ garbage values.
In short, rcu_dereference() is -not- optional when you are going to
dereference the resulting pointer.
+
+
+WHICH MEMBER OF THE rcu_dereference() FAMILY SHOULD YOU USE?
+
+First, please avoid using rcu_dereference_raw() and also please avoid
+using rcu_dereference_check() and rcu_dereference_protected() with a
+second argument with a constant value of 1 (or true, for that matter).
+With that caution out of the way, here is some guidance for which
+member of the rcu_dereference() to use in various situations:
+
+1. If the access needs to be within an RCU read-side critical
+ section, use rcu_dereference(). With the new consolidated
+ RCU flavors, an RCU read-side critical section is entered
+ using rcu_read_lock(), anything that disables bottom halves,
+ anything that disables interrupts, or anything that disables
+ preemption.
+
+2. If the access might be within an RCU read-side critical section
+ on the one hand, or protected by (say) my_lock on the other,
+ use rcu_dereference_check(), for example:
+
+ p1 = rcu_dereference_check(p->rcu_protected_pointer,
+ lockdep_is_held(&my_lock));
+
+
+3. If the access might be within an RCU read-side critical section
+ on the one hand, or protected by either my_lock or your_lock on
+ the other, again use rcu_dereference_check(), for example:
+
+ p1 = rcu_dereference_check(p->rcu_protected_pointer,
+ lockdep_is_held(&my_lock) ||
+ lockdep_is_held(&your_lock));
+
+4. If the access is on the update side, so that it is always protected
+ by my_lock, use rcu_dereference_protected():
+
+ p1 = rcu_dereference_protected(p->rcu_protected_pointer,
+ lockdep_is_held(&my_lock));
+
+ This can be extended to handle multiple locks as in #3 above,
+ and both can be extended to check other conditions as well.
+
+5. If the protection is supplied by the caller, and is thus unknown
+ to this code, that is the rare case when rcu_dereference_raw()
+ is appropriate. In addition, rcu_dereference_raw() might be
+ appropriate when the lockdep expression would be excessively
+ complex, except that a better approach in that case might be to
+ take a long hard look at your synchronization design. Still,
+ there are data-locking cases where any one of a very large number
+ of locks or reference counters suffices to protect the pointer,
+ so rcu_dereference_raw() does have its place.
+
+ However, its place is probably quite a bit smaller than one
+ might expect given the number of uses in the current kernel.
+ Ditto for its synonym, rcu_dereference_check( ... , 1), and
+ its close relative, rcu_dereference_protected(... , 1).
+
+
+SPARSE CHECKING OF RCU-PROTECTED POINTERS
+
+The sparse static-analysis tool checks for direct access to RCU-protected
+pointers, which can result in "interesting" bugs due to compiler
+optimizations involving invented loads and perhaps also load tearing.
+For example, suppose someone mistakenly does something like this:
+
+ p = q->rcu_protected_pointer;
+ do_something_with(p->a);
+ do_something_else_with(p->b);
+
+If register pressure is high, the compiler might optimize "p" out
+of existence, transforming the code to something like this:
+
+ do_something_with(q->rcu_protected_pointer->a);
+ do_something_else_with(q->rcu_protected_pointer->b);
+
+This could fatally disappoint your code if q->rcu_protected_pointer
+changed in the meantime. Nor is this a theoretical problem: Exactly
+this sort of bug cost Paul E. McKenney (and several of his innocent
+colleagues) a three-day weekend back in the early 1990s.
+
+Load tearing could of course result in dereferencing a mashup of a pair
+of pointers, which also might fatally disappoint your code.
+
+These problems could have been avoided simply by making the code instead
+read as follows:
+
+ p = rcu_dereference(q->rcu_protected_pointer);
+ do_something_with(p->a);
+ do_something_else_with(p->b);
+
+Unfortunately, these sorts of bugs can be extremely hard to spot during
+review. This is where the sparse tool comes into play, along with the
+"__rcu" marker. If you mark a pointer declaration, whether in a structure
+or as a formal parameter, with "__rcu", which tells sparse to complain if
+this pointer is accessed directly. It will also cause sparse to complain
+if a pointer not marked with "__rcu" is accessed using rcu_dereference()
+and friends. For example, ->rcu_protected_pointer might be declared as
+follows:
+
+ struct foo __rcu *rcu_protected_pointer;
+
+Use of "__rcu" is opt-in. If you choose not to use it, then you should
+ignore the sparse warnings.
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt
index 5d7759071a3e..a2782df69732 100644
--- a/Documentation/RCU/rcubarrier.txt
+++ b/Documentation/RCU/rcubarrier.txt
@@ -83,16 +83,15 @@ Pseudo-code using rcu_barrier() is as follows:
2. Execute rcu_barrier().
3. Allow the module to be unloaded.
-There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier()
-functions for the other flavors of RCU, and you of course must match
-the flavor of rcu_barrier() with that of call_rcu(). If your module
-uses multiple flavors of call_rcu(), then it must also use multiple
+There is also an srcu_barrier() function for SRCU, and you of course
+must match the flavor of rcu_barrier() with that of call_rcu(). If your
+module uses multiple flavors of call_rcu(), then it must also use multiple
flavors of rcu_barrier() when unloading that module. For example, if
-it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on
+it uses call_rcu(), call_srcu() on srcu_struct_1, and call_srcu() on
srcu_struct_2(), then the following three lines of code will be required
when unloading:
- 1 rcu_barrier_bh();
+ 1 rcu_barrier();
2 srcu_barrier(&srcu_struct_1);
3 srcu_barrier(&srcu_struct_2);
@@ -185,12 +184,12 @@ module invokes call_rcu() from timers, you will need to first cancel all
the timers, and only then invoke rcu_barrier() to wait for any remaining
RCU callbacks to complete.
-Of course, if you module uses call_rcu_bh(), you will need to invoke
-rcu_barrier_bh() before unloading. Similarly, if your module uses
-call_rcu_sched(), you will need to invoke rcu_barrier_sched() before
-unloading. If your module uses call_rcu(), call_rcu_bh(), -and-
-call_rcu_sched(), then you will need to invoke each of rcu_barrier(),
-rcu_barrier_bh(), and rcu_barrier_sched().
+Of course, if you module uses call_rcu(), you will need to invoke
+rcu_barrier() before unloading. Similarly, if your module uses
+call_srcu(), you will need to invoke srcu_barrier() before unloading,
+and on the same srcu_struct structure. If your module uses call_rcu()
+-and- call_srcu(), then you will need to invoke rcu_barrier() -and-
+srcu_barrier().
Implementing rcu_barrier()
@@ -223,8 +222,8 @@ shown below. Note that the final "1" in on_each_cpu()'s argument list
ensures that all the calls to rcu_barrier_func() will have completed
before on_each_cpu() returns. Line 9 then waits for the completion.
-This code was rewritten in 2008 to support rcu_barrier_bh() and
-rcu_barrier_sched() in addition to the original rcu_barrier().
+This code was rewritten in 2008 and several times thereafter, but this
+still gives the general idea.
The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
to post an RCU callback, as follows:
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 1ace20815bb1..981651a8b65d 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -310,7 +310,7 @@ reader, updater, and reclaimer.
rcu_assign_pointer()
- +--------+
+ +--------+
+---------------------->| reader |---------+
| +--------+ |
| | |
@@ -318,12 +318,12 @@ reader, updater, and reclaimer.
| | | rcu_read_lock()
| | | rcu_read_unlock()
| rcu_dereference() | |
- +---------+ | |
- | updater |<---------------------+ |
- +---------+ V
+ +---------+ | |
+ | updater |<----------------+ |
+ +---------+ V
| +-----------+
+----------------------------------->| reclaimer |
- +-----------+
+ +-----------+
Defer:
synchronize_rcu() & call_rcu()
diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt
index b8ca28b60215..7e71c9c1d8e9 100644
--- a/Documentation/accounting/psi.txt
+++ b/Documentation/accounting/psi.txt
@@ -56,12 +56,12 @@ situation from a state where some tasks are stalled but the CPU is
still doing productive work. As such, time spent in this subset of the
stall state is tracked separately and exported in the "full" averages.
-The ratios are tracked as recent trends over ten, sixty, and three
-hundred second windows, which gives insight into short term events as
-well as medium and long term trends. The total absolute stall time is
-tracked and exported as well, to allow detection of latency spikes
-which wouldn't necessarily make a dent in the time averages, or to
-average trends over custom time frames.
+The ratios (in %) are tracked as recent trends over ten, sixty, and
+three hundred second windows, which gives insight into short term events
+as well as medium and long term trends. The total absolute stall time
+(in us) is tracked and exported as well, to allow detection of latency
+spikes which wouldn't necessarily make a dent in the time averages,
+or to average trends over custom time frames.
Cgroup2 interface
=================
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2b8ee90bb644..b7e23e9d1770 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2544,6 +2544,38 @@
in the "bleeding edge" mini2440 support kernel at
http://repo.or.cz/w/linux-2.6/mini2440.git
+ mitigations=
+ [X86,PPC,S390] Control optional mitigations for CPU
+ vulnerabilities. This is a set of curated,
+ arch-independent options, each of which is an
+ aggregation of existing arch-specific options.
+
+ off
+ Disable all optional CPU mitigations. This
+ improves system performance, but it may also
+ expose users to several CPU vulnerabilities.
+ Equivalent to: nopti [X86,PPC]
+ nospectre_v1 [PPC]
+ nobp=0 [S390]
+ nospectre_v2 [X86,PPC,S390]
+ spectre_v2_user=off [X86]
+ spec_store_bypass_disable=off [X86,PPC]
+ l1tf=off [X86]
+
+ auto (default)
+ Mitigate all CPU vulnerabilities, but leave SMT
+ enabled, even if it's vulnerable. This is for
+ users who don't want to be surprised by SMT
+ getting disabled across kernel upgrades, or who
+ have other ways of avoiding SMT-based attacks.
+ Equivalent to: (default behavior)
+
+ auto,nosmt
+ Mitigate all CPU vulnerabilities, disabling SMT
+ if needed. This is for users who always want to
+ be fully mitigated, even if it means losing SMT.
+ Equivalent to: l1tf=flush,nosmt [X86]
+
mminit_loglevel=
[KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
@@ -3623,7 +3655,9 @@
see CONFIG_RAS_CEC help text.
rcu_nocbs= [KNL]
- The argument is a cpu list, as described above.
+ The argument is a cpu list, as described above,
+ except that the string "all" can be used to
+ specify every CPU on the system.
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
diff --git a/Documentation/atomic_t.txt b/Documentation/atomic_t.txt
index 913396ac5824..dca3fb0554db 100644
--- a/Documentation/atomic_t.txt
+++ b/Documentation/atomic_t.txt
@@ -56,6 +56,23 @@ Barriers:
smp_mb__{before,after}_atomic()
+TYPES (signed vs unsigned)
+-----
+
+While atomic_t, atomic_long_t and atomic64_t use int, long and s64
+respectively (for hysterical raisins), the kernel uses -fno-strict-overflow
+(which implies -fwrapv) and defines signed overflow to behave like
+2s-complement.
+
+Therefore, an explicitly unsigned variant of the atomic ops is strictly
+unnecessary and we can simply cast, there is no UB.
+
+There was a bug in UBSAN prior to GCC-8 that would generate UB warnings for
+signed types.
+
+With this we also conform to the C/C++ _Atomic behaviour and things like
+P1236R1.
+
SEMANTICS
---------
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst
index 9a60a5d60e38..7313d354f20e 100644
--- a/Documentation/bpf/btf.rst
+++ b/Documentation/bpf/btf.rst
@@ -148,16 +148,16 @@ The ``btf_type.size * 8`` must be equal to or greater than ``BTF_INT_BITS()``
for the type. The maximum value of ``BTF_INT_BITS()`` is 128.
The ``BTF_INT_OFFSET()`` specifies the starting bit offset to calculate values
-for this int. For example, a bitfield struct member has: * btf member bit
-offset 100 from the start of the structure, * btf member pointing to an int
-type, * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
+for this int. For example, a bitfield struct member has:
+ * btf member bit offset 100 from the start of the structure,
+ * btf member pointing to an int type,
+ * the int type has ``BTF_INT_OFFSET() = 2`` and ``BTF_INT_BITS() = 4``
Then in the struct memory layout, this member will occupy ``4`` bits starting
from bits ``100 + 2 = 102``.
Alternatively, the bitfield struct member can be the following to access the
same bits as the above:
-
* btf member bit offset 102,
* btf member pointing to an int type,
* the int type has ``BTF_INT_OFFSET() = 0`` and ``BTF_INT_BITS() = 4``
diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index 6eb9d3f090cd..93cb65d52720 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -101,16 +101,6 @@ changes occur:
translations for software managed TLB configurations.
The sparc64 port currently does this.
-6) ``void tlb_migrate_finish(struct mm_struct *mm)``
-
- This interface is called at the end of an explicit
- process migration. This interface provides a hook
- to allow a platform to update TLB or context-specific
- information for the address space.
-
- The ia64 sn2 platform is one example of a platform
- that uses this interface.
-
Next, we have the cache flushing interfaces. In general, when Linux
is changing an existing virtual-->physical mapping to a new value,
the sequence will be in one of the following forms::
diff --git a/Documentation/devicetree/bindings/arm/cpus.yaml b/Documentation/devicetree/bindings/arm/cpus.yaml
index 365dcf384d73..82dd7582e945 100644
--- a/Documentation/devicetree/bindings/arm/cpus.yaml
+++ b/Documentation/devicetree/bindings/arm/cpus.yaml
@@ -228,7 +228,7 @@ patternProperties:
- renesas,r9a06g032-smp
- rockchip,rk3036-smp
- rockchip,rk3066-smp
- - socionext,milbeaut-m10v-smp
+ - socionext,milbeaut-m10v-smp
- ste,dbx500-smp
cpu-release-addr:
diff --git a/Documentation/devicetree/bindings/hwmon/adc128d818.txt b/Documentation/devicetree/bindings/hwmon/adc128d818.txt
index 08bab0e94d25..d0ae46d7bac3 100644
--- a/Documentation/devicetree/bindings/hwmon/adc128d818.txt
+++ b/Documentation/devicetree/bindings/hwmon/adc128d818.txt
@@ -26,7 +26,7 @@ Required node properties:
Optional node properties:
- - ti,mode: Operation mode (see above).
+ - ti,mode: Operation mode (u8) (see above).
Example (operation mode 2):
@@ -34,5 +34,5 @@ Example (operation mode 2):
adc128d818@1d {
compatible = "ti,adc128d818";
reg = <0x1d>;
- ti,mode = <2>;
+ ti,mode = /bits/ 8 <2>;
};
diff --git a/Documentation/devicetree/bindings/i2c/i2c-xscale.txt b/Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt
index dcc8390e0d24..dcc8390e0d24 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-xscale.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-iop3xx.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-mtk.txt b/Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt
index ee4c32454198..ee4c32454198 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-mtk.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-mt65xx.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt b/Documentation/devicetree/bindings/i2c/i2c-stu300.txt
index bd81a482634f..bd81a482634f 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-st-ddci2c.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-stu300.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt b/Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt
index 49df0053347a..49df0053347a 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-sunxi-p2wi.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-sun6i-p2wi.txt
diff --git a/Documentation/devicetree/bindings/i2c/i2c-vt8500.txt b/Documentation/devicetree/bindings/i2c/i2c-wmt.txt
index 94a425eaa6c7..94a425eaa6c7 100644
--- a/Documentation/devicetree/bindings/i2c/i2c-vt8500.txt
+++ b/Documentation/devicetree/bindings/i2c/i2c-wmt.txt
diff --git a/Documentation/devicetree/bindings/net/davinci_emac.txt b/Documentation/devicetree/bindings/net/davinci_emac.txt
index 24c5cdaba8d2..ca83dcc84fb8 100644
--- a/Documentation/devicetree/bindings/net/davinci_emac.txt
+++ b/Documentation/devicetree/bindings/net/davinci_emac.txt
@@ -20,6 +20,8 @@ Required properties:
Optional properties:
- phy-handle: See ethernet.txt file in the same directory.
If absent, davinci_emac driver defaults to 100/FULL.
+- nvmem-cells: phandle, reference to an nvmem node for the MAC address
+- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used
- ti,davinci-rmii-en: 1 byte, 1 means use RMII
- ti,davinci-no-bd-ram: boolean, does EMAC have BD RAM?
diff --git a/Documentation/devicetree/bindings/net/dsa/qca8k.txt b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
index bbcb255c3150..93a7469e70d4 100644
--- a/Documentation/devicetree/bindings/net/dsa/qca8k.txt
+++ b/Documentation/devicetree/bindings/net/dsa/qca8k.txt
@@ -12,10 +12,15 @@ Required properties:
Subnodes:
The integrated switch subnode should be specified according to the binding
-described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
-port and PHY id, each subnode describing a port needs to have a valid phandle
-referencing the internal PHY connected to it. The CPU port of this switch is
-always port 0.
+described in dsa/dsa.txt. If the QCA8K switch is connect to a SoC's external
+mdio-bus each subnode describing a port needs to have a valid phandle
+referencing the internal PHY it is connected to. This is because there's no
+N:N mapping of port and PHY id.
+
+Don't use mixed external and internal mdio-bus configurations, as this is
+not supported by the hardware.
+
+The CPU port of this switch is always port 0.
A CPU port node has the following optional node:
@@ -31,8 +36,9 @@ For QCA8K the 'fixed-link' sub-node supports only the following properties:
- 'full-duplex' (boolean, optional), to indicate that full duplex is
used. When absent, half duplex is assumed.
-Example:
+Examples:
+for the external mdio-bus configuration:
&mdio0 {
phy_port1: phy@0 {
@@ -55,12 +61,12 @@ Example:
reg = <4>;
};
- switch0@0 {
+ switch@10 {
compatible = "qca,qca8337";
#address-cells = <1>;
#size-cells = <0>;
- reg = <0>;
+ reg = <0x10>;
ports {
#address-cells = <1>;
@@ -108,3 +114,56 @@ Example:
};
};
};
+
+for the internal master mdio-bus configuration:
+
+ &mdio0 {
+ switch@10 {
+ compatible = "qca,qca8337";
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ reg = <0x10>;
+
+ ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ port@0 {
+ reg = <0>;
+ label = "cpu";
+ ethernet = <&gmac1>;
+ phy-mode = "rgmii";
+ fixed-link {
+ speed = 1000;
+ full-duplex;
+ };
+ };
+
+ port@1 {
+ reg = <1>;
+ label = "lan1";
+ };
+
+ port@2 {
+ reg = <2>;
+ label = "lan2";
+ };
+
+ port@3 {
+ reg = <3>;
+ label = "lan3";
+ };
+
+ port@4 {
+ reg = <4>;
+ label = "lan4";
+ };
+
+ port@5 {
+ reg = <5>;
+ label = "wan";
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt
index cfc376bc977a..a68621580584 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -10,15 +10,14 @@ Documentation/devicetree/bindings/phy/phy-bindings.txt.
the boot program; should be used in cases where the MAC address assigned to
the device by the boot program is different from the "local-mac-address"
property;
-- nvmem-cells: phandle, reference to an nvmem node for the MAC address;
-- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used;
- max-speed: number, specifies maximum speed in Mbit/s supported by the device;
- max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than
the maximum frame size (there's contradiction in the Devicetree
Specification).
- phy-mode: string, operation mode of the PHY interface. This is now a de-facto
standard property; supported values are:
- * "internal"
+ * "internal" (Internal means there is not a standard bus between the MAC and
+ the PHY, something proprietary is being used to embed the PHY in the MAC.)
* "mii"
* "gmii"
* "sgmii"
diff --git a/Documentation/devicetree/bindings/net/macb.txt b/Documentation/devicetree/bindings/net/macb.txt
index 174f292d8a3e..8b80515729d7 100644
--- a/Documentation/devicetree/bindings/net/macb.txt
+++ b/Documentation/devicetree/bindings/net/macb.txt
@@ -26,6 +26,10 @@ Required properties:
Optional elements: 'tsu_clk'
- clocks: Phandles to input clocks.
+Optional properties:
+- nvmem-cells: phandle, reference to an nvmem node for the MAC address
+- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used
+
Optional properties for PHY child node:
- reset-gpios : Should specify the gpio for phy reset
- magic-packet : If present, indicates that the hardware supports waking
diff --git a/Documentation/devicetree/bindings/serial/mtk-uart.txt b/Documentation/devicetree/bindings/serial/mtk-uart.txt
index 742cb470595b..bcfb13194f16 100644
--- a/Documentation/devicetree/bindings/serial/mtk-uart.txt
+++ b/Documentation/devicetree/bindings/serial/mtk-uart.txt
@@ -16,6 +16,7 @@ Required properties:
* "mediatek,mt8127-uart" for MT8127 compatible UARTS
* "mediatek,mt8135-uart" for MT8135 compatible UARTS
* "mediatek,mt8173-uart" for MT8173 compatible UARTS
+ * "mediatek,mt8183-uart", "mediatek,mt6577-uart" for MT8183 compatible UARTS
* "mediatek,mt6577-uart" for MT6577 and all of the above
- reg: The base address of the UART register bank.
diff --git a/Documentation/driver-api/usb/power-management.rst b/Documentation/driver-api/usb/power-management.rst
index 79beb807996b..4a74cf6f2797 100644
--- a/Documentation/driver-api/usb/power-management.rst
+++ b/Documentation/driver-api/usb/power-management.rst
@@ -370,11 +370,15 @@ autosuspend the interface's device. When the usage counter is = 0
then the interface is considered to be idle, and the kernel may
autosuspend the device.
-Drivers need not be concerned about balancing changes to the usage
-counter; the USB core will undo any remaining "get"s when a driver
-is unbound from its interface. As a corollary, drivers must not call
-any of the ``usb_autopm_*`` functions after their ``disconnect``
-routine has returned.
+Drivers must be careful to balance their overall changes to the usage
+counter. Unbalanced "get"s will remain in effect when a driver is
+unbound from its interface, preventing the device from going into
+runtime suspend should the interface be bound to a driver again. On
+the other hand, drivers are allowed to achieve this balance by calling
+the ``usb_autopm_*`` functions even after their ``disconnect`` routine
+has returned -- say from within a work-queue routine -- provided they
+retain an active reference to the interface (via ``usb_get_intf`` and
+``usb_put_intf``).
Drivers using the async routines are responsible for their own
synchronization and mutual exclusion.
diff --git a/Documentation/filesystems/mount_api.txt b/Documentation/filesystems/mount_api.txt
index 944d1965e917..00ff0cfccfa7 100644
--- a/Documentation/filesystems/mount_api.txt
+++ b/Documentation/filesystems/mount_api.txt
@@ -12,11 +12,13 @@ CONTENTS
(4) Filesystem context security.
- (5) VFS filesystem context operations.
+ (5) VFS filesystem context API.
- (6) Parameter description.
+ (6) Superblock creation helpers.
- (7) Parameter helper functions.
+ (7) Parameter description.
+
+ (8) Parameter helper functions.
========
@@ -41,12 +43,15 @@ The creation of new mounts is now to be done in a multistep process:
(7) Destroy the context.
-To support this, the file_system_type struct gains a new field:
+To support this, the file_system_type struct gains two new fields:
int (*init_fs_context)(struct fs_context *fc);
+ const struct fs_parameter_description *parameters;
-which is invoked to set up the filesystem-specific parts of a filesystem
-context, including the additional space.
+The first is invoked to set up the filesystem-specific parts of a filesystem
+context, including the additional space, and the second points to the
+parameter description for validation at registration time and querying by a
+future system call.
Note that security initialisation is done *after* the filesystem is called so
that the namespaces may be adjusted first.
@@ -73,9 +78,9 @@ context. This is represented by the fs_context structure:
void *s_fs_info;
unsigned int sb_flags;
unsigned int sb_flags_mask;
+ unsigned int s_iflags;
+ unsigned int lsm_flags;
enum fs_context_purpose purpose:8;
- bool sloppy:1;
- bool silent:1;
...
};
@@ -141,6 +146,10 @@ The fs_context fields are as follows:
Which bits SB_* flags are to be set/cleared in super_block::s_flags.
+ (*) unsigned int s_iflags
+
+ These will be bitwise-OR'd with s->s_iflags when a superblock is created.
+
(*) enum fs_context_purpose
This indicates the purpose for which the context is intended. The
@@ -150,17 +159,6 @@ The fs_context fields are as follows:
FS_CONTEXT_FOR_SUBMOUNT -- New automatic submount of extant mount
FS_CONTEXT_FOR_RECONFIGURE -- Change an existing mount
- (*) bool sloppy
- (*) bool silent
-
- These are set if the sloppy or silent mount options are given.
-
- [NOTE] sloppy is probably unnecessary when userspace passes over one
- option at a time since the error can just be ignored if userspace deems it
- to be unimportant.
-
- [NOTE] silent is probably redundant with sb_flags & SB_SILENT.
-
The mount context is created by calling vfs_new_fs_context() or
vfs_dup_fs_context() and is destroyed with put_fs_context(). Note that the
structure is not refcounted.
@@ -342,28 +340,47 @@ number of operations used by the new mount code for this purpose:
It should return 0 on success or a negative error code on failure.
-=================================
-VFS FILESYSTEM CONTEXT OPERATIONS
-=================================
+==========================
+VFS FILESYSTEM CONTEXT API
+==========================
-There are four operations for creating a filesystem context and
-one for destroying a context:
+There are four operations for creating a filesystem context and one for
+destroying a context:
- (*) struct fs_context *vfs_new_fs_context(struct file_system_type *fs_type,
- struct dentry *reference,
- unsigned int sb_flags,
- unsigned int sb_flags_mask,
- enum fs_context_purpose purpose);
+ (*) struct fs_context *fs_context_for_mount(
+ struct file_system_type *fs_type,
+ unsigned int sb_flags);
- Create a filesystem context for a given filesystem type and purpose. This
- allocates the filesystem context, sets the superblock flags, initialises
- the security and calls fs_type->init_fs_context() to initialise the
- filesystem private data.
+ Allocate a filesystem context for the purpose of setting up a new mount,
+ whether that be with a new superblock or sharing an existing one. This
+ sets the superblock flags, initialises the security and calls
+ fs_type->init_fs_context() to initialise the filesystem private data.
- reference can be NULL or it may indicate the root dentry of a superblock
- that is going to be reconfigured (FS_CONTEXT_FOR_RECONFIGURE) or
- the automount point that triggered a submount (FS_CONTEXT_FOR_SUBMOUNT).
- This is provided as a source of namespace information.
+ fs_type specifies the filesystem type that will manage the context and
+ sb_flags presets the superblock flags stored therein.
+
+ (*) struct fs_context *fs_context_for_reconfigure(
+ struct dentry *dentry,
+ unsigned int sb_flags,
+ unsigned int sb_flags_mask);
+
+ Allocate a filesystem context for the purpose of reconfiguring an
+ existing superblock. dentry provides a reference to the superblock to be
+ configured. sb_flags and sb_flags_mask indicate which superblock flags
+ need changing and to what.
+
+ (*) struct fs_context *fs_context_for_submount(
+ struct file_system_type *fs_type,
+ struct dentry *reference);
+
+ Allocate a filesystem context for the purpose of creating a new mount for
+ an automount point or other derived superblock. fs_type specifies the
+ filesystem type that will manage the context and the reference dentry
+ supplies the parameters. Namespaces are propagated from the reference
+ dentry's superblock also.
+
+ Note that it's not a requirement that the reference dentry be of the same
+ filesystem type as fs_type.
(*) struct fs_context *vfs_dup_fs_context(struct fs_context *src_fc);
@@ -390,20 +407,6 @@ context pointer or a negative error code.
For the remaining operations, if an error occurs, a negative error code will be
returned.
- (*) int vfs_get_tree(struct fs_context *fc);
-
- Get or create the mountable root and superblock, using the parameters in
- the filesystem context to select/configure the superblock. This invokes
- the ->validate() op and then the ->get_tree() op.
-
- [NOTE] ->validate() could perhaps be rolled into ->get_tree() and
- ->reconfigure().
-
- (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
-
- Create a mount given the parameters in the specified filesystem context.
- Note that this does not attach the mount to anything.
-
(*) int vfs_parse_fs_param(struct fs_context *fc,
struct fs_parameter *param);
@@ -432,17 +435,80 @@ returned.
clear the pointer, but then becomes responsible for disposing of the
object.
- (*) int vfs_parse_fs_string(struct fs_context *fc, char *key,
+ (*) int vfs_parse_fs_string(struct fs_context *fc, const char *key,
const char *value, size_t v_size);
- A wrapper around vfs_parse_fs_param() that just passes a constant string.
+ A wrapper around vfs_parse_fs_param() that copies the value string it is
+ passed.
(*) int generic_parse_monolithic(struct fs_context *fc, void *data);
Parse a sys_mount() data page, assuming the form to be a text list
consisting of key[=val] options separated by commas. Each item in the
list is passed to vfs_mount_option(). This is the default when the
- ->parse_monolithic() operation is NULL.
+ ->parse_monolithic() method is NULL.
+
+ (*) int vfs_get_tree(struct fs_context *fc);
+
+ Get or create the mountable root and superblock, using the parameters in
+ the filesystem context to select/configure the superblock. This invokes
+ the ->get_tree() method.
+
+ (*) struct vfsmount *vfs_create_mount(struct fs_context *fc);
+
+ Create a mount given the parameters in the specified filesystem context.
+ Note that this does not attach the mount to anything.
+
+
+===========================
+SUPERBLOCK CREATION HELPERS
+===========================
+
+A number of VFS helpers are available for use by filesystems for the creation
+or looking up of superblocks.
+
+ (*) struct super_block *
+ sget_fc(struct fs_context *fc,
+ int (*test)(struct super_block *sb, struct fs_context *fc),
+ int (*set)(struct super_block *sb, struct fs_context *fc));
+
+ This is the core routine. If test is non-NULL, it searches for an
+ existing superblock matching the criteria held in the fs_context, using
+ the test function to match them. If no match is found, a new superblock
+ is created and the set function is called to set it up.
+
+ Prior to the set function being called, fc->s_fs_info will be transferred
+ to sb->s_fs_info - and fc->s_fs_info will be cleared if set returns
+ success (ie. 0).
+
+The following helpers all wrap sget_fc():
+
+ (*) int vfs_get_super(struct fs_context *fc,
+ enum vfs_get_super_keying keying,
+ int (*fill_super)(struct super_block *sb,
+ struct fs_context *fc))
+
+ This creates/looks up a deviceless superblock. The keying indicates how
+ many superblocks of this type may exist and in what manner they may be
+ shared:
+
+ (1) vfs_get_single_super
+
+ Only one such superblock may exist in the system. Any further
+ attempt to get a new superblock gets this one (and any parameter
+ differences are ignored).
+
+ (2) vfs_get_keyed_super
+
+ Multiple superblocks of this type may exist and they're keyed on
+ their s_fs_info pointer (for example this may refer to a
+ namespace).
+
+ (3) vfs_get_independent_super
+
+ Multiple independent superblocks of this type may exist. This
+ function never matches an existing one and always creates a new
+ one.
=====================
@@ -454,35 +520,22 @@ There's a core description struct that links everything together:
struct fs_parameter_description {
const char name[16];
- u8 nr_params;
- u8 nr_alt_keys;
- u8 nr_enums;
- bool ignore_unknown;
- bool no_source;
- const char *const *keys;
- const struct constant_table *alt_keys;
const struct fs_parameter_spec *specs;
const struct fs_parameter_enum *enums;
};
For example:
- enum afs_param {
+ enum {
Opt_autocell,
Opt_bar,
Opt_dyn,
Opt_foo,
Opt_source,
- nr__afs_params
};
static const struct fs_parameter_description afs_fs_parameters = {
.name = "kAFS",
- .nr_params = nr__afs_params,
- .nr_alt_keys = ARRAY_SIZE(afs_param_alt_keys),
- .nr_enums = ARRAY_SIZE(afs_param_enums),
- .keys = afs_param_keys,
- .alt_keys = afs_param_alt_keys,
.specs = afs_param_specs,
.enums = afs_param_enums,
};
@@ -494,28 +547,24 @@ The members are as follows:
The name to be used in error messages generated by the parse helper
functions.
- (2) u8 nr_params;
-
- The number of discrete parameter identifiers. This indicates the number
- of elements in the ->types[] array and also limits the values that may be
- used in the values that the ->keys[] array maps to.
-
- It is expected that, for example, two parameters that are related, say
- "acl" and "noacl" with have the same ID, but will be flagged to indicate
- that one is the inverse of the other. The value can then be picked out
- from the parse result.
+ (2) const struct fs_parameter_specification *specs;
- (3) const struct fs_parameter_specification *specs;
+ Table of parameter specifications, terminated with a null entry, where the
+ entries are of type:
- Table of parameter specifications, where the entries are of type:
-
- struct fs_parameter_type {
- enum fs_parameter_spec type:8;
- u8 flags;
+ struct fs_parameter_spec {
+ const char *name;
+ u8 opt;
+ enum fs_parameter_type type:8;
+ unsigned short flags;
};
- and the parameter identifier is the index to the array. 'type' indicates
- the desired value type and must be one of:
+ The 'name' field is a string to match exactly to the parameter key (no
+ wildcards, patterns and no case-independence) and 'opt' is the value that
+ will be returned by the fs_parser() function in the case of a successful
+ match.
+
+ The 'type' field indicates the desired value type and must be one of:
TYPE NAME EXPECTED VALUE RESULT IN
======================= ======================= =====================
@@ -525,85 +574,65 @@ The members are as follows:
fs_param_is_u32_octal 32-bit octal int result->uint_32
fs_param_is_u32_hex 32-bit hex int result->uint_32
fs_param_is_s32 32-bit signed int result->int_32
+ fs_param_is_u64 64-bit unsigned int result->uint_64
fs_param_is_enum Enum value name result->uint_32
fs_param_is_string Arbitrary string param->string
fs_param_is_blob Binary blob param->blob
fs_param_is_blockdev Blockdev path * Needs lookup
fs_param_is_path Path * Needs lookup
- fs_param_is_fd File descriptor param->file
-
- And each parameter can be qualified with 'flags':
-
- fs_param_v_optional The value is optional
- fs_param_neg_with_no If key name is prefixed with "no", it is false
- fs_param_neg_with_empty If value is "", it is false
- fs_param_deprecated The parameter is deprecated.
-
- For example:
-
- static const struct fs_parameter_spec afs_param_specs[nr__afs_params] = {
- [Opt_autocell] = { fs_param_is flag },
- [Opt_bar] = { fs_param_is_enum },
- [Opt_dyn] = { fs_param_is flag },
- [Opt_foo] = { fs_param_is_bool, fs_param_neg_with_no },
- [Opt_source] = { fs_param_is_string },
- };
+ fs_param_is_fd File descriptor result->int_32
Note that if the value is of fs_param_is_bool type, fs_parse() will try
to match any string value against "0", "1", "no", "yes", "false", "true".
- [!] NOTE that the table must be sorted according to primary key name so
- that ->keys[] is also sorted.
-
- (4) const char *const *keys;
-
- Table of primary key names for the parameters. There must be one entry
- per defined parameter. The table is optional if ->nr_params is 0. The
- table is just an array of names e.g.:
+ Each parameter can also be qualified with 'flags':
- static const char *const afs_param_keys[nr__afs_params] = {
- [Opt_autocell] = "autocell",
- [Opt_bar] = "bar",
- [Opt_dyn] = "dyn",
- [Opt_foo] = "foo",
- [Opt_source] = "source",
- };
-
- [!] NOTE that the table must be sorted such that the table can be searched
- with bsearch() using strcmp(). This means that the Opt_* values must
- correspond to the entries in this table.
-
- (5) const struct constant_table *alt_keys;
- u8 nr_alt_keys;
-
- Table of additional key names and their mappings to parameter ID plus the
- number of elements in the table. This is optional. The table is just an
- array of { name, integer } pairs, e.g.:
+ fs_param_v_optional The value is optional
+ fs_param_neg_with_no result->negated set if key is prefixed with "no"
+ fs_param_neg_with_empty result->negated set if value is ""
+ fs_param_deprecated The parameter is deprecated.
- static const struct constant_table afs_param_keys[] = {
- { "baz", Opt_bar },
- { "dynamic", Opt_dyn },
+ These are wrapped with a number of convenience wrappers:
+
+ MACRO SPECIFIES
+ ======================= ===============================================
+ fsparam_flag() fs_param_is_flag
+ fsparam_flag_no() fs_param_is_flag, fs_param_neg_with_no
+ fsparam_bool() fs_param_is_bool
+ fsparam_u32() fs_param_is_u32
+ fsparam_u32oct() fs_param_is_u32_octal
+ fsparam_u32hex() fs_param_is_u32_hex
+ fsparam_s32() fs_param_is_s32
+ fsparam_u64() fs_param_is_u64
+ fsparam_enum() fs_param_is_enum
+ fsparam_string() fs_param_is_string
+ fsparam_blob() fs_param_is_blob
+ fsparam_bdev() fs_param_is_blockdev
+ fsparam_path() fs_param_is_path
+ fsparam_fd() fs_param_is_fd
+
+ all of which take two arguments, name string and option number - for
+ example:
+
+ static const struct fs_parameter_spec afs_param_specs[] = {
+ fsparam_flag ("autocell", Opt_autocell),
+ fsparam_flag ("dyn", Opt_dyn),
+ fsparam_string ("source", Opt_source),
+ fsparam_flag_no ("foo", Opt_foo),
+ {}
};
- [!] NOTE that the table must be sorted such that strcmp() can be used with
- bsearch() to search the entries.
-
- The parameter ID can also be fs_param_key_removed to indicate that a
- deprecated parameter has been removed and that an error will be given.
- This differs from fs_param_deprecated where the parameter may still have
- an effect.
-
- Further, the behaviour of the parameter may differ when an alternate name
- is used (for instance with NFS, "v3", "v4.2", etc. are alternate names).
+ An addition macro, __fsparam() is provided that takes an additional pair
+ of arguments to specify the type and the flags for anything that doesn't
+ match one of the above macros.
(6) const struct fs_parameter_enum *enums;
- u8 nr_enums;
- Table of enum value names to integer mappings and the number of elements
- stored therein. This is of type:
+ Table of enum value names to integer mappings, terminated with a null
+ entry. This is of type:
struct fs_parameter_enum {
- u8 param_id;
+ u8 opt;
char name[14];
u8 value;
};
@@ -621,11 +650,6 @@ The members are as follows:
try to look the value up in the enum table and the result will be stored
in the parse result.
- (7) bool no_source;
-
- If this is set, fs_parse() will ignore any "source" parameter and not
- pass it to the filesystem.
-
The parser should be pointed to by the parser pointer in the file_system_type
struct as this will provide validation on registration (if
CONFIG_VALIDATE_FS_PARSER=y) and will allow the description to be queried from
@@ -650,9 +674,8 @@ process the parameters it is given.
int value;
};
- and it must be sorted such that it can be searched using bsearch() using
- strcmp(). If a match is found, the corresponding value is returned. If a
- match isn't found, the not_found value is returned instead.
+ If a match is found, the corresponding value is returned. If a match
+ isn't found, the not_found value is returned instead.
(*) bool validate_constant_table(const struct constant_table *tbl,
size_t tbl_size,
@@ -665,36 +688,36 @@ process the parameters it is given.
should just be set to lie inside the low-to-high range.
If all is good, true is returned. If the table is invalid, errors are
- logged to dmesg, the stack is dumped and false is returned.
+ logged to dmesg and false is returned.
+
+ (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+
+ This performs some validation checks on a parameter description. It
+ returns true if the description is good and false if it is not. It will
+ log errors to dmesg if validation fails.
(*) int fs_parse(struct fs_context *fc,
- const struct fs_param_parser *parser,
+ const struct fs_parameter_description *desc,
struct fs_parameter *param,
- struct fs_param_parse_result *result);
+ struct fs_parse_result *result);
This is the main interpreter of parameters. It uses the parameter
- description (parser) to look up the name of the parameter to use and to
- convert that to a parameter ID (stored in result->key).
+ description to look up a parameter by key name and to convert that to an
+ option number (which it returns).
If successful, and if the parameter type indicates the result is a
boolean, integer or enum type, the value is converted by this function and
- the result stored in result->{boolean,int_32,uint_32}.
+ the result stored in result->{boolean,int_32,uint_32,uint_64}.
If a match isn't initially made, the key is prefixed with "no" and no
value is present then an attempt will be made to look up the key with the
prefix removed. If this matches a parameter for which the type has flag
- fs_param_neg_with_no set, then a match will be made and the value will be
- set to false/0/NULL.
-
- If the parameter is successfully matched and, optionally, parsed
- correctly, 1 is returned. If the parameter isn't matched and
- parser->ignore_unknown is set, then 0 is returned. Otherwise -EINVAL is
- returned.
-
- (*) bool fs_validate_description(const struct fs_parameter_description *desc);
+ fs_param_neg_with_no set, then a match will be made and result->negated
+ will be set to true.
- This is validates the parameter description. It returns true if the
- description is good and false if it is not.
+ If the parameter isn't matched, -ENOPARAM will be returned; if the
+ parameter is matched, but the value is erroneous, -EINVAL will be
+ returned; otherwise the parameter's option number will be returned.
(*) int fs_lookup_param(struct fs_context *fc,
struct fs_parameter *value,
diff --git a/Documentation/i2c/busses/i2c-i801 b/Documentation/i2c/busses/i2c-i801
index d1ee484a787d..ee9984f35868 100644
--- a/Documentation/i2c/busses/i2c-i801
+++ b/Documentation/i2c/busses/i2c-i801
@@ -36,6 +36,7 @@ Supported adapters:
* Intel Cannon Lake (PCH)
* Intel Cedar Fork (PCH)
* Intel Ice Lake (PCH)
+ * Intel Comet Lake (PCH)
Datasheets: Publicly available at the Intel website
On Intel Patsburg and later chipsets, both the normal host SMBus controller
diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt
index 10f4499e677c..ee60e519438a 100644
--- a/Documentation/kprobes.txt
+++ b/Documentation/kprobes.txt
@@ -243,10 +243,10 @@ Optimization
^^^^^^^^^^^^
The Kprobe-optimizer doesn't insert the jump instruction immediately;
-rather, it calls synchronize_sched() for safety first, because it's
+rather, it calls synchronize_rcu() for safety first, because it's
possible for a CPU to be interrupted in the middle of executing the
-optimized region [3]_. As you know, synchronize_sched() can ensure
-that all interruptions that were active when synchronize_sched()
+optimized region [3]_. As you know, synchronize_rcu() can ensure
+that all interruptions that were active when synchronize_rcu()
was called are done, but only if CONFIG_PREEMPT=n. So, this version
of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
diff --git a/Documentation/lzo.txt b/Documentation/lzo.txt
index f79934225d8d..ca983328976b 100644
--- a/Documentation/lzo.txt
+++ b/Documentation/lzo.txt
@@ -102,9 +102,11 @@ Byte sequences
dictionary which is empty, and that it will always be
invalid at this place.
- 17 : bitstream version. If the first byte is 17, the next byte
- gives the bitstream version (version 1 only). If the first byte
- is not 17, the bitstream version is 0.
+ 17 : bitstream version. If the first byte is 17, and compressed
+ stream length is at least 5 bytes (length of shortest possible
+ versioned bitstream), the next byte gives the bitstream version
+ (version 1 only).
+ Otherwise, the bitstream version is 0.
18..21 : copy 0..3 literals
state = (byte - 17) = 0..3 [ copy <state> literals ]
diff --git a/Documentation/media/uapi/rc/rc-tables.rst b/Documentation/media/uapi/rc/rc-tables.rst
index f460031d8531..177ac44fa0fa 100644
--- a/Documentation/media/uapi/rc/rc-tables.rst
+++ b/Documentation/media/uapi/rc/rc-tables.rst
@@ -623,7 +623,7 @@ the remote via /dev/input/event devices.
- .. row 78
- - ``KEY_SCREEN``
+ - ``KEY_ASPECT_RATIO``
- Select screen aspect ratio
@@ -631,7 +631,7 @@ the remote via /dev/input/event devices.
- .. row 79
- - ``KEY_ZOOM``
+ - ``KEY_FULL_SCREEN``
- Put device into zoom/full screen mode
diff --git a/Documentation/networking/bpf_flow_dissector.rst b/Documentation/networking/bpf_flow_dissector.rst
new file mode 100644
index 000000000000..b375ae2ec2c4
--- /dev/null
+++ b/Documentation/networking/bpf_flow_dissector.rst
@@ -0,0 +1,126 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+BPF Flow Dissector
+==================
+
+Overview
+========
+
+Flow dissector is a routine that parses metadata out of the packets. It's
+used in the various places in the networking subsystem (RFS, flow hash, etc).
+
+BPF flow dissector is an attempt to reimplement C-based flow dissector logic
+in BPF to gain all the benefits of BPF verifier (namely, limits on the
+number of instructions and tail calls).
+
+API
+===
+
+BPF flow dissector programs operate on an ``__sk_buff``. However, only the
+limited set of fields is allowed: ``data``, ``data_end`` and ``flow_keys``.
+``flow_keys`` is ``struct bpf_flow_keys`` and contains flow dissector input
+and output arguments.
+
+The inputs are:
+ * ``nhoff`` - initial offset of the networking header
+ * ``thoff`` - initial offset of the transport header, initialized to nhoff
+ * ``n_proto`` - L3 protocol type, parsed out of L2 header
+
+Flow dissector BPF program should fill out the rest of the ``struct
+bpf_flow_keys`` fields. Input arguments ``nhoff/thoff/n_proto`` should be
+also adjusted accordingly.
+
+The return code of the BPF program is either BPF_OK to indicate successful
+dissection, or BPF_DROP to indicate parsing error.
+
+__sk_buff->data
+===============
+
+In the VLAN-less case, this is what the initial state of the BPF flow
+dissector looks like::
+
+ +------+------+------------+-----------+
+ | DMAC | SMAC | ETHER_TYPE | L3_HEADER |
+ +------+------+------------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point to the first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In case of VLAN, flow dissector can be called with the two different states.
+
+Pre-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of TCI
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = TPID
+
+Please note that TPID can be 802.1AD and, hence, BPF program would
+have to parse VLAN information twice for double tagged packets.
+
+
+Post-VLAN parsing::
+
+ +------+------+------+-----+-----------+-----------+
+ | DMAC | SMAC | TPID | TCI |ETHER_TYPE | L3_HEADER |
+ +------+------+------+-----+-----------+-----------+
+ ^
+ |
+ +-- flow dissector starts here
+
+.. code:: c
+
+ skb->data + flow_keys->nhoff point the to first byte of L3_HEADER
+ flow_keys->thoff = nhoff
+ flow_keys->n_proto = ETHER_TYPE
+
+In this case VLAN information has been processed before the flow dissector
+and BPF flow dissector is not required to handle it.
+
+
+The takeaway here is as follows: BPF flow dissector program can be called with
+the optional VLAN header and should gracefully handle both cases: when single
+or double VLAN is present and when it is not present. The same program
+can be called for both cases and would have to be written carefully to
+handle both cases.
+
+
+Reference Implementation
+========================
+
+See ``tools/testing/selftests/bpf/progs/bpf_flow.c`` for the reference
+implementation and ``tools/testing/selftests/bpf/flow_dissector_load.[hc]``
+for the loader. bpftool can be used to load BPF flow dissector program as well.
+
+The reference implementation is organized as follows:
+ * ``jmp_table`` map that contains sub-programs for each supported L3 protocol
+ * ``_dissect`` routine - entry point; it does input ``n_proto`` parsing and
+ does ``bpf_tail_call`` to the appropriate L3 handler
+
+Since BPF at this point doesn't support looping (or any jumping back),
+jmp_table is used instead to handle multiple levels of encapsulation (and
+IPv6 options).
+
+
+Current Limitations
+===================
+BPF flow dissector doesn't support exporting all the metadata that in-kernel
+C-based implementation can export. Notable example is single VLAN (802.1Q)
+and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys``
+for a set of information that's currently can be exported from the BPF context.
diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt
index e12a4900cf72..d192f8b9948b 100644
--- a/Documentation/networking/decnet.txt
+++ b/Documentation/networking/decnet.txt
@@ -22,8 +22,6 @@ you'll need the following options as well...
CONFIG_DECNET_ROUTER (to be able to add/delete routes)
CONFIG_NETFILTER (will be required for the DECnet routing daemon)
- CONFIG_DECNET_ROUTE_FWMARK is optional
-
Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
that you need it, in general you won't and it can cause ifconfig to
malfunction.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 5449149be496..984e68f9e026 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -9,6 +9,7 @@ Contents:
netdev-FAQ
af_xdp
batman-adv
+ bpf_flow_dissector
can
can_ucan_protocol
device_drivers/freescale/dpaa2/index
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index acdfb5d2bcaa..c4ac35234f05 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -422,6 +422,7 @@ tcp_min_rtt_wlen - INTEGER
minimum RTT when it is moved to a longer path (e.g., due to traffic
engineering). A longer window makes the filter more resistant to RTT
inflations such as transient congestion. The unit is seconds.
+ Possible values: 0 - 86400 (1 day)
Default: 300
tcp_moderate_rcvbuf - BOOLEAN
@@ -1336,6 +1337,7 @@ tag - INTEGER
Default value is 0.
xfrm4_gc_thresh - INTEGER
+ (Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv4
destination cache entries. At twice this value the system will
refuse new allocations.
@@ -1919,6 +1921,7 @@ echo_ignore_all - BOOLEAN
Default: 0
xfrm6_gc_thresh - INTEGER
+ (Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv6
destination cache entries. At twice this value the system will
refuse new allocations.
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index 18c1415e7bfa..ace56204dd03 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -50,7 +50,7 @@ the excellent reporting over at LWN.net or read the original code.
patchset
[PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
- http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+ https://lkml.kernel.org/netdev/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
Interface
diff --git a/Documentation/networking/netdev-FAQ.rst b/Documentation/networking/netdev-FAQ.rst
index 0ac5fa77f501..642fa963be3c 100644
--- a/Documentation/networking/netdev-FAQ.rst
+++ b/Documentation/networking/netdev-FAQ.rst
@@ -131,6 +131,19 @@ it to the maintainer to figure out what is the most recent and current
version that should be applied. If there is any doubt, the maintainer
will reply and ask what should be done.
+Q: I made changes to only a few patches in a patch series should I resend only those changed?
+---------------------------------------------------------------------------------------------
+A: No, please resend the entire patch series and make sure you do number your
+patches such that it is clear this is the latest and greatest set of patches
+that can be applied.
+
+Q: I submitted multiple versions of a patch series and it looks like a version other than the last one has been accepted, what should I do?
+-------------------------------------------------------------------------------------------------------------------------------------------
+A: There is no revert possible, once it is pushed out, it stays like that.
+Please send incremental versions on top of what has been merged in order to fix
+the patches the way they would look like if your latest patch series was to be
+merged.
+
Q: How can I tell what patches are queued up for backporting to the various stable releases?
--------------------------------------------------------------------------------------------
A: Normally Greg Kroah-Hartman collects stable commits himself, but for
diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.txt
index 54128c50d508..ca2136c76042 100644
--- a/Documentation/networking/nf_flowtable.txt
+++ b/Documentation/networking/nf_flowtable.txt
@@ -44,10 +44,10 @@ including the Netfilter hooks and the flowtable fastpath bypass.
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
\_________/ \__________/ ---------- \____________/ ^
- | ^ | | ^ |
- flowtable | | ____\/___ | |
- | | | / \ | |
- __\/___ | --------->| forward |------------ |
+ | ^ | ^ |
+ flowtable | ____\/___ | |
+ | | / \ | |
+ __\/___ | | forward |------------ |
|-----| | \_________/ |
|-----| | 'flow offload' rule |
|-----| | adds entry to |
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 2df5894353d6..cd7303d7fa25 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -1009,16 +1009,18 @@ The kernel interface functions are as follows:
(*) Check call still alive.
- u32 rxrpc_kernel_check_life(struct socket *sock,
- struct rxrpc_call *call);
+ bool rxrpc_kernel_check_life(struct socket *sock,
+ struct rxrpc_call *call,
+ u32 *_life);
void rxrpc_kernel_probe_life(struct socket *sock,
struct rxrpc_call *call);
- The first function returns a number that is updated when ACKs are received
- from the peer (notably including PING RESPONSE ACKs which we can elicit by
- sending PING ACKs to see if the call still exists on the server). The
- caller should compare the numbers of two calls to see if the call is still
- alive after waiting for a suitable interval.
+ The first function passes back in *_life a number that is updated when
+ ACKs are received from the peer (notably including PING RESPONSE ACKs
+ which we can elicit by sending PING ACKs to see if the call still exists
+ on the server). The caller should compare the numbers of two calls to see
+ if the call is still alive after waiting for a suitable interval. It also
+ returns true as long as the call hasn't yet reached the completed state.
This allows the caller to work out if the server is still contactable and
if the call is still alive on the server while waiting for the server to
diff --git a/Documentation/networking/snmp_counter.rst b/Documentation/networking/snmp_counter.rst
index 52b026be028f..38a4edc4522b 100644
--- a/Documentation/networking/snmp_counter.rst
+++ b/Documentation/networking/snmp_counter.rst
@@ -413,7 +413,7 @@ algorithm.
.. _F-RTO: https://tools.ietf.org/html/rfc5682
TCP Fast Path
-============
+=============
When kernel receives a TCP packet, it has two paths to handler the
packet, one is fast path, another is slow path. The comment in kernel
code provides a good explanation of them, I pasted them below::
@@ -681,6 +681,7 @@ The TCP stack receives an out of order duplicate packet, so it sends a
DSACK to the sender.
* TcpExtTCPDSACKRecv
+
The TCP stack receives a DSACK, which indicates an acknowledged
duplicate packet is received.
@@ -690,7 +691,7 @@ The TCP stack receives a DSACK, which indicate an out of order
duplicate packet is received.
invalid SACK and DSACK
-====================
+======================
When a SACK (or DSACK) block is invalid, a corresponding counter would
be updated. The validation method is base on the start/end sequence
number of the SACK block. For more details, please refer the comment
@@ -704,11 +705,13 @@ explaination:
.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
* TcpExtTCPSACKDiscard
+
This counter indicates how many SACK blocks are invalid. If the invalid
SACK block is caused by ACK recording, the TCP stack will only ignore
it and won't update this counter.
* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
+
When a DSACK block is invalid, one of these two counters would be
updated. Which counter will be updated depends on the undo_marker flag
of the TCP socket. If the undo_marker is not set, the TCP stack isn't
@@ -719,7 +722,7 @@ will be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
will be updated. As implied in its name, it might be an old packet.
SACK shift
-=========
+==========
The linux networking stack stores data in sk_buff struct (skb for
short). If a SACK block acrosses multiple skb, the TCP stack will try
to re-arrange data in these skb. E.g. if a SACK block acknowledges seq
@@ -730,12 +733,15 @@ seq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
discard, this operation is 'merge'.
* TcpExtTCPSackShifted
+
A skb is shifted
* TcpExtTCPSackMerged
+
A skb is merged
* TcpExtTCPSackShiftFallback
+
A skb should be shifted or merged, but the TCP stack doesn't do it for
some reasons.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 6af24cdb25cc..3f13d8599337 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -866,14 +866,14 @@ The intent is that compaction has less work to do in the future and to
increase the success rate of future high-order allocations such as SLUB
allocations, THP and hugetlbfs pages.
-To make it sensible with respect to the watermark_scale_factor parameter,
-the unit is in fractions of 10,000. The default value of 15,000 means
-that up to 150% of the high watermark will be reclaimed in the event of
-a pageblock being mixed due to fragmentation. The level of reclaim is
-determined by the number of fragmentation events that occurred in the
-recent past. If this value is smaller than a pageblock then a pageblocks
-worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
-of 0 will disable the feature.
+To make it sensible with respect to the watermark_scale_factor
+parameter, the unit is in fractions of 10,000. The default value of
+15,000 on !DISCONTIGMEM configurations means that up to 150% of the high
+watermark will be reclaimed in the event of a pageblock being mixed due
+to fragmentation. The level of reclaim is determined by the number of
+fragmentation events that occurred in the recent past. If this value is
+smaller than a pageblock then a pageblocks worth of pages will be reclaimed
+(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature.
=============================================================
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index 7f01fb1c1084..db0b9d8619f1 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -493,10 +493,8 @@ CPU 에게 기대할 수 있는 최소한의 보장사항 몇가지가 있습니
이 타입의 오퍼레이션은 단방향의 투과성 배리어처럼 동작합니다. ACQUIRE
오퍼레이션 뒤의 모든 메모리 오퍼레이션들이 ACQUIRE 오퍼레이션 후에
일어난 것으로 시스템의 나머지 컴포넌트들에 보이게 될 것이 보장됩니다.
- LOCK 오퍼레이션과 smp_load_acquire(), smp_cond_acquire() 오퍼레이션도
- ACQUIRE 오퍼레이션에 포함됩니다. smp_cond_acquire() 오퍼레이션은 컨트롤
- 의존성과 smp_rmb() 를 사용해서 ACQUIRE 의 의미적 요구사항(semantic)을
- 충족시킵니다.
+ LOCK 오퍼레이션과 smp_load_acquire(), smp_cond_load_acquire() 오퍼레이션도
+ ACQUIRE 오퍼레이션에 포함됩니다.
ACQUIRE 오퍼레이션 앞의 메모리 오퍼레이션들은 ACQUIRE 오퍼레이션 완료 후에
수행된 것처럼 보일 수 있습니다.
@@ -2146,33 +2144,40 @@ set_current_state() 는 다음의 것들로 감싸질 수도 있습니다:
event_indicated = 1;
wake_up_process(event_daemon);
-wake_up() 류에 의해 쓰기 메모리 배리어가 내포됩니다. 만약 그것들이 뭔가를
-깨운다면요. 이 배리어는 태스크 상태가 지워지기 전에 수행되므로, 이벤트를
-알리기 위한 STORE 와 태스크 상태를 TASK_RUNNING 으로 설정하는 STORE 사이에
-위치하게 됩니다.
+wake_up() 이 무언가를 깨우게 되면, 이 함수는 범용 메모리 배리어를 수행합니다.
+이 함수가 아무것도 깨우지 않는다면 메모리 배리어는 수행될 수도, 수행되지 않을
+수도 있습니다; 이 경우에 메모리 배리어를 수행할 거라 오해해선 안됩니다. 이
+배리어는 태스크 상태가 접근되기 전에 수행되는데, 자세히 말하면 이 이벤트를
+알리기 위한 STORE 와 TASK_RUNNING 으로 상태를 쓰는 STORE 사이에 수행됩니다:
- CPU 1 CPU 2
+ CPU 1 (Sleeper) CPU 2 (Waker)
=============================== ===============================
set_current_state(); STORE event_indicated
smp_store_mb(); wake_up();
- STORE current->state <쓰기 배리어>
- <범용 배리어> STORE current->state
- LOAD event_indicated
+ STORE current->state ...
+ <범용 배리어> <범용 배리어>
+ LOAD event_indicated if ((LOAD task->state) & TASK_NORMAL)
+ STORE task->state
-한번더 말합니다만, 이 쓰기 메모리 배리어는 이 코드가 정말로 뭔가를 깨울 때에만
-실행됩니다. 이걸 설명하기 위해, X 와 Y 는 모두 0 으로 초기화 되어 있다는 가정
-하에 아래의 이벤트 시퀀스를 생각해 봅시다:
+여기서 "task" 는 깨어나지는 쓰레드이고 CPU 1 의 "current" 와 같습니다.
+
+반복하지만, wake_up() 이 무언가를 정말 깨운다면 범용 메모리 배리어가 수행될
+것이 보장되지만, 그렇지 않다면 그런 보장이 없습니다. 이걸 이해하기 위해, X 와
+Y 는 모두 0 으로 초기화 되어 있다는 가정 하에 아래의 이벤트 시퀀스를 생각해
+봅시다:
CPU 1 CPU 2
=============================== ===============================
- X = 1; STORE event_indicated
+ X = 1; Y = 1;
smp_mb(); wake_up();
- Y = 1; wait_event(wq, Y == 1);
- wake_up(); load from Y sees 1, no memory barrier
- load from X might see 0
+ LOAD Y LOAD X
+
+정말로 깨우기가 행해졌다면, 두 로드 중 (최소한) 하나는 1 을 보게 됩니다.
+반면에, 실제 깨우기가 행해지지 않았다면, 두 로드 모두 0을 볼 수도 있습니다.
-위 예제에서의 경우와 달리 깨우기가 정말로 행해졌다면, CPU 2 의 X 로드는 1 을
-본다고 보장될 수 있을 겁니다.
+wake_up_process() 는 항상 범용 메모리 배리어를 수행합니다. 이 배리어 역시
+태스크 상태가 접근되기 전에 수행됩니다. 특히, 앞의 예제 코드에서 wake_up() 이
+wake_up_process() 로 대체된다면 두 로드 중 하나는 1을 볼 것이 보장됩니다.
사용 가능한 깨우기류 함수들로 다음과 같은 것들이 있습니다:
@@ -2192,6 +2197,8 @@ wake_up() 류에 의해 쓰기 메모리 배리어가 내포됩니다. 만약
wake_up_poll();
wake_up_process();
+메모리 순서규칙 관점에서, 이 함수들은 모두 wake_up() 과 같거나 보다 강한 순서
+보장을 제공합니다.
[!] 잠재우는 코드와 깨우는 코드에 내포되는 메모리 배리어들은 깨우기 전에
이루어진 스토어를 잠재우는 코드가 set_current_state() 를 호출한 후에 행하는
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 7de9eee73fcd..64b38dfcc243 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -5,25 +5,32 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
----------------------
The kvm API is a set of ioctls that are issued to control various aspects
-of a virtual machine. The ioctls belong to three classes
+of a virtual machine. The ioctls belong to three classes:
- System ioctls: These query and set global attributes which affect the
whole kvm subsystem. In addition a system ioctl is used to create
- virtual machines
+ virtual machines.
- VM ioctls: These query and set attributes that affect an entire virtual
machine, for example memory layout. In addition a VM ioctl is used to
- create virtual cpus (vcpus).
+ create virtual cpus (vcpus) and devices.
- Only run VM ioctls from the same process (address space) that was used
- to create the VM.
+ VM ioctls must be issued from the same process (address space) that was
+ used to create the VM.
- vcpu ioctls: These query and set attributes that control the operation
of a single virtual cpu.
- Only run vcpu ioctls from the same thread that was used to create the
- vcpu.
+ vcpu ioctls should be issued from the same thread that was used to create
+ the vcpu, except for asynchronous vcpu ioctl that are marked as such in
+ the documentation. Otherwise, the first ioctl after switching threads
+ could see a performance impact.
+ - device ioctls: These query and set attributes that control the operation
+ of a single device.
+
+ device ioctls must be issued from the same process (address space) that
+ was used to create the VM.
2. File descriptors
-------------------
@@ -32,17 +39,34 @@ The kvm API is centered around file descriptors. An initial
open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
handle will create a VM file descriptor which can be used to issue VM
-ioctls. A KVM_CREATE_VCPU ioctl on a VM fd will create a virtual cpu
-and return a file descriptor pointing to it. Finally, ioctls on a vcpu
-fd can be used to control the vcpu, including the important task of
-actually running guest code.
+ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
+create a virtual cpu or device and return a file descriptor pointing to
+the new resource. Finally, ioctls on a vcpu or device fd can be used
+to control the vcpu or device. For vcpus, this includes the important
+task of actually running guest code.
In general file descriptors can be migrated among processes by means
of fork() and the SCM_RIGHTS facility of unix domain socket. These
kinds of tricks are explicitly not supported by kvm. While they will
not cause harm to the host, their actual behavior is not guaranteed by
-the API. The only supported use is one virtual machine per process,
-and one vcpu per thread.
+the API. See "General description" for details on the ioctl usage
+model that is supported by KVM.
+
+It is important to note that althought VM ioctls may only be issued from
+the process that created the VM, a VM's lifecycle is associated with its
+file descriptor, not its creator (process). In other words, the VM and
+its resources, *including the associated address space*, are not freed
+until the last reference to the VM's file descriptor has been released.
+For example, if fork() is issued after ioctl(KVM_CREATE_VM), the VM will
+not be freed until both the parent (original) process and its child have
+put their references to the VM's file descriptor.
+
+Because a VM's resources are not freed until the last reference to its
+file descriptor is released, creating additional references to a VM via
+via fork(), dup(), etc... without careful consideration is strongly
+discouraged and may have unwanted side effects, e.g. memory allocated
+by and on behalf of the VM's process may not be freed/unaccounted when
+the VM is shut down.
It is important to note that althought VM ioctls may only be issued from
@@ -297,7 +321,7 @@ cpu's hardware control block.
4.8 KVM_GET_DIRTY_LOG (vm ioctl)
Capability: basic
-Architectures: x86
+Architectures: all
Type: vm ioctl
Parameters: struct kvm_dirty_log (in/out)
Returns: 0 on success, -1 on error
@@ -515,11 +539,15 @@ c) KVM_INTERRUPT_SET_LEVEL
Note that any value for 'irq' other than the ones stated above is invalid
and incurs unexpected behavior.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
MIPS:
Queues an external interrupt to be injected into the virtual CPU. A negative
interrupt number dequeues the interrupt.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
+
4.17 KVM_DEBUG_GUEST
@@ -1086,14 +1114,12 @@ struct kvm_userspace_memory_region {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
-This ioctl allows the user to create or modify a guest physical memory
-slot. When changing an existing slot, it may be moved in the guest
-physical memory space, or its flags may be modified. It may not be
-resized. Slots may not overlap in guest physical address space.
-Bits 0-15 of "slot" specifies the slot id and this value should be
-less than the maximum number of user memory slots supported per VM.
-The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
-if this capability is supported by the architecture.
+This ioctl allows the user to create, modify or delete a guest physical
+memory slot. Bits 0-15 of "slot" specify the slot id and this value
+should be less than the maximum number of user memory slots supported per
+VM. The maximum allowed slots can be queried using KVM_CAP_NR_MEMSLOTS,
+if this capability is supported by the architecture. Slots may not
+overlap in guest physical address space.
If KVM_CAP_MULTI_ADDRESS_SPACE is available, bits 16-31 of "slot"
specifies the address space which is being modified. They must be
@@ -1102,6 +1128,10 @@ KVM_CAP_MULTI_ADDRESS_SPACE capability. Slots in separate address spaces
are unrelated; the restriction on overlapping slots only applies within
each address space.
+Deleting a slot is done by passing zero for memory_size. When changing
+an existing slot, it may be moved in the guest physical memory space,
+or its flags may be modified, but it may not be resized.
+
Memory for the region is taken starting at the address denoted by the
field userspace_addr, which must point at user addressable memory for
the entire memory slot size. Any object may back this memory, including
@@ -2493,7 +2523,7 @@ KVM_S390_MCHK (vm, vcpu) - machine check interrupt; cr 14 bits in parm,
machine checks needing further payload are not
supported by this ioctl)
-Note that the vcpu ioctl is asynchronous to vcpu execution.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.78 KVM_PPC_GET_HTAB_FD
@@ -3042,8 +3072,7 @@ KVM_S390_INT_EMERGENCY - sigp emergency; parameters in .emerg
KVM_S390_INT_EXTERNAL_CALL - sigp external call; parameters in .extcall
KVM_S390_MCHK - machine check interrupt; parameters in .mchk
-
-Note that the vcpu ioctl is asynchronous to vcpu execution.
+This is an asynchronous vcpu ioctl and can be invoked from any thread.
4.94 KVM_S390_GET_IRQ_STATE
@@ -3781,7 +3810,7 @@ to I/O ports.
4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
-Architectures: x86
+Architectures: x86, arm, arm64, mips
Type: vm ioctl
Parameters: struct kvm_dirty_log (in)
Returns: 0 on success, -1 on error
@@ -3801,8 +3830,9 @@ The ioctl clears the dirty status of pages in a memory slot, according to
the bitmap that is passed in struct kvm_clear_dirty_log's dirty_bitmap
field. Bit 0 of the bitmap corresponds to page "first_page" in the
memory slot, and num_pages is the size in bits of the input bitmap.
-Both first_page and num_pages must be a multiple of 64. For each bit
-that is set in the input bitmap, the corresponding page is marked "clean"
+first_page must be a multiple of 64; num_pages must also be a multiple of
+64 unless first_page + num_pages is the size of the memory slot. For each
+bit that is set in the input bitmap, the corresponding page is marked "clean"
in KVM's dirty bitmap, and dirty tracking is re-enabled for that page
(for example via write-protection, or by clearing the dirty bit in
a page table entry).
@@ -4770,7 +4800,7 @@ and injected exceptions.
7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT
-Architectures: all
+Architectures: x86, arm, arm64, mips
Parameters: args[0] whether feature should be enabled or not
With this capability enabled, KVM_GET_DIRTY_LOG will not automatically
diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt
index f365102c80f5..2efe0efc516e 100644
--- a/Documentation/virtual/kvm/mmu.txt
+++ b/Documentation/virtual/kvm/mmu.txt
@@ -142,7 +142,7 @@ Shadow pages contain the following information:
If clear, this page corresponds to a guest page table denoted by the gfn
field.
role.quadrant:
- When role.cr4_pae=0, the guest uses 32-bit gptes while the host uses 64-bit
+ When role.gpte_is_8_bytes=0, the guest uses 32-bit gptes while the host uses 64-bit
sptes. That means a guest page table contains more ptes than the host,
so multiple shadow pages are needed to shadow one guest page.
For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the
@@ -158,9 +158,9 @@ Shadow pages contain the following information:
The page is invalid and should not be used. It is a root page that is
currently pinned (by a cpu hardware register pointing to it); once it is
unpinned it will be destroyed.
- role.cr4_pae:
- Contains the value of cr4.pae for which the page is valid (e.g. whether
- 32-bit or 64-bit gptes are in use).
+ role.gpte_is_8_bytes:
+ Reflects the size of the guest PTE for which the page is valid, i.e. '1'
+ if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
role.nxe:
Contains the value of efer.nxe for which the page is valid.
role.cr0_wp:
@@ -173,6 +173,9 @@ Shadow pages contain the following information:
Contains the value of cr4.smap && !cr0.wp for which the page is valid
(pages for which this is true are different from other pages; see the
treatment of cr0.wp=0 below).
+ role.ept_sp:
+ This is a virtual flag to denote a shadowed nested EPT page. ept_sp
+ is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
role.smm:
Is 1 if the page is valid in system management mode. This field
determines which of the kvm_memslots array was used to build this