aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/RCU/RTFP.txt149
-rw-r--r--Documentation/RCU/checklist.txt18
-rw-r--r--Documentation/arm64/memory.txt16
-rw-r--r--Documentation/devicetree/bindings/ata/ahci-platform.txt22
-rw-r--r--Documentation/devicetree/bindings/ata/apm-xgene.txt76
-rw-r--r--Documentation/devicetree/bindings/net/micrel-ks8851.txt1
-rw-r--r--Documentation/kernel-parameters.txt11
-rw-r--r--Documentation/kernel-per-CPU-kthreads.txt13
-rw-r--r--Documentation/memory-barriers.txt137
-rw-r--r--Documentation/networking/netlink_mmap.txt4
-rw-r--r--Documentation/sysctl/kernel.txt15
-rw-r--r--Documentation/x86/boot.txt4
12 files changed, 372 insertions, 94 deletions
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt
index 273e654d7d08..2f0fcb2112d2 100644
--- a/Documentation/RCU/RTFP.txt
+++ b/Documentation/RCU/RTFP.txt
@@ -31,6 +31,14 @@ has lapsed, so this approach may be used in non-GPL software, if desired.
(In contrast, implementation of RCU is permitted only in software licensed
under either GPL or LGPL. Sorry!!!)
+In 1987, Rashid et al. described lazy TLB-flush [RichardRashid87a].
+At first glance, this has nothing to do with RCU, but nevertheless
+this paper helped inspire the update-side batching used in the later
+RCU implementation in DYNIX/ptx. In 1988, Barbara Liskov published
+a description of Argus that noted that use of out-of-date values can
+be tolerated in some situations. Thus, this paper provides some early
+theoretical justification for use of stale data.
+
In 1990, Pugh [Pugh90] noted that explicitly tracking which threads
were reading a given data structure permitted deferred free to operate
in the presence of non-terminating threads. However, this explicit
@@ -41,11 +49,11 @@ providing a fine-grained locking design, however, it would be interesting
to see how much of the performance advantage reported in 1990 remains
today.
-At about this same time, Adams [Adams91] described ``chaotic relaxation'',
-where the normal barriers between successive iterations of convergent
-numerical algorithms are relaxed, so that iteration $n$ might use
-data from iteration $n-1$ or even $n-2$. This introduces error,
-which typically slows convergence and thus increases the number of
+At about this same time, Andrews [Andrews91textbook] described ``chaotic
+relaxation'', where the normal barriers between successive iterations
+of convergent numerical algorithms are relaxed, so that iteration $n$
+might use data from iteration $n-1$ or even $n-2$. This introduces
+error, which typically slows convergence and thus increases the number of
iterations required. However, this increase is sometimes more than made
up for by a reduction in the number of expensive barrier operations,
which are otherwise required to synchronize the threads at the end
@@ -55,7 +63,8 @@ is thus inapplicable to most data structures in operating-system kernels.
In 1992, Henry (now Alexia) Massalin completed a dissertation advising
parallel programmers to defer processing when feasible to simplify
-synchronization. RCU makes extremely heavy use of this advice.
+synchronization [HMassalinPhD]. RCU makes extremely heavy use of
+this advice.
In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
simplest deferred-free technique: simply waiting a fixed amount of time
@@ -90,27 +99,29 @@ mechanism, which is quite similar to RCU [Gamsa99]. These operating
systems made pervasive use of RCU in place of "existence locks", which
greatly simplifies locking hierarchies and helps avoid deadlocks.
-2001 saw the first RCU presentation involving Linux [McKenney01a]
-at OLS. The resulting abundance of RCU patches was presented the
-following year [McKenney02a], and use of RCU in dcache was first
-described that same year [Linder02a].
+The year 2000 saw an email exchange that would likely have
+led to yet another independent invention of something like RCU
+[RustyRussell2000a,RustyRussell2000b]. Instead, 2001 saw the first
+RCU presentation involving Linux [McKenney01a] at OLS. The resulting
+abundance of RCU patches was presented the following year [McKenney02a],
+and use of RCU in dcache was first described that same year [Linder02a].
Also in 2002, Michael [Michael02b,Michael02a] presented "hazard-pointer"
techniques that defer the destruction of data structures to simplify
non-blocking synchronization (wait-free synchronization, lock-free
synchronization, and obstruction-free synchronization are all examples of
-non-blocking synchronization). In particular, this technique eliminates
-locking, reduces contention, reduces memory latency for readers, and
-parallelizes pipeline stalls and memory latency for writers. However,
-these techniques still impose significant read-side overhead in the
-form of memory barriers. Researchers at Sun worked along similar lines
-in the same timeframe [HerlihyLM02]. These techniques can be thought
-of as inside-out reference counts, where the count is represented by the
-number of hazard pointers referencing a given data structure rather than
-the more conventional counter field within the data structure itself.
-The key advantage of inside-out reference counts is that they can be
-stored in immortal variables, thus allowing races between access and
-deletion to be avoided.
+non-blocking synchronization). The corresponding journal article appeared
+in 2004 [MagedMichael04a]. This technique eliminates locking, reduces
+contention, reduces memory latency for readers, and parallelizes pipeline
+stalls and memory latency for writers. However, these techniques still
+impose significant read-side overhead in the form of memory barriers.
+Researchers at Sun worked along similar lines in the same timeframe
+[HerlihyLM02]. These techniques can be thought of as inside-out reference
+counts, where the count is represented by the number of hazard pointers
+referencing a given data structure rather than the more conventional
+counter field within the data structure itself. The key advantage
+of inside-out reference counts is that they can be stored in immortal
+variables, thus allowing races between access and deletion to be avoided.
By the same token, RCU can be thought of as a "bulk reference count",
where some form of reference counter covers all reference by a given CPU
@@ -123,8 +134,10 @@ can be thought of in other terms as well.
In 2003, the K42 group described how RCU could be used to create
hot-pluggable implementations of operating-system functions [Appavoo03a].
-Later that year saw a paper describing an RCU implementation of System
-V IPC [Arcangeli03], and an introduction to RCU in Linux Journal
+Later that year saw a paper describing an RCU implementation
+of System V IPC [Arcangeli03] (following up on a suggestion by
+Hugh Dickins [Dickins02a] and an implementation by Mingming Cao
+[MingmingCao2002IPCRCU]), and an introduction to RCU in Linux Journal
[McKenney03a].
2004 has seen a Linux-Journal article on use of RCU in dcache
@@ -383,6 +396,21 @@ for Programming Languages and Operating Systems}"
}
}
+@phdthesis{HMassalinPhD
+,author="H. Massalin"
+,title="Synthesis: An Efficient Implementation of Fundamental Operating
+System Services"
+,school="Columbia University"
+,address="New York, NY"
+,year="1992"
+,annotation={
+ Mondo optimizing compiler.
+ Wait-free stuff.
+ Good advice: defer work to avoid synchronization. See page 90
+ (PDF page 106), Section 5.4, fourth bullet point.
+}
+}
+
@unpublished{Jacobson93
,author="Van Jacobson"
,title="Avoid Read-Side Locking Via Delayed Free"
@@ -671,6 +699,20 @@ Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni"
[Viewed October 18, 2004]"
}
+@conference{Michael02b
+,author="Maged M. Michael"
+,title="High Performance Dynamic Lock-Free Hash Tables and List-Based Sets"
+,Year="2002"
+,Month="August"
+,booktitle="{Proceedings of the 14\textsuperscript{th} Annual ACM
+Symposium on Parallel
+Algorithms and Architecture}"
+,pages="73-82"
+,annotation={
+Like the title says...
+}
+}
+
@Conference{Linder02a
,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni"
,Title="Scalability of the Directory Entry Cache"
@@ -727,6 +769,24 @@ Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell"
}
}
+@conference{Michael02a
+,author="Maged M. Michael"
+,title="Safe Memory Reclamation for Dynamic Lock-Free Objects Using Atomic
+Reads and Writes"
+,Year="2002"
+,Month="August"
+,booktitle="{Proceedings of the 21\textsuperscript{st} Annual ACM
+Symposium on Principles of Distributed Computing}"
+,pages="21-30"
+,annotation={
+ Each thread keeps an array of pointers to items that it is
+ currently referencing. Sort of an inside-out garbage collection
+ mechanism, but one that requires the accessing code to explicitly
+ state its needs. Also requires read-side memory barriers on
+ most architectures.
+}
+}
+
@unpublished{Dickins02a
,author="Hugh Dickins"
,title="Use RCU for System-V IPC"
@@ -735,6 +795,17 @@ Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell"
,note="private communication"
}
+@InProceedings{HerlihyLM02
+,author={Maurice Herlihy and Victor Luchangco and Mark Moir}
+,title="The Repeat Offender Problem: A Mechanism for Supporting Dynamic-Sized,
+Lock-Free Data Structures"
+,booktitle={Proceedings of 16\textsuperscript{th} International
+Symposium on Distributed Computing}
+,year=2002
+,month="October"
+,pages="339-353"
+}
+
@unpublished{Sarma02b
,Author="Dipankar Sarma"
,Title="Some dcache\_rcu benchmark numbers"
@@ -749,6 +820,19 @@ Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell"
}
}
+@unpublished{MingmingCao2002IPCRCU
+,Author="Mingming Cao"
+,Title="[PATCH]updated ipc lock patch"
+,month="October"
+,year="2002"
+,note="Available:
+\url{https://lkml.org/lkml/2002/10/24/262}
+[Viewed February 15, 2014]"
+,annotation={
+ Mingming Cao's patch to introduce RCU to SysV IPC.
+}
+}
+
@unpublished{LinusTorvalds2003a
,Author="Linus Torvalds"
,Title="Re: {[PATCH]} small fixes in brlock.h"
@@ -982,6 +1066,23 @@ Realtime Applications"
}
}
+@article{MagedMichael04a
+,author="Maged M. Michael"
+,title="Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects"
+,Year="2004"
+,Month="June"
+,journal="IEEE Transactions on Parallel and Distributed Systems"
+,volume="15"
+,number="6"
+,pages="491-504"
+,url="Available:
+\url{http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf}
+[Viewed March 1, 2005]"
+,annotation={
+ New canonical hazard-pointer citation.
+}
+}
+
@phdthesis{PaulEdwardMcKenneyPhD
,author="Paul E. McKenney"
,title="Exploiting Deferred Destruction:
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 91266193b8f4..9d10d1db16a5 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -256,10 +256,10 @@ over a rather long period of time, but improvements are always welcome!
variations on this theme.
b. Limiting update rate. For example, if updates occur only
- once per hour, then no explicit rate limiting is required,
- unless your system is already badly broken. The dcache
- subsystem takes this approach -- updates are guarded
- by a global lock, limiting their rate.
+ once per hour, then no explicit rate limiting is
+ required, unless your system is already badly broken.
+ Older versions of the dcache subsystem take this approach,
+ guarding updates with a global lock, limiting their rate.
c. Trusted update -- if updates can only be done manually by
superuser or some other trusted user, then it might not
@@ -268,7 +268,8 @@ over a rather long period of time, but improvements are always welcome!
the machine.
d. Use call_rcu_bh() rather than call_rcu(), in order to take
- advantage of call_rcu_bh()'s faster grace periods.
+ advantage of call_rcu_bh()'s faster grace periods. (This
+ is only a partial solution, though.)
e. Periodically invoke synchronize_rcu(), permitting a limited
number of updates per grace period.
@@ -276,6 +277,13 @@ over a rather long period of time, but improvements are always welcome!
The same cautions apply to call_rcu_bh(), call_rcu_sched(),
call_srcu(), and kfree_rcu().
+ Note that although these primitives do take action to avoid memory
+ exhaustion when any given CPU has too many callbacks, a determined
+ user could still exhaust memory. This is especially the case
+ if a system with a large number of CPUs has been configured to
+ offload all of its RCU callbacks onto a single CPU, or if the
+ system has relatively little free memory.
+
9. All RCU list-traversal primitives, which include
rcu_dereference(), list_for_each_entry_rcu(), and
list_for_each_safe_rcu(), must be either within an RCU read-side
diff --git a/Documentation/arm64/memory.txt b/Documentation/arm64/memory.txt
index 5e054bfe4dde..85e24c4f215c 100644
--- a/Documentation/arm64/memory.txt
+++ b/Documentation/arm64/memory.txt
@@ -35,11 +35,13 @@ ffffffbc00000000 ffffffbdffffffff 8GB vmemmap
ffffffbe00000000 ffffffbffbbfffff ~8GB [guard, future vmmemap]
-ffffffbffbc00000 ffffffbffbdfffff 2MB earlyprintk device
+ffffffbffa000000 ffffffbffaffffff 16MB PCI I/O space
+
+ffffffbffb000000 ffffffbffbbfffff 12MB [guard]
-ffffffbffbe00000 ffffffbffbe0ffff 64KB PCI I/O space
+ffffffbffbc00000 ffffffbffbdfffff 2MB earlyprintk device
-ffffffbffbe10000 ffffffbcffffffff ~2MB [guard]
+ffffffbffbe00000 ffffffbffbffffff 2MB [guard]
ffffffbffc000000 ffffffbfffffffff 64MB modules
@@ -60,11 +62,13 @@ fffffdfc00000000 fffffdfdffffffff 8GB vmemmap
fffffdfe00000000 fffffdfffbbfffff ~8GB [guard, future vmmemap]
-fffffdfffbc00000 fffffdfffbdfffff 2MB earlyprintk device
+fffffdfffa000000 fffffdfffaffffff 16MB PCI I/O space
+
+fffffdfffb000000 fffffdfffbbfffff 12MB [guard]
-fffffdfffbe00000 fffffdfffbe0ffff 64KB PCI I/O space
+fffffdfffbc00000 fffffdfffbdfffff 2MB earlyprintk device
-fffffdfffbe10000 fffffdfffbffffff ~2MB [guard]
+fffffdfffbe00000 fffffdfffbffffff 2MB [guard]
fffffdfffc000000 fffffdffffffffff 64MB modules
diff --git a/Documentation/devicetree/bindings/ata/ahci-platform.txt b/Documentation/devicetree/bindings/ata/ahci-platform.txt
index 89de1564950c..48b285ffa3a6 100644
--- a/Documentation/devicetree/bindings/ata/ahci-platform.txt
+++ b/Documentation/devicetree/bindings/ata/ahci-platform.txt
@@ -4,17 +4,33 @@ SATA nodes are defined to describe on-chip Serial ATA controllers.
Each SATA controller should have its own node.
Required properties:
-- compatible : compatible list, contains "snps,spear-ahci"
+- compatible : compatible list, one of "snps,spear-ahci",
+ "snps,exynos5440-ahci", "ibm,476gtr-ahci",
+ "allwinner,sun4i-a10-ahci", "fsl,imx53-ahci"
+ "fsl,imx6q-ahci" or "snps,dwc-ahci"
- interrupts : <interrupt mapping for SATA IRQ>
- reg : <registers mapping>
Optional properties:
- dma-coherent : Present if dma operations are coherent
+- clocks : a list of phandle + clock specifier pairs
+- target-supply : regulator for SATA target power
-Example:
+"fsl,imx53-ahci", "fsl,imx6q-ahci" required properties:
+- clocks : must contain the sata, sata_ref and ahb clocks
+- clock-names : must contain "ahb" for the ahb clock
+
+Examples:
sata@ffe08000 {
compatible = "snps,spear-ahci";
reg = <0xffe08000 0x1000>;
interrupts = <115>;
-
};
+
+ ahci: sata@01c18000 {
+ compatible = "allwinner,sun4i-a10-ahci";
+ reg = <0x01c18000 0x1000>;
+ interrupts = <56>;
+ clocks = <&pll6 0>, <&ahb_gates 25>;
+ target-supply = <&reg_ahci_5v>;
+ };
diff --git a/Documentation/devicetree/bindings/ata/apm-xgene.txt b/Documentation/devicetree/bindings/ata/apm-xgene.txt
new file mode 100644
index 000000000000..7bcfbf59810e
--- /dev/null
+++ b/Documentation/devicetree/bindings/ata/apm-xgene.txt
@@ -0,0 +1,76 @@
+* APM X-Gene 6.0 Gb/s SATA host controller nodes
+
+SATA host controller nodes are defined to describe on-chip Serial ATA
+controllers. Each SATA controller (pair of ports) have its own node.
+
+Required properties:
+- compatible : Shall contain:
+ * "apm,xgene-ahci"
+- reg : First memory resource shall be the AHCI memory
+ resource.
+ Second memory resource shall be the host controller
+ core memory resource.
+ Third memory resource shall be the host controller
+ diagnostic memory resource.
+ 4th memory resource shall be the host controller
+ AXI memory resource.
+ 5th optional memory resource shall be the host
+ controller MUX memory resource if required.
+- interrupts : Interrupt-specifier for SATA host controller IRQ.
+- clocks : Reference to the clock entry.
+- phys : A list of phandles + phy-specifiers, one for each
+ entry in phy-names.
+- phy-names : Should contain:
+ * "sata-phy" for the SATA 6.0Gbps PHY
+
+Optional properties:
+- status : Shall be "ok" if enabled or "disabled" if disabled.
+ Default is "ok".
+
+Example:
+ sataclk: sataclk {
+ compatible = "fixed-clock";
+ #clock-cells = <1>;
+ clock-frequency = <100000000>;
+ clock-output-names = "sataclk";
+ };
+
+ phy2: phy@1f22a000 {
+ compatible = "apm,xgene-phy";
+ reg = <0x0 0x1f22a000 0x0 0x100>;
+ #phy-cells = <1>;
+ };
+
+ phy3: phy@1f23a000 {
+ compatible = "apm,xgene-phy";
+ reg = <0x0 0x1f23a000 0x0 0x100>;
+ #phy-cells = <1>;
+ };
+
+ sata2: sata@1a400000 {
+ compatible = "apm,xgene-ahci";
+ reg = <0x0 0x1a400000 0x0 0x1000>,
+ <0x0 0x1f220000 0x0 0x1000>,
+ <0x0 0x1f22d000 0x0 0x1000>,
+ <0x0 0x1f22e000 0x0 0x1000>,
+ <0x0 0x1f227000 0x0 0x1000>;
+ interrupts = <0x0 0x87 0x4>;
+ status = "ok";
+ clocks = <&sataclk 0>;
+ phys = <&phy2 0>;
+ phy-names = "sata-phy";
+ };
+
+ sata3: sata@1a800000 {
+ compatible = "apm,xgene-ahci-pcie";
+ reg = <0x0 0x1a800000 0x0 0x1000>,
+ <0x0 0x1f230000 0x0 0x1000>,
+ <0x0 0x1f23d000 0x0 0x1000>,
+ <0x0 0x1f23e000 0x0 0x1000>,
+ <0x0 0x1f237000 0x0 0x1000>;
+ interrupts = <0x0 0x88 0x4>;
+ status = "ok";
+ clocks = <&sataclk 0>;
+ phys = <&phy3 0>;
+ phy-names = "sata-phy";
+ };
diff --git a/Documentation/devicetree/bindings/net/micrel-ks8851.txt b/Documentation/devicetree/bindings/net/micrel-ks8851.txt
index 11ace3c3d805..4fc392763611 100644
--- a/Documentation/devicetree/bindings/net/micrel-ks8851.txt
+++ b/Documentation/devicetree/bindings/net/micrel-ks8851.txt
@@ -7,3 +7,4 @@ Required properties:
Optional properties:
- local-mac-address : Ethernet mac address to use
+- vdd-supply: supply for Ethernet mac
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 7116fda7077f..67755ea834a7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1011,6 +1011,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
parameter will force ia64_sal_cache_flush to call
ia64_pal_cache_flush instead of SAL_CACHE_FLUSH.
+ forcepae [X86-32]
+ Forcefully enable Physical Address Extension (PAE).
+ Many Pentium M systems disable PAE but may have a
+ functionally usable PAE implementation.
+ Warning: use of this parameter will taint the kernel
+ and may cause unknown problems.
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
@@ -2053,8 +2060,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
IOAPICs that may be present in the system.
nokaslr [X86]
- Disable kernel base offset ASLR (Address Space
- Layout Randomization) if built into the kernel.
+ Disable kernel and module base offset ASLR (Address
+ Space Layout Randomization) if built into the kernel.
noautogroup Disable scheduler automatic task group creation.
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..f3cd299fcc41 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
To reduce its OS jitter, do any of the following:
1. Run your workload at a real-time priority, which will allow
preempting the kworker daemons.
-2. Do any of the following needed to avoid jitter that your
+2. A given workqueue can be made visible in the sysfs filesystem
+ by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
+ Such a workqueue can be confined to a given subset of the
+ CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
+ files. The set of WQ_SYSFS workqueues can be displayed using
+ "ls sys/devices/virtual/workqueue". That said, the workqueues
+ maintainer would like to caution people against indiscriminately
+ sprinkling WQ_SYSFS across all the workqueues. The reason for
+ caution is that it is easy to add WQ_SYSFS, but because sysfs is
+ part of the formal user/kernel API, it can be nearly impossible
+ to remove it, even if its addition was a mistake.
+3. Do any of the following needed to avoid jitter that your
application cannot tolerate:
a. Build your kernel with CONFIG_SLUB=y rather than
CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 102dc19c4119..11c1d2049662 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -608,26 +608,30 @@ as follows:
b = p; /* BUG: Compiler can reorder!!! */
do_something();
-The solution is again ACCESS_ONCE(), which preserves the ordering between
-the load from variable 'a' and the store to variable 'b':
+The solution is again ACCESS_ONCE() and barrier(), which preserves the
+ordering between the load from variable 'a' and the store to variable 'b':
q = ACCESS_ONCE(a);
if (q) {
+ barrier();
ACCESS_ONCE(b) = p;
do_something();
} else {
+ barrier();
ACCESS_ONCE(b) = p;
do_something_else();
}
-You could also use barrier() to prevent the compiler from moving
-the stores to variable 'b', but barrier() would not prevent the
-compiler from proving to itself that a==1 always, so ACCESS_ONCE()
-is also needed.
+The initial ACCESS_ONCE() is required to prevent the compiler from
+proving the value of 'a', and the pair of barrier() invocations are
+required to prevent the compiler from pulling the two identical stores
+to 'b' out from the legs of the "if" statement.
It is important to note that control dependencies absolutely require a
a conditional. For example, the following "optimized" version of
-the above example breaks ordering:
+the above example breaks ordering, which is why the barrier() invocations
+are absolutely required if you have identical stores in both legs of
+the "if" statement:
q = ACCESS_ONCE(a);
ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
@@ -643,9 +647,11 @@ It is of course legal for the prior load to be part of the conditional,
for example, as follows:
if (ACCESS_ONCE(a) > 0) {
+ barrier();
ACCESS_ONCE(b) = q / 2;
do_something();
} else {
+ barrier();
ACCESS_ONCE(b) = q / 3;
do_something_else();
}
@@ -659,9 +665,11 @@ the needed conditional. For example:
q = ACCESS_ONCE(a);
if (q % MAX) {
+ barrier();
ACCESS_ONCE(b) = p;
do_something();
} else {
+ barrier();
ACCESS_ONCE(b) = p;
do_something_else();
}
@@ -723,8 +731,13 @@ In summary:
use smb_rmb(), smp_wmb(), or, in the case of prior stores and
later loads, smp_mb().
+ (*) If both legs of the "if" statement begin with identical stores
+ to the same variable, a barrier() statement is required at the
+ beginning of each leg of the "if" statement.
+
(*) Control dependencies require at least one run-time conditional
- between the prior load and the subsequent store. If the compiler
+ between the prior load and the subsequent store, and this
+ conditional must involve the prior load. If the compiler
is able to optimize the conditional away, it will have also
optimized away the ordering. Careful use of ACCESS_ONCE() can
help to preserve the needed conditional.
@@ -1249,6 +1262,23 @@ The ACCESS_ONCE() function can prevent any number of optimizations that,
while perfectly safe in single-threaded code, can be fatal in concurrent
code. Here are some examples of these sorts of optimizations:
+ (*) The compiler is within its rights to reorder loads and stores
+ to the same variable, and in some cases, the CPU is within its
+ rights to reorder loads to the same variable. This means that
+ the following code:
+
+ a[0] = x;
+ a[1] = x;
+
+ Might result in an older value of x stored in a[1] than in a[0].
+ Prevent both the compiler and the CPU from doing this as follows:
+
+ a[0] = ACCESS_ONCE(x);
+ a[1] = ACCESS_ONCE(x);
+
+ In short, ACCESS_ONCE() provides cache coherence for accesses from
+ multiple CPUs to a single variable.
+
(*) The compiler is within its rights to merge successive loads from
the same variable. Such merging can cause the compiler to "optimize"
the following code:
@@ -1644,12 +1674,12 @@ for each construct. These operations all imply certain barriers:
Memory operations issued after the ACQUIRE will be completed after the
ACQUIRE operation has completed.
- Memory operations issued before the ACQUIRE may be completed after the
- ACQUIRE operation has completed. An smp_mb__before_spinlock(), combined
- with a following ACQUIRE, orders prior loads against subsequent stores and
- stores and prior stores against subsequent stores. Note that this is
- weaker than smp_mb()! The smp_mb__before_spinlock() primitive is free on
- many architectures.
+ Memory operations issued before the ACQUIRE may be completed after
+ the ACQUIRE operation has completed. An smp_mb__before_spinlock(),
+ combined with a following ACQUIRE, orders prior loads against
+ subsequent loads and stores and also orders prior stores against
+ subsequent stores. Note that this is weaker than smp_mb()! The
+ smp_mb__before_spinlock() primitive is free on many architectures.
(2) RELEASE operation implication:
@@ -1694,24 +1724,21 @@ may occur as:
ACQUIRE M, STORE *B, STORE *A, RELEASE M
-This same reordering can of course occur if the lock's ACQUIRE and RELEASE are
-to the same lock variable, but only from the perspective of another CPU not
-holding that lock.
-
-In short, a RELEASE followed by an ACQUIRE may -not- be assumed to be a full
-memory barrier because it is possible for a preceding RELEASE to pass a
-later ACQUIRE from the viewpoint of the CPU, but not from the viewpoint
-of the compiler. Note that deadlocks cannot be introduced by this
-interchange because if such a deadlock threatened, the RELEASE would
-simply complete.
-
-If it is necessary for a RELEASE-ACQUIRE pair to produce a full barrier, the
-ACQUIRE can be followed by an smp_mb__after_unlock_lock() invocation. This
-will produce a full barrier if either (a) the RELEASE and the ACQUIRE are
-executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on the
-same variable. The smp_mb__after_unlock_lock() primitive is free on many
-architectures. Without smp_mb__after_unlock_lock(), the critical sections
-corresponding to the RELEASE and the ACQUIRE can cross:
+When the ACQUIRE and RELEASE are a lock acquisition and release,
+respectively, this same reordering can occur if the lock's ACQUIRE and
+RELEASE are to the same lock variable, but only from the perspective of
+another CPU not holding that lock. In short, a ACQUIRE followed by an
+RELEASE may -not- be assumed to be a full memory barrier.
+
+Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not
+imply a full memory barrier. If it is necessary for a RELEASE-ACQUIRE
+pair to produce a full barrier, the ACQUIRE can be followed by an
+smp_mb__after_unlock_lock() invocation. This will produce a full barrier
+if either (a) the RELEASE and the ACQUIRE are executed by the same
+CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable.
+The smp_mb__after_unlock_lock() primitive is free on many architectures.
+Without smp_mb__after_unlock_lock(), the CPU's execution of the critical
+sections corresponding to the RELEASE and the ACQUIRE can cross, so that:
*A = a;
RELEASE M
@@ -1722,7 +1749,36 @@ could occur as:
ACQUIRE N, STORE *B, STORE *A, RELEASE M
-With smp_mb__after_unlock_lock(), they cannot, so that:
+It might appear that this reordering could introduce a deadlock.
+However, this cannot happen because if such a deadlock threatened,
+the RELEASE would simply complete, thereby avoiding the deadlock.
+
+ Why does this work?
+
+ One key point is that we are only talking about the CPU doing
+ the reordering, not the compiler. If the compiler (or, for
+ that matter, the developer) switched the operations, deadlock
+ -could- occur.
+
+ But suppose the CPU reordered the operations. In this case,
+ the unlock precedes the lock in the assembly code. The CPU
+ simply elected to try executing the later lock operation first.
+ If there is a deadlock, this lock operation will simply spin (or
+ try to sleep, but more on that later). The CPU will eventually
+ execute the unlock operation (which preceded the lock operation
+ in the assembly code), which will unravel the potential deadlock,
+ allowing the lock operation to succeed.
+
+ But what if the lock is a sleeplock? In that case, the code will
+ try to enter the scheduler, where it will eventually encounter
+ a memory barrier, which will force the earlier unlock operation
+ to complete, again unraveling the deadlock. There might be
+ a sleep-unlock race, but the locking primitive needs to resolve
+ such races properly in any case.
+
+With smp_mb__after_unlock_lock(), the two critical sections cannot overlap.
+For example, with the following code, the store to *A will always be
+seen by other CPUs before the store to *B:
*A = a;
RELEASE M
@@ -1730,13 +1786,18 @@ With smp_mb__after_unlock_lock(), they cannot, so that:
smp_mb__after_unlock_lock();
*B = b;
-will always occur as either of the following:
+The operations will always occur in one of the following orders:
- STORE *A, RELEASE, ACQUIRE, STORE *B
- STORE *A, ACQUIRE, RELEASE, STORE *B
+ STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B
+ STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B
+ ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B
If the RELEASE and ACQUIRE were instead both operating on the same lock
-variable, only the first of these two alternatives can occur.
+variable, only the first of these alternatives can occur. In addition,
+the more strongly ordered systems may rule out some of the above orders.
+But in any case, as noted earlier, the smp_mb__after_unlock_lock()
+ensures that the store to *A will always be seen as happening before
+the store to *B.
Locks and semaphores may not provide any guarantee of ordering on UP compiled
systems, and so cannot be counted on in such a situation to actually achieve
@@ -2757,7 +2818,7 @@ in that order, but, without intervention, the sequence may have almost any
combination of elements combined or discarded, provided the program's view of
the world remains consistent. Note that ACCESS_ONCE() is -not- optional
in the above example, as there are architectures where a given CPU might
-interchange successive loads to the same location. On such architectures,
+reorder successive loads to the same location. On such architectures,
ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
special ld.acq and st.rel instructions that prevent such reordering.
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
index b26122973525..c6af4bac5aa8 100644
--- a/Documentation/networking/netlink_mmap.txt
+++ b/Documentation/networking/netlink_mmap.txt
@@ -226,9 +226,9 @@ Ring setup:
void *rx_ring, *tx_ring;
/* Configure ring parameters */
- if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
- if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1)
/* Calculate size of each individual ring */
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index e55124e7c40c..ec8be46bf48d 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -320,10 +320,11 @@ This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
==============================================================
-hung_task_warning:
+hung_task_warnings:
The maximum number of warnings to report. During a check interval
-When this value is reached, no more the warnings will be reported.
+if a hung task is detected, this value is decreased by 1.
+When this value reaches 0, no more warnings will be reported.
This file shows up if CONFIG_DETECT_HUNG_TASK is enabled.
-1: report an infinite number of warnings.
@@ -441,8 +442,7 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
-numa_balancing_scan_size_mb, numa_balancing_settle_count sysctls and
-numa_balancing_migrate_deferred.
+numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls.
==============================================================
@@ -483,13 +483,6 @@ rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
-numa_balancing_migrate_deferred is how many page migrations get skipped
-unconditionally, after a page migration is skipped because a page is shared
-with other tasks. This reduces page migration overhead, and determines
-how much stronger the "move task near its memory" policy scheduler becomes,
-versus the "move memory near its task" memory management policy, for workloads
-with shared memory.
-
==============================================================
osrelease, ostype & version:
diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
index cb81741d3b0b..a75e3adaa39d 100644
--- a/Documentation/x86/boot.txt
+++ b/Documentation/x86/boot.txt
@@ -182,7 +182,7 @@ Offset Proto Name Meaning
0226/1 2.02+(3 ext_loader_ver Extended boot loader version
0227/1 2.02+(3 ext_loader_type Extended boot loader ID
0228/4 2.02+ cmd_line_ptr 32-bit pointer to the kernel command line
-022C/4 2.03+ ramdisk_max Highest legal initrd address
+022C/4 2.03+ initrd_addr_max Highest legal initrd address
0230/4 2.05+ kernel_alignment Physical addr alignment required for kernel
0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not
0235/1 2.10+ min_alignment Minimum alignment, as a power of two
@@ -534,7 +534,7 @@ Protocol: 2.02+
zero, the kernel will assume that your boot loader does not support
the 2.02+ protocol.
-Field name: ramdisk_max
+Field name: initrd_addr_max
Type: read
Offset/size: 0x22c/4
Protocol: 2.03+